# Category Archives: 2019

Spring 2019 blog posts

# Nearest Neighbor Analysis on Woodpecker Nest Locations Within Salvage Logging Units

Exercise 3: Nearest Neighbor Analysis

Question

How is the spatial presence of postfire woodpecker nests related to the spatial presence of salvage-logged forest stands?

• How are woodpecker nests clustered within survey units? (Exercise 1 and 3)
• How does this clustering relate to salvage treatment units within the survey units? (Exercise 2 and 3)

Tools

Average Nearest Neighbor in ArcMap

The Average Nearest Neighbor tool measures the distance between each feature centroid and its nearest neighbor’s centroid location and averages the distances for a nearest neighbor ratio. If the ratio is less than the average for a hypothetical random distribution, the feature distribution is clustered. If the ratio is greater than a hypothetical random distribution average, the features are dispersed.

Near Distance in ArcMap

The Near tool measures near distances between one set of target features and another set of target features. It produces additional columns in the original shapefile containing distance measurements in units of the user’s choosing. The user can enable a location option displaying X and Y distances individually.

Data

For this analysis, I used 2016 and 2017 woodpecker nest point datasets clipped to each RMRS survey unit. I also used a polygon shapefile of 35 salvage harvest units within RMRS woodpecker survey units. I used another polgyon shapefile of the woodpecker survey units and a WorldView-3 1 m raster for supplementary data.

Nearest Neighbor Analysis Steps

1. Export 2016 and 2017 nest points as one nest shapefile per RMRS survey unit for inputs into the Average Nearest Neighbor tool.
2. Export salvage treatment polygons into three shapefiles for treatments 1, 2, and 3.
3. Run the Average Nearest Neighbor tool in ArcMap with multiple inputs. I ran the tool on each 2016 and 2017 nest shapefile per survey unit, all 2016 and 2017 nests, all salvage units, and each salvage treatment type shapefile (see table below).
4. Run the Near tool to produce near distances in meters for the distance of each nest to a salvage unit. I ran this tool for 2016 and 2017 nests using salvage unit centroids and salvage unit polygons, which produced different results as expected.
5. Use Excel to create an Average Nearest Neighbor table for comparing 2016 and 2017 results.
6. Use ggplot in R to plot near distance graphs for 2016 and 2017. These graphs display near distance distributions for nest points to salvage unit centroids and near distances for nest points to salvage polygon boundaries.

Results

Near Distances

I produced boxplots displaying near distance distributions for nest points. Sets of graphs are featured with and without the control nests for different visual interpretations.

Near Distances for Nest Points to Salvage Centroids (With Control Nests)

Near Distances for Nest Points to Salvage Centroids (Without Control Nests)

Near Distances for Nest Points to Salvage Polygons (With Control Nests)

Near Distances for Nest Points to Salvage Polygons (Without Control Nests)

Nearest Neighbor Results Table

Above: Nearest neighbor results for woodpecker nests in each survey unit in 2016 and 2017. The NN Ratio is a threshold for expected clustering or dispersion. NN Ratio values less than 1 indicate clustering and NN Ratio values greater than 1 indicate dispersion. In 2017, green cells indicate an increase in value and red cells indicate a decrease in value. NN Ratio cells with an increased value in 2017 indicate units where nests increased dispersion. NN Ratio cells with a decreased value in 2017 indicate units where nests increased clustering. Alder Gulch (Treatment 3), Lower Fawn Creek (Treatment 2), and Upper Fawn Creek (Treatment 2) demonstrated increased clustering in 2017. Big Canyon (Treatment 1), Crazy Creek (Treatment 1), and Sloan Gulch (Treatment 3) experienced increased nest dispersion in 2017. All control units experienced increased or present dispersion in 2017. The “All Nests” and salvage shapefile results were produced for comparisons and to evaluate how clustering/dispersion within datasets may affect clustering/dispersion of other datasets.

With these two techniques creating near distance values and a nearest neighbor table, we can implement a refined nearest neighbor analysis. This analysis examines nearest neighbor distances in combination with distances to the study boundary. Refined nearest neighbor uses standardized formulas integrating the distribution of observed nearest neighbor distances and the distribution of expected (random) nearest neighbor distances while factoring in distance to unit boundaries. In the table above, the NN Expected and NN Observed values will aid  refined nearest neighbor results.

Problems and Critique

Did this nearest neighbor analysis account for the target area size and shape? In the ArcMap Average Nearest Neighbor help documents, these assumptions are made:

• “The equations used to calculate the average nearest neighbor distance index (1) and z-score (4) are based on the assumption that the points being measured are free to locate anywhere within the study area (for example, there are no barriers, and all cases or features are located independently of one another).”
• “The average nearest neighbor method is very sensitive to the Area value (small changes in the Area parameter value can result in considerable changes in the z-score and p-value results). Consequently, the Average Nearest Neighbor tool is most effective for comparing different features in a fixed study area.”

Because of this, I’m not sure whether I can interpret these results properly based on the study design. We are surveying specially designated study units, not the entire burn area. Therefore, our sample calculations should not assume the entire area is available for clustering or dispersion. There is an option in ArcMap to input “Area” as an integer but not as a polygon shapefile (for survey and salvage units). To control for this, I calculated Average Nearest Neighbor on nest shapefiles for each survey unit instead of all nests together. However, as shown in the table above I also calculated Average Nearest Neighbor for all nests and different combinations of salvage units. These results may prove less reliable.

I would like to extend this analysis to a nearest neighbor technique adjusting for both the clustering/dispersion of the nests and the clustering/dispersion of the salvage units and survey units at the same time. The predetermined study design continues to require special consideration.

# Multiple Buffer Distance Analysis on Woodpecker Nest Buffers Intersecting Salvage Harvest Areas

Exercise 2: Multiple Buffer Distance Analysis

Question

How is the spatial presence of postfire woodpecker nests related to the spatial presence of salvage-logged forest stands?

• How are woodpecker nests clustered within survey units? (Exercise 1 and 3)
• How does this clustering relate to salvage treatment units within the survey units? (Exercise 2 and 3)

Tool

Multiple Ring Buffer in ArcMap

This tool creates shapefiles of concentric circles around features based on user-input values. Users can dissolve buffers to create a single feature class per buffer distance. Buffer widths overlapping with features of interest can indicate spatial relationships between target subjects. In this case, intersecting woodpecker nest buffers with salvage harvest polygons may reveal trends in woodpeckers selecting nest sites closer to or farther from salvage units. An equation producing percent area of each nest buffer intersected by a salvage unit serves as an index.

Above: Buffers created in a test analysis at 15 – 50 meter intervals around nest point features.

Data

For this exercise I used 2016 and 2017 woodpecker nest point shapefiles. I created multiple ring buffer outputs for each shapefile to use in the analysis. I also used a polygon shapefile of 35 salvage harvest units included within the treatment woodpecker survey units. I used another polgyon shapefile of the woodpecker survey units and a WorldView-3 1 m raster for supplementary data.

Multiple Buffer Distance Analysis Steps

1. Create separate point shapefiles for 2016 and 2017 nest points.
2. Run the Multiple Ring Buffer tool in ArcMap on each shapefile at 50 meter intervals from 50 – 300 meters. The tool creates
3. Use the Intersect tool in ArcMap to create a polygon shapefile of the buffers clipped to the salvage harvest units.
4. Use the Dissolve tool in ArcMap to merge overlapping buffer polygons of the same type for the same nest.

Above: The green polygons indicate areas where the Intersect tool identified overlap between the buffers and salvage unit polygons. The tool creates a polygon shapefile of the overlapping areas.

Above: The final result of six buffers at 50 m intervals around nest points intersected with salvage harvest units. Each color represents an individual polygon in a shapefile. The polygons are given size information and used to calculate percent overlap of nest buffers with salvage units.

5. Use the XTools Pro extension to attribute size information to each intersected buffer shapefile for area in square meters.

6. Create a new field in each intersected buffer attribute table for percent area of the buffer intersected by a salvage unit.

7. Use the field calculator to create a formula for percent buffer area intersected: (Area field/Complete buffer size)*100

8. Export each intersected buffer table to Excel, including the salvage treatment type column and the percent buffer area intersected column for further analysis. Transfer the salvage treatment type and corresponding percent buffer area intersected columns to a combined Excel file with a sheet for every buffer distance.

9. Add zero values to each sheet for the nests falling within a control unit. The control nests act as a fourth treatment and should be included in the results.

10. Use ggplot2 in R to extract data from each sheet and create box plots for each buffer distance.

Above: Excel table of X and Y inputs for boxplots showing percent of the complete buffer intersected by a salvage harvest unit. Each buffer distance from 50 – 300 m has a sheet containing columns for these values.

Results

I generated graphs for each of the six buffer intervals (50-300 m at 50 m intervals) in 2016 (pre-salvage) and 2017 (post-salvage)  for 12 total graphs. I presented the 2016 and 2017 results for each buffer distance side by side below. In each graph, the unlabeled grouping of points to the left of salvage treatment type 1 represents nests in the control area, or nests 100% inside salvage treatment 0. Visual analysis reveals interesting patterns about woodpecker nest site selection when considering the silvicultural prescriptions designating salvage treatment types. Below is a reminder of the salvage treatments:

Treatment 1 harvests the most trees overall but retains the most large diameter trees with spacing for Lewis’s woodpeckers. Treatment 2 harvests a moderate number of trees, retaining less large diameter trees but more medium and small diameter trees. Treatment 3 harvests a limited number of trees but retains barely any large diameter with a heavy focus on small diameter for white-headed woodpeckers. The control treatments do not harvest any trees.

An obvious downfall of the following graphs is that as distance from the nest increases, percent of the nest buffer intersecting a salvage polygon will decrease. There is an overall decrease in mean and midrange values for intersecting area as the buffer distance increases. However, some significant points emerge:

• Nest distribution in Treatment 1 units generally increases in breadth between 2016 and 2017. Meaning, in 2016 the distribution of woodpecker nest distances from a salvage unit centered more closely around a mean distance near 30 – 50%. In 2017 the values appear to spread out with less preference towards a specific distance from a salvage unit.
• Nests in Treatments 2 and 3 units generally decrease in percent area intersected by a salvage unit from 2016 to 2017. Meaning, after the salvage harvest, woodpeckers are selecting nest sites farther from Treatments 2 and 3.

50 Meter Buffer Results

______________________________________________________________________________________________________

100 Meter Buffer Results

______________________________________________________________________________________________________

150 Meter Buffer Results

______________________________________________________________________________________________________

200 Meter Buffer Results

______________________________________________________________________________________________________

250 Meter Buffer Results

______________________________________________________________________________________________________

300 Meter Buffer Results

Problems and Critique

I would perform this analysis again with larger buffer sizes and greater buffer intervals. I am not convinced the buffer size I chose for this exercise captured significant information at this scale. I created 10 buffers from 100 – 1000 m for a future analysis. Buffer size should be determined by assumed travel and foraging distances for each woodpecker species. The 50 – 300 m scale may not be a great enough distance for local trends to develop in the data. I chose these distances because the belt transects for the woodpecker point count surveys are 200 – 300 m apart to avoid interfering with birds on neighboring transects. I thought this would be a comparable scale for the buffer analysis.

This analysis fails to address or quantify the control nests because they are too far away from salvage units for their buffers to intersect. In Exercise 3, a near analysis in ArcMap can quantify trends in control nest distances from salvage areas.

In the future I would like to produce a statistics table for each 2016 and 2017 buffer distance displaying mean, standard deviation, and other metrics for more than a visual analysis.

# Using LiDAR data to assess forest aesthetics in McDonald-Dunn Forest

Bryan Begay

GEOG566

05/31/2019

How does forest aesthetics vary depending on forest structure as a result of active management that retains vegetation and landforms?

In order to answer my question I would look at two stands in the McDonald-Dunn forest to do some analysis on how their forest structure is related to forest aesthetics. The first stand is Saddleback which had been logged in 1999 and shows signs of management. The second stand was identified near Baker Creek, and is 1 mile west of Saddleback. Baker Creek was chose for its riparian characteristics, as well as having no signs of management activates.

1. A description of the data set.

The LiDAR data that I used with my initial analysis was 2008 DOGAMI LiDAR flown over the McDonald-Dunn forest. The RMSE for this data set was 1.5cm. The DEM data set I used was from 2009 has a RMSE of 0.3 m. A Canopy Height Model (CHM) was made in RStudio lidR package that used a digital surface model with a 1 meter resolution. The CHM was used to create an individual tree segmentation, where segmented trees were then converted to point data.

A link to a visualization of the raw point cloud that is georeferenced to its terrain.

1. Hypotheses: predictions of patterns and processes you looked for.

I suspected that initially the Baker Creek stand would have higher forest aesthetic that would reflect in the stand’s unmanaged vegetation structure.

Rational

Since the Saddleback had been managed and cut I figured the more natural structure of the riparian stand would have generally higher forest aesthetics than a stand that has been altered by anthropogenic factors. Some processes that I hypothesized that relates to forest aesthetics to these stands was the spatial point pattern of trees could be related to forest aesthetics. Insert forest aesthetic link:

1. Approaches: analysis approaches you used.

Exercise 1: Ripley’s K Analysis point pattern analysis

The steps taken to create a point pattern analysis was to identify individual trees and convert the trees into point data. The RStudio lidR package was used to create a Canopy Height Model and then an Individual tree segmentation. Rasters and shapefiles were create to export the data so I could then use the tree polygons to identify tree points. The spatstat  package was used in RStudio as well to perform a Ripley’s K analysis on the point data.

Figure 1. Individual Tree Segmentation using watershed algorithm on Saddleback stand.

Exercise 2: Geographically weighted Regression

The steps taken to do the geographically weighted regression included using the polyongs created from the individual tree segmentation to delineate tree centers. When tree points were created from the centroids of the polygons, which would be inputs for the GWR in ArcMap. A density raster and CHM raster had their data extracted to the point data so that density and tree height could be the variables used in the regression. Tree height was the explanatory variable and density was the independent variable.

Figure 2. Polygon output from Individual tree segmentation using the lidR package. The Watershed algorithm was the means of segmentation, and points were created from polygon centroids in ArcMap.

Exercise 3: Supervised Classification

This analysis involved creating a supervised classification by using training data from NAIP imagery and a maximum likelihood classification algorithm. It involved using the NIR band and creating a false color image that would show the difference spectral reflectance values from conifers and deciduous trees. I used a histogram stretch to visualize the imagery better and spent time gathering quality training data. I then created a confusion matrix by using accuracy points on the training data. I then clipped the thematic map outputs with my individual tree segmentation polygons to show how each tree had their pixels assigned.

1. Results

The Ripley’s K analysis in ArcMap showed me that Saddleback stand’s trees are dispersed, and the Baker Creek stand’s trees were spatially clustered. GWR outputs told me that the model in the Saddle back stand showed me a map output where tree heights and density were positively related. The adjusted R2 was 0.69 and gave me a good output that showed me the tallest and densest trees were on the edges of Saddleback stand. The Baker Creek stand’s model performed poorly on the point data with an adjusted R2 of 0.5. The outputs only showed relationships could only be modeled on the upper left of the stand. The classified image worked well on Saddleback stand due to less distortion in the NAIP imagery on that stand, and the Baker Creek stand’s classification was not useful since it had significant distortion in the NAIP imagery.

Exercise 1:

Figure 3. ArcMap Ripley’s K function output for Saddleback stand assessing tree points.

Exercise 2:

Figure 4. Geographically weighted regression of Baker Creek and Saddleback stand. The Hotter colors indicate positive relationships between tree density and tree height.

Exercise 3.

Figure 5. Supervised image classification using a maximum likelihood algorithm on Saddleback stand.

1. What did you learn from your results? How are these results important to science? to resource managers?

I learned that Ripley’s K outputs can differ depending on what packages used. R-studio Ripley’s K outputs told me that both my stands had clustered tree patterning. ArcMap outputs that made more sense told me that my Saddleback stand was actually dispersed. Outputs can be variable if inputs are not explicitly understood or modeled with enough care. I also learned that trying to model a very heterogeneous riparian stand is more difficult because of the variability. This is important for researchers who are interest in riparian areas like Baker Creek since they might need to have more variables to adequately model those stands.

I became very familiar with processing and modelling with LiDAR point clouds. I also became familiar with Modelbuilder and learned how to use packages in R like Spatstat. I also found a new method for making a confusion matrix in ArcMap.

1. What did you learn about statistics or other techniques?

I learned how to do point pattern analysis with Ripley’s K on tree points. This was done in R and in Arc. In Arc using the spatial statistics tool was also something I used and still plan to use. When using GWR I understood what it does, understood the outputs, and learned to properly interpret the results. I also became more concerned with issues of scale and networks that might affect my areas of interest.

# Geospatial Analyses Examining Woodpecker Nest Site Selection and Postfire Salvage Logging

Research Question

How is the spatial presence of postfire woodpecker nests related to the spatial presence of salvage-logged forest stands?

• How are woodpecker nests clustered within survey units? (Exercise 1 and 3)
• How does this clustering relate to salvage treatment units within the survey units? (Exercise 2 and 3)

After evaluating my data with several spatial analyses methods, this research question still seems applicable to my project goals. A possible revision to this research question is to include phrasing about the internal spatial characteristics of the nest and salvage harvest location datasets aside from their relationship to each other.

Dataset

This project uses postfire datasets targeting the 2015 Canyon Creek fire complex on the Malheur National Forest in eastern Oregon. A salvage logging operation occurred in the burn area in July 2016. My geospatial analyses examine the relationships between woodpecker nest site locations and salvage harvest unit locations before and after the logging occured. This will lead to correlating woodpecker population dynamics with salvage treatment effects. Woodpeckers are an ideal indicator species for postfire ecosystem monitoring due to their heavy reliance on burned snags for bark-beetle foraging and cavity-nesting. Their population dynamics indicate habitat quality for all snag-dependent wildlife. Previous research indicates woodpeckers exhibiting species-specific habitat preferences toward burned forest variables like stand density, diameter class, height, and snag decay class. 2016 pre-salvage and 2017 post-salvage lidar datasets are in processing for eventual correlation between these 3D forest structure variables and woodpecker nest site selection before and after harvest. However, the spatial analyses featured in this project focused on the intial task of relating 2D nest and salvage unit locations.

My research is in conjunction with a Rocky Mountain Research Station study examining salvage logging effects on three woodpecker species in the Canyon Creek complex. In 2016 and 2017 I led crews on this project collecting extensive postfire woodpecker occupancy and nest productivity datasets for black-backed, white-headed, and Lewis’s woodpecker populations. This resulted in a 148-nest dataset for 2016 and 2017, representing woodpecker populations before and after salvage logging. A polygon shapefile outlining ten RMRS woodpecker point count survey units serves as the area of interest (6 treatment, 4 control). Within the 6 treatment units, another polygon shapefile outlining 35 salvage harvest units indicates treatment areas. Three silvicultural prescriptions replicating optimal habitat types for each woodpecker species designate target salvage variables like average post-harvest stand density and diameter class. Each salvage unit adheres to one of these three harvest prescriptions. Supplementary geospatial data includes a 2015 WorldView-3 1 m raster and ArcGIS basemaps.

Above: The 2015 Canyon Creek Fire burning near John Day, OR.

Black-backed woodpecker (Picodies arcticus)     White-headed woodpecker (Picoides albolarvatus)                       Lewis’s woodpecker (Melanerpes lewis)

Above: The three target woodpecker species. All species are Oregon Sensitive Species. The white-headed woodpecker is a U.S. Fish and Wildlife Species of Concern.

Above: Survey units and corresponding salvage treatments for each unit. Each treatment type is based on a silvicultural prescription designed to benefit target woodpecker species. Based on previous research, each target species demonstrates preferred habitat characteristics in variables like stand density, diameter class, height, and decay class.

Above: The three salvage treatment types and the control treatment depicted as overhead stem maps. Each polygon represents a snag of indicated diameter class. The salvage treatments retain more trees of smaller diameter moving from Treatment 1 to Treatment 3, with the control units retaining all trees.

Above: A salvage treatment in the Crazy Creek woodpecker survey unit (Treatment 1).

Above: An early project map showing rough locations of Rocky Mountain Research Station woodpecker survey units. Since this map was created, two more survey units were added for 10 total (6 treatment, 4 control). This map effectively shows the woodpecker survey units occupying areas of highest burn severity.

Above: The Canyon Creek fire complex as a false color WorldView-3 1 m raster. The area of interest includes 10 survey units in blue, labeled with yellow text (6 treatment, 4 control). This visual orients the survey units to an area in eastern Oregon southeast of John Day and Canyon City. The false color image displays healthy vegetation as red, with the darkest areas displaying high burn severity. The survey units are found within some of the highest burn severity areas in the fire complex.

Above: A close-up of the 35 salvage treatment polygons outlined in red and labeled with white text. Control units lack red salvage polygons. This image does not include Overholt Creek.

Above: A subset of the 2016 (orange) and 2017 (pink) nest points featuring survey (gold) and salvage unit (green) polygons.

Hypotheses

I expected to see dispersed nests in 2016 with possible trends indicating species habitat preferences. Previous research indicates species-specific preferences for certain forest habitat variables. Black-backed woodpeckers prefer dense, small-diameter stands for foraging and nest excavation. White-headed woodpeckers prefer a mosaic of live and dead variable density and diameter stands for pine cone foraging. Lewis’s woodpeckers prefer tall medium to large-diameter stands for aerial foraging maneuvers. I expected to see nest sites in both years clustered in areas with these forest structures. In 2017 I also expected to see nest sites clustered near salvage treatments implemented for each species. Overall I expected the control units to exhibit nest dispersal and high woodpecker activity.

Approaches

Exercise 1: Multi-Distance Spatial Cluster Analysis (Ripley’s K)

Multi-Distance Spatial Cluster Analysis (Ripley’s K) analyzes point data clustering over a range of distances. Ripley’s K indicates how spatial clustering or dispersion changes with neighborhood size. If the user specifies distances to evaluate, starting distance, or distance increment, Ripley’s K identifies the number of neighboring features associated with each point if the neighboring features are closer than the distance being evaluated. As the evaluation distance increases, each feature will typically have more neighbors. If the average number of neighbors for a particular evaluation distance is higher/larger than the average concentration of features throughout the study area, the distribution is considered clustered at that distance.

Referenced from ArcGIS Desktop 9.3 Help (http://resources.esri.com/help/9.3/arcgisdesktop/com/gp_toolref/spatial_statistics_tools/how_multi_distance_spatial_cluster_analysis_colon_ripley_s_k_function_spatial_statistics_works.htm)

Exercise 2: Multiple Buffer Distance Analysis

This tool creates shapefiles of concentric circles around features based on user-input values. Users can dissolve buffers to create a single feature class per buffer distance. Buffer widths overlapping with features of interest can indicate spatial relationships between target subjects. In this case, intersecting woodpecker nest buffers with salvage harvest polygons may reveal trends in woodpeckers selecting nest sites closer to or farther from salvage units. An equation producing percent area of each nest buffer intersected by a salvage unit serves as an index.

Exercise 3: Nearest Neighbor

The Average Nearest Neighbor tool measures the distance between each feature centroid and its nearest neighbor’s centroid location and averages the distances for a nearest neighbor ratio. If the ratio is less than the average for a hypothetical random distribution, the feature distribution is clustered. If the ratio is greater than a hypothetical random distribution average, the features are dispersed.

Results

Exercise 1: Multi-Distance Spatial Cluster Analysis (Ripley’s K)

Above: Notable clustering results from Ripley’s K analysis. Most sample sizes were too small to produce significant results, but Big Canyon and Crawford Gulch in 2016 and Lower Fawn Creek in 2017 showed evidence of clustering.

Exercise 2: Multiple Buffer Distance Analysis

Above: The difference in results between 50 and 100 meter buffer distances for the 2016 nest dataset demonstrate the utility of graphs created from buffer analysis. These results indicate that woodpeckers consistently choose to place their nests in areas completely surrounded by salvage treatment 3. The results also indicate that woodpeckers consistently place their nests within varying distances of salvage treatment 1 with trends indicating average nest placement near but not completely within those salvage treatments.

Exercise 3: Nearest Neighbor

Above: Nearest neighbor results for woodpecker nests in each survey unit in 2016 and 2017. The NN Ratio is a threshold for expected clustering or dispersion. NN Ratio values less than 1 indicate clustering and NN Ratio values greater than 1 indicate dispersion. In 2017, green cells indicate an increase in value and red cells indicate a decrease in value. NN Ratio cells with an increased value in 2017 indicate units where nests increased dispersion. NN Ratio cells with a decreased value in 2017 indicate units where nests increased clustering. Alder Gulch (Treatment 3), Lower Fawn Creek (Treatment 2), and Upper Fawn Creek (Treatment 2) demonstrated increased clustering in 2017. Big Canyon (Treatment 1), Crazy Creek (Treatment 1), and Sloan Gulch (Treatment 3) experienced increased nest dispersion in 2017. All control units experienced increased or present dispersion in 2017.

Significance

I am processing two lidar datasets of the study area from 2016 and 2017. These datasets were acquired before and after the salvage logging treatments occurred. I will produce forest metrics such as stand density, diameter class, and height in salvage and survey units. I will then correlate geospatial and statistical properties of the nest datasets to quantified forest variables affecting woodpecker nest site selection. I will examine trends between 2016 and 2017 nest selection to understand the effects of harvest treatments on woodpecker populations. At least two more years of woodpecker data will exist for 2018 and 2019, so future research will add these datasets to the analyses. I would like to see a machine learning algorithm developed from this study that could predict areas of optimal habitat suitability for snag-dependent wildlife species. Postfire wildlife habitat prediction will be crucial to resource managers as wildfires increase in the coming decades alongside accelerated landscape restoration.

This spatial problem is important to science and resource managers as climate change amplifies wildfire effects. Using 3D remote sensing datasets for resource management is the future trend across all disciplines. Increased wildfire activity around the world necessitates cutting-edge methods for active fire and postfire ecosystem quantification. In the Blue Mountains ecoregion in eastern Oregon, Rocky Mountain Research Station, Malheur National Forest, and Blue Mountains Forest Partners rely on this project’s lidar quantification for their research and management decisions. Determining acceptable levels of salvage harvest for wildlife habitat affects whether government agencies and rural economies in this region will allow small businesses to profit from sustainable harvest operations. Science overall will benefit from the continued investigation of wildlife response to anthropogenic disturbance, specifically postfire forest management decisions and the controversial practice of salvage logging.

Learning Software

For my analyses I used mostly ArcMap, ArcGIS Pro, and R. I am very familiar with ArcMap, but it was my first time using ArcGIS Pro extensively. I enjoyed applying my R programming experience to my own datasets and creating visual aids for my research.

Learning Statistics

The statistical analyses I implemented for my research did an adequate job characterizing my datasets. However, I would try a different set of analyses that fit my small sample sizes better. A spatial autocorrelation test may support these analyses. I used R heavily in my Ripley’s K and buffer analyses to create graphs

# Courtney’s Final Project Post

Research Question

• “How is the spatial pattern of ion and isotope concentrations in wells tapping the basalt aquifer related to the spatial pattern of mapped faults via the mechanism of groundwater flow as determined by hydraulic transmissivity of the geologic setting?”

Description of dataset that I examined

• A: In my research I have analytical data for 31 wells, whose XY locations were determined by field confirmation of the Oregon Water Resource Department (OWRD) well log database. As groundwater is a 3D system, I have to consider Z values as well. The well depths and lithology information are also from the OWRD database. My analytical data provides a snapshot of water chemistry during the summer of 2018. I have only one temporal data point per well. At all 31 wells, I collected samples to be analyzed for pH, temperature, specific conductivity, oxygen isotopes 16 and 18, and hydrogen isotopes 1 and 2. At a subset of 18 of those wells I collected additional samples for tritium, carbon 14, and major ion analysis.
• B: The shapefile of faults mapped at the surface was created by Madin and Geitgey of the USGS in their 2007 publication on the geology of the Umatilla Basin. There is some uncertainty in my analysis as far as extending this surface information into the subsurface. USGS studies have constrained proposed ranges of dip angles for the families of faults that I am studying, but not exact angles for any single mapped fault.
• C: results of pumping interference tests involving 29 wells, 12 of which I had chemical data for. The data was collected by the OWRD in 2018 and 2019.

Hypotheses

• Faults can act as conduits or barriers to groundwater flow, depending on how their transmissivity compares to the transmissivity of the host rock.
• I hypothesize that clusters of similar chemical and isotopic properties of groundwater can indicate a shared aquifer unit/compartment, and that if faults separate clusters then the fault is influencing that difference in chemical/isotopic signatures. If the fault is between two clusters, I hypothesize that it is acting as a barrier. If it crosses through a cluster, I hypothesize that it acts as a conduit.
• Where faults act as barriers, I hypothesize that parameter values will differ in groups on either side of a fault. Specifically, a barrier fault might cause older, warmer water to rise into upper aquifer layers, and the downstream well might show a signature of more local recharge.
• Where faults act as conduits, I hypothesize that water chemistry and isotopes of samples from wells on either side of the fault would indicate a relatively direct flowpath from the upstream well to a downstream well. Over a short distance, this means that ion and isotope concentrations would not differ significantly in wells across the fault.
• My hypotheses depend on a “barrier” fault damming groundwater flow up-gradient of the fault, and compartmentalizing local recharge on the down-gradient side. This hypothesis is only relevant if the fault is roughly perpendicular to the flow direction, and so disrupting transmissivity between a recharge zone and the wells. If a fault that separates two wells is parallel to the flow direction and there is no obstacle between the wells and upstream recharge areas, then the fault might indeed limit communication between the wells but they will have similar chemical signatures. Wells separate by this second kind of fault barrier would be better evaluated by a physical test of communication, such as a pumping interference test.

Analysis Approaches

• Principal component analysis: used to simplify the multivariate data set (19 variables!) into variable relationships that could represent aquifer processes
• Analysis of PCA results compared to distance from a flow path
• Interpolation of well water levels classified by well stratigraphy to estimate a potentiometric surface and evaluate groundwater flow directions.
• Raster calculations to compare flow direction to fault incidence angle
• Measuring distance from each well to the nearest fault along the flow path
• simple linear regression, comparing Non-ion PC1 score of a well with its distance from a fault.
• Two-sided T-tests comparing distance between wells, presence of a fault, and pumping interference test communication between wells
• ANOVA F-tests comparing chemical and isotopic variance within groups of wells that communicate with each other and between those groups.

Results

• Principal component analysis – Patterns of variation between wells are statistically explained primarily by total ion concentration, a relationship between chemical evolution from ion exchange and decreasing stable isotope ratios, and the combination of well depth and water temperature. Moran’s I indicates that only Non-ion PC2 is spatially clustered, while the other PC models have a distribution not significantly different than random. The other PC models are useful to understand the groundwater system, but not specifically to analyze clustering correlated to faults.
• Interpolation of water level, and comparison of fault incidence angle with flow direction, indicates faults that are and are not able to be tested by my hypotheses.
• Analysis of PCA results compared to distance from a fault along flow path – some wells that are “within” a fault zone have very old signatures and others have very young signatures. This could be related to the angle of the dip of the fault and the accuracy of mapping compared to the depth of the well. I hypothesize that the wells that are in the fault zone but have high PC1 scores are on the up-gradient side of the fault where older water is upwelling along a barrier. Wells in fault zones with low PC1 scores could indicate wells open to downgradient areas of the fault, where vertical recharge through the fault damage zone is able to reach the well.
• Returning to the conclusions I wrote in that blog post after I found improved stratigraphic data, I’m not sure if I can make conclusions other than those about the wells are that mapped as “inside” a fault. Several wells that are closer to faults are also open to shallower aquifer units, and so the effect of lower PC1 scores closer downgradient to faults might be confounded by lower PC1 scores caused by vertical inflow from the sedimentary aquifer and upper Saddle Mountain aquifer.
• Two-sided T-tests comparing distance between wells, presence of a fault, and communication between wells show that the presence of a fault has a greater effect on communication than the distance between the wells.
• ANOVA F-tests comparing chemical and isotopic variance within groups of wells that communicate with each other and between those groups – stable isotopes and specific conductivity both show more variation between well groups than within well groups.
• Not covering in these blog posts, I also ran Moran’s I on my inputs to see which ones are clustered and so might be more related to horizontal grouping factors (such as faults) than vertical grouping parameters (such as stratigraphic unit). Of the PCA and individual variables, only d18O, d2H, and Non-ion PC2(combination of well depth and water temperature) were clustered. The other PCA models, temperature, pH, and specific conductivity were not significantly spatially clustered.

Significance –  Groundwater signatures are related to faults agree/disagree with past understandings of differences between wells in the region, and can inform well management. If a senior permit holder puts a call on the water right and asks for junior users to be regulated off, it would not help that senior user if on of those junior permit holders’ wells is not hydraulically connected to the senior users.

• More wells would need to be sampled to be better able to disentangle the effects of faults from the effects of well stratigraphy.

My learning – I learned significantly more about how to code and troubleshoot in R. Additionally, I learned about the process of performing spatial clustering analysis in ArcGIS.

What did I learn about statistics?

• PCA was completely new to me, and it’s a cool method for dealing with multivariate data once I dealt with the steep learning curve involved in setting it up and interpreting the results. It was useful getting more practice performing and interpreting t-tests and Anova F-tests. I had not used spatial clustering before, and learning how to apply it was interesting. It gave me a much more concrete tool to try to disentangle the patterns in my effectively 3D data on X,Y plane, as opposed to the Z direction.

# Relationship of river recession behavior to watershed lithological composition

Research Question: How is watershed recession behavior related to the spatial pattern of watershed lithological composition in Oregon coastal freshwater systems?

Data

Recession Analysis

For the recession analysis performed in this exercise, I used daily records of precipitation from the PRISM database. Daily data is the highest temporal resolution that PRISM has to offer. The PRISM time series were downloaded from cells nearest the gage. Time series were from 2001 to April 2019. For further methodology on recession analysis, refer to Exercise 1. River discharge data was collected by the U.S. Geological Survey at the watershed pour point used here. Discharge data is recorded every 15 minutes, which was upscaled to daily average to match the resolution of the precipitation data. As a result, recession analysis results were on the daily timestep from April 2019 to 2001.

Geophysical data

Lithological data was sourced and derived from the Oregon Department of Geology and Mineral Industries state geological map. General lithology type is a listed variable in the attribute table (GEN_LITH_TY), and ArcGIS was used to dissolve that attribute across the study area. The spatial resolution of this dataset is unclear. Additionally, the data is representing “immutable” features, and therefore there is no associated temporal scale.

Hypotheses

Streams and rivers in watersheds with predominantly volcanic geology will recede faster than those with predominantly sedimentary geology.

Streams and rivers in watersheds with similar underlying lithology will have similar recession characteristics.

Approaches

To perform this analysis, I followed the following steps:

1. Recession analysis on the daily timescale across eight watersheds. Statistical methods and other analytical procedures are outlined in Exercise 1.
2. Profile lithological composition within each of 35 study watersheds across the Oregon Coast Range. Percentages of metamorphic, plutonic, sedimentary, surficial sediments, tectonic, and volcanic lithologies were quantified for each study watershed. Technical procedures are outlined in Exercise 3.
3. Perform hierarchical clustering analysis on lithological composition profile to see which watersheds are most lithologically similar and dissimilar.
4. Compare recession analysis coefficients to clustering results and lithologic profiles to see if recession behavior appears to be different in different predominant lithologies.

Results

Results from the clustering analysis demonstrated that a most coastal watersheds are predominantly underlain by sedimentary and volcanic lithologies, and that sedimentary lithologies dominate the region. However, volcanic lithologies are well-represented, primarily at the northernmost latitudes. Because of this stratification, recession analyses could be performed at watersheds in both lithologies and compared to test whether lithology is a dominant control. However, because I conducted the recession analysis before profiling lithological composition, only one of the eight hydrologically analyzed watersheds is in predominantly volcanic geology (Figure 1). The watersheds grouped under the sedimentary branch span values of 54-100% sedimentary lithology. Watersheds grouped under the volcanic branch span values of 47-74% volcanic geology.

Figure 1. Results from hierarchical clustering analysis depicted in a dendrogram. Watersheds in which recession analyses were performed are circled.

When recession coefficients from the recession analysis are compared with the clustering analysis results, we see that coefficient a is fairly consistent among all watersheds, while coefficient b is much larger in the predominantly volcanic watershed (Table 1). The Wilson River near Tillamook, Oregon, is characterized as 50% volcanic lithology, which is the highest among the studied watershed. Recession analysis coefficient b from the Wilson River is 1.726 is also the highest in the eight studied systems.

Table 1. Table showing lithological composition of watersheds for which recession analysis was performed. Coefficient results from the analysis are also listed.

Significance

While the results of this study are inconclusive given the small sample size, it does give indication that lithology is an important control on watershed recession behavior and should therefore be included in analysis of temporal patterns of hydrological behavior. It is relevant to resource managers in coastal PNW regions because it indicates a watershed characteristic that influences storage and release behavior. Such information may be valuable for identifying systems that are more or less susceptible to drought and changes in the climatic patterns.

Learning Outcomes

• Watersheds can be delineated in Matlab, and it is much faster than ArcGIS.
• dplyr is a very important and useful package for reformatting data in a logical way
• I learned more about spatial autocorrelation from conversations with other students about their analyses, especially given the breadth of research topics in the classroom.
• Probably the two most important outcomes of this project were the recession analysis and the clustering analysis. The recession analysis involved multi-variate regression, which when applied to my data and research question, makes far more sense. The clustering analysis helped get at the question I have had for some time: how similar/dissimilar are my study watersheds? For this project, I focused on lithology, but I think it will be a useful tool for analyzing the other variables as well. Additionally the clustering analysis will provide a useful method for stratifying my sampling design into different lithological groups.

# Relation of Potential Soil Carbon to Tree Height and Density across HJA

Major uncertainties exist in the geospatial distribution of terrestrial carbon (C) sources and sinks and the factors that influence soil C distribution and change. Temperate and boreal forests serve as a net sink of atmospheric CO2, so my aim is to understand how landscape features and vegetation vary across the HJ Andrews Forest (HJA) and determine how those factors influence soil C storage and CO2 release.

Research Question: How might landscape features and vegetation characteristics relate to accumulation of soil carbon, using tree height and density as a proxy for soil C?

Dataset description: I used LiDAR data from 2008 and 2011 at 0.3-0.5 m vertical resolution and at 1 m2 horizontal resolution covering the Lookout Creek Watershed and the Upper Blue River Watershed to the northern extent of HJA. These LiDAR data include a high hit model (tree tops) and a bare earth model from 2008. I also downloaded a shapefile of polygons of the reserved control areas within the HJA forest boundaries and used these areas to subset some of my later analyses.

Hypotheses:

1. Large storm events primarily topple larger (older) trees on ridges and upper slopes. Soils in these positions are typically thinner, making them quickly saturate from precipitation and making trees in those positions more vulnerable (Lal 2005; Overby 2003). In addition, the tallest trees on ridges may be exposed to greater risk from lightning strikes, greater wind gusts and more snow accumulation, resulting in shorter trees dominating ridgelines. Shorter trees have smaller canopies and occupy less physical space than taller trees, so shorter trees occupy denser stands and as they grow taller, each individual tree occupies more space. Because of these factors, I hypothesize that shorter trees that are more closely spaced will cluster more densely along ridges than other landforms.
2. I hypothesize that the tallest trees will be more densely clustered at low elevations along valleys because waterways carve out valleys and bring nutrients that then accumulate and build thick, carbon-rich soil. These areas tend to not suffer from the moisture limitations of ridges and steep slopes and are more protected from strong winds, so trees in these areas can maximize growth based on available solar radiation.
3. High density young forest stands exhibit characteristics similar to mature forests, like closed canopies and high LAI. However, recently disturbed forests have higher nutrient availability than undisturbed forests which has been shown to cause a shift in C-allocation from below- to aboveground, so I expect younger stands to negatively correlate with soil C accumulation.
4. I hypothesize that because vegetative growth is limited by solar radiation, vegetative growth on slopes that face S and W have greater exposure to solar radiation and will result in greater overall biomass (more dense stands of smaller trees, but not necessarily more tall trees). However, if the slope is too steep, I expect this pattern to diminish since trees will experience greater wind and water erosion and more tree mortality.

Note: I have yet to test hypotheses 3 and 4, but they are avenues for continuing research.

Approaches: I used a k-nearest neighbor analysis and k-means to cluster trees into 10 height classes and related each tree in the center of each height class to its distance from its 30 nearest neighboring trees of the same height class. I ran several hot spot analyses on tree heights and tree distances (tree density). Since my original hot spot analyses were subset by height class, which led to the algorithm naturally finding the arbitrary bins (upper and lower bounds) the height classes were based on, I performed new hot spot analyses on all the trees and all the tree distances within just the reserved control areas of the forest. I performed a linear regression to compare the regions of taller than expected trees to the regions of greater than expected distances between tree and Chi-squared tests independence for the hot spot analyses to compare hot spots and cold spots with variables like tree spacing, slope and elevation. Since hot spots between tree height and spacing did not overlap in all cases, I wanted to know what landscape features might explain this difference. Covariates I explored included slope, aspect, elevation, and landform. I calculated a confusion matrix between the Z-scores of height and distance for all the hot spot bins (-3,-2,-1,0,1,2,3), then further constrained the analysis to only the most extreme hot and cold spots (-3 and 3). I then compared mean height, distance, slope and elevation between the four combinations of the extreme hot and cold spots.

My objective for the final synthesis was to find regions that are higher or greater than expected for given parameters (height and tree density) and group these clusters into polygons where they overlap. I hypothesized that that soil C accumulation positively correlates with clusters of trees that are both taller and more densely spaced than expected. Conversely, I hypothesized that shorter trees that are more widely spaced negatively correlate with soil C accumulation. Therefore, overlapping hot spots of tree height and tree density and overlapping cold spots of tree height and tree density are prime areas for my planned future soil C sampling sites.

Results: I produced maps and statistical relationships between tree height and tree spacing. These included hot spot maps overlaid on elevation maps, density plots, graphical comparisons of Z-scores and many others (see previous blog posts). I found that hot spots of taller than expected trees overlapped with hot spots of greater than expected distances of trees, but not in all cases (Fig 1). When I examined a table of each of the distance hot spot bins compared with each of the height hot spot bins, I found that the most compelling correlations were between bins (0,0), (-3,-3) and (3,3) with 5%, 33% and 43% of total data, respectively. The proportion of data not covered by overlapping hot spots and by overlapping cold spots was minimal, but when mapped was visually compelling enough to warrant further exploration. I was particularly interested in areas with short trees and greater than expected distance between trees (bin -3,3) and areas with tall trees and shorter than expected distance between trees (bin 3,-3). I expect that bin (-3,3) would correlate with less soil C, so identifying those areas could be useful to sampling. I expect that bin (3,-3) would correlate with greater soil C. I can identify from Fig 2 where statistically significant hot spots and cold spots are spatially close and plan to sample heavily in those areas.

Fig 1. Tree height vs. distance between trees from hot spot analyses shows a highly linear pattern.

Fig 2. Tree density (mean distance in m between 30 closest trees) hot spot bins compared with tree height (m) hot spot bins in Lookout Mountain area. Bin comparisons (-3,-3) are areas of shorter than expected trees in more dense stands. Bin (3,-3) is taller than expected trees that are in more dense stands. Bin (3,3) is taller than expected trees in more sparsely populated stands.

Significance: Many of my results thus far confirm ecological phenomena that we already know to be true across forested landscapes. For example, I examined tree distance to stream and found that taller trees were more common closer to streams and less common at greater distances from streams. This makes sense with other landform features like valleys and ridges, so smaller trees tend to be along ridges (cold spots of tall trees according to my analysis) and taller trees tend to be in valleys. However, I have identified regions of taller than expected trees and regions of shorter than expected trees, and if those correlate with respectively more and less soil C, that provides evidence for LiDAR as an effective way to quantify soil C. If other landscape features that can be determined from geospatial data co-vary with tree height and/or tree density, it may be possible to quantify soil C at fine resolution. These data can be used to identify areas that have the potential to sequester more soil C and forest management could be tailored to support those regions.

Geospatial Learning: I learned a ton about ArcMap, QGIS and packages that were new to me in R like spatstat, caret, nngeo, and others. I had very little experience with and GIS before this class, so it was a steep learning curve but I’ll continue to learn as I perform this final synthesis. I learned how to perform hot spot analyses in ArcMap and export them to work with in Q. I learned how to manipulate spatial data in R and load it in Q and ArcMap for viewing. These are only a few examples of the many things I’ve learned!

Statistics Learning: I learned the limitations of hot spot analyses, with one of my criticisms being that it’s basically a smoothing function. Since the LiDAR dataset I’m using is basically a census of tree heights, running hot spot analyses is reducing the information in that dataset unnecessarily. I already knew that my data were spatially autocorrelated, so I had to take that into account for all of my analyses. I learned that the confusion matrix is great for visually discerning where a model predicts best and where it doesn’t predict well, but that the scientist must figure out the reasons for the good or poor predictions.

Because interactions among geomorphic processes, landforms, and biota occur on various temporal and spatial scales, it is necessary to understand the scale of an ecosystem process when performing spatial analyses. It is also necessary to consider the potential reasons why a particular spatial pattern did not emerge from an analysis. Reasons can range from data not being explained by expected process or multiple spatial scales of interactions, failing to use the right tool, not being at the right spatial scale, data being too coarse, failing to include the right variables or a flaw in one’s fundamental idea about how the system works. This is why it is important to (1) formulate clear, specific hypotheses before performing an analysis and (2) explore spatial patterns using multiple approaches if possible.

Sources:

Harmon M.E. (2009) Woody Detritus its Contribution to Carbon Dynamics of Old-Growth Forests: the Temporal Context. In: Wirth C., Gleixner G., Heimann M. (eds) Old-Growth Forests. Ecological Studies (Analysis and Synthesis), vol 207. Springer, Berlin, Heidelberg

Lal, R. (2005). Forest soils and carbon sequestration. Forest ecology and management220(1-3), 242-258.Overby et al. 2003: Impacts of natural disturbance on soil C dynamic in forest ecosystems: Soil C sources include litter fall and tree mortality, root turnover and root exudates.

Dixon, R. K., Solomon, A. M., Brown, S., Houghton, R. A., Trexier, M. C., & Wisniewski, J. (1994). Carbon pools and flux of global forest ecosystems. Science263(5144), 185-190.

# Spatial and Temporal Patterns of Oregon Salmonella Cases

1. The research question that you asked.

My original question focused solely on the spatial patterns of Salmonella cases in Oregon. However, after realizing my dataset contained time series information I decided to pivot my analysis somewhat and my research question became the following: How are the spatial and temporal patterns of reported Salmonella infections associated with sociodemographic factors in Oregon counties between 2008-2017? While the eventual discovery of causal drivers of disease will be important for disease prevention efforts, the focus here is what other factors are associated with disease at the county level. Through this analysis I hoped to identify sub-populations who are at a higher risk of becoming infected with Salmonella in Oregon.

1. A description of the dataset you examined, with spatial and temporal resolution and extent.

The data used in my analysis came from a variety of sources. Data regarding the number of Salmonella cases per county came from the Oregon Public Health Epidemiology User System (ORPHEUS). These data contained individual level information, but it was de-identified due to privacy concerns. Information concerning age, sex, disease onset, and county of residence was available with these data. Population estimates came from Portland State University’s Population Research Center which collected yearly county level population estimates. Other sociodemographic information came from the American Community Survey, a yearly survey which assesses various county-level characteristics like poverty, high school graduation rates, percentage of foreign-born residents, etc. All data used in this analysis was from years 2008-2017.

1. Hypotheses: predictions of patterns and processes you looked for.

Hypothesis 1: I expect counties with higher proportions of females to have higher levels of Salmonella infections due to findings in prior literature concluding that females have a higher incidence rate of Salmonella infections compared to males. The underlying causal mechanism for this pattern is unknown, but given the results of other studies of foodborne illness I expect Oregon’s population to be similar. As a result I would expect the percentage of a county that is female to be significantly associated with Salmonella incidence.

Hypothesis 2: I expect counties with higher proportions of infants and newborns to also have higher levels of Salmonella infections compared to counties with a higher proportion of older age groups. Other findings from the Oregon Health Authority indicate that young children are at high risk for developing foodborne illnesses. The reasoning here is that the immune systems of young children are not fully developed and less likely to effectively fight off infection resulting in a disproportionately high disease incidence in this group. I expect to see that the percentage of a county that belongs to the age group 0-4 years old will be significantly associated with Salmonella incidence.

Hypothesis 3: I expect to find a significant time trend of Salmonella incidence over time in Oregon counties. Disease incidence varies from year to year and can sometimes be volatile in scarcely populated counties or counties experiencing a major outbreak. Because of this natural variance over time I expect to see a significant time trend of reported cases in Oregon. Some counties will show a positive trend over time, others will show decreasing disease rates, and others will be relatively level. However, I expect time will be a significant factor associated with disease incidence.

1. Approaches: analysis approaches you used.

I used auto- and cross-correlation, longitudinal trend analysis, hotspot analysis, geographically weighted regression, and Principal Component Analysis. All of these analytical approaches were performed in the R statistical software with the aid of various packages which allowed me to perform all of my spatial analysis.

1. Results: what did you produce — maps? statistical relationships? other?

From these analyses I was able to create maps, plots of auto- and cross-correlation over time, and a bivariate plot of principle components which were associated with different regions of Oregon. Also, there was considerable numerical output for my time trend regression and GWR which provided evidence of statistically significant associations between the variables in my model and the outcome. These outputs can be seen in my second and third blog posts.

Hypothesis 1: Findings from my GWR support my first hypothesis that the proportion of a county which is female is significantly associated with Salmonella incidence. While coefficient estimates vary by county, most often there is a positive association between the proportion of a county that is female and disease incidence. Cross-correlation analysis found there were large areas of Oregon, particularly in the Western part of the state, where there was a positive association between the two variables was positive and some areas clustered in the Eastern part of the state where this pattern was reversed.

Hypothesis 2: GWR findings and cross correlation analyses did not support my second hypothesis. The proportion of a county which is aged 0-4 is not significantly associated with county Salmonella incidence. However, it was found that county percentage of child poverty was significantly associated with disease incidence. Perhaps the reason for this finding is that the proportion of children alone may not be significantly associated with disease, but poverty is. Thus children in poverty partly explain the association seen for children as a whole.

Hypothesis 3: Longitudinal analysis and regression supported my third hypothesis that a significant time trend existed for Oregon’s Salmonella incidence. As expected, there was some variance in county disease rates over time due to a large host of factors. Larger populations are likely to have more cases of disease reported due simply to the fact that there are more people to possibly infect. Overall, there is an increasing trend in Oregon’s Salmonella incidence. There may be a higher amount of disease occurring in Oregon over time. Another possible explanation for some of the increase could be due to improvements in Oregon’s disease monitoring abilities and infrastructure over time.

1. What did you learn from your results? How are these results important to science? to resource managers?

This analysis found statistically significant associations between some population characteristics (like median age, childhood poverty levels, and time) and the reported rate of Salmonella. These results are important to researchers because the focus to find causal exposures for Salmonella can be narrowed to groups or areas more associated with the disease. These results can inform future disease prevention research. Resource managers would also be interested in this analysis as the counties identified here as being more associated with disease than others can be targeted for disease surveillance.

1. Your learning: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other?

My analysis was completed solely in R statistical software. I learned a lot over this course about different spatial analysis packages in order to make my R more robust. Some spatial packages have a steep learning curve with a fair degree of technical knowledge to appropriately implement in your analysis. I would say I learned a lot about trouble shooting from discussion threads and GitHub posts.

1. What did you learn about statistics, including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), (d) multivariate methods (e.g., PCA),  and (e) or other techniques?

I learned how hotspot analysis is performed and what a Getis-Ord Gi* score is and how they are compared to yield hot and cold spots. As for autocorrelation I learned how to perform cross-correlation analysis through both space and time, as well as how to cluster different values into a map which is easier to interpret. During my geographically weighted regression analysis I learned how to transform my data and make it compatible with the regression technique as well as figure out which variables to put in my GWR based on an ordinary least squares regression. Prior to this project I had never heard of PCA, so here I learned introductory skills about how to apply this analytical technique to my dataset.

# Final: Evaluating a habitat suitability model for native Olympia oysters in Yaquina Bay

Research background
The goal of my graduate research project is to help build a more comprehensive picture of native Olympia oyster spatial distribution and abundance in Yaquina Bay, Oregon. Olympia oysters have experienced a major decline in population in the Pacific Northwest; it is estimated that the original population has been reduced by 90-99% (Beck et al. 2011). The species is an important ecological and cultural resource, which has garnered it support from resource management agencies, tribes, and local environmental groups. Yaquina Bay is one of only three Oregon estuaries that has historically supported Olympia oysters and may be the only estuary to have maintained a continuous population since Native American settlement (Groth and Rumrill 2009). Because of this, it has been identified by local shellfish biologists as a good candidate for species enhancement. However, current information about exactly where and how many wild oysters exist in Yaquina Bay is inexact and anecdotal. My research project aims to fill that data gap so that appropriate, informed actions can be taken for species enhancement and recovery.

Through this course, I have been experimenting with spatial analysis to see if I can create an accurate map of habitat suitability for Olympia oysters in Yaquina Bay. To do this, I’ve been focusing on three environmental parameters: salinity, substrate, and elevation. I then compared the map of predicted suitable habitat with field observations of known oyster locations. The main research question I examined in this project was:

How does the spatial pattern of Olympia oyster location data correspond to the spatial pattern of suitable habitat in Yaquina Bay?

Description of the dataset

Habitat parameters
I collected one raster layer for each of the three habitat parameters, then overlaid them on one another to establish suitable habitat. I ranked the subcategories within each parameter from highest to lowest likelihood to be suitable for oyster presence.

• Salinity: Salinity data was derived from an existing dataset available through the Oregon Department of Environmental Quality (DEQ). DEQ operates a publicly accessible online water monitoring data portal where results water quality monitoring stations are posted. Wet-season salinity was used to represent the extreme of salinity stress, since Olympia oysters are more capable of withstanding higher salinity waters than fresher water. The salinity dataset is based on historical measurements from 1960-2006 and represents a gradient from highest salinity (~32psu) to fresher water (<16psu).
• Substrate: Substrate data was taken from the Estuary Habitat Map of Oregon using the Coastal and Marine Ecological Classification Standard (CMECS). CMECS is a national standard; publicly available data for Oregon’s coastal areas was made accessible through an effort by the Oregon Coastal Management Program (Department of Land Conservation and Development). The CMECS Estuarine Substrate Component layer is an Oregon-wide raster that I clipped to Yaquina Bay for this analysis. This layer characterizes substrate type broadly and is fairly low resolution.
• Elevation: Elevation is represented through a bathymetric dataset from 2001-2002, which is the most current estuary-wide available dataset. This raster was sourced from the Newport, Oregon office of the Environmental Protection Agency (EPA) through contact with a staff geographer.

Oyster location data
This spring, I completed several field surveys of the intertidal zone of Yaquina Bay during low tide series. These were conducted in collaboration with my advisor and field staff from the Oregon Department of Fish and Wildlife (ODFW) on April 19-20 and May 17, 2019. We used a stratified random sampling design to randomly select survey locations. Once at the field survey locations, we timed our search for five minutes and assessed as much area as possible within that time. Oysters were characterized as ‘present’ if evidence of at least one oyster, living or dead, was detected within the search area. The oyster or oyster shell needed to be attached to a substrate to qualify. These data points were collected using a handheld Garmin GPS unit and field notebook, then transferred to Google Earth for quality control, and finally to ArcGIS Pro for analysis.

Hypotheses
I originally hypothesized that the spatial pattern of Olympia oyster location data would correspond with the spatial pattern of suitable habitat as established by analysis of the three habitat parameters (salinity, substrate, elevation). I did not think it would be exact, but would at least allow for a somewhat accurate prediction of the where the oysters were located in the estuary. I felt this model would be most useful in determining the spatial range of the oysters by finding the extreme ends of the population upstream and downstream. Of the three habitat parameters, I hypothesized that salinity would be the ultimate predictor of where oysters could be found and that substrate would be the limiting parameter.

Approaches
Exercise 1
I did not have any oyster location data to work with yet, so I first focused on the three habitat parameters to identify and rank where habitat was least to most suitable. I ranked the subcategories within each parameter raster layer by looking through the available literature on the subject and consulting with shellfish biologists to get an idea of what conditions oysters prefer. I then combined them using the ‘Weighted Overlay’ tool in ArcGIS Pro to determine where overlap between the layers shows the best environmental conditions for oysters to survive.

Exercise 2
After collecting field data to determine where oysters were present or absent within the intertidal zone of Yaquina Bay, I used the data points to compare against the map of suitable habitat. I first uploaded the data points collected in the field into Google Earth where I could easily verify the locations against my field notes and photos, as well as perform some minor quality control. The points were imported into ArcGIS Pro where I used the ‘Buffer’ and ‘Zonal Statistics’ tools to analyze the neighborhood surrounding each data point. Looking at these neighborhoods, I determined what habitat suitability type that the majority of surrounding habitat fit into. Finally, I made a confusion matrix to statistically assess which data points fit into each habitat type.

Exercise 3
Based on the results of the confusion matrix in Exercise 2, I wanted to expand on the habitat suitability analysis to see if I could more accurately predict oyster locations. I decided to see if the spatial pattern of oyster location data could be better described by manipulating the spatial pattern of one of the habitat parameters. I decided to change the rank values of the salinity parameter to alter the map of suitable habitat. Once I had a new output for habitat suitability, I ran the same neighborhood analysis that I used in Exercise 2 and created another confusion matrix.

Results
Exercise 1
After applying the ‘Weighted Overlay’ to the input raster layers, this map was generated as a representation of habitat suitability:

Exercise 2
This neighborhood analysis map shows the majority habitat type surrounding each data point:

This table below shows how many data points fell into each habitat type:

 Habitat suitability 4 3 2 1 Presence 13 [0.72] 4 [0.4] 2 [0.29] 0 [0] Absence 5 [0.28] 6 [0.6] 5 [0.71] 0 [0] n = 35 18 [1.0] 10 [1.0] 7 [1.0] 0 [0]

This confusion matrix further examines the ‘errors’ in the data, or the observations that did not correlate with my model of predicted suitable habitat. For ease of interpretation, I removed habitat suitability type 1 since there were not observations in this category, and type 3 since it fell in between high and low-quality habitat. Decimals reported indicate the proportion of total observations (n = 35) that fell within this category:

 Habitat suitability 4 (high) 2 (low) Presence 0.37 0.06 Absence 0.14 0.14

Exercise 3
This second neighborhood analysis map shows the majority habitat type surrounding each data point after the habitat suitability model was altered (by manipulating salinity):

This table below shows how many data points fell into each habitat type:

 Habitat suitability 4 3 2 1 Presence 1 [0.5] 18 [0.58] 0 [0] 0 [0] Absence 1 [0.5] 13 [0.42] 2 [1.0] 0 [0] n = 35 2 [1.0] 31 [1.0] 2 [1.0] 0 [0]

I created another confusion matrix to further examine the ‘errors’ in the data. I removed habitat suitability type 1 since there were not observations in this category. Decimals reported indicate the proportion of total observations (n = 35) that fell within this category. The confusion matrix shows that the model fit nearly all observations (31) into the type 3 habitat category:

 Habitat suitability 4 (high) 3 2 (low) Presence 0.03 0.51 0 Absence 0.03 0.37 0.06

Significance
From this analysis, I feel that these three habitat parameters do a good job of predicting the spatial distribution of oysters generally. This approach can be used as a helpful tool for guiding resource managers and shellfish biologists to likely oyster locations, but cannot be the only method for validating their presence or absence at this point. Given budget cuts or competing priorities for today’s resource managers however, a map predicting where an oyster population is might be enough for making management decisions in a pinch.

In collaboration with my advisor at ODFW, I am using the results of this spatial analysis to fill an important data gap about where and how many Olympia oysters can be found in Yaquina Bay.  This information will be used to develop a repeatable monitoring protocol to be conducted by ODFW’s Shellfish and Estuarine Assessment of Coastal Oregon (SEACOR) program. I hope that my research can also help estimate population enhancement potential by identifying possible sites for habitat restoration. With increased accuracy, I’d also like to analyze potential oyster distribution under future environmental conditions and under different management scenarios.

Recovery of a robust population of Olympia oysters is important to Yaquina Bay and the surrounding community in a number of ways. The species provides many benefits to the ecosystem, including water filtration and habitat for other marine creatures. It is culturally significant to local tribes, including the Confederated Tribes of Siletz. While Olympia oysters are not currently listed as threatened or endangered, possible listing would trigger a number of mitigation and conservation measures that will be difficult and expensive for agencies and private landowners. Additionally, there’s been some exploration that if the population can become robust again, there is potential to grow and harvest this species as a specialty food product.

Software and statistics learning
Most of the spatial analysis tools I used were familiar to me, though I had only really used them in exercises in other classes. I have not been able to use my own data to conduct an analysis, so that was exciting, but also a bit of a learning curve. The confusion matrix was a new tool that was a very simple and useful way to statistically assess my data.

Some issues worth noting
Throughout this process, I identified several small issues that I will need to consider or modify to improve my research analysis:

1. Regarding the neighborhood analysis, the size of the buffer around each point affects how the majority of habitat type is calculated. It might be worthwhile to try another analysis where buffers are smaller or multiple sized buffers are assessed.
2. A major issue for creating the habitat suitability map was the resolution of the base raster layers for each parameter, especially the substrate layer. The current layer does not reflect the nuances of small patches of available substrate and therefore is not a true indicator of available substrate for oysters. The resolution of the substrate layer is not sophisticated enough to capture the true extent of the oysters, because they’re known to be opportunistic in finding suitable substrate.
3. The area covered by each field survey (5-min search at a randomly selected starting point) was not equal at every location. Difficulty of travel at the site impacted the rate at which the survey could be conducted. For example, extremely muddy locations or locations with a lot of large boulders made traversing through the survey site more challenging and slowed the survey team down. Additionally, more ground could be covered if little substrate was available for examination.
4. During field surveys, no distinction was made for adult or juvenile oyster, or living or dead, for determining presence/absence. Separating out this information on the map might be interesting in determining where oysters are more likely to survive long-term. For example, it has been noted that young oysters can opportunistically settle on substrates under less ideal environmental conditions, but may not be able to survive into adulthood.
5. The simplistic approach for ranking the subcategories within each habitat parameter (on a scale from 1-4) ensured that the extreme ends of the rank value were always represented; there may not always be a definitive 1 or 4 value within each category. For example, maybe high salinity and low salinity both rank as 2’s rather than deciding one is slightly more appropriate than the other.

References:

Beck, M. W., R.D. Brumbaugh, L. Airoldi, A. Carranza, L.D. Coen, C. Crawford, O. Defeo, G.J. Edgar, B. Hancock, M.C. Kay, H.S. Lenihan, M.W. Luckenbach, C.L. Toropova, G. Zhang, X. Guo. 2011. Oyster Reefs at Risk and Recommendations for Conservation, Restoration, and Management. BioScience 61: 107-116.

Groth, S. and S. Rumrill. 2009. History of Olympia Oysters (Ostrea lurida Carpenter 1864) in Oregon Estuaries, and a Description of Recovering Populations in Coos Bay. J. Shellfish Research 28: 51-58.

# Final project: Juniper spatial distribution and slope, aspect, and surrounding juniper density

1. Research Questions.
• What are the spatial patterns of western juniper within the treated watershed? (The project was later expanded to address juniper density within the treated watershed, untreated watershed, and in nearby areas.)
• What are the spatial patterns of western juniper within the study area as they relate to the spatial pattern of slope and/or aspect (as potential indicators of soil moisture) and the density of juniper in a given area (as a potential indicator of seed source)?
2. Description of dataset.
• The study took place in the Camp Creek Paired Watershed Study (CCPWS), located in central OR. The majority of western juniper was removed from one watershed in 2005-2006 (“treated watershed), while mature western juniper is the dominant vegetation at the other watershed (“untreated watershed”).
• Thirty-three belt transects (3m by 30m) within the treated watershed recorded the location and size of juniper within each transect.
• Multispectral imagery were collected using a UAV for a subset of the treated watershed. Brightness values for red, green, blue, red-edge, near infrared wavelengths were used. Resolution is approximately 4 cm.
• National Agricultural Imagery Program (NAIP) imagery from 2016 was used for the study area. Resolution of this dataset is 1 m.
• A 30 m Digital Elevation Model (DEM) was used to determine the slope and aspect characteristics within the study area and develop a rating model as an indicator of potential soil moisture. Slope and aspect were each rated from 1-9, with areas associated with higher soil moisture (northerly aspects, flat slopes) being assigned higher values.
• All data was projected to NAD 1983 UTM Zone 10N.
• Supervised classification (support vector machine) was used to identify juniper within the NAIP and UAV imagery.
3. Hypotheses.
• I expect juniper density will be dispersed unevenly throughout the study areas based on soil moisture, with high density patches of juniper being found in areas with northern aspects and/or flatter slopes. While juniper can survive in dry, rocky areas we would expect greater numbers where higher soil moisture allows for improved survival and growth. It is anticipated that patches of highest juniper density will be found in those areas with presumed highest soil moisture values based on slope and aspect characteristics, with northerly aspects and flat slopes presumed to have highest soil moisture values and areas of steep slopes with southerly aspect presumed to have the lowest soil moisture. Assuming similar soil depth and type, higher soil moisture is anticipated on northern-facing slopes compared to southerly-facing slopes because the intensity and amount of sunlight received is lower on northern slopes (in the northern hemisphere) resulting in lower soil evaporation. Higher soil moisture is presumed on flatter slopes because infiltration is assumed to be higher in these areas while higher rates of runoff are assumed on steeper slopes.
• It is anticipated that juniper density (mature and sapling) will be higher in areas where there are higher levels of seed sources (as indicated by the juniper density surrounding each point of interest analyzed). Specifically, areas of higher density juniper saplings may be expected near higher densities of mature juniper (such as those areas near the watershed boundary). The seeds of juniper may be spread through a number of means, to include birds, winds, or small mammals.
4. Approaches.
• Kriging and inverse distance weighted (IDW) interpolation were used with the belt transect data (represented as individual points at the start of each transect).
• Spatial autocorrelation (Global Moran’s I) was calculated using the belt transect data (using number of juniper per transect). Spatial autocorrelation was also assessed for juniper density as it related to slope and aspect characteristics.
• Hot spot analysis (Getis-Ord Gi) was used to assess area of juniper concentration.
• Juniper density surrounding specific points/areas (based on NAIP imagery) was analyzed.
• Ordinary least squares (OLS) and geographically weighted regression (GWR) were used to assess juniper density as it corresponded to slope and aspect characteristics.
• A chi-squared test and logistic model were used to assess the rating model and classified NAIP rasters.
5. Results.
• Based on the interpolation techniques (kriging and IDW) used areas of high and low density juniper patches were indicated within the treated watershed. However, the locations of these patches did not correspond to regions where we expected high or low soil moisture. For example, areas of highest densities of juniper density based on the transects were found in southeastern and western areas of the watershed. This may be attributed to the small dataset (n=33) used for this analysis but may also suggest that other processes (such as seed distribution via birds or the presence of other vegetation/ground cover) may influence juniper density.One hot spot was indicated using the belt transects with northwestern aspect and 8.8% slope and one cold spot was indicated with north-northwestern aspect and a 11.5% slope.
• The hot spot analysis based on the UAV imagery indicated four hot spots, with slope values ranging from 3-11% and different aspect characteristics (northwestern, southwestern, southwestern, and southeastern).
• The initial GWR and OLS analysis did not find that slope category and/or aspect category were good explanatory variables for juniper density based on the belt transects, based on adjusted R-squared values ranging from 0.19 to 0.28.
• A weakly significant relationship was indicated between the rating model and the classified NAIP raster (p=0.018).
• In order to address issues with limited variance encountered in earlier analysis, an additional GWR was conducted comparing the juniper density in a 150m buffer around 84 random points to the rating model. Outside of the buffer size, methods used in this GWR are similar to those described in previous exercises.  This improved the adjusted R-squared value for this model to 0.53.
• The results of this analysis largely did not support the proposed hypotheses that greater juniper density would be associated with topographic characteristics associated with higher soil moisture. In situ soil moisture measurements were not used in this analysis, so differences in anticipated soil moisture associated with slope and aspect could not be compared to actual soil moisture measurements.
• Greater densities of western juniper are evident in the untreated watershed compared to the treated watershed (as expected). However, patterns of juniper distribution within the treated watershed (as indicated by the belt transects) did not show any correlation between slope and aspect and areas of higher or lower juniper density. Additionally, those areas closest to the watershed boundary (and therefore closest to areas of dense, mature juniper) did not have greater juniper density than areas further away from the watershed boundary.
6. Significance.
• While results of this analysis are of limited use to land managers at this point, a similar approach with an expanded dataset may improve is utility. The use of higher resolution imagery (such as UAV imagery) may improve the identification of juniper saplings and the reliability of this analysis. The NAIP imagery used in this analysis may be sufficient for identifying mature juniper, but small (sub-meter) juniper saplings may not be accurately detected and/or misidentified. Additionally, more ground-based data (additional belt transects, etc. distributed throughout both watersheds) could provide greater information about juniper density at this study site.
• Past research has found that juniper encroachment is associated with reduced undercanopy soil moisture [1]. The focus of this analysis was on the relationship between slope and aspect (as indicators of soil moisture) and juniper density, but further analysis may focus on how soil moisture varies with juniper density. Additionally, this analysis could also assess non-juniper vegetation density as it relates to juniper density, soil moisture, and other watershed characteristics.
• The results suggest that other factors may be influencing juniper density in addition to slope and aspect characteristics but further analysis is needed. It should also be noted that historically western juniper was largely found in rocky outcroppings and other areas protected from fire. However, fire suppression and other factors such as livestock grazing and climate change may be contributing to an increase of juniper density [2]. Future analysis may also focus on how these factors (e.g., disturbance related to land use, etc.) are related to juniper density over larger scales.
7. My learning: Software.
• Some difficulty encountered working when comparing raster and vector data.
• Due to limited data set and differences in resolution, limited conclusions could be drawn from data. However, the methods could be applied to other datasets.
• Some difficulty was encountered using ArcGIS 10.6/Pro, but issues were resolved by switching between versions.
• Once packages and data was added, raster analysis in R-studio was relatively simple and efficient for this analysis. The R-bridge could potentially simplify this process allowing for more complicated analysis.
• I was able to reinforce some basic skills in R and ArcGIS as well as gain exposure to new tools (such as the Ripey’s K and hot spot analysis). In particular, I was able to gain more exposure to using model builder in ArcGIS Pro for extracting and analyzing that would be time consuming and inefficient to perform otherwise.
8. My learning: Statistics.
• I did not have previous experience with several techniques used in this analysis, to include hot spot analysis and OLS/GWR. This analysis demonstrated how these methods could be applied in the future, as well as potential limitations (e.g., small datasets, clustered data, etc.). I would be interested in applying the hot spot analysis in particular to a larger data set and to higher resolution imagery for analyzing vegetation density.
• I have used several kriging approaches and IDW to a limited degree in the past. Similar limitations to those cited above were experienced but this is an approach that may have applicability that for in situ soil moisture measurements in the watershed.
• In addition to those methods used in this analysis, the tutorials helped reinforce concepts related to other methods, to include: principal component analysis, Ripley’s K, and qualitative analysis using maps.
• A major takeaway of this analysis was the potential ways that different statistical approaches can be combined to perform analysis (e.g. using a combination of OLS, Moran’s I, and GWR, etc.).

References

1. Lebron, I.; Madsen, M.D.; Chandler, D.G.; Robinson, D.A.; Wendroth, O.; Belnap, J. Ecohydrological controls on soil moisture and hydraulic conductivity within a pinyon-juniper woodland. Water Resour. Res. 2007, 43, W08422.
2. Soulé, P.T.; Knapp, P.A.; Grissino-Mayer, H.D. Human Agency, Environmental Drivers, and Western Juniper Establishment During the Late Holocene. Ecol. Appl. 2004, 14, 96–112.

# Final Project: Characterizing the exclusion of refugee settlements from global settlement datasets

Final Project Blog: Characterizing the exclusion of refugee settlements from global settlement datasets

Anna Ballasiotes

RESEARCH QUESTION

This research is a small part of a larger question: where are refugees? Which in itself is the basic question to a larger problem: how will the global community provide resources to those who need help — and in order to do so, understanding where these vulnerable populations exist is particularly important. In considering the hierarchy of understanding where IDPs and refugees exist, there are certain drivers of the creation of a ‘refugee population,’ notably conflict (most often varying levels of violent conflict) and more recently, climate events (including natural disaster events). Beyond the drivers of refugee populations, there are different typologies of refugee populations: refugees within cities and blending into existing infrastructure, refugees in settlements and camps that are close to borders, refugees in settlements close to urban centers, and refugees in settlements and camps in areas where populations have not been established and as such there is available land. My focus lies largely in the last item of that list: where populations have not been formally established but now camps and settlements are expanding. It is a known problem that global settlement datasets are not doing an adequate job of recognizing these camps and settlements as “population.” As such, my goal was to characterize how these global settlement datasets were not recognizing these camps and settlements — and thus excluding these populations. This asks the question: to what extent do global settlement datasets fail to detect known refugee camps and settlements that form in areas of low density population?

I originally sought to answer the question, “what about the classification metrics of global settlement dataset fails to detect settlements known to OSM.” I wasn’t able to truly answer this question because I needed to spend more time  on actually characterizing this exclusion before understanding why these global population datasets were unsuccessful. Characterizing where this exclusion occurred was difficult enough and took most of the class, especially as I explored what these refugee settlements looked like and considered how they may have formed.

DATA

I used three main data sources: Refugee settlement boundaries from OpenStreetMap, Facebook High Resolution Settlement Dataset, and the World Settlement Footprint. Only data within Uganda was used for these data, given the overlap of data availability and project partners with reliable data. The Facebook High Resolution Settlement Dataset is trained on 2018 imagery and is presented at a 30 meter resolution. World Settlement Footprint is trained on 2015 imagery and is also presented at 30 meter resolution, but with a minimum mapping unit mask that creates clusters of pixels that make the data appear to have a coarser resolution. The refugee settlement boundaries from OSM range in temporality from 2016 to 2019; however, this is only when the boundaries were added to the OpenStreetMap database and is not necessarily when camps or settlements were established. I had initially included the Global Human Settlement Layer, which is also trained on 2015 imagery, but GHSL did not appear to detect any settlement in the refugee camp boundaries within OSM, so I was not able to perform any meaningful analysis with it.

HYPOTHESES: predictions of patterns and processes.

I wanted to understand in what ways the global settlement datasets excluded the known refugee settlements — thus, I wanted to understand what percentage of refugee settlements were captured and whether larger settlements or smaller settlements showed a higher percentage of captured area, or if settlements that had a higher percentage of captured area showed a faster green-up around the settlement as one moved away from the boundary. I suspected that smaller settlements would have a higher percentage of area captured, mostly because I predicted that smaller settlements conformed more to the actual built-up of settlement area, whereas the larger settlements included more land cover types that would be less likely detected as “built-up.” I also suspected that settlements that showed a higher detection percentage would have a more distinct natural boundary — that the green-up around the settlements would occur faster if a higher percentage was detected.

APPROACHES: analysis approaches used.

I had initially measured the spatial autocorrelation of the refugee settlements using ArcGIS Pro, but did not find this analysis to provide too much significant information for me. I already knew that my settlements were quite different from each other in size; any clustering would have really come from other factors that are not obviously visible, like the way land was distributed or delegated for refugee settlements.

In order to characterize the percentage of area of refugee settlements captured by each settlement dataset, I brought everything into Earth Engine and calculated pixel area of every pixel both within the refugee boundaries and across WSF and Facebook datasets — counting only the pixels that have area detected as settlement.

In order to refine the comparison of area detected within refugee settlements, I also needed to really only look at pixels that look like “built-up” within the refugee settlements, since the boundary themselves have quite a lot of green space and other land cover types. also performed a binary unsupervised classification based on the known locations of refugee settlements. Ultimately, I used a confusion matrix to compare this binary unsupervised classification to WSF and Facebook to refine the “percent area detected.”

I also looked for the relationship of detection and greenness moving away from the boundary. To do this, I created a multi-ring buffer out to 1 kilometer at 100 meter increments, calculated the average maximum NDVI in each “donut buffer” for 2018 using Landsat 8 imagery. I then plotted these NDVI values over increasing distance from the refugee boundary to see the green-up.

RESULTS

I produced a map of the spatial confusion of overlap between WSF, Facebook, and a binary unsupervised classifier within my settlement boundaries. I have refined this map a bit, but there are still further statistical comparisons that could be done to better characterize the overlap of WSF, Facebook, and the K-Means classifier.

I also have all of the settlement boundaries with the percent of area captured by each global dataset within each boundary: this could be a sorting mechanism in which I could perform furt

Ayilo 1 & 2 Refugee Settlements; Classifications shown

her analysis to better answer my hypothesis, which I didn’t necessarily answer well, specifically with regards to the NDVI prediction. Within the exercises, I answered this question based on geography rather than percent detection — did settlements in the north vs. south show different pattern of vegetation greenup around the settlement?

SIGNIFICANCE What did you learn from your results? How are these results important to science? to resource managers?

I learned that across the board, refugee settlements ARE systematically excluded from global datasets but can easily be detected at the local scale, and thus this systematic bias of these datasets must be addressed and solved. Those that are modeling systems of migration or providing access to humanitarian resources need to have more accurate representations of settlements. Beyond this, I was really hoping to be able to characterize the spatial patterns of the exclusion better than I was able to, and I think that’s where my analysis fell short and where my next hypothesis would manifest.

LEARNING: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other? What about statistics? including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), (d) multivariate methods (e.g., PCA),  and (e) or other techniques?

I didn’t truly dive more deeply into these softwares than I have in the past; I think I just became more acutely aware of the statistical methods that exist in Arc toolboxes and became more comfortable manipulating data in R.

I think that the statistical learning was less specific or intense than I was hoping, but also in part because I think I was really challenged to think statistically and spatially-statistically about my research question. I wish I had spent more time understanding statistical methods rather than trying to mold my question into something that seemed to fit my elementary understanding of spatial statistics.

# Cluster analysis of watershed lithology in Oregon coastal systems

The question that I asked for Exercise 3 was: how similar are the lithological profiles of Oregon coastal watersheds?

I used cluster analysis to answer this question. Broadly, cluster analysis is a way of grouping data that minimizes within-group variance and maximizes between-group variance. There are many ways of performing a cluster analysis, usually involving statistical optimization computations. For this exercise, I used hierarchical clustering, which builds a hierarchical grouping network based on the distribution of numerical attributes in a dataset. My dataset was the lithological profile of 35 watersheds in the Oregon Coast Range. The lithological profile refers to the percentage of each lithology, including metamorphic, plutonic, sedimentary, surficial sediments, tectonic, and volcanic types.

Methods

1. Computing lithological profile for each watershed
1. Downloaded “Oregon Geologic Data Compilation – 2015” from Oregon Spatial Data Library. https://spatialdata.oregonexplorer.info/geoportal/details;id=e71e1897f5864b689a3a4a131287a309
2. Used ArcMap to dissolve the geology layer based on the GEN_LITH_TY field, which classified each geological type into either: metamorphic, plutonic, sedimentary, surficial sediments, tectonic, or volcanic.
3. Used ArcMap to intersect geology data with study watersheds. The Intersect tool subsets the dissolved lithology layer to only include data bounded by the study watersheds.
1. Ran Intersect on multiple different shapefiles: wshd4, wshd5, and all Coos watersheds
2. The output of the Intersect tool is a shapefile with each discrete lithological type as a single feature. As a result, there was often more than one polygon per lithology in each watershed.
4. Used Calculate Geometry tool in ArcMap attribute table to calculate area (in square kilometers), and the latitude and longitude of each watershed polygon centroid.
5. Exported attribute tables of each output shapefile and standardized columns in Excel. Exported as CSV and reformatted table in R so that rows demonstrated each watershed and their lithological profile. Used dplyr to group each lithological type and sum percentage. Used tidyr (spread function) to make each lithology a column.
2. Cluster analysis
1. Used the daisy function in R (package: cluster) to compute all pairwise dissimilarities between the 35 study watersheds. The dissimilarity distances were calculated using the Euclidean method.
2. Performed hierarchical clustering analysis (HCA) on the dissimilarity matrix computed in Method 2a. I used the agnes function to compute the HCA using Ward’s method. The minimum variances between groups in Ward’s method are calculated by squaring the Euclidean distance.
3. Validated hierarchical clustering procedure using eclust (package: factoextra) and fviz_silhouette. The latter function is just for visualizing the results from the former, the cluster validation.

Results

The HCA resulted in a fairly successful clustering with an agglomerative coefficient of .96. The agglomerative coefficient is a measure of the clustering structure, with values closest to one representing a high degree of dissimilarity between clusters. The cluster validation demonstrated that all but two watersheds were adequately grouped (Figure 1).

Figure 1.Results from cluster analysis silhouette plot.

Figure 2. Dendritic hierarchical clustering results. Three numbers are written below each final branch: the top is the watershed ID, the middle is the percent sedimentary lithology (blue), and the bottom is the volcanic geology (purple).

The HCA resulted in two main groups with many sub branches (Figure 2). By comparing the lithology percentages to the clusters, I determined that one main branch includes watersheds dominated by volcanic lithology and the other branch includes watersheds dominated by sedimentary lithology. Eight of the 35 watersheds are primarily volcanic, 26 are primarily sedimentary, and one watershed does not fit this classification (Watershed 31: 85% surficial sediments).

Critique of Method

This method was very useful for me because it helped me characterize the lithology of my study watersheds in a quantitative way, and then identify the dissimilarities between the lithological profiles of each watershed. The results from this exercise will advance my ability to compare hydrologic regime characteristics across coastal watersheds and the driving controls on such hydrological processes. For example, the dendrogram in Figure 2 shows that there are five watersheds with 100% sedimentary lithology and five watersheds with predominantly volcanic geology. Such statistics will help me stratify my study design and guide my flow regime analysis.

# Final: Stream Change Recap

My project focused on a set of grain size and cross-sectional change data from a long term research project in the Andrews Forest. (Maps and study description from Blog Post 1)

I wanted to explore two questions:

• How do stream channel erosion, deposition, and particle size vary over time and space?
• How do these changes relate to adjacent changes in the across-stream or along-stream direction?

Hypotheses:

1. Existing stream bed morphology drives spatial patterns in cross-sectional change.
1. Unit-scale stream morphology should lead to patchy transport and different levels of autocorrelation of the change at different spatial scales
2. Either hillslope or hydrodynamic processes or both should result in relatively high cross sectional change near banks.
2. Extreme flow events alter the size distribution of in-channel sediment
1. High energy flows associated with high peak discharge events should increase median grain size
2. Transport of hillslope material (also associated with high peak discharge events) should decrease sorting
3. Extreme flow events reduce the relative impact of bed morphology in cross-sectional change.

Here’s a conceptual model to investigate the patterns of change

Overall, I expected that different processes would drive change at different scales, so I expected the spatial and temporal patterns of change to vary based on scale as well.

Spatially:

• At reach scales: patterns of change are driven by reach level features including slope, watershed size, and land use history.
• At unit scales and smaller, patterns of change are driven by unit-scale bed morphology and positions of adjacent features.

Temporally:

• During time periods that include extreme flows, patterns of change are driven by the magnitude of the flow.
• During time periods of less extreme flows, patterns of change are more dependent on local features as described above

Approaches:

Due to the study design and to limits in data availability, it made sense for me to use one-dimensional analysis methods and correlation or autocorrelation tools on much of the data. I mostly worked in R. I also started but didn’t finish work on a network model using a tool specifically designed for stream network analysis.

Technical limitations:

1-dimensional models don’t perfectly capture the complex realities of stream networks and irregularly spaced spatial data, so I ran into a few hurdles trying to parse analysis results. I also couldn’t do some of my desired analyses due to the lack of shared coordinate systems for different data sets or for adjacent cross sections. However, thinking about these analyses helped me better understand exactly which data I would need to collect in order to do them in the future.

I am still having some trouble with a stream network analysis add-on in ArcGIS, and I am in the process of reaching out to experienced users who I hope know the way around my particular error.

Results:

I started the stream network analysis partly because I wanted more context for the results of exercise 3. Some 2017 REU students collected grain size data at sites that overlapped with the long term cross sectional study sites. I haven’t finished the network analysis, but I’ve mapped the REU student data below.

Here is a map of the median grain size (D50) at the REU student sites:

Here is a map of half the difference between the 84th and 16th percentile grain size at the REU student sites. This value is one metric of sorting, and it would equal the standard deviation if the sizes were normally distributed.

These results indicate that grain sizes and sorting in the grain sizes display more complex patterns than a simple fining of grain size downstream or consistent gradations in sorting.

For all of the analyses, I need to spend a little more time figuring out which of these results are useable and might reflect reality in the field and which ones are more likely showing artifacts of data collection or processing

Significance:

I hope to use these results to add more depth to my research and place my research in a broader context. I personally learned more about strengths and limitations of data set

The broader purpose of entire project is to see how streams like these respond over time. How much of their changes are influenced by hydrologic rather than hillslope processes? How big of a hydrologic event needs to happen to do substantial work on the channel? The project could be useful to land managers because channel mobility could damage streamside infrastructure or alter benthic ecology

My learning (technical):

From a technical standpoint, I think improved ggplot skills. I started using cowplot and learned more about custom themes. I learned a lot about preparing data using the STARS toolbox, and I am excited to learn more about the SSN R package once I get my data working

My learning (statistics):

# Examining the Spatial Patterns of NDVI Change near ASGM and non-ASGM villages in rural Senegal

Question

My research question for this project was to investigate how spatial patterns of land use/land cover (LULC) change were related and/or associated with spatial patterns of artisanal, small-scale gold mining (ASGM) establishment. A possible mechanism I considered as influencing these patterns would be deforestation associated with the establishment of ASGM operations, as miners require significant amounts of timber to bolster mine shafts, in addition to constructing shelters around the mine shafts and for the miners themselves. As such, my research question can be phrased thusly: “How is the spatial pattern of LULC change related to the spatial pattern of ASGM establishment via the mechanism of deforestation?” Additionally, I wanted to examine whether alternative factors may affect LULC change around sites; in this case, I examined the proximity of a site to a primary road and whether or not that related to rates of LULC change.

Dataset

My dataset for this project had several components. In previous work, I used very high resolution satellite imagery to map out the locations and spatial extent of nine ASGM sites, as well as nine non-ASGM sites to compare to (these sites are shown in fig. 1). After having mapped these sites, I acquired Landsat 5 and Landsat 8 imagery at 30m spatial resolution from the USGS for April 2007 and April 2018, respectively. Using this Landsat imagery, I calculated normalized difference vegetation indices (NDVIs) for each year, and then subtracted the former from the latter to obtain a dataset showing NDVI loss between each timeframe. This dataset enabled me to examine NDVI change around each of my 18 sites to evaluate LULC change in the form of NDVI change. In addition to these datasets, I also acquired a road network layer from OpenStreetMaps for Senegal, which I selected to only include “primary” roads in Kedougou. Together, these datasets enabled to me to investigate LULC change around ASGM sites via NDVI change and also how this change relates to proximity to primary roads.

Fig 1: Study Site

Hypotheses

My two hypotheses for this project are that 1) NDVI would be more negatively impacted near ASGM sites compared to non-ASGM sites as a result of deforestation, and 2) NDVI losses at ASGM sites could be attributed in part to each site’s proximity to primary roads.

Patterns of higher NDVI loss as a percentage of the total would occur at ASGM sites compared to non-ASGM sites. This may be because, as the larger populations near ASGM sites and also the activity itself would necessitate more timber harvesting, there would be more negative NDVI change in the immediate surroundings compared to sites without ASGM activity. The spatial process I would expect to see would be areas of higher NDVI loss nearer to the center of ASGM villages. These areas would be fragmented, indicating differences in vegetative cover around the village, but, overall, more negative than around non-ASGM villages. This pattern results from miners and villagers harvesting the nearby timber to bolster mine shafts, as well as for construction of structures around each mine and their own huts. This hypothesis is an explanation for the pattern of NDVI loss around a village.
ASGM sites may be, overall, closer to primary roads than non-ASGM sites, and so their growth and attendant NDVI loss is in part attributable to proximity to major travel corridors and the relative ease of reaching the sites. If a mine is closer to a primary road than another mine, it’s possible that this proximity may result in larger spatial patterns of NDVI loss. This hypothesis is an explanation for the process of deforestation around different sites.

The first hypothesis here is an attempt at understanding what spatial patterns may be occurring around ASGM sites; that is, it is an attempt to understand what NDVI losses may look like. The second hypothesis is an attempt to understand why those patterns may be occurring, and in finding some sort of correlation between the observed pattern and an alternative possible explanation for that pattern.

Approaches
Exercise 1: Moran’s I

Before examining how patterns of NDVI loss are shaped around ASGM and non-ASGM sites, I first wanted to examine how the patterns of NDVI loss and gain were self-related; that is, I wanted to know whether or not patterns of NDVI change occur near areas of similar change (spatially autocorrelated or clustered) or whether or not these patterns occur away from self-similar patterns (dispersed). To do this, I took my layer of NDVI change and ran a script in Rstudio to generate a Moran’s I value for the global layer, which is a measure of spatial autocorrelation: it results in values between -1 and 1, where values closer to -1 indicate negative spatial autocorrelation (dispersion) and values closer to 1 indicate positive spatial autocorrelation (clustering). I also performed this analysis at a smaller scale around one mining village to see how Moran’s I may change on a different scale.

Exercise 2: Neighborhood Analysis

After determining the spatial autocorrelation of patterns of NDVI change, I could examine how those patterns were related to and varied between ASGM sites and non-ASGM sites. To do this, I performed a neighborhood analysis. At each village, I found its center using the Mean Center tool in ArcGIS Pro. Around each center point, I generated “donut” buffers using the Buffer tool at 250 meter increments, up to 2 kilometers, for eight different donuts (e.g. 0-250m, 251-500m, 501-750m, etc.). I then used each of these buffers to Clip the global NDVI layer (which I had also edited to omit values over rivers and Mali), generating eight raster layers representing NDVI change values between each spatial step. From there, I simply generated histograms for each step and counted the values below 0, indicating NDVI loss from 2007 to 2018, and found their percentage of the total. With these values, I could then see how NDVI loss as a percentage of total pixels in each donut varied with distance from the center of each village, and could more readily understand how NDVI values change with distance from both ASGM and non-ASGM villages.

Exercise 3: Pearson’s R Correlation

After examining how NDVI change across space near ASGM and non-ASGM villages, I wanted to further understand how these patterns may be related to another factor: proximity to a primary road. To do this, I used my road network layer from OpenStreetMaps showing locations of all primary roads in Kedougou and used the Near tool in ArcGIS Pro to find the shortest distance from each of my 18 sites to the nearest primary road. Then, I took only the first values of NDVI change (from 0m to 250m), and, using these values, along with each site’s distance to the nearest primary road, I performed a Pearson’s R calculation (fig. 2). This approach thus attempts to explain how values of NDVI loss around both ASGM and non-ASGM sites may be related to a site’s proximity to a primary road, providing some more information on how different rates of NDVI loss may be affected by the ability of people to access a particular village.

Fig 2: Pearson’s R Correlation. x, in this case, is distance (m) from nearest primary road and y is NDVI loss at the first ring.

Results

I had several results from these three approaches.

Firstly, I produced a global Moran’s I of 0.6716816. The second Moran’s I produced, at a smaller scale, was .8079745. These results helped me to understand that, generally, patterns of NDVI loss are spatially autocorrelated, meaning that areas of loss and growth are clustered near self-similar areas. However, this analysis occurred over such a large area that it is possible the global Moran’s I may be accounting more for patterns of near riparian corridors, which are of course autocorrelated, especially areas where water bodies are visible. As such, this result needs to be interpreted with caution.

Secondly, the results from my neighborhood analysis are presented in fig. 3 [insert graph of NDVI changes for fig. 3; include key in caption]. These results indicate that, on average, villages which have ASGM activity present see more NDVI loss in their immediate surroundings than non-ASGM villages, up to about 2km radially from their center. This means that, generally, villages with ASGM activities experience reduced NDVI, which may indicate vegetation loss or loss healthy vegetation, than villages without ASGM present.

Fig 3: NDVI results. Light blue lines indicate ASGM mines; the dark blue line is their average. Light red lines indicate non-ASGM villages; the dark red line is their average.

Finally, the results of my Pearson’s R correlation indicate relatively weak correlations between a village’s NDVI loss (at least, within the first ring) and its proximity to the nearest primary road. The results are shown in table 1 [insert table screen snip]. For ASGM villages, their correlation was -0.1418, whereas non-ASGM villages have a Pearson’s R of -0.0578. Both of these are weak correlations, there is some indication that ASGM villages and their attendant NDVI losses are more correlated with proximity to a primary road (i.e., the closer they are to a primary road, the more NDVI loss they experience) than non-ASGM villages.

Table 1: Pearson’s R Results

Significance

These results are significant for several reasons. These showed to me that there is an observable pattern of NDVI loss that is greater around villages which develop or engage in ASGM than in villages which do not engage in ASGM. However, there is still some work to do in order to understand the process by which NDVI varies, and it may not be explainable by a village’s proximity to a primary road. There are several possibilities that may explain the process: deforestation is more prevalent around the gold mines for reasons to do with securing timber for construction, or because an inherently larger population at gold mines necessitates more forest product harvesting. More work should be undertaken to understand these processes. On the whole, however, these results are useful for governments or NGOs which seek to help better the conditions which ASGM practitioners participate and engage in, now that they know that ASGM villages experience greater rates of NDVI loss, which may indicate environmental degradation.

Learning

This process taught me several techniques. Firstly, I had no experience at all with R and R Studio, though I do understand some coding. As such, this helped me learn more about a really useful language that I did not have any experience with. Secondly, any practice with ArcGIS Pro is good practice for me, and so I got a lot of time and work done using that piece of software. In particular, I go some good time in with the model builder to help expedite work, as generating eight different ring buffers and then clipping them to an NDVI layer for 18 different villages would have been very time-consuming otherwise. Finally, I gained some good experience performing a statistical analysis, in the form of Pearson’s R. I have very little experience performing statistics, so this was really useful for me to help understand and interpret statistical analyses and results

Statistics

With regard to statistics, I learned several things. The three statistical analyses I used were Moran’s I for spatial autocorrelation, neighborhood analyses for comparing spatial patterns, and Pearson’s R for calculating a simple statistical correlation.

With regards to Moran’s I, I learned the importance of examining patterns at different scales, as a pattern which is apparent at one scale may not be so at another scale, and so interpretation and set-up must be conducted carefully.

Concerning neighborhood analyses, I learned the importance of dilligent workflow and juggling several different variables without muddying the results. This also demonstrated how to examine and structure a problem which seeks to understand change both across space and scross time.

Finally, conducting a Pearson’s R helped me to understand better how statistical analyses are constructed and what their significance is. Additionally, this taught me the value of interpreting a statistical result, as their significance may not always be indicative of anything. Null results or negative results are just as valuable as positive ones, and this analysis showed me that. Now I know more about what process to examine and how to examine them!

# Final

Question

How are the spatial patterns of individuals’ perception of natural resource governance related to the spatial patterns of environmental restoration sites via distance and abundance of improved sites?

Datasets

Puget Sound Partnership Social Data

My data came from a survey on subjective human wellbeing related to natural environments. The sample was a stratified random sample (28% response, n= 2323) of the general public from the Puget Sound in Washington from one time period. I am specifically examining questions related to perceptions of natural resource governance. Around 1730 individuals gave location data (cross street and zip code) which have been converted to GPS points. I also have demographic information for individuals, their overall life satisfaction, and how attached and connected they feel to the environment.

Environmental Data

I have three different forms of environmental data. (1) Point locations of restoration sites in the Puget Sound. (2) A shapefile of census tract level average environmental effects (the environmental condition ranked from 1 good to 10 very poor, and (3) Land cover raster files from 1992 and 2016.

• Individual restoration sites numbered over 12,000. They spanned many years, and point locations overlapped significantly through time.
• The environmental effects data comes from an online mapping tool that was created in a collaboration between the University of Washington, the Washington Department of Ecology, and the Washington Department of Health. Environmental effects are based on lead risk from housing, proximity to hazardous waste treatment storage and disposal facilities (TSDSs), proximity to national priorities list facilities (Superfund Sites). Proximity to Risk Management Plan (RMP) facilities, and wastewater discharge. The combination of these effects was aggregated within census tracks to produce an environmental effects ‘Rank’ from low effects (Rank of 1) to high effects (Rank of 10).
• The land cover files contained 25 different types of land cover. I reclassified these land cover types by combining similar types of land cover (e.g. deciduous forest and conifer forest to ‘forest’).

Hypotheses

1) Shorter distances between individuals and restoration sites, and greater number of restoration sites near individuals, would correlate positively with governance perceptions

2) Positive environmental conditions will correlate positively with governance perceptions

Richard Petty and John Cacioppo developed the elaboration likelihood model (1980) which asserts how individuals change their attitude based on persuasive stimuli. Perry and Cacioppo develop this idea by implementing the idea of two routes of persuasion—the central and peripheral paths—where determinants of routes are determined by motivation, ability, and nature of mental processing. The purpose of this theory is to help explain how individuals elaborate on ideas to form attitudes and how strong, or long lasting, those attitudes are. This theory may suggest that individuals will form stronger attitudes when reasons to form attitudes are stronger and more immediate. Stimuli of this kind tend to be nearer to individuals, as processing of issues far away is cognitively difficult. Research has shown that individuals have a difficult time thinking about problems on larger spatial scales, and there is an inverse relationship between how feelings of individual responsibility for environmental problems and spatial scale (as scale gets larger, feelings of responsibility go down (Uzzell, 2000).

Cognitive biases also suggest that people use heuristics to answer questions when they are unsure. One common heuristic is the saliency bias, where individuals overemphasize things that are emotionally more important, and ignore less interesting things (Kahneman, 2011). This may suggest that spatial patterns are more important than temporal patterns because what is happening now is more salient to individuals than what happening in the past. This may imply that even if environmental outcomes have improved over time, the immediacy of the current environment or things influencing the environment may have a greater effect on individual perceptions.

Approaches

Statistical Analyses

Spatial Autocorrelation Analyses

Moran’s I: To compute Moran’s I, I used the “ape” library in R. I also subset my data to examine spatial autocorrelation by demographics including area (urban, suburban, rural), political ideology, life satisfaction, income, and cluster (created by running a cluster analysis on the seven variables which comprise the governance index

Correlograms: I created correlograms for the variables that were significant (urban, conservative, and liberal) using the “ncf” library and the correlog() function. These figures give a better picture of spatial autocorrelation at various distances.

Semivariograms: To create semivariograms, I used the “gstat” and “lattice” libraries which contain a function called variogram. This function takes the variable of interest along with latitude and longitude locations. The object created can then be plotted. For this analysis I used the same subsets as in the Moran’s I analysis. These figures are excluded from the results as they do not provide information beyond what the correlograms show.

Nearest Neighbor analysis: A file of restoration sites was loaded into R. The sites were points. This analysis required four libraries as the points were from different files. The libraries were: nlem, rpart, spatstat, and sf. I first had to turn my points into point objects in R (.ppp objects). I used the convenxhull.xy() function to create the ‘window’ for my point object, then I used that to run the ppp() function. I then used the nncross() function. This function produced min, max, average, and quartile distances from individuals to the nearest restoration site. I added the ‘distance’ from the nearest neighbor as a variable in a linear regression to determine an association. I used these distances (1st quartile, median, and 3rd quartile) to produce buffers. I ran spatial joins between the buffers and the restoration sites. This produced an attribute table that had a join count—the number of restoration sites within the buffer. In R I ran a linear regression with join count as an independent variable.

Geographically weighted regression: I joined individual points to my rank data, a shapfile at census tract level. This gave individuals a rank for environmental effects at their location. Initially I used the GWR tool in Arc to run the regression, but wanted the flexibility of changing parameters and an easily readable output that R provides. In R, I used two different functions to run GWR, gwr(), and gwr.basic(). The gwr() required creating a band using gwr.sel, and the gwr.basic required creating a band using bw.gwr. The difference between these functions is that gwr.basic produces p-values for the betas. I ran gwr on both my entire data set and a subset based on perceptions. The subset was the ‘most agreeable’ and ‘least agreeable’ individuals who I defined as those one standard deviation above and below the mean perception.

Neighborhood analysis: I created a dataset of the upper and lower values of my governance perceptions (one standard deviation above and below the means). I then added buffers to these points at 1 and 5 miles. I joined the buffers to the rank values to get an average rank within those buffers. I exported the files for each buffer to R. I created a density plot of average rank for the low governance values at each buffer, and for the high governance values at each buffer.

T-test: To test whether land cover change differed between those with high and low perceptions, I exported the tables and calculated the change in land cover for the two samples. I then ran two-sample t-tests for each land cover type change between the two groups.

Mapping

Kriging and IDW: I used the Spatial Analysis toolbox to preform Kriging and IDW to compare the outputs of the two techniques. I used my indexed variable of governance perceptions. The values of the variable vary from 1 to 7. I then also uploaded a shapefile bounding the sample area.

Hotspot Analysis: I used the Spatial Analysis toolbox to preform ‘hotspot analysis.’ I used my indexed variable of governance perceptions with values from 1 to 7.

Tabulate area: I used two land cover maps from 1992 and 2016 to approximate environmental condition. I loaded the raster layers into ArcGIS Pro, and reclassified the land types (of which there were 255, but only 25 present in the region) down to 14. I then created buffers of 5km for my points of individuals with governance scores one standard deviation above and below the average. I then tabulated the land cover area within the buffers for each time period.

Results

What are the spatial patterns of perceptions of natural resource governance?

Moran’s I statistic for overall governance perceptions was insignificant, indicating a lack of spatial autocorrelation at the regional scale (I value; p = 0.51). Significant autocorrelation was detected when the data were subset by demographic variables. Individuals residing in urban areas exhibited spatial autocorrelation of perceptions while individuals in suburban and rural areas did not. Lastly, individuals classified as ‘conservative’ (political ideology < 0.5) exhibited significant spatial autocorrelation for governance perceptions (I statistic: -0.0062, p = 0.0018). The spatial autocorrelation for governance perceptions was mainly restricted to short distances as shown in the correlograms below. Significant correlations for the entire study population and individuals in urban areas were detected at near distances Individuals classified as conservative exhibited significant correlation across multiple distances. No significant correlations were found among liberal individuals, suggesting a non-spatial driver mechanism may be at play.

Local patches of significantly lower perceptions (cold spots) and significantly higher perceptions (hot spots) were identified as shown in the map below. Three ‘cold spots’ were located in the areas around Port Angeles, Shelton, and the greater Everett area. Two hot spots were identified surrounding Bainbridge Island, and the city of Tacoma, which is where the Puget Sound Partnership is located.

Blue points are “cold spots” at 99%, 95%, and 90% confidence that the points reside in a cold spot. Cold spots are areas where perception is significantly lower than the average compared to those around them. Red points are “hot spots” at 99%, 95%, and 90% confidence that the points resides in a hot spot. Hot spots are areas where perception is significantly higher than the average compared to those around them. Small grey points are insignificantly different. Bounding lines are counties in Northwest Washington.

What are the spatial patterns of natural resource governance perceptions in relation to environmental condition?

The median distance for individuals from restoration site was 0.037 degrees, 1st quartile was 0.020 degrees, and 3rd quartile was 0.057 degrees.

The regression on whether distance correlates with individual perception was insignificant (p = 0.198), which means distance from the nearest restoration site does not influence perceptions. For each regression on the number of sites near an individual, all coefficients were negative, meaning the more sites near an individual, the more disagreeable their perception was. All produced significant results, but the effect size of number of sites near individuals was very minimal (Table 1).

Table 1. Regression results for ‘nearest neighbor’ of individuals to restoration sites.

 Buffer Size b p-value Effect Size (rpb) Buffer 1 (1st quartile) -0.003 0.0181 .003 Buffer 2 (median) -0.002 0.045 .002 Buffer 3 (3rd quartile) -0.001 0.002 .003

In a geographically weighted regression (GWR) model, rank, life satisfaction, years lived in the Puget Sound, and race are significant (Table 2). All other variables were insignificant. For rank (the combination score of environmental effects) the coefficient was positive. Higher rank values are worse environmental effects, so as agreeable perceptions increase, environmental condition decreases. Overall, the effect size of this model on governance perceptions was small, explaining about 10% of the variance in the data (Table 2).

A map of residuals from the GWR confirms the results from the hotspot analysis about where high and low responses congregate.

Table 2. Regression results for environmental effects at individuals’ locations.

 Variable1 b p-value2 Rank 0.077 0.007** Life Satisfaction 0.478 <0.001** Years Lived in the Puget Sound -0.010 0.002** Sex -0.156 0.289 Area -0.150 0.179 Education -0.002 0.922 Income -0.022 0.622 Race 0.006 0.034* Ideology 0.103 0.533

1R2 = 0.094

2 ** = significant at the 0.01 level, * = significant at the 0.05 level

The plot of high and low governance values at the two buffers is presented below. The black and grey curves represent respondents from the survey that were at least one standard deviation lower than the mean (the mean was neutral). The green curves represent respondents from the survey that were at least one standard deviation higher than the mean. The solid curves are the average rank at the one mile buffer, and the dotted curve is the average rank at the five mile buffer.

This figure indicates there are two peaks in average rank, at low environmental effects (~1) and at mid-environmental effects (~4.75). Those with lower perceptions of environmental governance had higher peaks at low environmental effects for each buffer size. Those with higher perceptions of environmental governance had a bimodal distribution with peaks at low and mid-environmental effects.

What are the spatial patterns of natural resource governance perceptions in relation to environmental condition over time?

I used land cover change as a proxy for environmental condition over time. Land cover change differed significantly for four different land cover types between the two groups—grassland, forest, shrub/scrub, and bare land cover. Grassland cover decreased for both groups, but decreased by 5% more in the areas with individuals with below average governance perceptions. Forest cover also decreased for both group, but decrease by 1% more individuals with below average governance perceptions (the amount of forest cover in the region is very large which accounts for why a 1% difference is significant). Shrub and scrub cover increased for both groups, but increased by 3% more in areas with individuals with below average governance perceptions. Lastly, bare land decreased for both, but decreased by 5% more for individuals in areas with individuals with below average governance perceptions (Table 3). The effect size of land cover change on perception was small for each of the significant variables with biserial correlations between .07 and .09 (Table 1).

Table 3. Differences in land cover change between individuals with above average perceptions of natural resource governance perceptions and individuals with below average governance perceptions

 Governance Perception1 Above Average Below Average t-value p-value Effect size (rpb) High Development 17 16 0.80 .426 .02 Medium Development 19 19 0.67 .500 .02 Low Development 11 11 0.22 .829 .00 Developed Open Space 12 13 0.21 .837 .00 Agriculture -1 0 0.74 .461 .02 Grassland -13 -18 3.03 .002 .09 Forest -10 -9 2.61 .009 .08 Shrub/Scrub 40 43 3.04 .002 .09 Wetlands 0 0 1.67 .095 .05 Unconsolidated Shore 2 2 0.10 .918 .00 Bare Land -21 -25 2.43 .015 .07 Water 0 0 0.90 .368 .03 Aquatic Beds -1 -5 1.19 .233 .03 Snow/Tundra 0 0 0.00 .997 .00

1Cell entries are percent changes of land cover from 1992 to 2016

As one land cover type goes up, another land cover type must go down. While forest overall decreased in the region for both groups over time, areas with lower perceptions saw significantly less decrease. The percentage is small, but a lot of the area in the region is forested, so the actual amount of land is large. The two maps below show land cover in 2016 (top) and 1992 (bottom) around the area of Port Angeles. This area is one that had a significant cold spot (low perceptions). From these maps, you can see a spread of the dark green color (forest), and a decrease of the yellow (grassland). Port Angeles historically was a timber community (the same could be said for the regions of the other cold spots in the region. This change in forest and grassland could be the source of negative perceptions if these communities believe the governance structures are responsible for the change in forest cover.

Significance

Overall, in this analysis, poor environmental condition relates to positive perceptions, and land cover change may provide insight into reasons for poor perceptions. Areas with negative perceptions may not directly indicate poor natural resource governance. Elsewhere, trust in natural resource governance was primarily driven by individual value orientations (Manfredo et. al. (2017). The three areas identified as cold spots in this study could be areas where individuals do not feel their values are represented by the natural resource governing systems. Conversely, hot spot area A is located near the headquarters of the Puget Sound Partnership, potentially facilitating trust, representation, and access to information that may influence the perceptions of individuals located nearby. This urban, liberal, developed area contrasts the area to the west with a highly significant cold spot (Area 2). Individuals from area two, a more rural, conservative, historic logging site likely have different value orientations than those from Area A. Similar comparisons of the other cold spots could be made to areas near them. More research should be conducted in this area to determine if value orientations have a significant influence on perceptions of natural resource governance.

During this course I significantly advanced my knowledge in ArcGIS Pro and in R studio. I learned how to run many new tools in Arc including hotspot, geographically weighted regression, and tabulate area. I also learned new skills in troubleshooting problems. For example, I was having difficulty getting the rank data to display on my map. Another student taught me that as I imported the data from a .csv and joined it to a shapefile, I would need to create a new feature class from it before it would be able to display. I also gained knowledge in how to run more spatial analyses in R. These included many spatial autocorrelation analyses, and geographically weighted regression (which involved creating a point object!).

What did you learn about statistics?

I first and foremost learned of the existence of many types of spatial statistics. These included statistics Hotspot (Getis-Ord Gi*), spatial autocorrelation (correlogram, semivariogram, Moran’s I), geographically weighted regression (GWR). For hotspot, I learned what the hotspot clusters mean, and the best ways to pick the fixed-distance band of the neighbors the tool looks at (either through selecting the number of minimal points near it, or by looking at the z-score at the distance that shows the highest amount of spatial autocorrelation). Speaking of spatial autocorrelation, I learned what that meant in terms of the correlation of perceptions of governance between individuals depending on their distance from each other. Finally, I learned a little about GWR, including that it is necessary to plot the output to examine patterns over the space in question, and I also learned that the reason Arc does not display p-values for independent variables is not something to due with the equation, but actually due to computing power of Arc.

# Final Project: Searching For Faults with Kriging

Introduction:

I have looked at the spatial distribution of depth to first lava and the spatial relationship between the depth to first lava and the depth to first water. I found that the distance from volcanic centers might affect the depth we find the contact with first lava. I did not find any strong correlation between first water and first lava, but that does not preclude the idea that first water correlates to deeper lava contacts.

Now I want to look at what else might be effecting the spatial distributions of lava in the region. Note, that the techniques I used for looking at lava can also be applied to looking at the spatial distribution for water in the field area.

Figure 1: The style of active faulting in the Pit River Study area (delineated roughly in green). N-S trending faults accommodate dip slip (up-down) motion, while NW-Se trending faults accommodate oblique slip motion (up-down and side to side). Structurally controlled volcanoes often lie at the elbows of intersection between the N-S and NW-SE trending faults. Modified from Blakely et al 1997.

The Northern California Field area lies at the intersection of the Walker Lane Fault Zone, the Basin and Range and the Cascadia subduction zone. For the purpose of this examination I will focus on the relationship between the spatial distribution between Walker Lane style faulting (fig 1) and the spatial distribution of the depth to first water and the depth to first lava in the region.

The topography in the region reflects the interplay between active Walker Lane Style Faulting in the region and the higher frequency volcanism that mutes it. For, example, in the northern portion of the field area, Medicine Lake Volcano, with flanks covered by volcanism less than 10 kyr (fig 2), active faults disappear beneath the volcanic edifice. They do not appear in the topography, but they exist in the subsurface. However, in regions where faults are well expressed, enough displacement has occurred to 1) be resistant to burial by periodic lavas, and 2) to either serve as conduits to lava flow, or to displace lava itself.

The Data:

Figure 2: Map of the Field area with active faults (pink lines), the Pit River (blue line), rough locations of ground water discharge (yellow circles), young lava flows (green outlines) and all of the available well data (blue dots).

The subset of the data we used in figure 2 is depicted in figure 3. Temporally they range from wells dug in the1950s to the present. It is important to note that depths to first water might have changed in the 60 years since the older well logs were recorded. For this reason I did not use many of the older well logs in my analyses.

Figure 3: The center of the township and range sections for the wells used and corresponding depths to first lava. Purple logs are the shallowest and Blue logs are the deepest.

Hypotheses:

Figure 3: Conceptual design for the project. A) Configuration before slip upon the fault. The lava flow is continuous. B) If a normal fault displaces a lava flow, then one side will go down with respect to the other. The lower block experiences deposition, and the development of a soil profile. If enough time has passed, then there will be a difference between the depth to first lava between the upper and lower block.

I predict that faults will run parallel to and coincide with large step changes in the depth to first lava.

If there are no large step changes in depth to first lava, I predict that these changes will be driven by changes in elevation and correspond to channelized lava flows rather than faults.

Methods:

I used the Threshold Kriging function in Arc to make a surface that corresponded to the likelihood of the depth to first lava being less than 30 feet deep. The Kriged surface then interpolates the probability that location at other points in the map correspond to less than 30 feet deep. When the surface has a value of 1, values are close to less than 30ft deep. When it is close to zero, values are greater than 30 feet deep.

I then preformed a confusion matrix to see how my results compared to what I expected. This part I did by eye.

Results:

Figure 4: Kriged Surface with faults superimposed.

 High-Low Low-Low or High-High parallel 257 315 perpendicular 213 5

Table 1: Corresponding confusion matrix. There were 257 High-Low values that corresponded to parallel faults, but there were 315 Low-Low or High Values that also corresponded to parallel faults. See discussion below.

Figure 5: Kriged surface with contours.

Discussion:

257 out of 790 of the faults were gradient parallel. If, corresponded to step function in kriged depth to first lava. However, many more did not. What is happening here? If you look to figure 5, you can see that while some of the high gradients correspond to rapid changes in elevation. This block is the footwall of a range bounding normal fault. That fault is not mapped because it is no longer active. But it has large amounts of uplift upon it, and likely corresponds to a different lithology entirely and so does not fall into our hypothesis.

If I do this analysis for every lava layer I find, and can correctly link the well logs together, then this analysis could help me constrain slip upon the faults in the region. This information is good for hazards assessment in the region as PG&E has a hydroelectric dam in the area.

Learning outcomes:

Throughout the course of this class, I learned a few of the nuances of R, though my work here has just begun. I also feel like I have a better understanding of how to think about spatial processes, which I why I enrolled in the first place. Future goals include better locating my data in space. This way, many of the techniques I used in this class will pick up on actual processes, and not on the gridded spacing of my data.

Sources:

Blakely, Richard et al. “Gravity Anomalies, Quaternary Vents, and Quaternary Faults in the Southern Cascade Range, Oregon and California: Implications for Arc and Backarc Evolution.” Journal of Geophysical Research: Solid Earth, vol. 102, no. B10, 1997, pp. 22513–22527.

# Final Project: Western Hemlock Dwarf Mistletoe Spatial Patterns and Drivers

Western Hemlock Dwarf Mistletoe

Western hemlock dwarf mistletoe is a hemi-parasite of primarily western hemlock trees. It absorbs water, nutrients, and carbohydrates from its host. Infected branches produce small structures called aerial shoots that have minor photosynthetic capabilities, but are primarily for pollination and seed production and require a certain amount of light to emerge from an infection site. Seeds are explosively discharged from aerial shoots when fully mature. Once landing on a susceptible host branch they being germination and mechanical penetration of the host branch.

Research Question

I wanted to explore the spatial patterns of dwarf mistletoe infections after a mixed severity fire and the roles post-fire, forest structure and fire refugia play.

How does the spatial pattern of post-fire stand structure and species composition affect spatial patterns of dwarf mistletoe distribution throughout the stand through physical barriers to susceptible hosts and seed dispersal?

Description of Dataset

I have a 2.2 ha rectangular study area (Wolf Rock), just northwest of the HJ Andrews Exp. Forest that was stem mapped in 1992. Each tree has an X and Y coordinates, as well as the tree species, height, diameter, and a variety of other tree inventory related data. I only have ages for a few western hemlocks, but many more for Douglas-fir. Only one western hemlock core was identified as being over 100 years old, at 170 years. I have a polygon layer for the fire refugia that are documented on the study area. For the western hemlocks I have a presence/absence of western hemlock dwarf mistletoe as well as a severity measure. However, I am unsure of the scale and ethod of rating, so I will not be using it. The infection presence and absence are from a single measurement season.

Hypotheses and Predictions

Western hemlock dwarf mistletoe spreads easily through moderate canopy densities, where distances between trees are close to the average dispersal distance of 2-4 meters. Disturbances that create patchy gaps increase likelihood of spread because of increased light reaching infected branches and increase the rate of infection abundance because of the lack of physical barriers such as high densities of branches and foliage.

Disturbances can also remove the disease from a forested stand by killing or knocking down infected western hemlocks. Post-disturbance regeneration can enable or inhibit the parasite’s reintroduction to the stand. Non-susceptible hosts such as Douglas-fir or western redcedar regenerate readily alongside western hemlock and will intercept seeds. Some gaps are only conducive to western hemlock regeneration which will be readily infected by any surviving infected western hemlocks post-disturbance.

The Wolf Rock stand experienced two fires that created a mosaic of ~110 year old regeneration and >110 year old remnant trees in fire refugia. Remnant, infected western hemlocks survived the most recent fire in those fire refugia. From this stand structure, I have several hypotheses and predictions:

1. Remnant, infected western hemlocks form the center of new infection centers post disturbance so infections in the regenerating susceptible hosts, will be clustered around these remnant trees.
2. Non-susceptible hosts regulate the rate of infection spread through physical barriers to dispersal, so infection cluster size will have an inverse relationship with non-host density.
3. Western hemlock infection spreads from a central remnant tree, so uninfected western hemlocks will have a dispersed spatial pattern from the remnant western hemlock, regardless of non-host density.
4. Post-fire regeneration with higher western hemlock composition will have more susceptible hosts and less physical barriers to spread so infection cluster size will have a positive relationship with western hemlock composition.

Analysis Approaches

For Exercise 1,  I wanted to know about the spatial pattern of western hemlock trees infected with western hemlock dwarf mistletoe. I used a hotspot analysis to determine where clusters of infected and uninfected trees were in my 2.2 ha study area. I discovered a hot spot and a cold spot, indicating two clusters, one of high values (infected) and one of low values (uninfected).

For Exercise 2, I wanted to know how the spatial pattern of these clustered infected and uninfected trees were related to the spatial pattern of fire refugia outlined in my study site. I used Geographically Weighted Regression to determine the significance of this relationship, however I did not find a significant relationship between a western hemlock, its intensity of clustering and infection status, and it’s distance to its nearest fire refugia polygon edge.

For Exercise 3, I wanted to know how the spatial patterns of remnant western hemlocks related to the spatial patterns of regeneration western hemlocks, uninfected western hemlocks, and non-western hemlock tree species. I used the Kcross and Jcross functions in spatstat in R and prepared the data in ArcMap to analyze spatial relationships between trees. I found clustering between regenerating western hemlocks and non-hosts to remnant western hemlocks but the uninfected western hemlock’s spatial pattern was independent of the remnant western hemlocks.

Results

I produced several maps that showed the spatial patterns in my study site which were helpful for understanding and investigating further relationships. I produced several charts from my exercise 3 analysis that were useful for visual representations of the relationships between trees. In Exercise 3 I also produced a map from raster addition that gave me the best visualization of where western hemlock and non-host trees were in the stand. Exercise 2 produced a map and statistical relationship but was not significant in explaining a hemlock’s infection and density status.

Significance

The biggest finding was that the fire refugia polygons are not significant for my analysis, the remnant infected hemlocks are more important explanatory variables in spatial patterns of infected trees. This supported hypothesis 1. Because refugia can be effectively defined using the “for what, from what” framework, western hemlock dwarf mistletoe refugia from fire could be delineated differently in the field focusing only on the remnant western hemlocks.

Data was not available to determine the rates of infection spread over time because I only had one season of measurements. I also could not evaluate the size of clusters because I did not have GPS points of infection center extent so I could not assess hypothesis 2 and 4 directly. However using Ripley’s K and the cross-variant I could see how the clusters changed over distance. I learned that infected, regenerating trees are going to be found closer to the remnant infected trees and that non-host trees may be blocking the spread of mistletoe into an uninfected patch because they were found clustered around remnant trees as well. This provides support for Hypothesis 1, 2, and 3.

Silvicultural prescriptions with the goal to preserve old growth forest structure, but that want to limit the amount of dwarf mistletoe in a forest can appropriately remove old infected hemlocks to preserve infection spread and extent. These prescriptions will also be able to predict future dwarf mistletoe spread. Forest operations that simulate disturbances that leave remnant hemlocks such as harvests, can incorporate spread predictions to limit regeneration being infected.

Learning From The Process

I learned a lot about the spatial analyst tools in ArcMap and how to produce easily interpretable maps and graphs. I also learned how to use several function in spatstat. I learned a lot about interpreting R outputs and spatial. Spatial autocorrelation can tell you a lot about what your data are doing but I thought it was most useful to be able to see on a map or chart what is specifically happening.

# Final project: Washing out the (black) stain and ringing out the details

BACKGROUND

In order to explain my project, especially my hypotheses, some background information about this disease is necessary. Black stain root disease of Douglas-fir is caused by the fungus Leptographium wageneri. It infects the roots of its Douglas-fir host, growing in the xylem and cutting the tree off from water. It spreads between adjacent trees via growth through root contacts and grafts and long-distance via insects (vectors) that feed and breed in roots and stumps and carry fungal spores to new hosts.

Forest management practices influence the spread of disease because of the influence on (i) the distance between trees (determined by natural or planted tree densities); (ii) adjacency of susceptible species (as in single-species Douglas-fir plantations); (iii) road, thinning and harvest disturbance, which create suitable habitat for insect vectors (stumps, dead trees) and stress remaining live trees, attracting insect vectors; and (iv) forest age distributions, because rotation lengths determine the age structure in managed forest landscapes and younger trees (<30-40 years old) are thought to be more susceptible to infection and mortality by the disease.

RESEARCH QUESTION

How do (B) spatial patterns of forest management practices relate to (A) spatial patterns of black stain root disease (BSRD) infection probabilities at the stand and landscape scale via (C) the spatial configuration and connectivity of susceptible stands to infection?

In order to address my research questions, I built a spatial model to simulate BSRD spread in forest landscapes using the agent-based modeling software NetLogo (Wilensky 1999). I used Exercises 1-3 to focus on the spatial patterns of forest management classes. Landscapes were equivalent in terms of the proportion of each management class and number of stands, varying only in spatial pattern of management classes. In the exercises, I evaluated the relationship between management and disease by simulating disease spread in landscapes with two distinct spatial patterns of management:

• Clustered landscape: The landscape was evenly divided into three blocks, one for each management class. Each block was evenly divided into stands.
• Random landscape: The landscape was evenly divided into stands, and forest management classes were randomly assigned to each stand.

MY DATA

I analyzed outputs of my spatial model. The raster files contain the states of cells in forest landscapes at a given time step during one model run. States tracked include management class, stand ID number, presence/absence of trees, tree age, probability of infection, and infection status (infected/not infected). Management class and stand ID did not change during the model run. I analyzed tree states from the last step of the model run and did not analyze change over time.

Extent: ~20 hectares (much smaller than my full models runs will be)

Spatial resolution: ~1.524 x 1.524 m cells (maximum 1 tree per cell)

Three contrasting and realistic forest management classes for the Pacific Northwest were present in the landscapes analyzed:

• Intensive – Active management: 15-foot spacing, no thinning, harvest at 37 years.
• Extensive – Active management: 10-foot spacing, one pre-commercial and two commercial thinnings, harvest at 80 years.
• Set-aside/old-growth (OG) – No active management: Forest with Douglas-fir in Pacific Northwest old-growth densities and age distributions and uneven spacing with no thinning or harvest.

HYPOTHESES: PREDICTIONS OF PATTERNS AND PROCESSES I LOOKED FOR

Because forest management practices influence the spread of disease as described in the “Background” section above, I hypothesized that the spatial patterns of forest management practices would influence the spatial pattern of disease. Having changed my experimental design and learned about spatial statistics and analysis methods throughout the course, I hypothesize that…

• The “clustered” landscape will have (i) higher absolute values of infection probabilities, (ii) higher spatial autocorrelation in infection probabilities, and (iii) larger infection centers (“hotspots” of infection probabilities) than the “random” landscape because clustering of similarly managed forest stands creates continuous, connected areas of forest managed in a manner that creates suitable vector and pathogen habitat and facilitates the spread of disease (higher planting densities, lower age, frequent thinning and harvest disturbance in the intensive and extensive management). I therefore predict that:
• Intensive and extensive stands will have the highest infection probabilities with large infection centers (“hotspots”) that extend beyond stand boundaries.
• Spatial autocorrelation will therefore be higher and exhibit a lower rate of decrease with increasing distance because there will be larger clusters of high and low infection probabilities when stands with similar management are clustered.
• Set-aside (old-growth, OG) stands will have the lowest infection probabilities, with small infection centers that may or may not extend beyond stand boundaries.
• Where old-growth stands are in contact with intensive or extensive stands, neighborhood effects (and edge effects) will increase infection probabilities in those OG stands.
• In contrast, the “random” landscape will have (i) lower absolute values of infection probabilities, (ii) less spatial autocorrelation in infection probabilities, and (iii) smaller infection centers than the “clustered” landscape. This is because the random landscape will have less continuity and connectivity between similarly managed forest stands; stands with management that facilitates disease spread will be less connected and stands with management that does not facilitate the spread of disease will also be less connected. I would predict that:
• Intensive and extensive stands will still have the highest infection probabilities, but that the spread of infection will be limited at the boundaries with low-susceptibility old-growth stands.
• Because of the boundaries created by the spatial arrangement of low-susceptibility old-growth stands, clusters of similar infection probabilities will be smaller and values of spatial autocorrelation will be lower and decrease more rapidly with increasing lag distance. At the same time, old-growth stands may have higher infection probabilities in the random landscape than in the clustered landscape because they would be more likely to be in contact with high-susceptibility intensive and extensive stands.
• I also hypothesize that each stand’s neighborhood and spatial position relative to stands of similar or different management will influence that stand’s infection probabilities because of the difference in spread rates between management classes and the level of connectivity to high- and low-susceptibility stands based on the spatial distribution of management classes.
• Stands with a large proportion of high-susceptibility neighboring stands (e.g., extensive or intensive management) will have higher infection probabilities than similarly managed stands with a small proportion of high-susceptibility neighboring stands.
• High infection probabilities will be concentrated in intensive and extensive stands that have high levels of connectivity within their management class networks because high connectivity will allow for the rapid spread of the disease to those stands. In other words, the more connected you are to high-susceptibility stands, the higher your probability of infection.

APPROACHES: ANALYSIS APPROACHES I USED

Ex. 1: Correlogram, Global Moran’s I statistic

In order to test whether similar infection probability values were spatially clustered, I used the raster package in R (Hijmans 2019) to calculate the global Moran’s I statistic at multiple lag distances for both of the landscape patterns. I then plotted global Moran’s I vs. distance to create a correlogram and compared my results between landscapes.

Ex. 2: Hotspot analyses (ArcMap), Neighborhood analyses (ArcMap)

First, I performed a non-spatial analysis comparing infection probabilities between (i) landscape patterns (ii) management classes, and (iii) management classes in each of the landscapes. Then, I used the Hotspot Analysis (Getis-Ord Gi*) tool in ArcMap to identify statistically significant hot- and cold-spots of high and low infection probabilities, respectively. I selected points within hot and cold spots and used the Multiple Ring Buffer tool in ArcMap to create distance rings, which I intersected with the management classes to perform a neighborhood analysis. This neighborhood analysis revealed how the proportion of each management class changed with increasing distance from hotspots in order to test whether the management “neighborhood” of trees influenced their probability of infection.

Ex. 3: Network and landscape connectivity analyses (Conefor)

I divided my landscape into three separate stand networks based on their management class. Then, I used the free landscape connectivity software Conefor (Saura and Torné 2009) to analyze the connectivity of each stand based on its position within and role in connecting the network using the Integrative Index of Connectivity (Saura and Rubio 2010). I then assessed the relationship between the connectivity of each stand and infection probabilities of trees within that stand using various summary statistics (e.g., mean, median) to test whether connectivity was related to infection probability.

RESULTS: WHAT DID I PRODUCE?

As my model had not been parameterized by the beginning of this term, I analyzed “dummy” data, where infection spread probabilities were calculated as a decreasing linear function of distance from infected trees. However, the results I produced still provided insights as to the general functioning of the model and factors that will likely influence my results in the full, parameterized model.

I produced both maps and numerical/statistical relationships that describe the patterns of “A” (infection probabilities), the relationship between “A” and “B” (forest management classes), and how/whether “A” and “B” are related via “C” (landscape connectivity and stand networks).

In Exercise 1, I found evidence to support my hypothesis of spatial autocorrelation at small scales in both landscapes and higher autocorrelation and slower decay with distance in the clustered landscape than the random landscape. This was expected because the design of the model calculated probability of infection for each tree as a function of distance from infected trees.

In Exercises 2 and 3, I found little to no evidence to support the hypothesis that either connectivity or neighboring stand management had significant influence on infection probabilities. Because the model that produced the “dummy” data limited infection to ~35 meters from infected trees and harvest and thinning attraction had not been integrated into infection calculations, this result was not surprising. In my full model where spread via insect vectors could span >1,000 m, I expect to see a larger influence of connectivity and neighborhood on infection probabilities.

A critical component of model testing is exploring the “parameter space”, including a range of possible values for each parameter. This is especially for agent-based models where there are complex interactions between many individuals that result in larger-scale patterns that may be emergent and not fully predictable by the simple sum of the parts. In my model, the disease parameters of interest are the factors influencing probability of infection (Fig. 1). In order to understand how the model reacts to changes in those parameters, I will perform a sensitivity analysis, systematically adjusting parameter values one-by-one and comparing the results of each series of model runs under each set of parameter values.

Fig.1. Two of the model parameters that will be systematically adjusted during sensitivity analysis. Tree susceptibility to infection as a function of age (left) and probability of root contact as a function of distance (right) will both likely influence model behavior and the relative levels of infection probability between the three management classes.

This is especially relevant given that in Exercises 1 through 3, I found that the extensively managed plantations had the highest values of infection probability and most of the infection hotspots, likely due to the fact that this management class has the highest [initial] density of trees. For the complete model, I am hypothesizing that the intensive plantations will have the highest infection probabilities because of high frequency of insect-attracting harvest and short rotations that maintain the trees in an age class highly susceptible to infection. In the full model, the extensive plantations will have higher initial density than the intensive plantations but will undergo multiple thinnings, decreasing tree density but attracting vectors, and will be harvested at age 80, thus allowing trees to grow into a less susceptible age class. In this final model, thinning, harvest length, and vector attraction will factor in to the calculation of infection probabilities. My analysis made it clear that even a 1.5 meter difference in spacing resulted in a statistically significant difference for disease transmission, with much higher disease spread in the denser forest. Because the model is highly sensitive to tree spacing, likely because the parameters of my model that relate to distance drop off in sigmoidal or exponential decay patterns, I would hypothesize that changes in the values of parameters that influence the spatial spread of disease (i.e., insect dispersal distance, probability of root contact with distance) and the magnitude of vector attraction after harvest and thinning will determine whether the “extensive” or “intensive” forest management class will ultimately the highest levels of infection probabilities. In addition, the rate of decay of root contact and insect dispersal probabilities will determine whether management and infection within stands influence infection in neighboring stands and the distance and strength of those neighborhood effects. I would like to test this my performing such analyses on the outputs from my sensitivity analyses.

SIGNIFICANCE: WHAT DID I LEARN FROM MY RESULTS? HOW ARE THESE RESULTS IMPORTANT TO SCIENCE? TO RESOURCE MANAGERS?

Ultimately, the significance of this research is to understand the potential threat of black stain root disease in the Pacific Northwest and inform management practices by identifying evidence-based, landscape-scale management strategies that could mitigate BSRD disease issues. While the results of Exercises 1-3 were interesting, they were produced using a model that had not been fully parameterized and thus are not representative of the likely actual model outcomes. Therefore, I was not able to test my hypotheses. That said, this course allowed me to design and develop an analysis to test my hypotheses. The exercises I completed have also provided a deeper understanding of how my model works. Through this process, I have begun to generate additional testable hypotheses regarding model sensitivity to parameters and the relative spread rates of infection in each of the forest management classes. Another key takeaway is the importance of producing many runs with the same landscape configuration and parameter settings to account for stochastic processes in the model. By only analyzing one run for each scenario, there is a chance that the results are not representative of the average behavior of the system or the full range of behaviors possible for those scenarios. For example, with the random landscape configuration, one generated landscape can be highly connected and the next highly fragmented with respect to intensive plantations, and only a series of runs under the same conditions would provide reliable results for interpretation.

(a, b) Arc-Info, Modelbuilder and/or GIS programming in Python

This was my first opportunity to perform statistical analysis in ArcGIS, and I used multiple new tools, including hotspot analysis, multiple ring buffers, and using extensions. Though I did not use Python or Modelbuilder, I realized that doing so will be critical for automating my analyses given the large number of model runs I will be analyzing. While I learned how to program in Python using arcpy in GEOG 562, I used this course to choose the appropriate tools and analyses for my questions and hypotheses rather than automating procedures I may not use again. I would now like to implement my procedures for neighborhood analysis in Python in order to automate and increase the efficiency of my workflow.

(c) Spatial analysis in R

During this course, I learned most about spatial data manipulation in R, since I had limited experience using R with spatial data beforehand. I used R for spatial statistics, data cleaning and management, and conversion between vector and raster data. I also learned about the limitations of R (and my personal limitations) in terms of the challenge of learning how to use packages and their functions when documentation is variable in quality and a wide variety of user-generated packages are available with little reference as to their quality and reliability. For example, for Exercise 2, I had trouble finding an up-to-date and high-quality package for hotspot analysis in R, with raster data or otherwise. However, this method was straightforward in ArcMap once the data were converted from raster to points. For Exercise 1, the only Moran’s I calculation that I was able to perform with my raster data was the “moran” function in the raster package, which does not provide z- or p-values to evaluate the statistical significance of the calculated Moran’s I and requires you to generate your own weights matrices, which is a pain. Using the spdep or ncf packages for this analysis was incredibly slow (though I am not sure why), and the learning curve for spatstat was too steep for me to overcome by the Exercise 1 deadline (but I hope to return to this package in the future).

Reading, manipulating, and converting data: With some trial and error and research into the packages available for working with spatial data in R (especially raster, sp/spdep, and sf), I learned how to quickly and easily convert data between raster and shapefile formats, which was very useful in automating the cleaning and preparation for multiple datasets and creating the inputs for the analyses I want to perform.

(d) Landscape connectivity analyses: I learned that there are a wide variety of metrics available through Fragstats (and landscapemetrics and landscapetools packages in R), however, I was not able to perform my desired stand-scale analysis of connectivity because I could not determine whether it is possible to analyze contiguous stands with the same management class as separate patches (Fragstats considered all contiguous cells in the raster with the same class to be part of the same patch). Instead, I used Conefor, which has an ArcMap extension that allows you to generate a node and connection file from a polygon shapefile, to calculate relatively few but robust and ecologically meaningful connectivity metrics for the stands in my landscape.

WHAT I LEARNED ABOUT… SPATIAL STATISTICS

Correlograms and Moran’s I: For this statistical method, I learned the importance of choosing meaningful lag distances based on the data being analyzed and the process being examined. For example, my correlogram consists of a lot of “noise” with many peaks and troughs due to the empty cells between trees, but I also captured data at the relevant distances. Failure to choose appropriate lag distances means that some autocorrelation could be missed, but analyses of large raster images at a high resolution of lag distances results in very slow processing. In addition, I wanted to compare local vs. global Moran’s I to determine whether infections were sequestered to certain portions of the landscape or spread throughout the entire landscape, but the function for local Moran’s I returned values far outside the -1 to 1 range of the global Moran’s I. As a result, I did not understand how to interpret or compare these values. In addition, global Moran’s I did not tell me where spatial autocorrelation was happening, but the fact that there was spatial autocorrelation led me to perform a…

Hotspot analysis (Getis-Ord Gi*): It became clear that deep conceptual understanding of hypothesized spatial relationships and processes in the data and a clear hypothesis are critical for hotspot analysis. I performed multiple analyses with difference distance weighting to compare the results, and there was a large variation in both the number of points included in hot and cold spots and the landscape area covered by those spots between the different weighting and distance methods. I ended up choosing the inverse squared distance weighting based on my understanding of root transmission and vector dispersal probabilities and because this weighting method was the most conservative (produced the smallest hotspots). The confidence level chosen also resulted in large variation in the size of hotspots. After confirming that there was spatial autocorrelation in infection probabilities, using this method helped me to understand where in the landscape these patterns were occurring and thus how they related to management practices.

Neighborhood analysis: I did not find this method provided much insight in my case, not because of the method itself but because of my data (it just confirmed the landscape pattern that I had designed, clustered vs. random) and my approach (one hotspot and one coldspot point non-randomly selected in each landscape. I also found this method to be tedious in ArcMap, though I would like to automate it, and I later learned about the zonal statistics tool, which can help make this analysis more efficient. In general, it is not clear what statistics I could have used to confirm whether results were significantly different between landscapes, but perhaps this is an issue caused by my own ignorance.

Network/landscape connectivity analyses: I learned that there are a wide variety of tools, programs, and metrics available for these types of analyses. I found the Integrative Index of Connectivity (implemented in Conefor) particularly interesting because of the way it categorizes habitat patches based on multiple attributes in addition to their spatial and topological positions in the landscape. The documentation for this metric is thorough, its ecological significance has been supported in peer-reviewed publications (Saura and Rubio 2010), and it is relatively easy to interpret. In contrast, I found the number of metrics available in Fragstats to be overwhelming especially during the data exploration phase.

REFERENCES

Robert J. Hijmans (2019). raster: Geographic Data Analysis and Modeling. R package version 2.8-19. https://CRAN.R-project.org/package=raster

Saura, S. & J. Torné. 2009. Conefor Sensinode 2.2: a software package for quantifying the importance of habitat patches for landscape connectivity. Environmental Modelling & Software 24: 135-139.

Saura, S. & L. Rubio. 2010. A common currency for the different ways in which patches and links can contribute to habitat availability and connectivity in the landscape. Ecography 33: 523-537.

Wilensky, U. (1999). NetLogo. http://ccl.northwestern.edu/netlogo/. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.

# Final Project: Examining the relationship between rurality and colorectal cancer mortality in Texas counties

1.The research question that you asked.

My initial research question was about creation and comparison of rural indices for Texas. I did end up making a rural index for Texas, but instead of creating multiple indices and comparing the importance of different rural indicators, I incorporated an outcome variable, colorectal cancer (CRC) mortality, for Exercise 2 and 3. So my final research question ended up being as follows: how does the spatial pattern of CRC mortality in Texas relate to proxy measures of rurality and a constructed rural index for counties in Texas?

2. A description of the dataset you examined, with spatial and temporal resolution and extent.

For my project, I utilized Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database). Aggregated CRC mortality rates for Texas counties were obtained for the years 2011-2015 from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program U.S. Population Data – 1969-2015.

3. Hypotheses: predictions of patterns and processes you looked for.

Based on my revised research question I remade before exercise 2, I expected 2 major spatial patterns for my analyses: 1) areas immediately surrounding high CRC mortality counties to have higher “rural” values in 3 rural indicator variables: household income, population density, and percent land developed, with more “urban” values as distance increases away from the counties and 2) CRC mortality counts to significantly increase as county rural index values (from the 3 indicator variables) became more rural. I chose rural indicator variables based both on true measures of where population centers in Texas are located (population density and land development) and a well-established marker of rural status (income). For spatial pattern 1 from my neighborhood analysis, I expected there to be uniform shifts in buffer means towards more “urban” values in each of the 3 indicator variables as distance increased away from the high CRC counties. I expected this effect because counties with high CRC mortality are commonly rural, so areas in counties surrounding them with lower CRC mortality may show increasingly urban indicator values. For spatial pattern 2, I expected CRC mortality to increase as county rural index values became more rural in Texas. I expected this pattern because CRC mortality has been linked to rural status and various indicator variables in previous research, though a weighted rural index has never been used for Texas cancer data.

4. Approaches: analysis approaches you used.

For my analyses I used the following approaches:

Exercise 1: I utilized principal component analysis (PCA) to construct a rural index for Texas counties using 3 indicator variables.

Exercise 2: I did a neighborhood analysis of high CRC mortality counties by creating multi-ring buffers around centroids of 4 Texas counties, intersected the buffers with county polygons containing rural indicator variable values, and calculated buffer means for indicator variables for each donut around each county.

Exercise 3: I completed a Poisson regression of CRC mortality counts (Y) and my constructed rural index (X) to examine the effect of rural status on CRC mortality for Texas counties.

5. Results: what did you produce — maps? statistical relationships? other?

For all 3 exercises, my results included maps (Texas county maps, buffer maps, standardized residual maps), statistical relationships (p-values, coefficients, etc.), and plotting of statistical relationships (biplots, influence plots, line plots). I produced the maps in Arc Pro, while all other visualizations were produced in R.

6.What did you learn from your results? How are these results important to science? to resource managers? + software/technical limitations

I believe exercise 1 displays the effectiveness of PCA for construction of rural indices. More deductive methods for rural classification are very much needed in rural health, and I believe this method could improve detection and prevention of rural health disparities.

In exercise 2, the 4 high CRC mortality counties did not all follow the expected rural indicator spatial “gradient” that I expected. Two of the counties exhibited increasing urban scores as distance increased away from the counties, while the other two showed the opposite pattern. I think this result could be due to the arbitrary distances I chose for the buffers around the counties and the modifiable areal unit problem of utilizing county indicator variable data instead of more spatially defined values. Also, significant regionality in Texas could exist for the indicator variables I chose, as the 4 counties were not located in the same regions of Texas. This could have affected the relationships I found in each of the counties.  For example, certain rural regions in Texas may have lower or higher household income than other rural regions of Texas due to factors such as available jobs or regional/local governmental policies. Also, there are likely other variables that are indicators of rurality that I could have included in the analysis that may have more consistent spatial patterns.

In exercise 3, the results from the Poisson regression followed statistical pattern I expected: as county CRC mortality increases, rural index scores become more rural. I believe my results show important introductory associations between CRC mortality and rurality in Texas that indicate further and deeper study into these associations should be considered. To show the exercise 3 results, I utilized a map of standardized residuals in Arc and statistical plots in R.

The technical limitations of my analyses were mainly due to extensive missing cancer data for the state of Texas. This missingness was due to data suppression by the CDC’s National Vital Statistics System in order to maintain confidentiality of cancer cases. I did not have any obvious software issues besides difficulty with the Calculate Field tool in Arc Pro, where the tool would consistently fail when using more complicated formulas for data transformations. I also had some problems when joining Excel table to attribute tables, where Arc would not show symbology for the newly-joined data until exporting to geodatabase and restarting the program.

7. Your learning: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other?

I believe I greatly improved my skills in ArcGIS this term, especially in data wrangling and cleaning to convert my data to formats that maximize the visualization. Also, I feel I became much better at using key plotting packages in R, such as ggplot2.

8. What did you learn about statistics, including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), (d) multivariate methods (e.g., PCA), and (e) or other techniques?

I believe the major statistical and spatial skills I improved in this course include PCA, neighborhood analysis, Poisson regression, and (especially) regression diagnostics. I had not used PCA in R before this term and feel very comfortable using it going forward in my research on rural health. The many assumptions I had to follow for the Poisson regression in exercise 3 improved my ability to run diagnostics in R and create intuitive assumption plots.

# Final project synthesis of parts 1, 2, and 3 on classification of Blackleg on turnip

Background

The objective of this research was to find a way to attribute proper classification of diseased versus non-diseased regions on a turnip leaf. To do this I assessed the classification accuracy of the unsupervised machine learning classification algorithm, segmentation, with manual classification. Additionally, I hoped to reveal some spatial characteristics about the lesions on the leaf for future work.

Dataset

The cumulative data set is roughly 500 turnip leaves, harvested here in the Willamette valley during the months of February to April of 2019. For the sake of this project I have selected 5 images in which I have worked with to complete the basis of this analysis. The spatial resolution is just under 1 mm and images are in raster format of single turnip leaves with five bands (blue, green, red, far-red, NIR). For this analysis only, the red band was used. The camera being used is a Micasense Red-edge which has a 16-bit radiometric resolution.

Questions

These were not all questions that I had anticipated answering at the beginning of the term, but have developed over the course of the term as I ran into issues or found intriguing results.

• Can diseased patches be successfully identified and isolated by computer image classification and through manual classification?
If yes;
o can the diseased patches based on image classification and manually classified diseased patches be characterized?
o How accurate is the segmentation classification?
o Are more raster cells being classified as false negatives or false positives when misclassified?
o What is going undetected in segmentation?

Hypothesis

My original hypothesize was that spatial patterns of pixels based on image classification are related to manually classified spatial patterns of observed disease on turnip leaves because disease has a characteristic spectral signature of infection on the leaves.

Approaches and results

I began in ArcGIS Pro by using the: segmentation -> mask -> clip -> raster to polygon -> merge -> polygon to raster. This yielded a layer which included only the computer generated segmented diseased patches which were exported as Tiff files. I followed a similar process with my manually classified diseased patches but used the ‘training sample manager’ instead of segmentation. Both these Tiff files were uploaded to Fragstats where I did patch analysis between the two to characterize the diseased patches. I found this not entirely helpful, but if I dug deeper into patch analysis I could be provided with some worthwhile information about the diseased lesion characteristics.

After characterizing the patches, I performed an error analysis using a confusion matrix in R. The two Tiff files were uploaded to R and converted to rasters to check quality and then to tables. The tables were imported into excel and values were adjusted to 1 for diseased patches and 0’s for non-diseased. Anything outside the area of the leaf was deleted. The tables were imported back to R and a confusion matrix comparing the computer classified segmentation versus the manually classified patches was conducted. All the classification matrices were combined for the overall accuracy (Table 1). This aspect was a bit time consuming as was the original image processing but was a very useful analysis.

One other approach I used to help with visualization was showing the leaf with false positives and false negatives included (Image 1). This was necessary to help explain results more clearly to others and show results outside of the statistics.

Significance

I have created a method for getting from image acquisition all the way to statistical results. This work is first off significant to my thesis research. This will be something I refine to help streamline but is a very good starting point for me. Second, I got statistically significant results based on p-value and a high accuracy. This will be helpful to those in agronomy or crop consulting which are using remote sensing for disease detection. At this point I don’t think it is practical for application of agronomist but with further work I hope to make this methodology an available resource for other researchers to work off of and for farmers to potentially try and utilize.

Hypothesis revised

After receiving these results and considering my next steps, my hypothesis has been refined. I now hypothesize that segmentation classification has an equal ability to classify diseased patches as manual classification because of the spectral values associated with diseased pixels.

New research questions

This research is far from complete and has led to a new set of questions to be answered.

• Is percent classification of diseased pixels the best metric for how well the classification method works?
• What is the accuracy of a support vector machine versus manually classified pixels?
• Is this classification translatable to other brassica species such as canola, radish, kale, etc.?

Future research

To answer some of my new questions I believe it is critical I learn how to create band composites between the five available bands. Currently, my issue is the bands do not overlap. My approach for solving this includes the use of georeferenced points/pixels on each image to ensure they overlap. I think if I use the 4 corners of the image I should get the overlap I want. By creating band composites, I can begin using vegetative indices like NDVI to help more accurately distinguish pixels than can be done with one band alone.

Another concern I have is despite classifying pixels accurately I am weary about this being an all determining method of classification. I have instances in my current classification where only 2 or 3 of the 6 or so lesions are actually having pixels inside its perimeter classified as diseased. I think a more accurate or a helpful assessment would quantify the percent of lesions that are captured by the segmentation process out of the total number of lesions on the leaf.

Using an unsupervised classification scheme for this work was good to lay the groundwork. In future work I would like to move to supervised machine learning classification schemes. My reason being; when attempting to translate this algorithm used on turnip classification to others such as canola, radish, etc. I believe it will have a higher level of accuracy and consider more spatial characteristics. Currently, my classification scheme only uses the spectral values of the red band where I would like to implement variables such as lesion size, proximity from margin, proximity to other lesions, shape, NDVI values, etc. This works into my last question concerning the ability of the classification to be applied to broader ranges of plants.

Technical and software limitations

Some of my limitations are apparent in above sections because they led to my future research questions. One of my greatest limitations was my ability to effectively and efficiently use ArcGIS Pro as well as code in R. This led to longer measures of time conducting image processing which may have been streamlined if I was better in ArcGIS Pro. The major roadblocks I hit in ArcGIS Pro were related to the overlap of different bands that I mentioned earlier in addition to the use of the support vector machine. I have plans to effectively solve the band composite issue, but am still unsure about the support vector machine function in ArcGIS Pro. I believe the way it was setup is not designed for collecting training data for many images but collecting this data from one image. My alternative for the SVM is to try conducting the analysis in R.

In R I struggled in my ability to code. This was a major strain on my time and required the assistance of others to help troubleshoot on many occasions. This can be overcome with more practice and is not so much a limitation of the software, but the user.

My learning

I learned the most in ArcGIS Pro I would say. The image processing took lots of time and many steps which exposed me to a whole set of tools and analysis I was previously unaware of. I don’t think my skills in R increased much but I now have the code and steps for the analysis I would like to complete on my future thesis work, the confusion matrix. Another program I gained some experience in was Fragstats. It was useful for characterizing the diseased patches and may be an option for some of my future work when looking at the size, shape distance, etc. of these diseased patches.

Appendix

Table 1. Confusion matrix for pixel classification of five leaves.

Image 1. Example of one leaf analysis where the green cells are manually classified non-diseased pixels, white pixels inside the leaf are segmentation classified diseased pixels and block pixels are the manually classified diseased pixels.

# Final Project: San Diego Bottlenose Dolphin Sighting Distributions

## The Research Question:

Originally, I asked the question: do common bottlenose dolphin sighting distances from shore change over time?

However, throughout the research and analysis process, I refined this question for a multitude of reasons. For example, I planned on using all of my dolphin sightings from my six different survey locations along the California coastline. Because the bulk of my sightings are from the San Diego survey site, I chose this data set for completeness and feasibility. Additionally, this data set used the most standard survey methods. Rather than simply looking at distance from shore, which would be at a very fine scale, seeing as all of my sightings are within two kilometers from shore, I chose to try and identify changes in latitude. Furthermore, I wanted to see if changes in latitude (if present, were somehow related to the El Nino Southern Oscillation (ENSO) cycles and then distances to lagoons). This data set also has the largest span of sightings by both year and month. When you see my hypotheses, you will notice that my original research question morphed into much more specific hypotheses.

## Data Description:

My dolphin sighting data spans 1981-2015 with a few absent years, and sightings covering all months, but not in all years sampled. The same transects were performed in a small boat with approximately a two kilometer sighting span (one kilometer surveyed 90 degrees to starboard and port of the bow). These data points therefore have a resolution of approximately two kilometers. Much of the other data has a coarser spatial resolution, which is why it was important to use such a robust data set. The ENSO data I used gave a broad brushstroke approach to ENSO indices. Rather than first using the exact ENSO index which is at a fine scale, I used the NOAA database that split month-years into positive, neutral, and negative indices (1, 0, and -1, respectively). These data were at a month-year temporal resolution, which I matched to my month-date information of my sighting data. Lagoon data were sourced from the mid-late 2000s, therefore I treated lagoon distances as static.

## Hypotheses:

H1: I predicted that bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) along the San Diego transect throughout the years 1981-2015 would exhibit clustered distribution patterns as a result of the patchy distributions of both the species’ preferred habitats and prey, as well as the social nature of this species.

H2: I predicted there would be higher densities of bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) at higher latitudes spanning 1981-2015 due to prey distributions shifting northward and less human activities in the northward sections of the transect. I predicted that during warm (positive) ENSO months, the dolphin sightings in San Diego would be distributed more northerly, predominantly with prey aggregations historically shifting northward into cooler waters, due to (secondarily) increasing sea surface temperatures. I expect the spatial gradient to shift north and south, in relation to the ENSO gradient (warm, neutral, or cold)

H3: I predicted that along the San Diego coastline, bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) would be clustered around the six major lagoons within about two kilometers, with no specific preference for any lagoon, because the murky, nutrient-rich waters in the estuarine environments are ideal for prey protection and known for their higher densities of schooling fishes.

Map with bottlenose dolphin sightings on the one-kilometer buffered transect line and the six major lagoons in San Diego.

## Approaches:

I utilized multiple approaches with different software platforms including ArcMap, qGIS, GoogleEarth, and R Studio (with some Excel data cleaning).

• Buffers in ArcMap
• Calculations in an attribute table
• ANOVA with Tukey HSD
• Nearest Neighbor averages
• Cluster analyses
• Histograms and Bar plots

## Results:

I produced a few maps (will be), found statistical relationships between sightings and distribution patterns,  ENSO and dolphin latitudes, and distances to lagoons.

H1: I predicted that bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) along the San Diego transect throughout the years 1981-2015 would exhibit clustered distribution patterns as a result of the patchy distributions of both the species’ preferred habitats and prey, as well as the social nature of this species.

True: The results of the average nearest neighbor spatial analysis in ArcMap 10.6 produced a z-score of -127.16 with a p-value of < 0.000001, which translates into there being less than a 1% likelihood that this clustered pattern could be the result of random chance. Although I could not look directly at prey distributions because of data availability, it is well-known that schooling fishes exist in clustered distributions that could be related to these dolphin sightings also being clustered. In addition, bottlenose dolphins are highly social and although pods change in composition of individuals, the dolphins do usually transit, feed, and socialize in small groups. Also see Exercise 2 for other, relevant preliminary results, including a histogram of the distribution in differences of sighting latitudes.

Summary from the Average Nearest Neighbor calculation in ArcMap 10.6 displaying that bottlenose dolphin sightings in San Diego are highly clustered.

H2: I predicted there would be higher densities of bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) at higher latitudes spanning 1981-2015 due to prey distributions shifting northward and less human activities in the northward sections of the transect. With this, I predicted that during warm (positive) ENSO months, the dolphin sightings in San Diego would be distributed more northerly, predominantly with prey aggregations historically shifting northward into cooler waters, due to (secondarily) increasing sea surface temperatures. I expect the spatial gradient to shift north and south, in relation to the ENSO gradient (warm, neutral, or cold).

False: the sightings are more clumped towards the lower latitudes overall (p < 2e-16), possibly due to habitat preference. The sightings are closer to beaches with higher human densities and human-related activities near Mission Bay, CA. It should be noted, that just north of the San Diego transect is the Camp Pendleton Marine Base which conducts frequent military exercises and could deter animals.

I used an ANOVA analysis and found there was a significant difference in sighting latitude distributions between monthly ENSO indices. A Tukey HSD was performed to determine where the differences between treatment(s) were significant. All differences (neutral and negative, positive and negative, and positive and neutral ENSO indices) were significant with p < 0.005.

H3: I predicted that along the San Diego coastline, bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) would be clustered around the six major lagoons within about two kilometers, with no specific preference for any lagoon, because the murky, nutrient-rich waters in the estuarine environments are ideal for prey protection and known for their higher densities of schooling fishes. See my Exercise 3 results.

Using a histogram, I was able to visualize how distances to each lagoon differed by lagoon. That is dolphin sightings nearest to, Lagoon 6, the San Dieguito Lagoon, are always within 0.03 decimal degrees. In comparison, Lagoon 5, Los Penasquitos Lagoon, is distributed across distances, with the most sightings at a great distance.

Bar plot displaying the different distances from dolphin sighting location to the nearest lagoon in San Diego in decimal degrees. Note: Lagoon 4 is south of the study site and therefore was never the nearest lagoon.

After running an ANOVA in R Studio, I found that there was a significant difference between distance to nearest lagoon in different ENSO index categories (p < 2.55e-9) with a Tukey HSD confirming that the significant difference in distance to nearest lagoon being significant between neutral and negative values and positive and neutral years. Therefore, I gather there must be something happening in neutral months that changes the distance to the nearest lagoon, potentially prey are more static or more dynamic in those years compared to the positive and negative months. Using a violin plot, it appears that Lagoon 5, Los Penasquitos Lagoon, has the widest span of sighting distances when it is the nearest lagoon in all ENSO index month values. In neutral years, Lagoon 0, the Buena Vista Lagoon has more than a single sighting (there were none in negative months and only one in positive months). The Buena Vista Lagoon is the most northerly lagoon, which may indicate that in neutral ENSO months, dolphin pods are more northerly in their distribution.

## Takeaways to science and management:

Bottlenose dolphins have a clustered distribution which seems to be related to ENSO monthly indices, with certain years having more of a difference in distribution, and likely, their sociality on a larger scale. Neutral ENSO months seem to have a different characteristic that impact sighting distribution locations along the San Diego coastline. More research needs to be done in this to determine what is different about neutral months and how this may impact this dolphin population. On a finer scale, the six lagoons in San Diego appear to have a spatial relationship with dolphin pod sighting distributions. These lagoons may provide critical habitat for bottlenose dolphin preferred prey species or preferred habitat for the dolphins themselves either for cover or for hunting, and different lagoons may have different spans of impact at different distances, either by creating larger nutrient plumes, or because of static, geographic and geologic features. This could mean that specific areas should be protected more or maintain protection. For example, the Batiquitos and San Dieguito Lagoons have some Marine Conservation Areas with No-Take Zones. It is interesting to see the relationship to different lagoons, which may provide nutrient outflows and protection for key bottlenose dolphin prey species. The city of San Diego and the state of California are need ways to assess the coastlines and how protecting the marine, estuarine, and terrestrial environments near and encompassing the coastlines impact the greater ecosystem. Other than the Marine Mammal Protection Act and small protected zones, there are no safeguards for these dolphins.

## My Learning: about software (a) Arc-Info and b) R

1. a) Arc-Info: buffer creation, creating graphs, nearest neighbor analyses. How to deal with transects, certain data with mismatching information, conflicting shapefiles
2. b) R: I didn’t know much, except the basics in R. I learned about how to conduct ANOVAs and then how to interpret results. Mainly I learned about how to visualize my results and use new packages.

Throughout this project I learned that spatial statistics requires clear hypothesis testing in order to clearly step through a spatial process. Most specifically, I learned about spatial analyses in ArcMap, and how I could utilize nearest neighbor calculations to assess distribution patters. Furthermore, I now have a better understanding of spatial distribution patterns and how they are assessed, such as clustering versus random versus equally dispersed distributions. For more data analysis and cleaning, I also learned how to apply my novice understanding of ANOVAs and then display results relating to spatial relationships (distances) using histograms and other graphical displays in R Studio.

________________________________________________________________________

# Manipulating salinity to create a better fit habitat suitability model for Olympia oysters

Follow-up from Exercise 2
In Exercise 2, I compared Olympia oyster location data to the model of predicted suitable habitat that I developed in Exercise 1. Results from that analysis showed that 13 of the 18 observations within or near areas of high-quality habitat (type 4) indicated the presence of Olympia oysters (72%) versus 5 locations where oysters were not found (28%). No field survey locations fell within areas of majority lowest quality habitat (type 1). Seven observations were found within the second lowest quality habitat type (2), with 2 of those observations indicating presence (29%) and 5 indicating absence (71%).

 Habitat suitability 4 3 2 1 Presence 13 [0.72] 4 [0.4] 2 [0.29] 0 [0] Absence 5 [0.28] 6 [0.6] 5 [0.71] 0 [0] Total (n = 35*) 18 [1.0] 10 [1.0] 7 [1.0] 0 [0]

*3 data points removed from analysis due to inconclusive search results.

To expand on this analysis, I used a confusion matrix to further examine the ‘errors’ in the data, or the observations that did not correlate with my model of predicted suitable habitat. For ease of interpretation, I removed habitat suitability type 1 since there were not observations in this category, and type 3 since it fell in between high and low-quality habitat.

 Habitat suitability 4 (high) 2 (low) Presence 0.37 0.06 Absence 0.14 0.14

Decimals reported indicate the proportion of total observations (n = 35) that fell within this category. The habitat suitability model predicted that oysters would be present within the highest quality habitat type and absent in low-quality habitat. The confusion matrix shows that the model was successful in predicting that 37% of the total observations where oysters were present were found within habitat type 4 (high), and 14% of the observations where oysters were absent were found in habitat type 2 (low).

In the type 4 habitat, 14% of the total observations found that oysters were absent, which goes against the model prediction. I suspect this is partly due to the patchy nature of substrate availability in Yaquina Bay and the low-resolution quality of the substrate raster layer used for analysis. For the 6% of observations that show oyster presence within habitat type 2, it’s possible that these points were juvenile oysters that were able to settle in year-1, but are less likely to survive into adulthood. Both of these errors could also indicate issues with the weights assigned in the model back in Exercise 1.

For exercise 3, I wanted to expand on the habitat suitability analysis to see if I could more accurately predict oyster locations and account for the errors found in exercise 2. Here I asked:

Can the spatial pattern of Olympia oyster location data be more accurately described by manipulating the spatial pattern of one of the parameters of suitable habitat (salinity)?

I decided to modify the rank values of one of the model parameters: salinity. Based on my experience collecting oyster location data in the field, it seemed that salinity was the biggest influence in where oysters would be found. It was also the easiest parameter to change since it had the fewest rank categories. The excerpt below comes from the ranking value table I established for the habitat parameters in Exercise 1. Changes to rank value for salinity are indicated in the right-most column.

 Habitat parameter Subcategories Subcategory variable range Olympia oyster tolerance Rank value applied Mean wet-season salinity (psu) Upper estuary < 16 psu somewhat, but not long-term 1 –> 2 Upper mid estuary 16.1 – 23 psu yes 4 –> 3 Lower mid estuary 23.1 – 27 psu yes 3 –> 4 Lower estuary > 27 psu somewhat 2 –> 1

Name of tool or approach
I combined my approach from exercise 1 and exercise 2 to create a different model output based on the new rank values applied to the salinity parameter. The analysis was completed in ArcGIS Pro and the table of values generated was reviewed in Excel.

Brief description of steps to complete the analysis

1. After assigning new rank values to the salinity parameter, I applied a new ‘weighted overlay’ to the salinity raster layer in ArcGIS. As I did in exercise 1, I used the ‘weighted overlay’ tool again to combine the weighted substrate and bathymetry layers with the updated salinity layer. A new map of suitable habitat was created based on these new ranking values.
2. Then, I added the field observation data of oyster presence/absence to the map and created a new map of all the data points overlaid on habitat suitability.
3. I then created buffers around each of the points using the ‘buffer’ tool. In the last analysis, I used the ‘multiple ring buffer’, but was only able to analyze the largest buffer (300m). This time, I created only the one buffer around each point.
4. Using the ‘Zonal Statistics’ tool, I overlaid the newly created buffers on the updated raster of habitat suitability and viewed the results. I again chose ‘majority’ as my visual represented statistic, which categories the buffer based on the habitat suitability type occupying the largest area.
5. I also created a results table using the ‘Zonal Statistics as Table’ tool, then copied it over to Excel for additional analysis.

Results
An updated table based on manipulated salinity rank values was generated to compare to the table created from exercise 2 and displayed at the top of this blog post. Results from this analysis showed that only 2 of the 35 total observations fell within or near areas of high-quality habitat (type 4), one indicated presence and the other absence. The adjustments to the salinity rank value allowed the habitat type 3 to dominate the map, with 31 of the total 35 observations falling in this category. Of the 31 points, 18 showed presence data (58%) and 13 were absence data (42%). Again, no field survey locations fell within areas of majority lowest quality habitat (type 1). Two observations were found within the second lowest quality habitat type (2), both indicating absence (100%).

 Habitat suitability 4 3 2 1 Presence 1 [0.5] 18 [0.58] 0 [0] 0 [0] Absence 1 [0.5] 13 [0.42] 2 [1.0] 0 [0] Total (n = 35) 2 [1.0] 31 [1.0] 2 [1.0] 0 [0]

Again, I used a confusion matrix to further examine the ‘errors’ in the data, or the observations that did not correlate with my model of predicted suitable habitat. I removed habitat suitability type 1 since there were not observations in this category.

 Habitat suitability 4 (high) 3 2 (low) Presence 0.03 0.51 0 Absence 0.03 0.37 0.06

Decimals reported indicate the proportion of total observations (n = 35) that fell within this category. The confusion matrix shows that the model fit nearly all observations (31) into the type 3 habitat category, with a near even split between presence (18) and absence (13). In reference to the confusion matrix from exercise 2 at the top of this blog, it is difficult to make a direct comparison of the errors since most of the observations fell into type 3.

Critique of the method
I was surprised to see how drastically the map of suitable habitat changed by manipulating only one of the habitat parameters. The adjustment of the rank values for salinity resulted in a vast reduction in area attributed to the highest quality habitat (type 4). The results indicate that choosing the salinity parameter to manipulate did not result in a better fit model and that changes to salinity rank values were too drastic. Since the salinity parameter contains only 4 subcategories, or 4 different weighted salinity values, the impacts to the habitat suitability map were greater than if the parameter had had more nuance. For example, the bathymetry parameter has 10 subcategories and a reworking of the ranking values within could have made more subtle changes to the habitat suitability map.

The next steps would be to examine another parameter, either substrate or bathymetry, to see if adjustments to ranking values result in a more appropriate illustration of suitable habitat. Additionally, the collection of more oyster location data points will help in creating a better fit model and understanding the nuances of suitable habitat in Yaquina Bay.

# Deaggregation of multi-hazard risks, losses, and connectivity: An application to the joint earthquake-tsunami hazard at Seaside, OR

Research Question

The Pacific Northwest is subject to the rupture of the Cascadia Subduction Zone (CSZ) which will result in an earthquake and nearfield tsunami. Low-lying communities along the coast (such as Seaside, OR) are susceptible to infrastructure damage from both earthquake ground shaking and tsunami inundation. Given resource constraints (budget, personnel, time, etc.), it is not feasible for city planners and resource managers to expect to mitigate all possible damages resulting from an earthquake/tsunami event; however, it is possible to optimize resources in order to minimize the expected damages. Consequently, a first step toward resource optimization is a thorough understanding of the risks posed by the CSZ.

For this project, I investigated how the spatial pattern of tax-lot damage and connectivity to critical infrastructure following a multi-hazard earthquake/tsunami in Seaside, OR can provide insight into possible hazard mitigation measures. The damages and connectivity were deaggregated by both hazard (earthquake vs. tsunami) as well as intensity of the event. An understanding of the deaggregated hazard provides insight into vulnerable areas within Seaside.

Description of Dataset

My dataset consists of maps of the built environment (or infrastructure) and hazard maps (earthquake ground shaking and tsunami inundation) for Seaside, OR. The infrastructure is composed of buildings, an electric power network, a transportation network, and a water supply network (Figure 1). The earthquake and tsunami hazard maps were previously generated through numerical modeling (Park, Cox, Alam, & Barbosa, 2017). Due to computational limitations, tsunami inundation is modeled using a “bare-earth” model indicating that no infrastructure is included.

In addition to the infrastructure and hazard maps, a suite of probabilistic damage codes were utilized. These damage codes implement Monte-Carlo methods to evaluate the expected damages, economic losses, and connectivity associated with earthquake/tsunami hazards.

Figure 1: Infrastructure at Seaside (Kameshwar et al., n.d.)

Hypotheses

The project was divided into three exercises. The questions posed by each exercise, and hypotheses are outlined below:

1. What is the spatial pattern of economic losses resulting from a joint earthquake/tsunami event? How does each hazard contribute to this spatial pattern?
1. Earthquakes and tsunamis pose different hazards to the built environment. The former results in strong ground shaking whereas the latter results in inundation. Tsunami inundation is much more spatially variable due to limiting characteristics such as ground elevation and friction losses. Conversely, earthquake ground shaking is not limited by elevation or friction and is subsequently not as spatially variable within a geographic region (especially at scales the size of Seaside). Because of these differences in hazardous conditions, I expect the spatial pattern of tsunami losses to be concentrated along the coast, whereas the spatial pattern of economic losses will be dispersed around Seaside.

1. How does the spatial pattern of economic losses relate to the spatial pattern of tsunami momentum flux?
1. Because there is significantly more spatial variability in the tsunami hazard compared to the earthquake hazard, I wanted to investigate how this spatial pattern relates to the spatial pattern of tsunami economic losses. Economic losses are driven by the intensity of the hazard, therefore, I expect a significant correlation between economic losses and tsunami momentum flux (a measure of both inundation depth and velocity).

1. How vulnerable is Seaside’s networked infrastructure to a joint earthquake/tsunami event?
1. While economic losses to infrastructure can play a role in hazard mitigation planning, it should not serve as the only driver. Mitigation planning should also (and perhaps more importantly) consider resident accessibility to critical services and facilities immediately following a disaster. Given the importance of resident accessibility, I wanted to perform a connectivity analysis of networked infrastructure (electricity, transportation, and water) following a joint earthquake/tsunami event. The networked infrastructure in Figure 1 shows that failure of a few key links could severely limit large populations of Seaside (g. the network is not highly “parallel”, but exhibits some “series” features). For example, failure of the pipes leading to the water treatment plant (Figure 1-d) would result in complete disconnectivity of water to Seaside. Given the structure of the networks at Seaside, I expect sharp increases in the disconnectivity if some of the key links fail.

Approaches/Methods

1. Spatial pattern of economic losses:
1. To evaluate the spatial pattern of economic losses, I created real market value and economic loss heatmaps. Economic loss heatmaps provide insight into densely populated regions within Seaside that could potentially benefit from a single mitigation option. Heatmaps were generated by using the kernel density estimation tool within the QGIS processing toolbox.
2. Relationship between economic losses and tsunami momentum flux:
1. Performing a geographically weighted regression (GWR) provided insight into the relationship between economic losses and tsunami momentum flux. Here, the percent of economic loss depended on the tsunami momentum flux. The GWR was performed using the python spatial analysis library PySal.
3. Connectivity of networked infrastructure:
1. Connectivity analyses between tax lots and critical infrastructure provided insight into the probability of tax lots becoming disconnected from critical infrastructure. Additionally, maps showing the probability of link failure were generated to isolate vulnerable links within the networks. The connectivity analysis was performed using the python network analysis package python-igraph.

Results

Spatial pattern of economic losses

Heatmaps showing the spatial pattern of economic losses deaggregated by both hazard and intensity are shown in Figure 2. It can be seen that a hot spot of damages is located at the central business district (CBD), and appears for low magnitude earthquake events. The tsunami damages are more evenly distributed along the coast, relative to the earthquake damages. Figure 3 shows the total economic risks for of Seaside deaggregated by hazard and intensity. Here, risk is defined as the economic losses multiplied by the annual probability of occurrence (the inverse of the return year). Quantifying the economic losses by risk allows for the isolation of events that are both likely to occur and produce significant damages. It can be seen in Figure 3 that the 250 to 1000-year events pose the highest economic risks. If economic losses are a priority, using Figures 2 and 3, a city planner could identify regions of buildings within Seaside that are vulnerable to earthquake damage. Subsequently these regions would benefit the most from mitigation options.

Figure 2: Deaggregated heatmaps of economic losses

Figure 3: Deaggregated economic risks for all of Seaside

Relationship between economic losses and tsunami momentum flux

The relationship between percent of economic losses and tsunami momentum flux was measured by performing a geographically weighted regression (GWR). A key parameter of the GWR is the bandwidth, which describes the area under which the regression is performed. A bandwidth of 200 was initially used (see to exercise 2). The bandwidth has been further optimized by comparing the resulting r2 values of multiple GWRs (Figure 4).  It can be seen that a bandwidth of 75 results in the largest r2 value, and was subsequently used for further analysis. Bandwidths below this value resulted in errors in the GWR code.

The results from the GWR with an updated bandwidth are shown in Figure 5. It can be seen that the slope is small near to the coast with hotspots of larger values located further inland. Conversely, the intercept is large near to the coast, and decreases as further inland. The intercept can be explained by the large momentum flux near to the coast and decreases as the tsunami propagates landward.

Interestingly, some of the slope values are less than 0 indicating that the damages decrease as momentum flux increases. This is likely explained by the heterogeneity of building types within Seaside. For example, concrete buildings are more resistant to tsunami damage than wood buildings. Consequently, a concrete building that experiences a large momentum flux may result in less damage than a wood building that experiences a small momentum flux. To validate this, the buildings in Seaside were deaggregated by their characteristics and a linear regression was performed (Figure 6). The buildings can be classified according to the construction material (wood vs. concrete), year built, and number of floors. Here, W1 corresponds to one-story wood building; W2 corresponds to two-story wood building; and C1 corresponds to concrete buildings with moment a frame. The concrete moment frame can be further divided into less than or greater than 4 stories. It can be seen in Figure 6 that all regression slopes are positive with relatively high r2 values. Considering the most extreme case, it can be seen that if a W1 building built before 1979 is next to a C1 building built after 2003, the resulting regression slope would likely be negative.

Figure 4: GWR bandwidth vs. r^2. Used for bandwidth selection.

Figure 5: Slope and intercept results from GWR analysis

Figure 6: Linear regression deaggregated by building characteristics

Connectivity of networked infrastructure

The probability of each tax lot being connected to critical facilities was evaluated by performing a connectivity analysis using the EPN, transportation, and water networks (see networks in Figure 1). The total fraction of tax lots disconnected from critical infrastructure are shown in Figure 7 (left-hand side). Similar to economic risks, the fraction of disconnection was multiplied by the probability of occurrence to determine the disconnectivity risk. It can be seen that for the 1000-year event, nearly all of Seaside becomes disconnected. Furthermore, the 250- to 500-year events pose the highest risk across the three networked infrastructures.

The network performance can also be characterized spatially by creating maps indicating the probability of link failure and tax lot disconnection. An example is shown in Figure 8 for the tsunami damage resulting from a 500-year event. This case was selected to view spatially because it drives the risk associated with the transportation network. Figure 8 shows that tax lots west of the Necanicum River have a high probability of becoming disconnected. Viewing the link failure map, it becomes clear that the disconnection is not driven by bridge failure, but rather from road washout.

Figure 7: Disconnection results: (left) fraction of tax lots disconnected from critical infrastructure and (right) disconnection risk

Figure 8: Probability of (left) link failure and (right) tax-lot disconnection

Significance

The ability to spatially isolate vulnerable infrastructure within a community serves as a first step in optimizing resource management. Being able to identify highly vulnerable areas, ensures that resources are not spent in areas that are already relatively resilient to the earthquake/tsunami hazard. The risk results from this analysis show that Seaside is particularly vulnerable to the 250- to 1000-year rupture of the CSZ. There is a large concentration of economic value located at the city center of seaside. If resource managers and city planners place an emphasis on reducing economic losses, then they should focus on mitigation measures to reduce damages within this region. However, more important than economic losses is resident and tourist accessibility to critical facilities following a natural disaster. Network mitigation measures should focus on reducing the damages associated with the 250- to 500-year events.

The results from this analysis are not comprehensive and future work should include:

1. Spatial analysis of mitigation measures – Additional damage modeling should be performed with various mitigation measures in place. For example, the effect on transportation connectivity resulting from earthquake retrofitting of bridges can be incorporated into the damage model. Similar spatial maps as those produced in this project could be created in order to determine the spatial effect that mitigation measures have.
2. Incorporate population in analyses – Climate change and natural disasters tend to have a disproportionate effect on populations within a geographic region. Future work should include spatially mapping socio-economic vulnerability indices to identify vulnerable populations within Seaside.
3. Perform additional GWRs – It was shown that the building classification lead to counterintuitive results in the GWR. Additional GWRs could be performed by only considering similar building types with a geographic region.

Learning

Throughout this term, I learned how to both use new software as well as apply new statistical methods.

1. Software
1. Python – Prior to this course, I was familiar with both QGIS and python; however, I had not used the two together. A personal goal was to learn how to perform geospatial analysis in python. This was accomplished by performing the GWR using the python library I additionally learned how to read and modify shapefiles in python using the geospatial library geopandas.
2. QGIS processing toolbox – I learned how to use multiple tools within the QGIS processing toolbox (g. heatmap interpolation and hotspot analysis).
2. Statistics –
1. Hot spot analysis – Although I did not use hotspot analysis for my project, I learned about this method and did a test case for exercise 1. The hotspot analysis was used to isolate areas within Seaside that resulted in disproportionate damages relative to the surrounding area.
2. Geographically weighted regression – Geographically weighted regression was performed to evaluate the relationship between tsunami momentum flux and percent damage.

As a significant portion of my research will be spatially-oriented, the tools and skills I learned during this term will be beneficial for future work. Furthermore, this course introduced additional geo-spatial statistic methods that were not implemented for this project but could be relevant for additional work.

References

Kameshwar, S., Farokhnia, K., Park, H., Alam, M. S., Cox, D., Barbosa, A., & van de Lindt, J. (n.d.). Probabilistic decision-support framework for community resilience: Incorporating multi-hazards, infrastructure interdepencies, and target objectives in a Bayesian network. Reliability Engineering and System Safety.

Park, H., Cox, D. T., Alam, M. S., & Barbosa, A. R. (2017). Probabilistic Seismic and Tsunami Hazard Analysis Conditioned on a Megathrust Rupture of the Cascadia Subduction Zone. Frontiers in Built Environment, 3(June), 1–19. https://doi.org/10.3389/fbuil.2017.00032

# Spatial pattern of ventenata invasion in eastern Oregon: Final Project

1. The research question that you asked.

I initially asked the question, “how is the spatial pattern of invasion by the recently introduced annual grass, ventenata, influenced by the spatial pattern of suitable habitat patches (scablands) via the susceptibility of these habitat patches to invasion and ventenata’s invasion potential?”

1. A description of the dataset you examined, with spatial and temporal resolution and extent.

In Exercise 1, I examined spatial autocorrelation and in ventenata abundance and ventenata hotspots using spatial data (coordinates and environmental variables) and ventenata cover data that I collected in the field (summer 2018) for 110 plots within and surrounding seven burn perimeters across the Blue Mountain Ecoregion of eastern Oregon.

Target areas were located to capture a range of ventenata cover from 0% ventenata cover to over 90% cover across a range of plant community types and environmental variables including aspect, slope, and canopy cover within and just outside recently burned areas. Once a target area was identified, plot centers were randomly located using a random azimuth and a random number of paces between 5 and 100 from the target areas. Sample plots were restricted to public lands within 1600m of the nearest road to aid plot access. Environmental data for sample plots includes: canopy cover, soil variables (depth, pH, carbon content, texture, color, and phosphorus content), rock cover, average yearly precipitation, elevation, slope, aspect, litter cover, and percent bare ground cover.

For Exercise 2, I examined how the spatial pattern of vegetation type influences invasibility of plant communities by ventenata. To achieve this, I applied vegetation type data from the Simpson Potential Vegetation Type raster data (Simpson 2013) to 30m resolution in Arc GIS which was developed to identify potential vegetation types across the Blue Mountain Ecoregion.

In Exercise 3, I explored how the spatial pattern of canopy cover was related to ventenata abundance. For this, I used a live tree canopy cover layer developed for the Pacific Northwest calculated using Forest Vegetation Simulator methods including the sum of canopy cover estimates for vegetation plots in the region (Crookston and Stage 1999).

1. Hypotheses: predictions of patterns and processes you looked for

Ventenata is an invasive annual grass that shares many functional traits to other impactful invasive annual grasses in the region such as cheatgrass and medusahead, including similar vegetative height, fall germination, and shallow root system. These similarities have led me to believe that, like cheatgrass and medusahead, ventenata will be more abundant in open areas with low canopy cover where competition from existing vegetation is lower.

The study area contains many open areas interspersed throughout the larger forested landscape. The patchy spatial distribution of open areas throughout the study area will likely result in a patchy distribution of areas with high ventenata cover. Additionally, ventenata produces many seeds, with the majority of these seeds dispersing short distances from the parent plant. This leads me to believe that areas with high ventenata cover will be clustered near other areas with high ventenata cover creating invasion “hot spots” across the study region.

Hypothesis 1: Areas with high ventenata cover will be clustered near other high cover areas and low cover areas will be clustered near other low cover areas.

Hypothesis 2: The spatial pattern of ventenata abundance will be positively correlated with a neighborhood of non-forest habitat types (shrub-lands and grasslands) and negatively correlated with a neighborhood of forest habitat types. This relationship will decrease in strength as distance increases from the high cover sample point, as vegetation types farther from an invasion point are likely not as strongly influencing invasion as vegetation types closer to that point.

Once a species has established in a suitable habitat, it may spread to areas of less suitable habitat aided by strong propagule pressure from a nearby population. Open areas may act as source populations, allowing ventenata to build propagule pressure to the point where it is able to successfully establish and maintain a population in less suitable habitat such as areas with high canopy cover.

Hypothesis 3: Plots where ventenata is present in areas with high canopy cover (e.g. forests) will be clustered near open areas. These open areas may provide strong propagule pressure to aid invasion into areas with fewer available resources (sunlight).

1. Approaches: analysis approaches you used.

To test these predictions I performed a handful of spatial analyses including:

Exercise 1: I tested for spatial autocorrelation using Moran’s I and created a correlogram in R and performed hot spot analysis in ArcGIS

Exercise 2: I explored the spatial relationship between the spatial pattern of ventenata abundance and the spatial pattern of different vegetation types using neighborhood analyses in ArcGIS and R

Exercise 3: I examined the spatial relationship between ventenata and canopy cover using a Ripley’s cross K analysis in R

1. Results: what did you produce — maps? statistical relationships? other?

Throughout the analyses, I produced a series of statistical relationships displayed as maps and graphs. Hot spot analysis produced a map that allowed me to visualize the relationship of autocorrelation between ventenata abundance at my sample points. For Moran’s I, neighborhood analysis, and Ripley’s cross K, I produced graphical representations of statistical relationships in R.

1. What did you learn from your results? How are these results important to science? to resource managers?

The correlogram and hotspot analysis results showed that the spatial pattern of ventenata is auto correlated and has a patchy distribution. The hotspot analysis suggests that areas of high ventenata are clustered with other high ventenata plots and low ventenata plots are clustered as I predicted in Hypothesis 1.  This is likely a result of the patchy distribution of open areas and forested areas across the landscape and the dispersal ability of ventenata.

Neighborhood analysis showed that areas with high ventenata cover are more positively correlated with nearby forested areas (ponderosa pine) than I originally thought. This result suggests that ventenata may preferentially invade areas surrounded by ponderosa pine vegetation type as well as shrublands which would not support Hypothesis 2. However, ponderosa pine vegetation type does not necessarily indicate high canopy cover, and could represent invasion into an alternative low canopy cover vegetation type. Additionally, the vegetation type maps are mapped at large spatial scales and may not represent the fine scale variation in vegetative cover. Uncertainty in this result inspired a follow up analysis using canopy cover instead of vegetation type as a predictor variable in a Ripley’s cross K analysis.

In my follow up analysis using Ripley’s cross K, I found that forest plots where ventenata was present were only weakly clustered around open areas despite my original hypothesis that there would be strong clustering (H3). These results could suggest that ventenata has a higher tolerance for high canopy cover than I originally predicted. Alternatively, these results could indicate that ventenata is capable of dispersing large quantities of seed much farther distances than originally thought, thus not requiring open areas in the immediate neighborhood. Moreover, the same issue of scale may apply to the canopy layer as the vegetation layer, and the 30m resolution may be over predicting canopy cover at my sample sites.

My findings could have severe implications for forest ecosystems which are commonly thought to be relatively resistant to invasion by annual grasses and are now showing susceptibility to ventenata invasion. For example, vententata could increase fine fuels in these systems, making them more likely to ignite and carry surface fire. Managers may want to consider incorporating annual grass management strategies into their current and future forest management plans to help reduce potential invasion impacts.

1. Your learning: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other?

During this class I learned a suite of new tools in ArcGIS including hotspot analysis and concentric ring buffer. I created my first model using ArcGIS Modelbuilder! I learned the basics of spatstat in R and successfully completed some spatial analysis which required transforming my data into a spatial data frame (I did not know that these existed prior to this class). Additionally, I was exposed to, and gained experience using many other new functions in R including Moran.I, correlog and kcross that were useful for spatial analysis.

1. What did you learn about statistics, including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), (d) multivariate methods (e.g., PCA),  and (e) or other techniques?

I learned that Moran’s I and correlograms are useful for testing spatial autocorrelation in data, but only if the scale applied is of interest. For example, it was not useful to compute only one Moran’s I value for my entire data set – this indicated that there was spatial autocorrelation in the data, but did not indicate a spatial pattern. However, when I computed Moran’s I at various distances and displayed these results in a correlogram, I found uncovered the pattern of the spatial correlation. The hotspot analysis allowed me to visualize exactly where the high and low clustering was occurring across my sample plots while simultaneously providing a significance value for those hot and cold spots.

Ripley’s cross K analysis was useful for testing the relationship of my ventenata points to another variable (canopy cover). I found this test appea ling because it tests whether or not one variable is clustered around another variable using a Poisson distribution to compare observed and expected values assuming spatial randomness. However, I learned that this method was not appropriate for my data, as my sample plots were chosen based on field variables and were not a random sample. This violated assumptions of randomness and homogeneity across the sampling region as my plots were more heavily located in non-forested areas. If I wanted to properly investigate these spatial questions, I would have to develop a more random sampling method.

Citations

Simpson, M. 2013. Developer of the forest vegetation zone map. Ecologist, Central Oregon Area Ecology and Forest Health Program. USDA Forest Service, Pacific Northwest Region, Bend, Oregon, USA

Crookston, NL and AR Stage. 1999. Percent canopy cover and stand structure statistics from the Forest Vegetation Simulator. Gen. Tech. Rep. RMRS-GTR-24. Ogden, UT: U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station. 11 p.

# Final Project Summary – Spatial Analyses of the Relationships Between Seascape Classes and Rockfish Distributions

The question that I sought to answer in this class was “How is the spatial pattern of forage fish assemblage in the California Current System related to the spatial pattern of seascapes based on the sea-surface conditions used to classify the seascapes?”  This is outlined in my first blog post, which can be found here. Additionally, I was seeking to characterize the spatial distribution and extent of seascape classes and forage fishes. Seascape classes are used as a way to simplify the myriad of physical, biological, and chemical processes that can affect organisms in the ocean. Eventually, the focus of the analyses shifted to young-of-the-year rockfish, as opposed to forage fish. All analyses mentioned in this blog post were conducted using the YOY-Rockfish data as opposed to the forage fish data.

Data

Midwater trawls have been conducted annually by the National Oceanic and Atmospheric Administration’s (NOAA) Southwest Fisheries Science Center (SWFSC) in an attempt to monitor the recruitment of pelagic rockfish (Sebastes spp.) and other epipelagic micronekton at SWFSC stations off California. The trawls have informed a dataset that represents overall abundance of all midwater pelagic species that commonly reside along the majority of the nearshore coast of California from 1998 to 2015. Each trawl collected information about both fish abundance, recorded in absolute abundance, and location data, recorded in the form of latitude and longitude. The dataset also includes a breakdown of species by taxa.

Seascapes have been classified using a combination of in-situ data (from the trawls) and remotely sensed data from NASA’s MODIS program. Seascapes were classified using the methods described in Kavanaugh et al., 2014 and represent the seascape class in the immediate area that each trawl occurred. Seascapes are classified at 1 km and 4 km spatial resolution and at 8-day and monthly temporal resolution. Each seascape has been assigned an ID number which is used to identify similar conditions throughout the dataset.

The current seascape classification methods are able to classify surface conditions into one of 33 distinct seascape classes. The variables used to classify the seascapes are listed below:

• Sea-surface temperature (Celsius)
• Sea-surface salinity (Practical salinity units)
• Absolute Dynamic Topography (meters)
• Ice cover (% of area)
• Colored dissolved organic matter (m^-1)
• Spatial averaged sea-surface chlorophyll a (mg m^-3)
• Phytoplankton physiological state (NFLH) (W m^-2 um^-1 sr^-1)
• NFLH:Chlorophyll ratio

It is inferred that certain, carefully-selected sea-surface conditions can imply favorable conditions to support healthy ocean ecosystems in the water column and near the ocean floor. The conditions listed above have been studied and selected for this reason. However, this may not be a perfect science – any attempt to delineate and simplify the multitude of factors and conditions that facilitate life in the ocean is likely to leave out important factors. This underscores the importance of research that tests these classification methods and measures their ability to replicate and predict the true natural environment. The analyses conducted for this class attempt to do that in the context of one functional group of fishes within one major ecosystem in the Pacific Ocean.

Hypotheses

Simply put, I hypothesized that any measurable spatial changes in the spatial extent of certain seascape classes will also be identifiable in the spatial variability of forage fish assemblage over time. The California Current ecosystem is meticulously studied and examined by a myriad of researchers from a number of different affiliate institutions. Studies reviewing the physical and biogeochemical conditions of the area indicate that from an environmental perspective, many areas off of the coast of California should support areas of high fish abundance.

Specifically, I was expecting areas of high forage fish and young of the year rockfish abundance to exist in seascape classes that represent nutrient-rich, upwelled water. These conditions have been shown to support thriving ecosystems throughout the water column due to an abundance of energy and food for fishes that live higher in the water column. Upwelling brings cold, nutrient-rich water to the surface and occurs seasonally in most places along the California Coast. Using the variables listed above, that would mean below average sea-surface temperature, high dissolved organic matter, and high chlorophyll a. Some classes that represent conditions similar to this are:

• Temperate Transition (Class 7)
• Tropical/Subtropical Upwelling (Class 11)
• Subpolar (Class 12)
• Temperate Blooms Upwelling (Class 14)
• Warm, Blooms, High Nutrients (Class 21)

Preliminary multivariate community structure analysis has shown some statistically significant relationships between certain species and certain seascape classes using this data. If spatial patterns do exist, I expect there to be some relationship between the surface conditions and the fish found at depth of the midwater trawls.

Since my shift focused from forage fish to YOY rockfish, my hypothesis shifted from a prediction of correlation between certain seascape classes and forage fish abundance to relationships between rockfish and other seascape classes (as determined by the aforementioned multivariate analyses).

My hypothesis, taking into account all of the aforementioned considerations, can be restated as follows:

It is expected that the spatial pattern of areas of high young-of-the-year rockfish abundance will be related to the spatial patterns of seascape classes that represent areas of high nutrient availability and upwelled water along the California Coast.

Additionally, I expect even higher areas of abundance to occur in areas where two or more seascapes representing these favorable conditions converge or border one another. These border areas are likely to indicate much larger swaths of ocean that hold habitable conditions that are likely to be able to support entire communities rather than smaller numbers of fishes. While this hypothesis was not tested in this project, future work can be conducted using FRAGSTATS, patch analysis software, and other landscape ecology methods to seek an answer related to this prediction.

Approaches Used

Exercise 1: For the first exercise, different interpolation methods were used to explore their effects on the point trawls data. The interpolations (Kriging and IDW were tested) were used to model the abundance of rockfish in four selected years.

Exercise 2: Next, a neighborhood analysis was used to determine what the dominant seascapes were in the areas around trawls that produced high rockfish abundance (both according to the data and the interpolation) and low rockfish abundance. Buffers were set at 5, 10, and 25km distances around two trawls of each kind and seascape classes were measured as a function of % area of each buffer.

Exercise 3: Finally, a confusion matrix was calculated that measured the agreement between occurrence of the significant seascapes (as determined by exercise 2) and areas of high abundance (as determined by the interpolations in exercise 1).

Results

Exercise 1: The interpolations produced a number of maps of varying spatial extents and with varying resolution. Methods had to be refined multiple times to procure workable results, but maps interpolating rockfish abundance for 4 distinct years were eventually created and used.

Exercise 2: The neighborhood analysis produced statistics related to the percent of each buffer area occupied by each seascape class. This information was then used to create two plots – one for areas of high abundance and one for areas of low abundance. The most important result was an understanding of which seascape classes represented areas of high rockfish abundance. The dominant seascape classes in areas of high abundance turned out to match two of the seascape classes predicted in my hypothesis: Temperate Transition (Class 7) and Subpolar (Class 12).

Exercise 3: The final exercise produced a confusion matrix that measured true positives, true negatives, false positives, and false negatives as they related to the agreement between the seascape classes and the interpolated maps. The final matrix also produced an overall agreement score (Kappa).

Significance

The confusion matrix ended up being a bit of a failure, as both the seascape information and interpolated information had to be turned into presence-absence data for the matrix to be calculated. This eliminated some of the fine-tuned resolution associated with the interpolation and ultimately led to a Kappa score that suggested less agreement than would be expected by chance. While this specific result is not important to science or managers, these methods can be refined to improve the agreement between the two sources, ultimately providing a way to statistically link physical ocean conditions and rockfish abundances in California.

For these reasons, I consider this project to be an excellent pilot study for these data – one that introduces me to the types of analyses available to researchers who work with these types of data and their strengths and shortcomings.

Learning Outcomes

Software and Statistics

This class gave me an opportunity to brush up on my Arcmap and Modelbuilder skills, which I used to be very proficient in but had lost. I’ve also had the opportunity to explore some analyses through the lenses of R and Python, though I did not end up executing any parts of my project in these programs. The most important statistical learning outcomes for me had to do with interpolation and confusion matrices. The first exercise taught me all about the different ways point data can be interpolated and the consequences that that can have for results. I was familiar with the different types of hotspot analysis available to users in ArcGIS, but was not aware of their differences. The confusion matrix introduced me to an entirely new way to connect independent modeling methods. Methodologically, I learned about the importance of recording your methods as you execute processes and about the benefits of naming files sequentially. Had I not been organized, I would have had a hard time keeping track of the different processes executed and files used during each step.

# Exercise 3: Poisson regression analysis of Texas rural index scores and colorectal cancer mortality counts

Research Question

What is the association between Texas county rurality index score and colorectal cancer (CRC) mortality?

In Exercise 1, I created a rurality index for Texas counties using rural indicator variables, while in Exercise 2, I selected Texas counties with the highest CRC death counts and created multi-ring buffers around them to visualize and measure how rural indicator variables shift as ones moves away from the counties. In this exercise, I am doing a more direct analysis of CRC mortality and my rurality index by utilizing a Poisson regression model. I expect the results to show that as index scores become more “rural,” county CRC death counts will increase.

Tools and Data Sources Used

The analysis, diagnostic, and graphing procedures for this exercise were all completed using the following R functions/packages: glm(), vcd, AER, and car. More specifically on the R tools, a generalized linear model (GLM) of the family Poisson was utilized for modeling, a rootogram from the vcd package was used for Poisson goodness-of-fit, a test for model over-dispersion was utilized from the AER package, and an influence plot was created using the car package. Other goodness-of-fit diagnostics and statistical measures were ascertained via the base glm() procedure.The rural indicators utilized for the index in this analysis are from the same sources I used in Exercise 1: Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database). Like in my Exercise 2, aggregated CRC mortality counts per 100,000 population for Texas counties were obtained for the years 2011-2015 from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program U.S. Population Data – 1969-2015.

Methods

Attribute Table Conversion to R: First, the attribute table of CRC death counts and rural index scores by county (created as part of Exercise 2) was exported to an Excel file using the Table to Excel tool in Arc. The Excel file was then loaded into R using the “readxl” package.

Univariate Poisson Diagnostics: Before introducing a Poisson regression, I wanted confirm the outcome variable follows a Poisson distribution. To do this, I used the rootogram() function from the vcd package in R to visualize the distribution of the CRC mortality count data. The figure below show that the data seems to roughly follow a Poisson distribution. Based on this, I proceeded to fitting the model.

Model Fitting: I fit a GLM of the Poisson family using the following formula in R: glm(formula = CRCDeathrate ~ PCAWeighted_Index, family = “poisson”). Initially, I attempted to construct a linear OLS regression with this data, but after many attempted transformations, I realized the outcome count data was likely following a Poisson distribution. In subsequent steps, I walk through the post-fitting Poisson diagnostics about the appropriateness of this fit.

Poisson Regression Diagnostics: In Poisson regressions, one of the main concerns is over-dispersion of the data, as Poisson distributions have only one free parameter and do not allow for variance to be adjusted individually from the mean. Therefore, if over-dispersion exists, the data has more variation than the Poisson distribution allows, biasing the variance of the model. To ensure the Poisson regression is not over-dispersed, I utilized the dispersiontest() function from the AER package. Equi-dispersion from this function is displayed when alpha=0. The alpha value of -0.20 and p=0.99 displayed below show that model is not over-dispersed.

Further proof that Poisson is the correct model for the data can be shown through an influence plot from the car package, where only a few data points (101,57) are highly influential (shown below).

I also used a chi-squared test to confirm that a Poisson model is a good fit. The resulting p-value (0.95) indicates that a Poisson model is a very good fit for the data (shown below).

Exporting to Arc for Mapping: After creation the model, I computed the standardized residuals of the model in R for each county using the studres() function. I then exported this data to Arc and added it to the attribute table I mentioned previously in this exercise. I then used symbology to represent the residuals on the Texas map and created a layout

Results

Because Poisson regression of counts utilized a log link, the resulting coefficient(s) need to be exponentiated. The exponentiated coefficient for this regression display that as county index scores increase (become more urban) by 1 unit, the intercept (mean of CRC counts) is multiplied by 0.99. In other words, for every unit increase in rurality, county CRC deaths counts increase by 1% (p<0.001). The narrow confidence intervals for this estimate indicate that the coefficients are accurate.

The map of standardized residuals can be seen above and indicates that there is significant regionality to the association between index scores and CRC death counts, where central Texas is much more green (negative standardized residuals) and counties near the edges are orange and red (positive standardized residuals). Also, as expected, the high CRC death count Texas counties from Exercise 2 (Anderson, Howard, Gonzales, and Newton) have standardized residuals significantly higher than expected in the Poisson model.

Critique

This process of regression construction was very helpful for determining how to best model my data. I initially attempted to use a linear regression and after laboring to find a transformation for my outcome data, I finally decided on Poisson regression (count data duh!). Based on the model diagnostics I performed, it seems that the data works perfectly for Poisson regression. In my opinion, the most difficult part of this exercise was interpreting the coefficients of the regression. Because the coefficients are on the log scale, they have to be exponentiated, which greatly confused my interpretation at first. Because there is an extensive amount of county CRC death data missing from this analysis (confidentiality), these results may be less indicative of the true rurality-CRC mortality associations in Texas. Further, this analysis suffers from the modifiable areal unit problem, where counties may not an appropriate boundary for comparison. In future analyses, I would like to use more complete and spatially defined data from the state cancer registry to shed more light on the relationship.

# Multi-Distance Spatial Cluster Analysis for Woodpecker Nests (Ripley’s K)

Exercise 1: Multi-Distance Spatial Cluster Analysis (Ripley’s K)

Questions

How is the spatial presence of postfire woodpecker nests related to the spatial presence of salvage-logged forest stands?

• How are woodpecker nests clustered within survey units? (Exercise 1 and 3)
• How does this clustering relate to salvage treatment units within the survey units? (Exercise 2 and 3)

Tool

Multi-Distance Spatial Cluster Analysis (Ripley’s K) in R

Multi-Distance Spatial Cluster Analysis (Ripley’s K) analyzes point data clustering over a range of distances. Ripley’s K indicates how spatial clustering or dispersion changes with neighborhood size. If the user specifies distances to evaluate, starting distance, or distance increment, Ripley’s K identifies the number of neighboring features associated with each point if the neighboring features are closer than the distance being evaluated. As the evaluation distance increases, each feature will typically have more neighbors. If the average number of neighbors for a particular evaluation distance is higher/larger than the average concentration of features throughout the study area, the distribution is considered clustered at that distance.

Referenced from ArcGIS Desktop 9.3 Help (http://resources.esri.com/help/9.3/arcgisdesktop/com/gp_toolref/spatial_statistics_tools/how_multi_distance_spatial_cluster_analysis_colon_ripley_s_k_function_spatial_statistics_works.htm)

Data Description

My polygon dataset includes 10 survey units (6 treatment, 4 control) totaling over 7,000 ac. Each survey unit is between 397 and 1154 ac2. 34 salvage logging units are dispersed throughout the 6 treatment units. The salvage units are characterized by three silvicultural prescriptions replicating preferred postfire habitat for the three target woodpecker species: black-backed, white-headed, and Lewis’s woodpeckers. The 4 control survey units and the unlogged landscape in the 6 treatment units act as the undisturbed control habitat.

In 2016 and 2017, belt transects with corresponding avian point counts were surveyed in each of the 10 survey units. These surveys detected woodpecker occupancy and nest locations based on audio playback and nest searching methodology. Woodpecker nest datasets were developed from these surveys for pre-salvage (2016, n = 71) and post-salvage (2017, n = 77) conditions. I wanted to run Ripley’s K on the two datasets separately to determine differences in nest clustering before and after the salvage treatments. In the future, with successive datasets collected in 2018 and 2019, I can analyze clustering trends up to three years after salvage logging. I will also integrate 3D forest structure variables from pre- and post-salvage lidar datasets.

A subset of the 2016 and 2017 nest datasets picturing clockwise from top right: East Fork Canyon Creek (Ctrl), Wall Creek (Ctrl), Alder Gulch (Trt), Lower Fawn Springs (Trt), Upper Fawn Springs (Trt).

By using Exercise 1 to determine how nests are clustered within the study units, I can use this clustering to inform my Exercises 2 and 3, which should reveal how the clustering relates to the salvage harvest units.

Ripley’s K Steps

1. Export nest points as their own shapefile. My preliminary point dataset includes both nest points and vegetation survey points, so I needed to isolate only nest points. This will indicate how nest points cluster within study units.
2. Further export nest points by survey unit. Isolate nests in each survey unit as their own shapefiles. Running Ripley’s K on 2016 and 2017 nests as a whole will not be useful, since clustering across an entire nest dataset will reflect the 10 survey units selected for this study. In that case, Ripley’s K would falsely demonstrate high clustering within those 10 survey units. In reality, the nests are only clustered here because the surveys intentionally restricted nest detection to these areas instead of across the entire Canyon Creek fire complex. I ran Ripley’s K on all the 2016 and 2017 nests and the output proved true to this phenomenon, indicating statistically significant clustering at smaller distances:

1. Use R to run Ripley’s K on each survey unit’s 2016 nest shapefile.
2. Use R to run Ripley’s K on each survey unit’s 2017 nest shapefile.

Below is the example script for Ripley’s K using the 2016 Alder Gulch survey unit. For 2017 I  changed the variable names to 2017:

library(spatstat)
library(rgdal)
library(maptools)
library(ggplot2)
library(sp)
library(nlme)
library(rpart)
library(raster)
setwd(“N:/AIS_Blue/Woodpeckers”)
getwd()
AGnests2016 <- readOGR(“./data”, layer = “AG_2016_nests”)
plot(AGnests2016)
class(AGnests2016)
AGnests2016.ppp <- as(AGnests2016,”ppp”)
n <- 100
AGnests2016RK <- envelope(AGnests2016.ppp, fun= Kest, nsim=n, verbose=F)
plot(AGnests2016RK)

I gave the option to run either 100 or 1000 iterations. The difference is that the confidence interval (grey area in the output graphs) widened with 1000 iterations, shown below for the Big Canyon survey unit:

Below are the Ripley’s K results for each survey unit in 2016 and 2017 over 100 iterations. The accompanying visual plots display nest point distribution in each survey unit. Some survey units did not have large enough sample sizes for the tool to function correctly. Notable results from a visual analysis of each graph include nest clustering in Big Canyon (Treatment 1) and Crawford Gulch (Control) in 2016 and Lower Fawn Creek (Treatment 2) in 2017.

Alder Gulch 2016 (n = 5)

Alder Gulch 2017 (n = 7)

______________________________________________________________________________________________________________________________________

Big Canyon 2016 (n = 13)

Big Canyon 2017 (n = 10)

______________________________________________________________________________________________________________________________________

Crazy Creek 2016 (n = 10)

Crazy Creek 2017 (n = 11)

______________________________________________________________________________________________________________________________________

Crawford Gulch 2016 (n = 12)

Crawford Gulch 2017 (n = 8)

______________________________________________________________________________________________________________________________________

East Fork Canyon Creek 2016 (n = 3); Not surveyed in 2017

______________________________________________________________________________________________________________________________________

Lower Fawn Creek 2016 (n = 8)

Lower Fawn Creek 2017 (n = 14)

______________________________________________________________________________________________________________________________________

Overholt Creek 2017 (n = 4); Not surveyed in 2016

______________________________________________________________________________________________________________________________________

Sloan Gulch 2016 (n = 7)

Sloan Gulch 2017 (n = 7)

______________________________________________________________________________________________________________________________________

Upper Fawn Creek 2016 (n = 4)

Upper Fawn Creek 2017 (n = 5)

______________________________________________________________________________________________________________________________________

Wall Creek 2016 (n = 9)

Wall Creek 2017 (n = 11); Cannot import figures due to unresolved maximum file size error.

Problems/Critique

Because of the small sample size for each of the survey units (as few as 4 nests in some years/units), my Ripley’s K output graphs appeared blocky instead of continuous. Notice how continuous the graphs appeared when I ran all 2016 nests. Overall this analysis did not perform well with these datasets. I am also unclear how edge effects were factored into this particular analysis. There may be more parameters I could define in the code when running this analysis in R. For example, I did not manually specify distances in the R code.

Ripley’s K requires (X,Y) coordinates for each point location. I first tried to perform this analysis in ArcMap. However, my woodpecker nest point shapefiles contain UTM coordinates divided into two separate fields for northing and easting. This caused a problem when the Ripley’s K tool in ArcMap asked for the dependent variable and I could only select one field, northing or easting. Therefore, I needed to run the Ripley’s K tool in RStudio instead.

The above Step 2 explains another issue with trying to analyze all nests at once, but I was able to resolve it by individually isolating each survey unit. Nest dataset analyses must also address a temporal component related to the salvage logging. 2016 nests must be analyzed separately from 2017-2019 nests, but 2017-2019 nests can be analyzed either by year or as a group.

# Exercise 3: Characterizing the spatial pattern of confusion between pixels classified by global settlement datasets and OSM refugee settlements.

In Exercise 3, I was asking how the different classification methods that I’m using align or do not align with each other — essentially, creating a confusion matrix to identify the pixels that are True Positives, False Positives, and False Negatives when comparing two of the classifications. That is — what is the spatial pattern of confusion between pixels classified by global settlement datasets and an unsupervised binary classification of pixels within an OSM boundary?

Approach that I used:

I used EarthEngine to simulate confusion among pixels — I did not necessarily create a matrix in order to do so, but considered the concept of a confusion matrix in order to calculate a number for each pixel.

Steps I followed to complete the analysis:

In order to do this, I needed to compare “Facebook” and “K-means” as well as “WSF” and “K-means.” In this case, the “K-means” classification was my “truth” classification, because I needed something to measure the other classifications against that was not just a summary of pixels inside of a vector but a more traditional binary unsupervised classification. Thus, I used the settlements to guide a clustering algorithm of K-Means to cluster similar pixels, and created clusters of “settlement-like” pixels. I needed to perform raster math in order to combine these datasets and represent False Positives, False Negatives, and True Positives. I added WSF to K-means, with each pixel labeled as “1” in each representing  a positive record. I then added Facebook to K-means, again with each pixel labeled as 1, but then multiplied this by 4 for further additive properties to identify the overlaps in False Positives, False Negatives, and True Positives. Ultimately, this meant that the pixels that were False Negative in either or both comparisons had pixel values of 10, 11, and 14. The data that were False Positive in either or both comparisons had pixel values of 1, 4, and 5. A value of 15 represented where all three datasets registered as True Positive.

Brief description of results I obtained:

I was interested in both False Positives and False Negatives, but for different reasons. In False Positives, I wanted to visualize where the different global settlement datasets DID detect settlements while the K-Means did not. False Negatives would show me where K-Means picked up settlement data but neither of the global datasets did. Ultimately, False Negatives were much more prevalent throughout the dataset, further illustrating the exclusion of refugee settlements from global datasets. I chose to display a map of a more interesting pattern that appeared when looking over a larger settlement that was near a large body of water: because the binary clustering grouped water with “settlement” type landscapes, this area is a significant False Negative in the data. If I were to calculate statistics, it would make sense to exclude the area of water in order to not promote bias in the data.

False Positives and False Negatives between Refugee Settlement Binary Clusters and Facebook Classification and World Settlement Footprint Classification

Critique of the method – what was useful, what was not?

This exercise probably caused the most trouble with regards to methods that I thought made sense and would work but both presented more challenges and fewer patterns than I was hoping. There wasn’t really the spatial pattern that I was expecting, perhaps because this data is so noisy and over discontinuous areas in the landscape. It was useful to dig into the overlapping data, but ultimately the actual spatial patterns were not very remarkable. Perhaps the numbers presented in a matrix would be a more useful representation of the data or patterns for this specific method. Aside from the spatial statistics, I actually ran into quite a few function issues in EarthEngine, Pro, and ArcGIS Desktop — all three of which I used to try to manipulate this data into the ways I was imagining it would. For my final project, I may need to revisit these methods or seek advice from others, because my techniques were not as effective as I’d hoped.

# Supervised Image classification on forested stands

Could I identify functional tree species with supervised image-classification in my stands?

The reason I asked this question was so that If I had to do a geographically weighted regression again it would be valuable to have deciduous or coniferous tree species in my point data for an added variable.

Name of the tool or approach that you used.

The main tool that I used for image classification was the maximum-likelihood classification in ArcMap. I also used the create accuracy assessment points to help create a confusion matrix in excel.

Brief description of steps you followed to complete the analysis.

I downloaded 2016 NAIP imagery in my area of interest and used the high resolution imagery to create a false colored image with the bands being arranged as NIR, Red, and Green. To help delineate broad-leaf vegetation from coniferous vegetation, I applied a histogram equalize stretch that enhanced my ability to identity conifers in the landscape. From there I created a maximum likelihood classification by drawing training data polygons on the false color imagery, which involved me using the Esri digital imagery base map as a reference image.

Once the image classification was complete, I used the create accuracy points on my stand and then extracted the raster values from the thematic map output to those points to create a confusion matrix in Excel. I clipped the thematic map raster to the watershed polygons I made when I did an individual tree segmentation to show what pixel classifications were assigned in my tree tops.

Brief description of results you obtained.

The thematic map output was 83% accurate with conifer and developed land covers performing the worst in the model. The developed  land cover is generally difficult to model in a landscape, and the variability in urban spectral reflectance leads to errors in modeling. The conifer land cover performed more poorly due to my trouble achieving accurate training data with the imagery resolution, and also with the model having trouble delineating conifers from grass and deciduous vegetation. Errors of commission on my part (65% accuracy), and errors of omission (75%  accuracy) lead to the lower accuracy of the conifer land cover (Table 1).  Despite these errors, the thematic map output performed well, and the land cover pixels in my stands showed that conifer trees were accurately assigned in the Saddleback stand (Figure 2). For the baker creek stand the large amount of shadows, sun glare on canopies, and classification cut off, lead to a poor classification of that stand.

Figure 1. The land cover thematic map for the entire NAIP image. The cyan blue color indicates the locations of Saddleback and Baker Creek stands.

Figure 2.The land cover classification output for the Saddleback stand.

Figure 3. The land cover classified output for Baker Creek Stand. Note that the NAIP imagery that was classified did not extend to cover the entire stand. The tree crown polygons were laid below the output to show where the land cover cuts off.

Table 1. Confusion matrix for the thematic map output.

Critique of the method – what was useful, what was not?

Some critiques about this process was that it was time consuming to create training data detailed enough to capture the variation in the scene for my desired accuracy. Sources of errors in the thematic map include shadows, resolution and variable spectral response signatures in the remotely sensed vegetation. Shadows occluded trees that would otherwise stand out, and distorted the classification enough for me to have to add in a land cover classification for shadows to mask them out of the scene. The issue of resolution just means that NAIP imagery was not detailed enough for the applications I asked. Imagery taken from unmanned aerial drones may be a potential avenue for acquiring a more  higher resolution data set. The confusion matrix highlights this issue, with an omission error of 65% for conifers and 75% commission error. It was difficult to determine conifer trees accurately in the training data from the variability of the spectral reflectance and the blurred crowns from the 1 meter resolution.

Since I only did a classification, I didn’t attempt to classify tree functional species to my tree polygons. The process that comes to mind on how to do that is to visibly determine which classification color is more pronounced in a tree top, and then placing that species in the point data as an attribute. This process would be highly time consuming and developing a methodology to streamline the classification of functional tree species to my tree points would be potential future work.Overall, the thematic map outputs are useful for areas like the Saddleback stand that have less shadows and distortion. The map is less useful for areas with high distortion like my Baker Creek stand.

# Watershed recession behavior as a function of watershed environmental variable

Question asked: How is the spatial pattern of watershed recession coefficients related to the spatial patterns of watershed elevation, soils, geology, basin area, and precipitation?

Name of the tool or approach used: To perform this analysis, I used recession analysis to quantify low flow metrics for 12 streams and rivers in the Oregon Coast Range. I classified the a and b recession coefficients using 2 different approaches in ArcGIS: equal interval and natural jenks. The resulting classifications are quite different.

Methods of procedure:

1. Recession Analysis: the methods employed for the recession analysis are described in Exercise 1. For Exercise 2, I calculated recession coefficients for an additional seven sites in the Oregon Coast Range for a total of nine sites. Data for the present analysis was analyzed on the hourly timestep.
2. Watershed Delineation: I delineated watersheds in ArcGIS based on each gaged pour point.
3. Spatial data: I downloaded relevant spatial data from various geospatial databases including Oregon Geospatial Enterprise Office, USGS, USDA, and OSU PRISM Climate Group.
4. Data Visualization and Classification: I used ArcGIS to visualize my spatial data and to classify the recession analysis results. I used two classification methods: equal interval and natural jenks. Both methods had 5 classes.
1. Equal interval classification: classification categories using this method is calculated by dividing the range of data by the designated number of classes.
2. Natural jenks classification: this is an optimization method of classification that minimizes variance within classes and maximizes variance between classes.
5. The results of my classification were analyzed by visual comparison. That is comparison between the two classification methods, as well as how the results of each classification methods compared to the environmental variables across the watersheds.

Results:

Results from this analysis demonstrate that the recession coefficient classification method influences the apparent spatial variability of recession metrics. The natural jenks method results in more spatial heterogeneity across the latitudinal gradient (Figure 1). The a and b recession coefficients seem to be related in the way that they vary across space, however coefficient b is slightly more variable than coefficient a. The equal interval classification method results in a more homogenous representation of both a and b coefficients (Figure 2). Coefficient a seems to be largely controlled by basin size using this method, which is apparent because the smallest basin that is in the class with high values. Coefficient b is more variable, and also seems to be controlled by basin area. Precipitation may also be a primary control. Overall, between the two classification methods, the Nehalem watershed (furthest north) and the Chetco watershed (furthest south) seem to demonstrate similar recession coefficients. The watersheds between the Chetco and the Nehalem, in general, demonstrate similar recession behavior.

Figure 1. Classification of recession coefficients using the natural jenks method in ArcGIS, compared to five environmental variables. Coefficient a is in the first panel and coefficient b is in the second panel.

Figure 2. Classification of recession coefficients using the equal interval method in ArcGIS, compared to five environmental variables. Coefficient a is in the first panel and coefficient b is in the second panel.

Critique of method:

This was a simple method of visual comparison across space, however it was quite useful for both: 1) considering the spatial patterns of watershed recession behavior, and 2) comparing classification methods and how they influence the outcome of the analysis. Because it is just a visual comparison, there are no quantifiable differences presented here, which will be important moving forward. Additionally, this was an important exercise to understand the mechanical steps necessary for making this comparison.

# Monitoring Postfire Salvage Logging Effects on Woodpecker Population Dynamics

Research Question

How is the spatial presence of postfire woodpecker nests related to the spatial presence of salvage-logged forest stands?

Dataset

This project will use datasets targeting the 2015 Canyon Creek fire complex on the Malheur National Forest in eastern Oregon. A salvage logging operation occurred in the burn area in July 2016. My research is in cooperation with a Rocky Mountain Research Station study examining salvage logging effects on three woodpecker species in the Canyon Creek complex. In 2016 and 2017 I led crews on this project collecting extensive postfire woodpecker occupancy and nest productivity datasets for black-backed, white-headed, and Lewis’s woodpecker populations. This resulted in a 148-nest dataset for 2016 and 2017, representing woodpecker populations before and after salvage logging. A polygon shapefile outlining ten RMRS woodpecker point count survey units serves as the area of interest (6 treatment, 4 control). Within the 6 treatment units, another polygon shapefile outlining 34 salvage harvest units indicates treatment areas. Three silvicultural prescriptions replicating optimal habitat types for each woodpecker species designate salvage variables like post-harvest stand density and diameter. Each salvage unit adheres to one of these three harvest prescriptions. 2016 pre-salvage and 2017 post-salvage lidar datasets are in processing for eventual correlation between 3D forest structure variables and woodpecker nest site selection before and after harvest. Supplementary geospatial data includes a 2015 WorldView-3 1 m raster and ArcGIS basemaps.

Above: The 2015 Canyon Creek Fire burning near John Day, OR.

Above: The Canyon Creek fire complex as a false color WorldView-3 1 m raster. The area of interest includes 10 study units in blue, labeled with yellow text (6 treatment, 4 control). This visual orients the survey units to an area in eastern Oregon southeast of John Day and Canyon City. The false color image displays healthy vegetation as red, with the darkest areas displaying high burn severity. The survey units are found within some of the highest burn severity areas in the fire complex.

Above: A close-up of the 34 salvage treatment polygons outlined in red and labeled with white text. Control units lack red salvage polygons. This image does not include Overholt Creek.

Above: A subset of the 2016 and 2017 nest points featuring survey and salvage unit polygons.

Hypotheses

I expect to see dispersed nests in 2016 with possible trends indicating species habitat preferences. Previous research indicates species-specific preferences for certain forest habitat variables. Black-backed woodpeckers prefer dense, small-diameter stands for foraging and nest excavation. White-headed woodpeckers prefer a mosaic of live and dead variable density and diameter stands for pine cone foraging. Lewis’s woodpeckers prefer tall medium to large-diameter stands for aerial foraging maneuvers. I expect to see nest sites in both years clustered in areas with these forest structures. In 2017 I also expect to see nest sites clustered near salvage treatments implemented for each species. Overall I expect the control units to exhibit nest dispersal and high woodpecker activity.

Black-backed woodpecker (Picodies arcticus)     White-headed woodpecker (Picoides albolarvatus)                       Lewis’s woodpecker (Melanerpes lewis)

A graphic depicting 3 salvage harvest treatment types and a control designed to benefit each of the target woodpecker species (Dave Halemeier 2016).

Approaches

Analyses will consider pre- and post-salvage variables to determine changes in forest structure alongside woodpecker population dynamics. I would like to learn about spatial autocorrelation analyses such as Moran’s I. It is likely I will find patterns of dependent observations based on localized conditions. Woodpecker species presence and nest locations may be affected by burn severity, since highly weakened trees will host their primary food source, bark beetle larvae. In 2017 woodpecker species presence may be grouped according to salvage treatments targeting each species, or control areas. Regression analyses showing the relationship strength between nest distance from salvage units and salvage treatment types could indicate certain forest variables affecting postfire woodpecker colonization.

Expected Outcome

Regression coefficients describing the relationship between woodpecker presence and salvage treatment location and type will help develop inferences towards postfire management effects. I will create interpretive maps of nest locations showing survey unit and salvage unit polygons. These maps could include statistical and geospatial relationships represented with different colors and symbols. Eventually, I will geovisualize the lidar data with these maps and statistical relationships for a comprehensive and communicative representation of woodpecker population and forest structure change dynamics.

Significance

I am processing two lidar datasets of the study area from 2016 and 2017. These datasets were acquired before and after the salvage logging treatments occurred. I will produce forest metrics such as stand density, diameter class, and height in salvage and survey units. I will then correlate geospatial and statistical properties of the nest dataset to quantified forest variables affecting woodpecker nest site selection. I will examine trends between 2016 and 2017 nest selection to understand the effects of harvest treatments on woodpecker populations. At least two more years of woodpecker data will exist for 2018 and 2019, so future research will add these datasets to the analyses. I would like to see a machine learning algorithm developed from this study that could predict areas of optimal habitat suitability for snag-dependent wildlife species. Postfire wildlife habitat prediction will be crucial to resource managers as wildfires increase in the coming decades alongside accelerated landscape restoration.

This spatial problem is important to science and resource managers as climate change amplifies wildfire effects. Using 3D remote sensing datasets for resource management is the future trend across all disciplines. Increased wildfire activity around the world necessitates cutting-edge methods for active fire and postfire ecosystem quantification. In the Blue Mountains ecoregion in eastern Oregon, Rocky Mountain Research Station, Malheur National Forest, and Blue Mountains Forest Partners rely on this project’s lidar quantification for their research and management decisions. Determining acceptable levels of salvage harvest for wildlife habitat affects whether government agencies and rural economies in this region will allow small businesses to profit from sustainable harvest operations. Science overall will benefit from the continued investigation of wildlife response to anthropogenic disturbance, specifically postfire forest management decisions and the controversial practice of salvage logging.

Above: A salvage treatment in the Crazy Creek woodpecker survey unit.

Preparation

I took an ArcInfo class (ARC Macro Language) during my undergraduate program. I am currently taking a Python class for geospatial programming. I have experience in image processing through an undergraduate degree in land use and GIS, multiple years of professional experience in remote sensing and GIS, and my current MS program in Forest Geomatics. I have academic and professional experience with R, C++, ArcGIS, and multiple types of remote sensing software for 2D and 3D data analysis.

# Landscape Patterns as Predictors of Tree Height

Question: Which landscape features correspond to clusters of greater than expected trees?

Methods: I performed two Hot Spot Analyses in ArcMap; one on Hot Spots of tree height and another on hot spots of distance between trees. Both were constrained to the reserved control areas of the HJ Andrews Forest. Hot spots in tree height are regions of greater than expected tree height, while hot spots in distances between trees are regions of greater than expected distance between individual trees (more dispersed trees). Hot spots and cold spots of each analysis generally overlapped. However, hot spots between tree height and spacing did not overlap in all cases, so I wanted to know what landscape features might explain this difference. Covariates I explored included slope, aspect, elevation, and landform. Since the end goal is to find landscape features that may correlate with amount of soil carbon, I conducted this analysis with the assumption that taller trees may correlate with regions of greater soil carbon. I used the package ‘caret’ in R to calculate a  confusion matrix between the Z-scores of height and distance for all the hot spot bins (-3,-2,-1,0,1,2,3), then further constrained the analysis to only the most extreme hot and cold spots (-3 and 3). I then compared mean height, distance, slope and elevation between the four combinations of the extreme hot and cold spots (Table 1).

Results: Regions of taller than expected trees often correspond to regions of greater than expected distances between trees, which agrees with current forest growth models (Fig. 1). Hot spots of tall trees are typically in valleys and cold spots are commonly on ridges (Fig 3 & 4). When we zoom in to the Lookout Mountain area of HJ Andrews, we see that hot spots of tall trees are more concentrated in valleys and on footslopes, and cold spots are closer to mountain ridges (Fig 3). When compared with the distance hot spot map of the same area, we see that cold spots go much further down the footslopes and even into the valleys in some cases (Fig 4). So although we have evidence for a strongly linear relationship between height and distance between trees, we also have evidence that they do not fully explain each other and other landscape features are likely at play.

Fig 1. Distance Z-scores vs. Height Z-scores from hot spot analyses show a linear relationship.

Fig 2. HJ Andrews elevation with reserved control areas in orange and

inset area of Lookout Mountain hot spot maps (below)

Fig 3. Hot Spot Analysis showing hot spots of tree heights (tallest trees)

in the Lookout Mountain area

Fig 4. Hot Spot Analysis showing the greatest distance between trees

in the Lookout Mountain area

An elevation band that correlates with occurrences of tall trees exists up to around 1100 m, after which point number of tall trees drops off substantially (Fig. 5). Certain aspects seem to correlate with taller trees, but those relationships are harder to tease apart and I have yet to fully explore them. Greater slopes tend to correlate with shorter trees, but this relationship is not linear. There is an interesting upwards trend at slopes between 30 and 50 degrees that seems to correlate with slightly taller trees, then a big drop in mean height Z-score at slopes of 60 degrees.

Fig 5. Aspect, elevation and slope compared with Z-scores of mean height.

A comparison of Z-scores from hot spot analyses of height and distances shows that although hot spots of height and distance tightly correlate, covariates that explain them are different (mean slope and elevation). When we compare the most extreme Z-scores to one another, slope, height and distance between trees are not particularly different. Mean elevation in three categories of Z-score is similar, but mean elevation in the fourth group (>3,>3) is significantly lower. A next step is to map out these

Table 1. Comparison between the most extreme Z-scores of tree height and tree spacing.

 Height Z-Score Distance Z-Score Mean Height (m) Height_SD Mean Distance (m) Distance_SD Mean Slope (m) Slope_SD Mean Elevation (m) Elevation_SD <-3 <-3 22.8 10.5 5.1 2.4 27.9 10.5 1285 294 <-3 >3 24.5 11.2 5.6 2 26.9 4.5 1377 153 >3 <-3 34.8 7.1 5 2 31.8 5.9 1310 44 >3 >3 39.5 16.7 4.7 2.6 26.2 11 934 188

Critique: These analyses are still based on Hot Spot Analyses, so they still comes with the same criticisms as previous Hot Spot Analyses. One of these criticisms was that it’s basically a smoothing function. Since the LiDAR dataset I’m using is basically a census of tree heights, running hot spot analyses is reducing the information in that dataset unnecessarily. I have yet to map out regions that were well-predicted and poorly predicted spatially, so I cannot fully discuss the merits of the confusion matrix method.

# Exercise 3: Lagoons, ENSO Indices, and Dolphin Sightings

## 1. Question that you asked

I was looking to see a pattern at more than one scale, specifically the relationship with ENSO and sighting distributions off of San Diego. I asked the question: do bottlenose dolphin sighting distributions change latitudinally with ENSO related to distance from the nearest lagoon. The greater San Diego area has six major lagoons that contribute the major estuarine habitat to the San Diego coastline and are all recognized as separate estuaries. All of these lagoons/estuaries sit at the mouths of broad river valleys along the 18 miles of coastline between Torrey Pines to the south and Oceanside to the north. The small boat survey transects cover this entire stretch with near-exact overlap from start to finish. These habitats are known to be highly dynamic, experience variable environmental conditions, and support a wide range of native vegetation and wildlife species.

Distribution of common bottlenose dolphin sightings in the San Diego study area along boat-based transects with the six major lagoons.

 FID NAME 0 Buena Vista Lagoon 1 Agua Hedionda Lagoon 2 Batiquitos Lagoon 3 San Elijo Lagoon 4 Tijuana Estuary 5 Los Penasquitos Lagoon 6 San Dieguito Lagoon

## 2. Name of the tool or approach that you used.

I utilized the “Near” tool in ArcMap 10.6 that calculated the distance from points to polygons and associated the point with FID of that nearest polygon. I also used R Studio for basic analysis, graphic displays, and ANOVA with Tukey HSD.

## 3. Brief description of steps you followed to complete the analysis.

1. I researched the San Diego GIS database for the layer that would be most helpful and found the lagoon shapefile.
2. Imported the shapefile into ArcMap where I already had my sightings, transect line, and 1000m buffered transect polygon.
3. I used the “Near” tool in the Analysis toolbox, part of the of the “proximity toolset”. I chose the point to polygon option with my dolphin sightings as the point layer and the lagoon polygons as the polygon layer.
4. I opened the attribute table for my dolphin sightings and there was now a NEAR_FID and NEAR_DIST which represented the identification (number) related to the nearest lagoon and the distance in kilometers to the nearest lagoon, respectively.
5. I exported using the “conversion” tool to Excel and then imported into R studio for further analyses (ANOVA between the differences in sighting distances to lagoons and ENSO indices).

## 4. Brief description of results you obtained

After a quick histogram in ArcMap, it was visually clear that the distribution of points with nearest lagoons appeared clustered, skewed, or to have a binomial distribution, without considering ENSO. Then, after importing into R studio, I created a box plot of the distance to nearest lagoon compared to the ENSO index (-1, 0, or 1). I ran an ANOVA which returned a very small p-value of 2.55 e-9. Further analysis using a Tukey HSD found that the differences between ENSO states of neutral (0) and -1 and neutral and 1 were significant, but not between 1 and -1. These results are interesting because this means the sightings of dolphins differ most during neutral ENSO years. This could be that certain lagoons are preferred during extremes compared to the neutral years. Therefore, yes, there is a difference in dolphin sightings distances to lagoons during different ENSO phases, specifically the neutral years.

Histogram comparing the distance from the dolphin sighting to nearest lagoon in San Diego during the three major indices of El Niño Southern Oscillation (ENSO): -1, 0, and 1.

Violin plot showing the breakdown of distributions of dolphin sighting distances to lagoons (numbered 0-6) during the three different ENSO indices.

## 5. Critique of the method – what was useful, what was not?

This method was incredibly helpful and also was the easiest to apply once I got started, in comparison to my previous steps. It allowed to both visualize and quantify interesting results. I also learned some tricks for how to better graph my data and to symbolize my data in ArcMap.

# Does the presence of Olympia oysters correspond with predicted suitable habitat?

Question explored
In my last blog post, I mapped habitat suitability for Olympia oysters in Yaquina Bay, OR by assessing three environmental parameters: salinity, substrate availability, and elevation. In exercise 2, I brought in oyster location data points collected from field surveys of the intertidal zone to compare against the map of suitable habitat. The question I am examining in this exercise is:

How does the spatial pattern of Olympia oyster location data correspond to the spatial pattern of suitable habitat in Yaquina Bay?

Field surveys of the intertidal zone of Yaquina Bay were conducted on April 19-20 and May 17, 2019 during low tides. Oysters were characterized as ‘present’ if evidence of at least one oyster (living or dead) was detected within the predefined search area.

Name of tool or approach
I first uploaded the data points collected in the field into Google Earth where I could easily verify the locations against my field notes and photos, as well as perform some minor quality control. The points were imported into ArcGIS Pro for spatial analysis. Statistical information was reviewed and plotted in Excel.

Brief description of steps to complete the analysis

1. After validating the data point locations and performing some minor quality control in Google Earth, I saved the points as a KML file. In ArcGIS Pro, I used the geoprocessing tool ‘KML to layer’ to convert them for analysis. Once I added the data points onto the map as a layer, I edited the symbology to display the points as ‘Present’ or ‘Absent’.
2. To assess the neighboring habitat, defined by 4 class types from least suitable to most suitable, surrounding each of the data points, I used the ‘Multiple Ring Buffer’ tool to create 3 buffer rings around each of the points at distances of 75, 150, and 300 meters. For the results in this blog post, only the 300-meter buffer was used. I selected ‘overlapping (disks)’ in the dissolve option to assess the habitat around each data point individually.
3. Once the buffers were created, I used the ‘Zonal Statistics’ tool to overlay the buffered areas onto the raster of habitat suitability. This tool allows the user to select by statistic desired (mean, median, etc.) to generate a spatial output. I chose ‘majority’, which categorizes the buffer zones based on dominant habitat suitability type within the buffer. Majority also represented the median in this output. For example, if the majority of the suitable habitat within the buffer area is class 4 (most suitable) then the buffer display is shown as ‘4’.
4. In addition to the ‘Zonal Statistics’ spatial output, I used the tool ‘Zonal Statistics as Table’ to generate a table of all the statistical information relevant to this analysis. The same input data is used (overlaying the buffered zones on the habitat suitability raster) to create this table.
5. I copied the table generated into Excel where I split up the data on a couple levels: 1) Presence vs. Absence and 2) North shore vs. South shore for comparison. The north and south shores are managed very differently: the north shore is largely composed of rip rap and steeper slopes because Yaquina Bay Road runs right along the edge, whereas the south shore is more natural and less developed. I created box and whisker plots of the majority habitat suitability type, minority, variety, and mean.

Results

The results show some mixed information. When looking ‘Presence’ data points, the majority habitat types surrounding these points on both the north and south shore are 3 or 4, most suitable areas. The minority habitat type is very different between the north and south shore, with the south, more natural shoreline showing a stronger correlation with low suitability habitat being found less often around presence data points. The means for presence data points generally correspond with greater habitat suitability.

However, the absence data points show that the majority habitat type tends to be more closely aligned with predicted most suitable habitat, especially on the south shore. Additionally, the minority habitat types surrounding the absence data points are 1, 2, and 3, indicating that the least suitable habitat does not constitute much of the area. This could be partly due to low coverage by the least suitable habitat type overall (see maps). There appears to be very weak correlation between absence of oysters and location of suitable habitat. Absence of oysters is likely to be recorded in both suitable and unsuitable habitat.

Critique of the method
This method further revealed to me that the importance of the resolution of the baseline data. Based on field observations and conversations with shellfish biologists, the distribution of Olympia oysters is very patchy due to substrate availability. The oysters may be found attached to a pile of rocks in the middle of the mud flat, but will not be found elsewhere in the mud flat. The raster layer I have available for substrate has classified substrate into large generalized categories, which does not reflect the nuanced nature of their opportunistic settling strategy. Dividing habitat suitability into only 4 categories limits the complexity of the analysis which can be helpful, but also means that there’s not a lot of distinction between suitable and unsuitable. Additionally, more data points will help make this analysis more robust.

Using the buffers and the ‘Zonal Statistics’ tools created a generalized output that provides some useful information for analyzing habitat suitability for the oysters. The approach is easily duplicated, which was helpful as I needed to add my field data points in batches as I collected them. What would be more informative for the next iteration is to be able to analyze multiple buffers side-by-side; how does the smallest neighborhood around each point compare to the larger ones?

# Ex. 3: Does black stain spread through landscape networks?

BACKGROUND

For those who have not seen my previous posts, my research involves building a model to simulate the spread of black stain root disease (a disease affecting Douglas-fir trees) in different landscape management scenarios. Each of my landscapes are made up of stands assigned one of three forest management classes. These classes determine the age structure, density, thinning, and harvest of the stands, factors that influence probability of infection.

QUESTION

Are spatial patterns of infection probabilities for black stain root disease related to spatial patterns of forest management practices via the connectivity structure of the network of stands in my landscape?

TOOLS AND APPROACH

I decided to look at how landscape connectivity influenced the spatial relationship between forest management practices and infection probabilities. This approach builds off of a network approach based in graph theory (where each component of the landscape is a “node” with “edges” connecting them) and incorporates concepts from landscape ecology regarding distance-dependent ecological processes and the importance of patch characteristics (e.g., area, habitat quality) in the contribution of patches to the connectivity of the landscape. I used ArcMap, R, and a free software called Conefor (Saura and Torné 2009) to perform my analysis.

DESCRIPTION OF STEPS I FOLLOWED TO COMPLETE THE ANALYSIS

1. Create a mosaic of the landscape

The landscape in my disease spread model is a torus (left and right sides connected, top and bottom are connected). The raster outputs from my model with stand ID numbers and management classes do not account for this and are represented as a square. Thus, in order to fully consider the connectivity of each stand in the landscape, I needed to tile the landscape in a 3 x 3 grid so that each stand at the edge of the stand map would have the correct spatial position relative to its neighbors beyond the original raster boundary. I did this in R by making copies of the stand ID raster and adjusting their extent. In ArcMap, I then assigned the management classes to each of those stands, converting to polygon, using the “Identity” tool with the polygon for management class, and then using the “Join Field” tool so that every stand with the same unique ID number would have the relevant management class assigned. If I had not done this step, then the position of stands at the edge of the raster in the network would have been misrepresented.

2. Calculate infection probability statistics for each stand

I then needed to relate each stand to the probability of infection of trees in that stand (generated by my model simulation and converted to point data in a previous exercise). In ArcMap, I used the “Spatial Join” tool to calculate statistics for infection probabilities in that stand, because each stand contains many trees. Statistics included the point count, median, mean, standard deviation, minimum, maximum, range, and sum.

3. Calculate the connectivity of each stand in the network of similarly managed stands in the landscape

3a. For this step, I used the free software Conefor, which calculates a variety of connectivity indices at the individual patch and overall landscape level. First, I used the Conefor extension for ArcMap to generate the input files for the Conefor analysis. The extension generates a “nodes” file for each feature and a “connection” file, which contains the distances between features a binary description of whether or not a link (“edge”) exists between two features. One can set the maximum distance for two features to be linked or generate a probability of connection based on an exponential decay function (built-in feature of Conefor, which is an incredible application). For my analysis, I performed connectivity analyses that only considered features to be linked if (i) they had the same management class and (ii) there were no more than 10 meters of distance between the stand boundaries. Ten meters is about the upper limit for the maximum likely root contact distance between two Douglas-fir trees.

3b. For each management class, I ran the Conefor analysis to calculate multiple metrics. I focused primarily on:

• Number of links in the network
• Network components – Each component is a set of connected patches (stands) that is internally connected but has no connection to any other set of patches.
• Integral Index of Connectivity (IIC) – Essentially, this index gives each patch (stand) a value in terms of its importance for connectivity in the network based on its habitat attributes (e.g., area, habitat quality) and its topological position within the network. For this index, higher values indicate higher importance for connectivity. This is broken into three non-redundant components that sum to the total IIC:
• IIC intra – connectivity within a patch
• IIC flux – area-weighted dispersal flux
• IIC connector – importance of a patch for connecting other patches in the network) (Saura and Rubino 2010)
1. Analyze the relationship between connectivity metrics and infection probabilities

I reduced the mosaic to include only feature for each stand, eliminating those at the periphery and keeping those in the core. I confirmed that the values were similar for all of the copies of each stand near the center of the mosaic. I then mapped and plotted different combinations of connectivity and infection probability metrics to analyze the relationship for each management class (Fig. 1, Fig. 2).

Fig. 1. Map of IIC connectivity index and mean infection probability for the extensively managed stands.

RESULTS

I generally found no relationship between infection probability and the various metrics of connectivity. As connectivity increased, infection probabilities did not change for any of the metrics I examined (Fig. 2). I would like to analyze this for a series of landscape simulations in the future to see whether patterns emerge. I could also refine the distance used to generate links between patches to reflect the dispersal distance for the insects that vector the disease.

Fig. 2. Plots of infection probability statistics and connectivity metrics for each of the stands in the landscape. Each point represents one stands in the randomly distributed landscape, with extensively managed stands in red, intensively managed stands in blue, and old-growth stands in green.

CRITIQUE OF THE METHOD – What was useful, what was not?

I had originally planned to use the popular landscape ecology application Fragstats (or the R equivalent “landscapemetrics” package), but I ran into issues. As far as I could tell (though I may be incorrect), these options only use raster data and consider one value at a time. What I needed was for the analysis to consider groups of pixels by both their stand ID and their management class, because stands with the same management class are still managed independently. However, landscapemetrics would consider adjacent stands with the same management class to be all one patch. This meant that I could only calculate metrics for the entire landscape or the entire management class, which did not allow me to look at how each patch’s position relative to similarly or differently managed patches related to its probability of infection. In contrast, Conefor is a great application that allows for calculation of a large number of connectivity metrics at both the patch and landscape level.

References

Saura, S. & J. Torné. 2009. Conefor Sensinode 2.2: a software package for quantifying the importance of habitat patches for landscape connectivity. Environmental Modelling & Software 24: 135-139.

Saura, S. & L. Rubio. 2010. A common currency for the different ways in which patches and links can contribute to habitat availability and connectivity in the landscape. Ecography 33: 523-537.

# Ex 3: Grain size distributions and peak flow events

The context:

For this exercise, I wanted to figure out if there was a relationship between the temporal pattern of peak discharge in my study creeks and the temporal pattern of grain size distributions. The temporal pattern of grain size distributions could help tell a story about the interplay between hillslope and alluvial process and the forces involved in shaping and stabilizing the streams. One might predict that changes in grain size distribution might be related to extreme events. Large events could associated with debris flows which might reduce sorting while more moderate flow events could increase sorting.

The data:

Pebble counts were conducted in conjunction with cross section surveys at five reaches between 1995 and 2011. They show both the median (D50) and standard deviation of grain size varying over time.

During this time period, two different cross section sampling methods were used, depending on the year. In one method, every cross section in a given year was surveyed. In another method, the only cross sections sampled were ones in which the field crew thought that the creek bed had changed. Because of this, the grain size samples in some year are incomplete and biased towards conditions that favor visible change.

The graphs below show the mean of the D50 and mean of the standard deviation for each set of cross section for each year sampled in subfigure a. The error bars show standard error. The faded points represent years where the field crew only sampled a subset of the cross sections. Subfigure b shows the area-normalized annual peak discharge for the stream gauges on Mack and Lookout Creek. Cold Creek is ungauged.

These figures imply that most reaches were the least sorted in 1997, the year after the flood of record, followed by various decreases and increases over time.

The questions

I asked the following questions to try to understand the relationship between peak flow and grain size distribution:

1. Is the standard deviation of grain size in a given water associated with the peak flow from that water year?
2. Is the standard deviation of grain size in a given water associated with the peak flow from the previous water year?
3. Is the change in standard deviation of grain size between two years associated with the largest peak flow from the interceding years?
4. Is the change in D50 between two years associated with the largest peak flow from the interceding years?

The methods and results

I used simple linear regression to relate each of these variables and took the R2 value to represent how much of the sediment distribution variability each variable might explain. The results were negative in all cases.

1.

SITE  Rsquared

1 LOL    0.00983

2 LOM    0.00320

3 MAC    0.0340

4 MCC    0.106

2.

SITE  Rsquared

1 LOL     0.0546

2 LOM     0.0297

3 MAC     0.0191

4 MCC     0.217

3.

SITE  Rsquared

1 LOL     0.0845

2 LOM     0.0718

3 MAC     0.0842

4 MCC     0.0852

4.

SITE  Rsquared

1 LOL    0.0440

2 LOM    0.0252

3 MAC    0.0553

4 MCC    0.00688

The answer to these questions all appear to be no – clearly hydrology has some impact on grain size distributions, but the relationship may be too complicated to address using a single predictor variable and limited data.

# Exercise 3: Perception of natural resource governance and changes in environment over time

Questions

What are the spatial patterns of natural resource governance perceptions in relation to environmental condition over time?

Tools and Approaches

1. Tabulate area in ArcGIS Prop
2. T-test in R

Analysis Steps

1. To begin to answer this question, I first needed to find spatial layers of the environmental condition in the Puget Sound over time. I could not use the same ‘environmental effects’ data I used for Exercise 2, as the data was only completed 5 months prior. Instead, I substituted two land cover maps from 1992 and 2016 to approximate environmental condition.

I first loaded the raster layers into ArcGIS Pro. I then reclassified the land types (of which there were 255, but only 25 present in the region) down to 14. I then created buffers of 5km for my points of individuals with governance scores one standard deviation above and below the average. I then tabulated the land cover area within the buffers for each time period.

1. To test whether land cover change differed between those with high and low perceptions, I exported the tables and calculated the change in land cover for the two samples. I then ran two-sample t-tests for each land cover type change between the two groups.

Results

Land cover change differed significantly for four different land cover types between the two groups—grassland, forest, shrub/scrub, and bare land cover. Grassland cover decreased for both groups, but decreased by 5% more in in the areas with individuals with below average governance perceptions. Forest cover also decreased for both group, but decrease by 1% more individuals with below average governance perceptions (the amount of forest cover in the region is very large which accounts for why a 1% difference is significant). Shrub and scrub cover increased for both groups, but increased by 3% more in areas with individuals with below average governance perceptions. Lastly, bare land decreased for both, but decreased by 5% more for individuals in areas with individuals with below average governance perceptions (Table 1).

The effect size of land cover change on perception was small for each of the significant variables with biserial correlations between .07 and .09 (Table 1).

Table 1. Differences in land cover change between individuals with above average perceptions of natural resource governance perceptions and individuals with below average governance perceptions

 Governance Perception1 Above Average Below Average t-value p-value Effect size (rpb) High Development 17 16 0.80 .426 .02 Medium Development 19 19 0.67 .500 .02 Low Development 11 11 0.22 .829 .00 Developed Open Space 12 13 0.21 .837 .00 Agriculture -1 0 0.74 .461 .02 Grassland -13 -18 3.03 .002 .09 Forest -10 -9 2.61 .009 .08 Shrub/Scrub 40 43 3.04 .002 .09 Wetlands 0 0 1.67 .095 .05 Unconsolidated Shore 2 2 0.10 .918 .00 Bare Land -21 -25 2.43 .015 .07 Water 0 0 0.90 .368 .03 Aquatic Beds -1 -5 1.19 .233 .03 Snow/Tundra 0 0 0.00 .997 .00

1Cell entries are percent changes of land cover from 1992 to 2016

1. Critique

The biggest problem I had with this method was figuring out the most appropriate way to test my question. I struggled with classifying ‘good’ or ‘bad’ land cover change, and ultimately decided it was inappropriate. I also don’t think land cover is necessarily the best way to test environmental condition. I think it would be more appropriate if I were able to use environmental effects over time. I think it would also be best if I did different sized buffers around individuals.

Additionally, I believe I should have controlled for how long they lived in the region. The average number of years individuals have lived in the region was 34, which is less than the land cover change of 23 years. However, as years lived in the Puget Sound is a significant determinant of perception, it may have more to add in this case. The problem here is that this method is clunky. The way in which I differentiated high/low perception was by having different csv files. Having to then split these by years in the region would create an additional step, and more files, that could ultimately increase personal error because of the difficulty in keeping track of which buffer is which type of individual.

# Ex 3: Analysis of juniper density and slope, aspect characteristics

How does juniper density vary with slope and aspects characteristics (as indicators of soil moisture)?

This study site for this analysis is the Camp Creek Paired Watershed Study (CCPWS), located in a semiarid region of central Oregon. This research focuses on two watersheds: one in which most juniper was removed in 2005-2006, referred to as the treated WS, and one watershed in which mature juniper is the dominant overstory species, referred to as the untreated WS. Belt transects (3m wide by 30m long) which recorded the number, height, and canopy cover of juniper within the treated WS were used for analysis. The number of juniper found in each transect was represented as a single point at the beginning of each transect. Additionally, NAIP imagery and a 30m DEM were used to assess juniper density and slope and aspect characteristics across both watersheds and in the surrounding areas.

2. Approach and tools used:

Analysis was conducted using R Studio, ArcGIS Pro, and Excel. A Chi-squared test and logistic regression were calculated in R, followed by Geographically Weighted Regression (GWR) in ArcGIS Pro. Excel was used to calculate the percentage of pixels classified as juniper within each buffer.

3. Steps used in analysis:

A. Data preparation

1. The following steps were conducted in a previous exercise: 1) a 30m DEM was used to extract slope and aspect characteristics, 2)slope and aspect values were each divided into nine groups, representing characteristics generally associated with increased soil moisture (e.g., flat slopes with northerly aspects were rated highest), 3)using the raster calculator the values of slope and aspect were averaged ((slope category+aspect category)/2). The raster created using this process rated each grid square from 1-9 (henceforth referred to as “rating model”). The rating model is shown below.

2.Classification of NAIP imagery (also from a previous exercise). The NAIP imagery was classified using Support Vector Machine supervised classification. (Please note: the resolution of the NAIP imagery makes identification of juniper saplings difficult, therefore conclusions from the data should be made with caution.) The NAIP raster was resampled to the same resolution as the 30m DEM. The pixels were reclassified into a binary raster, with each pixel being either “juniper” or “not juniper”. The resampled NAIP imagery is shown below.

3.The classified NAIP raster and rating model were projected to NAD 1983 UTM zone 10. Both rasters were masked to cover the same extent.

4. A table was created which represented the number of juniper found per transect as a point. Using the “extract multi values to points” tool in ArcGIS the corresponding value from the rating model was assigned to each transect point.

5. In a previous exercise, a buffer of 50m was created surrounding each of the belt transects (treated WS only). The percentage of juniper (pixels classified as juniper/total pixels) was calculated in excel and added to the belt transect table created previously.

6. Using the “create random points” tool in ArcGIS, 96 random points were created within the study area (treated WS, untreated WS, as well as surrounding area). These points were a minimum of 150m apart.

7. A 50m buffer was created surrounding each of the 96 random points. The percentage of juniper in each buffer was calculated as indicated above. These values were added (using the join function) to a table with the random points.

B. Analysis

1. The binary classified NAIP raster and rating model were loaded into R Studio and reviewed (addresses have been abbreviated):

library(raster)
library(sp)
library(rgdal)

apslop<-raster(“C:/…AspSlop_projected_clip.tif”)
juniperonly_NAIP<-raster(“C:/…NAIP_resampled_clip1.tif”)

apslop
juniperonly_NAIP
plot(apslop)
plot(juniperonly_NAIP)

s2<-stack(apslop,juniperonly_NAIP)
v2<-data.frame(na.omit(values(s2)))

2. Chi-squared test was conducted for the rating model and classified NAIP raster (R):

chisq.test(apslop[],juniperonly_NAIP[])

3. Logistic model, summary of results and associated plots (R):

glm.fit<-glm(juniperonly_NAIP[]~apslop[],data=v2,family=binomial)
summary(glm.fit)

plot(glm.fit)
exp(coef(glm.fit))

4. GWR was conducted in ArcGIS for the belt transects (in treated WS), using the juniper density in the 50m buffer as the dependent variable and the rating model as the independent variable. The distance band neighborhood type and golden search neighborhood selection were used.

5. GWR was conducted in ArcGIS using the 96 random points (in treated WS, untreated WS, and surrounding areas). The same neighborhood parameters described above were used.

6. Global Moran’s I was calculated using the residuals for both GWR analyses. Inverse distance and the Euclidean distance method were used for analysis.

4. Overview of Results:

A. The Chi-squared test of the binary NAIP raster (indicating juniper versus non-juniper) and the rating model raster indicated a slightly significant relationship:

Pearson’s Chi-squared test data:

apslop[] and juniperonly_NAIP[]

X-squared = 28.624, df = 15, p-value = 0.01798

B. The logistic regression model indicated a significant relationship between the aspect and slope classification (from the rating model) and the presence or absence of juniper (dependent variable). However, the difference between the null deviance and the residual deviance suggests that this may not be an appropriate model for predicting the presence or absence of juniper. The exponent of the coefficient indicated that for every increase in aspslop rating the probability of juniper being present increased by 1.06.

Call: glm(formula = juniperonly_NAIP[] ~ apslop[], family = binomial, data = v2) Deviance Residuals:

Min 1Q Median 3Q Max

-0.6934 -0.6576 -0.6317 -0.5984 1.9537

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.80732 0.11048 -16.359 < 2e-16 ***

apslop[] 0.05935 0.01948 3.046 0.00232 **

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 7425.5 on 7761 degrees of freedom

Residual deviance: 7416.1 on 7760 degrees of freedom

(656 observations deleted due to missingness)

AIC: 7420.1

Number of Fisher Scoring iterations: 4

(Intercept)      apslop[]

0.164094       1.061147

C. The GWR of treated WS dataset indicated an adjusted R-squared value of 0.28. Also, the following warning was indicated for this analysis: “At least one of your local regressions had very limited variation after applying the weights. Please use caution when interpreting the results”.  This further suggests that this model may not be indicative of the explanatory factors related to juniper density. The GWR results are provided below.

Golden Search Results

Distance Band     AICc

292.6545      222.3773

884.3284      221.4705

518.6538      216.8374

658.3291      219.2658

432.3298      217.0552

572.0050      217.5695

485.6810      216.6200

465.3026      216.6782

———————

WARNING 110259: At least one of your local regressions had very limited variation after applying the weights. Please use caution when interpreting the results.

——————————– Analysis Details ——————————-

Number of Features:                                                             33

Dependent Variable:                      BELTTRANSECTS_50M_EXCELTOTABLE.F__JUNIPER

Explanatory Variables:  C250M_ALLSITES_EXCELTOTABLE_PROJECT.ASPSLOP_PROJECTED_CLIP

Distance Band:                                                            465.3026

———————————————————————————

——– Model Diagnostics ———-

R2:                              0.4918

AICc:                          216.6782

Sigma-Squared:                  28.4635

Sigma-Squared MLE:              20.4670

Effective Degrees of Freedom:   23.7290

D. The GWR performed for the larger dataset (points distributed across both watersheds and in the surrounding areas) indicated a similar R-squared value to the GWR based in the treated WS only. A similar warning regarding the residuals was indicated for this GWR analysis. The results of this GWR are displayed below.

Golden Search Results

Distance Band     AICc

1360.4243     -98.1728

3270.5186     -84.8038

2090.0154     -90.0118

2540.9275     -87.1310

1811.3364     -92.9095

1639.1033     -95.1341

1532.6574     -96.4953

1466.8702     -97.2349

1426.2115     -97.6318

1401.0830     -97.8502

1385.5527     -97.9775

1375.9545     -98.0539

———————

WARNING 110259: At least one of your local regressions had very limited variation after applying the weights. Please use caution when interpreting the results.

—————————- Analysis Details —————————

Number of Features:                                                     96

Dependent Variable:     RANDOMPTS_50MBUFFER_JUNIPER_EXCELTOTABLE.F_JUNIPER

Explanatory Variables:            RANDOM_POINTS_105.ASPSLOP_PROJECTED_CLIP

Distance Band:                                                   1375.9545

————————————————————————-

——— Model Diagnostics ———-

R2:                              0.2932

AICc:                          -98.0539

Sigma-Squared:                   0.0190

Sigma-Squared MLE:               0.0167

Effective Degrees of Freedom:   84.3419

E.The Global Moran’s I for the GWR residuals from the belt transects dataset (treated WS) indicated clustering (p<0.0001) while the GWR residuals for the larger dataset (both WS with surrounding area) indicated random distribution (p=0.327).

5. Discussion/critique of methods:

The analysis used here suggests that other explanatory factors (outside of the rating model created based slope and aspect) may be in place which influence the density of juniper. The R-squared values of this analysis improved compared to previous exercises but are still relatively low. Additionally it should be noted that the presence of juniper has been correlated to reduced soil moisture in some conditions (Lebron et al., 2007), therefore slope and aspect may not be sufficient representatives of soil moisture in these watersheds.

The intent of this analysis was to develop a potential workflow that can be used with other datasets. Caution should be used in drawing any conclusions from this analysis specifically. Additionally, caution should be taken in interpreting the results of this analysis due to the characteristics of the datasets used. For instance, the resolution of the NAIP imagery makes the detection of juniper saplings difficult. Therefore the classification results may not be an accurate representation of juniper density, particularly in the treated WS. However, this process may be applied to more expansive data to include higher-resolution imagery (such as imagery collected using UAVs, etc.).

Reference:

Lebron, I., Madsen, M. D., Chandler, D. G., Robinson, D. A., Wendroth, O., & Belnap, J. (2007). Ecohydrological controls on soil moisture and hydraulic conductivity within a pinyon-juniper woodland. Water Resources Research, 43(8), W08422. https://doi.org/10.1029/2006WR005398

# Does proximity of open areas influence invasibility of forested areas?

I am interested in how the spatial pattern of non-forested areas influences invasibility of forest plots by the exotic annual grass vententata (Ventenata dubia). For exercise 3 I asked the question, how are forest plots with high and low ventenata cover related to the spatial pattern of canopy openness?

Tools and Approaches Used

To explore this question, I performed a Ripley’s cross-K between ventenata forest plots with high and low cover and non-forest areas (Ripley 1976). This analysis is used to interpret spatial relationships between two point patterns. I used ArcGIS and R to answer the question following the steps below:

• I added my sample plots with ventenata cover and coordinates and fire perimeters into ArcGIS.
• To generate non-forest points, I added canopy cover raster data for the study area into ArcGIS with values ranging from 0 to 95% canopy cover.
• I buffered the sampled fire perimeters by 3000m using the “buffer” tool and generate 200 random points, not closer than 100m in each buffer using the “create random points” tool in ArcGIS.
• I extracted the raster values to the random points using the “values to points” tool, and extracted points that had <10% canopy cover into a new layer file for open areas.
• I then extracted my sample plots only from forested areas into a forest_plots layer.
• I exported these plots and the random points into a data frame with spatial data and an indicator for sites with high ventenata (>15% cover), low ventenata (0-15% cover), or random open points with no ventenata recorded (labeled “high”, “low”, and “open” respectively).
• I performed a Ripleys cross K between the high and low ventenata forest plots and non-forest “open” areas and plotted the response in R with the package “spatstat” (Thanks, Stephen C. for sharing R code!). I repeated the analysis individually for three sampled fires (Fox, South Fork, and Canyon Creek) to avoid picking up the inherently clustered spatial pattern of my sample plots and overshadowing the research question. However, I also performed the analysis on the entire sample area to view where issues arise when ignoring a clustered sample design and compare to results from individual fires.
1. In order to perform the cross Ripleys K on each sampled fire, I first had to establish rectangular “windows” around each fire and provide the xy coordinates for the vertices using tools in ArcGIS depicted below.

Results & Discussion

Most annual grass species prefer areas of low canopy cover. These open areas can, however, act as source populations for invasion into more resistant forested areas. I hypothesized that ventenata (an annual grass) abundance in forested areas would increase with increased proximity and abundance of open areas near the sampled forest (i.e. sites with high ventenata cover in forested areas would be more spatially “clustered” around non-forest areas than sites with low ventenata cover).

The results of my Ripley’s cross-K show that the spatial relationship between sites with high/ low ventenata cover differs between fires. At the Fox fire (FOX), high and low sites closely follow the projected Poisson distribution when crossed with open areas, suggesting that ventenata abundance in forest sites is not related to nearby open areas. However, in the Corner Creek (CC) and South Fork (SF) fires, heavily invaded forest sites are more clustered around open points and low sites are more dispersed around open points than would be expected by chance. I may not have seen a relationship between open areas and high/ low sites at the fox fire because of the relatively low number of forest sample plots (n = 5 high and 4 low). Another explanation could be that I am missing an interaction between fire and invasion by not accounting for canopy lost in forest plots as a function of fire. Analyzing burned vs. unburned forest plots (with fire as an additional driver of invasion) would have been ideal, but would have dropped the sample size too low to perform analyses.

When I performed the cross-K on the entire study area it shows strong clustering of both high and low plots around open areas. However, this is likely the result of the clustered spatial pattern of my sampling method (targeting specific burned areas), and less representative of the actual spatial relationship between ventenata invasion and canopy cover.

Critique of Method

Overall, I thought the cross-K was a useful method for evaluating the spatial relationship between two variables (ventenata cover and open areas). However, a method that could include interaction terms (burning) would probably have been more appropriate for this study. Additionally, the arrangement of sample plots made it difficult to evaluate overall patterns across the study region. It would have been a more useful analysis had my samples been evenly distributed across the region.

Ripley, B.D., 1976. The second-order analysis of stationary point processes. Journal of applied probability, 13(2), pp.255-266.

# Ex3: Correlation of Mining and Non-Mining Villages’ NDVI Loss with Distance to Nearest Primary Road

Following up from Exercise 2, where I determined how NDVI loss relates to distance from village center, I now wanted to explore whether or not the NDVI loss I observed could be explained or attributed to another factor: the distance of a village from a primary road.

To do this, I acquired a shapefile of roads from OpenStreetMaps for Senegal, which I used alongside my locations of mining and non-mining villages, ArcGIS Pro for some simple spatial work, and Excel to calculate a simple correlation.

First, I imported the shapefile of roads in Senegal to ArcGIS Pro, and first determined what the nearest actual road to each village was (fig. 1). Based on this first-pass glance, it appears the more mining sites in my sample than non-mining sites are nearest to “primary” paved roads, and that the majority of non-mining villages are near “tertiary” roads. This simply means that, out of my sample, more mining happens closer to paved roads, which would make sense: gold needs to be transported from gold sites, and equipment/supplies need to be transported in.

Fig 1

Next, I manipulated the shapefile to only include features which were designated “primary” roads (fig. 2). I then used ArcGIS Pro’s “Near” tool to generate new attributes for each set of villages which listed the shortest distance to the nearest primary road in meters.

Fig 2: Green dots are non-mining villages; red dots are mining villages

With these distances in hand, I plotted them in excel against the NDVI % loss values for the first ring (0-250m) around each village, as a simple first pass of understanding how distance from a primary road may be related to the amount of NDVI loss being experienced at a particular road (fig. 3).

Fig 3

Then, within Excel, I calculated a simple correlation (a la fig. 4) to determine how distance from a nearest primary road relates to amount of NDVI loss experienced by a village.

Fig 4: x, in this case, is distance (m) from nearest primary road and y is NDVI loss at the first ring.

For mining villages, I calculated a correlation of -.804, and for non-mining villages, I calculated a correlation of .686. This means that, for this sample at least, the NDVI loss experienced at a mining village is negatively correlated with distance from a primary road; that is, as a mining village’s distance from a primary road increases, its NDVI loss should decrease. This makes perfect sense: mining villages closer to primary roads are easier to access both by miners and by other users, and to leave as well, encouraging larger groups of people to live there, which would decrease NDVI as a result of their activities.

Non-mining villages, on the other hand, are more difficult to explain. One explanation, which may be plausible, is that more distant villages have less opportunities for other livelihoods which are not environmentally dependent, and so are more obligated to work in and with the environment, perhaps encouraging vegetation loss as users rely on and drawdown more on their environment.

Overall, I think this was a useful approach for examining another possible explanation for why mining villages may have higher rates of NDVI loss than non-mining villages. Mining villages are closer to primary roads, which means they are more accessible to potential miners, and thus more likely to grow larger than non-mining villages, and drawdown more on their environment. That being said, as can be seen in Fig. 3, this is not a perfect relationship by any means, and a more robust analysis would be useful to say much more. But, as a first pass, it’s useful for illustrating some additional explanation.

# Exercise 3: Confusion Matrix to Test Predictive Capacity of Seascapes

I’ve been using my time in this class to explore methods and analyses that could help me to better understand the spatial and temporal relationships between satellite-derived seascapes and fishes in the California Current Ecosystem. In Exercise 1, different types of interpolations were explored and applied to the trawl data to see how results changed. After refining methods related to the Kriging Interpolation, Exercise 2 was conducted, exploring he possible uses of neighborhood analysis in determining the prominent seascapes related to high rockfish abundance in the California Coast Ecosystem. Exercise 3 builds on the results of the previous two exercises: in order to explore the relationship between seascapes and rockfish abundance, a confusion matrix was calculated to measure the ability for certain seascapes to predict the occurance of rockfish hotspots. Simply put, the question being asked is: How were seascapes related to areas of high rockfish abundance in the California Coast Ecosystem in May of 2007?

Tool Used

In order to answer this question, I’m going to use a confusion matrix. A confusion matrix is a table that measures the agreement between two raster layers and provides an overall measure of a model’s predictive capacity. Additionally, these matrices can also measure a raster’s error (in terms of false-positives or false negatives). Each statistic calculated is useful, as each one clues the researcher into spatial or methodological processes that may account for certain types of error. The confusion matrix works by taking original data and creating accuracy assessment points, which are points assigned to random spatial points and given the value indicated by the raster value at that point. Those points are then matched up to points on the model raster and statistics are calculated measuring how much they agree.

Methods

The data used for this exercise are the following:

• Significant seascape classes that indicate high rockfish abundance for May 2007, as determined by the neighborhood analysis in Exercise 2
• Seascape raster data for the relevant month
• Rockfish abundance raster (interpolated, from Exercise 1)

Confusion matrices can only be calculated when your ground-truthed and modeled layers are the same units. For this reason, my raster data had to be converted to presence absence data for both the seascape classes and the interpolated rockfish abundances. In order to do this, I used the Reclassify tool in ArcGIS to change “present” values to 1 and “absent” values to 0. For the interpolated abundance raster, any nonzero and nonegative values were changed to 1 to indicated modeled rockfish presence. The remaining cells were changed to 0 to indicate modeled rockfish absence. The result was the raster shown below:

Rockfish Presence-Absence Layer calculated using the Reclassify too on the interpolated trawl data.

Similar steps were used to reclassify the seascape data – seascape classes 7, 12, 14, 17, and 27, which were shown to be present within a 25km radius of the high abundance trawls in Exercise 2, were counted as present cells, while the other classes were counted as absent cells. The result was the raster shown below:

Rockfish Presence-Absence raster calculated using the results from exercise 2 to reclassify the seascape data

It was determined that the seascpae layer would be the base or “ground-truthed” layer and the interpolation layer would be the modeled layer, since the interpolated values are simply estimations. The next step was using the “Create Accuracy Assessment Points” Tool to create randomized points within the modeled layer. The result was the following shapefile:

Randomized accuracy assessment points created using the Create Accuracy Assessment Points tool

Once the points are created, they must be compared to the values in the modeled layer. The Update Accuracy Assessment Points tool does this automatically, and my points were matched up to corresponding values from the seascape layer. Finally, the final step was to run the resulting shapefile through the “Calculate Confusion Matrix” tool.

Results/Discussion

The resulting confusion matrix is displayed below:

 ClassValue C_0 C_1 Total U_Accuracy Kappa C_0 9 79 88 0.102272727 C_1 89 276 365 0.756164384 Total 98 355 453 P_Accuracy 0.091836735 0.777464789 0.629139073 Kappa -0.135711

The two rasters measured in the matrix did not compare well at all, resulting in a negative Kappa value. Kappa values measure overall agreement between two measures, with 1 being perfect agreement. A negative score indicate lower agreement than would be found by chance. The interpolated model performed especially poorly measuring absent values, only agreeing with the seascape model on 9 pixels. While the two rasters agreed on a vast majority of the areas where rockfish may have been found, the irregularities on the absence areas drove the matrix to failure. The large number of both false positive and false negative readings shows that this was not a fundamental disagreement based on scale or resolution of the models – the two methods of measuring rockfish abundance fundamentally disagreed on where they should be found, to the point where they cannot be reconciled.

Critique of Method

Overall, I think a confusion matrix is a good way to compare modeled and true (or ground-truthed) measured information. However, in my case, both of my informational rasters are the results of modeled analyses, each with their own set of assumptions and flaws. The interpolation, for example, is at the mercy of the settings used when executing the Kriging: how many points shall be measured per pixel? Is it a fixed sampling method, or variable, spatially based? And the seascape raster is at the mercy of the neighborhood analysis, which was only conducted on 4, hand-picked points, which introduce all kinds of sampling and spatial biases. All of this, coupled with the fact that both rasters had to be distilled into presence-absence data, which eliminates much of the resolution that makes Kriging so great, results in a heavily flawed methodology that I do not believe truly represents the data. This framework which I’ve used, however, has the potential to be modified, standardized, and rerun in order to address many of these issues. But for now, I feel as though the confusion matrix is a helpful tool, just not for two sets of modeled data.

# Ex 3: Relating distance between wells to communication type and fault presence

Is there a relationship between the distance between wells, their communication status based on a pumping interference test, and whether or not they are separated by a fault?

Name of the tool or approach used:

Polyline creation and classification in ArcGIS Pro, boxplot creation and two-sided t-tests in R.

Method:

29 wells in my study area were evaluated by the Oregon Water Resources Department during pumping interference tests in 2018 and 2019. This test involves pumping one well, and seeing whether the water levels in nearby wells drop in response. I received a verbal account of the wells that did and did not communicate, sketched it on a map, and then transferred that information to ArcIS Pro. I drew polylines using the well locations as snapped endpoints. Then, I created and populated fields containing the well communication type (“communicate” for wells that respond to pumping at a nearby well, and “does not communicate” for wells that do not) and whether or not the path between two wells crosses a fault. Shape_Length in feet was automatically calculated when I created the polylines, on account of the projection I used for the shapefile.

I exported that data table to a csv and imported it in R, where I subset it into three categories: all paths, paths between wells that communicate, and paths between wells that do not communicate. I then created box plots and ran t-tests to see differences between means and distributions of path length based on communication type or fault length.

Results:

Comparing the path length and the communication type of all 29 wells involved in the communication test, there is not significant evidence of a difference in mean path length between wells that do and do not communicate because the p-value of a two-sided t-test was 0.152. While the mean distance between wells that do not communicate is larger than the mean distance between wells that do communicate, the overlapping interquartile ranges in both categories make this difference less significant. There is not clear evidence that distance plays a role in well communication.

There is some evidence for a difference in mean path lengths between wells that do and do not cross faults, based on a p-value of 0.047 in a two-sided t-test. The mean path length that crosses a fault is 5,139 ft, while the mean path length that does not cross a fault is 3,608 ft. Wells that are closer together are less likely to be separated by a fault.

For wells that do communicate, there is evidence of a difference between the mean path lengths that cross faults and the mean path lengths that do not cross faults. The p-value for a two-sided t-test was 0.024. Wells that communicate but are not separated by a fault are more likely to be closer together than wells that are separated by a fault.

For wells that do not communicate, there is no evidence of a difference in mean path lengths between paths that do and do not cross faults, given a p-value of 0.98 in a two-sided t-test. Wells that do not communicate are likely to be separated by the same mean distance whether or not they are separated by faults, although there is a larger range of path length values for wells separated by a fault that do not communicate.

Summary of results:

Wells that communicate in pumping tests do not have a significantly different mean distance between them than wells that do not communicate (p = 0.152)

Wells that are closer together are less likely to be separated by a fault. (p = 0.047)

Wells that communicate but are not separated by a fault are more likely to be closer together than communicating wells that are separated by a fault. (p = 0.024)

Wells that do not communicate are likely to be separated by the same mean distance whether or not they are separated by faults, although there is a larger range of path length values for non-communicating wells separated by a fault. (p = 0.98)

Critique: I wish I had more sample points and paths to work with, so I could use a more interesting analysis such as ANOVA.

# Estimates of connectivity to critical infrastructure in Seaside, OR following a rupture of the Cascadia Subduction Zone

Question

For exercise 3, I evaluated the connectivity of each building within Seaside, OR, to critical infrastructure following a rupture of the Cascadia Subduction Zone. The probability of connectivity for each building was determined using networks and considered the following:

• Electric Power Network (EPN): probability that each building has electricity.
• Transportation: probability that each building can reach the hospital or fire stations via the road network.
• Water Supply Network: probability that each building has access to running water.

The connectivity analysis was deaggregated by hazard as well as the intensity of the event.

Tool and approach

For this exercise, I used: (1) a probabilistic earthquake/tsunami damage model to evaluate the functionality of linkages; (2) the network analysis package python-igraph to evaluate the connectivity of each tax lot to critical infrastructure; and (3) QGIS for spatial visualization of the results.

Description of steps

Networks were created to represent the connectivity of the three infrastructure components (Figure 1). A network consists of nodes connected to each other through edges. When edges are removed, a connectivity analysis can be performed to determine whether there is any path from one node to any other specific node. A disconnection in the network results in two (or more) separate networks.

Here, the earthquake and tsunami hazards cause damages to edges which are removed from the network if deemed nonfunctional. A connectivity analysis between each tax lot and critical infrastructure was performed, and each tax lot was triggered with a binary yes/no for connectivity. A Monte-Carlo approach with 10,000 simulations was implemented to determine the probability of each tax lot being connected to critical infrastructure. The resulting probabilities were then added as attributes to GIS shapefiles in order to evaluate the spatial distribution of connectivity.

Figure 1: GIS and representative networks for each infrastructure component

Description of results

Characteristics of the network can be described by a degree distribution. In a network, the degree of a node is the number of immediate connections that the node has to other nodes (e.g. a node connected to 3 other nodes has a degree of 3). A histogram of the degrees can be generated to describe the overall distribution of the entire network. The degree distribution for the three infrastructure components are shown in Figure 2. It can be seen that in the EPN network, most nodes are connected to two other nodes. This is likewise apparent in the network of Figure 1, as the EPN network appears more “linear” compared to the transportation and water networks. The transportation and water networks exhibit similar characteristics to each other in that the majority of nodes have a degree of three.

Figure 2: Degree distribution for each infrastructure component

Using the results from the Monte-Carlo network analysis, maps were created to show the spatial variability of connectivity. The connectivity was deaggregated by both hazard and intensity of the event, as deaggregation provides an avenue for smart mitigation planning. Although similar maps were produced for all three networked infrastructure, for brevity, only the spatial distribution of the transportation network is shown (Figure 3). The maps show the probability of each tax lot becoming disconnected from the fire stations (2 in Seaside) and hospital (1 in Seaside) via the road network. It can be observed that the tsunami hazard results in significant damage to the transportation system relative to the earthquake hazard. The result of bridge failures caused by the tsunami can be observed for intensities larger than the 500-year event. The region west of the Necanicum River becomes completely disconnected from the fire stations and hospital which are located east of the river.

In addition to the spatial deaggregation, the aggregated results provide a comprehensive overview of the connectivity. Figure 4 shows the average fraction of tax lots disconnected from critical infrastructure across all of Seaside. The three networked infrastructure systems approach complete disconnectivity for hazard intensities larger than the 1000-year event. The transportation and water networks are dominated by the tsunami for the higher magnitude events; whereas the EPN see’s significant damage from both the tsunami and earthquake. Consequently, if resource managers are planning for high magnitude events, they should invest in tsunami damage mitigation measures.

Figure 3: Spatial distribution of connectivity to fire station and hospitals

Figure 4: Fraction of tax lots connected to critical infrastructure

Critique

Network analysis provides a means to evaluate the connectivity of tax lots to critical infrastructure, and incorporating probabilistic methods accounts for uncertainties as opposed to a deterministic approach. While this type of analysis can be useful to determine overall connectivity, it does not account for limitations and additional stresses in the “flow” of the network. For example, damage to the transportation network would result in additional travel times to the fire stations and hospital. In order to provide a more comprehensive analysis of the impact to networked infrastructure, both connectivity and flow should be considered.

# Comparing the Spatial Patterns of Remnant Infected Hemlocks, Regeneration Infected Hemlocks, and Non-Hosts of Western Hemlock Dwarf Mistletoe

### Overview

For Exercise 1, I wanted to know about the spatial pattern of western hemlock trees infected with western hemlock dwarf mistletoe. I used a hotspot analysis to determine where clusters of infected and uninfected trees were in my 2.2 ha study area (Map 1). I discovered a hot spot and a cold spot, indicating two clusters, one of high values (infected) and one of low values (uninfected).

For Exercise 2, I wanted to know how the spatial pattern of these clustered infected and uninfected trees were related to the spatial pattern of fire refugia outlined in my study site. I used Geographically weighted Regression to determine the significance of this relationship, however I did not find a significant relationship between a western hemlock, its intensity of clustering and infection status, and it’s distance to its nearest fire refugia polygon.

This result led to the realization that the polygons as they were drawn on the map were not as relevant as the actual “functional refugia”. I hypothesized that, after the 1892 fire, the only way for western hemlock dwarf mistletoe to spread back into the stand would be from the trees that survived that fire, or “remnant” trees. These would then infect the “regeneration” tree that came after the 1892 fire. The functional refugia I am interested in are defined by the location of the remnant western hemlocks. I also hypothesized that the spatial pattern of non-susceptible host tree (trees that were not western hemlocks) would play a role in the distribution of the mistletoe.

How are the spatial patterns of remnant western hemlocks related to the spatial patterns of regeneration western hemlocks, uninfected western hemlocks, and non-western hemlock tree species, and how are these relationships related to the spread of western hemlock dwarf mistletoe in the stand?

### The Kcross and Jcross Functions

The cross-type functions (also referred to as multi-type functions) are tools capable of comparing the spatial patterns of two different type events (type i and j) in a similar spatial window, of some point process, X. It does this by assigning labels to the events differentiating the type and summarizing the number or distance, of and between events, at differing spatial scales, or radius circles (r).

The statistic Kij (r) , summarizes the number of type j events, around a type i event at a distance of r, or a point process X. Deviations of the observed Kij(r) curve from the the Poisson curve, or if type j events are truly randomly distributed, indicates dependence of type j events on type i events. Similar results can be obtained from the regular Ripley’s K: deviations above the curve indicate clustering and deviations below indicate dispersal.

The statistic Jij(r) = (1 – Gij(r))/(1-Fj(r)) summarizes the shortest distance between a type i and j event and compares it to the empty space function of type j event. This is another test for inferring independence or dependence of type j events to type i. Deviations of the Jij(r) curve the value of 1, indicate levels of dependence of the events to each-other. Specific deviations from 1 can be hard to interpret without an understanding of the Fj(r) function so imagining it stationary in the ratio makes it easier. As Gij(r) increases, the numerator shrinks, creating a smaller Jij(r) statistic. Deviations below 1 indicate that type i and j events are dependent and that as r increases, the shortest distance between points of type i and j increases. As Gij(r) decreases, the numerator grows, creating a larger Jij(r) statistic. Deviations above 1 indicate that type i and j events are dependent and that as r increases, the shortest distance between points of type i and j decreases.

### Methods Overview

In R, the package “spatstat” provides a suite of spatial statistic functions including the cross-type functions. In order to use these you need to create a “point pattern process” object. These objects incorporate X and Y coordinates, and a frame of reference, or a “window,” and give spatial context top a list of values. Then marks are applied to these points that create the necessary multi-type point pattern process object. These marks serve to distinguish the type i and type j events described earlier in your analysis. Then running the “Kcross()” or “Jcross()” functions with the specified type events produces a graph that you can interpret, very similar to producing the normal Ripley’s K plot.

• I took my X – Y coordinates of all trees on the stand and added a column called “Status” to serve as my mark for the point pattern analysis.
1. The four statuses were “Remnant,” “Regen,” “Uninfected,” and “NonHost” to identify my populations of interest.
1. I had access to tree cores, so I identified trees that were older than 170 yrs old and these trees’ diameters served as my cutoff for the “Remnant” diameter class.
2. All trees DBH > 39.8 cm.
2. Doing this in ArcMap removed steps I would have to have taken when I migrated the dataset to R.
3. I removed all the dead trees because I wasn’t concerned with them for my analysis.
• I exported this attribute table to a csv and loaded it into R Studio.
• I created the boundary window of my study site using the “owin()” function, and the corner points from my study site polygon.
• The function “ppp()” creates the point pattern object and I assigned the marks to the data set using the “Status” column I created in ArcMap
1. It’s important your marks are factors otherwise it is not converted into a multi-type point pattern object.
• The last step is running the “Kcross()” and “Jcross” to compare the “Remnant” population to the “Regen,” “Uninfected,” and “NonHost” populations.
1. This produced 6 plots, 3 of each type of cross-type analysis.
2. Compare these easily using the “par()” function, for example:

par(mfrow = c(1,3))

plot(Ex3.Kcross1)

plot(Ex3.Kcross2)

plot(Ex3.Kcross3)

This produces the three plots in a single row and three columns.

### Results

Because I am assuming the remnant, infected western hemlock trees are one of the main factors for the spread of western hemlock dwarf mistletoe and that they are the center of  new infection centers on the study site, I did all my analysis centered on the remnant trees (points with status = “Remnant” treated as event type i).

1)   i = Remnant, j = Regen

The first analysis between remnant and regeneration trees demonstrate that there is dependence on the two events to each other. At fairly small distances, or values of r, infected western hemlocks that have regenerated after the 1892 fire cluster around infected remnant western hemlocks that survived the 1892 fire. This stands to reason because we assume that infected trees will be near other infected trees, and that infection centers start usually with a “mother tree.” In this case the remnant trees serve as the start of the new infection centers. The Jcross output also shows me that the two types of trees are clustered using the frequency of the shortest distances. After ~8 meters the two tree types exhibit definite clustering. In terms of the function, the Gij(r) in the numerator of the Jij(r) function is approaching 1, or the highest frequency of very short distances.

2)   i = Remnant, j = Uninfected

The Kcross plot from the second set of analyses between remnant and uninfected trees demonstrates that there is independence between the two events up to ~15 meters. After that, the trees exhibit slight clustering effects. The lack of dispersal tendencies is strange for these two types of trees because we expect uninfected trees to be furthest away from the center of infection centers. The presence of clustering may be indicative of the small spatial scale of my study site. It may also be that the size of the infection centers are only about 15 meters (if we assume that remnant trees are the center). The Jcross plot shows something similar: at small distances the types of trees seem independent and then around 8 meters they exhibit clustering.

3)   i = Remnant, j = NonHost

The Kcross from the last set of analyses between the remnant trees and the non-hosts demonstrates a similar pattern exhibited by the regeneration trees. After about 4 meters, the trees tend to be clustered. This is an interesting find because if the non-hosts cluster to remnant trees but uninfected trees are independent, then the non-hosts may be playing a role in this. The Jcross plot shows the same: the two types of trees exhibit clustering.

4)   Comparing Kcross Functions with eval.fv()

A useful way to compare patterns of Kcross functions is using the eval.fv function. The titles of each plot tell which Kcross was subtracted from which; note the difference in scales. The first plot shows that the regenerating trees’ spatial pattern as related to remnant trees is very different from the uninfected trees’ pattern. The regenerating trees’ spatial pattern is much more similar to the non-hosts’ spatial pattern at short distances, until about 15 meters. Then the patterns differ with the regenerating trees exhibiting more of a clustering tendency. However the scale is much smaller than the other two graphs. Lastly, the third plot shows the difference between the non-host trees’ spatial pattern and the uninfected trees’ spatial pattern. There appears to be a stepwise relationship where, at very near and very far distances the non-host trees are much more clustered, but at moderate distances the differences may be less dramatic.

### Critique of Cross-Type Functions

The amount of easily interpretable literature on the spatstat package as a whole is sparse, although a wealth of very technical information exists. The function was easy to use and execute though and so was the process of creating the point pattern object. These two functions can clearly show how the spatial patterns of the two types of events change with scale. It would be helpful if there was a way to compare three or more types of events. The last drawback is that there is a lack of specific information for each point on your map or study site. This pattern that is generalizable to a whole set of points may not be as useful when trying to put together a story, such as the story of a stand’s development through time.

The last critique of the cross-type functions led me to attempt a visualization of these patterns on my stand. Very briefly, I determined densities of the infected, uninfected, and the non-host trees using the Kernel Density function in ArcMap. Then I classified these densities using natural breaks and coded these for raster addition. After adding all three density rasters together, I coded each unique density classification combination to tell me how the densities of the populations appeared in the study site.

It appears that there are distinct patches of high density separated by areas of low density. On the eastern side of my study site, it appears that the high density areas of infected trees cluster with the remnant infected trees. An interesting interaction is occurring between the high density patches of uninfected trees and infected trees in the western portion of my study site. The mechanism for the seemingly clear divide may be the non-TSHE trees.

# Blackleg classification by segmentation error analysis and visualization

Background

Once again, I am using the confusion matrix to determine the accuracy of segmented classification based on the ground truth of manual classification. This work piggybacks off of exercise two where I also used a confusion matrix for determining accuracy. In my previous work I critiqued the ability to draw any relevant conclusions based on a sample size of one. Here I have included a sample size of five images which have gone through the entire image processing, segmentation, manual classification and confusion matrix. This allows us to make more relevant comparisons and looks for trends in the data. Additionally, I mentioned the confusion matrix was inaccurate because it was performed on a rectangular region that was simply the extent of the image and not just the leaf. This has been corrected for and now conducts a confusion matrix on the entire leaf and no outside area. Lastly, I have adjusted the segmentation settings to increase accuracy based on previous findings in exercise one and two.

Questions

• How accurate is the segmentation classification?
• Are there commonalities between the false positives and false negatives?
• Are more raster cells being classified as false negatives or false positives when misclassified?
• What is going undetected in segmentation?
• What is most easily detected in segmentation?

Tools and Methods

Many of the steps are repeated from exercise two. I will repeat them here in a completer and more coherent format with the new steps found in the process.

ArcGIS Pro

Begin by navigating to the Training Sample Manager. To get to the Training Sample Manager, go to the Imagery tab and Image Classification group and select the Classification Tools where you can click Training Sample Manager. In the Image Classification window pane go to create new scheme. Right click on New Scheme and Add New Class and name it “img_X_leaf” and supply a value (the image #) and a description if desired. Click ok and select the diseased pixels scheme and then the Freehand tool in the pane. Draw an outline around the entire leaf (Images 1-5). Save this training sample and the classification scheme. Save ArcGIS Pro and exit. Reopen ArcGIS Pro and go to Add data in the Layer group on the Map tab to upload the training sample polygon of the leaf. This process must be repeated for each leaf.

The polygon of the leaf will need to be converted to a raster. Click the Analysis tab and find the Geoprocessing group to Tools once again. Search Polygon to Raster (Conversion Tools) in the Tools pane and select it. Use the polygons layer for Input Features and the window pane options should adjust to accommodate your raster. The only adjustment I made was to the Cellsize, which I changed to 1 to maintain the cell size in my original raster. Select Run in the bottom right corner of the pane. Each of the polygons should now be converted to a rasterized polygon with a unique ObjectID, Value and Count which can be found by going to the Attribute Table for that layer or clicking on a group of pixels. This also must be completed for each of the polygons created for each leaf outline.

Now we should have three layers for each leaf in which we want to export as Tiff files. We have our “leaf” that we just created, our “manually classified” and “segmented” layers. To export these three, you need to go to each layer and right click. Start with the “leaf” image and go to Data and Export Raster and a pane will appear on the right. Zoom into the image so there isn’t much area above or below the leaf. Select Clipping Geometry and click on Current Display Extent. This will export the image as a Tiff file which we will use later. Continue this process with the other two images without changing the extent. Each image should be exported after this step with the same extent resulting in the same number of pixels and overlap between pixels if placed overtop one another. This can be confirmed in R in the follow steps.

R

Open R studio and use the following annotated code below:

required packages for rasters
install.packages(“raster”)
install.packages(“rgdal”)
install.packages(“sp”)
library(raster)
library(rgdal)
library(sp)
rm(list=ls()) ##clears variables
img_10_truth <- raster(“E:/Exercise1_Geog566/MyProject3/img_10_truth.tif”)
img_10_predict <- raster(“E:/Exercise1_Geog566/MyProject3/img_10_predict.tif”)
img_10_leaf <- raster(“E:/Exercise1_Geog566/MyProject3/img_10_leaf_shape.tif”)
##view the raster
img_10_truth ##confirm dimensions, extent, etc. They need to be the same or very close for a confusion matrix
img_10_predict
img_10_leaf
plot(img_10_leaf) ##view the images
plot(img_10_predict)
plot(img_10_truth)
##export data to excel
img_10_leaf_table <- as.data.frame(img_10_leaf, xy=T) ##creates tables
img_10_predict_table <- as.data.frame(img_10_predict, xy=T)
img_10_truth_table <- as.data.frame(img_10_truth, xy=T)
install.packages(“xlsx”)
library(xlsx)
setwd(“E:/Exercise_3/data_tables”)
write.table(img_10_leaf_table, file = “img_10_leaf.csv”, sep = “,”)
write.table(img_10_predict_table, file = “img_10_predict.csv”, sep = “,”)
write.table(img_10_truth_table, file = “img_10_truth.csv”, sep = “,”)

##########################################################################################

Excel

At this point the three Tiff files should have been uploaded to R as raster’s and undergone a quality check by viewing the images and the data using the supplied code above. Next, it should have been converted to a table for each file and then all exported out as csv files. From here, open all three files into excel. Begin with the “Truth” and “Predict” files and select the entire column that is the farthest to the left, the values associated with each cell. Go to the Home tab and the Editing group and click on the drop down for Sort and Filter and select the A to Z. Change everything from whatever value they are, to 1. This can easily be accomplished by using the Find and Replace function in the Editing group. Replace all the NA’s with 0’s and then go to left column which simply lists each pixel number and sort the column from A to Z so its back to the order it was when originally opened into excel. Once this is completed for both the “Predict” and “Truth” files, copy the right most column of values and paste them into the Leaf” excel file. Use the Sort function to sort the right column from A to Z on the leaf values column and allow for it to expand the selection. It should be a list of 0’s and if you scroll down you will eventually begin seeing NA’s in the leaf value column. Delete everything below this point in the “Truth” and “Predict” value columns. Scroll to the top and delete the first four columns that are the numbers, x and y coordinates and the leaf values column. You should be left with two columns that are the “Truth” and “Predict” values with 1’s and 0’s only. Give each column a label (Truth & Predict), save this document as an xlsx file and exit out.

This excel step helped us overcome the issue of having values that weren’t in a 0’s and 1’s format needed for the confusion matrix. Here, 1’s stand for diseased patch and 0’s stand for non-diseased. What we also did was removed all the values that were outside of the leaf area. So, the only values we are working with are those which are in the leaf and not part of the background. The last step is importing the excel file that was just created into R to perform the confusion matrix.

##########################################################################################

##confusion matrix
install.packages(“caret”)
library(caret)
img_10_confusion\$Predict <- as.factor(img_10_confusion\$Predict) ##convert to factors
img_10_confusion\$Truth <- as.factor(img_10_confusion\$Truth)
confusionMatrix(img_10_confusion\$Predict, img_10_confusion\$Truth)

You should now have R output of the confusion matrix and associated statistics.

ArcGIS Pro

For the purpose of visualization, I also changed the colors of the three Tiff files. To do this, I simply right clicked on each of the three layers and selected Symbology. The pane on the right side of the screen appears where you have options to change the Color Scheme. I made the leaf layer green, the segmented regions white and manually classified regions white. The snipping tool was used to extract these images.

Results

Between the five images there were 15,519 pixels classified in the confusion matrix. 14,884 were properly classified as non-diseased. 282 were classified as true disease. 307 were identified as false negatives and 46 were classified as false positives. Overall the segmentation classification had an accuracy of 97.7% with a p-value less than 0.05 indicating statistical significance. The model had a sensitivity of 99.69% and a specificity of 47.88%.

Accuracy between the five images ranged from 94% to 99%, all with significant p-values. The sensitivity for all five individual matrices and the combined matrix was at 99% with no exceptions. Specificity had a fairly large range.

I was happy with the level of accuracy produced throughout all five images and the overall cumulative matrix. The five images ranged a bit in complexity with the number of lesions present, leaf size and shape, size of lesions, etc. Looking at the confusion matrix we can see there were many more false negatives than false positives. This indicates the model is possibly not quite sensitive enough to detect the all regions or certain parts of the disease. The specificity was not consistent between the five images for some reason which is still unclear to me. As mentioned, although the model may have some issues with detection certain patches, the sensitivity indicated throughout all matrices is very high.

Viewing the images below we can see where the false positives appear and where the false negatives occur (Images 1-5). It seems that both the false positives and negatives tend to appear on the margins of the lesion when they at least detected. Additionally, larger patches seem to have a higher rate of detection than those which are small in size. This is especially true in Image 1&2.

Critique

My only critique is the process to get to this step. I was very satisfied with my results but see logistical problems in getting through large amounts of samples. I will attempt to find shortcuts throughout the process my next go around. I will also seek creating a model in ArcGIS Pro for some of the rudimentary tasks done in the processing of the images.

Appendix

Table 1. Confusion matrix of all five samples combined. Values are number of pixels classified out of 15,519 total.

Figure 1. Classification error for the combined matrix. The values are the number of pixels classified out of 15,519 total.

Image 1. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.

Image 2. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.

Image 3. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.

Image 4. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.

Image 5. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.

# Exercise 2

## Question Asked: Are latitudinal differences in dolphin sightings in the San Diego, CA survey area related to El Niño Southern Oscillation (ENSO) index values on a monthly temporal scale?

1. My previous question for Exercise 1 was: do the number of dolphin sightings in the San Diego, CA survey region differ latitudinally? I was finally able to answer this question with a histogram of sighting count by latitudinal difference. I defined latitudinal difference as the difference from the highest latitude of dolphin sightings (the Northernmost sighting point along the San Diego transect line) to the other sighting points, in decimal degrees. Therefore it becomes a simple mathematical subtraction in ArcMap. Smaller differences would be the result of a small difference and therefore mean more Northerly sighting, with large differences being from more Southerly areas. I used all sightings in the San Diego region (from 1981 through 2015). As you can see from below, there is an unequal distribution of sightings at different latitudes. Because I had visual confirmation of differences at least when all sightings are binned (in terms of all years from 1981-2015 treated the same), I looked for what process could be affecting these differences in latitude.

Comparing the Latitudes with the frequency of dolphin sightings in San Diego, CA

ENSO is a large-scale climate phenomena where the climate modes periodically fluctuate (Sprogis et al. 2018). The climate variability produced by ENSO affects physical oceanic and coastal conditions that can both directly and indirectly influence ecological and biological processes. ENSO can alter food webs because climate changes may impact animal physiology, specifically metabolism. This creates further trophic impacts on predator-prey dynamics, often because of prey availability (Barber and Chavez 1983). During the surveys of bottlenose dolphins in California, multiple ENSO cycles have caused widespread changes in the California Current Ecosystem (CCE), such as the squid fishery collapse (Nezlin, Hamner, and Zeidberg 2002). With this knowledge, I wanted to see if the frequency of dolphin sightings in different latitudes of the most-consistently studied area was driven by ENSO.

## Tool/Approach:

Primarily R Studio, some ArcMap 10.6 and Excel

## Step by Step:

1. 1.For this portion of the analysis, I exported my table of latitudinal differences within my attribute table for dolphin sightings from ArcMap 10.6. I saved this as a .csv and imported it into R Studio.
2. Some of the sighting data needed to be changed because R didn’t recognize the dates as dates, rather as factors. This is important in order to join ENSO data by month and year.
3. Meanwhile, I found NOAA data on a publicly-sourced website that had months as the columns and years as the rows for a matching ENSO index value of either: 1, 0, or -1 for each month/year combination. A value of 1 is a positive (warm) year, a value of 0 is a neutral year, and a value of -1 is a negative (cold) year. This is a broad-value, because indices range from 1 to -1. But, to simplify my question this was the most logical first step.
4. I had to convert the NOAA data into two-column data with the date in one column by MM/YYYY and then the Index value in the other column. After multiple attempts in R studio, I hand-corrected them in Excel. Then, imported this data into R studio.
5. I was then able to tell R to match the sighting date’s month and year to the ENSO data’s month and year, and assign the respective ENSO value. Then I assigned the ENSO values as factors.
6. I created a boxplot to visualize if there were differences in distributions of latitudinal differences and ENSO index. (See figure)Illustrating the number of sightings grouped by ENSO index values (1, 0, and -1).
7. Then I ran an ANOVA to see if there was a reportable, strong difference in sighting latitudinal difference and ENSO index value.

## Results:

From the boxplot, it appears that in warm years (ENSO index level of “1”), the dolphins are sighted more frequently in lower latitudes, closer to Mexican waters when compared to the neutral (“0”) and cold years (“-1”). This result is intriguing because I would have expected dolphins to move northerly during warm months to maintain similar body temperatures in the same water temperatures. However, warm ENSO years could shift prey availability or nutrients southerly, which is why there are more sightings further south.  The result of the ANOVA, was a p-value of <2e-16, providing very strong evidence to reject the null of hypothesis of no difference. I followed up with a Tukey HSD and found that there is strong evidence for differences between both the 0 and -1, -1 and 1, and 1 and 0 values. Therefore, the different ENSO indices on a monthly scale are significantly contributing to the differences in sighting latitudes in the San Diego study area.

Tukey HSD output:

0–1 0.01161047 0.004250827 0.01897011 0.0006422

1–1 0.04101170 0.030844193 0.05117920 0.0000000

1-0 02940123 0.020689737 0.03811272 0.0000000

## Critique of the Method(s):

These methods worked very well for visualization and finally solidifying that there was a difference on sighting latitude related to ENSO index value on a broad level. Data transformation and clean-up was challenging in R, and took much longer than I’d expected.

### References:

Barber, Richard T., and Francisco P. Chavez. 1983. “Biological Consequences of El Niño.” Science 222 (4629): 1203–10.

Sprogis, Kate R., Fredrik Christiansen, Moritz Wandres, and Lars Bejder. 2018. “El Niño Southern Oscillation Influences the Abundance and Movements of a Marine Top Predator in Coastal Waters.” Global Change Biology 24 (3): 1085–96. https://doi.org/10.1111/gcb.13892.

# Spatial and Temporal Patterns of Reported Salmonella Rates in Oregon

Here I asked if there was evidence supporting temporal autocorrelation of age-adjusted Salmonella county rates within Oregon from 2008-2017 and if so what type of correlation structure is most appropriate. I also investigated spatial patterns of reported Salmonella rates as they related to various demographic variables like: Percent of county which is aged 0-4, percent of county which is aged 80+, percent of county which is female, median age, percentage of county residents who have completed high school, median county income, percent of county who is born outside the US, percent of county who speaks a language other than English at home, percentage of county estimated to be in poverty, and percent of children in a county estimated to be in poverty.

To answer these questions, I used the same data outlined in the first exercise blog post with newer demographic variables being taken from the American Community Survey and published on AmericanFactFinder which provides yearly estimates of demographic information at the county level. Unfortunately, yearly data prior to the year 2009 is unavailable which shortened the window of analysis by a year.

1. Names of analytical tools/approaches used

The methods used to answer these questions was first to create an exhaustive general linear model where county Salmonella rates were a function of the above listed demographic variables. A Ljung-Box Test was used to assess if there was evidence of non-zero autocorrelation of residuals of the model at various time lags. Following this, an ideal linear model was selected using a stepwise AIC selection process and then different variance structures were compared by AIC, BIC, and log Likelihood metrics as well as ANOVA testing. Following the selection of an appropriate base model and variance structure, I allowed for interaction between all variables and time and performed more ANOVA testing to select the best model which allowed for variable time interaction. A version of this model would later be used in geographically weighted regression analysis. I performed geographically weighted regression allowing the coefficients to vary across space in Oregon.

1. Description of the analytical process

I created a lattice plot of reported age-adjusted Salmonella rates over the 10-year period in every county to visually assess whether Salmonella rates were changing over time. After seeing that rates of reported Salmonella were changing over time in many Oregon counties I created a “full” linear model in which the rate of reported Salmonella cases in a county was a function of the demographic variables described above. Because this is longitudinal data measured over time I wanted to see if rates of Salmonella were correlated over time, meaning that a county’s rate one year could be predicted from the rate found in another year in the same county. I first performed a Ljung-Box test to assess the need to evaluate for temporal autocorrelation as well as tested normality assumptions, and log-transformed my outcome (Salmonella rates) based on those normality assumption tests. A simple backward step-wise AIC approach was used on my full model to identify and remove extraneous variables. This worked by removing variables in the full model in a stepwise fashion, comparing the AIC values between the two models, and continuing this process until the AIC values between the two models being compared are not significantly different. I then used this model to select an ideal variance structure to compare Salmonella rates autocorrelated at different time lags. The types of variance compared were: Independent variance, compound symmetric, AR1, MA1, ARMA (1,1), unstructured, and allowing for different variance structures across time. After AIC, BIC, log Likelihood, and ANOVA testing an ideal variance structure was determined and the model using this variance structure was evaluated for basic interaction with time. All variables present in the model were allowed to have time interaction including time itself (i.e. allowing for a quadratic time trend). Once again AIC, BIC, log Likelihood, and ANOVA testing were used to select the most ideal model.

Following this I moved on to GWR, where I was able to use the model identified above to create a new data frame containing Beta coefficient estimates of the significant variables in my final model for every county in Oregon. This data frame of coefficients was merged with a spatial data frame containing county level information for all of Oregon. Plots of the coefficient values for every different county were created.

1. Brief Description of Results

Every panel represents one of Oregon’s 36 counties and panels containing an error did not have any cases of reported Salmonella. Some counties are seen decreasing with time, others show slightly increasing trends, and others show a fairly level rate over time. Clearly there is evidence of some time trend for some counties.

Results of normality testing found that age- adjusted rates of reported Salmonella in Oregon were not normally distributed, for the ease of visualization and as an attempt to address the failure to meet the assumption of normality in linear modeling I log-transformed the rates of reported Salmonella. Results of the Ljung-Box test with my full model provided evidence of non-zero time autocorrelation in the data and a visual inspection of the lattice plot supports this with most counties showing a change in rates over time.

The stepwise AIC model selection approach yielded the following model:

logsal ~ %female + %childpov + medianage +year

Covariance structure comparison:

 Covariance Model logLikelihood AIC BIC Independent -431 874 896 Compound Symmetry -423 860 887 AR(1) -427 869 895 MA(1) -427 869 895 ARMA(1,1) -423 862 892 Unstructured -387 859 1017 Compound Symmetry Different Variance Across Time -412 854 910

Mostly AIC, BIC, and Log Likelihood values were clustered together for the different models. I decided to base my choice primarily on the lowest AIC because that’s how I did variable selection to this point. This resulted in me choosing a compound symmetric model which allowed for different variances across time.

Next, I built models which allowed for simple interaction with time meaning that any three-way interaction with time was not evaluated for. Subsequent ANOVA testing comparing the different interaction models to each other, to a model where no interaction was present, and a model where time was absent were used in my selection of a final model.

 Final Model:  logsal ~ %female + %childpov + medianage + year +(medianage*year) This model follows a 5×5 compound symmetric correlation variance model which allows for variance to change over time. Code: interact_m <- gls(logsal ~ female + childpov + medianage + year +(medianage*year), na.action=na.omit, correlation= corCompSymm(form= ~1|county),weights=varIdent(form=~1|year),data=alldata) Within County Standard Error  (95% CI): 0.92 Estimate Name Estimate (log-scale) Std. Error p-value Intercept -759.42 237.79 0.002 % Female 18.06 4.46 <0.001 % Child Poverty 0.03 3.27 0.001 Median Age 17.16 3.11 0.002 Year 0.38 3.18 0.002 Median Age*Year -0.01 -3.12 0.002

Estimates are on the log scale making them difficult to interpret without exponentiation, however it can be seen that a percent change in the number of females in a county or a year change in median age are associated with much larger changes in rates of reported Salmonella incidence compared to changes in percent of child poverty and the year. Overall, incidence rates of reported Salmonella were shown to increase with time, county percentage females, county percentage of child poverty, and county median age with a significant protective interaction effect between time and median age.

For my GWR analysis I used a function derived from the rspatial.org website and looks like:

regfun <- function(x) {
dat <- alldata[alldata\$county == x, ]
m <- glm(logsal ~ female + childpov + medianage + year + (medianage*year), data=dat)
coefficients(m)
}

As can be seen, I retained the same significant variables found in the OLS regression for my time series analysis. GWR in this case allows for the coefficient estimates to vary by county.

This allowed me to create a data frame of all coefficient estimates for every county in Oregon. Subsequent dot charts showed the direction and magnitude of all covariates varied across the counties. Plots and dot charts of Oregon for the different coefficient estimates were made.

For % Female

For % Child Poverty

For Median Age

For Year

For Median Age * Year

For the most part, county % female, median age, year, and the interaction term clustered close to 0 for most counties. Some counties were showed highly positive/negative coefficient estimates though no consistently high/low counties could be identified. The maps for the coefficients of median age and year are very similar though I do not have a clear idea why this is the case. The map of the coefficients of child poverty showed the most varied distribution across space. Autocorrelation analysis using Moran’s I of the residuals from this GWR model did not find any evidence of significant autocorrelation. I could not find evidence of a significant non-random spatial pattern for the residuals of my model.

1. Critique of Methods Used

While the temporal autocorrelation analysis was useful in that it provided evidence of temporal autocorrelation present in the data and prior univariate spatial autocorrelation provided limited evidence of variables being spatially autocorrelated at different spatial lags, I was unable to perform cross correlation analysis between my outcome and predictor variables. One important note: I do plan on performing this analysis I just need to figure out how the “ncf” package works in R. This is one of the more glaring shortcomings of my analysis so far is that I do not have evidence that my outcome is correlated with my predictor variables at various distances. Another critique is that the choice of an ideal temporal correlation structure was fairly subjective with my choice of model selection criteria being AIC. Basing my decision on other criteria would likely change my covariance structure. A similar argument could be said for my choice of variable selection being based on a backwards stepwise AIC approach where other selection criteria/methods would likely have different variables in the model.

Finally, the results of my GWR analysis do not show actual drivers of reported Salmonella rates. Rather it shows the demographic characteristics associated with higher reported rates. While this information is useful it does not identify any direct drivers of disease incidence. Further analysis will be required to see if these populations have different exposures or severity of risky exposures.

# Determining accuracy of segmentation versus ground truth disease patches of Blackleg on Turnip

Background
My research involves the classification of a leaf spots on turnip derived from the pathogen blackleg. I had hypothesized that spatial patterns of pixels based on image classification are related to manually classified spatial patterns of observed disease on turnip leaves because disease has a characteristic spectral signature of infection on the leaves. Here I focus on the accuracy of previously determined disease patches through segmentation and ground truth classification. To do this a confusion matrix is used and allows for detection of true positives, true negatives, false positives and false negatives. All of the image processing took place in ArcGIS Pro. The next step involved the accuracy assessment which was conducted in R.

Questions
How accurate is the computer image classification?
• Can the accuracy be quantified?
• How can the image classification error be visually represented?

Tools and Methodology
I began in ArcGIS Pro to help visually represent the false negative and false positive regions of the disease patches through the segmentation process of classification. To start I turned on both layers where you can see the overlap between the two classification methods in Image 1.

I then went to the symbology of the segmented image and changed the color to white. I placed this layer on top, so it covered all the raster cells of the manually classified patches leaving only the false negatives, seen in Image 2. Next, I went back into the symbology, but for the manually classified image and changed the color scheme to white. I moved this layer to the top and changed the segmented image color scheme back to unique values. I was left with Image 3 showing the false positives. This was an easy way to visualize these disease patches and how well the classification method was working.

Next, I exported a Tiff file of both the manually classified patches and the segmented patches. To ensure each cell between the two layers lined up in R I had to make sure the extent was the same for both layers when I exported. To do this I right clicked on the layer and went down to data and selected export raster. A pane appears in right side of ArcGIS Pro where I hit the drop-down arrow for clipping geometry and selected current display. I did this with both layers and had one Tiff file for computer classified image through segmentation and one for the manually classified disease patches.

Using the raster, rgdal and sp packages I was able to upload my two Tiff files in raster format to R. I gave the two files each a name and used the plot function to view the two images. I noticed they both had values associated with each patch which were on a gradient scale. To correct for this, I converted my two raster layers in R to tables. This provided a coordinate for each cell and the value associated with it. In the image segmentation raster table I had 0 to 2 and the manually classified image I had 1 to 6. For all the white space I was given ‘N/A’, which was another issue. I used the xlsx package to export my data tables to excel files. I opened the two files in excel and used the sort smallest to largest function. From here I was able to use the replace function and change all the ‘N/A’ values to 0’s and all the values associated with pixels to 1’s. The values were arbitrary associated with the pixels and I needed the raster in 1’s and 0’s format. After doing this with both excel sheets I copy and pasted the two side by side and deleted all the associated coordinates. These were also unnecessary because each pixel from the same coordinate between the layers were in the same row. I saved this excel file and uploaded it into R. I downloaded the caret package and performed a confusion matrix which can be seen below in Table 1.

Results
The visual representation for my false positive and false negative results can be seen below in Image 2 & 3 with Image 1 for comparison. You can see the false negatives for disease covers a much larger area than does the false positives. This may imply that the segmentation is limited in its assessment of diseased area. What it tends to miss is the margins of the disease but does a fair job of predicting the center of the disease where it likely originated and is most severe. To correct this, setting a larger threshold may allow for less severe regions of the disease to be classified. Because the segmentation is based on pixel reflectance value at the red band, this would mean the threshold value needs to be slightly lowered.

Additionally, an entire patch of disease was missed which can be seen in the right corner of Image 1. Currently, the classification system is set to only create segmented patches of 10 or more pixels. This patch is 9 pixels and therefore just missed the cutoff. Even though it was just shy of this requirement, we are unsure if the segmentation would have detected a difference in this diseased patch or if it was also out of the threshold for classification. If it is common for disease patches to be this small, it may be an indicator to lower the value to 5 or 6 for what it is allowed in the segmentation blocking.

There are multiple steps in the processing of the image where different routes could have been taken and potentially increased classification accuracy. The objective of this classification method is to have as little percent error as possible or at least determine if you’d rather have more false positive than false negatives, vice versa or equal. Here we have greater percent of cells which are false negatives and modest assessment of diseased pixels when simply visualizing the images.

To help quantify these images and determine how accurate the model was, a confusion matrix was used in R, seen in Table 1. The segmented classification correctly identified non-diseased regions very well and did a pretty good overall job of predicting disease that was confirmed with ground truth. There were 2075 true positives, 55 false negatives, 41 true negatives, and 12 false positives. The model correctly identified 97.1% of the 2183 total cells. The precision of the model was 42.7%. The sensitivity of the model was 77.4% and the specificity was 97.4%. The accuracy was very good for this model and is essentially a percent error calculation of the model. The sensitivity measures the proportion of actual positives that are correctly identified while the specificity is the opposite and measures the proportion of true negatives identified. The precision gives us a sense of how useful or complete the results are. The model did well overall but provided insights to possible adjustments that could be made which would increase the predictive power here.

Critique of method
One critique I have is the small sample size. While I simply intended to only lay down the framework for creating a stepwise process in disease classification, it supplies results that can hardly be statistically backed. I would like to increase my sample size to five images for part three and look for similarities and differences between the five. I also intend to make some adjustments to the process to try and increase overall accuracy, precision, specificity and sensitivity. So essentially this critique is the limited conclusion that can be drawn from a sample size of one and the need to increase that to five for now.

The second critique I have is the number of steps I have used to get to this point. I would like to find a more manageable way to do the segmentation process and the image processing steps to get to this point. I have found small changes I can make along the way. Ideally, I can use the Modelbuilder in ArcGIS Pro where most of the processing is done. This will streamline the process when I find a way to do this.

An error which is present in my results is the confusion matrix. The matrix is considering 2183 raster cells in order to perform the matrix. These cells were determined by a defined rectangle when exporting a Tiff file from ArcGIS Pro. Many of these cells are not even of the leaf and is simply classifying regions outside of the leaf. To correct this, I would need to export a Tiff file which is symbolic of the leaf shape. The confusion matrix results provided were therefore erroneous in a sense or a bit misleading.

Partner ideas
My partner talked about how they were doing a neighborhood analysis and it could be practical for me to do. She had mentioned doing it in earth engine which I haven’t used but could get some help from her. There is a multiple rings buffer and I could look at the false positives in this light. She also mentioned using geographically weighted regression. We didn’t discuss much about it, but it seemed like a good regression to perform on my error analysis. We related on some level with our projects data and issues but at the time didn’t have any clear resolves. I will be curious to follow up on our chat and see what type of analysis was performed and share my results as well.

Appendix

Image 1. Overlap between segmentation on top and manual classification below

Image 2. False negatives after subtracting segmented regions from manually classified.

Image 3. False positives after manually classified cells are subtracted from segmented.

Table 1. Confusion matrix

R Code

install.packages(“raster”)
install.packages(“rgdal”)
install.packages(“sp”)

library(raster)
library(rgdal)
library(sp)

img20_seg <- raster(“E:/Exercise1_Geog566/MyProject3/RasterT_afr7_Polygon_1.tif”)

img20_ground <- raster(“E:/Exercise1_Geog566/MyProject3/diseased_20_PolygonT_1.tif”)

img20_seg
img20_ground

plot(img20_ground)
plot(img20_seg)

##export data

raster.table <- as.data.frame(img20_seg, xy=T)
truth <- as.data.frame(img20_ground, xy=T)

install.packages(“xlsx”)
library(xlsx)
setwd(“E:/”)

tableimg <- raster.table

write.table(tableimg, file = “dataexport.csv”, sep = “,”)
write.table(truth, file = “truth1.csv”, sep = “,”)

##confusion matrix

#######this isn’t working

install.packages(“caret”)
library(caret)

table(confusionM\$reference, confusionM\$predicted)

confusionMatrix(confusionM\$reference, confusionM\$predicted)

##For the confusion matrix if above doesn’t work

myconfusionM <- table(confusionM\$predicted, confusionM\$reference)
print(myconfusionM)

##accuracy of matrix

2075+41+12+55 #total
(12+55)/2183 #misclassified/total

##Precision

41/(55+41)

##Sensitivity

41/(12+41)

##Specificity

2075/(2075+55)

# Ex2: Incremental pixel greenness while moving away from refugee settlement boundaries

Exercise 2: Incremental pixel greenness while moving away from refugee settlement boundaries

• The question I asked centered on how the pixel greenness / NDVI varied in buffered increments around settlements within BidiBidi, Imvepi, and Rhino Refugee camps. I wanted to compare the settlements in Bidi Bidi and Imvepi, which have a larger settlements, to Rhino, which tended to have smaller settlements more uniformly spaced and spread out. Given the more uniformed and wider spacing of Rhino, I expect that green-up will happen more quickly in comparison to the settlements in Bidi Bidi and Imvepi, which are closer together and varying in size and development pattern. This leads me to believe that there’s more cleared or available land and that

Map of Regions of Interest & Buffers

settlement size is based on political organization of camp blocks rather than natural boundaries that might exist already. One of the questions that I’m asking with these settlements and the geography of exclusion is essentially why different settlement areas and parts of these areas are included or excluded from global settlement datasets. One of the factors that contributes to this is a spectral and spatial distinction – that is, how might the green space in and around a settlement change as you move away from said settlement? With this exercise, I wanted to compare the settlements in Rhino to the settlements in Bidi Bidi and Imvepi to see if the pixel greenness changed at a different rate or in a different pattern as one moved away from the settlement center.

• I used multiple different tools, including the Multi-Ring Buffer tool in QGIS (since I was working with export JSONs), EarthEngine to extract NDVI mean values from Landsat 8 satellite images at these settlement locations, and the smoothing factor in ggplot in R to plot and statistically examine the way NDVI changed.
• I first needed to buffer the regions of interest that I wanted to study, of which there were 44 in Bidibidi and Impvipi and 41 in Rhino. I performed this buffering in QGIS using the Multi-Ring Buffer plugin and made buffers at 100-meter increments from 0 to 1000 meters. Some of the buffers overlapped, but for the sake of simplicity of this assessment, I ignored this. Within EarthEngine, I pulled Landsat8 images from 2018 that covered these settlements. After adding NDVI and NDBI calculated bands to the image collection of 2018 images, I performed a quality mosaic to compress the image collection into one image. I based this quality mosaic on NDVI, meaning that the pixel chosen from the image collection would be the pixel with the highest NDVI. While this can sometimes pull pixels from different dates, it does exclude the possibility of clouds and seasonality affecting the dataset by comparing just the most vegetated pixel that occurred in that area. If I were to re-do this, I might choose a single date image to capture phenological nuance. After reducing the image across all of the buffers (that is, calculating the mean NDVI within each buffer), I exported the geoJSONs, brought them back into QGIS, ensured that there was a spatial selection component linking all of the buffers and regions of interest together, and brought this data into R to plot the NDVI change over distance for all regions of interest.

Bidi Bidi Settlement; Imveppi Settlement Buffers

Rhino Settlement Buffers

• The pattern of greening in the buffers around settlements in Rhino versus Imvipi and BidiBidi did present different patterns, but not particularly significant different patterns. It appears that the Rhino settlements had a faster rate of increase in greenness while moving away from the settlements especially in the first 500 meters, whereas Bidi Bidi and Imveppi showed a more gradual green-up, although there also seems to be a small shift at 500 meters. These results are somewhat expected, but also not very drastic. It would be interesting to see how the green-up changes if I increased my buffer extent or decreased my buffer increments.
• I think that looking at NDVI in buffers was an interesting approach, but as I said above, my choice of pixel quality selection (highest NDVI) could alter a neutral selection of data. Also, what buffers I chose were relatively arbitrary – I chose equal intervals, but this does mean that when the mean NDVI is calculated, the mean is reduced across a larger area as each buffer gets further from the center. I could also try testing with larger buffers (200 or 500 meter buffers) that extend beyond 1000 meters from the settlement edge. Further, some of my buffers overlapped and encroached on other actual boundaries: this means that the buffers sometimes contained pixels from other identified settlements. For this reason, I chose to present the data in a smoothing trend. I will likely need to fix some of these errors for the final project, because I do think that this is a typical and useful spatial analysis to perform on this type of data and some of the errors are relatively easy to fix and would show stronger data integrity.

# Exercise 2: Geographically weighted regression on two forested stands.

Bryan Begay

1. Initial Spatial Question: How does the spatial arrangement of trees relate to forest aesthetics in my areas of interest?

Context:

To understand forest aesthetics in my stand called Saddleback, I did a Ripley’s K analysis for Saddleback and on a riparian stand called Baker Creek to determine if the stands are clustered or dispersed.  The Baker Creek location is a mile west of the Saddleback stand.

1. Geographically weighted Regression:

I performed a geographically weighted regression on both the Saddleback and the Baker Creek stands. The dependent variable was a density raster value and the explanatory value was tree height.

1. Tools and Workflow

Figure 1. The workflow for creating the Geographically Weighted Regression for the Saddleback Stand. The Baker Creek stand followed the same workflow as well.

Results:

Figure 2. Geographically Weighted Regression showing the explanatory variable coefficients in the Saddleback and Baker Creek stands near Corvallis Oregon. Yellow color indicates negative relationships and the hotter colors  indicate positive relationships between tree height and density.

Figure 3. Geographically Weighted Regression showing the Local R2 values in the Saddleback and Baker Creek stands near Corvallis Oregon. Yellow color indicates that the local model is performing poorly, while hotter colors indicate better performance locally.

Table 1. Summary table output for the Saddleback stand’s geographically weighted regression.

Table 2. Summary table output for the Back Creek stand’s geographically weighted regression.

4. Interpretation/Discussion:

Having done the Ripley’s K analysis, I wanted to have a connection with this exercise, so I created a point density raster on both my stands (Figure 1). The point density raster calculates a magnitude-per-unit area from my tree points and outputs a density for the neighborhood around each tree point. The raster values would then be a descriptor of the trees neighborhood density. Having the density neighborhood values describes the stands tree spatial arraignment and relates to the Ripley’s K analysis outputs of telling if a stand is spatially clustered or dispersed.

Figure 2. shows that there is a spatial pattern in the Saddleback stand between density and height. There is a positive relationship on the edges of the stand and a decreasing relationship in the middle of stand between the two variables. This makes sense when thinking about how the stand would have denser and higher trees on the edges of the managed stand to screen the forest operations. The coefficient values for the baker creek showed a positive relationship on the north eastern portion of the stand, which would need further investigation to understand the relationship between density and height. Overall the relationship was negative in the Baker creek stand between density and height, but this may be attributed to the low local R2 values that indicate poor modeling (figure 3). Table 2. also shows that the Baker Creek model only accounted for 50% of the variance for the adjusted R2 values, which would indicate that more variables would be needed for the riparian stand. Figure 1. shows the summary table for GWR in the Saddleback stand.

1. Critiques

The critiques for this exercise is that I only look at height and density. If I had more knowledge of working with LAS data sets I would have liked to have implemented the return values on the LiDAR data as an indicator of density. Another critique would be that I used density as a dependent variable and height as an explanatory variable. Using density as the dependent value allows me to see the spatial patterning of my trees when plotted in ArcMap so I can reference the Ripley’s K outputs for further analysis. Having height as a response variable with density as an explanatory is something that would have been easier for me understand and explain that relationship. Density can affect tree height in a stand but understanding tree height as a factor that affects density is not as intuitive. Looking at how tree height responds to density in my stand would tell something about tree height, but that relationship has already been explored in great depth.