Monthly Archives: June 2019

Nearest Neighbor Analysis on Woodpecker Nest Locations Within Salvage Logging Units

Exercise 3: Nearest Neighbor Analysis


How is the spatial presence of postfire woodpecker nests related to the spatial presence of salvage-logged forest stands?

  • How are woodpecker nests clustered within survey units? (Exercise 1 and 3)
  • How does this clustering relate to salvage treatment units within the survey units? (Exercise 2 and 3)


Average Nearest Neighbor in ArcMap

The Average Nearest Neighbor tool measures the distance between each feature centroid and its nearest neighbor’s centroid location and averages the distances for a nearest neighbor ratio. If the ratio is less than the average for a hypothetical random distribution, the feature distribution is clustered. If the ratio is greater than a hypothetical random distribution average, the features are dispersed.

Near Distance in ArcMap

The Near tool measures near distances between one set of target features and another set of target features. It produces additional columns in the original shapefile containing distance measurements in units of the user’s choosing. The user can enable a location option displaying X and Y distances individually.


For this analysis, I used 2016 and 2017 woodpecker nest point datasets clipped to each RMRS survey unit. I also used a polygon shapefile of 35 salvage harvest units within RMRS woodpecker survey units. I used another polgyon shapefile of the woodpecker survey units and a WorldView-3 1 m raster for supplementary data.

Nearest Neighbor Analysis Steps

  1. Export 2016 and 2017 nest points as one nest shapefile per RMRS survey unit for inputs into the Average Nearest Neighbor tool.
  2. Export salvage treatment polygons into three shapefiles for treatments 1, 2, and 3.
  3. Run the Average Nearest Neighbor tool in ArcMap with multiple inputs. I ran the tool on each 2016 and 2017 nest shapefile per survey unit, all 2016 and 2017 nests, all salvage units, and each salvage treatment type shapefile (see table below).
  4. Run the Near tool to produce near distances in meters for the distance of each nest to a salvage unit. I ran this tool for 2016 and 2017 nests using salvage unit centroids and salvage unit polygons, which produced different results as expected.
  5. Use Excel to create an Average Nearest Neighbor table for comparing 2016 and 2017 results.
  6. Use ggplot in R to plot near distance graphs for 2016 and 2017. These graphs display near distance distributions for nest points to salvage unit centroids and near distances for nest points to salvage polygon boundaries.


Near Distances

I produced boxplots displaying near distance distributions for nest points. Sets of graphs are featured with and without the control nests for different visual interpretations.

Near Distances for Nest Points to Salvage Centroids (With Control Nests)

Near Distances for Nest Points to Salvage Centroids (Without Control Nests)

Near Distances for Nest Points to Salvage Polygons (With Control Nests)

Near Distances for Nest Points to Salvage Polygons (Without Control Nests)

Nearest Neighbor Results Table

Above: Nearest neighbor results for woodpecker nests in each survey unit in 2016 and 2017. The NN Ratio is a threshold for expected clustering or dispersion. NN Ratio values less than 1 indicate clustering and NN Ratio values greater than 1 indicate dispersion. In 2017, green cells indicate an increase in value and red cells indicate a decrease in value. NN Ratio cells with an increased value in 2017 indicate units where nests increased dispersion. NN Ratio cells with a decreased value in 2017 indicate units where nests increased clustering. Alder Gulch (Treatment 3), Lower Fawn Creek (Treatment 2), and Upper Fawn Creek (Treatment 2) demonstrated increased clustering in 2017. Big Canyon (Treatment 1), Crazy Creek (Treatment 1), and Sloan Gulch (Treatment 3) experienced increased nest dispersion in 2017. All control units experienced increased or present dispersion in 2017. The “All Nests” and salvage shapefile results were produced for comparisons and to evaluate how clustering/dispersion within datasets may affect clustering/dispersion of other datasets.

With these two techniques creating near distance values and a nearest neighbor table, we can implement a refined nearest neighbor analysis. This analysis examines nearest neighbor distances in combination with distances to the study boundary. Refined nearest neighbor uses standardized formulas integrating the distribution of observed nearest neighbor distances and the distribution of expected (random) nearest neighbor distances while factoring in distance to unit boundaries. In the table above, the NN Expected and NN Observed values will aid  refined nearest neighbor results.

Problems and Critique

Did this nearest neighbor analysis account for the target area size and shape? In the ArcMap Average Nearest Neighbor help documents, these assumptions are made:

  • “The equations used to calculate the average nearest neighbor distance index (1) and z-score (4) are based on the assumption that the points being measured are free to locate anywhere within the study area (for example, there are no barriers, and all cases or features are located independently of one another).”
  • “The average nearest neighbor method is very sensitive to the Area value (small changes in the Area parameter value can result in considerable changes in the z-score and p-value results). Consequently, the Average Nearest Neighbor tool is most effective for comparing different features in a fixed study area.”

Because of this, I’m not sure whether I can interpret these results properly based on the study design. We are surveying specially designated study units, not the entire burn area. Therefore, our sample calculations should not assume the entire area is available for clustering or dispersion. There is an option in ArcMap to input “Area” as an integer but not as a polygon shapefile (for survey and salvage units). To control for this, I calculated Average Nearest Neighbor on nest shapefiles for each survey unit instead of all nests together. However, as shown in the table above I also calculated Average Nearest Neighbor for all nests and different combinations of salvage units. These results may prove less reliable.

I would like to extend this analysis to a nearest neighbor technique adjusting for both the clustering/dispersion of the nests and the clustering/dispersion of the salvage units and survey units at the same time. The predetermined study design continues to require special consideration.

Multiple Buffer Distance Analysis on Woodpecker Nest Buffers Intersecting Salvage Harvest Areas

Exercise 2: Multiple Buffer Distance Analysis


How is the spatial presence of postfire woodpecker nests related to the spatial presence of salvage-logged forest stands?

  • How are woodpecker nests clustered within survey units? (Exercise 1 and 3)
  • How does this clustering relate to salvage treatment units within the survey units? (Exercise 2 and 3)


Multiple Ring Buffer in ArcMap

This tool creates shapefiles of concentric circles around features based on user-input values. Users can dissolve buffers to create a single feature class per buffer distance. Buffer widths overlapping with features of interest can indicate spatial relationships between target subjects. In this case, intersecting woodpecker nest buffers with salvage harvest polygons may reveal trends in woodpeckers selecting nest sites closer to or farther from salvage units. An equation producing percent area of each nest buffer intersected by a salvage unit serves as an index.

Above: Buffers created in a test analysis at 15 – 50 meter intervals around nest point features.


For this exercise I used 2016 and 2017 woodpecker nest point shapefiles. I created multiple ring buffer outputs for each shapefile to use in the analysis. I also used a polygon shapefile of 35 salvage harvest units included within the treatment woodpecker survey units. I used another polgyon shapefile of the woodpecker survey units and a WorldView-3 1 m raster for supplementary data.

Multiple Buffer Distance Analysis Steps

  1. Create separate point shapefiles for 2016 and 2017 nest points.
  2. Run the Multiple Ring Buffer tool in ArcMap on each shapefile at 50 meter intervals from 50 – 300 meters. The tool creates
  3. Use the Intersect tool in ArcMap to create a polygon shapefile of the buffers clipped to the salvage harvest units.
  4. Use the Dissolve tool in ArcMap to merge overlapping buffer polygons of the same type for the same nest.

Above: The green polygons indicate areas where the Intersect tool identified overlap between the buffers and salvage unit polygons. The tool creates a polygon shapefile of the overlapping areas.

Above: The final result of six buffers at 50 m intervals around nest points intersected with salvage harvest units. Each color represents an individual polygon in a shapefile. The polygons are given size information and used to calculate percent overlap of nest buffers with salvage units.

5. Use the XTools Pro extension to attribute size information to each intersected buffer shapefile for area in square meters.

6. Create a new field in each intersected buffer attribute table for percent area of the buffer intersected by a salvage unit.

7. Use the field calculator to create a formula for percent buffer area intersected: (Area field/Complete buffer size)*100

8. Export each intersected buffer table to Excel, including the salvage treatment type column and the percent buffer area intersected column for further analysis. Transfer the salvage treatment type and corresponding percent buffer area intersected columns to a combined Excel file with a sheet for every buffer distance.

9. Add zero values to each sheet for the nests falling within a control unit. The control nests act as a fourth treatment and should be included in the results.

10. Use ggplot2 in R to extract data from each sheet and create box plots for each buffer distance.

Above: Excel table of X and Y inputs for boxplots showing percent of the complete buffer intersected by a salvage harvest unit. Each buffer distance from 50 – 300 m has a sheet containing columns for these values.


I generated graphs for each of the six buffer intervals (50-300 m at 50 m intervals) in 2016 (pre-salvage) and 2017 (post-salvage)  for 12 total graphs. I presented the 2016 and 2017 results for each buffer distance side by side below. In each graph, the unlabeled grouping of points to the left of salvage treatment type 1 represents nests in the control area, or nests 100% inside salvage treatment 0. Visual analysis reveals interesting patterns about woodpecker nest site selection when considering the silvicultural prescriptions designating salvage treatment types. Below is a reminder of the salvage treatments:

Treatment 1 harvests the most trees overall but retains the most large diameter trees with spacing for Lewis’s woodpeckers. Treatment 2 harvests a moderate number of trees, retaining less large diameter trees but more medium and small diameter trees. Treatment 3 harvests a limited number of trees but retains barely any large diameter with a heavy focus on small diameter for white-headed woodpeckers. The control treatments do not harvest any trees.

An obvious downfall of the following graphs is that as distance from the nest increases, percent of the nest buffer intersecting a salvage polygon will decrease. There is an overall decrease in mean and midrange values for intersecting area as the buffer distance increases. However, some significant points emerge:

  • Nest distribution in Treatment 1 units generally increases in breadth between 2016 and 2017. Meaning, in 2016 the distribution of woodpecker nest distances from a salvage unit centered more closely around a mean distance near 30 – 50%. In 2017 the values appear to spread out with less preference towards a specific distance from a salvage unit.
  • Nests in Treatments 2 and 3 units generally decrease in percent area intersected by a salvage unit from 2016 to 2017. Meaning, after the salvage harvest, woodpeckers are selecting nest sites farther from Treatments 2 and 3.

50 Meter Buffer Results


100 Meter Buffer Results


150 Meter Buffer Results


200 Meter Buffer Results



250 Meter Buffer Results


300 Meter Buffer Results

Problems and Critique

I would perform this analysis again with larger buffer sizes and greater buffer intervals. I am not convinced the buffer size I chose for this exercise captured significant information at this scale. I created 10 buffers from 100 – 1000 m for a future analysis. Buffer size should be determined by assumed travel and foraging distances for each woodpecker species. The 50 – 300 m scale may not be a great enough distance for local trends to develop in the data. I chose these distances because the belt transects for the woodpecker point count surveys are 200 – 300 m apart to avoid interfering with birds on neighboring transects. I thought this would be a comparable scale for the buffer analysis.

This analysis fails to address or quantify the control nests because they are too far away from salvage units for their buffers to intersect. In Exercise 3, a near analysis in ArcMap can quantify trends in control nest distances from salvage areas.

In the future I would like to produce a statistics table for each 2016 and 2017 buffer distance displaying mean, standard deviation, and other metrics for more than a visual analysis.

Using LiDAR data to assess forest aesthetics in McDonald-Dunn Forest

Bryan Begay



  1. Question asked?

How does forest aesthetics vary depending on forest structure as a result of active management that retains vegetation and landforms?

In order to answer my question I would look at two stands in the McDonald-Dunn forest to do some analysis on how their forest structure is related to forest aesthetics. The first stand is Saddleback which had been logged in 1999 and shows signs of management. The second stand was identified near Baker Creek, and is 1 mile west of Saddleback. Baker Creek was chose for its riparian characteristics, as well as having no signs of management activates.

  1. A description of the data set.

The LiDAR data that I used with my initial analysis was 2008 DOGAMI LiDAR flown over the McDonald-Dunn forest. The RMSE for this data set was 1.5cm. The DEM data set I used was from 2009 has a RMSE of 0.3 m. A Canopy Height Model (CHM) was made in RStudio lidR package that used a digital surface model with a 1 meter resolution. The CHM was used to create an individual tree segmentation, where segmented trees were then converted to point data.

A link to a visualization of the raw point cloud that is georeferenced to its terrain.

  1. Hypotheses: predictions of patterns and processes you looked for.

I suspected that initially the Baker Creek stand would have higher forest aesthetic that would reflect in the stand’s unmanaged vegetation structure.


Since the Saddleback had been managed and cut I figured the more natural structure of the riparian stand would have generally higher forest aesthetics than a stand that has been altered by anthropogenic factors. Some processes that I hypothesized that relates to forest aesthetics to these stands was the spatial point pattern of trees could be related to forest aesthetics. Insert forest aesthetic link:

  1. Approaches: analysis approaches you used.

Exercise 1: Ripley’s K Analysis point pattern analysis

The steps taken to create a point pattern analysis was to identify individual trees and convert the trees into point data. The RStudio lidR package was used to create a Canopy Height Model and then an Individual tree segmentation. Rasters and shapefiles were create to export the data so I could then use the tree polygons to identify tree points. The spatstat  package was used in RStudio as well to perform a Ripley’s K analysis on the point data.

Figure 1. Individual Tree Segmentation using watershed algorithm on Saddleback stand.

Exercise 2: Geographically weighted Regression

The steps taken to do the geographically weighted regression included using the polyongs created from the individual tree segmentation to delineate tree centers. When tree points were created from the centroids of the polygons, which would be inputs for the GWR in ArcMap. A density raster and CHM raster had their data extracted to the point data so that density and tree height could be the variables used in the regression. Tree height was the explanatory variable and density was the independent variable.

Figure 2. Polygon output from Individual tree segmentation using the lidR package. The Watershed algorithm was the means of segmentation, and points were created from polygon centroids in ArcMap.

Exercise 3: Supervised Classification

This analysis involved creating a supervised classification by using training data from NAIP imagery and a maximum likelihood classification algorithm. It involved using the NIR band and creating a false color image that would show the difference spectral reflectance values from conifers and deciduous trees. I used a histogram stretch to visualize the imagery better and spent time gathering quality training data. I then created a confusion matrix by using accuracy points on the training data. I then clipped the thematic map outputs with my individual tree segmentation polygons to show how each tree had their pixels assigned.

  1. Results

The Ripley’s K analysis in ArcMap showed me that Saddleback stand’s trees are dispersed, and the Baker Creek stand’s trees were spatially clustered. GWR outputs told me that the model in the Saddle back stand showed me a map output where tree heights and density were positively related. The adjusted R2 was 0.69 and gave me a good output that showed me the tallest and densest trees were on the edges of Saddleback stand. The Baker Creek stand’s model performed poorly on the point data with an adjusted R2 of 0.5. The outputs only showed relationships could only be modeled on the upper left of the stand. The classified image worked well on Saddleback stand due to less distortion in the NAIP imagery on that stand, and the Baker Creek stand’s classification was not useful since it had significant distortion in the NAIP imagery.

Exercise 1:

Figure 3. ArcMap Ripley’s K function output for Saddleback stand assessing tree points.

Exercise 2:

Figure 4. Geographically weighted regression of Baker Creek and Saddleback stand. The Hotter colors indicate positive relationships between tree density and tree height.

Exercise 3.

Figure 5. Supervised image classification using a maximum likelihood algorithm on Saddleback stand.

  1. What did you learn from your results? How are these results important to science? to resource managers?

I learned that Ripley’s K outputs can differ depending on what packages used. R-studio Ripley’s K outputs told me that both my stands had clustered tree patterning. ArcMap outputs that made more sense told me that my Saddleback stand was actually dispersed. Outputs can be variable if inputs are not explicitly understood or modeled with enough care. I also learned that trying to model a very heterogeneous riparian stand is more difficult because of the variability. This is important for researchers who are interest in riparian areas like Baker Creek since they might need to have more variables to adequately model those stands.

  1. Your learning: what did you learn about software?

I became very familiar with processing and modelling with LiDAR point clouds. I also became familiar with Modelbuilder and learned how to use packages in R like Spatstat. I also found a new method for making a confusion matrix in ArcMap.

  1. What did you learn about statistics or other techniques?

I learned how to do point pattern analysis with Ripley’s K on tree points. This was done in R and in Arc. In Arc using the spatial statistics tool was also something I used and still plan to use. When using GWR I understood what it does, understood the outputs, and learned to properly interpret the results. I also became more concerned with issues of scale and networks that might affect my areas of interest.

Geospatial Analyses Examining Woodpecker Nest Site Selection and Postfire Salvage Logging

Research Question

How is the spatial presence of postfire woodpecker nests related to the spatial presence of salvage-logged forest stands?

  • How are woodpecker nests clustered within survey units? (Exercise 1 and 3)
  • How does this clustering relate to salvage treatment units within the survey units? (Exercise 2 and 3)

After evaluating my data with several spatial analyses methods, this research question still seems applicable to my project goals. A possible revision to this research question is to include phrasing about the internal spatial characteristics of the nest and salvage harvest location datasets aside from their relationship to each other.


This project uses postfire datasets targeting the 2015 Canyon Creek fire complex on the Malheur National Forest in eastern Oregon. A salvage logging operation occurred in the burn area in July 2016. My geospatial analyses examine the relationships between woodpecker nest site locations and salvage harvest unit locations before and after the logging occured. This will lead to correlating woodpecker population dynamics with salvage treatment effects. Woodpeckers are an ideal indicator species for postfire ecosystem monitoring due to their heavy reliance on burned snags for bark-beetle foraging and cavity-nesting. Their population dynamics indicate habitat quality for all snag-dependent wildlife. Previous research indicates woodpeckers exhibiting species-specific habitat preferences toward burned forest variables like stand density, diameter class, height, and snag decay class. 2016 pre-salvage and 2017 post-salvage lidar datasets are in processing for eventual correlation between these 3D forest structure variables and woodpecker nest site selection before and after harvest. However, the spatial analyses featured in this project focused on the intial task of relating 2D nest and salvage unit locations.

My research is in conjunction with a Rocky Mountain Research Station study examining salvage logging effects on three woodpecker species in the Canyon Creek complex. In 2016 and 2017 I led crews on this project collecting extensive postfire woodpecker occupancy and nest productivity datasets for black-backed, white-headed, and Lewis’s woodpecker populations. This resulted in a 148-nest dataset for 2016 and 2017, representing woodpecker populations before and after salvage logging. A polygon shapefile outlining ten RMRS woodpecker point count survey units serves as the area of interest (6 treatment, 4 control). Within the 6 treatment units, another polygon shapefile outlining 35 salvage harvest units indicates treatment areas. Three silvicultural prescriptions replicating optimal habitat types for each woodpecker species designate target salvage variables like average post-harvest stand density and diameter class. Each salvage unit adheres to one of these three harvest prescriptions. Supplementary geospatial data includes a 2015 WorldView-3 1 m raster and ArcGIS basemaps.

Image result for canyon creek fire oregon

Above: The 2015 Canyon Creek Fire burning near John Day, OR.

Image result for black-backed woodpecker              Image result for white headed woodpecker             Image result for lewis's woodpecker

Black-backed woodpecker (Picodies arcticus)     White-headed woodpecker (Picoides albolarvatus)                       Lewis’s woodpecker (Melanerpes lewis)

Above: The three target woodpecker species. All species are Oregon Sensitive Species. The white-headed woodpecker is a U.S. Fish and Wildlife Species of Concern.

Above: Survey units and corresponding salvage treatments for each unit. Each treatment type is based on a silvicultural prescription designed to benefit target woodpecker species. Based on previous research, each target species demonstrates preferred habitat characteristics in variables like stand density, diameter class, height, and decay class.

Above: The three salvage treatment types and the control treatment depicted as overhead stem maps. Each polygon represents a snag of indicated diameter class. The salvage treatments retain more trees of smaller diameter moving from Treatment 1 to Treatment 3, with the control units retaining all trees.

Above: A salvage treatment in the Crazy Creek woodpecker survey unit (Treatment 1).

Above: An early project map showing rough locations of Rocky Mountain Research Station woodpecker survey units. Since this map was created, two more survey units were added for 10 total (6 treatment, 4 control). This map effectively shows the woodpecker survey units occupying areas of highest burn severity.

Above: The Canyon Creek fire complex as a false color WorldView-3 1 m raster. The area of interest includes 10 survey units in blue, labeled with yellow text (6 treatment, 4 control). This visual orients the survey units to an area in eastern Oregon southeast of John Day and Canyon City. The false color image displays healthy vegetation as red, with the darkest areas displaying high burn severity. The survey units are found within some of the highest burn severity areas in the fire complex.

Above: A close-up of the 35 salvage treatment polygons outlined in red and labeled with white text. Control units lack red salvage polygons. This image does not include Overholt Creek.

Above: A subset of the 2016 (orange) and 2017 (pink) nest points featuring survey (gold) and salvage unit (green) polygons.


I expected to see dispersed nests in 2016 with possible trends indicating species habitat preferences. Previous research indicates species-specific preferences for certain forest habitat variables. Black-backed woodpeckers prefer dense, small-diameter stands for foraging and nest excavation. White-headed woodpeckers prefer a mosaic of live and dead variable density and diameter stands for pine cone foraging. Lewis’s woodpeckers prefer tall medium to large-diameter stands for aerial foraging maneuvers. I expected to see nest sites in both years clustered in areas with these forest structures. In 2017 I also expected to see nest sites clustered near salvage treatments implemented for each species. Overall I expected the control units to exhibit nest dispersal and high woodpecker activity.


Exercise 1: Multi-Distance Spatial Cluster Analysis (Ripley’s K)

Multi-Distance Spatial Cluster Analysis (Ripley’s K) analyzes point data clustering over a range of distances. Ripley’s K indicates how spatial clustering or dispersion changes with neighborhood size. If the user specifies distances to evaluate, starting distance, or distance increment, Ripley’s K identifies the number of neighboring features associated with each point if the neighboring features are closer than the distance being evaluated. As the evaluation distance increases, each feature will typically have more neighbors. If the average number of neighbors for a particular evaluation distance is higher/larger than the average concentration of features throughout the study area, the distribution is considered clustered at that distance.

K Function Graphic










Referenced from ArcGIS Desktop 9.3 Help (

Exercise 2: Multiple Buffer Distance Analysis

This tool creates shapefiles of concentric circles around features based on user-input values. Users can dissolve buffers to create a single feature class per buffer distance. Buffer widths overlapping with features of interest can indicate spatial relationships between target subjects. In this case, intersecting woodpecker nest buffers with salvage harvest polygons may reveal trends in woodpeckers selecting nest sites closer to or farther from salvage units. An equation producing percent area of each nest buffer intersected by a salvage unit serves as an index.

Exercise 3: Nearest Neighbor

The Average Nearest Neighbor tool measures the distance between each feature centroid and its nearest neighbor’s centroid location and averages the distances for a nearest neighbor ratio. If the ratio is less than the average for a hypothetical random distribution, the feature distribution is clustered. If the ratio is greater than a hypothetical random distribution average, the features are dispersed.


Exercise 1: Multi-Distance Spatial Cluster Analysis (Ripley’s K)





Above: Notable clustering results from Ripley’s K analysis. Most sample sizes were too small to produce significant results, but Big Canyon and Crawford Gulch in 2016 and Lower Fawn Creek in 2017 showed evidence of clustering.

Exercise 2: Multiple Buffer Distance Analysis

Above: The difference in results between 50 and 100 meter buffer distances for the 2016 nest dataset demonstrate the utility of graphs created from buffer analysis. These results indicate that woodpeckers consistently choose to place their nests in areas completely surrounded by salvage treatment 3. The results also indicate that woodpeckers consistently place their nests within varying distances of salvage treatment 1 with trends indicating average nest placement near but not completely within those salvage treatments.

Exercise 3: Nearest Neighbor

Above: Nearest neighbor results for woodpecker nests in each survey unit in 2016 and 2017. The NN Ratio is a threshold for expected clustering or dispersion. NN Ratio values less than 1 indicate clustering and NN Ratio values greater than 1 indicate dispersion. In 2017, green cells indicate an increase in value and red cells indicate a decrease in value. NN Ratio cells with an increased value in 2017 indicate units where nests increased dispersion. NN Ratio cells with a decreased value in 2017 indicate units where nests increased clustering. Alder Gulch (Treatment 3), Lower Fawn Creek (Treatment 2), and Upper Fawn Creek (Treatment 2) demonstrated increased clustering in 2017. Big Canyon (Treatment 1), Crazy Creek (Treatment 1), and Sloan Gulch (Treatment 3) experienced increased nest dispersion in 2017. All control units experienced increased or present dispersion in 2017.


I am processing two lidar datasets of the study area from 2016 and 2017. These datasets were acquired before and after the salvage logging treatments occurred. I will produce forest metrics such as stand density, diameter class, and height in salvage and survey units. I will then correlate geospatial and statistical properties of the nest datasets to quantified forest variables affecting woodpecker nest site selection. I will examine trends between 2016 and 2017 nest selection to understand the effects of harvest treatments on woodpecker populations. At least two more years of woodpecker data will exist for 2018 and 2019, so future research will add these datasets to the analyses. I would like to see a machine learning algorithm developed from this study that could predict areas of optimal habitat suitability for snag-dependent wildlife species. Postfire wildlife habitat prediction will be crucial to resource managers as wildfires increase in the coming decades alongside accelerated landscape restoration.

This spatial problem is important to science and resource managers as climate change amplifies wildfire effects. Using 3D remote sensing datasets for resource management is the future trend across all disciplines. Increased wildfire activity around the world necessitates cutting-edge methods for active fire and postfire ecosystem quantification. In the Blue Mountains ecoregion in eastern Oregon, Rocky Mountain Research Station, Malheur National Forest, and Blue Mountains Forest Partners rely on this project’s lidar quantification for their research and management decisions. Determining acceptable levels of salvage harvest for wildlife habitat affects whether government agencies and rural economies in this region will allow small businesses to profit from sustainable harvest operations. Science overall will benefit from the continued investigation of wildlife response to anthropogenic disturbance, specifically postfire forest management decisions and the controversial practice of salvage logging.

Learning Software

For my analyses I used mostly ArcMap, ArcGIS Pro, and R. I am very familiar with ArcMap, but it was my first time using ArcGIS Pro extensively. I enjoyed applying my R programming experience to my own datasets and creating visual aids for my research.

Learning Statistics

The statistical analyses I implemented for my research did an adequate job characterizing my datasets. However, I would try a different set of analyses that fit my small sample sizes better. A spatial autocorrelation test may support these analyses. I used R heavily in my Ripley’s K and buffer analyses to create graphs

Courtney’s Final Project Post

Research Question

  • “How is the spatial pattern of ion and isotope concentrations in wells tapping the basalt aquifer related to the spatial pattern of mapped faults via the mechanism of groundwater flow as determined by hydraulic transmissivity of the geologic setting?”

Description of dataset that I examined

  • A: In my research I have analytical data for 31 wells, whose XY locations were determined by field confirmation of the Oregon Water Resource Department (OWRD) well log database. As groundwater is a 3D system, I have to consider Z values as well. The well depths and lithology information are also from the OWRD database. My analytical data provides a snapshot of water chemistry during the summer of 2018. I have only one temporal data point per well. At all 31 wells, I collected samples to be analyzed for pH, temperature, specific conductivity, oxygen isotopes 16 and 18, and hydrogen isotopes 1 and 2. At a subset of 18 of those wells I collected additional samples for tritium, carbon 14, and major ion analysis.
  • B: The shapefile of faults mapped at the surface was created by Madin and Geitgey of the USGS in their 2007 publication on the geology of the Umatilla Basin. There is some uncertainty in my analysis as far as extending this surface information into the subsurface. USGS studies have constrained proposed ranges of dip angles for the families of faults that I am studying, but not exact angles for any single mapped fault.
  • C: results of pumping interference tests involving 29 wells, 12 of which I had chemical data for. The data was collected by the OWRD in 2018 and 2019.


  • Faults can act as conduits or barriers to groundwater flow, depending on how their transmissivity compares to the transmissivity of the host rock.
  • I hypothesize that clusters of similar chemical and isotopic properties of groundwater can indicate a shared aquifer unit/compartment, and that if faults separate clusters then the fault is influencing that difference in chemical/isotopic signatures. If the fault is between two clusters, I hypothesize that it is acting as a barrier. If it crosses through a cluster, I hypothesize that it acts as a conduit.
  • Where faults act as barriers, I hypothesize that parameter values will differ in groups on either side of a fault. Specifically, a barrier fault might cause older, warmer water to rise into upper aquifer layers, and the downstream well might show a signature of more local recharge.
  • Where faults act as conduits, I hypothesize that water chemistry and isotopes of samples from wells on either side of the fault would indicate a relatively direct flowpath from the upstream well to a downstream well. Over a short distance, this means that ion and isotope concentrations would not differ significantly in wells across the fault.
  • My hypotheses depend on a “barrier” fault damming groundwater flow up-gradient of the fault, and compartmentalizing local recharge on the down-gradient side. This hypothesis is only relevant if the fault is roughly perpendicular to the flow direction, and so disrupting transmissivity between a recharge zone and the wells. If a fault that separates two wells is parallel to the flow direction and there is no obstacle between the wells and upstream recharge areas, then the fault might indeed limit communication between the wells but they will have similar chemical signatures. Wells separate by this second kind of fault barrier would be better evaluated by a physical test of communication, such as a pumping interference test.

Analysis Approaches

  • Principal component analysis: used to simplify the multivariate data set (19 variables!) into variable relationships that could represent aquifer processes
  • Analysis of PCA results compared to distance from a flow path
    • Interpolation of well water levels classified by well stratigraphy to estimate a potentiometric surface and evaluate groundwater flow directions.
    • Raster calculations to compare flow direction to fault incidence angle
    • Measuring distance from each well to the nearest fault along the flow path
    • simple linear regression, comparing Non-ion PC1 score of a well with its distance from a fault.
  • Two-sided T-tests comparing distance between wells, presence of a fault, and pumping interference test communication between wells
  • ANOVA F-tests comparing chemical and isotopic variance within groups of wells that communicate with each other and between those groups.


  • Principal component analysis – Patterns of variation between wells are statistically explained primarily by total ion concentration, a relationship between chemical evolution from ion exchange and decreasing stable isotope ratios, and the combination of well depth and water temperature. Moran’s I indicates that only Non-ion PC2 is spatially clustered, while the other PC models have a distribution not significantly different than random. The other PC models are useful to understand the groundwater system, but not specifically to analyze clustering correlated to faults.
  • Interpolation of water level, and comparison of fault incidence angle with flow direction, indicates faults that are and are not able to be tested by my hypotheses.
  • Analysis of PCA results compared to distance from a fault along flow path – some wells that are “within” a fault zone have very old signatures and others have very young signatures. This could be related to the angle of the dip of the fault and the accuracy of mapping compared to the depth of the well. I hypothesize that the wells that are in the fault zone but have high PC1 scores are on the up-gradient side of the fault where older water is upwelling along a barrier. Wells in fault zones with low PC1 scores could indicate wells open to downgradient areas of the fault, where vertical recharge through the fault damage zone is able to reach the well.
  • Returning to the conclusions I wrote in that blog post after I found improved stratigraphic data, I’m not sure if I can make conclusions other than those about the wells are that mapped as “inside” a fault. Several wells that are closer to faults are also open to shallower aquifer units, and so the effect of lower PC1 scores closer downgradient to faults might be confounded by lower PC1 scores caused by vertical inflow from the sedimentary aquifer and upper Saddle Mountain aquifer.
  • Two-sided T-tests comparing distance between wells, presence of a fault, and communication between wells show that the presence of a fault has a greater effect on communication than the distance between the wells.
  • ANOVA F-tests comparing chemical and isotopic variance within groups of wells that communicate with each other and between those groups – stable isotopes and specific conductivity both show more variation between well groups than within well groups.
  • Not covering in these blog posts, I also ran Moran’s I on my inputs to see which ones are clustered and so might be more related to horizontal grouping factors (such as faults) than vertical grouping parameters (such as stratigraphic unit). Of the PCA and individual variables, only d18O, d2H, and Non-ion PC2(combination of well depth and water temperature) were clustered. The other PCA models, temperature, pH, and specific conductivity were not significantly spatially clustered.

Significance –  Groundwater signatures are related to faults agree/disagree with past understandings of differences between wells in the region, and can inform well management. If a senior permit holder puts a call on the water right and asks for junior users to be regulated off, it would not help that senior user if on of those junior permit holders’ wells is not hydraulically connected to the senior users.

  • More wells would need to be sampled to be better able to disentangle the effects of faults from the effects of well stratigraphy.

My learning – I learned significantly more about how to code and troubleshoot in R. Additionally, I learned about the process of performing spatial clustering analysis in ArcGIS.

What did I learn about statistics?

  • PCA was completely new to me, and it’s a cool method for dealing with multivariate data once I dealt with the steep learning curve involved in setting it up and interpreting the results. It was useful getting more practice performing and interpreting t-tests and Anova F-tests. I had not used spatial clustering before, and learning how to apply it was interesting. It gave me a much more concrete tool to try to disentangle the patterns in my effectively 3D data on X,Y plane, as opposed to the Z direction.

Relationship of river recession behavior to watershed lithological composition

Research Question: How is watershed recession behavior related to the spatial pattern of watershed lithological composition in Oregon coastal freshwater systems?


Recession Analysis

For the recession analysis performed in this exercise, I used daily records of precipitation from the PRISM database. Daily data is the highest temporal resolution that PRISM has to offer. The PRISM time series were downloaded from cells nearest the gage. Time series were from 2001 to April 2019. For further methodology on recession analysis, refer to Exercise 1. River discharge data was collected by the U.S. Geological Survey at the watershed pour point used here. Discharge data is recorded every 15 minutes, which was upscaled to daily average to match the resolution of the precipitation data. As a result, recession analysis results were on the daily timestep from April 2019 to 2001.

Geophysical data

Lithological data was sourced and derived from the Oregon Department of Geology and Mineral Industries state geological map. General lithology type is a listed variable in the attribute table (GEN_LITH_TY), and ArcGIS was used to dissolve that attribute across the study area. The spatial resolution of this dataset is unclear. Additionally, the data is representing “immutable” features, and therefore there is no associated temporal scale.


Streams and rivers in watersheds with predominantly volcanic geology will recede faster than those with predominantly sedimentary geology.

Streams and rivers in watersheds with similar underlying lithology will have similar recession characteristics.


To perform this analysis, I followed the following steps:

  1. Recession analysis on the daily timescale across eight watersheds. Statistical methods and other analytical procedures are outlined in Exercise 1.
  2. Profile lithological composition within each of 35 study watersheds across the Oregon Coast Range. Percentages of metamorphic, plutonic, sedimentary, surficial sediments, tectonic, and volcanic lithologies were quantified for each study watershed. Technical procedures are outlined in Exercise 3.
  3. Perform hierarchical clustering analysis on lithological composition profile to see which watersheds are most lithologically similar and dissimilar.
  4. Compare recession analysis coefficients to clustering results and lithologic profiles to see if recession behavior appears to be different in different predominant lithologies.


Results from the clustering analysis demonstrated that a most coastal watersheds are predominantly underlain by sedimentary and volcanic lithologies, and that sedimentary lithologies dominate the region. However, volcanic lithologies are well-represented, primarily at the northernmost latitudes. Because of this stratification, recession analyses could be performed at watersheds in both lithologies and compared to test whether lithology is a dominant control. However, because I conducted the recession analysis before profiling lithological composition, only one of the eight hydrologically analyzed watersheds is in predominantly volcanic geology (Figure 1). The watersheds grouped under the sedimentary branch span values of 54-100% sedimentary lithology. Watersheds grouped under the volcanic branch span values of 47-74% volcanic geology.

Figure 1. Results from hierarchical clustering analysis depicted in a dendrogram. Watersheds in which recession analyses were performed are circled.

When recession coefficients from the recession analysis are compared with the clustering analysis results, we see that coefficient a is fairly consistent among all watersheds, while coefficient b is much larger in the predominantly volcanic watershed (Table 1). The Wilson River near Tillamook, Oregon, is characterized as 50% volcanic lithology, which is the highest among the studied watershed. Recession analysis coefficient b from the Wilson River is 1.726 is also the highest in the eight studied systems.

Table 1. Table showing lithological composition of watersheds for which recession analysis was performed. Coefficient results from the analysis are also listed.


While the results of this study are inconclusive given the small sample size, it does give indication that lithology is an important control on watershed recession behavior and should therefore be included in analysis of temporal patterns of hydrological behavior. It is relevant to resource managers in coastal PNW regions because it indicates a watershed characteristic that influences storage and release behavior. Such information may be valuable for identifying systems that are more or less susceptible to drought and changes in the climatic patterns.

Learning Outcomes

  • Watersheds can be delineated in Matlab, and it is much faster than ArcGIS.
  • dplyr is a very important and useful package for reformatting data in a logical way
  • I learned more about spatial autocorrelation from conversations with other students about their analyses, especially given the breadth of research topics in the classroom.
  • Probably the two most important outcomes of this project were the recession analysis and the clustering analysis. The recession analysis involved multi-variate regression, which when applied to my data and research question, makes far more sense. The clustering analysis helped get at the question I have had for some time: how similar/dissimilar are my study watersheds? For this project, I focused on lithology, but I think it will be a useful tool for analyzing the other variables as well. Additionally the clustering analysis will provide a useful method for stratifying my sampling design into different lithological groups.

Relation of Potential Soil Carbon to Tree Height and Density across HJA

Major uncertainties exist in the geospatial distribution of terrestrial carbon (C) sources and sinks and the factors that influence soil C distribution and change. Temperate and boreal forests serve as a net sink of atmospheric CO2, so my aim is to understand how landscape features and vegetation vary across the HJ Andrews Forest (HJA) and determine how those factors influence soil C storage and CO2 release.

Research Question: How might landscape features and vegetation characteristics relate to accumulation of soil carbon, using tree height and density as a proxy for soil C?

Dataset description: I used LiDAR data from 2008 and 2011 at 0.3-0.5 m vertical resolution and at 1 m2 horizontal resolution covering the Lookout Creek Watershed and the Upper Blue River Watershed to the northern extent of HJA. These LiDAR data include a high hit model (tree tops) and a bare earth model from 2008. I also downloaded a shapefile of polygons of the reserved control areas within the HJA forest boundaries and used these areas to subset some of my later analyses.


  1. Large storm events primarily topple larger (older) trees on ridges and upper slopes. Soils in these positions are typically thinner, making them quickly saturate from precipitation and making trees in those positions more vulnerable (Lal 2005; Overby 2003). In addition, the tallest trees on ridges may be exposed to greater risk from lightning strikes, greater wind gusts and more snow accumulation, resulting in shorter trees dominating ridgelines. Shorter trees have smaller canopies and occupy less physical space than taller trees, so shorter trees occupy denser stands and as they grow taller, each individual tree occupies more space. Because of these factors, I hypothesize that shorter trees that are more closely spaced will cluster more densely along ridges than other landforms.
  2. I hypothesize that the tallest trees will be more densely clustered at low elevations along valleys because waterways carve out valleys and bring nutrients that then accumulate and build thick, carbon-rich soil. These areas tend to not suffer from the moisture limitations of ridges and steep slopes and are more protected from strong winds, so trees in these areas can maximize growth based on available solar radiation.
  3. High density young forest stands exhibit characteristics similar to mature forests, like closed canopies and high LAI. However, recently disturbed forests have higher nutrient availability than undisturbed forests which has been shown to cause a shift in C-allocation from below- to aboveground, so I expect younger stands to negatively correlate with soil C accumulation.
  4. I hypothesize that because vegetative growth is limited by solar radiation, vegetative growth on slopes that face S and W have greater exposure to solar radiation and will result in greater overall biomass (more dense stands of smaller trees, but not necessarily more tall trees). However, if the slope is too steep, I expect this pattern to diminish since trees will experience greater wind and water erosion and more tree mortality.

Note: I have yet to test hypotheses 3 and 4, but they are avenues for continuing research.

Approaches: I used a k-nearest neighbor analysis and k-means to cluster trees into 10 height classes and related each tree in the center of each height class to its distance from its 30 nearest neighboring trees of the same height class. I ran several hot spot analyses on tree heights and tree distances (tree density). Since my original hot spot analyses were subset by height class, which led to the algorithm naturally finding the arbitrary bins (upper and lower bounds) the height classes were based on, I performed new hot spot analyses on all the trees and all the tree distances within just the reserved control areas of the forest. I performed a linear regression to compare the regions of taller than expected trees to the regions of greater than expected distances between tree and Chi-squared tests independence for the hot spot analyses to compare hot spots and cold spots with variables like tree spacing, slope and elevation. Since hot spots between tree height and spacing did not overlap in all cases, I wanted to know what landscape features might explain this difference. Covariates I explored included slope, aspect, elevation, and landform. I calculated a confusion matrix between the Z-scores of height and distance for all the hot spot bins (-3,-2,-1,0,1,2,3), then further constrained the analysis to only the most extreme hot and cold spots (-3 and 3). I then compared mean height, distance, slope and elevation between the four combinations of the extreme hot and cold spots.

My objective for the final synthesis was to find regions that are higher or greater than expected for given parameters (height and tree density) and group these clusters into polygons where they overlap. I hypothesized that that soil C accumulation positively correlates with clusters of trees that are both taller and more densely spaced than expected. Conversely, I hypothesized that shorter trees that are more widely spaced negatively correlate with soil C accumulation. Therefore, overlapping hot spots of tree height and tree density and overlapping cold spots of tree height and tree density are prime areas for my planned future soil C sampling sites.

Results: I produced maps and statistical relationships between tree height and tree spacing. These included hot spot maps overlaid on elevation maps, density plots, graphical comparisons of Z-scores and many others (see previous blog posts). I found that hot spots of taller than expected trees overlapped with hot spots of greater than expected distances of trees, but not in all cases (Fig 1). When I examined a table of each of the distance hot spot bins compared with each of the height hot spot bins, I found that the most compelling correlations were between bins (0,0), (-3,-3) and (3,3) with 5%, 33% and 43% of total data, respectively. The proportion of data not covered by overlapping hot spots and by overlapping cold spots was minimal, but when mapped was visually compelling enough to warrant further exploration. I was particularly interested in areas with short trees and greater than expected distance between trees (bin -3,3) and areas with tall trees and shorter than expected distance between trees (bin 3,-3). I expect that bin (-3,3) would correlate with less soil C, so identifying those areas could be useful to sampling. I expect that bin (3,-3) would correlate with greater soil C. I can identify from Fig 2 where statistically significant hot spots and cold spots are spatially close and plan to sample heavily in those areas.

Fig 1. Tree height vs. distance between trees from hot spot analyses shows a highly linear pattern.

Fig 2. Tree density (mean distance in m between 30 closest trees) hot spot bins compared with tree height (m) hot spot bins in Lookout Mountain area. Bin comparisons (-3,-3) are areas of shorter than expected trees in more dense stands. Bin (3,-3) is taller than expected trees that are in more dense stands. Bin (3,3) is taller than expected trees in more sparsely populated stands.


Significance: Many of my results thus far confirm ecological phenomena that we already know to be true across forested landscapes. For example, I examined tree distance to stream and found that taller trees were more common closer to streams and less common at greater distances from streams. This makes sense with other landform features like valleys and ridges, so smaller trees tend to be along ridges (cold spots of tall trees according to my analysis) and taller trees tend to be in valleys. However, I have identified regions of taller than expected trees and regions of shorter than expected trees, and if those correlate with respectively more and less soil C, that provides evidence for LiDAR as an effective way to quantify soil C. If other landscape features that can be determined from geospatial data co-vary with tree height and/or tree density, it may be possible to quantify soil C at fine resolution. These data can be used to identify areas that have the potential to sequester more soil C and forest management could be tailored to support those regions.

Geospatial Learning: I learned a ton about ArcMap, QGIS and packages that were new to me in R like spatstat, caret, nngeo, and others. I had very little experience with and GIS before this class, so it was a steep learning curve but I’ll continue to learn as I perform this final synthesis. I learned how to perform hot spot analyses in ArcMap and export them to work with in Q. I learned how to manipulate spatial data in R and load it in Q and ArcMap for viewing. These are only a few examples of the many things I’ve learned!

Statistics Learning: I learned the limitations of hot spot analyses, with one of my criticisms being that it’s basically a smoothing function. Since the LiDAR dataset I’m using is basically a census of tree heights, running hot spot analyses is reducing the information in that dataset unnecessarily. I already knew that my data were spatially autocorrelated, so I had to take that into account for all of my analyses. I learned that the confusion matrix is great for visually discerning where a model predicts best and where it doesn’t predict well, but that the scientist must figure out the reasons for the good or poor predictions.

Because interactions among geomorphic processes, landforms, and biota occur on various temporal and spatial scales, it is necessary to understand the scale of an ecosystem process when performing spatial analyses. It is also necessary to consider the potential reasons why a particular spatial pattern did not emerge from an analysis. Reasons can range from data not being explained by expected process or multiple spatial scales of interactions, failing to use the right tool, not being at the right spatial scale, data being too coarse, failing to include the right variables or a flaw in one’s fundamental idea about how the system works. This is why it is important to (1) formulate clear, specific hypotheses before performing an analysis and (2) explore spatial patterns using multiple approaches if possible.


Harmon M.E. (2009) Woody Detritus its Contribution to Carbon Dynamics of Old-Growth Forests: the Temporal Context. In: Wirth C., Gleixner G., Heimann M. (eds) Old-Growth Forests. Ecological Studies (Analysis and Synthesis), vol 207. Springer, Berlin, Heidelberg

Lal, R. (2005). Forest soils and carbon sequestration. Forest ecology and management220(1-3), 242-258.Overby et al. 2003: Impacts of natural disturbance on soil C dynamic in forest ecosystems: Soil C sources include litter fall and tree mortality, root turnover and root exudates.

Dixon, R. K., Solomon, A. M., Brown, S., Houghton, R. A., Trexier, M. C., & Wisniewski, J. (1994). Carbon pools and flux of global forest ecosystems. Science263(5144), 185-190.

Spatial and Temporal Patterns of Oregon Salmonella Cases

  1. The research question that you asked.

My original question focused solely on the spatial patterns of Salmonella cases in Oregon. However, after realizing my dataset contained time series information I decided to pivot my analysis somewhat and my research question became the following: How are the spatial and temporal patterns of reported Salmonella infections associated with sociodemographic factors in Oregon counties between 2008-2017? While the eventual discovery of causal drivers of disease will be important for disease prevention efforts, the focus here is what other factors are associated with disease at the county level. Through this analysis I hoped to identify sub-populations who are at a higher risk of becoming infected with Salmonella in Oregon.

  1. A description of the dataset you examined, with spatial and temporal resolution and extent.

The data used in my analysis came from a variety of sources. Data regarding the number of Salmonella cases per county came from the Oregon Public Health Epidemiology User System (ORPHEUS). These data contained individual level information, but it was de-identified due to privacy concerns. Information concerning age, sex, disease onset, and county of residence was available with these data. Population estimates came from Portland State University’s Population Research Center which collected yearly county level population estimates. Other sociodemographic information came from the American Community Survey, a yearly survey which assesses various county-level characteristics like poverty, high school graduation rates, percentage of foreign-born residents, etc. All data used in this analysis was from years 2008-2017.

  1. Hypotheses: predictions of patterns and processes you looked for.

Hypothesis 1: I expect counties with higher proportions of females to have higher levels of Salmonella infections due to findings in prior literature concluding that females have a higher incidence rate of Salmonella infections compared to males. The underlying causal mechanism for this pattern is unknown, but given the results of other studies of foodborne illness I expect Oregon’s population to be similar. As a result I would expect the percentage of a county that is female to be significantly associated with Salmonella incidence.

Hypothesis 2: I expect counties with higher proportions of infants and newborns to also have higher levels of Salmonella infections compared to counties with a higher proportion of older age groups. Other findings from the Oregon Health Authority indicate that young children are at high risk for developing foodborne illnesses. The reasoning here is that the immune systems of young children are not fully developed and less likely to effectively fight off infection resulting in a disproportionately high disease incidence in this group. I expect to see that the percentage of a county that belongs to the age group 0-4 years old will be significantly associated with Salmonella incidence.

Hypothesis 3: I expect to find a significant time trend of Salmonella incidence over time in Oregon counties. Disease incidence varies from year to year and can sometimes be volatile in scarcely populated counties or counties experiencing a major outbreak. Because of this natural variance over time I expect to see a significant time trend of reported cases in Oregon. Some counties will show a positive trend over time, others will show decreasing disease rates, and others will be relatively level. However, I expect time will be a significant factor associated with disease incidence.

  1. Approaches: analysis approaches you used.

I used auto- and cross-correlation, longitudinal trend analysis, hotspot analysis, geographically weighted regression, and Principal Component Analysis. All of these analytical approaches were performed in the R statistical software with the aid of various packages which allowed me to perform all of my spatial analysis.

  1. Results: what did you produce — maps? statistical relationships? other?

From these analyses I was able to create maps, plots of auto- and cross-correlation over time, and a bivariate plot of principle components which were associated with different regions of Oregon. Also, there was considerable numerical output for my time trend regression and GWR which provided evidence of statistically significant associations between the variables in my model and the outcome. These outputs can be seen in my second and third blog posts.

Hypothesis 1: Findings from my GWR support my first hypothesis that the proportion of a county which is female is significantly associated with Salmonella incidence. While coefficient estimates vary by county, most often there is a positive association between the proportion of a county that is female and disease incidence. Cross-correlation analysis found there were large areas of Oregon, particularly in the Western part of the state, where there was a positive association between the two variables was positive and some areas clustered in the Eastern part of the state where this pattern was reversed.

Hypothesis 2: GWR findings and cross correlation analyses did not support my second hypothesis. The proportion of a county which is aged 0-4 is not significantly associated with county Salmonella incidence. However, it was found that county percentage of child poverty was significantly associated with disease incidence. Perhaps the reason for this finding is that the proportion of children alone may not be significantly associated with disease, but poverty is. Thus children in poverty partly explain the association seen for children as a whole.

Hypothesis 3: Longitudinal analysis and regression supported my third hypothesis that a significant time trend existed for Oregon’s Salmonella incidence. As expected, there was some variance in county disease rates over time due to a large host of factors. Larger populations are likely to have more cases of disease reported due simply to the fact that there are more people to possibly infect. Overall, there is an increasing trend in Oregon’s Salmonella incidence. There may be a higher amount of disease occurring in Oregon over time. Another possible explanation for some of the increase could be due to improvements in Oregon’s disease monitoring abilities and infrastructure over time.

  1. What did you learn from your results? How are these results important to science? to resource managers?

This analysis found statistically significant associations between some population characteristics (like median age, childhood poverty levels, and time) and the reported rate of Salmonella. These results are important to researchers because the focus to find causal exposures for Salmonella can be narrowed to groups or areas more associated with the disease. These results can inform future disease prevention research. Resource managers would also be interested in this analysis as the counties identified here as being more associated with disease than others can be targeted for disease surveillance.

  1. Your learning: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other?

My analysis was completed solely in R statistical software. I learned a lot over this course about different spatial analysis packages in order to make my R more robust. Some spatial packages have a steep learning curve with a fair degree of technical knowledge to appropriately implement in your analysis. I would say I learned a lot about trouble shooting from discussion threads and GitHub posts.

  1. What did you learn about statistics, including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), (d) multivariate methods (e.g., PCA),  and (e) or other techniques?

I learned how hotspot analysis is performed and what a Getis-Ord Gi* score is and how they are compared to yield hot and cold spots. As for autocorrelation I learned how to perform cross-correlation analysis through both space and time, as well as how to cluster different values into a map which is easier to interpret. During my geographically weighted regression analysis I learned how to transform my data and make it compatible with the regression technique as well as figure out which variables to put in my GWR based on an ordinary least squares regression. Prior to this project I had never heard of PCA, so here I learned introductory skills about how to apply this analytical technique to my dataset.

Final: Evaluating a habitat suitability model for native Olympia oysters in Yaquina Bay

Research background
The goal of my graduate research project is to help build a more comprehensive picture of native Olympia oyster spatial distribution and abundance in Yaquina Bay, Oregon. Olympia oysters have experienced a major decline in population in the Pacific Northwest; it is estimated that the original population has been reduced by 90-99% (Beck et al. 2011). The species is an important ecological and cultural resource, which has garnered it support from resource management agencies, tribes, and local environmental groups. Yaquina Bay is one of only three Oregon estuaries that has historically supported Olympia oysters and may be the only estuary to have maintained a continuous population since Native American settlement (Groth and Rumrill 2009). Because of this, it has been identified by local shellfish biologists as a good candidate for species enhancement. However, current information about exactly where and how many wild oysters exist in Yaquina Bay is inexact and anecdotal. My research project aims to fill that data gap so that appropriate, informed actions can be taken for species enhancement and recovery.

Question asked
Through this course, I have been experimenting with spatial analysis to see if I can create an accurate map of habitat suitability for Olympia oysters in Yaquina Bay. To do this, I’ve been focusing on three environmental parameters: salinity, substrate, and elevation. I then compared the map of predicted suitable habitat with field observations of known oyster locations. The main research question I examined in this project was:

How does the spatial pattern of Olympia oyster location data correspond to the spatial pattern of suitable habitat in Yaquina Bay?

Description of the dataset

Habitat parameters
I collected one raster layer for each of the three habitat parameters, then overlaid them on one another to establish suitable habitat. I ranked the subcategories within each parameter from highest to lowest likelihood to be suitable for oyster presence.

  • Salinity: Salinity data was derived from an existing dataset available through the Oregon Department of Environmental Quality (DEQ). DEQ operates a publicly accessible online water monitoring data portal where results water quality monitoring stations are posted. Wet-season salinity was used to represent the extreme of salinity stress, since Olympia oysters are more capable of withstanding higher salinity waters than fresher water. The salinity dataset is based on historical measurements from 1960-2006 and represents a gradient from highest salinity (~32psu) to fresher water (<16psu).
  • Substrate: Substrate data was taken from the Estuary Habitat Map of Oregon using the Coastal and Marine Ecological Classification Standard (CMECS). CMECS is a national standard; publicly available data for Oregon’s coastal areas was made accessible through an effort by the Oregon Coastal Management Program (Department of Land Conservation and Development). The CMECS Estuarine Substrate Component layer is an Oregon-wide raster that I clipped to Yaquina Bay for this analysis. This layer characterizes substrate type broadly and is fairly low resolution.
  • Elevation: Elevation is represented through a bathymetric dataset from 2001-2002, which is the most current estuary-wide available dataset. This raster was sourced from the Newport, Oregon office of the Environmental Protection Agency (EPA) through contact with a staff geographer.

Oyster location data
This spring, I completed several field surveys of the intertidal zone of Yaquina Bay during low tide series. These were conducted in collaboration with my advisor and field staff from the Oregon Department of Fish and Wildlife (ODFW) on April 19-20 and May 17, 2019. We used a stratified random sampling design to randomly select survey locations. Once at the field survey locations, we timed our search for five minutes and assessed as much area as possible within that time. Oysters were characterized as ‘present’ if evidence of at least one oyster, living or dead, was detected within the search area. The oyster or oyster shell needed to be attached to a substrate to qualify. These data points were collected using a handheld Garmin GPS unit and field notebook, then transferred to Google Earth for quality control, and finally to ArcGIS Pro for analysis.

I originally hypothesized that the spatial pattern of Olympia oyster location data would correspond with the spatial pattern of suitable habitat as established by analysis of the three habitat parameters (salinity, substrate, elevation). I did not think it would be exact, but would at least allow for a somewhat accurate prediction of the where the oysters were located in the estuary. I felt this model would be most useful in determining the spatial range of the oysters by finding the extreme ends of the population upstream and downstream. Of the three habitat parameters, I hypothesized that salinity would be the ultimate predictor of where oysters could be found and that substrate would be the limiting parameter.

Exercise 1
I did not have any oyster location data to work with yet, so I first focused on the three habitat parameters to identify and rank where habitat was least to most suitable. I ranked the subcategories within each parameter raster layer by looking through the available literature on the subject and consulting with shellfish biologists to get an idea of what conditions oysters prefer. I then combined them using the ‘Weighted Overlay’ tool in ArcGIS Pro to determine where overlap between the layers shows the best environmental conditions for oysters to survive.

Exercise 2
After collecting field data to determine where oysters were present or absent within the intertidal zone of Yaquina Bay, I used the data points to compare against the map of suitable habitat. I first uploaded the data points collected in the field into Google Earth where I could easily verify the locations against my field notes and photos, as well as perform some minor quality control. The points were imported into ArcGIS Pro where I used the ‘Buffer’ and ‘Zonal Statistics’ tools to analyze the neighborhood surrounding each data point. Looking at these neighborhoods, I determined what habitat suitability type that the majority of surrounding habitat fit into. Finally, I made a confusion matrix to statistically assess which data points fit into each habitat type.

Exercise 3
Based on the results of the confusion matrix in Exercise 2, I wanted to expand on the habitat suitability analysis to see if I could more accurately predict oyster locations. I decided to see if the spatial pattern of oyster location data could be better described by manipulating the spatial pattern of one of the habitat parameters. I decided to change the rank values of the salinity parameter to alter the map of suitable habitat. Once I had a new output for habitat suitability, I ran the same neighborhood analysis that I used in Exercise 2 and created another confusion matrix.

Exercise 1
After applying the ‘Weighted Overlay’ to the input raster layers, this map was generated as a representation of habitat suitability:

Exercise 2
This neighborhood analysis map shows the majority habitat type surrounding each data point:

This table below shows how many data points fell into each habitat type:

Habitat suitability
4 3 2 1
Presence 13 [0.72] 4 [0.4] 2 [0.29] 0 [0]
Absence 5 [0.28] 6 [0.6] 5 [0.71] 0 [0]
n = 35 18 [1.0] 10 [1.0] 7 [1.0] 0 [0]

This confusion matrix further examines the ‘errors’ in the data, or the observations that did not correlate with my model of predicted suitable habitat. For ease of interpretation, I removed habitat suitability type 1 since there were not observations in this category, and type 3 since it fell in between high and low-quality habitat. Decimals reported indicate the proportion of total observations (n = 35) that fell within this category:

Habitat suitability
4 (high) 2 (low)
Presence 0.37 0.06
Absence 0.14 0.14

Exercise 3
This second neighborhood analysis map shows the majority habitat type surrounding each data point after the habitat suitability model was altered (by manipulating salinity):

This table below shows how many data points fell into each habitat type:

Habitat suitability
4 3 2 1
Presence 1 [0.5] 18 [0.58] 0 [0] 0 [0]
Absence 1 [0.5] 13 [0.42] 2 [1.0] 0 [0]
n = 35 2 [1.0] 31 [1.0] 2 [1.0] 0 [0]

I created another confusion matrix to further examine the ‘errors’ in the data. I removed habitat suitability type 1 since there were not observations in this category. Decimals reported indicate the proportion of total observations (n = 35) that fell within this category. The confusion matrix shows that the model fit nearly all observations (31) into the type 3 habitat category:

Habitat suitability
4 (high) 3 2 (low)
Presence 0.03 0.51 0
Absence 0.03 0.37 0.06

From this analysis, I feel that these three habitat parameters do a good job of predicting the spatial distribution of oysters generally. This approach can be used as a helpful tool for guiding resource managers and shellfish biologists to likely oyster locations, but cannot be the only method for validating their presence or absence at this point. Given budget cuts or competing priorities for today’s resource managers however, a map predicting where an oyster population is might be enough for making management decisions in a pinch.

In collaboration with my advisor at ODFW, I am using the results of this spatial analysis to fill an important data gap about where and how many Olympia oysters can be found in Yaquina Bay.  This information will be used to develop a repeatable monitoring protocol to be conducted by ODFW’s Shellfish and Estuarine Assessment of Coastal Oregon (SEACOR) program. I hope that my research can also help estimate population enhancement potential by identifying possible sites for habitat restoration. With increased accuracy, I’d also like to analyze potential oyster distribution under future environmental conditions and under different management scenarios.

Recovery of a robust population of Olympia oysters is important to Yaquina Bay and the surrounding community in a number of ways. The species provides many benefits to the ecosystem, including water filtration and habitat for other marine creatures. It is culturally significant to local tribes, including the Confederated Tribes of Siletz. While Olympia oysters are not currently listed as threatened or endangered, possible listing would trigger a number of mitigation and conservation measures that will be difficult and expensive for agencies and private landowners. Additionally, there’s been some exploration that if the population can become robust again, there is potential to grow and harvest this species as a specialty food product.

Software and statistics learning
Most of the spatial analysis tools I used were familiar to me, though I had only really used them in exercises in other classes. I have not been able to use my own data to conduct an analysis, so that was exciting, but also a bit of a learning curve. The confusion matrix was a new tool that was a very simple and useful way to statistically assess my data.

Some issues worth noting
Throughout this process, I identified several small issues that I will need to consider or modify to improve my research analysis:

  1. Regarding the neighborhood analysis, the size of the buffer around each point affects how the majority of habitat type is calculated. It might be worthwhile to try another analysis where buffers are smaller or multiple sized buffers are assessed.
  2. A major issue for creating the habitat suitability map was the resolution of the base raster layers for each parameter, especially the substrate layer. The current layer does not reflect the nuances of small patches of available substrate and therefore is not a true indicator of available substrate for oysters. The resolution of the substrate layer is not sophisticated enough to capture the true extent of the oysters, because they’re known to be opportunistic in finding suitable substrate.
  3. The area covered by each field survey (5-min search at a randomly selected starting point) was not equal at every location. Difficulty of travel at the site impacted the rate at which the survey could be conducted. For example, extremely muddy locations or locations with a lot of large boulders made traversing through the survey site more challenging and slowed the survey team down. Additionally, more ground could be covered if little substrate was available for examination.
  4. During field surveys, no distinction was made for adult or juvenile oyster, or living or dead, for determining presence/absence. Separating out this information on the map might be interesting in determining where oysters are more likely to survive long-term. For example, it has been noted that young oysters can opportunistically settle on substrates under less ideal environmental conditions, but may not be able to survive into adulthood.
  5. The simplistic approach for ranking the subcategories within each habitat parameter (on a scale from 1-4) ensured that the extreme ends of the rank value were always represented; there may not always be a definitive 1 or 4 value within each category. For example, maybe high salinity and low salinity both rank as 2’s rather than deciding one is slightly more appropriate than the other.


Beck, M. W., R.D. Brumbaugh, L. Airoldi, A. Carranza, L.D. Coen, C. Crawford, O. Defeo, G.J. Edgar, B. Hancock, M.C. Kay, H.S. Lenihan, M.W. Luckenbach, C.L. Toropova, G. Zhang, X. Guo. 2011. Oyster Reefs at Risk and Recommendations for Conservation, Restoration, and Management. BioScience 61: 107-116.

Groth, S. and S. Rumrill. 2009. History of Olympia Oysters (Ostrea lurida Carpenter 1864) in Oregon Estuaries, and a Description of Recovering Populations in Coos Bay. J. Shellfish Research 28: 51-58.



Final project: Juniper spatial distribution and slope, aspect, and surrounding juniper density

  1. Research Questions.
    • What are the spatial patterns of western juniper within the treated watershed? (The project was later expanded to address juniper density within the treated watershed, untreated watershed, and in nearby areas.)
    • What are the spatial patterns of western juniper within the study area as they relate to the spatial pattern of slope and/or aspect (as potential indicators of soil moisture) and the density of juniper in a given area (as a potential indicator of seed source)?
  2. Description of dataset.
    • The study took place in the Camp Creek Paired Watershed Study (CCPWS), located in central OR. The majority of western juniper was removed from one watershed in 2005-2006 (“treated watershed), while mature western juniper is the dominant vegetation at the other watershed (“untreated watershed”).
    • Thirty-three belt transects (3m by 30m) within the treated watershed recorded the location and size of juniper within each transect.
    • Multispectral imagery were collected using a UAV for a subset of the treated watershed. Brightness values for red, green, blue, red-edge, near infrared wavelengths were used. Resolution is approximately 4 cm.
    • National Agricultural Imagery Program (NAIP) imagery from 2016 was used for the study area. Resolution of this dataset is 1 m.
    • A 30 m Digital Elevation Model (DEM) was used to determine the slope and aspect characteristics within the study area and develop a rating model as an indicator of potential soil moisture. Slope and aspect were each rated from 1-9, with areas associated with higher soil moisture (northerly aspects, flat slopes) being assigned higher values.
    • All data was projected to NAD 1983 UTM Zone 10N.
    • Supervised classification (support vector machine) was used to identify juniper within the NAIP and UAV imagery.
  3. Hypotheses.
    • I expect juniper density will be dispersed unevenly throughout the study areas based on soil moisture, with high density patches of juniper being found in areas with northern aspects and/or flatter slopes. While juniper can survive in dry, rocky areas we would expect greater numbers where higher soil moisture allows for improved survival and growth. It is anticipated that patches of highest juniper density will be found in those areas with presumed highest soil moisture values based on slope and aspect characteristics, with northerly aspects and flat slopes presumed to have highest soil moisture values and areas of steep slopes with southerly aspect presumed to have the lowest soil moisture. Assuming similar soil depth and type, higher soil moisture is anticipated on northern-facing slopes compared to southerly-facing slopes because the intensity and amount of sunlight received is lower on northern slopes (in the northern hemisphere) resulting in lower soil evaporation. Higher soil moisture is presumed on flatter slopes because infiltration is assumed to be higher in these areas while higher rates of runoff are assumed on steeper slopes.
    • It is anticipated that juniper density (mature and sapling) will be higher in areas where there are higher levels of seed sources (as indicated by the juniper density surrounding each point of interest analyzed). Specifically, areas of higher density juniper saplings may be expected near higher densities of mature juniper (such as those areas near the watershed boundary). The seeds of juniper may be spread through a number of means, to include birds, winds, or small mammals.
  4. Approaches.
    • Kriging and inverse distance weighted (IDW) interpolation were used with the belt transect data (represented as individual points at the start of each transect).
    • Spatial autocorrelation (Global Moran’s I) was calculated using the belt transect data (using number of juniper per transect). Spatial autocorrelation was also assessed for juniper density as it related to slope and aspect characteristics.
    • Hot spot analysis (Getis-Ord Gi) was used to assess area of juniper concentration.
    • Juniper density surrounding specific points/areas (based on NAIP imagery) was analyzed.
    • Ordinary least squares (OLS) and geographically weighted regression (GWR) were used to assess juniper density as it corresponded to slope and aspect characteristics.
    • A chi-squared test and logistic model were used to assess the rating model and classified NAIP rasters.
  5. Results.
    • Based on the interpolation techniques (kriging and IDW) used areas of high and low density juniper patches were indicated within the treated watershed. However, the locations of these patches did not correspond to regions where we expected high or low soil moisture. For example, areas of highest densities of juniper density based on the transects were found in southeastern and western areas of the watershed. This may be attributed to the small dataset (n=33) used for this analysis but may also suggest that other processes (such as seed distribution via birds or the presence of other vegetation/ground cover) may influence juniper density.One hot spot was indicated using the belt transects with northwestern aspect and 8.8% slope and one cold spot was indicated with north-northwestern aspect and a 11.5% slope.
    • The hot spot analysis based on the UAV imagery indicated four hot spots, with slope values ranging from 3-11% and different aspect characteristics (northwestern, southwestern, southwestern, and southeastern).
    • The initial GWR and OLS analysis did not find that slope category and/or aspect category were good explanatory variables for juniper density based on the belt transects, based on adjusted R-squared values ranging from 0.19 to 0.28.
    • A weakly significant relationship was indicated between the rating model and the classified NAIP raster (p=0.018).
    • In order to address issues with limited variance encountered in earlier analysis, an additional GWR was conducted comparing the juniper density in a 150m buffer around 84 random points to the rating model. Outside of the buffer size, methods used in this GWR are similar to those described in previous exercises.  This improved the adjusted R-squared value for this model to 0.53.
    • The results of this analysis largely did not support the proposed hypotheses that greater juniper density would be associated with topographic characteristics associated with higher soil moisture. In situ soil moisture measurements were not used in this analysis, so differences in anticipated soil moisture associated with slope and aspect could not be compared to actual soil moisture measurements.
    • Greater densities of western juniper are evident in the untreated watershed compared to the treated watershed (as expected). However, patterns of juniper distribution within the treated watershed (as indicated by the belt transects) did not show any correlation between slope and aspect and areas of higher or lower juniper density. Additionally, those areas closest to the watershed boundary (and therefore closest to areas of dense, mature juniper) did not have greater juniper density than areas further away from the watershed boundary.
  6. Significance. 
    • While results of this analysis are of limited use to land managers at this point, a similar approach with an expanded dataset may improve is utility. The use of higher resolution imagery (such as UAV imagery) may improve the identification of juniper saplings and the reliability of this analysis. The NAIP imagery used in this analysis may be sufficient for identifying mature juniper, but small (sub-meter) juniper saplings may not be accurately detected and/or misidentified. Additionally, more ground-based data (additional belt transects, etc. distributed throughout both watersheds) could provide greater information about juniper density at this study site.
    • Past research has found that juniper encroachment is associated with reduced undercanopy soil moisture [1]. The focus of this analysis was on the relationship between slope and aspect (as indicators of soil moisture) and juniper density, but further analysis may focus on how soil moisture varies with juniper density. Additionally, this analysis could also assess non-juniper vegetation density as it relates to juniper density, soil moisture, and other watershed characteristics.
    • The results suggest that other factors may be influencing juniper density in addition to slope and aspect characteristics but further analysis is needed. It should also be noted that historically western juniper was largely found in rocky outcroppings and other areas protected from fire. However, fire suppression and other factors such as livestock grazing and climate change may be contributing to an increase of juniper density [2]. Future analysis may also focus on how these factors (e.g., disturbance related to land use, etc.) are related to juniper density over larger scales.
  7. My learning: Software.
    • Some difficulty encountered working when comparing raster and vector data.
    • Due to limited data set and differences in resolution, limited conclusions could be drawn from data. However, the methods could be applied to other datasets.
    • Some difficulty was encountered using ArcGIS 10.6/Pro, but issues were resolved by switching between versions.
    • Once packages and data was added, raster analysis in R-studio was relatively simple and efficient for this analysis. The R-bridge could potentially simplify this process allowing for more complicated analysis.
    • I was able to reinforce some basic skills in R and ArcGIS as well as gain exposure to new tools (such as the Ripey’s K and hot spot analysis). In particular, I was able to gain more exposure to using model builder in ArcGIS Pro for extracting and analyzing that would be time consuming and inefficient to perform otherwise.
  8. My learning: Statistics.
    • I did not have previous experience with several techniques used in this analysis, to include hot spot analysis and OLS/GWR. This analysis demonstrated how these methods could be applied in the future, as well as potential limitations (e.g., small datasets, clustered data, etc.). I would be interested in applying the hot spot analysis in particular to a larger data set and to higher resolution imagery for analyzing vegetation density.
    • I have used several kriging approaches and IDW to a limited degree in the past. Similar limitations to those cited above were experienced but this is an approach that may have applicability that for in situ soil moisture measurements in the watershed.
    • In addition to those methods used in this analysis, the tutorials helped reinforce concepts related to other methods, to include: principal component analysis, Ripley’s K, and qualitative analysis using maps.
    • A major takeaway of this analysis was the potential ways that different statistical approaches can be combined to perform analysis (e.g. using a combination of OLS, Moran’s I, and GWR, etc.).


  1. Lebron, I.; Madsen, M.D.; Chandler, D.G.; Robinson, D.A.; Wendroth, O.; Belnap, J. Ecohydrological controls on soil moisture and hydraulic conductivity within a pinyon-juniper woodland. Water Resour. Res. 2007, 43, W08422.
  2. Soulé, P.T.; Knapp, P.A.; Grissino-Mayer, H.D. Human Agency, Environmental Drivers, and Western Juniper Establishment During the Late Holocene. Ecol. Appl. 2004, 14, 96–112.

Final Project: Characterizing the exclusion of refugee settlements from global settlement datasets

Final Project Blog: Characterizing the exclusion of refugee settlements from global settlement datasets

Anna Ballasiotes


This research is a small part of a larger question: where are refugees? Which in itself is the basic question to a larger problem: how will the global community provide resources to those who need help — and in order to do so, understanding where these vulnerable populations exist is particularly important. In considering the hierarchy of understanding where IDPs and refugees exist, there are certain drivers of the creation of a ‘refugee population,’ notably conflict (most often varying levels of violent conflict) and more recently, climate events (including natural disaster events). Beyond the drivers of refugee populations, there are different typologies of refugee populations: refugees within cities and blending into existing infrastructure, refugees in settlements and camps that are close to borders, refugees in settlements close to urban centers, and refugees in settlements and camps in areas where populations have not been established and as such there is available land. My focus lies largely in the last item of that list: where populations have not been formally established but now camps and settlements are expanding. It is a known problem that global settlement datasets are not doing an adequate job of recognizing these camps and settlements as “population.” As such, my goal was to characterize how these global settlement datasets were not recognizing these camps and settlements — and thus excluding these populations. This asks the question: to what extent do global settlement datasets fail to detect known refugee camps and settlements that form in areas of low density population?

I originally sought to answer the question, “what about the classification metrics of global settlement dataset fails to detect settlements known to OSM.” I wasn’t able to truly answer this question because I needed to spend more time  on actually characterizing this exclusion before understanding why these global population datasets were unsuccessful. Characterizing where this exclusion occurred was difficult enough and took most of the class, especially as I explored what these refugee settlements looked like and considered how they may have formed.


I used three main data sources: Refugee settlement boundaries from OpenStreetMap, Facebook High Resolution Settlement Dataset, and the World Settlement Footprint. Only data within Uganda was used for these data, given the overlap of data availability and project partners with reliable data. The Facebook High Resolution Settlement Dataset is trained on 2018 imagery and is presented at a 30 meter resolution. World Settlement Footprint is trained on 2015 imagery and is also presented at 30 meter resolution, but with a minimum mapping unit mask that creates clusters of pixels that make the data appear to have a coarser resolution. The refugee settlement boundaries from OSM range in temporality from 2016 to 2019; however, this is only when the boundaries were added to the OpenStreetMap database and is not necessarily when camps or settlements were established. I had initially included the Global Human Settlement Layer, which is also trained on 2015 imagery, but GHSL did not appear to detect any settlement in the refugee camp boundaries within OSM, so I was not able to perform any meaningful analysis with it.

HYPOTHESES: predictions of patterns and processes.

I wanted to understand in what ways the global settlement datasets excluded the known refugee settlements — thus, I wanted to understand what percentage of refugee settlements were captured and whether larger settlements or smaller settlements showed a higher percentage of captured area, or if settlements that had a higher percentage of captured area showed a faster green-up around the settlement as one moved away from the boundary. I suspected that smaller settlements would have a higher percentage of area captured, mostly because I predicted that smaller settlements conformed more to the actual built-up of settlement area, whereas the larger settlements included more land cover types that would be less likely detected as “built-up.” I also suspected that settlements that showed a higher detection percentage would have a more distinct natural boundary — that the green-up around the settlements would occur faster if a higher percentage was detected.  

APPROACHES: analysis approaches used.

I had initially measured the spatial autocorrelation of the refugee settlements using ArcGIS Pro, but did not find this analysis to provide too much significant information for me. I already knew that my settlements were quite different from each other in size; any clustering would have really come from other factors that are not obviously visible, like the way land was distributed or delegated for refugee settlements.

In order to characterize the percentage of area of refugee settlements captured by each settlement dataset, I brought everything into Earth Engine and calculated pixel area of every pixel both within the refugee boundaries and across WSF and Facebook datasets — counting only the pixels that have area detected as settlement.

In order to refine the comparison of area detected within refugee settlements, I also needed to really only look at pixels that look like “built-up” within the refugee settlements, since the boundary themselves have quite a lot of green space and other land cover types. also performed a binary unsupervised classification based on the known locations of refugee settlements. Ultimately, I used a confusion matrix to compare this binary unsupervised classification to WSF and Facebook to refine the “percent area detected.”

I also looked for the relationship of detection and greenness moving away from the boundary. To do this, I created a multi-ring buffer out to 1 kilometer at 100 meter increments, calculated the average maximum NDVI in each “donut buffer” for 2018 using Landsat 8 imagery. I then plotted these NDVI values over increasing distance from the refugee boundary to see the green-up.


I produced a map of the spatial confusion of overlap between WSF, Facebook, and a binary unsupervised classifier within my settlement boundaries. I have refined this map a bit, but there are still further statistical comparisons that could be done to better characterize the overlap of WSF, Facebook, and the K-Means classifier.

I also have all of the settlement boundaries with the percent of area captured by each global dataset within each boundary: this could be a sorting mechanism in which I could perform furt

Ayilo 1 & 2 Refugee Settlements; Classifications shown

her analysis to better answer my hypothesis, which I didn’t necessarily answer well, specifically with regards to the NDVI prediction. Within the exercises, I answered this question based on geography rather than percent detection — did settlements in the north vs. south show different pattern of vegetation greenup around the settlement?

SIGNIFICANCE What did you learn from your results? How are these results important to science? to resource managers?

I learned that across the board, refugee settlements ARE systematically excluded from global datasets but can easily be detected at the local scale, and thus this systematic bias of these datasets must be addressed and solved. Those that are modeling systems of migration or providing access to humanitarian resources need to have more accurate representations of settlements. Beyond this, I was really hoping to be able to characterize the spatial patterns of the exclusion better than I was able to, and I think that’s where my analysis fell short and where my next hypothesis would manifest.

LEARNING: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other? What about statistics? including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), (d) multivariate methods (e.g., PCA),  and (e) or other techniques?

I didn’t truly dive more deeply into these softwares than I have in the past; I think I just became more acutely aware of the statistical methods that exist in Arc toolboxes and became more comfortable manipulating data in R.  

I think that the statistical learning was less specific or intense than I was hoping, but also in part because I think I was really challenged to think statistically and spatially-statistically about my research question. I wish I had spent more time understanding statistical methods rather than trying to mold my question into something that seemed to fit my elementary understanding of spatial statistics.

Cluster analysis of watershed lithology in Oregon coastal systems

The question that I asked for Exercise 3 was: how similar are the lithological profiles of Oregon coastal watersheds?

I used cluster analysis to answer this question. Broadly, cluster analysis is a way of grouping data that minimizes within-group variance and maximizes between-group variance. There are many ways of performing a cluster analysis, usually involving statistical optimization computations. For this exercise, I used hierarchical clustering, which builds a hierarchical grouping network based on the distribution of numerical attributes in a dataset. My dataset was the lithological profile of 35 watersheds in the Oregon Coast Range. The lithological profile refers to the percentage of each lithology, including metamorphic, plutonic, sedimentary, surficial sediments, tectonic, and volcanic types.


  1. Computing lithological profile for each watershed
    1. Downloaded “Oregon Geologic Data Compilation – 2015” from Oregon Spatial Data Library.;id=e71e1897f5864b689a3a4a131287a309
    2. Used ArcMap to dissolve the geology layer based on the GEN_LITH_TY field, which classified each geological type into either: metamorphic, plutonic, sedimentary, surficial sediments, tectonic, or volcanic.
    3. Used ArcMap to intersect geology data with study watersheds. The Intersect tool subsets the dissolved lithology layer to only include data bounded by the study watersheds.
      1. Ran Intersect on multiple different shapefiles: wshd4, wshd5, and all Coos watersheds
      2. The output of the Intersect tool is a shapefile with each discrete lithological type as a single feature. As a result, there was often more than one polygon per lithology in each watershed.
    4. Used Calculate Geometry tool in ArcMap attribute table to calculate area (in square kilometers), and the latitude and longitude of each watershed polygon centroid.
    5. Exported attribute tables of each output shapefile and standardized columns in Excel. Exported as CSV and reformatted table in R so that rows demonstrated each watershed and their lithological profile. Used dplyr to group each lithological type and sum percentage. Used tidyr (spread function) to make each lithology a column.
  2. Cluster analysis
    1. Used the daisy function in R (package: cluster) to compute all pairwise dissimilarities between the 35 study watersheds. The dissimilarity distances were calculated using the Euclidean method.
    2. Performed hierarchical clustering analysis (HCA) on the dissimilarity matrix computed in Method 2a. I used the agnes function to compute the HCA using Ward’s method. The minimum variances between groups in Ward’s method are calculated by squaring the Euclidean distance.
    3. Validated hierarchical clustering procedure using eclust (package: factoextra) and fviz_silhouette. The latter function is just for visualizing the results from the former, the cluster validation.


The HCA resulted in a fairly successful clustering with an agglomerative coefficient of .96. The agglomerative coefficient is a measure of the clustering structure, with values closest to one representing a high degree of dissimilarity between clusters. The cluster validation demonstrated that all but two watersheds were adequately grouped (Figure 1).

Figure 1.Results from cluster analysis silhouette plot.

Figure 2. Dendritic hierarchical clustering results. Three numbers are written below each final branch: the top is the watershed ID, the middle is the percent sedimentary lithology (blue), and the bottom is the volcanic geology (purple).

The HCA resulted in two main groups with many sub branches (Figure 2). By comparing the lithology percentages to the clusters, I determined that one main branch includes watersheds dominated by volcanic lithology and the other branch includes watersheds dominated by sedimentary lithology. Eight of the 35 watersheds are primarily volcanic, 26 are primarily sedimentary, and one watershed does not fit this classification (Watershed 31: 85% surficial sediments).

Critique of Method

This method was very useful for me because it helped me characterize the lithology of my study watersheds in a quantitative way, and then identify the dissimilarities between the lithological profiles of each watershed. The results from this exercise will advance my ability to compare hydrologic regime characteristics across coastal watersheds and the driving controls on such hydrological processes. For example, the dendrogram in Figure 2 shows that there are five watersheds with 100% sedimentary lithology and five watersheds with predominantly volcanic geology. Such statistics will help me stratify my study design and guide my flow regime analysis.

Final: Stream Change Recap

My project focused on a set of grain size and cross-sectional change data from a long term research project in the Andrews Forest. (Maps and study description from Blog Post 1)

I wanted to explore two questions:

  • How do stream channel erosion, deposition, and particle size vary over time and space?
  • How do these changes relate to adjacent changes in the across-stream or along-stream direction?


I tried to address these nested hypotheses about the system:

  1. Existing stream bed morphology drives spatial patterns in cross-sectional change.
    1. Unit-scale stream morphology should lead to patchy transport and different levels of autocorrelation of the change at different spatial scales
    2. Either hillslope or hydrodynamic processes or both should result in relatively high cross sectional change near banks.
  2. Extreme flow events alter the size distribution of in-channel sediment
    1. High energy flows associated with high peak discharge events should increase median grain size
    2. Transport of hillslope material (also associated with high peak discharge events) should decrease sorting
  3. Extreme flow events reduce the relative impact of bed morphology in cross-sectional change.


Here’s a conceptual model to investigate the patterns of change

Overall, I expected that different processes would drive change at different scales, so I expected the spatial and temporal patterns of change to vary based on scale as well.


  • At reach scales: patterns of change are driven by reach level features including slope, watershed size, and land use history.
  • At unit scales and smaller, patterns of change are driven by unit-scale bed morphology and positions of adjacent features.


  • During time periods that include extreme flows, patterns of change are driven by the magnitude of the flow.
  • During time periods of less extreme flows, patterns of change are more dependent on local features as described above



Due to the study design and to limits in data availability, it made sense for me to use one-dimensional analysis methods and correlation or autocorrelation tools on much of the data. I mostly worked in R. I also started but didn’t finish work on a network model using a tool specifically designed for stream network analysis.

Technical limitations:

1-dimensional models don’t perfectly capture the complex realities of stream networks and irregularly spaced spatial data, so I ran into a few hurdles trying to parse analysis results. I also couldn’t do some of my desired analyses due to the lack of shared coordinate systems for different data sets or for adjacent cross sections. However, thinking about these analyses helped me better understand exactly which data I would need to collect in order to do them in the future.

I am still having some trouble with a stream network analysis add-on in ArcGIS, and I am in the process of reaching out to experienced users who I hope know the way around my particular error.



I started the stream network analysis partly because I wanted more context for the results of exercise 3. Some 2017 REU students collected grain size data at sites that overlapped with the long term cross sectional study sites. I haven’t finished the network analysis, but I’ve mapped the REU student data below.

Here is a map of the median grain size (D50) at the REU student sites:

Here is a map of half the difference between the 84th and 16th percentile grain size at the REU student sites. This value is one metric of sorting, and it would equal the standard deviation if the sizes were normally distributed.

These results indicate that grain sizes and sorting in the grain sizes display more complex patterns than a simple fining of grain size downstream or consistent gradations in sorting.

For all of the analyses, I need to spend a little more time figuring out which of these results are useable and might reflect reality in the field and which ones are more likely showing artifacts of data collection or processing


I hope to use these results to add more depth to my research and place my research in a broader context. I personally learned more about strengths and limitations of data set

The broader purpose of entire project is to see how streams like these respond over time. How much of their changes are influenced by hydrologic rather than hillslope processes? How big of a hydrologic event needs to happen to do substantial work on the channel? The project could be useful to land managers because channel mobility could damage streamside infrastructure or alter benthic ecology

My learning (technical):

From a technical standpoint, I think improved ggplot skills. I started using cowplot and learned more about custom themes. I learned a lot about preparing data using the STARS toolbox, and I am excited to learn more about the SSN R package once I get my data working

My learning (statistics):

In this class, I learned more about autocorrelation methods and correlograms. Hope to learn more about network analysis soon too.

Examining the Spatial Patterns of NDVI Change near ASGM and non-ASGM villages in rural Senegal


My research question for this project was to investigate how spatial patterns of land use/land cover (LULC) change were related and/or associated with spatial patterns of artisanal, small-scale gold mining (ASGM) establishment. A possible mechanism I considered as influencing these patterns would be deforestation associated with the establishment of ASGM operations, as miners require significant amounts of timber to bolster mine shafts, in addition to constructing shelters around the mine shafts and for the miners themselves. As such, my research question can be phrased thusly: “How is the spatial pattern of LULC change related to the spatial pattern of ASGM establishment via the mechanism of deforestation?” Additionally, I wanted to examine whether alternative factors may affect LULC change around sites; in this case, I examined the proximity of a site to a primary road and whether or not that related to rates of LULC change.


My dataset for this project had several components. In previous work, I used very high resolution satellite imagery to map out the locations and spatial extent of nine ASGM sites, as well as nine non-ASGM sites to compare to (these sites are shown in fig. 1). After having mapped these sites, I acquired Landsat 5 and Landsat 8 imagery at 30m spatial resolution from the USGS for April 2007 and April 2018, respectively. Using this Landsat imagery, I calculated normalized difference vegetation indices (NDVIs) for each year, and then subtracted the former from the latter to obtain a dataset showing NDVI loss between each timeframe. This dataset enabled me to examine NDVI change around each of my 18 sites to evaluate LULC change in the form of NDVI change. In addition to these datasets, I also acquired a road network layer from OpenStreetMaps for Senegal, which I selected to only include “primary” roads in Kedougou. Together, these datasets enabled to me to investigate LULC change around ASGM sites via NDVI change and also how this change relates to proximity to primary roads.

Fig 1: Study Site


My two hypotheses for this project are that 1) NDVI would be more negatively impacted near ASGM sites compared to non-ASGM sites as a result of deforestation, and 2) NDVI losses at ASGM sites could be attributed in part to each site’s proximity to primary roads.

Patterns of higher NDVI loss as a percentage of the total would occur at ASGM sites compared to non-ASGM sites. This may be because, as the larger populations near ASGM sites and also the activity itself would necessitate more timber harvesting, there would be more negative NDVI change in the immediate surroundings compared to sites without ASGM activity. The spatial process I would expect to see would be areas of higher NDVI loss nearer to the center of ASGM villages. These areas would be fragmented, indicating differences in vegetative cover around the village, but, overall, more negative than around non-ASGM villages. This pattern results from miners and villagers harvesting the nearby timber to bolster mine shafts, as well as for construction of structures around each mine and their own huts. This hypothesis is an explanation for the pattern of NDVI loss around a village.
ASGM sites may be, overall, closer to primary roads than non-ASGM sites, and so their growth and attendant NDVI loss is in part attributable to proximity to major travel corridors and the relative ease of reaching the sites. If a mine is closer to a primary road than another mine, it’s possible that this proximity may result in larger spatial patterns of NDVI loss. This hypothesis is an explanation for the process of deforestation around different sites.

The first hypothesis here is an attempt at understanding what spatial patterns may be occurring around ASGM sites; that is, it is an attempt to understand what NDVI losses may look like. The second hypothesis is an attempt to understand why those patterns may be occurring, and in finding some sort of correlation between the observed pattern and an alternative possible explanation for that pattern.

Exercise 1: Moran’s I

Before examining how patterns of NDVI loss are shaped around ASGM and non-ASGM sites, I first wanted to examine how the patterns of NDVI loss and gain were self-related; that is, I wanted to know whether or not patterns of NDVI change occur near areas of similar change (spatially autocorrelated or clustered) or whether or not these patterns occur away from self-similar patterns (dispersed). To do this, I took my layer of NDVI change and ran a script in Rstudio to generate a Moran’s I value for the global layer, which is a measure of spatial autocorrelation: it results in values between -1 and 1, where values closer to -1 indicate negative spatial autocorrelation (dispersion) and values closer to 1 indicate positive spatial autocorrelation (clustering). I also performed this analysis at a smaller scale around one mining village to see how Moran’s I may change on a different scale.

Exercise 2: Neighborhood Analysis

After determining the spatial autocorrelation of patterns of NDVI change, I could examine how those patterns were related to and varied between ASGM sites and non-ASGM sites. To do this, I performed a neighborhood analysis. At each village, I found its center using the Mean Center tool in ArcGIS Pro. Around each center point, I generated “donut” buffers using the Buffer tool at 250 meter increments, up to 2 kilometers, for eight different donuts (e.g. 0-250m, 251-500m, 501-750m, etc.). I then used each of these buffers to Clip the global NDVI layer (which I had also edited to omit values over rivers and Mali), generating eight raster layers representing NDVI change values between each spatial step. From there, I simply generated histograms for each step and counted the values below 0, indicating NDVI loss from 2007 to 2018, and found their percentage of the total. With these values, I could then see how NDVI loss as a percentage of total pixels in each donut varied with distance from the center of each village, and could more readily understand how NDVI values change with distance from both ASGM and non-ASGM villages.

Exercise 3: Pearson’s R Correlation

After examining how NDVI change across space near ASGM and non-ASGM villages, I wanted to further understand how these patterns may be related to another factor: proximity to a primary road. To do this, I used my road network layer from OpenStreetMaps showing locations of all primary roads in Kedougou and used the Near tool in ArcGIS Pro to find the shortest distance from each of my 18 sites to the nearest primary road. Then, I took only the first values of NDVI change (from 0m to 250m), and, using these values, along with each site’s distance to the nearest primary road, I performed a Pearson’s R calculation (fig. 2). This approach thus attempts to explain how values of NDVI loss around both ASGM and non-ASGM sites may be related to a site’s proximity to a primary road, providing some more information on how different rates of NDVI loss may be affected by the ability of people to access a particular village.

Fig 2: Pearson’s R Correlation. x, in this case, is distance (m) from nearest primary road and y is NDVI loss at the first ring.


I had several results from these three approaches.

Firstly, I produced a global Moran’s I of 0.6716816. The second Moran’s I produced, at a smaller scale, was .8079745. These results helped me to understand that, generally, patterns of NDVI loss are spatially autocorrelated, meaning that areas of loss and growth are clustered near self-similar areas. However, this analysis occurred over such a large area that it is possible the global Moran’s I may be accounting more for patterns of near riparian corridors, which are of course autocorrelated, especially areas where water bodies are visible. As such, this result needs to be interpreted with caution.

Secondly, the results from my neighborhood analysis are presented in fig. 3 [insert graph of NDVI changes for fig. 3; include key in caption]. These results indicate that, on average, villages which have ASGM activity present see more NDVI loss in their immediate surroundings than non-ASGM villages, up to about 2km radially from their center. This means that, generally, villages with ASGM activities experience reduced NDVI, which may indicate vegetation loss or loss healthy vegetation, than villages without ASGM present.

Fig 3: NDVI results. Light blue lines indicate ASGM mines; the dark blue line is their average. Light red lines indicate non-ASGM villages; the dark red line is their average.

Finally, the results of my Pearson’s R correlation indicate relatively weak correlations between a village’s NDVI loss (at least, within the first ring) and its proximity to the nearest primary road. The results are shown in table 1 [insert table screen snip]. For ASGM villages, their correlation was -0.1418, whereas non-ASGM villages have a Pearson’s R of -0.0578. Both of these are weak correlations, there is some indication that ASGM villages and their attendant NDVI losses are more correlated with proximity to a primary road (i.e., the closer they are to a primary road, the more NDVI loss they experience) than non-ASGM villages.

Table 1: Pearson’s R Results


These results are significant for several reasons. These showed to me that there is an observable pattern of NDVI loss that is greater around villages which develop or engage in ASGM than in villages which do not engage in ASGM. However, there is still some work to do in order to understand the process by which NDVI varies, and it may not be explainable by a village’s proximity to a primary road. There are several possibilities that may explain the process: deforestation is more prevalent around the gold mines for reasons to do with securing timber for construction, or because an inherently larger population at gold mines necessitates more forest product harvesting. More work should be undertaken to understand these processes. On the whole, however, these results are useful for governments or NGOs which seek to help better the conditions which ASGM practitioners participate and engage in, now that they know that ASGM villages experience greater rates of NDVI loss, which may indicate environmental degradation.


This process taught me several techniques. Firstly, I had no experience at all with R and R Studio, though I do understand some coding. As such, this helped me learn more about a really useful language that I did not have any experience with. Secondly, any practice with ArcGIS Pro is good practice for me, and so I got a lot of time and work done using that piece of software. In particular, I go some good time in with the model builder to help expedite work, as generating eight different ring buffers and then clipping them to an NDVI layer for 18 different villages would have been very time-consuming otherwise. Finally, I gained some good experience performing a statistical analysis, in the form of Pearson’s R. I have very little experience performing statistics, so this was really useful for me to help understand and interpret statistical analyses and results


With regard to statistics, I learned several things. The three statistical analyses I used were Moran’s I for spatial autocorrelation, neighborhood analyses for comparing spatial patterns, and Pearson’s R for calculating a simple statistical correlation.

With regards to Moran’s I, I learned the importance of examining patterns at different scales, as a pattern which is apparent at one scale may not be so at another scale, and so interpretation and set-up must be conducted carefully.

Concerning neighborhood analyses, I learned the importance of dilligent workflow and juggling several different variables without muddying the results. This also demonstrated how to examine and structure a problem which seeks to understand change both across space and scross time.

Finally, conducting a Pearson’s R helped me to understand better how statistical analyses are constructed and what their significance is. Additionally, this taught me the value of interpreting a statistical result, as their significance may not always be indicative of anything. Null results or negative results are just as valuable as positive ones, and this analysis showed me that. Now I know more about what process to examine and how to examine them!




How are the spatial patterns of individuals’ perception of natural resource governance related to the spatial patterns of environmental restoration sites via distance and abundance of improved sites?


 Puget Sound Partnership Social Data

My data came from a survey on subjective human wellbeing related to natural environments. The sample was a stratified random sample (28% response, n= 2323) of the general public from the Puget Sound in Washington from one time period. I am specifically examining questions related to perceptions of natural resource governance. Around 1730 individuals gave location data (cross street and zip code) which have been converted to GPS points. I also have demographic information for individuals, their overall life satisfaction, and how attached and connected they feel to the environment.

Environmental Data

I have three different forms of environmental data. (1) Point locations of restoration sites in the Puget Sound. (2) A shapefile of census tract level average environmental effects (the environmental condition ranked from 1 good to 10 very poor, and (3) Land cover raster files from 1992 and 2016.

  • Individual restoration sites numbered over 12,000. They spanned many years, and point locations overlapped significantly through time.
  • The environmental effects data comes from an online mapping tool that was created in a collaboration between the University of Washington, the Washington Department of Ecology, and the Washington Department of Health. Environmental effects are based on lead risk from housing, proximity to hazardous waste treatment storage and disposal facilities (TSDSs), proximity to national priorities list facilities (Superfund Sites). Proximity to Risk Management Plan (RMP) facilities, and wastewater discharge. The combination of these effects was aggregated within census tracks to produce an environmental effects ‘Rank’ from low effects (Rank of 1) to high effects (Rank of 10).
  • The land cover files contained 25 different types of land cover. I reclassified these land cover types by combining similar types of land cover (e.g. deciduous forest and conifer forest to ‘forest’).



 1) Shorter distances between individuals and restoration sites, and greater number of restoration sites near individuals, would correlate positively with governance perceptions

2) Positive environmental conditions will correlate positively with governance perceptions

Richard Petty and John Cacioppo developed the elaboration likelihood model (1980) which asserts how individuals change their attitude based on persuasive stimuli. Perry and Cacioppo develop this idea by implementing the idea of two routes of persuasion—the central and peripheral paths—where determinants of routes are determined by motivation, ability, and nature of mental processing. The purpose of this theory is to help explain how individuals elaborate on ideas to form attitudes and how strong, or long lasting, those attitudes are. This theory may suggest that individuals will form stronger attitudes when reasons to form attitudes are stronger and more immediate. Stimuli of this kind tend to be nearer to individuals, as processing of issues far away is cognitively difficult. Research has shown that individuals have a difficult time thinking about problems on larger spatial scales, and there is an inverse relationship between how feelings of individual responsibility for environmental problems and spatial scale (as scale gets larger, feelings of responsibility go down (Uzzell, 2000).

Cognitive biases also suggest that people use heuristics to answer questions when they are unsure. One common heuristic is the saliency bias, where individuals overemphasize things that are emotionally more important, and ignore less interesting things (Kahneman, 2011). This may suggest that spatial patterns are more important than temporal patterns because what is happening now is more salient to individuals than what happening in the past. This may imply that even if environmental outcomes have improved over time, the immediacy of the current environment or things influencing the environment may have a greater effect on individual perceptions.



Statistical Analyses

Spatial Autocorrelation Analyses

Moran’s I: To compute Moran’s I, I used the “ape” library in R. I also subset my data to examine spatial autocorrelation by demographics including area (urban, suburban, rural), political ideology, life satisfaction, income, and cluster (created by running a cluster analysis on the seven variables which comprise the governance index

Correlograms: I created correlograms for the variables that were significant (urban, conservative, and liberal) using the “ncf” library and the correlog() function. These figures give a better picture of spatial autocorrelation at various distances.

Semivariograms: To create semivariograms, I used the “gstat” and “lattice” libraries which contain a function called variogram. This function takes the variable of interest along with latitude and longitude locations. The object created can then be plotted. For this analysis I used the same subsets as in the Moran’s I analysis. These figures are excluded from the results as they do not provide information beyond what the correlograms show.

Nearest Neighbor analysis: A file of restoration sites was loaded into R. The sites were points. This analysis required four libraries as the points were from different files. The libraries were: nlem, rpart, spatstat, and sf. I first had to turn my points into point objects in R (.ppp objects). I used the convenxhull.xy() function to create the ‘window’ for my point object, then I used that to run the ppp() function. I then used the nncross() function. This function produced min, max, average, and quartile distances from individuals to the nearest restoration site. I added the ‘distance’ from the nearest neighbor as a variable in a linear regression to determine an association. I used these distances (1st quartile, median, and 3rd quartile) to produce buffers. I ran spatial joins between the buffers and the restoration sites. This produced an attribute table that had a join count—the number of restoration sites within the buffer. In R I ran a linear regression with join count as an independent variable.

Geographically weighted regression: I joined individual points to my rank data, a shapfile at census tract level. This gave individuals a rank for environmental effects at their location. Initially I used the GWR tool in Arc to run the regression, but wanted the flexibility of changing parameters and an easily readable output that R provides. In R, I used two different functions to run GWR, gwr(), and gwr.basic(). The gwr() required creating a band using gwr.sel, and the gwr.basic required creating a band using bw.gwr. The difference between these functions is that gwr.basic produces p-values for the betas. I ran gwr on both my entire data set and a subset based on perceptions. The subset was the ‘most agreeable’ and ‘least agreeable’ individuals who I defined as those one standard deviation above and below the mean perception.

Neighborhood analysis: I created a dataset of the upper and lower values of my governance perceptions (one standard deviation above and below the means). I then added buffers to these points at 1 and 5 miles. I joined the buffers to the rank values to get an average rank within those buffers. I exported the files for each buffer to R. I created a density plot of average rank for the low governance values at each buffer, and for the high governance values at each buffer.

T-test: To test whether land cover change differed between those with high and low perceptions, I exported the tables and calculated the change in land cover for the two samples. I then ran two-sample t-tests for each land cover type change between the two groups.


Kriging and IDW: I used the Spatial Analysis toolbox to preform Kriging and IDW to compare the outputs of the two techniques. I used my indexed variable of governance perceptions. The values of the variable vary from 1 to 7. I then also uploaded a shapefile bounding the sample area.

Hotspot Analysis: I used the Spatial Analysis toolbox to preform ‘hotspot analysis.’ I used my indexed variable of governance perceptions with values from 1 to 7.

Tabulate area: I used two land cover maps from 1992 and 2016 to approximate environmental condition. I loaded the raster layers into ArcGIS Pro, and reclassified the land types (of which there were 255, but only 25 present in the region) down to 14. I then created buffers of 5km for my points of individuals with governance scores one standard deviation above and below the average. I then tabulated the land cover area within the buffers for each time period.



What are the spatial patterns of perceptions of natural resource governance?

 Moran’s I statistic for overall governance perceptions was insignificant, indicating a lack of spatial autocorrelation at the regional scale (I value; p = 0.51). Significant autocorrelation was detected when the data were subset by demographic variables. Individuals residing in urban areas exhibited spatial autocorrelation of perceptions while individuals in suburban and rural areas did not. Lastly, individuals classified as ‘conservative’ (political ideology < 0.5) exhibited significant spatial autocorrelation for governance perceptions (I statistic: -0.0062, p = 0.0018). The spatial autocorrelation for governance perceptions was mainly restricted to short distances as shown in the correlograms below. Significant correlations for the entire study population and individuals in urban areas were detected at near distances Individuals classified as conservative exhibited significant correlation across multiple distances. No significant correlations were found among liberal individuals, suggesting a non-spatial driver mechanism may be at play.


Local patches of significantly lower perceptions (cold spots) and significantly higher perceptions (hot spots) were identified as shown in the map below. Three ‘cold spots’ were located in the areas around Port Angeles, Shelton, and the greater Everett area. Two hot spots were identified surrounding Bainbridge Island, and the city of Tacoma, which is where the Puget Sound Partnership is located.

Blue points are “cold spots” at 99%, 95%, and 90% confidence that the points reside in a cold spot. Cold spots are areas where perception is significantly lower than the average compared to those around them. Red points are “hot spots” at 99%, 95%, and 90% confidence that the points resides in a hot spot. Hot spots are areas where perception is significantly higher than the average compared to those around them. Small grey points are insignificantly different. Bounding lines are counties in Northwest Washington.

What are the spatial patterns of natural resource governance perceptions in relation to environmental condition?

The median distance for individuals from restoration site was 0.037 degrees, 1st quartile was 0.020 degrees, and 3rd quartile was 0.057 degrees.

The regression on whether distance correlates with individual perception was insignificant (p = 0.198), which means distance from the nearest restoration site does not influence perceptions. For each regression on the number of sites near an individual, all coefficients were negative, meaning the more sites near an individual, the more disagreeable their perception was. All produced significant results, but the effect size of number of sites near individuals was very minimal (Table 1).

Table 1. Regression results for ‘nearest neighbor’ of individuals to restoration sites.

Buffer Size b p-value Effect Size (rpb)
Buffer 1 (1st quartile) -0.003 0.0181 .003
Buffer 2 (median) -0.002 0.045 .002
Buffer 3 (3rd quartile) -0.001 0.002 .003

In a geographically weighted regression (GWR) model, rank, life satisfaction, years lived in the Puget Sound, and race are significant (Table 2). All other variables were insignificant. For rank (the combination score of environmental effects) the coefficient was positive. Higher rank values are worse environmental effects, so as agreeable perceptions increase, environmental condition decreases. Overall, the effect size of this model on governance perceptions was small, explaining about 10% of the variance in the data (Table 2).

A map of residuals from the GWR confirms the results from the hotspot analysis about where high and low responses congregate.

Table 2. Regression results for environmental effects at individuals’ locations.

Variable1 b p-value2
Rank 0.077 0.007**
Life Satisfaction 0.478 <0.001**
Years Lived in the Puget Sound -0.010 0.002**
Sex -0.156 0.289
Area -0.150 0.179
Education -0.002 0.922
Income -0.022 0.622
Race 0.006 0.034*
Ideology 0.103 0.533

1R2 = 0.094

2 ** = significant at the 0.01 level, * = significant at the 0.05 level

The plot of high and low governance values at the two buffers is presented below. The black and grey curves represent respondents from the survey that were at least one standard deviation lower than the mean (the mean was neutral). The green curves represent respondents from the survey that were at least one standard deviation higher than the mean. The solid curves are the average rank at the one mile buffer, and the dotted curve is the average rank at the five mile buffer.

This figure indicates there are two peaks in average rank, at low environmental effects (~1) and at mid-environmental effects (~4.75). Those with lower perceptions of environmental governance had higher peaks at low environmental effects for each buffer size. Those with higher perceptions of environmental governance had a bimodal distribution with peaks at low and mid-environmental effects.


What are the spatial patterns of natural resource governance perceptions in relation to environmental condition over time?

I used land cover change as a proxy for environmental condition over time. Land cover change differed significantly for four different land cover types between the two groups—grassland, forest, shrub/scrub, and bare land cover. Grassland cover decreased for both groups, but decreased by 5% more in the areas with individuals with below average governance perceptions. Forest cover also decreased for both group, but decrease by 1% more individuals with below average governance perceptions (the amount of forest cover in the region is very large which accounts for why a 1% difference is significant). Shrub and scrub cover increased for both groups, but increased by 3% more in areas with individuals with below average governance perceptions. Lastly, bare land decreased for both, but decreased by 5% more for individuals in areas with individuals with below average governance perceptions (Table 3). The effect size of land cover change on perception was small for each of the significant variables with biserial correlations between .07 and .09 (Table 1).

Table 3. Differences in land cover change between individuals with above average perceptions of natural resource governance perceptions and individuals with below average governance perceptions

Governance Perception1
Above Average Below Average t-value p-value Effect size (rpb)
High Development 17 16 0.80 .426 .02
Medium Development 19 19 0.67 .500 .02
Low Development 11 11 0.22 .829 .00
Developed Open Space 12 13 0.21 .837 .00
Agriculture -1 0 0.74 .461 .02
Grassland -13 -18 3.03 .002 .09
Forest -10 -9 2.61 .009 .08
Shrub/Scrub 40 43 3.04 .002 .09
Wetlands 0 0 1.67 .095 .05
Unconsolidated Shore 2 2 0.10 .918 .00
Bare Land -21 -25 2.43 .015 .07
Water 0 0 0.90 .368 .03
Aquatic Beds -1 -5 1.19 .233 .03
Snow/Tundra 0 0 0.00 .997 .00

1Cell entries are percent changes of land cover from 1992 to 2016

As one land cover type goes up, another land cover type must go down. While forest overall decreased in the region for both groups over time, areas with lower perceptions saw significantly less decrease. The percentage is small, but a lot of the area in the region is forested, so the actual amount of land is large. The two maps below show land cover in 2016 (top) and 1992 (bottom) around the area of Port Angeles. This area is one that had a significant cold spot (low perceptions). From these maps, you can see a spread of the dark green color (forest), and a decrease of the yellow (grassland). Port Angeles historically was a timber community (the same could be said for the regions of the other cold spots in the region. This change in forest and grassland could be the source of negative perceptions if these communities believe the governance structures are responsible for the change in forest cover.


Overall, in this analysis, poor environmental condition relates to positive perceptions, and land cover change may provide insight into reasons for poor perceptions. Areas with negative perceptions may not directly indicate poor natural resource governance. Elsewhere, trust in natural resource governance was primarily driven by individual value orientations (Manfredo et. al. (2017). The three areas identified as cold spots in this study could be areas where individuals do not feel their values are represented by the natural resource governing systems. Conversely, hot spot area A is located near the headquarters of the Puget Sound Partnership, potentially facilitating trust, representation, and access to information that may influence the perceptions of individuals located nearby. This urban, liberal, developed area contrasts the area to the west with a highly significant cold spot (Area 2). Individuals from area two, a more rural, conservative, historic logging site likely have different value orientations than those from Area A. Similar comparisons of the other cold spots could be made to areas near them. More research should be conducted in this area to determine if value orientations have a significant influence on perceptions of natural resource governance.


Your learning


During this course I significantly advanced my knowledge in ArcGIS Pro and in R studio. I learned how to run many new tools in Arc including hotspot, geographically weighted regression, and tabulate area. I also learned new skills in troubleshooting problems. For example, I was having difficulty getting the rank data to display on my map. Another student taught me that as I imported the data from a .csv and joined it to a shapefile, I would need to create a new feature class from it before it would be able to display. I also gained knowledge in how to run more spatial analyses in R. These included many spatial autocorrelation analyses, and geographically weighted regression (which involved creating a point object!).


What did you learn about statistics?


I first and foremost learned of the existence of many types of spatial statistics. These included statistics Hotspot (Getis-Ord Gi*), spatial autocorrelation (correlogram, semivariogram, Moran’s I), geographically weighted regression (GWR). For hotspot, I learned what the hotspot clusters mean, and the best ways to pick the fixed-distance band of the neighbors the tool looks at (either through selecting the number of minimal points near it, or by looking at the z-score at the distance that shows the highest amount of spatial autocorrelation). Speaking of spatial autocorrelation, I learned what that meant in terms of the correlation of perceptions of governance between individuals depending on their distance from each other. Finally, I learned a little about GWR, including that it is necessary to plot the output to examine patterns over the space in question, and I also learned that the reason Arc does not display p-values for independent variables is not something to due with the equation, but actually due to computing power of Arc.

Final Project: Searching For Faults with Kriging


I have looked at the spatial distribution of depth to first lava and the spatial relationship between the depth to first lava and the depth to first water. I found that the distance from volcanic centers might affect the depth we find the contact with first lava. I did not find any strong correlation between first water and first lava, but that does not preclude the idea that first water correlates to deeper lava contacts.

Now I want to look at what else might be effecting the spatial distributions of lava in the region. Note, that the techniques I used for looking at lava can also be applied to looking at the spatial distribution for water in the field area.

Figure 1: The style of active faulting in the Pit River Study area (delineated roughly in green). N-S trending faults accommodate dip slip (up-down) motion, while NW-Se trending faults accommodate oblique slip motion (up-down and side to side). Structurally controlled volcanoes often lie at the elbows of intersection between the N-S and NW-SE trending faults. Modified from Blakely et al 1997.

The Northern California Field area lies at the intersection of the Walker Lane Fault Zone, the Basin and Range and the Cascadia subduction zone. For the purpose of this examination I will focus on the relationship between the spatial distribution between Walker Lane style faulting (fig 1) and the spatial distribution of the depth to first water and the depth to first lava in the region.

The topography in the region reflects the interplay between active Walker Lane Style Faulting in the region and the higher frequency volcanism that mutes it. For, example, in the northern portion of the field area, Medicine Lake Volcano, with flanks covered by volcanism less than 10 kyr (fig 2), active faults disappear beneath the volcanic edifice. They do not appear in the topography, but they exist in the subsurface. However, in regions where faults are well expressed, enough displacement has occurred to 1) be resistant to burial by periodic lavas, and 2) to either serve as conduits to lava flow, or to displace lava itself.

The Data:

Figure 2: Map of the Field area with active faults (pink lines), the Pit River (blue line), rough locations of ground water discharge (yellow circles), young lava flows (green outlines) and all of the available well data (blue dots).

The subset of the data we used in figure 2 is depicted in figure 3. Temporally they range from wells dug in the1950s to the present. It is important to note that depths to first water might have changed in the 60 years since the older well logs were recorded. For this reason I did not use many of the older well logs in my analyses.

Figure 3: The center of the township and range sections for the wells used and corresponding depths to first lava. Purple logs are the shallowest and Blue logs are the deepest.



Figure 3: Conceptual design for the project. A) Configuration before slip upon the fault. The lava flow is continuous. B) If a normal fault displaces a lava flow, then one side will go down with respect to the other. The lower block experiences deposition, and the development of a soil profile. If enough time has passed, then there will be a difference between the depth to first lava between the upper and lower block.

I predict that faults will run parallel to and coincide with large step changes in the depth to first lava.

If there are no large step changes in depth to first lava, I predict that these changes will be driven by changes in elevation and correspond to channelized lava flows rather than faults.


I used the Threshold Kriging function in Arc to make a surface that corresponded to the likelihood of the depth to first lava being less than 30 feet deep. The Kriged surface then interpolates the probability that location at other points in the map correspond to less than 30 feet deep. When the surface has a value of 1, values are close to less than 30ft deep. When it is close to zero, values are greater than 30 feet deep.

I then preformed a confusion matrix to see how my results compared to what I expected. This part I did by eye.


Figure 4: Kriged Surface with faults superimposed.



High-Low Low-Low or High-High
parallel 257 315
perpendicular 213 5

Table 1: Corresponding confusion matrix. There were 257 High-Low values that corresponded to parallel faults, but there were 315 Low-Low or High Values that also corresponded to parallel faults. See discussion below.

Figure 5: Kriged surface with contours.



257 out of 790 of the faults were gradient parallel. If, corresponded to step function in kriged depth to first lava. However, many more did not. What is happening here? If you look to figure 5, you can see that while some of the high gradients correspond to rapid changes in elevation. This block is the footwall of a range bounding normal fault. That fault is not mapped because it is no longer active. But it has large amounts of uplift upon it, and likely corresponds to a different lithology entirely and so does not fall into our hypothesis.

If I do this analysis for every lava layer I find, and can correctly link the well logs together, then this analysis could help me constrain slip upon the faults in the region. This information is good for hazards assessment in the region as PG&E has a hydroelectric dam in the area.


Learning outcomes:

Throughout the course of this class, I learned a few of the nuances of R, though my work here has just begun. I also feel like I have a better understanding of how to think about spatial processes, which I why I enrolled in the first place. Future goals include better locating my data in space. This way, many of the techniques I used in this class will pick up on actual processes, and not on the gridded spacing of my data.



Blakely, Richard et al. “Gravity Anomalies, Quaternary Vents, and Quaternary Faults in the Southern Cascade Range, Oregon and California: Implications for Arc and Backarc Evolution.” Journal of Geophysical Research: Solid Earth, vol. 102, no. B10, 1997, pp. 22513–22527.


Final Project: Western Hemlock Dwarf Mistletoe Spatial Patterns and Drivers

Western Hemlock Dwarf Mistletoe

Western hemlock dwarf mistletoe is a hemi-parasite of primarily western hemlock trees. It absorbs water, nutrients, and carbohydrates from its host. Infected branches produce small structures called aerial shoots that have minor photosynthetic capabilities, but are primarily for pollination and seed production and require a certain amount of light to emerge from an infection site. Seeds are explosively discharged from aerial shoots when fully mature. Once landing on a susceptible host branch they being germination and mechanical penetration of the host branch.

Research Question

I wanted to explore the spatial patterns of dwarf mistletoe infections after a mixed severity fire and the roles post-fire, forest structure and fire refugia play.

How does the spatial pattern of post-fire stand structure and species composition affect spatial patterns of dwarf mistletoe distribution throughout the stand through physical barriers to susceptible hosts and seed dispersal?

Description of Dataset

I have a 2.2 ha rectangular study area (Wolf Rock), just northwest of the HJ Andrews Exp. Forest that was stem mapped in 1992. Each tree has an X and Y coordinates, as well as the tree species, height, diameter, and a variety of other tree inventory related data. I only have ages for a few western hemlocks, but many more for Douglas-fir. Only one western hemlock core was identified as being over 100 years old, at 170 years. I have a polygon layer for the fire refugia that are documented on the study area. For the western hemlocks I have a presence/absence of western hemlock dwarf mistletoe as well as a severity measure. However, I am unsure of the scale and ethod of rating, so I will not be using it. The infection presence and absence are from a single measurement season.

Hypotheses and Predictions

Western hemlock dwarf mistletoe spreads easily through moderate canopy densities, where distances between trees are close to the average dispersal distance of 2-4 meters. Disturbances that create patchy gaps increase likelihood of spread because of increased light reaching infected branches and increase the rate of infection abundance because of the lack of physical barriers such as high densities of branches and foliage.

Disturbances can also remove the disease from a forested stand by killing or knocking down infected western hemlocks. Post-disturbance regeneration can enable or inhibit the parasite’s reintroduction to the stand. Non-susceptible hosts such as Douglas-fir or western redcedar regenerate readily alongside western hemlock and will intercept seeds. Some gaps are only conducive to western hemlock regeneration which will be readily infected by any surviving infected western hemlocks post-disturbance.

The Wolf Rock stand experienced two fires that created a mosaic of ~110 year old regeneration and >110 year old remnant trees in fire refugia. Remnant, infected western hemlocks survived the most recent fire in those fire refugia. From this stand structure, I have several hypotheses and predictions:

  1. Remnant, infected western hemlocks form the center of new infection centers post disturbance so infections in the regenerating susceptible hosts, will be clustered around these remnant trees.
  2. Non-susceptible hosts regulate the rate of infection spread through physical barriers to dispersal, so infection cluster size will have an inverse relationship with non-host density.
  3. Western hemlock infection spreads from a central remnant tree, so uninfected western hemlocks will have a dispersed spatial pattern from the remnant western hemlock, regardless of non-host density.
  4. Post-fire regeneration with higher western hemlock composition will have more susceptible hosts and less physical barriers to spread so infection cluster size will have a positive relationship with western hemlock composition.

Analysis Approaches

For Exercise 1,  I wanted to know about the spatial pattern of western hemlock trees infected with western hemlock dwarf mistletoe. I used a hotspot analysis to determine where clusters of infected and uninfected trees were in my 2.2 ha study area. I discovered a hot spot and a cold spot, indicating two clusters, one of high values (infected) and one of low values (uninfected).

For Exercise 2, I wanted to know how the spatial pattern of these clustered infected and uninfected trees were related to the spatial pattern of fire refugia outlined in my study site. I used Geographically Weighted Regression to determine the significance of this relationship, however I did not find a significant relationship between a western hemlock, its intensity of clustering and infection status, and it’s distance to its nearest fire refugia polygon edge.

For Exercise 3, I wanted to know how the spatial patterns of remnant western hemlocks related to the spatial patterns of regeneration western hemlocks, uninfected western hemlocks, and non-western hemlock tree species. I used the Kcross and Jcross functions in spatstat in R and prepared the data in ArcMap to analyze spatial relationships between trees. I found clustering between regenerating western hemlocks and non-hosts to remnant western hemlocks but the uninfected western hemlock’s spatial pattern was independent of the remnant western hemlocks.


I produced several maps that showed the spatial patterns in my study site which were helpful for understanding and investigating further relationships. I produced several charts from my exercise 3 analysis that were useful for visual representations of the relationships between trees. In Exercise 3 I also produced a map from raster addition that gave me the best visualization of where western hemlock and non-host trees were in the stand. Exercise 2 produced a map and statistical relationship but was not significant in explaining a hemlock’s infection and density status.


The biggest finding was that the fire refugia polygons are not significant for my analysis, the remnant infected hemlocks are more important explanatory variables in spatial patterns of infected trees. This supported hypothesis 1. Because refugia can be effectively defined using the “for what, from what” framework, western hemlock dwarf mistletoe refugia from fire could be delineated differently in the field focusing only on the remnant western hemlocks.

Data was not available to determine the rates of infection spread over time because I only had one season of measurements. I also could not evaluate the size of clusters because I did not have GPS points of infection center extent so I could not assess hypothesis 2 and 4 directly. However using Ripley’s K and the cross-variant I could see how the clusters changed over distance. I learned that infected, regenerating trees are going to be found closer to the remnant infected trees and that non-host trees may be blocking the spread of mistletoe into an uninfected patch because they were found clustered around remnant trees as well. This provides support for Hypothesis 1, 2, and 3.

Silvicultural prescriptions with the goal to preserve old growth forest structure, but that want to limit the amount of dwarf mistletoe in a forest can appropriately remove old infected hemlocks to preserve infection spread and extent. These prescriptions will also be able to predict future dwarf mistletoe spread. Forest operations that simulate disturbances that leave remnant hemlocks such as harvests, can incorporate spread predictions to limit regeneration being infected.

Learning From The Process

I learned a lot about the spatial analyst tools in ArcMap and how to produce easily interpretable maps and graphs. I also learned how to use several function in spatstat. I learned a lot about interpreting R outputs and spatial. Spatial autocorrelation can tell you a lot about what your data are doing but I thought it was most useful to be able to see on a map or chart what is specifically happening.

Final project: Washing out the (black) stain and ringing out the details


In order to explain my project, especially my hypotheses, some background information about this disease is necessary. Black stain root disease of Douglas-fir is caused by the fungus Leptographium wageneri. It infects the roots of its Douglas-fir host, growing in the xylem and cutting the tree off from water. It spreads between adjacent trees via growth through root contacts and grafts and long-distance via insects (vectors) that feed and breed in roots and stumps and carry fungal spores to new hosts.

Forest management practices influence the spread of disease because of the influence on (i) the distance between trees (determined by natural or planted tree densities); (ii) adjacency of susceptible species (as in single-species Douglas-fir plantations); (iii) road, thinning and harvest disturbance, which create suitable habitat for insect vectors (stumps, dead trees) and stress remaining live trees, attracting insect vectors; and (iv) forest age distributions, because rotation lengths determine the age structure in managed forest landscapes and younger trees (<30-40 years old) are thought to be more susceptible to infection and mortality by the disease.


How do (B) spatial patterns of forest management practices relate to (A) spatial patterns of black stain root disease (BSRD) infection probabilities at the stand and landscape scale via (C) the spatial configuration and connectivity of susceptible stands to infection?

In order to address my research questions, I built a spatial model to simulate BSRD spread in forest landscapes using the agent-based modeling software NetLogo (Wilensky 1999). I used Exercises 1-3 to focus on the spatial patterns of forest management classes. Landscapes were equivalent in terms of the proportion of each management class and number of stands, varying only in spatial pattern of management classes. In the exercises, I evaluated the relationship between management and disease by simulating disease spread in landscapes with two distinct spatial patterns of management:

  • Clustered landscape: The landscape was evenly divided into three blocks, one for each management class. Each block was evenly divided into stands.
  • Random landscape: The landscape was evenly divided into stands, and forest management classes were randomly assigned to each stand.


I analyzed outputs of my spatial model. The raster files contain the states of cells in forest landscapes at a given time step during one model run. States tracked include management class, stand ID number, presence/absence of trees, tree age, probability of infection, and infection status (infected/not infected). Management class and stand ID did not change during the model run. I analyzed tree states from the last step of the model run and did not analyze change over time.

Extent: ~20 hectares (much smaller than my full models runs will be)

Spatial resolution: ~1.524 x 1.524 m cells (maximum 1 tree per cell)

Three contrasting and realistic forest management classes for the Pacific Northwest were present in the landscapes analyzed:

  • Intensive – Active management: 15-foot spacing, no thinning, harvest at 37 years.
  • Extensive – Active management: 10-foot spacing, one pre-commercial and two commercial thinnings, harvest at 80 years.
  • Set-aside/old-growth (OG) – No active management: Forest with Douglas-fir in Pacific Northwest old-growth densities and age distributions and uneven spacing with no thinning or harvest.


Because forest management practices influence the spread of disease as described in the “Background” section above, I hypothesized that the spatial patterns of forest management practices would influence the spatial pattern of disease. Having changed my experimental design and learned about spatial statistics and analysis methods throughout the course, I hypothesize that…

  • The “clustered” landscape will have (i) higher absolute values of infection probabilities, (ii) higher spatial autocorrelation in infection probabilities, and (iii) larger infection centers (“hotspots” of infection probabilities) than the “random” landscape because clustering of similarly managed forest stands creates continuous, connected areas of forest managed in a manner that creates suitable vector and pathogen habitat and facilitates the spread of disease (higher planting densities, lower age, frequent thinning and harvest disturbance in the intensive and extensive management). I therefore predict that:
    • Intensive and extensive stands will have the highest infection probabilities with large infection centers (“hotspots”) that extend beyond stand boundaries.
      • Spatial autocorrelation will therefore be higher and exhibit a lower rate of decrease with increasing distance because there will be larger clusters of high and low infection probabilities when stands with similar management are clustered.
    • Set-aside (old-growth, OG) stands will have the lowest infection probabilities, with small infection centers that may or may not extend beyond stand boundaries.
      • Where old-growth stands are in contact with intensive or extensive stands, neighborhood effects (and edge effects) will increase infection probabilities in those OG stands.
    • In contrast, the “random” landscape will have (i) lower absolute values of infection probabilities, (ii) less spatial autocorrelation in infection probabilities, and (iii) smaller infection centers than the “clustered” landscape. This is because the random landscape will have less continuity and connectivity between similarly managed forest stands; stands with management that facilitates disease spread will be less connected and stands with management that does not facilitate the spread of disease will also be less connected. I would predict that:
      • Intensive and extensive stands will still have the highest infection probabilities, but that the spread of infection will be limited at the boundaries with low-susceptibility old-growth stands.
        • Because of the boundaries created by the spatial arrangement of low-susceptibility old-growth stands, clusters of similar infection probabilities will be smaller and values of spatial autocorrelation will be lower and decrease more rapidly with increasing lag distance. At the same time, old-growth stands may have higher infection probabilities in the random landscape than in the clustered landscape because they would be more likely to be in contact with high-susceptibility intensive and extensive stands.
      • I also hypothesize that each stand’s neighborhood and spatial position relative to stands of similar or different management will influence that stand’s infection probabilities because of the difference in spread rates between management classes and the level of connectivity to high- and low-susceptibility stands based on the spatial distribution of management classes.
        • Stands with a large proportion of high-susceptibility neighboring stands (e.g., extensive or intensive management) will have higher infection probabilities than similarly managed stands with a small proportion of high-susceptibility neighboring stands.
        • High infection probabilities will be concentrated in intensive and extensive stands that have high levels of connectivity within their management class networks because high connectivity will allow for the rapid spread of the disease to those stands. In other words, the more connected you are to high-susceptibility stands, the higher your probability of infection.


Ex. 1: Correlogram, Global Moran’s I statistic

In order to test whether similar infection probability values were spatially clustered, I used the raster package in R (Hijmans 2019) to calculate the global Moran’s I statistic at multiple lag distances for both of the landscape patterns. I then plotted global Moran’s I vs. distance to create a correlogram and compared my results between landscapes.

Ex. 2: Hotspot analyses (ArcMap), Neighborhood analyses (ArcMap)

First, I performed a non-spatial analysis comparing infection probabilities between (i) landscape patterns (ii) management classes, and (iii) management classes in each of the landscapes. Then, I used the Hotspot Analysis (Getis-Ord Gi*) tool in ArcMap to identify statistically significant hot- and cold-spots of high and low infection probabilities, respectively. I selected points within hot and cold spots and used the Multiple Ring Buffer tool in ArcMap to create distance rings, which I intersected with the management classes to perform a neighborhood analysis. This neighborhood analysis revealed how the proportion of each management class changed with increasing distance from hotspots in order to test whether the management “neighborhood” of trees influenced their probability of infection.

Ex. 3: Network and landscape connectivity analyses (Conefor)

I divided my landscape into three separate stand networks based on their management class. Then, I used the free landscape connectivity software Conefor (Saura and Torné 2009) to analyze the connectivity of each stand based on its position within and role in connecting the network using the Integrative Index of Connectivity (Saura and Rubio 2010). I then assessed the relationship between the connectivity of each stand and infection probabilities of trees within that stand using various summary statistics (e.g., mean, median) to test whether connectivity was related to infection probability.


As my model had not been parameterized by the beginning of this term, I analyzed “dummy” data, where infection spread probabilities were calculated as a decreasing linear function of distance from infected trees. However, the results I produced still provided insights as to the general functioning of the model and factors that will likely influence my results in the full, parameterized model.

I produced both maps and numerical/statistical relationships that describe the patterns of “A” (infection probabilities), the relationship between “A” and “B” (forest management classes), and how/whether “A” and “B” are related via “C” (landscape connectivity and stand networks).

In Exercise 1, I found evidence to support my hypothesis of spatial autocorrelation at small scales in both landscapes and higher autocorrelation and slower decay with distance in the clustered landscape than the random landscape. This was expected because the design of the model calculated probability of infection for each tree as a function of distance from infected trees.

In Exercises 2 and 3, I found little to no evidence to support the hypothesis that either connectivity or neighboring stand management had significant influence on infection probabilities. Because the model that produced the “dummy” data limited infection to ~35 meters from infected trees and harvest and thinning attraction had not been integrated into infection calculations, this result was not surprising. In my full model where spread via insect vectors could span >1,000 m, I expect to see a larger influence of connectivity and neighborhood on infection probabilities.

A critical component of model testing is exploring the “parameter space”, including a range of possible values for each parameter. This is especially for agent-based models where there are complex interactions between many individuals that result in larger-scale patterns that may be emergent and not fully predictable by the simple sum of the parts. In my model, the disease parameters of interest are the factors influencing probability of infection (Fig. 1). In order to understand how the model reacts to changes in those parameters, I will perform a sensitivity analysis, systematically adjusting parameter values one-by-one and comparing the results of each series of model runs under each set of parameter values.

Fig.1. Two of the model parameters that will be systematically adjusted during sensitivity analysis. Tree susceptibility to infection as a function of age (left) and probability of root contact as a function of distance (right) will both likely influence model behavior and the relative levels of infection probability between the three management classes.

This is especially relevant given that in Exercises 1 through 3, I found that the extensively managed plantations had the highest values of infection probability and most of the infection hotspots, likely due to the fact that this management class has the highest [initial] density of trees. For the complete model, I am hypothesizing that the intensive plantations will have the highest infection probabilities because of high frequency of insect-attracting harvest and short rotations that maintain the trees in an age class highly susceptible to infection. In the full model, the extensive plantations will have higher initial density than the intensive plantations but will undergo multiple thinnings, decreasing tree density but attracting vectors, and will be harvested at age 80, thus allowing trees to grow into a less susceptible age class. In this final model, thinning, harvest length, and vector attraction will factor in to the calculation of infection probabilities. My analysis made it clear that even a 1.5 meter difference in spacing resulted in a statistically significant difference for disease transmission, with much higher disease spread in the denser forest. Because the model is highly sensitive to tree spacing, likely because the parameters of my model that relate to distance drop off in sigmoidal or exponential decay patterns, I would hypothesize that changes in the values of parameters that influence the spatial spread of disease (i.e., insect dispersal distance, probability of root contact with distance) and the magnitude of vector attraction after harvest and thinning will determine whether the “extensive” or “intensive” forest management class will ultimately the highest levels of infection probabilities. In addition, the rate of decay of root contact and insect dispersal probabilities will determine whether management and infection within stands influence infection in neighboring stands and the distance and strength of those neighborhood effects. I would like to test this my performing such analyses on the outputs from my sensitivity analyses.


Ultimately, the significance of this research is to understand the potential threat of black stain root disease in the Pacific Northwest and inform management practices by identifying evidence-based, landscape-scale management strategies that could mitigate BSRD disease issues. While the results of Exercises 1-3 were interesting, they were produced using a model that had not been fully parameterized and thus are not representative of the likely actual model outcomes. Therefore, I was not able to test my hypotheses. That said, this course allowed me to design and develop an analysis to test my hypotheses. The exercises I completed have also provided a deeper understanding of how my model works. Through this process, I have begun to generate additional testable hypotheses regarding model sensitivity to parameters and the relative spread rates of infection in each of the forest management classes. Another key takeaway is the importance of producing many runs with the same landscape configuration and parameter settings to account for stochastic processes in the model. By only analyzing one run for each scenario, there is a chance that the results are not representative of the average behavior of the system or the full range of behaviors possible for those scenarios. For example, with the random landscape configuration, one generated landscape can be highly connected and the next highly fragmented with respect to intensive plantations, and only a series of runs under the same conditions would provide reliable results for interpretation.


(a, b) Arc-Info, Modelbuilder and/or GIS programming in Python

This was my first opportunity to perform statistical analysis in ArcGIS, and I used multiple new tools, including hotspot analysis, multiple ring buffers, and using extensions. Though I did not use Python or Modelbuilder, I realized that doing so will be critical for automating my analyses given the large number of model runs I will be analyzing. While I learned how to program in Python using arcpy in GEOG 562, I used this course to choose the appropriate tools and analyses for my questions and hypotheses rather than automating procedures I may not use again. I would now like to implement my procedures for neighborhood analysis in Python in order to automate and increase the efficiency of my workflow.

(c) Spatial analysis in R

During this course, I learned most about spatial data manipulation in R, since I had limited experience using R with spatial data beforehand. I used R for spatial statistics, data cleaning and management, and conversion between vector and raster data. I also learned about the limitations of R (and my personal limitations) in terms of the challenge of learning how to use packages and their functions when documentation is variable in quality and a wide variety of user-generated packages are available with little reference as to their quality and reliability. For example, for Exercise 2, I had trouble finding an up-to-date and high-quality package for hotspot analysis in R, with raster data or otherwise. However, this method was straightforward in ArcMap once the data were converted from raster to points. For Exercise 1, the only Moran’s I calculation that I was able to perform with my raster data was the “moran” function in the raster package, which does not provide z- or p-values to evaluate the statistical significance of the calculated Moran’s I and requires you to generate your own weights matrices, which is a pain. Using the spdep or ncf packages for this analysis was incredibly slow (though I am not sure why), and the learning curve for spatstat was too steep for me to overcome by the Exercise 1 deadline (but I hope to return to this package in the future).

Reading, manipulating, and converting data: With some trial and error and research into the packages available for working with spatial data in R (especially raster, sp/spdep, and sf), I learned how to quickly and easily convert data between raster and shapefile formats, which was very useful in automating the cleaning and preparation for multiple datasets and creating the inputs for the analyses I want to perform.

(d) Landscape connectivity analyses: I learned that there are a wide variety of metrics available through Fragstats (and landscapemetrics and landscapetools packages in R), however, I was not able to perform my desired stand-scale analysis of connectivity because I could not determine whether it is possible to analyze contiguous stands with the same management class as separate patches (Fragstats considered all contiguous cells in the raster with the same class to be part of the same patch). Instead, I used Conefor, which has an ArcMap extension that allows you to generate a node and connection file from a polygon shapefile, to calculate relatively few but robust and ecologically meaningful connectivity metrics for the stands in my landscape.


Correlograms and Moran’s I: For this statistical method, I learned the importance of choosing meaningful lag distances based on the data being analyzed and the process being examined. For example, my correlogram consists of a lot of “noise” with many peaks and troughs due to the empty cells between trees, but I also captured data at the relevant distances. Failure to choose appropriate lag distances means that some autocorrelation could be missed, but analyses of large raster images at a high resolution of lag distances results in very slow processing. In addition, I wanted to compare local vs. global Moran’s I to determine whether infections were sequestered to certain portions of the landscape or spread throughout the entire landscape, but the function for local Moran’s I returned values far outside the -1 to 1 range of the global Moran’s I. As a result, I did not understand how to interpret or compare these values. In addition, global Moran’s I did not tell me where spatial autocorrelation was happening, but the fact that there was spatial autocorrelation led me to perform a…

Hotspot analysis (Getis-Ord Gi*): It became clear that deep conceptual understanding of hypothesized spatial relationships and processes in the data and a clear hypothesis are critical for hotspot analysis. I performed multiple analyses with difference distance weighting to compare the results, and there was a large variation in both the number of points included in hot and cold spots and the landscape area covered by those spots between the different weighting and distance methods. I ended up choosing the inverse squared distance weighting based on my understanding of root transmission and vector dispersal probabilities and because this weighting method was the most conservative (produced the smallest hotspots). The confidence level chosen also resulted in large variation in the size of hotspots. After confirming that there was spatial autocorrelation in infection probabilities, using this method helped me to understand where in the landscape these patterns were occurring and thus how they related to management practices.

Neighborhood analysis: I did not find this method provided much insight in my case, not because of the method itself but because of my data (it just confirmed the landscape pattern that I had designed, clustered vs. random) and my approach (one hotspot and one coldspot point non-randomly selected in each landscape. I also found this method to be tedious in ArcMap, though I would like to automate it, and I later learned about the zonal statistics tool, which can help make this analysis more efficient. In general, it is not clear what statistics I could have used to confirm whether results were significantly different between landscapes, but perhaps this is an issue caused by my own ignorance.

Network/landscape connectivity analyses: I learned that there are a wide variety of tools, programs, and metrics available for these types of analyses. I found the Integrative Index of Connectivity (implemented in Conefor) particularly interesting because of the way it categorizes habitat patches based on multiple attributes in addition to their spatial and topological positions in the landscape. The documentation for this metric is thorough, its ecological significance has been supported in peer-reviewed publications (Saura and Rubio 2010), and it is relatively easy to interpret. In contrast, I found the number of metrics available in Fragstats to be overwhelming especially during the data exploration phase.


Robert J. Hijmans (2019). raster: Geographic Data Analysis and Modeling. R package version 2.8-19.

Saura, S. & J. Torné. 2009. Conefor Sensinode 2.2: a software package for quantifying the importance of habitat patches for landscape connectivity. Environmental Modelling & Software 24: 135-139.

Saura, S. & L. Rubio. 2010. A common currency for the different ways in which patches and links can contribute to habitat availability and connectivity in the landscape. Ecography 33: 523-537.

Wilensky, U. (1999). NetLogo. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.

Final Project: Examining the relationship between rurality and colorectal cancer mortality in Texas counties

1.The research question that you asked.

My initial research question was about creation and comparison of rural indices for Texas. I did end up making a rural index for Texas, but instead of creating multiple indices and comparing the importance of different rural indicators, I incorporated an outcome variable, colorectal cancer (CRC) mortality, for Exercise 2 and 3. So my final research question ended up being as follows: how does the spatial pattern of CRC mortality in Texas relate to proxy measures of rurality and a constructed rural index for counties in Texas?


2. A description of the dataset you examined, with spatial and temporal resolution and extent.

For my project, I utilized Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database). Aggregated CRC mortality rates for Texas counties were obtained for the years 2011-2015 from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program U.S. Population Data – 1969-2015.


3. Hypotheses: predictions of patterns and processes you looked for.

Based on my revised research question I remade before exercise 2, I expected 2 major spatial patterns for my analyses: 1) areas immediately surrounding high CRC mortality counties to have higher “rural” values in 3 rural indicator variables: household income, population density, and percent land developed, with more “urban” values as distance increases away from the counties and 2) CRC mortality counts to significantly increase as county rural index values (from the 3 indicator variables) became more rural. I chose rural indicator variables based both on true measures of where population centers in Texas are located (population density and land development) and a well-established marker of rural status (income). For spatial pattern 1 from my neighborhood analysis, I expected there to be uniform shifts in buffer means towards more “urban” values in each of the 3 indicator variables as distance increased away from the high CRC counties. I expected this effect because counties with high CRC mortality are commonly rural, so areas in counties surrounding them with lower CRC mortality may show increasingly urban indicator values. For spatial pattern 2, I expected CRC mortality to increase as county rural index values became more rural in Texas. I expected this pattern because CRC mortality has been linked to rural status and various indicator variables in previous research, though a weighted rural index has never been used for Texas cancer data.


4. Approaches: analysis approaches you used.

For my analyses I used the following approaches:

Exercise 1: I utilized principal component analysis (PCA) to construct a rural index for Texas counties using 3 indicator variables.

Exercise 2: I did a neighborhood analysis of high CRC mortality counties by creating multi-ring buffers around centroids of 4 Texas counties, intersected the buffers with county polygons containing rural indicator variable values, and calculated buffer means for indicator variables for each donut around each county.

Exercise 3: I completed a Poisson regression of CRC mortality counts (Y) and my constructed rural index (X) to examine the effect of rural status on CRC mortality for Texas counties.


5. Results: what did you produce — maps? statistical relationships? other?

For all 3 exercises, my results included maps (Texas county maps, buffer maps, standardized residual maps), statistical relationships (p-values, coefficients, etc.), and plotting of statistical relationships (biplots, influence plots, line plots). I produced the maps in Arc Pro, while all other visualizations were produced in R.


6.What did you learn from your results? How are these results important to science? to resource managers? + software/technical limitations

I believe exercise 1 displays the effectiveness of PCA for construction of rural indices. More deductive methods for rural classification are very much needed in rural health, and I believe this method could improve detection and prevention of rural health disparities.

In exercise 2, the 4 high CRC mortality counties did not all follow the expected rural indicator spatial “gradient” that I expected. Two of the counties exhibited increasing urban scores as distance increased away from the counties, while the other two showed the opposite pattern. I think this result could be due to the arbitrary distances I chose for the buffers around the counties and the modifiable areal unit problem of utilizing county indicator variable data instead of more spatially defined values. Also, significant regionality in Texas could exist for the indicator variables I chose, as the 4 counties were not located in the same regions of Texas. This could have affected the relationships I found in each of the counties.  For example, certain rural regions in Texas may have lower or higher household income than other rural regions of Texas due to factors such as available jobs or regional/local governmental policies. Also, there are likely other variables that are indicators of rurality that I could have included in the analysis that may have more consistent spatial patterns.

In exercise 3, the results from the Poisson regression followed statistical pattern I expected: as county CRC mortality increases, rural index scores become more rural. I believe my results show important introductory associations between CRC mortality and rurality in Texas that indicate further and deeper study into these associations should be considered. To show the exercise 3 results, I utilized a map of standardized residuals in Arc and statistical plots in R.

The technical limitations of my analyses were mainly due to extensive missing cancer data for the state of Texas. This missingness was due to data suppression by the CDC’s National Vital Statistics System in order to maintain confidentiality of cancer cases. I did not have any obvious software issues besides difficulty with the Calculate Field tool in Arc Pro, where the tool would consistently fail when using more complicated formulas for data transformations. I also had some problems when joining Excel table to attribute tables, where Arc would not show symbology for the newly-joined data until exporting to geodatabase and restarting the program.


7. Your learning: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other?

I believe I greatly improved my skills in ArcGIS this term, especially in data wrangling and cleaning to convert my data to formats that maximize the visualization. Also, I feel I became much better at using key plotting packages in R, such as ggplot2.


8. What did you learn about statistics, including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), (d) multivariate methods (e.g., PCA), and (e) or other techniques?

I believe the major statistical and spatial skills I improved in this course include PCA, neighborhood analysis, Poisson regression, and (especially) regression diagnostics. I had not used PCA in R before this term and feel very comfortable using it going forward in my research on rural health. The many assumptions I had to follow for the Poisson regression in exercise 3 improved my ability to run diagnostics in R and create intuitive assumption plots.


Final project synthesis of parts 1, 2, and 3 on classification of Blackleg on turnip


The objective of this research was to find a way to attribute proper classification of diseased versus non-diseased regions on a turnip leaf. To do this I assessed the classification accuracy of the unsupervised machine learning classification algorithm, segmentation, with manual classification. Additionally, I hoped to reveal some spatial characteristics about the lesions on the leaf for future work.



The cumulative data set is roughly 500 turnip leaves, harvested here in the Willamette valley during the months of February to April of 2019. For the sake of this project I have selected 5 images in which I have worked with to complete the basis of this analysis. The spatial resolution is just under 1 mm and images are in raster format of single turnip leaves with five bands (blue, green, red, far-red, NIR). For this analysis only, the red band was used. The camera being used is a Micasense Red-edge which has a 16-bit radiometric resolution.



These were not all questions that I had anticipated answering at the beginning of the term, but have developed over the course of the term as I ran into issues or found intriguing results.

• Can diseased patches be successfully identified and isolated by computer image classification and through manual classification?
If yes;
o can the diseased patches based on image classification and manually classified diseased patches be characterized?
o How accurate is the segmentation classification?
o Are more raster cells being classified as false negatives or false positives when misclassified?
o What is going undetected in segmentation?



My original hypothesize was that spatial patterns of pixels based on image classification are related to manually classified spatial patterns of observed disease on turnip leaves because disease has a characteristic spectral signature of infection on the leaves.


Approaches and results

I began in ArcGIS Pro by using the: segmentation -> mask -> clip -> raster to polygon -> merge -> polygon to raster. This yielded a layer which included only the computer generated segmented diseased patches which were exported as Tiff files. I followed a similar process with my manually classified diseased patches but used the ‘training sample manager’ instead of segmentation. Both these Tiff files were uploaded to Fragstats where I did patch analysis between the two to characterize the diseased patches. I found this not entirely helpful, but if I dug deeper into patch analysis I could be provided with some worthwhile information about the diseased lesion characteristics.

After characterizing the patches, I performed an error analysis using a confusion matrix in R. The two Tiff files were uploaded to R and converted to rasters to check quality and then to tables. The tables were imported into excel and values were adjusted to 1 for diseased patches and 0’s for non-diseased. Anything outside the area of the leaf was deleted. The tables were imported back to R and a confusion matrix comparing the computer classified segmentation versus the manually classified patches was conducted. All the classification matrices were combined for the overall accuracy (Table 1). This aspect was a bit time consuming as was the original image processing but was a very useful analysis.

One other approach I used to help with visualization was showing the leaf with false positives and false negatives included (Image 1). This was necessary to help explain results more clearly to others and show results outside of the statistics.



I have created a method for getting from image acquisition all the way to statistical results. This work is first off significant to my thesis research. This will be something I refine to help streamline but is a very good starting point for me. Second, I got statistically significant results based on p-value and a high accuracy. This will be helpful to those in agronomy or crop consulting which are using remote sensing for disease detection. At this point I don’t think it is practical for application of agronomist but with further work I hope to make this methodology an available resource for other researchers to work off of and for farmers to potentially try and utilize.


Hypothesis revised

After receiving these results and considering my next steps, my hypothesis has been refined. I now hypothesize that segmentation classification has an equal ability to classify diseased patches as manual classification because of the spectral values associated with diseased pixels.


New research questions

This research is far from complete and has led to a new set of questions to be answered.

• Is percent classification of diseased pixels the best metric for how well the classification method works?
• What is the accuracy of a support vector machine versus manually classified pixels?
• Is this classification translatable to other brassica species such as canola, radish, kale, etc.?


Future research

To answer some of my new questions I believe it is critical I learn how to create band composites between the five available bands. Currently, my issue is the bands do not overlap. My approach for solving this includes the use of georeferenced points/pixels on each image to ensure they overlap. I think if I use the 4 corners of the image I should get the overlap I want. By creating band composites, I can begin using vegetative indices like NDVI to help more accurately distinguish pixels than can be done with one band alone.

Another concern I have is despite classifying pixels accurately I am weary about this being an all determining method of classification. I have instances in my current classification where only 2 or 3 of the 6 or so lesions are actually having pixels inside its perimeter classified as diseased. I think a more accurate or a helpful assessment would quantify the percent of lesions that are captured by the segmentation process out of the total number of lesions on the leaf.

Using an unsupervised classification scheme for this work was good to lay the groundwork. In future work I would like to move to supervised machine learning classification schemes. My reason being; when attempting to translate this algorithm used on turnip classification to others such as canola, radish, etc. I believe it will have a higher level of accuracy and consider more spatial characteristics. Currently, my classification scheme only uses the spectral values of the red band where I would like to implement variables such as lesion size, proximity from margin, proximity to other lesions, shape, NDVI values, etc. This works into my last question concerning the ability of the classification to be applied to broader ranges of plants.


Technical and software limitations

Some of my limitations are apparent in above sections because they led to my future research questions. One of my greatest limitations was my ability to effectively and efficiently use ArcGIS Pro as well as code in R. This led to longer measures of time conducting image processing which may have been streamlined if I was better in ArcGIS Pro. The major roadblocks I hit in ArcGIS Pro were related to the overlap of different bands that I mentioned earlier in addition to the use of the support vector machine. I have plans to effectively solve the band composite issue, but am still unsure about the support vector machine function in ArcGIS Pro. I believe the way it was setup is not designed for collecting training data for many images but collecting this data from one image. My alternative for the SVM is to try conducting the analysis in R.

In R I struggled in my ability to code. This was a major strain on my time and required the assistance of others to help troubleshoot on many occasions. This can be overcome with more practice and is not so much a limitation of the software, but the user.


My learning

I learned the most in ArcGIS Pro I would say. The image processing took lots of time and many steps which exposed me to a whole set of tools and analysis I was previously unaware of. I don’t think my skills in R increased much but I now have the code and steps for the analysis I would like to complete on my future thesis work, the confusion matrix. Another program I gained some experience in was Fragstats. It was useful for characterizing the diseased patches and may be an option for some of my future work when looking at the size, shape distance, etc. of these diseased patches.



Table 1. Confusion matrix for pixel classification of five leaves.


Image 1. Example of one leaf analysis where the green cells are manually classified non-diseased pixels, white pixels inside the leaf are segmentation classified diseased pixels and block pixels are the manually classified diseased pixels.

Final Project: San Diego Bottlenose Dolphin Sighting Distributions

Final Project: San Diego Bottlenose Dolphin Sighting Distributions

The Research Question:

Originally, I asked the question: do common bottlenose dolphin sighting distances from shore change over time?

However, throughout the research and analysis process, I refined this question for a multitude of reasons. For example, I planned on using all of my dolphin sightings from my six different survey locations along the California coastline. Because the bulk of my sightings are from the San Diego survey site, I chose this data set for completeness and feasibility. Additionally, this data set used the most standard survey methods. Rather than simply looking at distance from shore, which would be at a very fine scale, seeing as all of my sightings are within two kilometers from shore, I chose to try and identify changes in latitude. Furthermore, I wanted to see if changes in latitude (if present, were somehow related to the El Nino Southern Oscillation (ENSO) cycles and then distances to lagoons). This data set also has the largest span of sightings by both year and month. When you see my hypotheses, you will notice that my original research question morphed into much more specific hypotheses.

Data Description:

My dolphin sighting data spans 1981-2015 with a few absent years, and sightings covering all months, but not in all years sampled. The same transects were performed in a small boat with approximately a two kilometer sighting span (one kilometer surveyed 90 degrees to starboard and port of the bow). These data points therefore have a resolution of approximately two kilometers. Much of the other data has a coarser spatial resolution, which is why it was important to use such a robust data set. The ENSO data I used gave a broad brushstroke approach to ENSO indices. Rather than first using the exact ENSO index which is at a fine scale, I used the NOAA database that split month-years into positive, neutral, and negative indices (1, 0, and -1, respectively). These data were at a month-year temporal resolution, which I matched to my month-date information of my sighting data. Lagoon data were sourced from the mid-late 2000s, therefore I treated lagoon distances as static.


H1: I predicted that bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) along the San Diego transect throughout the years 1981-2015 would exhibit clustered distribution patterns as a result of the patchy distributions of both the species’ preferred habitats and prey, as well as the social nature of this species.

H2: I predicted there would be higher densities of bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) at higher latitudes spanning 1981-2015 due to prey distributions shifting northward and less human activities in the northward sections of the transect. I predicted that during warm (positive) ENSO months, the dolphin sightings in San Diego would be distributed more northerly, predominantly with prey aggregations historically shifting northward into cooler waters, due to (secondarily) increasing sea surface temperatures. I expect the spatial gradient to shift north and south, in relation to the ENSO gradient (warm, neutral, or cold)

H3: I predicted that along the San Diego coastline, bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) would be clustered around the six major lagoons within about two kilometers, with no specific preference for any lagoon, because the murky, nutrient-rich waters in the estuarine environments are ideal for prey protection and known for their higher densities of schooling fishes.

Map with bottlenose dolphin sightings on the one-kilometer buffered transect line and the six major lagoons in San Diego.


I utilized multiple approaches with different software platforms including ArcMap, qGIS, GoogleEarth, and R Studio (with some Excel data cleaning).

  • Buffers in ArcMap
  • Calculations in an attribute table
  • ANOVA with Tukey HSD
  • Nearest Neighbor averages
  • Cluster analyses
  • Histograms and Bar plots


I produced a few maps (will be), found statistical relationships between sightings and distribution patterns,  ENSO and dolphin latitudes, and distances to lagoons.

H1: I predicted that bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) along the San Diego transect throughout the years 1981-2015 would exhibit clustered distribution patterns as a result of the patchy distributions of both the species’ preferred habitats and prey, as well as the social nature of this species.

True: The results of the average nearest neighbor spatial analysis in ArcMap 10.6 produced a z-score of -127.16 with a p-value of < 0.000001, which translates into there being less than a 1% likelihood that this clustered pattern could be the result of random chance. Although I could not look directly at prey distributions because of data availability, it is well-known that schooling fishes exist in clustered distributions that could be related to these dolphin sightings also being clustered. In addition, bottlenose dolphins are highly social and although pods change in composition of individuals, the dolphins do usually transit, feed, and socialize in small groups. Also see Exercise 2 for other, relevant preliminary results, including a histogram of the distribution in differences of sighting latitudes.

Summary from the Average Nearest Neighbor calculation in ArcMap 10.6 displaying that bottlenose dolphin sightings in San Diego are highly clustered.

H2: I predicted there would be higher densities of bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) at higher latitudes spanning 1981-2015 due to prey distributions shifting northward and less human activities in the northward sections of the transect. With this, I predicted that during warm (positive) ENSO months, the dolphin sightings in San Diego would be distributed more northerly, predominantly with prey aggregations historically shifting northward into cooler waters, due to (secondarily) increasing sea surface temperatures. I expect the spatial gradient to shift north and south, in relation to the ENSO gradient (warm, neutral, or cold).

False: the sightings are more clumped towards the lower latitudes overall (p < 2e-16), possibly due to habitat preference. The sightings are closer to beaches with higher human densities and human-related activities near Mission Bay, CA. It should be noted, that just north of the San Diego transect is the Camp Pendleton Marine Base which conducts frequent military exercises and could deter animals.

I used an ANOVA analysis and found there was a significant difference in sighting latitude distributions between monthly ENSO indices. A Tukey HSD was performed to determine where the differences between treatment(s) were significant. All differences (neutral and negative, positive and negative, and positive and neutral ENSO indices) were significant with p < 0.005.

H3: I predicted that along the San Diego coastline, bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) would be clustered around the six major lagoons within about two kilometers, with no specific preference for any lagoon, because the murky, nutrient-rich waters in the estuarine environments are ideal for prey protection and known for their higher densities of schooling fishes. See my Exercise 3 results.

Using a histogram, I was able to visualize how distances to each lagoon differed by lagoon. That is dolphin sightings nearest to, Lagoon 6, the San Dieguito Lagoon, are always within 0.03 decimal degrees. In comparison, Lagoon 5, Los Penasquitos Lagoon, is distributed across distances, with the most sightings at a great distance.

Bar plot displaying the different distances from dolphin sighting location to the nearest lagoon in San Diego in decimal degrees. Note: Lagoon 4 is south of the study site and therefore was never the nearest lagoon.

After running an ANOVA in R Studio, I found that there was a significant difference between distance to nearest lagoon in different ENSO index categories (p < 2.55e-9) with a Tukey HSD confirming that the significant difference in distance to nearest lagoon being significant between neutral and negative values and positive and neutral years. Therefore, I gather there must be something happening in neutral months that changes the distance to the nearest lagoon, potentially prey are more static or more dynamic in those years compared to the positive and negative months. Using a violin plot, it appears that Lagoon 5, Los Penasquitos Lagoon, has the widest span of sighting distances when it is the nearest lagoon in all ENSO index month values. In neutral years, Lagoon 0, the Buena Vista Lagoon has more than a single sighting (there were none in negative months and only one in positive months). The Buena Vista Lagoon is the most northerly lagoon, which may indicate that in neutral ENSO months, dolphin pods are more northerly in their distribution.

Takeaways to science and management: 

Bottlenose dolphins have a clustered distribution which seems to be related to ENSO monthly indices, with certain years having more of a difference in distribution, and likely, their sociality on a larger scale. Neutral ENSO months seem to have a different characteristic that impact sighting distribution locations along the San Diego coastline. More research needs to be done in this to determine what is different about neutral months and how this may impact this dolphin population. On a finer scale, the six lagoons in San Diego appear to have a spatial relationship with dolphin pod sighting distributions. These lagoons may provide critical habitat for bottlenose dolphin preferred prey species or preferred habitat for the dolphins themselves either for cover or for hunting, and different lagoons may have different spans of impact at different distances, either by creating larger nutrient plumes, or because of static, geographic and geologic features. This could mean that specific areas should be protected more or maintain protection. For example, the Batiquitos and San Dieguito Lagoons have some Marine Conservation Areas with No-Take Zones. It is interesting to see the relationship to different lagoons, which may provide nutrient outflows and protection for key bottlenose dolphin prey species. The city of San Diego and the state of California are need ways to assess the coastlines and how protecting the marine, estuarine, and terrestrial environments near and encompassing the coastlines impact the greater ecosystem. Other than the Marine Mammal Protection Act and small protected zones, there are no safeguards for these dolphins.

My Learning: about software (a) Arc-Info and b) R

  1. a) Arc-Info: buffer creation, creating graphs, nearest neighbor analyses. How to deal with transects, certain data with mismatching information, conflicting shapefiles
  2. b) R: I didn’t know much, except the basics in R. I learned about how to conduct ANOVAs and then how to interpret results. Mainly I learned about how to visualize my results and use new packages.

My Learning: about statistics

Throughout this project I learned that spatial statistics requires clear hypothesis testing in order to clearly step through a spatial process. Most specifically, I learned about spatial analyses in ArcMap, and how I could utilize nearest neighbor calculations to assess distribution patters. Furthermore, I now have a better understanding of spatial distribution patterns and how they are assessed, such as clustering versus random versus equally dispersed distributions. For more data analysis and cleaning, I also learned how to apply my novice understanding of ANOVAs and then display results relating to spatial relationships (distances) using histograms and other graphical displays in R Studio.


Contact information: this post was written by Alexa Kownacki, Wildlife Science Ph.D. Student at Oregon State University. Twitter: @lexaKownacki

Manipulating salinity to create a better fit habitat suitability model for Olympia oysters

Follow-up from Exercise 2
In Exercise 2, I compared Olympia oyster location data to the model of predicted suitable habitat that I developed in Exercise 1. Results from that analysis showed that 13 of the 18 observations within or near areas of high-quality habitat (type 4) indicated the presence of Olympia oysters (72%) versus 5 locations where oysters were not found (28%). No field survey locations fell within areas of majority lowest quality habitat (type 1). Seven observations were found within the second lowest quality habitat type (2), with 2 of those observations indicating presence (29%) and 5 indicating absence (71%).

Habitat suitability
4 3 2 1
Presence 13 [0.72] 4 [0.4] 2 [0.29] 0 [0]
Absence 5 [0.28] 6 [0.6] 5 [0.71] 0 [0]
Total (n = 35*) 18 [1.0] 10 [1.0] 7 [1.0] 0 [0]

*3 data points removed from analysis due to inconclusive search results.

To expand on this analysis, I used a confusion matrix to further examine the ‘errors’ in the data, or the observations that did not correlate with my model of predicted suitable habitat. For ease of interpretation, I removed habitat suitability type 1 since there were not observations in this category, and type 3 since it fell in between high and low-quality habitat.

Habitat suitability
4 (high) 2 (low)
Presence 0.37 0.06
Absence 0.14 0.14

Decimals reported indicate the proportion of total observations (n = 35) that fell within this category. The habitat suitability model predicted that oysters would be present within the highest quality habitat type and absent in low-quality habitat. The confusion matrix shows that the model was successful in predicting that 37% of the total observations where oysters were present were found within habitat type 4 (high), and 14% of the observations where oysters were absent were found in habitat type 2 (low).

In the type 4 habitat, 14% of the total observations found that oysters were absent, which goes against the model prediction. I suspect this is partly due to the patchy nature of substrate availability in Yaquina Bay and the low-resolution quality of the substrate raster layer used for analysis. For the 6% of observations that show oyster presence within habitat type 2, it’s possible that these points were juvenile oysters that were able to settle in year-1, but are less likely to survive into adulthood. Both of these errors could also indicate issues with the weights assigned in the model back in Exercise 1.

Question asked
For exercise 3, I wanted to expand on the habitat suitability analysis to see if I could more accurately predict oyster locations and account for the errors found in exercise 2. Here I asked:

Can the spatial pattern of Olympia oyster location data be more accurately described by manipulating the spatial pattern of one of the parameters of suitable habitat (salinity)?

I decided to modify the rank values of one of the model parameters: salinity. Based on my experience collecting oyster location data in the field, it seemed that salinity was the biggest influence in where oysters would be found. It was also the easiest parameter to change since it had the fewest rank categories. The excerpt below comes from the ranking value table I established for the habitat parameters in Exercise 1. Changes to rank value for salinity are indicated in the right-most column.

Habitat parameter Subcategories Subcategory variable range Olympia oyster tolerance Rank value applied
Mean wet-season salinity (psu) Upper estuary < 16 psu somewhat, but not long-term 1 –> 2
Upper mid estuary 16.1 – 23 psu yes 4 –> 3
Lower mid estuary 23.1 – 27 psu yes 3 –> 4
Lower estuary > 27 psu somewhat 2 –> 1

Name of tool or approach
I combined my approach from exercise 1 and exercise 2 to create a different model output based on the new rank values applied to the salinity parameter. The analysis was completed in ArcGIS Pro and the table of values generated was reviewed in Excel.

Brief description of steps to complete the analysis

  1. After assigning new rank values to the salinity parameter, I applied a new ‘weighted overlay’ to the salinity raster layer in ArcGIS. As I did in exercise 1, I used the ‘weighted overlay’ tool again to combine the weighted substrate and bathymetry layers with the updated salinity layer. A new map of suitable habitat was created based on these new ranking values.
  2. Then, I added the field observation data of oyster presence/absence to the map and created a new map of all the data points overlaid on habitat suitability.
  3. I then created buffers around each of the points using the ‘buffer’ tool. In the last analysis, I used the ‘multiple ring buffer’, but was only able to analyze the largest buffer (300m). This time, I created only the one buffer around each point.
  4. Using the ‘Zonal Statistics’ tool, I overlaid the newly created buffers on the updated raster of habitat suitability and viewed the results. I again chose ‘majority’ as my visual represented statistic, which categories the buffer based on the habitat suitability type occupying the largest area.
  5. I also created a results table using the ‘Zonal Statistics as Table’ tool, then copied it over to Excel for additional analysis.

An updated table based on manipulated salinity rank values was generated to compare to the table created from exercise 2 and displayed at the top of this blog post. Results from this analysis showed that only 2 of the 35 total observations fell within or near areas of high-quality habitat (type 4), one indicated presence and the other absence. The adjustments to the salinity rank value allowed the habitat type 3 to dominate the map, with 31 of the total 35 observations falling in this category. Of the 31 points, 18 showed presence data (58%) and 13 were absence data (42%). Again, no field survey locations fell within areas of majority lowest quality habitat (type 1). Two observations were found within the second lowest quality habitat type (2), both indicating absence (100%).

Habitat suitability
4 3 2 1
Presence 1 [0.5] 18 [0.58] 0 [0] 0 [0]
Absence 1 [0.5] 13 [0.42] 2 [1.0] 0 [0]
Total (n = 35) 2 [1.0] 31 [1.0] 2 [1.0] 0 [0]

Again, I used a confusion matrix to further examine the ‘errors’ in the data, or the observations that did not correlate with my model of predicted suitable habitat. I removed habitat suitability type 1 since there were not observations in this category.

Habitat suitability
4 (high) 3 2 (low)
Presence 0.03 0.51 0
Absence 0.03 0.37 0.06


Decimals reported indicate the proportion of total observations (n = 35) that fell within this category. The confusion matrix shows that the model fit nearly all observations (31) into the type 3 habitat category, with a near even split between presence (18) and absence (13). In reference to the confusion matrix from exercise 2 at the top of this blog, it is difficult to make a direct comparison of the errors since most of the observations fell into type 3.

Critique of the method
I was surprised to see how drastically the map of suitable habitat changed by manipulating only one of the habitat parameters. The adjustment of the rank values for salinity resulted in a vast reduction in area attributed to the highest quality habitat (type 4). The results indicate that choosing the salinity parameter to manipulate did not result in a better fit model and that changes to salinity rank values were too drastic. Since the salinity parameter contains only 4 subcategories, or 4 different weighted salinity values, the impacts to the habitat suitability map were greater than if the parameter had had more nuance. For example, the bathymetry parameter has 10 subcategories and a reworking of the ranking values within could have made more subtle changes to the habitat suitability map.

The next steps would be to examine another parameter, either substrate or bathymetry, to see if adjustments to ranking values result in a more appropriate illustration of suitable habitat. Additionally, the collection of more oyster location data points will help in creating a better fit model and understanding the nuances of suitable habitat in Yaquina Bay.


Deaggregation of multi-hazard risks, losses, and connectivity: An application to the joint earthquake-tsunami hazard at Seaside, OR

Research Question

The Pacific Northwest is subject to the rupture of the Cascadia Subduction Zone (CSZ) which will result in an earthquake and nearfield tsunami. Low-lying communities along the coast (such as Seaside, OR) are susceptible to infrastructure damage from both earthquake ground shaking and tsunami inundation. Given resource constraints (budget, personnel, time, etc.), it is not feasible for city planners and resource managers to expect to mitigate all possible damages resulting from an earthquake/tsunami event; however, it is possible to optimize resources in order to minimize the expected damages. Consequently, a first step toward resource optimization is a thorough understanding of the risks posed by the CSZ.

For this project, I investigated how the spatial pattern of tax-lot damage and connectivity to critical infrastructure following a multi-hazard earthquake/tsunami in Seaside, OR can provide insight into possible hazard mitigation measures. The damages and connectivity were deaggregated by both hazard (earthquake vs. tsunami) as well as intensity of the event. An understanding of the deaggregated hazard provides insight into vulnerable areas within Seaside.

Description of Dataset

My dataset consists of maps of the built environment (or infrastructure) and hazard maps (earthquake ground shaking and tsunami inundation) for Seaside, OR. The infrastructure is composed of buildings, an electric power network, a transportation network, and a water supply network (Figure 1). The earthquake and tsunami hazard maps were previously generated through numerical modeling (Park, Cox, Alam, & Barbosa, 2017). Due to computational limitations, tsunami inundation is modeled using a “bare-earth” model indicating that no infrastructure is included.

In addition to the infrastructure and hazard maps, a suite of probabilistic damage codes were utilized. These damage codes implement Monte-Carlo methods to evaluate the expected damages, economic losses, and connectivity associated with earthquake/tsunami hazards.

Figure 1: Infrastructure at Seaside (Kameshwar et al., n.d.)



The project was divided into three exercises. The questions posed by each exercise, and hypotheses are outlined below:

  1. What is the spatial pattern of economic losses resulting from a joint earthquake/tsunami event? How does each hazard contribute to this spatial pattern?
    1. Earthquakes and tsunamis pose different hazards to the built environment. The former results in strong ground shaking whereas the latter results in inundation. Tsunami inundation is much more spatially variable due to limiting characteristics such as ground elevation and friction losses. Conversely, earthquake ground shaking is not limited by elevation or friction and is subsequently not as spatially variable within a geographic region (especially at scales the size of Seaside). Because of these differences in hazardous conditions, I expect the spatial pattern of tsunami losses to be concentrated along the coast, whereas the spatial pattern of economic losses will be dispersed around Seaside.


  1. How does the spatial pattern of economic losses relate to the spatial pattern of tsunami momentum flux?
    1. Because there is significantly more spatial variability in the tsunami hazard compared to the earthquake hazard, I wanted to investigate how this spatial pattern relates to the spatial pattern of tsunami economic losses. Economic losses are driven by the intensity of the hazard, therefore, I expect a significant correlation between economic losses and tsunami momentum flux (a measure of both inundation depth and velocity).


  1. How vulnerable is Seaside’s networked infrastructure to a joint earthquake/tsunami event?
    1. While economic losses to infrastructure can play a role in hazard mitigation planning, it should not serve as the only driver. Mitigation planning should also (and perhaps more importantly) consider resident accessibility to critical services and facilities immediately following a disaster. Given the importance of resident accessibility, I wanted to perform a connectivity analysis of networked infrastructure (electricity, transportation, and water) following a joint earthquake/tsunami event. The networked infrastructure in Figure 1 shows that failure of a few key links could severely limit large populations of Seaside (g. the network is not highly “parallel”, but exhibits some “series” features). For example, failure of the pipes leading to the water treatment plant (Figure 1-d) would result in complete disconnectivity of water to Seaside. Given the structure of the networks at Seaside, I expect sharp increases in the disconnectivity if some of the key links fail.


  1. Spatial pattern of economic losses:
    1. To evaluate the spatial pattern of economic losses, I created real market value and economic loss heatmaps. Economic loss heatmaps provide insight into densely populated regions within Seaside that could potentially benefit from a single mitigation option. Heatmaps were generated by using the kernel density estimation tool within the QGIS processing toolbox.
  2. Relationship between economic losses and tsunami momentum flux:
    1. Performing a geographically weighted regression (GWR) provided insight into the relationship between economic losses and tsunami momentum flux. Here, the percent of economic loss depended on the tsunami momentum flux. The GWR was performed using the python spatial analysis library PySal.
  3. Connectivity of networked infrastructure:
    1. Connectivity analyses between tax lots and critical infrastructure provided insight into the probability of tax lots becoming disconnected from critical infrastructure. Additionally, maps showing the probability of link failure were generated to isolate vulnerable links within the networks. The connectivity analysis was performed using the python network analysis package python-igraph.


Spatial pattern of economic losses

Heatmaps showing the spatial pattern of economic losses deaggregated by both hazard and intensity are shown in Figure 2. It can be seen that a hot spot of damages is located at the central business district (CBD), and appears for low magnitude earthquake events. The tsunami damages are more evenly distributed along the coast, relative to the earthquake damages. Figure 3 shows the total economic risks for of Seaside deaggregated by hazard and intensity. Here, risk is defined as the economic losses multiplied by the annual probability of occurrence (the inverse of the return year). Quantifying the economic losses by risk allows for the isolation of events that are both likely to occur and produce significant damages. It can be seen in Figure 3 that the 250 to 1000-year events pose the highest economic risks. If economic losses are a priority, using Figures 2 and 3, a city planner could identify regions of buildings within Seaside that are vulnerable to earthquake damage. Subsequently these regions would benefit the most from mitigation options.

Figure 2: Deaggregated heatmaps of economic losses

Figure 3: Deaggregated economic risks for all of Seaside

Relationship between economic losses and tsunami momentum flux

The relationship between percent of economic losses and tsunami momentum flux was measured by performing a geographically weighted regression (GWR). A key parameter of the GWR is the bandwidth, which describes the area under which the regression is performed. A bandwidth of 200 was initially used (see to exercise 2). The bandwidth has been further optimized by comparing the resulting r2 values of multiple GWRs (Figure 4).  It can be seen that a bandwidth of 75 results in the largest r2 value, and was subsequently used for further analysis. Bandwidths below this value resulted in errors in the GWR code.

The results from the GWR with an updated bandwidth are shown in Figure 5. It can be seen that the slope is small near to the coast with hotspots of larger values located further inland. Conversely, the intercept is large near to the coast, and decreases as further inland. The intercept can be explained by the large momentum flux near to the coast and decreases as the tsunami propagates landward.

Interestingly, some of the slope values are less than 0 indicating that the damages decrease as momentum flux increases. This is likely explained by the heterogeneity of building types within Seaside. For example, concrete buildings are more resistant to tsunami damage than wood buildings. Consequently, a concrete building that experiences a large momentum flux may result in less damage than a wood building that experiences a small momentum flux. To validate this, the buildings in Seaside were deaggregated by their characteristics and a linear regression was performed (Figure 6). The buildings can be classified according to the construction material (wood vs. concrete), year built, and number of floors. Here, W1 corresponds to one-story wood building; W2 corresponds to two-story wood building; and C1 corresponds to concrete buildings with moment a frame. The concrete moment frame can be further divided into less than or greater than 4 stories. It can be seen in Figure 6 that all regression slopes are positive with relatively high r2 values. Considering the most extreme case, it can be seen that if a W1 building built before 1979 is next to a C1 building built after 2003, the resulting regression slope would likely be negative.

Figure 4: GWR bandwidth vs. r^2. Used for bandwidth selection.

Figure 5: Slope and intercept results from GWR analysis

Figure 6: Linear regression deaggregated by building characteristics

Connectivity of networked infrastructure

The probability of each tax lot being connected to critical facilities was evaluated by performing a connectivity analysis using the EPN, transportation, and water networks (see networks in Figure 1). The total fraction of tax lots disconnected from critical infrastructure are shown in Figure 7 (left-hand side). Similar to economic risks, the fraction of disconnection was multiplied by the probability of occurrence to determine the disconnectivity risk. It can be seen that for the 1000-year event, nearly all of Seaside becomes disconnected. Furthermore, the 250- to 500-year events pose the highest risk across the three networked infrastructures.

The network performance can also be characterized spatially by creating maps indicating the probability of link failure and tax lot disconnection. An example is shown in Figure 8 for the tsunami damage resulting from a 500-year event. This case was selected to view spatially because it drives the risk associated with the transportation network. Figure 8 shows that tax lots west of the Necanicum River have a high probability of becoming disconnected. Viewing the link failure map, it becomes clear that the disconnection is not driven by bridge failure, but rather from road washout.

Figure 7: Disconnection results: (left) fraction of tax lots disconnected from critical infrastructure and (right) disconnection risk

Figure 8: Probability of (left) link failure and (right) tax-lot disconnection


The ability to spatially isolate vulnerable infrastructure within a community serves as a first step in optimizing resource management. Being able to identify highly vulnerable areas, ensures that resources are not spent in areas that are already relatively resilient to the earthquake/tsunami hazard. The risk results from this analysis show that Seaside is particularly vulnerable to the 250- to 1000-year rupture of the CSZ. There is a large concentration of economic value located at the city center of seaside. If resource managers and city planners place an emphasis on reducing economic losses, then they should focus on mitigation measures to reduce damages within this region. However, more important than economic losses is resident and tourist accessibility to critical facilities following a natural disaster. Network mitigation measures should focus on reducing the damages associated with the 250- to 500-year events.

The results from this analysis are not comprehensive and future work should include:

  1. Spatial analysis of mitigation measures – Additional damage modeling should be performed with various mitigation measures in place. For example, the effect on transportation connectivity resulting from earthquake retrofitting of bridges can be incorporated into the damage model. Similar spatial maps as those produced in this project could be created in order to determine the spatial effect that mitigation measures have.
  2. Incorporate population in analyses – Climate change and natural disasters tend to have a disproportionate effect on populations within a geographic region. Future work should include spatially mapping socio-economic vulnerability indices to identify vulnerable populations within Seaside.
  3. Perform additional GWRs – It was shown that the building classification lead to counterintuitive results in the GWR. Additional GWRs could be performed by only considering similar building types with a geographic region.


Throughout this term, I learned how to both use new software as well as apply new statistical methods.

  1. Software
    1. Python – Prior to this course, I was familiar with both QGIS and python; however, I had not used the two together. A personal goal was to learn how to perform geospatial analysis in python. This was accomplished by performing the GWR using the python library I additionally learned how to read and modify shapefiles in python using the geospatial library geopandas.
    2. QGIS processing toolbox – I learned how to use multiple tools within the QGIS processing toolbox (g. heatmap interpolation and hotspot analysis).
  2. Statistics –
    1. Hot spot analysis – Although I did not use hotspot analysis for my project, I learned about this method and did a test case for exercise 1. The hotspot analysis was used to isolate areas within Seaside that resulted in disproportionate damages relative to the surrounding area.
    2. Geographically weighted regression – Geographically weighted regression was performed to evaluate the relationship between tsunami momentum flux and percent damage.

As a significant portion of my research will be spatially-oriented, the tools and skills I learned during this term will be beneficial for future work. Furthermore, this course introduced additional geo-spatial statistic methods that were not implemented for this project but could be relevant for additional work.


Kameshwar, S., Farokhnia, K., Park, H., Alam, M. S., Cox, D., Barbosa, A., & van de Lindt, J. (n.d.). Probabilistic decision-support framework for community resilience: Incorporating multi-hazards, infrastructure interdepencies, and target objectives in a Bayesian network. Reliability Engineering and System Safety.

Park, H., Cox, D. T., Alam, M. S., & Barbosa, A. R. (2017). Probabilistic Seismic and Tsunami Hazard Analysis Conditioned on a Megathrust Rupture of the Cascadia Subduction Zone. Frontiers in Built Environment, 3(June), 1–19.


Spatial pattern of ventenata invasion in eastern Oregon: Final Project

  1. The research question that you asked.

I initially asked the question, “how is the spatial pattern of invasion by the recently introduced annual grass, ventenata, influenced by the spatial pattern of suitable habitat patches (scablands) via the susceptibility of these habitat patches to invasion and ventenata’s invasion potential?”

  1. A description of the dataset you examined, with spatial and temporal resolution and extent.

In Exercise 1, I examined spatial autocorrelation and in ventenata abundance and ventenata hotspots using spatial data (coordinates and environmental variables) and ventenata cover data that I collected in the field (summer 2018) for 110 plots within and surrounding seven burn perimeters across the Blue Mountain Ecoregion of eastern Oregon.

Target areas were located to capture a range of ventenata cover from 0% ventenata cover to over 90% cover across a range of plant community types and environmental variables including aspect, slope, and canopy cover within and just outside recently burned areas. Once a target area was identified, plot centers were randomly located using a random azimuth and a random number of paces between 5 and 100 from the target areas. Sample plots were restricted to public lands within 1600m of the nearest road to aid plot access. Environmental data for sample plots includes: canopy cover, soil variables (depth, pH, carbon content, texture, color, and phosphorus content), rock cover, average yearly precipitation, elevation, slope, aspect, litter cover, and percent bare ground cover.

For Exercise 2, I examined how the spatial pattern of vegetation type influences invasibility of plant communities by ventenata. To achieve this, I applied vegetation type data from the Simpson Potential Vegetation Type raster data (Simpson 2013) to 30m resolution in Arc GIS which was developed to identify potential vegetation types across the Blue Mountain Ecoregion.

In Exercise 3, I explored how the spatial pattern of canopy cover was related to ventenata abundance. For this, I used a live tree canopy cover layer developed for the Pacific Northwest calculated using Forest Vegetation Simulator methods including the sum of canopy cover estimates for vegetation plots in the region (Crookston and Stage 1999).

  1. Hypotheses: predictions of patterns and processes you looked for

Ventenata is an invasive annual grass that shares many functional traits to other impactful invasive annual grasses in the region such as cheatgrass and medusahead, including similar vegetative height, fall germination, and shallow root system. These similarities have led me to believe that, like cheatgrass and medusahead, ventenata will be more abundant in open areas with low canopy cover where competition from existing vegetation is lower.

The study area contains many open areas interspersed throughout the larger forested landscape. The patchy spatial distribution of open areas throughout the study area will likely result in a patchy distribution of areas with high ventenata cover. Additionally, ventenata produces many seeds, with the majority of these seeds dispersing short distances from the parent plant. This leads me to believe that areas with high ventenata cover will be clustered near other areas with high ventenata cover creating invasion “hot spots” across the study region.

Hypothesis 1: Areas with high ventenata cover will be clustered near other high cover areas and low cover areas will be clustered near other low cover areas.

Hypothesis 2: The spatial pattern of ventenata abundance will be positively correlated with a neighborhood of non-forest habitat types (shrub-lands and grasslands) and negatively correlated with a neighborhood of forest habitat types. This relationship will decrease in strength as distance increases from the high cover sample point, as vegetation types farther from an invasion point are likely not as strongly influencing invasion as vegetation types closer to that point.

Once a species has established in a suitable habitat, it may spread to areas of less suitable habitat aided by strong propagule pressure from a nearby population. Open areas may act as source populations, allowing ventenata to build propagule pressure to the point where it is able to successfully establish and maintain a population in less suitable habitat such as areas with high canopy cover.

Hypothesis 3: Plots where ventenata is present in areas with high canopy cover (e.g. forests) will be clustered near open areas. These open areas may provide strong propagule pressure to aid invasion into areas with fewer available resources (sunlight).

  1. Approaches: analysis approaches you used.

To test these predictions I performed a handful of spatial analyses including:

Exercise 1: I tested for spatial autocorrelation using Moran’s I and created a correlogram in R and performed hot spot analysis in ArcGIS

Exercise 2: I explored the spatial relationship between the spatial pattern of ventenata abundance and the spatial pattern of different vegetation types using neighborhood analyses in ArcGIS and R

Exercise 3: I examined the spatial relationship between ventenata and canopy cover using a Ripley’s cross K analysis in R

  1. Results: what did you produce — maps? statistical relationships? other?

Throughout the analyses, I produced a series of statistical relationships displayed as maps and graphs. Hot spot analysis produced a map that allowed me to visualize the relationship of autocorrelation between ventenata abundance at my sample points. For Moran’s I, neighborhood analysis, and Ripley’s cross K, I produced graphical representations of statistical relationships in R.

  1. What did you learn from your results? How are these results important to science? to resource managers?

The correlogram and hotspot analysis results showed that the spatial pattern of ventenata is auto correlated and has a patchy distribution. The hotspot analysis suggests that areas of high ventenata are clustered with other high ventenata plots and low ventenata plots are clustered as I predicted in Hypothesis 1.  This is likely a result of the patchy distribution of open areas and forested areas across the landscape and the dispersal ability of ventenata.

Neighborhood analysis showed that areas with high ventenata cover are more positively correlated with nearby forested areas (ponderosa pine) than I originally thought. This result suggests that ventenata may preferentially invade areas surrounded by ponderosa pine vegetation type as well as shrublands which would not support Hypothesis 2. However, ponderosa pine vegetation type does not necessarily indicate high canopy cover, and could represent invasion into an alternative low canopy cover vegetation type. Additionally, the vegetation type maps are mapped at large spatial scales and may not represent the fine scale variation in vegetative cover. Uncertainty in this result inspired a follow up analysis using canopy cover instead of vegetation type as a predictor variable in a Ripley’s cross K analysis.

In my follow up analysis using Ripley’s cross K, I found that forest plots where ventenata was present were only weakly clustered around open areas despite my original hypothesis that there would be strong clustering (H3). These results could suggest that ventenata has a higher tolerance for high canopy cover than I originally predicted. Alternatively, these results could indicate that ventenata is capable of dispersing large quantities of seed much farther distances than originally thought, thus not requiring open areas in the immediate neighborhood. Moreover, the same issue of scale may apply to the canopy layer as the vegetation layer, and the 30m resolution may be over predicting canopy cover at my sample sites.

My findings could have severe implications for forest ecosystems which are commonly thought to be relatively resistant to invasion by annual grasses and are now showing susceptibility to ventenata invasion. For example, vententata could increase fine fuels in these systems, making them more likely to ignite and carry surface fire. Managers may want to consider incorporating annual grass management strategies into their current and future forest management plans to help reduce potential invasion impacts.

  1. Your learning: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other?

During this class I learned a suite of new tools in ArcGIS including hotspot analysis and concentric ring buffer. I created my first model using ArcGIS Modelbuilder! I learned the basics of spatstat in R and successfully completed some spatial analysis which required transforming my data into a spatial data frame (I did not know that these existed prior to this class). Additionally, I was exposed to, and gained experience using many other new functions in R including Moran.I, correlog and kcross that were useful for spatial analysis.

  1. What did you learn about statistics, including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), (d) multivariate methods (e.g., PCA),  and (e) or other techniques?

I learned that Moran’s I and correlograms are useful for testing spatial autocorrelation in data, but only if the scale applied is of interest. For example, it was not useful to compute only one Moran’s I value for my entire data set – this indicated that there was spatial autocorrelation in the data, but did not indicate a spatial pattern. However, when I computed Moran’s I at various distances and displayed these results in a correlogram, I found uncovered the pattern of the spatial correlation. The hotspot analysis allowed me to visualize exactly where the high and low clustering was occurring across my sample plots while simultaneously providing a significance value for those hot and cold spots.

Ripley’s cross K analysis was useful for testing the relationship of my ventenata points to another variable (canopy cover). I found this test appea ling because it tests whether or not one variable is clustered around another variable using a Poisson distribution to compare observed and expected values assuming spatial randomness. However, I learned that this method was not appropriate for my data, as my sample plots were chosen based on field variables and were not a random sample. This violated assumptions of randomness and homogeneity across the sampling region as my plots were more heavily located in non-forested areas. If I wanted to properly investigate these spatial questions, I would have to develop a more random sampling method.


Simpson, M. 2013. Developer of the forest vegetation zone map. Ecologist, Central Oregon Area Ecology and Forest Health Program. USDA Forest Service, Pacific Northwest Region, Bend, Oregon, USA

Crookston, NL and AR Stage. 1999. Percent canopy cover and stand structure statistics from the Forest Vegetation Simulator. Gen. Tech. Rep. RMRS-GTR-24. Ogden, UT: U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station. 11 p.


Final Project Summary – Spatial Analyses of the Relationships Between Seascape Classes and Rockfish Distributions

Question Asked

The question that I sought to answer in this class was “How is the spatial pattern of forage fish assemblage in the California Current System related to the spatial pattern of seascapes based on the sea-surface conditions used to classify the seascapes?”  This is outlined in my first blog post, which can be found here. Additionally, I was seeking to characterize the spatial distribution and extent of seascape classes and forage fishes. Seascape classes are used as a way to simplify the myriad of physical, biological, and chemical processes that can affect organisms in the ocean. Eventually, the focus of the analyses shifted to young-of-the-year rockfish, as opposed to forage fish. All analyses mentioned in this blog post were conducted using the YOY-Rockfish data as opposed to the forage fish data.


Midwater trawls have been conducted annually by the National Oceanic and Atmospheric Administration’s (NOAA) Southwest Fisheries Science Center (SWFSC) in an attempt to monitor the recruitment of pelagic rockfish (Sebastes spp.) and other epipelagic micronekton at SWFSC stations off California. The trawls have informed a dataset that represents overall abundance of all midwater pelagic species that commonly reside along the majority of the nearshore coast of California from 1998 to 2015. Each trawl collected information about both fish abundance, recorded in absolute abundance, and location data, recorded in the form of latitude and longitude. The dataset also includes a breakdown of species by taxa.

Seascapes have been classified using a combination of in-situ data (from the trawls) and remotely sensed data from NASA’s MODIS program. Seascapes were classified using the methods described in Kavanaugh et al., 2014 and represent the seascape class in the immediate area that each trawl occurred. Seascapes are classified at 1 km and 4 km spatial resolution and at 8-day and monthly temporal resolution. Each seascape has been assigned an ID number which is used to identify similar conditions throughout the dataset.

The current seascape classification methods are able to classify surface conditions into one of 33 distinct seascape classes. The variables used to classify the seascapes are listed below:

  • Sea-surface temperature (Celsius)
  • Sea-surface salinity (Practical salinity units)
  • Absolute Dynamic Topography (meters)
  • Ice cover (% of area)
  • Colored dissolved organic matter (m^-1)
  • Spatial averaged sea-surface chlorophyll a (mg m^-3)
  • Phytoplankton physiological state (NFLH) (W m^-2 um^-1 sr^-1)
  • NFLH:Chlorophyll ratio

It is inferred that certain, carefully-selected sea-surface conditions can imply favorable conditions to support healthy ocean ecosystems in the water column and near the ocean floor. The conditions listed above have been studied and selected for this reason. However, this may not be a perfect science – any attempt to delineate and simplify the multitude of factors and conditions that facilitate life in the ocean is likely to leave out important factors. This underscores the importance of research that tests these classification methods and measures their ability to replicate and predict the true natural environment. The analyses conducted for this class attempt to do that in the context of one functional group of fishes within one major ecosystem in the Pacific Ocean.


Simply put, I hypothesized that any measurable spatial changes in the spatial extent of certain seascape classes will also be identifiable in the spatial variability of forage fish assemblage over time. The California Current ecosystem is meticulously studied and examined by a myriad of researchers from a number of different affiliate institutions. Studies reviewing the physical and biogeochemical conditions of the area indicate that from an environmental perspective, many areas off of the coast of California should support areas of high fish abundance.

Specifically, I was expecting areas of high forage fish and young of the year rockfish abundance to exist in seascape classes that represent nutrient-rich, upwelled water. These conditions have been shown to support thriving ecosystems throughout the water column due to an abundance of energy and food for fishes that live higher in the water column. Upwelling brings cold, nutrient-rich water to the surface and occurs seasonally in most places along the California Coast. Using the variables listed above, that would mean below average sea-surface temperature, high dissolved organic matter, and high chlorophyll a. Some classes that represent conditions similar to this are:

  • Temperate Transition (Class 7)
  • Tropical/Subtropical Upwelling (Class 11)
  • Subpolar (Class 12)
  • Temperate Blooms Upwelling (Class 14)
  • Warm, Blooms, High Nutrients (Class 21)

Preliminary multivariate community structure analysis has shown some statistically significant relationships between certain species and certain seascape classes using this data. If spatial patterns do exist, I expect there to be some relationship between the surface conditions and the fish found at depth of the midwater trawls.

Since my shift focused from forage fish to YOY rockfish, my hypothesis shifted from a prediction of correlation between certain seascape classes and forage fish abundance to relationships between rockfish and other seascape classes (as determined by the aforementioned multivariate analyses).

My hypothesis, taking into account all of the aforementioned considerations, can be restated as follows:

It is expected that the spatial pattern of areas of high young-of-the-year rockfish abundance will be related to the spatial patterns of seascape classes that represent areas of high nutrient availability and upwelled water along the California Coast.

Additionally, I expect even higher areas of abundance to occur in areas where two or more seascapes representing these favorable conditions converge or border one another. These border areas are likely to indicate much larger swaths of ocean that hold habitable conditions that are likely to be able to support entire communities rather than smaller numbers of fishes. While this hypothesis was not tested in this project, future work can be conducted using FRAGSTATS, patch analysis software, and other landscape ecology methods to seek an answer related to this prediction.

Approaches Used

Exercise 1: For the first exercise, different interpolation methods were used to explore their effects on the point trawls data. The interpolations (Kriging and IDW were tested) were used to model the abundance of rockfish in four selected years.

Exercise 2: Next, a neighborhood analysis was used to determine what the dominant seascapes were in the areas around trawls that produced high rockfish abundance (both according to the data and the interpolation) and low rockfish abundance. Buffers were set at 5, 10, and 25km distances around two trawls of each kind and seascape classes were measured as a function of % area of each buffer.

Exercise 3: Finally, a confusion matrix was calculated that measured the agreement between occurrence of the significant seascapes (as determined by exercise 2) and areas of high abundance (as determined by the interpolations in exercise 1).


Exercise 1: The interpolations produced a number of maps of varying spatial extents and with varying resolution. Methods had to be refined multiple times to procure workable results, but maps interpolating rockfish abundance for 4 distinct years were eventually created and used.

Exercise 2: The neighborhood analysis produced statistics related to the percent of each buffer area occupied by each seascape class. This information was then used to create two plots – one for areas of high abundance and one for areas of low abundance. The most important result was an understanding of which seascape classes represented areas of high rockfish abundance. The dominant seascape classes in areas of high abundance turned out to match two of the seascape classes predicted in my hypothesis: Temperate Transition (Class 7) and Subpolar (Class 12).

Exercise 3: The final exercise produced a confusion matrix that measured true positives, true negatives, false positives, and false negatives as they related to the agreement between the seascape classes and the interpolated maps. The final matrix also produced an overall agreement score (Kappa).


The confusion matrix ended up being a bit of a failure, as both the seascape information and interpolated information had to be turned into presence-absence data for the matrix to be calculated. This eliminated some of the fine-tuned resolution associated with the interpolation and ultimately led to a Kappa score that suggested less agreement than would be expected by chance. While this specific result is not important to science or managers, these methods can be refined to improve the agreement between the two sources, ultimately providing a way to statistically link physical ocean conditions and rockfish abundances in California.

For these reasons, I consider this project to be an excellent pilot study for these data – one that introduces me to the types of analyses available to researchers who work with these types of data and their strengths and shortcomings.

Learning Outcomes

Software and Statistics

This class gave me an opportunity to brush up on my Arcmap and Modelbuilder skills, which I used to be very proficient in but had lost. I’ve also had the opportunity to explore some analyses through the lenses of R and Python, though I did not end up executing any parts of my project in these programs. The most important statistical learning outcomes for me had to do with interpolation and confusion matrices. The first exercise taught me all about the different ways point data can be interpolated and the consequences that that can have for results. I was familiar with the different types of hotspot analysis available to users in ArcGIS, but was not aware of their differences. The confusion matrix introduced me to an entirely new way to connect independent modeling methods. Methodologically, I learned about the importance of recording your methods as you execute processes and about the benefits of naming files sequentially. Had I not been organized, I would have had a hard time keeping track of the different processes executed and files used during each step.