Category Archives: Final Projects

Using LiDAR data to assess forest aesthetics in McDonald-Dunn Forest

Bryan Begay



  1. Question asked?

How does forest aesthetics vary depending on forest structure as a result of active management that retains vegetation and landforms?

In order to answer my question I would look at two stands in the McDonald-Dunn forest to do some analysis on how their forest structure is related to forest aesthetics. The first stand is Saddleback which had been logged in 1999 and shows signs of management. The second stand was identified near Baker Creek, and is 1 mile west of Saddleback. Baker Creek was chose for its riparian characteristics, as well as having no signs of management activates.

  1. A description of the data set.

The LiDAR data that I used with my initial analysis was 2008 DOGAMI LiDAR flown over the McDonald-Dunn forest. The RMSE for this data set was 1.5cm. The DEM data set I used was from 2009 has a RMSE of 0.3 m. A Canopy Height Model (CHM) was made in RStudio lidR package that used a digital surface model with a 1 meter resolution. The CHM was used to create an individual tree segmentation, where segmented trees were then converted to point data.

A link to a visualization of the raw point cloud that is georeferenced to its terrain.

  1. Hypotheses: predictions of patterns and processes you looked for.

I suspected that initially the Baker Creek stand would have higher forest aesthetic that would reflect in the stand’s unmanaged vegetation structure.


Since the Saddleback had been managed and cut I figured the more natural structure of the riparian stand would have generally higher forest aesthetics than a stand that has been altered by anthropogenic factors. Some processes that I hypothesized that relates to forest aesthetics to these stands was the spatial point pattern of trees could be related to forest aesthetics. Insert forest aesthetic link:

  1. Approaches: analysis approaches you used.

Exercise 1: Ripley’s K Analysis point pattern analysis

The steps taken to create a point pattern analysis was to identify individual trees and convert the trees into point data. The RStudio lidR package was used to create a Canopy Height Model and then an Individual tree segmentation. Rasters and shapefiles were create to export the data so I could then use the tree polygons to identify tree points. The spatstat  package was used in RStudio as well to perform a Ripley’s K analysis on the point data.

Figure 1. Individual Tree Segmentation using watershed algorithm on Saddleback stand.

Exercise 2: Geographically weighted Regression

The steps taken to do the geographically weighted regression included using the polyongs created from the individual tree segmentation to delineate tree centers. When tree points were created from the centroids of the polygons, which would be inputs for the GWR in ArcMap. A density raster and CHM raster had their data extracted to the point data so that density and tree height could be the variables used in the regression. Tree height was the explanatory variable and density was the independent variable.

Figure 2. Polygon output from Individual tree segmentation using the lidR package. The Watershed algorithm was the means of segmentation, and points were created from polygon centroids in ArcMap.

Exercise 3: Supervised Classification

This analysis involved creating a supervised classification by using training data from NAIP imagery and a maximum likelihood classification algorithm. It involved using the NIR band and creating a false color image that would show the difference spectral reflectance values from conifers and deciduous trees. I used a histogram stretch to visualize the imagery better and spent time gathering quality training data. I then created a confusion matrix by using accuracy points on the training data. I then clipped the thematic map outputs with my individual tree segmentation polygons to show how each tree had their pixels assigned.

  1. Results

The Ripley’s K analysis in ArcMap showed me that Saddleback stand’s trees are dispersed, and the Baker Creek stand’s trees were spatially clustered. GWR outputs told me that the model in the Saddle back stand showed me a map output where tree heights and density were positively related. The adjusted R2 was 0.69 and gave me a good output that showed me the tallest and densest trees were on the edges of Saddleback stand. The Baker Creek stand’s model performed poorly on the point data with an adjusted R2 of 0.5. The outputs only showed relationships could only be modeled on the upper left of the stand. The classified image worked well on Saddleback stand due to less distortion in the NAIP imagery on that stand, and the Baker Creek stand’s classification was not useful since it had significant distortion in the NAIP imagery.

Exercise 1:

Figure 3. ArcMap Ripley’s K function output for Saddleback stand assessing tree points.

Exercise 2:

Figure 4. Geographically weighted regression of Baker Creek and Saddleback stand. The Hotter colors indicate positive relationships between tree density and tree height.

Exercise 3.

Figure 5. Supervised image classification using a maximum likelihood algorithm on Saddleback stand.

  1. What did you learn from your results? How are these results important to science? to resource managers?

I learned that Ripley’s K outputs can differ depending on what packages used. R-studio Ripley’s K outputs told me that both my stands had clustered tree patterning. ArcMap outputs that made more sense told me that my Saddleback stand was actually dispersed. Outputs can be variable if inputs are not explicitly understood or modeled with enough care. I also learned that trying to model a very heterogeneous riparian stand is more difficult because of the variability. This is important for researchers who are interest in riparian areas like Baker Creek since they might need to have more variables to adequately model those stands.

  1. Your learning: what did you learn about software?

I became very familiar with processing and modelling with LiDAR point clouds. I also became familiar with Modelbuilder and learned how to use packages in R like Spatstat. I also found a new method for making a confusion matrix in ArcMap.

  1. What did you learn about statistics or other techniques?

I learned how to do point pattern analysis with Ripley’s K on tree points. This was done in R and in Arc. In Arc using the spatial statistics tool was also something I used and still plan to use. When using GWR I understood what it does, understood the outputs, and learned to properly interpret the results. I also became more concerned with issues of scale and networks that might affect my areas of interest.

Geospatial Analyses Examining Woodpecker Nest Site Selection and Postfire Salvage Logging

Research Question

How is the spatial presence of postfire woodpecker nests related to the spatial presence of salvage-logged forest stands?

  • How are woodpecker nests clustered within survey units? (Exercise 1 and 3)
  • How does this clustering relate to salvage treatment units within the survey units? (Exercise 2 and 3)

After evaluating my data with several spatial analyses methods, this research question still seems applicable to my project goals. A possible revision to this research question is to include phrasing about the internal spatial characteristics of the nest and salvage harvest location datasets aside from their relationship to each other.


This project uses postfire datasets targeting the 2015 Canyon Creek fire complex on the Malheur National Forest in eastern Oregon. A salvage logging operation occurred in the burn area in July 2016. My geospatial analyses examine the relationships between woodpecker nest site locations and salvage harvest unit locations before and after the logging occured. This will lead to correlating woodpecker population dynamics with salvage treatment effects. Woodpeckers are an ideal indicator species for postfire ecosystem monitoring due to their heavy reliance on burned snags for bark-beetle foraging and cavity-nesting. Their population dynamics indicate habitat quality for all snag-dependent wildlife. Previous research indicates woodpeckers exhibiting species-specific habitat preferences toward burned forest variables like stand density, diameter class, height, and snag decay class. 2016 pre-salvage and 2017 post-salvage lidar datasets are in processing for eventual correlation between these 3D forest structure variables and woodpecker nest site selection before and after harvest. However, the spatial analyses featured in this project focused on the intial task of relating 2D nest and salvage unit locations.

My research is in conjunction with a Rocky Mountain Research Station study examining salvage logging effects on three woodpecker species in the Canyon Creek complex. In 2016 and 2017 I led crews on this project collecting extensive postfire woodpecker occupancy and nest productivity datasets for black-backed, white-headed, and Lewis’s woodpecker populations. This resulted in a 148-nest dataset for 2016 and 2017, representing woodpecker populations before and after salvage logging. A polygon shapefile outlining ten RMRS woodpecker point count survey units serves as the area of interest (6 treatment, 4 control). Within the 6 treatment units, another polygon shapefile outlining 35 salvage harvest units indicates treatment areas. Three silvicultural prescriptions replicating optimal habitat types for each woodpecker species designate target salvage variables like average post-harvest stand density and diameter class. Each salvage unit adheres to one of these three harvest prescriptions. Supplementary geospatial data includes a 2015 WorldView-3 1 m raster and ArcGIS basemaps.

Image result for canyon creek fire oregon

Above: The 2015 Canyon Creek Fire burning near John Day, OR.

Image result for black-backed woodpecker              Image result for white headed woodpecker             Image result for lewis's woodpecker

Black-backed woodpecker (Picodies arcticus)     White-headed woodpecker (Picoides albolarvatus)                       Lewis’s woodpecker (Melanerpes lewis)

Above: The three target woodpecker species. All species are Oregon Sensitive Species. The white-headed woodpecker is a U.S. Fish and Wildlife Species of Concern.

Above: Survey units and corresponding salvage treatments for each unit. Each treatment type is based on a silvicultural prescription designed to benefit target woodpecker species. Based on previous research, each target species demonstrates preferred habitat characteristics in variables like stand density, diameter class, height, and decay class.

Above: The three salvage treatment types and the control treatment depicted as overhead stem maps. Each polygon represents a snag of indicated diameter class. The salvage treatments retain more trees of smaller diameter moving from Treatment 1 to Treatment 3, with the control units retaining all trees.

Above: A salvage treatment in the Crazy Creek woodpecker survey unit (Treatment 1).

Above: An early project map showing rough locations of Rocky Mountain Research Station woodpecker survey units. Since this map was created, two more survey units were added for 10 total (6 treatment, 4 control). This map effectively shows the woodpecker survey units occupying areas of highest burn severity.

Above: The Canyon Creek fire complex as a false color WorldView-3 1 m raster. The area of interest includes 10 survey units in blue, labeled with yellow text (6 treatment, 4 control). This visual orients the survey units to an area in eastern Oregon southeast of John Day and Canyon City. The false color image displays healthy vegetation as red, with the darkest areas displaying high burn severity. The survey units are found within some of the highest burn severity areas in the fire complex.

Above: A close-up of the 35 salvage treatment polygons outlined in red and labeled with white text. Control units lack red salvage polygons. This image does not include Overholt Creek.

Above: A subset of the 2016 (orange) and 2017 (pink) nest points featuring survey (gold) and salvage unit (green) polygons.


I expected to see dispersed nests in 2016 with possible trends indicating species habitat preferences. Previous research indicates species-specific preferences for certain forest habitat variables. Black-backed woodpeckers prefer dense, small-diameter stands for foraging and nest excavation. White-headed woodpeckers prefer a mosaic of live and dead variable density and diameter stands for pine cone foraging. Lewis’s woodpeckers prefer tall medium to large-diameter stands for aerial foraging maneuvers. I expected to see nest sites in both years clustered in areas with these forest structures. In 2017 I also expected to see nest sites clustered near salvage treatments implemented for each species. Overall I expected the control units to exhibit nest dispersal and high woodpecker activity.


Exercise 1: Multi-Distance Spatial Cluster Analysis (Ripley’s K)

Multi-Distance Spatial Cluster Analysis (Ripley’s K) analyzes point data clustering over a range of distances. Ripley’s K indicates how spatial clustering or dispersion changes with neighborhood size. If the user specifies distances to evaluate, starting distance, or distance increment, Ripley’s K identifies the number of neighboring features associated with each point if the neighboring features are closer than the distance being evaluated. As the evaluation distance increases, each feature will typically have more neighbors. If the average number of neighbors for a particular evaluation distance is higher/larger than the average concentration of features throughout the study area, the distribution is considered clustered at that distance.

K Function Graphic










Referenced from ArcGIS Desktop 9.3 Help (

Exercise 2: Multiple Buffer Distance Analysis

This tool creates shapefiles of concentric circles around features based on user-input values. Users can dissolve buffers to create a single feature class per buffer distance. Buffer widths overlapping with features of interest can indicate spatial relationships between target subjects. In this case, intersecting woodpecker nest buffers with salvage harvest polygons may reveal trends in woodpeckers selecting nest sites closer to or farther from salvage units. An equation producing percent area of each nest buffer intersected by a salvage unit serves as an index.

Exercise 3: Nearest Neighbor

The Average Nearest Neighbor tool measures the distance between each feature centroid and its nearest neighbor’s centroid location and averages the distances for a nearest neighbor ratio. If the ratio is less than the average for a hypothetical random distribution, the feature distribution is clustered. If the ratio is greater than a hypothetical random distribution average, the features are dispersed.


Exercise 1: Multi-Distance Spatial Cluster Analysis (Ripley’s K)





Above: Notable clustering results from Ripley’s K analysis. Most sample sizes were too small to produce significant results, but Big Canyon and Crawford Gulch in 2016 and Lower Fawn Creek in 2017 showed evidence of clustering.

Exercise 2: Multiple Buffer Distance Analysis

Above: The difference in results between 50 and 100 meter buffer distances for the 2016 nest dataset demonstrate the utility of graphs created from buffer analysis. These results indicate that woodpeckers consistently choose to place their nests in areas completely surrounded by salvage treatment 3. The results also indicate that woodpeckers consistently place their nests within varying distances of salvage treatment 1 with trends indicating average nest placement near but not completely within those salvage treatments.

Exercise 3: Nearest Neighbor

Above: Nearest neighbor results for woodpecker nests in each survey unit in 2016 and 2017. The NN Ratio is a threshold for expected clustering or dispersion. NN Ratio values less than 1 indicate clustering and NN Ratio values greater than 1 indicate dispersion. In 2017, green cells indicate an increase in value and red cells indicate a decrease in value. NN Ratio cells with an increased value in 2017 indicate units where nests increased dispersion. NN Ratio cells with a decreased value in 2017 indicate units where nests increased clustering. Alder Gulch (Treatment 3), Lower Fawn Creek (Treatment 2), and Upper Fawn Creek (Treatment 2) demonstrated increased clustering in 2017. Big Canyon (Treatment 1), Crazy Creek (Treatment 1), and Sloan Gulch (Treatment 3) experienced increased nest dispersion in 2017. All control units experienced increased or present dispersion in 2017.


I am processing two lidar datasets of the study area from 2016 and 2017. These datasets were acquired before and after the salvage logging treatments occurred. I will produce forest metrics such as stand density, diameter class, and height in salvage and survey units. I will then correlate geospatial and statistical properties of the nest datasets to quantified forest variables affecting woodpecker nest site selection. I will examine trends between 2016 and 2017 nest selection to understand the effects of harvest treatments on woodpecker populations. At least two more years of woodpecker data will exist for 2018 and 2019, so future research will add these datasets to the analyses. I would like to see a machine learning algorithm developed from this study that could predict areas of optimal habitat suitability for snag-dependent wildlife species. Postfire wildlife habitat prediction will be crucial to resource managers as wildfires increase in the coming decades alongside accelerated landscape restoration.

This spatial problem is important to science and resource managers as climate change amplifies wildfire effects. Using 3D remote sensing datasets for resource management is the future trend across all disciplines. Increased wildfire activity around the world necessitates cutting-edge methods for active fire and postfire ecosystem quantification. In the Blue Mountains ecoregion in eastern Oregon, Rocky Mountain Research Station, Malheur National Forest, and Blue Mountains Forest Partners rely on this project’s lidar quantification for their research and management decisions. Determining acceptable levels of salvage harvest for wildlife habitat affects whether government agencies and rural economies in this region will allow small businesses to profit from sustainable harvest operations. Science overall will benefit from the continued investigation of wildlife response to anthropogenic disturbance, specifically postfire forest management decisions and the controversial practice of salvage logging.

Learning Software

For my analyses I used mostly ArcMap, ArcGIS Pro, and R. I am very familiar with ArcMap, but it was my first time using ArcGIS Pro extensively. I enjoyed applying my R programming experience to my own datasets and creating visual aids for my research.

Learning Statistics

The statistical analyses I implemented for my research did an adequate job characterizing my datasets. However, I would try a different set of analyses that fit my small sample sizes better. A spatial autocorrelation test may support these analyses. I used R heavily in my Ripley’s K and buffer analyses to create graphs

Courtney’s Final Project Post

Research Question

  • “How is the spatial pattern of ion and isotope concentrations in wells tapping the basalt aquifer related to the spatial pattern of mapped faults via the mechanism of groundwater flow as determined by hydraulic transmissivity of the geologic setting?”

Description of dataset that I examined

  • A: In my research I have analytical data for 31 wells, whose XY locations were determined by field confirmation of the Oregon Water Resource Department (OWRD) well log database. As groundwater is a 3D system, I have to consider Z values as well. The well depths and lithology information are also from the OWRD database. My analytical data provides a snapshot of water chemistry during the summer of 2018. I have only one temporal data point per well. At all 31 wells, I collected samples to be analyzed for pH, temperature, specific conductivity, oxygen isotopes 16 and 18, and hydrogen isotopes 1 and 2. At a subset of 18 of those wells I collected additional samples for tritium, carbon 14, and major ion analysis.
  • B: The shapefile of faults mapped at the surface was created by Madin and Geitgey of the USGS in their 2007 publication on the geology of the Umatilla Basin. There is some uncertainty in my analysis as far as extending this surface information into the subsurface. USGS studies have constrained proposed ranges of dip angles for the families of faults that I am studying, but not exact angles for any single mapped fault.
  • C: results of pumping interference tests involving 29 wells, 12 of which I had chemical data for. The data was collected by the OWRD in 2018 and 2019.


  • Faults can act as conduits or barriers to groundwater flow, depending on how their transmissivity compares to the transmissivity of the host rock.
  • I hypothesize that clusters of similar chemical and isotopic properties of groundwater can indicate a shared aquifer unit/compartment, and that if faults separate clusters then the fault is influencing that difference in chemical/isotopic signatures. If the fault is between two clusters, I hypothesize that it is acting as a barrier. If it crosses through a cluster, I hypothesize that it acts as a conduit.
  • Where faults act as barriers, I hypothesize that parameter values will differ in groups on either side of a fault. Specifically, a barrier fault might cause older, warmer water to rise into upper aquifer layers, and the downstream well might show a signature of more local recharge.
  • Where faults act as conduits, I hypothesize that water chemistry and isotopes of samples from wells on either side of the fault would indicate a relatively direct flowpath from the upstream well to a downstream well. Over a short distance, this means that ion and isotope concentrations would not differ significantly in wells across the fault.
  • My hypotheses depend on a “barrier” fault damming groundwater flow up-gradient of the fault, and compartmentalizing local recharge on the down-gradient side. This hypothesis is only relevant if the fault is roughly perpendicular to the flow direction, and so disrupting transmissivity between a recharge zone and the wells. If a fault that separates two wells is parallel to the flow direction and there is no obstacle between the wells and upstream recharge areas, then the fault might indeed limit communication between the wells but they will have similar chemical signatures. Wells separate by this second kind of fault barrier would be better evaluated by a physical test of communication, such as a pumping interference test.

Analysis Approaches

  • Principal component analysis: used to simplify the multivariate data set (19 variables!) into variable relationships that could represent aquifer processes
  • Analysis of PCA results compared to distance from a flow path
    • Interpolation of well water levels classified by well stratigraphy to estimate a potentiometric surface and evaluate groundwater flow directions.
    • Raster calculations to compare flow direction to fault incidence angle
    • Measuring distance from each well to the nearest fault along the flow path
    • simple linear regression, comparing Non-ion PC1 score of a well with its distance from a fault.
  • Two-sided T-tests comparing distance between wells, presence of a fault, and pumping interference test communication between wells
  • ANOVA F-tests comparing chemical and isotopic variance within groups of wells that communicate with each other and between those groups.


  • Principal component analysis – Patterns of variation between wells are statistically explained primarily by total ion concentration, a relationship between chemical evolution from ion exchange and decreasing stable isotope ratios, and the combination of well depth and water temperature. Moran’s I indicates that only Non-ion PC2 is spatially clustered, while the other PC models have a distribution not significantly different than random. The other PC models are useful to understand the groundwater system, but not specifically to analyze clustering correlated to faults.
  • Interpolation of water level, and comparison of fault incidence angle with flow direction, indicates faults that are and are not able to be tested by my hypotheses.
  • Analysis of PCA results compared to distance from a fault along flow path – some wells that are “within” a fault zone have very old signatures and others have very young signatures. This could be related to the angle of the dip of the fault and the accuracy of mapping compared to the depth of the well. I hypothesize that the wells that are in the fault zone but have high PC1 scores are on the up-gradient side of the fault where older water is upwelling along a barrier. Wells in fault zones with low PC1 scores could indicate wells open to downgradient areas of the fault, where vertical recharge through the fault damage zone is able to reach the well.
  • Returning to the conclusions I wrote in that blog post after I found improved stratigraphic data, I’m not sure if I can make conclusions other than those about the wells are that mapped as “inside” a fault. Several wells that are closer to faults are also open to shallower aquifer units, and so the effect of lower PC1 scores closer downgradient to faults might be confounded by lower PC1 scores caused by vertical inflow from the sedimentary aquifer and upper Saddle Mountain aquifer.
  • Two-sided T-tests comparing distance between wells, presence of a fault, and communication between wells show that the presence of a fault has a greater effect on communication than the distance between the wells.
  • ANOVA F-tests comparing chemical and isotopic variance within groups of wells that communicate with each other and between those groups – stable isotopes and specific conductivity both show more variation between well groups than within well groups.
  • Not covering in these blog posts, I also ran Moran’s I on my inputs to see which ones are clustered and so might be more related to horizontal grouping factors (such as faults) than vertical grouping parameters (such as stratigraphic unit). Of the PCA and individual variables, only d18O, d2H, and Non-ion PC2(combination of well depth and water temperature) were clustered. The other PCA models, temperature, pH, and specific conductivity were not significantly spatially clustered.

Significance –  Groundwater signatures are related to faults agree/disagree with past understandings of differences between wells in the region, and can inform well management. If a senior permit holder puts a call on the water right and asks for junior users to be regulated off, it would not help that senior user if on of those junior permit holders’ wells is not hydraulically connected to the senior users.

  • More wells would need to be sampled to be better able to disentangle the effects of faults from the effects of well stratigraphy.

My learning – I learned significantly more about how to code and troubleshoot in R. Additionally, I learned about the process of performing spatial clustering analysis in ArcGIS.

What did I learn about statistics?

  • PCA was completely new to me, and it’s a cool method for dealing with multivariate data once I dealt with the steep learning curve involved in setting it up and interpreting the results. It was useful getting more practice performing and interpreting t-tests and Anova F-tests. I had not used spatial clustering before, and learning how to apply it was interesting. It gave me a much more concrete tool to try to disentangle the patterns in my effectively 3D data on X,Y plane, as opposed to the Z direction.

Relationship of river recession behavior to watershed lithological composition

Research Question: How is watershed recession behavior related to the spatial pattern of watershed lithological composition in Oregon coastal freshwater systems?


Recession Analysis

For the recession analysis performed in this exercise, I used daily records of precipitation from the PRISM database. Daily data is the highest temporal resolution that PRISM has to offer. The PRISM time series were downloaded from cells nearest the gage. Time series were from 2001 to April 2019. For further methodology on recession analysis, refer to Exercise 1. River discharge data was collected by the U.S. Geological Survey at the watershed pour point used here. Discharge data is recorded every 15 minutes, which was upscaled to daily average to match the resolution of the precipitation data. As a result, recession analysis results were on the daily timestep from April 2019 to 2001.

Geophysical data

Lithological data was sourced and derived from the Oregon Department of Geology and Mineral Industries state geological map. General lithology type is a listed variable in the attribute table (GEN_LITH_TY), and ArcGIS was used to dissolve that attribute across the study area. The spatial resolution of this dataset is unclear. Additionally, the data is representing “immutable” features, and therefore there is no associated temporal scale.


Streams and rivers in watersheds with predominantly volcanic geology will recede faster than those with predominantly sedimentary geology.

Streams and rivers in watersheds with similar underlying lithology will have similar recession characteristics.


To perform this analysis, I followed the following steps:

  1. Recession analysis on the daily timescale across eight watersheds. Statistical methods and other analytical procedures are outlined in Exercise 1.
  2. Profile lithological composition within each of 35 study watersheds across the Oregon Coast Range. Percentages of metamorphic, plutonic, sedimentary, surficial sediments, tectonic, and volcanic lithologies were quantified for each study watershed. Technical procedures are outlined in Exercise 3.
  3. Perform hierarchical clustering analysis on lithological composition profile to see which watersheds are most lithologically similar and dissimilar.
  4. Compare recession analysis coefficients to clustering results and lithologic profiles to see if recession behavior appears to be different in different predominant lithologies.


Results from the clustering analysis demonstrated that a most coastal watersheds are predominantly underlain by sedimentary and volcanic lithologies, and that sedimentary lithologies dominate the region. However, volcanic lithologies are well-represented, primarily at the northernmost latitudes. Because of this stratification, recession analyses could be performed at watersheds in both lithologies and compared to test whether lithology is a dominant control. However, because I conducted the recession analysis before profiling lithological composition, only one of the eight hydrologically analyzed watersheds is in predominantly volcanic geology (Figure 1). The watersheds grouped under the sedimentary branch span values of 54-100% sedimentary lithology. Watersheds grouped under the volcanic branch span values of 47-74% volcanic geology.

Figure 1. Results from hierarchical clustering analysis depicted in a dendrogram. Watersheds in which recession analyses were performed are circled.

When recession coefficients from the recession analysis are compared with the clustering analysis results, we see that coefficient a is fairly consistent among all watersheds, while coefficient b is much larger in the predominantly volcanic watershed (Table 1). The Wilson River near Tillamook, Oregon, is characterized as 50% volcanic lithology, which is the highest among the studied watershed. Recession analysis coefficient b from the Wilson River is 1.726 is also the highest in the eight studied systems.

Table 1. Table showing lithological composition of watersheds for which recession analysis was performed. Coefficient results from the analysis are also listed.


While the results of this study are inconclusive given the small sample size, it does give indication that lithology is an important control on watershed recession behavior and should therefore be included in analysis of temporal patterns of hydrological behavior. It is relevant to resource managers in coastal PNW regions because it indicates a watershed characteristic that influences storage and release behavior. Such information may be valuable for identifying systems that are more or less susceptible to drought and changes in the climatic patterns.

Learning Outcomes

  • Watersheds can be delineated in Matlab, and it is much faster than ArcGIS.
  • dplyr is a very important and useful package for reformatting data in a logical way
  • I learned more about spatial autocorrelation from conversations with other students about their analyses, especially given the breadth of research topics in the classroom.
  • Probably the two most important outcomes of this project were the recession analysis and the clustering analysis. The recession analysis involved multi-variate regression, which when applied to my data and research question, makes far more sense. The clustering analysis helped get at the question I have had for some time: how similar/dissimilar are my study watersheds? For this project, I focused on lithology, but I think it will be a useful tool for analyzing the other variables as well. Additionally the clustering analysis will provide a useful method for stratifying my sampling design into different lithological groups.

Relation of Potential Soil Carbon to Tree Height and Density across HJA

Major uncertainties exist in the geospatial distribution of terrestrial carbon (C) sources and sinks and the factors that influence soil C distribution and change. Temperate and boreal forests serve as a net sink of atmospheric CO2, so my aim is to understand how landscape features and vegetation vary across the HJ Andrews Forest (HJA) and determine how those factors influence soil C storage and CO2 release.

Research Question: How might landscape features and vegetation characteristics relate to accumulation of soil carbon, using tree height and density as a proxy for soil C?

Dataset description: I used LiDAR data from 2008 and 2011 at 0.3-0.5 m vertical resolution and at 1 m2 horizontal resolution covering the Lookout Creek Watershed and the Upper Blue River Watershed to the northern extent of HJA. These LiDAR data include a high hit model (tree tops) and a bare earth model from 2008. I also downloaded a shapefile of polygons of the reserved control areas within the HJA forest boundaries and used these areas to subset some of my later analyses.


  1. Large storm events primarily topple larger (older) trees on ridges and upper slopes. Soils in these positions are typically thinner, making them quickly saturate from precipitation and making trees in those positions more vulnerable (Lal 2005; Overby 2003). In addition, the tallest trees on ridges may be exposed to greater risk from lightning strikes, greater wind gusts and more snow accumulation, resulting in shorter trees dominating ridgelines. Shorter trees have smaller canopies and occupy less physical space than taller trees, so shorter trees occupy denser stands and as they grow taller, each individual tree occupies more space. Because of these factors, I hypothesize that shorter trees that are more closely spaced will cluster more densely along ridges than other landforms.
  2. I hypothesize that the tallest trees will be more densely clustered at low elevations along valleys because waterways carve out valleys and bring nutrients that then accumulate and build thick, carbon-rich soil. These areas tend to not suffer from the moisture limitations of ridges and steep slopes and are more protected from strong winds, so trees in these areas can maximize growth based on available solar radiation.
  3. High density young forest stands exhibit characteristics similar to mature forests, like closed canopies and high LAI. However, recently disturbed forests have higher nutrient availability than undisturbed forests which has been shown to cause a shift in C-allocation from below- to aboveground, so I expect younger stands to negatively correlate with soil C accumulation.
  4. I hypothesize that because vegetative growth is limited by solar radiation, vegetative growth on slopes that face S and W have greater exposure to solar radiation and will result in greater overall biomass (more dense stands of smaller trees, but not necessarily more tall trees). However, if the slope is too steep, I expect this pattern to diminish since trees will experience greater wind and water erosion and more tree mortality.

Note: I have yet to test hypotheses 3 and 4, but they are avenues for continuing research.

Approaches: I used a k-nearest neighbor analysis and k-means to cluster trees into 10 height classes and related each tree in the center of each height class to its distance from its 30 nearest neighboring trees of the same height class. I ran several hot spot analyses on tree heights and tree distances (tree density). Since my original hot spot analyses were subset by height class, which led to the algorithm naturally finding the arbitrary bins (upper and lower bounds) the height classes were based on, I performed new hot spot analyses on all the trees and all the tree distances within just the reserved control areas of the forest. I performed a linear regression to compare the regions of taller than expected trees to the regions of greater than expected distances between tree and Chi-squared tests independence for the hot spot analyses to compare hot spots and cold spots with variables like tree spacing, slope and elevation. Since hot spots between tree height and spacing did not overlap in all cases, I wanted to know what landscape features might explain this difference. Covariates I explored included slope, aspect, elevation, and landform. I calculated a confusion matrix between the Z-scores of height and distance for all the hot spot bins (-3,-2,-1,0,1,2,3), then further constrained the analysis to only the most extreme hot and cold spots (-3 and 3). I then compared mean height, distance, slope and elevation between the four combinations of the extreme hot and cold spots.

My objective for the final synthesis was to find regions that are higher or greater than expected for given parameters (height and tree density) and group these clusters into polygons where they overlap. I hypothesized that that soil C accumulation positively correlates with clusters of trees that are both taller and more densely spaced than expected. Conversely, I hypothesized that shorter trees that are more widely spaced negatively correlate with soil C accumulation. Therefore, overlapping hot spots of tree height and tree density and overlapping cold spots of tree height and tree density are prime areas for my planned future soil C sampling sites.

Results: I produced maps and statistical relationships between tree height and tree spacing. These included hot spot maps overlaid on elevation maps, density plots, graphical comparisons of Z-scores and many others (see previous blog posts). I found that hot spots of taller than expected trees overlapped with hot spots of greater than expected distances of trees, but not in all cases (Fig 1). When I examined a table of each of the distance hot spot bins compared with each of the height hot spot bins, I found that the most compelling correlations were between bins (0,0), (-3,-3) and (3,3) with 5%, 33% and 43% of total data, respectively. The proportion of data not covered by overlapping hot spots and by overlapping cold spots was minimal, but when mapped was visually compelling enough to warrant further exploration. I was particularly interested in areas with short trees and greater than expected distance between trees (bin -3,3) and areas with tall trees and shorter than expected distance between trees (bin 3,-3). I expect that bin (-3,3) would correlate with less soil C, so identifying those areas could be useful to sampling. I expect that bin (3,-3) would correlate with greater soil C. I can identify from Fig 2 where statistically significant hot spots and cold spots are spatially close and plan to sample heavily in those areas.

Fig 1. Tree height vs. distance between trees from hot spot analyses shows a highly linear pattern.

Fig 2. Tree density (mean distance in m between 30 closest trees) hot spot bins compared with tree height (m) hot spot bins in Lookout Mountain area. Bin comparisons (-3,-3) are areas of shorter than expected trees in more dense stands. Bin (3,-3) is taller than expected trees that are in more dense stands. Bin (3,3) is taller than expected trees in more sparsely populated stands.


Significance: Many of my results thus far confirm ecological phenomena that we already know to be true across forested landscapes. For example, I examined tree distance to stream and found that taller trees were more common closer to streams and less common at greater distances from streams. This makes sense with other landform features like valleys and ridges, so smaller trees tend to be along ridges (cold spots of tall trees according to my analysis) and taller trees tend to be in valleys. However, I have identified regions of taller than expected trees and regions of shorter than expected trees, and if those correlate with respectively more and less soil C, that provides evidence for LiDAR as an effective way to quantify soil C. If other landscape features that can be determined from geospatial data co-vary with tree height and/or tree density, it may be possible to quantify soil C at fine resolution. These data can be used to identify areas that have the potential to sequester more soil C and forest management could be tailored to support those regions.

Geospatial Learning: I learned a ton about ArcMap, QGIS and packages that were new to me in R like spatstat, caret, nngeo, and others. I had very little experience with and GIS before this class, so it was a steep learning curve but I’ll continue to learn as I perform this final synthesis. I learned how to perform hot spot analyses in ArcMap and export them to work with in Q. I learned how to manipulate spatial data in R and load it in Q and ArcMap for viewing. These are only a few examples of the many things I’ve learned!

Statistics Learning: I learned the limitations of hot spot analyses, with one of my criticisms being that it’s basically a smoothing function. Since the LiDAR dataset I’m using is basically a census of tree heights, running hot spot analyses is reducing the information in that dataset unnecessarily. I already knew that my data were spatially autocorrelated, so I had to take that into account for all of my analyses. I learned that the confusion matrix is great for visually discerning where a model predicts best and where it doesn’t predict well, but that the scientist must figure out the reasons for the good or poor predictions.

Because interactions among geomorphic processes, landforms, and biota occur on various temporal and spatial scales, it is necessary to understand the scale of an ecosystem process when performing spatial analyses. It is also necessary to consider the potential reasons why a particular spatial pattern did not emerge from an analysis. Reasons can range from data not being explained by expected process or multiple spatial scales of interactions, failing to use the right tool, not being at the right spatial scale, data being too coarse, failing to include the right variables or a flaw in one’s fundamental idea about how the system works. This is why it is important to (1) formulate clear, specific hypotheses before performing an analysis and (2) explore spatial patterns using multiple approaches if possible.


Harmon M.E. (2009) Woody Detritus its Contribution to Carbon Dynamics of Old-Growth Forests: the Temporal Context. In: Wirth C., Gleixner G., Heimann M. (eds) Old-Growth Forests. Ecological Studies (Analysis and Synthesis), vol 207. Springer, Berlin, Heidelberg

Lal, R. (2005). Forest soils and carbon sequestration. Forest ecology and management220(1-3), 242-258.Overby et al. 2003: Impacts of natural disturbance on soil C dynamic in forest ecosystems: Soil C sources include litter fall and tree mortality, root turnover and root exudates.

Dixon, R. K., Solomon, A. M., Brown, S., Houghton, R. A., Trexier, M. C., & Wisniewski, J. (1994). Carbon pools and flux of global forest ecosystems. Science263(5144), 185-190.

Final: Evaluating a habitat suitability model for native Olympia oysters in Yaquina Bay

Research background
The goal of my graduate research project is to help build a more comprehensive picture of native Olympia oyster spatial distribution and abundance in Yaquina Bay, Oregon. Olympia oysters have experienced a major decline in population in the Pacific Northwest; it is estimated that the original population has been reduced by 90-99% (Beck et al. 2011). The species is an important ecological and cultural resource, which has garnered it support from resource management agencies, tribes, and local environmental groups. Yaquina Bay is one of only three Oregon estuaries that has historically supported Olympia oysters and may be the only estuary to have maintained a continuous population since Native American settlement (Groth and Rumrill 2009). Because of this, it has been identified by local shellfish biologists as a good candidate for species enhancement. However, current information about exactly where and how many wild oysters exist in Yaquina Bay is inexact and anecdotal. My research project aims to fill that data gap so that appropriate, informed actions can be taken for species enhancement and recovery.

Question asked
Through this course, I have been experimenting with spatial analysis to see if I can create an accurate map of habitat suitability for Olympia oysters in Yaquina Bay. To do this, I’ve been focusing on three environmental parameters: salinity, substrate, and elevation. I then compared the map of predicted suitable habitat with field observations of known oyster locations. The main research question I examined in this project was:

How does the spatial pattern of Olympia oyster location data correspond to the spatial pattern of suitable habitat in Yaquina Bay?

Description of the dataset

Habitat parameters
I collected one raster layer for each of the three habitat parameters, then overlaid them on one another to establish suitable habitat. I ranked the subcategories within each parameter from highest to lowest likelihood to be suitable for oyster presence.

  • Salinity: Salinity data was derived from an existing dataset available through the Oregon Department of Environmental Quality (DEQ). DEQ operates a publicly accessible online water monitoring data portal where results water quality monitoring stations are posted. Wet-season salinity was used to represent the extreme of salinity stress, since Olympia oysters are more capable of withstanding higher salinity waters than fresher water. The salinity dataset is based on historical measurements from 1960-2006 and represents a gradient from highest salinity (~32psu) to fresher water (<16psu).
  • Substrate: Substrate data was taken from the Estuary Habitat Map of Oregon using the Coastal and Marine Ecological Classification Standard (CMECS). CMECS is a national standard; publicly available data for Oregon’s coastal areas was made accessible through an effort by the Oregon Coastal Management Program (Department of Land Conservation and Development). The CMECS Estuarine Substrate Component layer is an Oregon-wide raster that I clipped to Yaquina Bay for this analysis. This layer characterizes substrate type broadly and is fairly low resolution.
  • Elevation: Elevation is represented through a bathymetric dataset from 2001-2002, which is the most current estuary-wide available dataset. This raster was sourced from the Newport, Oregon office of the Environmental Protection Agency (EPA) through contact with a staff geographer.

Oyster location data
This spring, I completed several field surveys of the intertidal zone of Yaquina Bay during low tide series. These were conducted in collaboration with my advisor and field staff from the Oregon Department of Fish and Wildlife (ODFW) on April 19-20 and May 17, 2019. We used a stratified random sampling design to randomly select survey locations. Once at the field survey locations, we timed our search for five minutes and assessed as much area as possible within that time. Oysters were characterized as ‘present’ if evidence of at least one oyster, living or dead, was detected within the search area. The oyster or oyster shell needed to be attached to a substrate to qualify. These data points were collected using a handheld Garmin GPS unit and field notebook, then transferred to Google Earth for quality control, and finally to ArcGIS Pro for analysis.

I originally hypothesized that the spatial pattern of Olympia oyster location data would correspond with the spatial pattern of suitable habitat as established by analysis of the three habitat parameters (salinity, substrate, elevation). I did not think it would be exact, but would at least allow for a somewhat accurate prediction of the where the oysters were located in the estuary. I felt this model would be most useful in determining the spatial range of the oysters by finding the extreme ends of the population upstream and downstream. Of the three habitat parameters, I hypothesized that salinity would be the ultimate predictor of where oysters could be found and that substrate would be the limiting parameter.

Exercise 1
I did not have any oyster location data to work with yet, so I first focused on the three habitat parameters to identify and rank where habitat was least to most suitable. I ranked the subcategories within each parameter raster layer by looking through the available literature on the subject and consulting with shellfish biologists to get an idea of what conditions oysters prefer. I then combined them using the ‘Weighted Overlay’ tool in ArcGIS Pro to determine where overlap between the layers shows the best environmental conditions for oysters to survive.

Exercise 2
After collecting field data to determine where oysters were present or absent within the intertidal zone of Yaquina Bay, I used the data points to compare against the map of suitable habitat. I first uploaded the data points collected in the field into Google Earth where I could easily verify the locations against my field notes and photos, as well as perform some minor quality control. The points were imported into ArcGIS Pro where I used the ‘Buffer’ and ‘Zonal Statistics’ tools to analyze the neighborhood surrounding each data point. Looking at these neighborhoods, I determined what habitat suitability type that the majority of surrounding habitat fit into. Finally, I made a confusion matrix to statistically assess which data points fit into each habitat type.

Exercise 3
Based on the results of the confusion matrix in Exercise 2, I wanted to expand on the habitat suitability analysis to see if I could more accurately predict oyster locations. I decided to see if the spatial pattern of oyster location data could be better described by manipulating the spatial pattern of one of the habitat parameters. I decided to change the rank values of the salinity parameter to alter the map of suitable habitat. Once I had a new output for habitat suitability, I ran the same neighborhood analysis that I used in Exercise 2 and created another confusion matrix.

Exercise 1
After applying the ‘Weighted Overlay’ to the input raster layers, this map was generated as a representation of habitat suitability:

Exercise 2
This neighborhood analysis map shows the majority habitat type surrounding each data point:

This table below shows how many data points fell into each habitat type:

Habitat suitability
4 3 2 1
Presence 13 [0.72] 4 [0.4] 2 [0.29] 0 [0]
Absence 5 [0.28] 6 [0.6] 5 [0.71] 0 [0]
n = 35 18 [1.0] 10 [1.0] 7 [1.0] 0 [0]

This confusion matrix further examines the ‘errors’ in the data, or the observations that did not correlate with my model of predicted suitable habitat. For ease of interpretation, I removed habitat suitability type 1 since there were not observations in this category, and type 3 since it fell in between high and low-quality habitat. Decimals reported indicate the proportion of total observations (n = 35) that fell within this category:

Habitat suitability
4 (high) 2 (low)
Presence 0.37 0.06
Absence 0.14 0.14

Exercise 3
This second neighborhood analysis map shows the majority habitat type surrounding each data point after the habitat suitability model was altered (by manipulating salinity):

This table below shows how many data points fell into each habitat type:

Habitat suitability
4 3 2 1
Presence 1 [0.5] 18 [0.58] 0 [0] 0 [0]
Absence 1 [0.5] 13 [0.42] 2 [1.0] 0 [0]
n = 35 2 [1.0] 31 [1.0] 2 [1.0] 0 [0]

I created another confusion matrix to further examine the ‘errors’ in the data. I removed habitat suitability type 1 since there were not observations in this category. Decimals reported indicate the proportion of total observations (n = 35) that fell within this category. The confusion matrix shows that the model fit nearly all observations (31) into the type 3 habitat category:

Habitat suitability
4 (high) 3 2 (low)
Presence 0.03 0.51 0
Absence 0.03 0.37 0.06

From this analysis, I feel that these three habitat parameters do a good job of predicting the spatial distribution of oysters generally. This approach can be used as a helpful tool for guiding resource managers and shellfish biologists to likely oyster locations, but cannot be the only method for validating their presence or absence at this point. Given budget cuts or competing priorities for today’s resource managers however, a map predicting where an oyster population is might be enough for making management decisions in a pinch.

In collaboration with my advisor at ODFW, I am using the results of this spatial analysis to fill an important data gap about where and how many Olympia oysters can be found in Yaquina Bay.  This information will be used to develop a repeatable monitoring protocol to be conducted by ODFW’s Shellfish and Estuarine Assessment of Coastal Oregon (SEACOR) program. I hope that my research can also help estimate population enhancement potential by identifying possible sites for habitat restoration. With increased accuracy, I’d also like to analyze potential oyster distribution under future environmental conditions and under different management scenarios.

Recovery of a robust population of Olympia oysters is important to Yaquina Bay and the surrounding community in a number of ways. The species provides many benefits to the ecosystem, including water filtration and habitat for other marine creatures. It is culturally significant to local tribes, including the Confederated Tribes of Siletz. While Olympia oysters are not currently listed as threatened or endangered, possible listing would trigger a number of mitigation and conservation measures that will be difficult and expensive for agencies and private landowners. Additionally, there’s been some exploration that if the population can become robust again, there is potential to grow and harvest this species as a specialty food product.

Software and statistics learning
Most of the spatial analysis tools I used were familiar to me, though I had only really used them in exercises in other classes. I have not been able to use my own data to conduct an analysis, so that was exciting, but also a bit of a learning curve. The confusion matrix was a new tool that was a very simple and useful way to statistically assess my data.

Some issues worth noting
Throughout this process, I identified several small issues that I will need to consider or modify to improve my research analysis:

  1. Regarding the neighborhood analysis, the size of the buffer around each point affects how the majority of habitat type is calculated. It might be worthwhile to try another analysis where buffers are smaller or multiple sized buffers are assessed.
  2. A major issue for creating the habitat suitability map was the resolution of the base raster layers for each parameter, especially the substrate layer. The current layer does not reflect the nuances of small patches of available substrate and therefore is not a true indicator of available substrate for oysters. The resolution of the substrate layer is not sophisticated enough to capture the true extent of the oysters, because they’re known to be opportunistic in finding suitable substrate.
  3. The area covered by each field survey (5-min search at a randomly selected starting point) was not equal at every location. Difficulty of travel at the site impacted the rate at which the survey could be conducted. For example, extremely muddy locations or locations with a lot of large boulders made traversing through the survey site more challenging and slowed the survey team down. Additionally, more ground could be covered if little substrate was available for examination.
  4. During field surveys, no distinction was made for adult or juvenile oyster, or living or dead, for determining presence/absence. Separating out this information on the map might be interesting in determining where oysters are more likely to survive long-term. For example, it has been noted that young oysters can opportunistically settle on substrates under less ideal environmental conditions, but may not be able to survive into adulthood.
  5. The simplistic approach for ranking the subcategories within each habitat parameter (on a scale from 1-4) ensured that the extreme ends of the rank value were always represented; there may not always be a definitive 1 or 4 value within each category. For example, maybe high salinity and low salinity both rank as 2’s rather than deciding one is slightly more appropriate than the other.


Beck, M. W., R.D. Brumbaugh, L. Airoldi, A. Carranza, L.D. Coen, C. Crawford, O. Defeo, G.J. Edgar, B. Hancock, M.C. Kay, H.S. Lenihan, M.W. Luckenbach, C.L. Toropova, G. Zhang, X. Guo. 2011. Oyster Reefs at Risk and Recommendations for Conservation, Restoration, and Management. BioScience 61: 107-116.

Groth, S. and S. Rumrill. 2009. History of Olympia Oysters (Ostrea lurida Carpenter 1864) in Oregon Estuaries, and a Description of Recovering Populations in Coos Bay. J. Shellfish Research 28: 51-58.



Final Project: Characterizing the exclusion of refugee settlements from global settlement datasets

Final Project Blog: Characterizing the exclusion of refugee settlements from global settlement datasets

Anna Ballasiotes


This research is a small part of a larger question: where are refugees? Which in itself is the basic question to a larger problem: how will the global community provide resources to those who need help — and in order to do so, understanding where these vulnerable populations exist is particularly important. In considering the hierarchy of understanding where IDPs and refugees exist, there are certain drivers of the creation of a ‘refugee population,’ notably conflict (most often varying levels of violent conflict) and more recently, climate events (including natural disaster events). Beyond the drivers of refugee populations, there are different typologies of refugee populations: refugees within cities and blending into existing infrastructure, refugees in settlements and camps that are close to borders, refugees in settlements close to urban centers, and refugees in settlements and camps in areas where populations have not been established and as such there is available land. My focus lies largely in the last item of that list: where populations have not been formally established but now camps and settlements are expanding. It is a known problem that global settlement datasets are not doing an adequate job of recognizing these camps and settlements as “population.” As such, my goal was to characterize how these global settlement datasets were not recognizing these camps and settlements — and thus excluding these populations. This asks the question: to what extent do global settlement datasets fail to detect known refugee camps and settlements that form in areas of low density population?

I originally sought to answer the question, “what about the classification metrics of global settlement dataset fails to detect settlements known to OSM.” I wasn’t able to truly answer this question because I needed to spend more time  on actually characterizing this exclusion before understanding why these global population datasets were unsuccessful. Characterizing where this exclusion occurred was difficult enough and took most of the class, especially as I explored what these refugee settlements looked like and considered how they may have formed.


I used three main data sources: Refugee settlement boundaries from OpenStreetMap, Facebook High Resolution Settlement Dataset, and the World Settlement Footprint. Only data within Uganda was used for these data, given the overlap of data availability and project partners with reliable data. The Facebook High Resolution Settlement Dataset is trained on 2018 imagery and is presented at a 30 meter resolution. World Settlement Footprint is trained on 2015 imagery and is also presented at 30 meter resolution, but with a minimum mapping unit mask that creates clusters of pixels that make the data appear to have a coarser resolution. The refugee settlement boundaries from OSM range in temporality from 2016 to 2019; however, this is only when the boundaries were added to the OpenStreetMap database and is not necessarily when camps or settlements were established. I had initially included the Global Human Settlement Layer, which is also trained on 2015 imagery, but GHSL did not appear to detect any settlement in the refugee camp boundaries within OSM, so I was not able to perform any meaningful analysis with it.

HYPOTHESES: predictions of patterns and processes.

I wanted to understand in what ways the global settlement datasets excluded the known refugee settlements — thus, I wanted to understand what percentage of refugee settlements were captured and whether larger settlements or smaller settlements showed a higher percentage of captured area, or if settlements that had a higher percentage of captured area showed a faster green-up around the settlement as one moved away from the boundary. I suspected that smaller settlements would have a higher percentage of area captured, mostly because I predicted that smaller settlements conformed more to the actual built-up of settlement area, whereas the larger settlements included more land cover types that would be less likely detected as “built-up.” I also suspected that settlements that showed a higher detection percentage would have a more distinct natural boundary — that the green-up around the settlements would occur faster if a higher percentage was detected.  

APPROACHES: analysis approaches used.

I had initially measured the spatial autocorrelation of the refugee settlements using ArcGIS Pro, but did not find this analysis to provide too much significant information for me. I already knew that my settlements were quite different from each other in size; any clustering would have really come from other factors that are not obviously visible, like the way land was distributed or delegated for refugee settlements.

In order to characterize the percentage of area of refugee settlements captured by each settlement dataset, I brought everything into Earth Engine and calculated pixel area of every pixel both within the refugee boundaries and across WSF and Facebook datasets — counting only the pixels that have area detected as settlement.

In order to refine the comparison of area detected within refugee settlements, I also needed to really only look at pixels that look like “built-up” within the refugee settlements, since the boundary themselves have quite a lot of green space and other land cover types. also performed a binary unsupervised classification based on the known locations of refugee settlements. Ultimately, I used a confusion matrix to compare this binary unsupervised classification to WSF and Facebook to refine the “percent area detected.”

I also looked for the relationship of detection and greenness moving away from the boundary. To do this, I created a multi-ring buffer out to 1 kilometer at 100 meter increments, calculated the average maximum NDVI in each “donut buffer” for 2018 using Landsat 8 imagery. I then plotted these NDVI values over increasing distance from the refugee boundary to see the green-up.


I produced a map of the spatial confusion of overlap between WSF, Facebook, and a binary unsupervised classifier within my settlement boundaries. I have refined this map a bit, but there are still further statistical comparisons that could be done to better characterize the overlap of WSF, Facebook, and the K-Means classifier.

I also have all of the settlement boundaries with the percent of area captured by each global dataset within each boundary: this could be a sorting mechanism in which I could perform furt

Ayilo 1 & 2 Refugee Settlements; Classifications shown

her analysis to better answer my hypothesis, which I didn’t necessarily answer well, specifically with regards to the NDVI prediction. Within the exercises, I answered this question based on geography rather than percent detection — did settlements in the north vs. south show different pattern of vegetation greenup around the settlement?

SIGNIFICANCE What did you learn from your results? How are these results important to science? to resource managers?

I learned that across the board, refugee settlements ARE systematically excluded from global datasets but can easily be detected at the local scale, and thus this systematic bias of these datasets must be addressed and solved. Those that are modeling systems of migration or providing access to humanitarian resources need to have more accurate representations of settlements. Beyond this, I was really hoping to be able to characterize the spatial patterns of the exclusion better than I was able to, and I think that’s where my analysis fell short and where my next hypothesis would manifest.

LEARNING: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other? What about statistics? including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), (d) multivariate methods (e.g., PCA),  and (e) or other techniques?

I didn’t truly dive more deeply into these softwares than I have in the past; I think I just became more acutely aware of the statistical methods that exist in Arc toolboxes and became more comfortable manipulating data in R.  

I think that the statistical learning was less specific or intense than I was hoping, but also in part because I think I was really challenged to think statistically and spatially-statistically about my research question. I wish I had spent more time understanding statistical methods rather than trying to mold my question into something that seemed to fit my elementary understanding of spatial statistics.

Final: Stream Change Recap

My project focused on a set of grain size and cross-sectional change data from a long term research project in the Andrews Forest. (Maps and study description from Blog Post 1)

I wanted to explore two questions:

  • How do stream channel erosion, deposition, and particle size vary over time and space?
  • How do these changes relate to adjacent changes in the across-stream or along-stream direction?


I tried to address these nested hypotheses about the system:

  1. Existing stream bed morphology drives spatial patterns in cross-sectional change.
    1. Unit-scale stream morphology should lead to patchy transport and different levels of autocorrelation of the change at different spatial scales
    2. Either hillslope or hydrodynamic processes or both should result in relatively high cross sectional change near banks.
  2. Extreme flow events alter the size distribution of in-channel sediment
    1. High energy flows associated with high peak discharge events should increase median grain size
    2. Transport of hillslope material (also associated with high peak discharge events) should decrease sorting
  3. Extreme flow events reduce the relative impact of bed morphology in cross-sectional change.


Here’s a conceptual model to investigate the patterns of change

Overall, I expected that different processes would drive change at different scales, so I expected the spatial and temporal patterns of change to vary based on scale as well.


  • At reach scales: patterns of change are driven by reach level features including slope, watershed size, and land use history.
  • At unit scales and smaller, patterns of change are driven by unit-scale bed morphology and positions of adjacent features.


  • During time periods that include extreme flows, patterns of change are driven by the magnitude of the flow.
  • During time periods of less extreme flows, patterns of change are more dependent on local features as described above



Due to the study design and to limits in data availability, it made sense for me to use one-dimensional analysis methods and correlation or autocorrelation tools on much of the data. I mostly worked in R. I also started but didn’t finish work on a network model using a tool specifically designed for stream network analysis.

Technical limitations:

1-dimensional models don’t perfectly capture the complex realities of stream networks and irregularly spaced spatial data, so I ran into a few hurdles trying to parse analysis results. I also couldn’t do some of my desired analyses due to the lack of shared coordinate systems for different data sets or for adjacent cross sections. However, thinking about these analyses helped me better understand exactly which data I would need to collect in order to do them in the future.

I am still having some trouble with a stream network analysis add-on in ArcGIS, and I am in the process of reaching out to experienced users who I hope know the way around my particular error.



I started the stream network analysis partly because I wanted more context for the results of exercise 3. Some 2017 REU students collected grain size data at sites that overlapped with the long term cross sectional study sites. I haven’t finished the network analysis, but I’ve mapped the REU student data below.

Here is a map of the median grain size (D50) at the REU student sites:

Here is a map of half the difference between the 84th and 16th percentile grain size at the REU student sites. This value is one metric of sorting, and it would equal the standard deviation if the sizes were normally distributed.

These results indicate that grain sizes and sorting in the grain sizes display more complex patterns than a simple fining of grain size downstream or consistent gradations in sorting.

For all of the analyses, I need to spend a little more time figuring out which of these results are useable and might reflect reality in the field and which ones are more likely showing artifacts of data collection or processing


I hope to use these results to add more depth to my research and place my research in a broader context. I personally learned more about strengths and limitations of data set

The broader purpose of entire project is to see how streams like these respond over time. How much of their changes are influenced by hydrologic rather than hillslope processes? How big of a hydrologic event needs to happen to do substantial work on the channel? The project could be useful to land managers because channel mobility could damage streamside infrastructure or alter benthic ecology

My learning (technical):

From a technical standpoint, I think improved ggplot skills. I started using cowplot and learned more about custom themes. I learned a lot about preparing data using the STARS toolbox, and I am excited to learn more about the SSN R package once I get my data working

My learning (statistics):

In this class, I learned more about autocorrelation methods and correlograms. Hope to learn more about network analysis soon too.

Examining the Spatial Patterns of NDVI Change near ASGM and non-ASGM villages in rural Senegal


My research question for this project was to investigate how spatial patterns of land use/land cover (LULC) change were related and/or associated with spatial patterns of artisanal, small-scale gold mining (ASGM) establishment. A possible mechanism I considered as influencing these patterns would be deforestation associated with the establishment of ASGM operations, as miners require significant amounts of timber to bolster mine shafts, in addition to constructing shelters around the mine shafts and for the miners themselves. As such, my research question can be phrased thusly: “How is the spatial pattern of LULC change related to the spatial pattern of ASGM establishment via the mechanism of deforestation?” Additionally, I wanted to examine whether alternative factors may affect LULC change around sites; in this case, I examined the proximity of a site to a primary road and whether or not that related to rates of LULC change.


My dataset for this project had several components. In previous work, I used very high resolution satellite imagery to map out the locations and spatial extent of nine ASGM sites, as well as nine non-ASGM sites to compare to (these sites are shown in fig. 1). After having mapped these sites, I acquired Landsat 5 and Landsat 8 imagery at 30m spatial resolution from the USGS for April 2007 and April 2018, respectively. Using this Landsat imagery, I calculated normalized difference vegetation indices (NDVIs) for each year, and then subtracted the former from the latter to obtain a dataset showing NDVI loss between each timeframe. This dataset enabled me to examine NDVI change around each of my 18 sites to evaluate LULC change in the form of NDVI change. In addition to these datasets, I also acquired a road network layer from OpenStreetMaps for Senegal, which I selected to only include “primary” roads in Kedougou. Together, these datasets enabled to me to investigate LULC change around ASGM sites via NDVI change and also how this change relates to proximity to primary roads.

Fig 1: Study Site


My two hypotheses for this project are that 1) NDVI would be more negatively impacted near ASGM sites compared to non-ASGM sites as a result of deforestation, and 2) NDVI losses at ASGM sites could be attributed in part to each site’s proximity to primary roads.

Patterns of higher NDVI loss as a percentage of the total would occur at ASGM sites compared to non-ASGM sites. This may be because, as the larger populations near ASGM sites and also the activity itself would necessitate more timber harvesting, there would be more negative NDVI change in the immediate surroundings compared to sites without ASGM activity. The spatial process I would expect to see would be areas of higher NDVI loss nearer to the center of ASGM villages. These areas would be fragmented, indicating differences in vegetative cover around the village, but, overall, more negative than around non-ASGM villages. This pattern results from miners and villagers harvesting the nearby timber to bolster mine shafts, as well as for construction of structures around each mine and their own huts. This hypothesis is an explanation for the pattern of NDVI loss around a village.
ASGM sites may be, overall, closer to primary roads than non-ASGM sites, and so their growth and attendant NDVI loss is in part attributable to proximity to major travel corridors and the relative ease of reaching the sites. If a mine is closer to a primary road than another mine, it’s possible that this proximity may result in larger spatial patterns of NDVI loss. This hypothesis is an explanation for the process of deforestation around different sites.

The first hypothesis here is an attempt at understanding what spatial patterns may be occurring around ASGM sites; that is, it is an attempt to understand what NDVI losses may look like. The second hypothesis is an attempt to understand why those patterns may be occurring, and in finding some sort of correlation between the observed pattern and an alternative possible explanation for that pattern.

Exercise 1: Moran’s I

Before examining how patterns of NDVI loss are shaped around ASGM and non-ASGM sites, I first wanted to examine how the patterns of NDVI loss and gain were self-related; that is, I wanted to know whether or not patterns of NDVI change occur near areas of similar change (spatially autocorrelated or clustered) or whether or not these patterns occur away from self-similar patterns (dispersed). To do this, I took my layer of NDVI change and ran a script in Rstudio to generate a Moran’s I value for the global layer, which is a measure of spatial autocorrelation: it results in values between -1 and 1, where values closer to -1 indicate negative spatial autocorrelation (dispersion) and values closer to 1 indicate positive spatial autocorrelation (clustering). I also performed this analysis at a smaller scale around one mining village to see how Moran’s I may change on a different scale.

Exercise 2: Neighborhood Analysis

After determining the spatial autocorrelation of patterns of NDVI change, I could examine how those patterns were related to and varied between ASGM sites and non-ASGM sites. To do this, I performed a neighborhood analysis. At each village, I found its center using the Mean Center tool in ArcGIS Pro. Around each center point, I generated “donut” buffers using the Buffer tool at 250 meter increments, up to 2 kilometers, for eight different donuts (e.g. 0-250m, 251-500m, 501-750m, etc.). I then used each of these buffers to Clip the global NDVI layer (which I had also edited to omit values over rivers and Mali), generating eight raster layers representing NDVI change values between each spatial step. From there, I simply generated histograms for each step and counted the values below 0, indicating NDVI loss from 2007 to 2018, and found their percentage of the total. With these values, I could then see how NDVI loss as a percentage of total pixels in each donut varied with distance from the center of each village, and could more readily understand how NDVI values change with distance from both ASGM and non-ASGM villages.

Exercise 3: Pearson’s R Correlation

After examining how NDVI change across space near ASGM and non-ASGM villages, I wanted to further understand how these patterns may be related to another factor: proximity to a primary road. To do this, I used my road network layer from OpenStreetMaps showing locations of all primary roads in Kedougou and used the Near tool in ArcGIS Pro to find the shortest distance from each of my 18 sites to the nearest primary road. Then, I took only the first values of NDVI change (from 0m to 250m), and, using these values, along with each site’s distance to the nearest primary road, I performed a Pearson’s R calculation (fig. 2). This approach thus attempts to explain how values of NDVI loss around both ASGM and non-ASGM sites may be related to a site’s proximity to a primary road, providing some more information on how different rates of NDVI loss may be affected by the ability of people to access a particular village.

Fig 2: Pearson’s R Correlation. x, in this case, is distance (m) from nearest primary road and y is NDVI loss at the first ring.


I had several results from these three approaches.

Firstly, I produced a global Moran’s I of 0.6716816. The second Moran’s I produced, at a smaller scale, was .8079745. These results helped me to understand that, generally, patterns of NDVI loss are spatially autocorrelated, meaning that areas of loss and growth are clustered near self-similar areas. However, this analysis occurred over such a large area that it is possible the global Moran’s I may be accounting more for patterns of near riparian corridors, which are of course autocorrelated, especially areas where water bodies are visible. As such, this result needs to be interpreted with caution.

Secondly, the results from my neighborhood analysis are presented in fig. 3 [insert graph of NDVI changes for fig. 3; include key in caption]. These results indicate that, on average, villages which have ASGM activity present see more NDVI loss in their immediate surroundings than non-ASGM villages, up to about 2km radially from their center. This means that, generally, villages with ASGM activities experience reduced NDVI, which may indicate vegetation loss or loss healthy vegetation, than villages without ASGM present.

Fig 3: NDVI results. Light blue lines indicate ASGM mines; the dark blue line is their average. Light red lines indicate non-ASGM villages; the dark red line is their average.

Finally, the results of my Pearson’s R correlation indicate relatively weak correlations between a village’s NDVI loss (at least, within the first ring) and its proximity to the nearest primary road. The results are shown in table 1 [insert table screen snip]. For ASGM villages, their correlation was -0.1418, whereas non-ASGM villages have a Pearson’s R of -0.0578. Both of these are weak correlations, there is some indication that ASGM villages and their attendant NDVI losses are more correlated with proximity to a primary road (i.e., the closer they are to a primary road, the more NDVI loss they experience) than non-ASGM villages.

Table 1: Pearson’s R Results


These results are significant for several reasons. These showed to me that there is an observable pattern of NDVI loss that is greater around villages which develop or engage in ASGM than in villages which do not engage in ASGM. However, there is still some work to do in order to understand the process by which NDVI varies, and it may not be explainable by a village’s proximity to a primary road. There are several possibilities that may explain the process: deforestation is more prevalent around the gold mines for reasons to do with securing timber for construction, or because an inherently larger population at gold mines necessitates more forest product harvesting. More work should be undertaken to understand these processes. On the whole, however, these results are useful for governments or NGOs which seek to help better the conditions which ASGM practitioners participate and engage in, now that they know that ASGM villages experience greater rates of NDVI loss, which may indicate environmental degradation.


This process taught me several techniques. Firstly, I had no experience at all with R and R Studio, though I do understand some coding. As such, this helped me learn more about a really useful language that I did not have any experience with. Secondly, any practice with ArcGIS Pro is good practice for me, and so I got a lot of time and work done using that piece of software. In particular, I go some good time in with the model builder to help expedite work, as generating eight different ring buffers and then clipping them to an NDVI layer for 18 different villages would have been very time-consuming otherwise. Finally, I gained some good experience performing a statistical analysis, in the form of Pearson’s R. I have very little experience performing statistics, so this was really useful for me to help understand and interpret statistical analyses and results


With regard to statistics, I learned several things. The three statistical analyses I used were Moran’s I for spatial autocorrelation, neighborhood analyses for comparing spatial patterns, and Pearson’s R for calculating a simple statistical correlation.

With regards to Moran’s I, I learned the importance of examining patterns at different scales, as a pattern which is apparent at one scale may not be so at another scale, and so interpretation and set-up must be conducted carefully.

Concerning neighborhood analyses, I learned the importance of dilligent workflow and juggling several different variables without muddying the results. This also demonstrated how to examine and structure a problem which seeks to understand change both across space and scross time.

Finally, conducting a Pearson’s R helped me to understand better how statistical analyses are constructed and what their significance is. Additionally, this taught me the value of interpreting a statistical result, as their significance may not always be indicative of anything. Null results or negative results are just as valuable as positive ones, and this analysis showed me that. Now I know more about what process to examine and how to examine them!




How are the spatial patterns of individuals’ perception of natural resource governance related to the spatial patterns of environmental restoration sites via distance and abundance of improved sites?


 Puget Sound Partnership Social Data

My data came from a survey on subjective human wellbeing related to natural environments. The sample was a stratified random sample (28% response, n= 2323) of the general public from the Puget Sound in Washington from one time period. I am specifically examining questions related to perceptions of natural resource governance. Around 1730 individuals gave location data (cross street and zip code) which have been converted to GPS points. I also have demographic information for individuals, their overall life satisfaction, and how attached and connected they feel to the environment.

Environmental Data

I have three different forms of environmental data. (1) Point locations of restoration sites in the Puget Sound. (2) A shapefile of census tract level average environmental effects (the environmental condition ranked from 1 good to 10 very poor, and (3) Land cover raster files from 1992 and 2016.

  • Individual restoration sites numbered over 12,000. They spanned many years, and point locations overlapped significantly through time.
  • The environmental effects data comes from an online mapping tool that was created in a collaboration between the University of Washington, the Washington Department of Ecology, and the Washington Department of Health. Environmental effects are based on lead risk from housing, proximity to hazardous waste treatment storage and disposal facilities (TSDSs), proximity to national priorities list facilities (Superfund Sites). Proximity to Risk Management Plan (RMP) facilities, and wastewater discharge. The combination of these effects was aggregated within census tracks to produce an environmental effects ‘Rank’ from low effects (Rank of 1) to high effects (Rank of 10).
  • The land cover files contained 25 different types of land cover. I reclassified these land cover types by combining similar types of land cover (e.g. deciduous forest and conifer forest to ‘forest’).



 1) Shorter distances between individuals and restoration sites, and greater number of restoration sites near individuals, would correlate positively with governance perceptions

2) Positive environmental conditions will correlate positively with governance perceptions

Richard Petty and John Cacioppo developed the elaboration likelihood model (1980) which asserts how individuals change their attitude based on persuasive stimuli. Perry and Cacioppo develop this idea by implementing the idea of two routes of persuasion—the central and peripheral paths—where determinants of routes are determined by motivation, ability, and nature of mental processing. The purpose of this theory is to help explain how individuals elaborate on ideas to form attitudes and how strong, or long lasting, those attitudes are. This theory may suggest that individuals will form stronger attitudes when reasons to form attitudes are stronger and more immediate. Stimuli of this kind tend to be nearer to individuals, as processing of issues far away is cognitively difficult. Research has shown that individuals have a difficult time thinking about problems on larger spatial scales, and there is an inverse relationship between how feelings of individual responsibility for environmental problems and spatial scale (as scale gets larger, feelings of responsibility go down (Uzzell, 2000).

Cognitive biases also suggest that people use heuristics to answer questions when they are unsure. One common heuristic is the saliency bias, where individuals overemphasize things that are emotionally more important, and ignore less interesting things (Kahneman, 2011). This may suggest that spatial patterns are more important than temporal patterns because what is happening now is more salient to individuals than what happening in the past. This may imply that even if environmental outcomes have improved over time, the immediacy of the current environment or things influencing the environment may have a greater effect on individual perceptions.



Statistical Analyses

Spatial Autocorrelation Analyses

Moran’s I: To compute Moran’s I, I used the “ape” library in R. I also subset my data to examine spatial autocorrelation by demographics including area (urban, suburban, rural), political ideology, life satisfaction, income, and cluster (created by running a cluster analysis on the seven variables which comprise the governance index

Correlograms: I created correlograms for the variables that were significant (urban, conservative, and liberal) using the “ncf” library and the correlog() function. These figures give a better picture of spatial autocorrelation at various distances.

Semivariograms: To create semivariograms, I used the “gstat” and “lattice” libraries which contain a function called variogram. This function takes the variable of interest along with latitude and longitude locations. The object created can then be plotted. For this analysis I used the same subsets as in the Moran’s I analysis. These figures are excluded from the results as they do not provide information beyond what the correlograms show.

Nearest Neighbor analysis: A file of restoration sites was loaded into R. The sites were points. This analysis required four libraries as the points were from different files. The libraries were: nlem, rpart, spatstat, and sf. I first had to turn my points into point objects in R (.ppp objects). I used the convenxhull.xy() function to create the ‘window’ for my point object, then I used that to run the ppp() function. I then used the nncross() function. This function produced min, max, average, and quartile distances from individuals to the nearest restoration site. I added the ‘distance’ from the nearest neighbor as a variable in a linear regression to determine an association. I used these distances (1st quartile, median, and 3rd quartile) to produce buffers. I ran spatial joins between the buffers and the restoration sites. This produced an attribute table that had a join count—the number of restoration sites within the buffer. In R I ran a linear regression with join count as an independent variable.

Geographically weighted regression: I joined individual points to my rank data, a shapfile at census tract level. This gave individuals a rank for environmental effects at their location. Initially I used the GWR tool in Arc to run the regression, but wanted the flexibility of changing parameters and an easily readable output that R provides. In R, I used two different functions to run GWR, gwr(), and gwr.basic(). The gwr() required creating a band using gwr.sel, and the gwr.basic required creating a band using bw.gwr. The difference between these functions is that gwr.basic produces p-values for the betas. I ran gwr on both my entire data set and a subset based on perceptions. The subset was the ‘most agreeable’ and ‘least agreeable’ individuals who I defined as those one standard deviation above and below the mean perception.

Neighborhood analysis: I created a dataset of the upper and lower values of my governance perceptions (one standard deviation above and below the means). I then added buffers to these points at 1 and 5 miles. I joined the buffers to the rank values to get an average rank within those buffers. I exported the files for each buffer to R. I created a density plot of average rank for the low governance values at each buffer, and for the high governance values at each buffer.

T-test: To test whether land cover change differed between those with high and low perceptions, I exported the tables and calculated the change in land cover for the two samples. I then ran two-sample t-tests for each land cover type change between the two groups.


Kriging and IDW: I used the Spatial Analysis toolbox to preform Kriging and IDW to compare the outputs of the two techniques. I used my indexed variable of governance perceptions. The values of the variable vary from 1 to 7. I then also uploaded a shapefile bounding the sample area.

Hotspot Analysis: I used the Spatial Analysis toolbox to preform ‘hotspot analysis.’ I used my indexed variable of governance perceptions with values from 1 to 7.

Tabulate area: I used two land cover maps from 1992 and 2016 to approximate environmental condition. I loaded the raster layers into ArcGIS Pro, and reclassified the land types (of which there were 255, but only 25 present in the region) down to 14. I then created buffers of 5km for my points of individuals with governance scores one standard deviation above and below the average. I then tabulated the land cover area within the buffers for each time period.



What are the spatial patterns of perceptions of natural resource governance?

 Moran’s I statistic for overall governance perceptions was insignificant, indicating a lack of spatial autocorrelation at the regional scale (I value; p = 0.51). Significant autocorrelation was detected when the data were subset by demographic variables. Individuals residing in urban areas exhibited spatial autocorrelation of perceptions while individuals in suburban and rural areas did not. Lastly, individuals classified as ‘conservative’ (political ideology < 0.5) exhibited significant spatial autocorrelation for governance perceptions (I statistic: -0.0062, p = 0.0018). The spatial autocorrelation for governance perceptions was mainly restricted to short distances as shown in the correlograms below. Significant correlations for the entire study population and individuals in urban areas were detected at near distances Individuals classified as conservative exhibited significant correlation across multiple distances. No significant correlations were found among liberal individuals, suggesting a non-spatial driver mechanism may be at play.


Local patches of significantly lower perceptions (cold spots) and significantly higher perceptions (hot spots) were identified as shown in the map below. Three ‘cold spots’ were located in the areas around Port Angeles, Shelton, and the greater Everett area. Two hot spots were identified surrounding Bainbridge Island, and the city of Tacoma, which is where the Puget Sound Partnership is located.

Blue points are “cold spots” at 99%, 95%, and 90% confidence that the points reside in a cold spot. Cold spots are areas where perception is significantly lower than the average compared to those around them. Red points are “hot spots” at 99%, 95%, and 90% confidence that the points resides in a hot spot. Hot spots are areas where perception is significantly higher than the average compared to those around them. Small grey points are insignificantly different. Bounding lines are counties in Northwest Washington.

What are the spatial patterns of natural resource governance perceptions in relation to environmental condition?

The median distance for individuals from restoration site was 0.037 degrees, 1st quartile was 0.020 degrees, and 3rd quartile was 0.057 degrees.

The regression on whether distance correlates with individual perception was insignificant (p = 0.198), which means distance from the nearest restoration site does not influence perceptions. For each regression on the number of sites near an individual, all coefficients were negative, meaning the more sites near an individual, the more disagreeable their perception was. All produced significant results, but the effect size of number of sites near individuals was very minimal (Table 1).

Table 1. Regression results for ‘nearest neighbor’ of individuals to restoration sites.

Buffer Size b p-value Effect Size (rpb)
Buffer 1 (1st quartile) -0.003 0.0181 .003
Buffer 2 (median) -0.002 0.045 .002
Buffer 3 (3rd quartile) -0.001 0.002 .003

In a geographically weighted regression (GWR) model, rank, life satisfaction, years lived in the Puget Sound, and race are significant (Table 2). All other variables were insignificant. For rank (the combination score of environmental effects) the coefficient was positive. Higher rank values are worse environmental effects, so as agreeable perceptions increase, environmental condition decreases. Overall, the effect size of this model on governance perceptions was small, explaining about 10% of the variance in the data (Table 2).

A map of residuals from the GWR confirms the results from the hotspot analysis about where high and low responses congregate.

Table 2. Regression results for environmental effects at individuals’ locations.

Variable1 b p-value2
Rank 0.077 0.007**
Life Satisfaction 0.478 <0.001**
Years Lived in the Puget Sound -0.010 0.002**
Sex -0.156 0.289
Area -0.150 0.179
Education -0.002 0.922
Income -0.022 0.622
Race 0.006 0.034*
Ideology 0.103 0.533

1R2 = 0.094

2 ** = significant at the 0.01 level, * = significant at the 0.05 level

The plot of high and low governance values at the two buffers is presented below. The black and grey curves represent respondents from the survey that were at least one standard deviation lower than the mean (the mean was neutral). The green curves represent respondents from the survey that were at least one standard deviation higher than the mean. The solid curves are the average rank at the one mile buffer, and the dotted curve is the average rank at the five mile buffer.

This figure indicates there are two peaks in average rank, at low environmental effects (~1) and at mid-environmental effects (~4.75). Those with lower perceptions of environmental governance had higher peaks at low environmental effects for each buffer size. Those with higher perceptions of environmental governance had a bimodal distribution with peaks at low and mid-environmental effects.


What are the spatial patterns of natural resource governance perceptions in relation to environmental condition over time?

I used land cover change as a proxy for environmental condition over time. Land cover change differed significantly for four different land cover types between the two groups—grassland, forest, shrub/scrub, and bare land cover. Grassland cover decreased for both groups, but decreased by 5% more in the areas with individuals with below average governance perceptions. Forest cover also decreased for both group, but decrease by 1% more individuals with below average governance perceptions (the amount of forest cover in the region is very large which accounts for why a 1% difference is significant). Shrub and scrub cover increased for both groups, but increased by 3% more in areas with individuals with below average governance perceptions. Lastly, bare land decreased for both, but decreased by 5% more for individuals in areas with individuals with below average governance perceptions (Table 3). The effect size of land cover change on perception was small for each of the significant variables with biserial correlations between .07 and .09 (Table 1).

Table 3. Differences in land cover change between individuals with above average perceptions of natural resource governance perceptions and individuals with below average governance perceptions

Governance Perception1
Above Average Below Average t-value p-value Effect size (rpb)
High Development 17 16 0.80 .426 .02
Medium Development 19 19 0.67 .500 .02
Low Development 11 11 0.22 .829 .00
Developed Open Space 12 13 0.21 .837 .00
Agriculture -1 0 0.74 .461 .02
Grassland -13 -18 3.03 .002 .09
Forest -10 -9 2.61 .009 .08
Shrub/Scrub 40 43 3.04 .002 .09
Wetlands 0 0 1.67 .095 .05
Unconsolidated Shore 2 2 0.10 .918 .00
Bare Land -21 -25 2.43 .015 .07
Water 0 0 0.90 .368 .03
Aquatic Beds -1 -5 1.19 .233 .03
Snow/Tundra 0 0 0.00 .997 .00

1Cell entries are percent changes of land cover from 1992 to 2016

As one land cover type goes up, another land cover type must go down. While forest overall decreased in the region for both groups over time, areas with lower perceptions saw significantly less decrease. The percentage is small, but a lot of the area in the region is forested, so the actual amount of land is large. The two maps below show land cover in 2016 (top) and 1992 (bottom) around the area of Port Angeles. This area is one that had a significant cold spot (low perceptions). From these maps, you can see a spread of the dark green color (forest), and a decrease of the yellow (grassland). Port Angeles historically was a timber community (the same could be said for the regions of the other cold spots in the region. This change in forest and grassland could be the source of negative perceptions if these communities believe the governance structures are responsible for the change in forest cover.


Overall, in this analysis, poor environmental condition relates to positive perceptions, and land cover change may provide insight into reasons for poor perceptions. Areas with negative perceptions may not directly indicate poor natural resource governance. Elsewhere, trust in natural resource governance was primarily driven by individual value orientations (Manfredo et. al. (2017). The three areas identified as cold spots in this study could be areas where individuals do not feel their values are represented by the natural resource governing systems. Conversely, hot spot area A is located near the headquarters of the Puget Sound Partnership, potentially facilitating trust, representation, and access to information that may influence the perceptions of individuals located nearby. This urban, liberal, developed area contrasts the area to the west with a highly significant cold spot (Area 2). Individuals from area two, a more rural, conservative, historic logging site likely have different value orientations than those from Area A. Similar comparisons of the other cold spots could be made to areas near them. More research should be conducted in this area to determine if value orientations have a significant influence on perceptions of natural resource governance.


Your learning


During this course I significantly advanced my knowledge in ArcGIS Pro and in R studio. I learned how to run many new tools in Arc including hotspot, geographically weighted regression, and tabulate area. I also learned new skills in troubleshooting problems. For example, I was having difficulty getting the rank data to display on my map. Another student taught me that as I imported the data from a .csv and joined it to a shapefile, I would need to create a new feature class from it before it would be able to display. I also gained knowledge in how to run more spatial analyses in R. These included many spatial autocorrelation analyses, and geographically weighted regression (which involved creating a point object!).


What did you learn about statistics?


I first and foremost learned of the existence of many types of spatial statistics. These included statistics Hotspot (Getis-Ord Gi*), spatial autocorrelation (correlogram, semivariogram, Moran’s I), geographically weighted regression (GWR). For hotspot, I learned what the hotspot clusters mean, and the best ways to pick the fixed-distance band of the neighbors the tool looks at (either through selecting the number of minimal points near it, or by looking at the z-score at the distance that shows the highest amount of spatial autocorrelation). Speaking of spatial autocorrelation, I learned what that meant in terms of the correlation of perceptions of governance between individuals depending on their distance from each other. Finally, I learned a little about GWR, including that it is necessary to plot the output to examine patterns over the space in question, and I also learned that the reason Arc does not display p-values for independent variables is not something to due with the equation, but actually due to computing power of Arc.

Final Project: Searching For Faults with Kriging


I have looked at the spatial distribution of depth to first lava and the spatial relationship between the depth to first lava and the depth to first water. I found that the distance from volcanic centers might affect the depth we find the contact with first lava. I did not find any strong correlation between first water and first lava, but that does not preclude the idea that first water correlates to deeper lava contacts.

Now I want to look at what else might be effecting the spatial distributions of lava in the region. Note, that the techniques I used for looking at lava can also be applied to looking at the spatial distribution for water in the field area.

Figure 1: The style of active faulting in the Pit River Study area (delineated roughly in green). N-S trending faults accommodate dip slip (up-down) motion, while NW-Se trending faults accommodate oblique slip motion (up-down and side to side). Structurally controlled volcanoes often lie at the elbows of intersection between the N-S and NW-SE trending faults. Modified from Blakely et al 1997.

The Northern California Field area lies at the intersection of the Walker Lane Fault Zone, the Basin and Range and the Cascadia subduction zone. For the purpose of this examination I will focus on the relationship between the spatial distribution between Walker Lane style faulting (fig 1) and the spatial distribution of the depth to first water and the depth to first lava in the region.

The topography in the region reflects the interplay between active Walker Lane Style Faulting in the region and the higher frequency volcanism that mutes it. For, example, in the northern portion of the field area, Medicine Lake Volcano, with flanks covered by volcanism less than 10 kyr (fig 2), active faults disappear beneath the volcanic edifice. They do not appear in the topography, but they exist in the subsurface. However, in regions where faults are well expressed, enough displacement has occurred to 1) be resistant to burial by periodic lavas, and 2) to either serve as conduits to lava flow, or to displace lava itself.

The Data:

Figure 2: Map of the Field area with active faults (pink lines), the Pit River (blue line), rough locations of ground water discharge (yellow circles), young lava flows (green outlines) and all of the available well data (blue dots).

The subset of the data we used in figure 2 is depicted in figure 3. Temporally they range from wells dug in the1950s to the present. It is important to note that depths to first water might have changed in the 60 years since the older well logs were recorded. For this reason I did not use many of the older well logs in my analyses.

Figure 3: The center of the township and range sections for the wells used and corresponding depths to first lava. Purple logs are the shallowest and Blue logs are the deepest.



Figure 3: Conceptual design for the project. A) Configuration before slip upon the fault. The lava flow is continuous. B) If a normal fault displaces a lava flow, then one side will go down with respect to the other. The lower block experiences deposition, and the development of a soil profile. If enough time has passed, then there will be a difference between the depth to first lava between the upper and lower block.

I predict that faults will run parallel to and coincide with large step changes in the depth to first lava.

If there are no large step changes in depth to first lava, I predict that these changes will be driven by changes in elevation and correspond to channelized lava flows rather than faults.


I used the Threshold Kriging function in Arc to make a surface that corresponded to the likelihood of the depth to first lava being less than 30 feet deep. The Kriged surface then interpolates the probability that location at other points in the map correspond to less than 30 feet deep. When the surface has a value of 1, values are close to less than 30ft deep. When it is close to zero, values are greater than 30 feet deep.

I then preformed a confusion matrix to see how my results compared to what I expected. This part I did by eye.


Figure 4: Kriged Surface with faults superimposed.



High-Low Low-Low or High-High
parallel 257 315
perpendicular 213 5

Table 1: Corresponding confusion matrix. There were 257 High-Low values that corresponded to parallel faults, but there were 315 Low-Low or High Values that also corresponded to parallel faults. See discussion below.

Figure 5: Kriged surface with contours.



257 out of 790 of the faults were gradient parallel. If, corresponded to step function in kriged depth to first lava. However, many more did not. What is happening here? If you look to figure 5, you can see that while some of the high gradients correspond to rapid changes in elevation. This block is the footwall of a range bounding normal fault. That fault is not mapped because it is no longer active. But it has large amounts of uplift upon it, and likely corresponds to a different lithology entirely and so does not fall into our hypothesis.

If I do this analysis for every lava layer I find, and can correctly link the well logs together, then this analysis could help me constrain slip upon the faults in the region. This information is good for hazards assessment in the region as PG&E has a hydroelectric dam in the area.


Learning outcomes:

Throughout the course of this class, I learned a few of the nuances of R, though my work here has just begun. I also feel like I have a better understanding of how to think about spatial processes, which I why I enrolled in the first place. Future goals include better locating my data in space. This way, many of the techniques I used in this class will pick up on actual processes, and not on the gridded spacing of my data.



Blakely, Richard et al. “Gravity Anomalies, Quaternary Vents, and Quaternary Faults in the Southern Cascade Range, Oregon and California: Implications for Arc and Backarc Evolution.” Journal of Geophysical Research: Solid Earth, vol. 102, no. B10, 1997, pp. 22513–22527.


Final Project: Western Hemlock Dwarf Mistletoe Spatial Patterns and Drivers

Western Hemlock Dwarf Mistletoe

Western hemlock dwarf mistletoe is a hemi-parasite of primarily western hemlock trees. It absorbs water, nutrients, and carbohydrates from its host. Infected branches produce small structures called aerial shoots that have minor photosynthetic capabilities, but are primarily for pollination and seed production and require a certain amount of light to emerge from an infection site. Seeds are explosively discharged from aerial shoots when fully mature. Once landing on a susceptible host branch they being germination and mechanical penetration of the host branch.

Research Question

I wanted to explore the spatial patterns of dwarf mistletoe infections after a mixed severity fire and the roles post-fire, forest structure and fire refugia play.

How does the spatial pattern of post-fire stand structure and species composition affect spatial patterns of dwarf mistletoe distribution throughout the stand through physical barriers to susceptible hosts and seed dispersal?

Description of Dataset

I have a 2.2 ha rectangular study area (Wolf Rock), just northwest of the HJ Andrews Exp. Forest that was stem mapped in 1992. Each tree has an X and Y coordinates, as well as the tree species, height, diameter, and a variety of other tree inventory related data. I only have ages for a few western hemlocks, but many more for Douglas-fir. Only one western hemlock core was identified as being over 100 years old, at 170 years. I have a polygon layer for the fire refugia that are documented on the study area. For the western hemlocks I have a presence/absence of western hemlock dwarf mistletoe as well as a severity measure. However, I am unsure of the scale and ethod of rating, so I will not be using it. The infection presence and absence are from a single measurement season.

Hypotheses and Predictions

Western hemlock dwarf mistletoe spreads easily through moderate canopy densities, where distances between trees are close to the average dispersal distance of 2-4 meters. Disturbances that create patchy gaps increase likelihood of spread because of increased light reaching infected branches and increase the rate of infection abundance because of the lack of physical barriers such as high densities of branches and foliage.

Disturbances can also remove the disease from a forested stand by killing or knocking down infected western hemlocks. Post-disturbance regeneration can enable or inhibit the parasite’s reintroduction to the stand. Non-susceptible hosts such as Douglas-fir or western redcedar regenerate readily alongside western hemlock and will intercept seeds. Some gaps are only conducive to western hemlock regeneration which will be readily infected by any surviving infected western hemlocks post-disturbance.

The Wolf Rock stand experienced two fires that created a mosaic of ~110 year old regeneration and >110 year old remnant trees in fire refugia. Remnant, infected western hemlocks survived the most recent fire in those fire refugia. From this stand structure, I have several hypotheses and predictions:

  1. Remnant, infected western hemlocks form the center of new infection centers post disturbance so infections in the regenerating susceptible hosts, will be clustered around these remnant trees.
  2. Non-susceptible hosts regulate the rate of infection spread through physical barriers to dispersal, so infection cluster size will have an inverse relationship with non-host density.
  3. Western hemlock infection spreads from a central remnant tree, so uninfected western hemlocks will have a dispersed spatial pattern from the remnant western hemlock, regardless of non-host density.
  4. Post-fire regeneration with higher western hemlock composition will have more susceptible hosts and less physical barriers to spread so infection cluster size will have a positive relationship with western hemlock composition.

Analysis Approaches

For Exercise 1,  I wanted to know about the spatial pattern of western hemlock trees infected with western hemlock dwarf mistletoe. I used a hotspot analysis to determine where clusters of infected and uninfected trees were in my 2.2 ha study area. I discovered a hot spot and a cold spot, indicating two clusters, one of high values (infected) and one of low values (uninfected).

For Exercise 2, I wanted to know how the spatial pattern of these clustered infected and uninfected trees were related to the spatial pattern of fire refugia outlined in my study site. I used Geographically Weighted Regression to determine the significance of this relationship, however I did not find a significant relationship between a western hemlock, its intensity of clustering and infection status, and it’s distance to its nearest fire refugia polygon edge.

For Exercise 3, I wanted to know how the spatial patterns of remnant western hemlocks related to the spatial patterns of regeneration western hemlocks, uninfected western hemlocks, and non-western hemlock tree species. I used the Kcross and Jcross functions in spatstat in R and prepared the data in ArcMap to analyze spatial relationships between trees. I found clustering between regenerating western hemlocks and non-hosts to remnant western hemlocks but the uninfected western hemlock’s spatial pattern was independent of the remnant western hemlocks.


I produced several maps that showed the spatial patterns in my study site which were helpful for understanding and investigating further relationships. I produced several charts from my exercise 3 analysis that were useful for visual representations of the relationships between trees. In Exercise 3 I also produced a map from raster addition that gave me the best visualization of where western hemlock and non-host trees were in the stand. Exercise 2 produced a map and statistical relationship but was not significant in explaining a hemlock’s infection and density status.


The biggest finding was that the fire refugia polygons are not significant for my analysis, the remnant infected hemlocks are more important explanatory variables in spatial patterns of infected trees. This supported hypothesis 1. Because refugia can be effectively defined using the “for what, from what” framework, western hemlock dwarf mistletoe refugia from fire could be delineated differently in the field focusing only on the remnant western hemlocks.

Data was not available to determine the rates of infection spread over time because I only had one season of measurements. I also could not evaluate the size of clusters because I did not have GPS points of infection center extent so I could not assess hypothesis 2 and 4 directly. However using Ripley’s K and the cross-variant I could see how the clusters changed over distance. I learned that infected, regenerating trees are going to be found closer to the remnant infected trees and that non-host trees may be blocking the spread of mistletoe into an uninfected patch because they were found clustered around remnant trees as well. This provides support for Hypothesis 1, 2, and 3.

Silvicultural prescriptions with the goal to preserve old growth forest structure, but that want to limit the amount of dwarf mistletoe in a forest can appropriately remove old infected hemlocks to preserve infection spread and extent. These prescriptions will also be able to predict future dwarf mistletoe spread. Forest operations that simulate disturbances that leave remnant hemlocks such as harvests, can incorporate spread predictions to limit regeneration being infected.

Learning From The Process

I learned a lot about the spatial analyst tools in ArcMap and how to produce easily interpretable maps and graphs. I also learned how to use several function in spatstat. I learned a lot about interpreting R outputs and spatial. Spatial autocorrelation can tell you a lot about what your data are doing but I thought it was most useful to be able to see on a map or chart what is specifically happening.

Final project: Washing out the (black) stain and ringing out the details


In order to explain my project, especially my hypotheses, some background information about this disease is necessary. Black stain root disease of Douglas-fir is caused by the fungus Leptographium wageneri. It infects the roots of its Douglas-fir host, growing in the xylem and cutting the tree off from water. It spreads between adjacent trees via growth through root contacts and grafts and long-distance via insects (vectors) that feed and breed in roots and stumps and carry fungal spores to new hosts.

Forest management practices influence the spread of disease because of the influence on (i) the distance between trees (determined by natural or planted tree densities); (ii) adjacency of susceptible species (as in single-species Douglas-fir plantations); (iii) road, thinning and harvest disturbance, which create suitable habitat for insect vectors (stumps, dead trees) and stress remaining live trees, attracting insect vectors; and (iv) forest age distributions, because rotation lengths determine the age structure in managed forest landscapes and younger trees (<30-40 years old) are thought to be more susceptible to infection and mortality by the disease.


How do (B) spatial patterns of forest management practices relate to (A) spatial patterns of black stain root disease (BSRD) infection probabilities at the stand and landscape scale via (C) the spatial configuration and connectivity of susceptible stands to infection?

In order to address my research questions, I built a spatial model to simulate BSRD spread in forest landscapes using the agent-based modeling software NetLogo (Wilensky 1999). I used Exercises 1-3 to focus on the spatial patterns of forest management classes. Landscapes were equivalent in terms of the proportion of each management class and number of stands, varying only in spatial pattern of management classes. In the exercises, I evaluated the relationship between management and disease by simulating disease spread in landscapes with two distinct spatial patterns of management:

  • Clustered landscape: The landscape was evenly divided into three blocks, one for each management class. Each block was evenly divided into stands.
  • Random landscape: The landscape was evenly divided into stands, and forest management classes were randomly assigned to each stand.


I analyzed outputs of my spatial model. The raster files contain the states of cells in forest landscapes at a given time step during one model run. States tracked include management class, stand ID number, presence/absence of trees, tree age, probability of infection, and infection status (infected/not infected). Management class and stand ID did not change during the model run. I analyzed tree states from the last step of the model run and did not analyze change over time.

Extent: ~20 hectares (much smaller than my full models runs will be)

Spatial resolution: ~1.524 x 1.524 m cells (maximum 1 tree per cell)

Three contrasting and realistic forest management classes for the Pacific Northwest were present in the landscapes analyzed:

  • Intensive – Active management: 15-foot spacing, no thinning, harvest at 37 years.
  • Extensive – Active management: 10-foot spacing, one pre-commercial and two commercial thinnings, harvest at 80 years.
  • Set-aside/old-growth (OG) – No active management: Forest with Douglas-fir in Pacific Northwest old-growth densities and age distributions and uneven spacing with no thinning or harvest.


Because forest management practices influence the spread of disease as described in the “Background” section above, I hypothesized that the spatial patterns of forest management practices would influence the spatial pattern of disease. Having changed my experimental design and learned about spatial statistics and analysis methods throughout the course, I hypothesize that…

  • The “clustered” landscape will have (i) higher absolute values of infection probabilities, (ii) higher spatial autocorrelation in infection probabilities, and (iii) larger infection centers (“hotspots” of infection probabilities) than the “random” landscape because clustering of similarly managed forest stands creates continuous, connected areas of forest managed in a manner that creates suitable vector and pathogen habitat and facilitates the spread of disease (higher planting densities, lower age, frequent thinning and harvest disturbance in the intensive and extensive management). I therefore predict that:
    • Intensive and extensive stands will have the highest infection probabilities with large infection centers (“hotspots”) that extend beyond stand boundaries.
      • Spatial autocorrelation will therefore be higher and exhibit a lower rate of decrease with increasing distance because there will be larger clusters of high and low infection probabilities when stands with similar management are clustered.
    • Set-aside (old-growth, OG) stands will have the lowest infection probabilities, with small infection centers that may or may not extend beyond stand boundaries.
      • Where old-growth stands are in contact with intensive or extensive stands, neighborhood effects (and edge effects) will increase infection probabilities in those OG stands.
    • In contrast, the “random” landscape will have (i) lower absolute values of infection probabilities, (ii) less spatial autocorrelation in infection probabilities, and (iii) smaller infection centers than the “clustered” landscape. This is because the random landscape will have less continuity and connectivity between similarly managed forest stands; stands with management that facilitates disease spread will be less connected and stands with management that does not facilitate the spread of disease will also be less connected. I would predict that:
      • Intensive and extensive stands will still have the highest infection probabilities, but that the spread of infection will be limited at the boundaries with low-susceptibility old-growth stands.
        • Because of the boundaries created by the spatial arrangement of low-susceptibility old-growth stands, clusters of similar infection probabilities will be smaller and values of spatial autocorrelation will be lower and decrease more rapidly with increasing lag distance. At the same time, old-growth stands may have higher infection probabilities in the random landscape than in the clustered landscape because they would be more likely to be in contact with high-susceptibility intensive and extensive stands.
      • I also hypothesize that each stand’s neighborhood and spatial position relative to stands of similar or different management will influence that stand’s infection probabilities because of the difference in spread rates between management classes and the level of connectivity to high- and low-susceptibility stands based on the spatial distribution of management classes.
        • Stands with a large proportion of high-susceptibility neighboring stands (e.g., extensive or intensive management) will have higher infection probabilities than similarly managed stands with a small proportion of high-susceptibility neighboring stands.
        • High infection probabilities will be concentrated in intensive and extensive stands that have high levels of connectivity within their management class networks because high connectivity will allow for the rapid spread of the disease to those stands. In other words, the more connected you are to high-susceptibility stands, the higher your probability of infection.


Ex. 1: Correlogram, Global Moran’s I statistic

In order to test whether similar infection probability values were spatially clustered, I used the raster package in R (Hijmans 2019) to calculate the global Moran’s I statistic at multiple lag distances for both of the landscape patterns. I then plotted global Moran’s I vs. distance to create a correlogram and compared my results between landscapes.

Ex. 2: Hotspot analyses (ArcMap), Neighborhood analyses (ArcMap)

First, I performed a non-spatial analysis comparing infection probabilities between (i) landscape patterns (ii) management classes, and (iii) management classes in each of the landscapes. Then, I used the Hotspot Analysis (Getis-Ord Gi*) tool in ArcMap to identify statistically significant hot- and cold-spots of high and low infection probabilities, respectively. I selected points within hot and cold spots and used the Multiple Ring Buffer tool in ArcMap to create distance rings, which I intersected with the management classes to perform a neighborhood analysis. This neighborhood analysis revealed how the proportion of each management class changed with increasing distance from hotspots in order to test whether the management “neighborhood” of trees influenced their probability of infection.

Ex. 3: Network and landscape connectivity analyses (Conefor)

I divided my landscape into three separate stand networks based on their management class. Then, I used the free landscape connectivity software Conefor (Saura and Torné 2009) to analyze the connectivity of each stand based on its position within and role in connecting the network using the Integrative Index of Connectivity (Saura and Rubio 2010). I then assessed the relationship between the connectivity of each stand and infection probabilities of trees within that stand using various summary statistics (e.g., mean, median) to test whether connectivity was related to infection probability.


As my model had not been parameterized by the beginning of this term, I analyzed “dummy” data, where infection spread probabilities were calculated as a decreasing linear function of distance from infected trees. However, the results I produced still provided insights as to the general functioning of the model and factors that will likely influence my results in the full, parameterized model.

I produced both maps and numerical/statistical relationships that describe the patterns of “A” (infection probabilities), the relationship between “A” and “B” (forest management classes), and how/whether “A” and “B” are related via “C” (landscape connectivity and stand networks).

In Exercise 1, I found evidence to support my hypothesis of spatial autocorrelation at small scales in both landscapes and higher autocorrelation and slower decay with distance in the clustered landscape than the random landscape. This was expected because the design of the model calculated probability of infection for each tree as a function of distance from infected trees.

In Exercises 2 and 3, I found little to no evidence to support the hypothesis that either connectivity or neighboring stand management had significant influence on infection probabilities. Because the model that produced the “dummy” data limited infection to ~35 meters from infected trees and harvest and thinning attraction had not been integrated into infection calculations, this result was not surprising. In my full model where spread via insect vectors could span >1,000 m, I expect to see a larger influence of connectivity and neighborhood on infection probabilities.

A critical component of model testing is exploring the “parameter space”, including a range of possible values for each parameter. This is especially for agent-based models where there are complex interactions between many individuals that result in larger-scale patterns that may be emergent and not fully predictable by the simple sum of the parts. In my model, the disease parameters of interest are the factors influencing probability of infection (Fig. 1). In order to understand how the model reacts to changes in those parameters, I will perform a sensitivity analysis, systematically adjusting parameter values one-by-one and comparing the results of each series of model runs under each set of parameter values.

Fig.1. Two of the model parameters that will be systematically adjusted during sensitivity analysis. Tree susceptibility to infection as a function of age (left) and probability of root contact as a function of distance (right) will both likely influence model behavior and the relative levels of infection probability between the three management classes.

This is especially relevant given that in Exercises 1 through 3, I found that the extensively managed plantations had the highest values of infection probability and most of the infection hotspots, likely due to the fact that this management class has the highest [initial] density of trees. For the complete model, I am hypothesizing that the intensive plantations will have the highest infection probabilities because of high frequency of insect-attracting harvest and short rotations that maintain the trees in an age class highly susceptible to infection. In the full model, the extensive plantations will have higher initial density than the intensive plantations but will undergo multiple thinnings, decreasing tree density but attracting vectors, and will be harvested at age 80, thus allowing trees to grow into a less susceptible age class. In this final model, thinning, harvest length, and vector attraction will factor in to the calculation of infection probabilities. My analysis made it clear that even a 1.5 meter difference in spacing resulted in a statistically significant difference for disease transmission, with much higher disease spread in the denser forest. Because the model is highly sensitive to tree spacing, likely because the parameters of my model that relate to distance drop off in sigmoidal or exponential decay patterns, I would hypothesize that changes in the values of parameters that influence the spatial spread of disease (i.e., insect dispersal distance, probability of root contact with distance) and the magnitude of vector attraction after harvest and thinning will determine whether the “extensive” or “intensive” forest management class will ultimately the highest levels of infection probabilities. In addition, the rate of decay of root contact and insect dispersal probabilities will determine whether management and infection within stands influence infection in neighboring stands and the distance and strength of those neighborhood effects. I would like to test this my performing such analyses on the outputs from my sensitivity analyses.


Ultimately, the significance of this research is to understand the potential threat of black stain root disease in the Pacific Northwest and inform management practices by identifying evidence-based, landscape-scale management strategies that could mitigate BSRD disease issues. While the results of Exercises 1-3 were interesting, they were produced using a model that had not been fully parameterized and thus are not representative of the likely actual model outcomes. Therefore, I was not able to test my hypotheses. That said, this course allowed me to design and develop an analysis to test my hypotheses. The exercises I completed have also provided a deeper understanding of how my model works. Through this process, I have begun to generate additional testable hypotheses regarding model sensitivity to parameters and the relative spread rates of infection in each of the forest management classes. Another key takeaway is the importance of producing many runs with the same landscape configuration and parameter settings to account for stochastic processes in the model. By only analyzing one run for each scenario, there is a chance that the results are not representative of the average behavior of the system or the full range of behaviors possible for those scenarios. For example, with the random landscape configuration, one generated landscape can be highly connected and the next highly fragmented with respect to intensive plantations, and only a series of runs under the same conditions would provide reliable results for interpretation.


(a, b) Arc-Info, Modelbuilder and/or GIS programming in Python

This was my first opportunity to perform statistical analysis in ArcGIS, and I used multiple new tools, including hotspot analysis, multiple ring buffers, and using extensions. Though I did not use Python or Modelbuilder, I realized that doing so will be critical for automating my analyses given the large number of model runs I will be analyzing. While I learned how to program in Python using arcpy in GEOG 562, I used this course to choose the appropriate tools and analyses for my questions and hypotheses rather than automating procedures I may not use again. I would now like to implement my procedures for neighborhood analysis in Python in order to automate and increase the efficiency of my workflow.

(c) Spatial analysis in R

During this course, I learned most about spatial data manipulation in R, since I had limited experience using R with spatial data beforehand. I used R for spatial statistics, data cleaning and management, and conversion between vector and raster data. I also learned about the limitations of R (and my personal limitations) in terms of the challenge of learning how to use packages and their functions when documentation is variable in quality and a wide variety of user-generated packages are available with little reference as to their quality and reliability. For example, for Exercise 2, I had trouble finding an up-to-date and high-quality package for hotspot analysis in R, with raster data or otherwise. However, this method was straightforward in ArcMap once the data were converted from raster to points. For Exercise 1, the only Moran’s I calculation that I was able to perform with my raster data was the “moran” function in the raster package, which does not provide z- or p-values to evaluate the statistical significance of the calculated Moran’s I and requires you to generate your own weights matrices, which is a pain. Using the spdep or ncf packages for this analysis was incredibly slow (though I am not sure why), and the learning curve for spatstat was too steep for me to overcome by the Exercise 1 deadline (but I hope to return to this package in the future).

Reading, manipulating, and converting data: With some trial and error and research into the packages available for working with spatial data in R (especially raster, sp/spdep, and sf), I learned how to quickly and easily convert data between raster and shapefile formats, which was very useful in automating the cleaning and preparation for multiple datasets and creating the inputs for the analyses I want to perform.

(d) Landscape connectivity analyses: I learned that there are a wide variety of metrics available through Fragstats (and landscapemetrics and landscapetools packages in R), however, I was not able to perform my desired stand-scale analysis of connectivity because I could not determine whether it is possible to analyze contiguous stands with the same management class as separate patches (Fragstats considered all contiguous cells in the raster with the same class to be part of the same patch). Instead, I used Conefor, which has an ArcMap extension that allows you to generate a node and connection file from a polygon shapefile, to calculate relatively few but robust and ecologically meaningful connectivity metrics for the stands in my landscape.


Correlograms and Moran’s I: For this statistical method, I learned the importance of choosing meaningful lag distances based on the data being analyzed and the process being examined. For example, my correlogram consists of a lot of “noise” with many peaks and troughs due to the empty cells between trees, but I also captured data at the relevant distances. Failure to choose appropriate lag distances means that some autocorrelation could be missed, but analyses of large raster images at a high resolution of lag distances results in very slow processing. In addition, I wanted to compare local vs. global Moran’s I to determine whether infections were sequestered to certain portions of the landscape or spread throughout the entire landscape, but the function for local Moran’s I returned values far outside the -1 to 1 range of the global Moran’s I. As a result, I did not understand how to interpret or compare these values. In addition, global Moran’s I did not tell me where spatial autocorrelation was happening, but the fact that there was spatial autocorrelation led me to perform a…

Hotspot analysis (Getis-Ord Gi*): It became clear that deep conceptual understanding of hypothesized spatial relationships and processes in the data and a clear hypothesis are critical for hotspot analysis. I performed multiple analyses with difference distance weighting to compare the results, and there was a large variation in both the number of points included in hot and cold spots and the landscape area covered by those spots between the different weighting and distance methods. I ended up choosing the inverse squared distance weighting based on my understanding of root transmission and vector dispersal probabilities and because this weighting method was the most conservative (produced the smallest hotspots). The confidence level chosen also resulted in large variation in the size of hotspots. After confirming that there was spatial autocorrelation in infection probabilities, using this method helped me to understand where in the landscape these patterns were occurring and thus how they related to management practices.

Neighborhood analysis: I did not find this method provided much insight in my case, not because of the method itself but because of my data (it just confirmed the landscape pattern that I had designed, clustered vs. random) and my approach (one hotspot and one coldspot point non-randomly selected in each landscape. I also found this method to be tedious in ArcMap, though I would like to automate it, and I later learned about the zonal statistics tool, which can help make this analysis more efficient. In general, it is not clear what statistics I could have used to confirm whether results were significantly different between landscapes, but perhaps this is an issue caused by my own ignorance.

Network/landscape connectivity analyses: I learned that there are a wide variety of tools, programs, and metrics available for these types of analyses. I found the Integrative Index of Connectivity (implemented in Conefor) particularly interesting because of the way it categorizes habitat patches based on multiple attributes in addition to their spatial and topological positions in the landscape. The documentation for this metric is thorough, its ecological significance has been supported in peer-reviewed publications (Saura and Rubio 2010), and it is relatively easy to interpret. In contrast, I found the number of metrics available in Fragstats to be overwhelming especially during the data exploration phase.


Robert J. Hijmans (2019). raster: Geographic Data Analysis and Modeling. R package version 2.8-19.

Saura, S. & J. Torné. 2009. Conefor Sensinode 2.2: a software package for quantifying the importance of habitat patches for landscape connectivity. Environmental Modelling & Software 24: 135-139.

Saura, S. & L. Rubio. 2010. A common currency for the different ways in which patches and links can contribute to habitat availability and connectivity in the landscape. Ecography 33: 523-537.

Wilensky, U. (1999). NetLogo. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.

Final Project: Examining the relationship between rurality and colorectal cancer mortality in Texas counties

1.The research question that you asked.

My initial research question was about creation and comparison of rural indices for Texas. I did end up making a rural index for Texas, but instead of creating multiple indices and comparing the importance of different rural indicators, I incorporated an outcome variable, colorectal cancer (CRC) mortality, for Exercise 2 and 3. So my final research question ended up being as follows: how does the spatial pattern of CRC mortality in Texas relate to proxy measures of rurality and a constructed rural index for counties in Texas?


2. A description of the dataset you examined, with spatial and temporal resolution and extent.

For my project, I utilized Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database). Aggregated CRC mortality rates for Texas counties were obtained for the years 2011-2015 from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program U.S. Population Data – 1969-2015.


3. Hypotheses: predictions of patterns and processes you looked for.

Based on my revised research question I remade before exercise 2, I expected 2 major spatial patterns for my analyses: 1) areas immediately surrounding high CRC mortality counties to have higher “rural” values in 3 rural indicator variables: household income, population density, and percent land developed, with more “urban” values as distance increases away from the counties and 2) CRC mortality counts to significantly increase as county rural index values (from the 3 indicator variables) became more rural. I chose rural indicator variables based both on true measures of where population centers in Texas are located (population density and land development) and a well-established marker of rural status (income). For spatial pattern 1 from my neighborhood analysis, I expected there to be uniform shifts in buffer means towards more “urban” values in each of the 3 indicator variables as distance increased away from the high CRC counties. I expected this effect because counties with high CRC mortality are commonly rural, so areas in counties surrounding them with lower CRC mortality may show increasingly urban indicator values. For spatial pattern 2, I expected CRC mortality to increase as county rural index values became more rural in Texas. I expected this pattern because CRC mortality has been linked to rural status and various indicator variables in previous research, though a weighted rural index has never been used for Texas cancer data.


4. Approaches: analysis approaches you used.

For my analyses I used the following approaches:

Exercise 1: I utilized principal component analysis (PCA) to construct a rural index for Texas counties using 3 indicator variables.

Exercise 2: I did a neighborhood analysis of high CRC mortality counties by creating multi-ring buffers around centroids of 4 Texas counties, intersected the buffers with county polygons containing rural indicator variable values, and calculated buffer means for indicator variables for each donut around each county.

Exercise 3: I completed a Poisson regression of CRC mortality counts (Y) and my constructed rural index (X) to examine the effect of rural status on CRC mortality for Texas counties.


5. Results: what did you produce — maps? statistical relationships? other?

For all 3 exercises, my results included maps (Texas county maps, buffer maps, standardized residual maps), statistical relationships (p-values, coefficients, etc.), and plotting of statistical relationships (biplots, influence plots, line plots). I produced the maps in Arc Pro, while all other visualizations were produced in R.


6.What did you learn from your results? How are these results important to science? to resource managers? + software/technical limitations

I believe exercise 1 displays the effectiveness of PCA for construction of rural indices. More deductive methods for rural classification are very much needed in rural health, and I believe this method could improve detection and prevention of rural health disparities.

In exercise 2, the 4 high CRC mortality counties did not all follow the expected rural indicator spatial “gradient” that I expected. Two of the counties exhibited increasing urban scores as distance increased away from the counties, while the other two showed the opposite pattern. I think this result could be due to the arbitrary distances I chose for the buffers around the counties and the modifiable areal unit problem of utilizing county indicator variable data instead of more spatially defined values. Also, significant regionality in Texas could exist for the indicator variables I chose, as the 4 counties were not located in the same regions of Texas. This could have affected the relationships I found in each of the counties.  For example, certain rural regions in Texas may have lower or higher household income than other rural regions of Texas due to factors such as available jobs or regional/local governmental policies. Also, there are likely other variables that are indicators of rurality that I could have included in the analysis that may have more consistent spatial patterns.

In exercise 3, the results from the Poisson regression followed statistical pattern I expected: as county CRC mortality increases, rural index scores become more rural. I believe my results show important introductory associations between CRC mortality and rurality in Texas that indicate further and deeper study into these associations should be considered. To show the exercise 3 results, I utilized a map of standardized residuals in Arc and statistical plots in R.

The technical limitations of my analyses were mainly due to extensive missing cancer data for the state of Texas. This missingness was due to data suppression by the CDC’s National Vital Statistics System in order to maintain confidentiality of cancer cases. I did not have any obvious software issues besides difficulty with the Calculate Field tool in Arc Pro, where the tool would consistently fail when using more complicated formulas for data transformations. I also had some problems when joining Excel table to attribute tables, where Arc would not show symbology for the newly-joined data until exporting to geodatabase and restarting the program.


7. Your learning: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other?

I believe I greatly improved my skills in ArcGIS this term, especially in data wrangling and cleaning to convert my data to formats that maximize the visualization. Also, I feel I became much better at using key plotting packages in R, such as ggplot2.


8. What did you learn about statistics, including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), (d) multivariate methods (e.g., PCA), and (e) or other techniques?

I believe the major statistical and spatial skills I improved in this course include PCA, neighborhood analysis, Poisson regression, and (especially) regression diagnostics. I had not used PCA in R before this term and feel very comfortable using it going forward in my research on rural health. The many assumptions I had to follow for the Poisson regression in exercise 3 improved my ability to run diagnostics in R and create intuitive assumption plots.


Final project synthesis of parts 1, 2, and 3 on classification of Blackleg on turnip


The objective of this research was to find a way to attribute proper classification of diseased versus non-diseased regions on a turnip leaf. To do this I assessed the classification accuracy of the unsupervised machine learning classification algorithm, segmentation, with manual classification. Additionally, I hoped to reveal some spatial characteristics about the lesions on the leaf for future work.



The cumulative data set is roughly 500 turnip leaves, harvested here in the Willamette valley during the months of February to April of 2019. For the sake of this project I have selected 5 images in which I have worked with to complete the basis of this analysis. The spatial resolution is just under 1 mm and images are in raster format of single turnip leaves with five bands (blue, green, red, far-red, NIR). For this analysis only, the red band was used. The camera being used is a Micasense Red-edge which has a 16-bit radiometric resolution.



These were not all questions that I had anticipated answering at the beginning of the term, but have developed over the course of the term as I ran into issues or found intriguing results.

• Can diseased patches be successfully identified and isolated by computer image classification and through manual classification?
If yes;
o can the diseased patches based on image classification and manually classified diseased patches be characterized?
o How accurate is the segmentation classification?
o Are more raster cells being classified as false negatives or false positives when misclassified?
o What is going undetected in segmentation?



My original hypothesize was that spatial patterns of pixels based on image classification are related to manually classified spatial patterns of observed disease on turnip leaves because disease has a characteristic spectral signature of infection on the leaves.


Approaches and results

I began in ArcGIS Pro by using the: segmentation -> mask -> clip -> raster to polygon -> merge -> polygon to raster. This yielded a layer which included only the computer generated segmented diseased patches which were exported as Tiff files. I followed a similar process with my manually classified diseased patches but used the ‘training sample manager’ instead of segmentation. Both these Tiff files were uploaded to Fragstats where I did patch analysis between the two to characterize the diseased patches. I found this not entirely helpful, but if I dug deeper into patch analysis I could be provided with some worthwhile information about the diseased lesion characteristics.

After characterizing the patches, I performed an error analysis using a confusion matrix in R. The two Tiff files were uploaded to R and converted to rasters to check quality and then to tables. The tables were imported into excel and values were adjusted to 1 for diseased patches and 0’s for non-diseased. Anything outside the area of the leaf was deleted. The tables were imported back to R and a confusion matrix comparing the computer classified segmentation versus the manually classified patches was conducted. All the classification matrices were combined for the overall accuracy (Table 1). This aspect was a bit time consuming as was the original image processing but was a very useful analysis.

One other approach I used to help with visualization was showing the leaf with false positives and false negatives included (Image 1). This was necessary to help explain results more clearly to others and show results outside of the statistics.



I have created a method for getting from image acquisition all the way to statistical results. This work is first off significant to my thesis research. This will be something I refine to help streamline but is a very good starting point for me. Second, I got statistically significant results based on p-value and a high accuracy. This will be helpful to those in agronomy or crop consulting which are using remote sensing for disease detection. At this point I don’t think it is practical for application of agronomist but with further work I hope to make this methodology an available resource for other researchers to work off of and for farmers to potentially try and utilize.


Hypothesis revised

After receiving these results and considering my next steps, my hypothesis has been refined. I now hypothesize that segmentation classification has an equal ability to classify diseased patches as manual classification because of the spectral values associated with diseased pixels.


New research questions

This research is far from complete and has led to a new set of questions to be answered.

• Is percent classification of diseased pixels the best metric for how well the classification method works?
• What is the accuracy of a support vector machine versus manually classified pixels?
• Is this classification translatable to other brassica species such as canola, radish, kale, etc.?


Future research

To answer some of my new questions I believe it is critical I learn how to create band composites between the five available bands. Currently, my issue is the bands do not overlap. My approach for solving this includes the use of georeferenced points/pixels on each image to ensure they overlap. I think if I use the 4 corners of the image I should get the overlap I want. By creating band composites, I can begin using vegetative indices like NDVI to help more accurately distinguish pixels than can be done with one band alone.

Another concern I have is despite classifying pixels accurately I am weary about this being an all determining method of classification. I have instances in my current classification where only 2 or 3 of the 6 or so lesions are actually having pixels inside its perimeter classified as diseased. I think a more accurate or a helpful assessment would quantify the percent of lesions that are captured by the segmentation process out of the total number of lesions on the leaf.

Using an unsupervised classification scheme for this work was good to lay the groundwork. In future work I would like to move to supervised machine learning classification schemes. My reason being; when attempting to translate this algorithm used on turnip classification to others such as canola, radish, etc. I believe it will have a higher level of accuracy and consider more spatial characteristics. Currently, my classification scheme only uses the spectral values of the red band where I would like to implement variables such as lesion size, proximity from margin, proximity to other lesions, shape, NDVI values, etc. This works into my last question concerning the ability of the classification to be applied to broader ranges of plants.


Technical and software limitations

Some of my limitations are apparent in above sections because they led to my future research questions. One of my greatest limitations was my ability to effectively and efficiently use ArcGIS Pro as well as code in R. This led to longer measures of time conducting image processing which may have been streamlined if I was better in ArcGIS Pro. The major roadblocks I hit in ArcGIS Pro were related to the overlap of different bands that I mentioned earlier in addition to the use of the support vector machine. I have plans to effectively solve the band composite issue, but am still unsure about the support vector machine function in ArcGIS Pro. I believe the way it was setup is not designed for collecting training data for many images but collecting this data from one image. My alternative for the SVM is to try conducting the analysis in R.

In R I struggled in my ability to code. This was a major strain on my time and required the assistance of others to help troubleshoot on many occasions. This can be overcome with more practice and is not so much a limitation of the software, but the user.


My learning

I learned the most in ArcGIS Pro I would say. The image processing took lots of time and many steps which exposed me to a whole set of tools and analysis I was previously unaware of. I don’t think my skills in R increased much but I now have the code and steps for the analysis I would like to complete on my future thesis work, the confusion matrix. Another program I gained some experience in was Fragstats. It was useful for characterizing the diseased patches and may be an option for some of my future work when looking at the size, shape distance, etc. of these diseased patches.



Table 1. Confusion matrix for pixel classification of five leaves.


Image 1. Example of one leaf analysis where the green cells are manually classified non-diseased pixels, white pixels inside the leaf are segmentation classified diseased pixels and block pixels are the manually classified diseased pixels.

Final Project: San Diego Bottlenose Dolphin Sighting Distributions

Final Project: San Diego Bottlenose Dolphin Sighting Distributions

The Research Question:

Originally, I asked the question: do common bottlenose dolphin sighting distances from shore change over time?

However, throughout the research and analysis process, I refined this question for a multitude of reasons. For example, I planned on using all of my dolphin sightings from my six different survey locations along the California coastline. Because the bulk of my sightings are from the San Diego survey site, I chose this data set for completeness and feasibility. Additionally, this data set used the most standard survey methods. Rather than simply looking at distance from shore, which would be at a very fine scale, seeing as all of my sightings are within two kilometers from shore, I chose to try and identify changes in latitude. Furthermore, I wanted to see if changes in latitude (if present, were somehow related to the El Nino Southern Oscillation (ENSO) cycles and then distances to lagoons). This data set also has the largest span of sightings by both year and month. When you see my hypotheses, you will notice that my original research question morphed into much more specific hypotheses.

Data Description:

My dolphin sighting data spans 1981-2015 with a few absent years, and sightings covering all months, but not in all years sampled. The same transects were performed in a small boat with approximately a two kilometer sighting span (one kilometer surveyed 90 degrees to starboard and port of the bow). These data points therefore have a resolution of approximately two kilometers. Much of the other data has a coarser spatial resolution, which is why it was important to use such a robust data set. The ENSO data I used gave a broad brushstroke approach to ENSO indices. Rather than first using the exact ENSO index which is at a fine scale, I used the NOAA database that split month-years into positive, neutral, and negative indices (1, 0, and -1, respectively). These data were at a month-year temporal resolution, which I matched to my month-date information of my sighting data. Lagoon data were sourced from the mid-late 2000s, therefore I treated lagoon distances as static.


H1: I predicted that bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) along the San Diego transect throughout the years 1981-2015 would exhibit clustered distribution patterns as a result of the patchy distributions of both the species’ preferred habitats and prey, as well as the social nature of this species.

H2: I predicted there would be higher densities of bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) at higher latitudes spanning 1981-2015 due to prey distributions shifting northward and less human activities in the northward sections of the transect. I predicted that during warm (positive) ENSO months, the dolphin sightings in San Diego would be distributed more northerly, predominantly with prey aggregations historically shifting northward into cooler waters, due to (secondarily) increasing sea surface temperatures. I expect the spatial gradient to shift north and south, in relation to the ENSO gradient (warm, neutral, or cold)

H3: I predicted that along the San Diego coastline, bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) would be clustered around the six major lagoons within about two kilometers, with no specific preference for any lagoon, because the murky, nutrient-rich waters in the estuarine environments are ideal for prey protection and known for their higher densities of schooling fishes.

Map with bottlenose dolphin sightings on the one-kilometer buffered transect line and the six major lagoons in San Diego.


I utilized multiple approaches with different software platforms including ArcMap, qGIS, GoogleEarth, and R Studio (with some Excel data cleaning).

  • Buffers in ArcMap
  • Calculations in an attribute table
  • ANOVA with Tukey HSD
  • Nearest Neighbor averages
  • Cluster analyses
  • Histograms and Bar plots


I produced a few maps (will be), found statistical relationships between sightings and distribution patterns,  ENSO and dolphin latitudes, and distances to lagoons.

H1: I predicted that bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) along the San Diego transect throughout the years 1981-2015 would exhibit clustered distribution patterns as a result of the patchy distributions of both the species’ preferred habitats and prey, as well as the social nature of this species.

True: The results of the average nearest neighbor spatial analysis in ArcMap 10.6 produced a z-score of -127.16 with a p-value of < 0.000001, which translates into there being less than a 1% likelihood that this clustered pattern could be the result of random chance. Although I could not look directly at prey distributions because of data availability, it is well-known that schooling fishes exist in clustered distributions that could be related to these dolphin sightings also being clustered. In addition, bottlenose dolphins are highly social and although pods change in composition of individuals, the dolphins do usually transit, feed, and socialize in small groups. Also see Exercise 2 for other, relevant preliminary results, including a histogram of the distribution in differences of sighting latitudes.

Summary from the Average Nearest Neighbor calculation in ArcMap 10.6 displaying that bottlenose dolphin sightings in San Diego are highly clustered.

H2: I predicted there would be higher densities of bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) at higher latitudes spanning 1981-2015 due to prey distributions shifting northward and less human activities in the northward sections of the transect. With this, I predicted that during warm (positive) ENSO months, the dolphin sightings in San Diego would be distributed more northerly, predominantly with prey aggregations historically shifting northward into cooler waters, due to (secondarily) increasing sea surface temperatures. I expect the spatial gradient to shift north and south, in relation to the ENSO gradient (warm, neutral, or cold).

False: the sightings are more clumped towards the lower latitudes overall (p < 2e-16), possibly due to habitat preference. The sightings are closer to beaches with higher human densities and human-related activities near Mission Bay, CA. It should be noted, that just north of the San Diego transect is the Camp Pendleton Marine Base which conducts frequent military exercises and could deter animals.

I used an ANOVA analysis and found there was a significant difference in sighting latitude distributions between monthly ENSO indices. A Tukey HSD was performed to determine where the differences between treatment(s) were significant. All differences (neutral and negative, positive and negative, and positive and neutral ENSO indices) were significant with p < 0.005.

H3: I predicted that along the San Diego coastline, bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) would be clustered around the six major lagoons within about two kilometers, with no specific preference for any lagoon, because the murky, nutrient-rich waters in the estuarine environments are ideal for prey protection and known for their higher densities of schooling fishes. See my Exercise 3 results.

Using a histogram, I was able to visualize how distances to each lagoon differed by lagoon. That is dolphin sightings nearest to, Lagoon 6, the San Dieguito Lagoon, are always within 0.03 decimal degrees. In comparison, Lagoon 5, Los Penasquitos Lagoon, is distributed across distances, with the most sightings at a great distance.

Bar plot displaying the different distances from dolphin sighting location to the nearest lagoon in San Diego in decimal degrees. Note: Lagoon 4 is south of the study site and therefore was never the nearest lagoon.

After running an ANOVA in R Studio, I found that there was a significant difference between distance to nearest lagoon in different ENSO index categories (p < 2.55e-9) with a Tukey HSD confirming that the significant difference in distance to nearest lagoon being significant between neutral and negative values and positive and neutral years. Therefore, I gather there must be something happening in neutral months that changes the distance to the nearest lagoon, potentially prey are more static or more dynamic in those years compared to the positive and negative months. Using a violin plot, it appears that Lagoon 5, Los Penasquitos Lagoon, has the widest span of sighting distances when it is the nearest lagoon in all ENSO index month values. In neutral years, Lagoon 0, the Buena Vista Lagoon has more than a single sighting (there were none in negative months and only one in positive months). The Buena Vista Lagoon is the most northerly lagoon, which may indicate that in neutral ENSO months, dolphin pods are more northerly in their distribution.

Takeaways to science and management: 

Bottlenose dolphins have a clustered distribution which seems to be related to ENSO monthly indices, with certain years having more of a difference in distribution, and likely, their sociality on a larger scale. Neutral ENSO months seem to have a different characteristic that impact sighting distribution locations along the San Diego coastline. More research needs to be done in this to determine what is different about neutral months and how this may impact this dolphin population. On a finer scale, the six lagoons in San Diego appear to have a spatial relationship with dolphin pod sighting distributions. These lagoons may provide critical habitat for bottlenose dolphin preferred prey species or preferred habitat for the dolphins themselves either for cover or for hunting, and different lagoons may have different spans of impact at different distances, either by creating larger nutrient plumes, or because of static, geographic and geologic features. This could mean that specific areas should be protected more or maintain protection. For example, the Batiquitos and San Dieguito Lagoons have some Marine Conservation Areas with No-Take Zones. It is interesting to see the relationship to different lagoons, which may provide nutrient outflows and protection for key bottlenose dolphin prey species. The city of San Diego and the state of California are need ways to assess the coastlines and how protecting the marine, estuarine, and terrestrial environments near and encompassing the coastlines impact the greater ecosystem. Other than the Marine Mammal Protection Act and small protected zones, there are no safeguards for these dolphins.

My Learning: about software (a) Arc-Info and b) R

  1. a) Arc-Info: buffer creation, creating graphs, nearest neighbor analyses. How to deal with transects, certain data with mismatching information, conflicting shapefiles
  2. b) R: I didn’t know much, except the basics in R. I learned about how to conduct ANOVAs and then how to interpret results. Mainly I learned about how to visualize my results and use new packages.

My Learning: about statistics

Throughout this project I learned that spatial statistics requires clear hypothesis testing in order to clearly step through a spatial process. Most specifically, I learned about spatial analyses in ArcMap, and how I could utilize nearest neighbor calculations to assess distribution patters. Furthermore, I now have a better understanding of spatial distribution patterns and how they are assessed, such as clustering versus random versus equally dispersed distributions. For more data analysis and cleaning, I also learned how to apply my novice understanding of ANOVAs and then display results relating to spatial relationships (distances) using histograms and other graphical displays in R Studio.


Contact information: this post was written by Alexa Kownacki, Wildlife Science Ph.D. Student at Oregon State University. Twitter: @lexaKownacki

Spatial pattern of ventenata invasion in eastern Oregon: Final Project

  1. The research question that you asked.

I initially asked the question, “how is the spatial pattern of invasion by the recently introduced annual grass, ventenata, influenced by the spatial pattern of suitable habitat patches (scablands) via the susceptibility of these habitat patches to invasion and ventenata’s invasion potential?”

  1. A description of the dataset you examined, with spatial and temporal resolution and extent.

In Exercise 1, I examined spatial autocorrelation and in ventenata abundance and ventenata hotspots using spatial data (coordinates and environmental variables) and ventenata cover data that I collected in the field (summer 2018) for 110 plots within and surrounding seven burn perimeters across the Blue Mountain Ecoregion of eastern Oregon.

Target areas were located to capture a range of ventenata cover from 0% ventenata cover to over 90% cover across a range of plant community types and environmental variables including aspect, slope, and canopy cover within and just outside recently burned areas. Once a target area was identified, plot centers were randomly located using a random azimuth and a random number of paces between 5 and 100 from the target areas. Sample plots were restricted to public lands within 1600m of the nearest road to aid plot access. Environmental data for sample plots includes: canopy cover, soil variables (depth, pH, carbon content, texture, color, and phosphorus content), rock cover, average yearly precipitation, elevation, slope, aspect, litter cover, and percent bare ground cover.

For Exercise 2, I examined how the spatial pattern of vegetation type influences invasibility of plant communities by ventenata. To achieve this, I applied vegetation type data from the Simpson Potential Vegetation Type raster data (Simpson 2013) to 30m resolution in Arc GIS which was developed to identify potential vegetation types across the Blue Mountain Ecoregion.

In Exercise 3, I explored how the spatial pattern of canopy cover was related to ventenata abundance. For this, I used a live tree canopy cover layer developed for the Pacific Northwest calculated using Forest Vegetation Simulator methods including the sum of canopy cover estimates for vegetation plots in the region (Crookston and Stage 1999).

  1. Hypotheses: predictions of patterns and processes you looked for

Ventenata is an invasive annual grass that shares many functional traits to other impactful invasive annual grasses in the region such as cheatgrass and medusahead, including similar vegetative height, fall germination, and shallow root system. These similarities have led me to believe that, like cheatgrass and medusahead, ventenata will be more abundant in open areas with low canopy cover where competition from existing vegetation is lower.

The study area contains many open areas interspersed throughout the larger forested landscape. The patchy spatial distribution of open areas throughout the study area will likely result in a patchy distribution of areas with high ventenata cover. Additionally, ventenata produces many seeds, with the majority of these seeds dispersing short distances from the parent plant. This leads me to believe that areas with high ventenata cover will be clustered near other areas with high ventenata cover creating invasion “hot spots” across the study region.

Hypothesis 1: Areas with high ventenata cover will be clustered near other high cover areas and low cover areas will be clustered near other low cover areas.

Hypothesis 2: The spatial pattern of ventenata abundance will be positively correlated with a neighborhood of non-forest habitat types (shrub-lands and grasslands) and negatively correlated with a neighborhood of forest habitat types. This relationship will decrease in strength as distance increases from the high cover sample point, as vegetation types farther from an invasion point are likely not as strongly influencing invasion as vegetation types closer to that point.

Once a species has established in a suitable habitat, it may spread to areas of less suitable habitat aided by strong propagule pressure from a nearby population. Open areas may act as source populations, allowing ventenata to build propagule pressure to the point where it is able to successfully establish and maintain a population in less suitable habitat such as areas with high canopy cover.

Hypothesis 3: Plots where ventenata is present in areas with high canopy cover (e.g. forests) will be clustered near open areas. These open areas may provide strong propagule pressure to aid invasion into areas with fewer available resources (sunlight).

  1. Approaches: analysis approaches you used.

To test these predictions I performed a handful of spatial analyses including:

Exercise 1: I tested for spatial autocorrelation using Moran’s I and created a correlogram in R and performed hot spot analysis in ArcGIS

Exercise 2: I explored the spatial relationship between the spatial pattern of ventenata abundance and the spatial pattern of different vegetation types using neighborhood analyses in ArcGIS and R

Exercise 3: I examined the spatial relationship between ventenata and canopy cover using a Ripley’s cross K analysis in R

  1. Results: what did you produce — maps? statistical relationships? other?

Throughout the analyses, I produced a series of statistical relationships displayed as maps and graphs. Hot spot analysis produced a map that allowed me to visualize the relationship of autocorrelation between ventenata abundance at my sample points. For Moran’s I, neighborhood analysis, and Ripley’s cross K, I produced graphical representations of statistical relationships in R.

  1. What did you learn from your results? How are these results important to science? to resource managers?

The correlogram and hotspot analysis results showed that the spatial pattern of ventenata is auto correlated and has a patchy distribution. The hotspot analysis suggests that areas of high ventenata are clustered with other high ventenata plots and low ventenata plots are clustered as I predicted in Hypothesis 1.  This is likely a result of the patchy distribution of open areas and forested areas across the landscape and the dispersal ability of ventenata.

Neighborhood analysis showed that areas with high ventenata cover are more positively correlated with nearby forested areas (ponderosa pine) than I originally thought. This result suggests that ventenata may preferentially invade areas surrounded by ponderosa pine vegetation type as well as shrublands which would not support Hypothesis 2. However, ponderosa pine vegetation type does not necessarily indicate high canopy cover, and could represent invasion into an alternative low canopy cover vegetation type. Additionally, the vegetation type maps are mapped at large spatial scales and may not represent the fine scale variation in vegetative cover. Uncertainty in this result inspired a follow up analysis using canopy cover instead of vegetation type as a predictor variable in a Ripley’s cross K analysis.

In my follow up analysis using Ripley’s cross K, I found that forest plots where ventenata was present were only weakly clustered around open areas despite my original hypothesis that there would be strong clustering (H3). These results could suggest that ventenata has a higher tolerance for high canopy cover than I originally predicted. Alternatively, these results could indicate that ventenata is capable of dispersing large quantities of seed much farther distances than originally thought, thus not requiring open areas in the immediate neighborhood. Moreover, the same issue of scale may apply to the canopy layer as the vegetation layer, and the 30m resolution may be over predicting canopy cover at my sample sites.

My findings could have severe implications for forest ecosystems which are commonly thought to be relatively resistant to invasion by annual grasses and are now showing susceptibility to ventenata invasion. For example, vententata could increase fine fuels in these systems, making them more likely to ignite and carry surface fire. Managers may want to consider incorporating annual grass management strategies into their current and future forest management plans to help reduce potential invasion impacts.

  1. Your learning: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other?

During this class I learned a suite of new tools in ArcGIS including hotspot analysis and concentric ring buffer. I created my first model using ArcGIS Modelbuilder! I learned the basics of spatstat in R and successfully completed some spatial analysis which required transforming my data into a spatial data frame (I did not know that these existed prior to this class). Additionally, I was exposed to, and gained experience using many other new functions in R including Moran.I, correlog and kcross that were useful for spatial analysis.

  1. What did you learn about statistics, including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), (d) multivariate methods (e.g., PCA),  and (e) or other techniques?

I learned that Moran’s I and correlograms are useful for testing spatial autocorrelation in data, but only if the scale applied is of interest. For example, it was not useful to compute only one Moran’s I value for my entire data set – this indicated that there was spatial autocorrelation in the data, but did not indicate a spatial pattern. However, when I computed Moran’s I at various distances and displayed these results in a correlogram, I found uncovered the pattern of the spatial correlation. The hotspot analysis allowed me to visualize exactly where the high and low clustering was occurring across my sample plots while simultaneously providing a significance value for those hot and cold spots.

Ripley’s cross K analysis was useful for testing the relationship of my ventenata points to another variable (canopy cover). I found this test appea ling because it tests whether or not one variable is clustered around another variable using a Poisson distribution to compare observed and expected values assuming spatial randomness. However, I learned that this method was not appropriate for my data, as my sample plots were chosen based on field variables and were not a random sample. This violated assumptions of randomness and homogeneity across the sampling region as my plots were more heavily located in non-forested areas. If I wanted to properly investigate these spatial questions, I would have to develop a more random sampling method.


Simpson, M. 2013. Developer of the forest vegetation zone map. Ecologist, Central Oregon Area Ecology and Forest Health Program. USDA Forest Service, Pacific Northwest Region, Bend, Oregon, USA

Crookston, NL and AR Stage. 1999. Percent canopy cover and stand structure statistics from the Forest Vegetation Simulator. Gen. Tech. Rep. RMRS-GTR-24. Ogden, UT: U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station. 11 p.


Final Project Summary – Spatial Analyses of the Relationships Between Seascape Classes and Rockfish Distributions

Question Asked

The question that I sought to answer in this class was “How is the spatial pattern of forage fish assemblage in the California Current System related to the spatial pattern of seascapes based on the sea-surface conditions used to classify the seascapes?”  This is outlined in my first blog post, which can be found here. Additionally, I was seeking to characterize the spatial distribution and extent of seascape classes and forage fishes. Seascape classes are used as a way to simplify the myriad of physical, biological, and chemical processes that can affect organisms in the ocean. Eventually, the focus of the analyses shifted to young-of-the-year rockfish, as opposed to forage fish. All analyses mentioned in this blog post were conducted using the YOY-Rockfish data as opposed to the forage fish data.


Midwater trawls have been conducted annually by the National Oceanic and Atmospheric Administration’s (NOAA) Southwest Fisheries Science Center (SWFSC) in an attempt to monitor the recruitment of pelagic rockfish (Sebastes spp.) and other epipelagic micronekton at SWFSC stations off California. The trawls have informed a dataset that represents overall abundance of all midwater pelagic species that commonly reside along the majority of the nearshore coast of California from 1998 to 2015. Each trawl collected information about both fish abundance, recorded in absolute abundance, and location data, recorded in the form of latitude and longitude. The dataset also includes a breakdown of species by taxa.

Seascapes have been classified using a combination of in-situ data (from the trawls) and remotely sensed data from NASA’s MODIS program. Seascapes were classified using the methods described in Kavanaugh et al., 2014 and represent the seascape class in the immediate area that each trawl occurred. Seascapes are classified at 1 km and 4 km spatial resolution and at 8-day and monthly temporal resolution. Each seascape has been assigned an ID number which is used to identify similar conditions throughout the dataset.

The current seascape classification methods are able to classify surface conditions into one of 33 distinct seascape classes. The variables used to classify the seascapes are listed below:

  • Sea-surface temperature (Celsius)
  • Sea-surface salinity (Practical salinity units)
  • Absolute Dynamic Topography (meters)
  • Ice cover (% of area)
  • Colored dissolved organic matter (m^-1)
  • Spatial averaged sea-surface chlorophyll a (mg m^-3)
  • Phytoplankton physiological state (NFLH) (W m^-2 um^-1 sr^-1)
  • NFLH:Chlorophyll ratio

It is inferred that certain, carefully-selected sea-surface conditions can imply favorable conditions to support healthy ocean ecosystems in the water column and near the ocean floor. The conditions listed above have been studied and selected for this reason. However, this may not be a perfect science – any attempt to delineate and simplify the multitude of factors and conditions that facilitate life in the ocean is likely to leave out important factors. This underscores the importance of research that tests these classification methods and measures their ability to replicate and predict the true natural environment. The analyses conducted for this class attempt to do that in the context of one functional group of fishes within one major ecosystem in the Pacific Ocean.


Simply put, I hypothesized that any measurable spatial changes in the spatial extent of certain seascape classes will also be identifiable in the spatial variability of forage fish assemblage over time. The California Current ecosystem is meticulously studied and examined by a myriad of researchers from a number of different affiliate institutions. Studies reviewing the physical and biogeochemical conditions of the area indicate that from an environmental perspective, many areas off of the coast of California should support areas of high fish abundance.

Specifically, I was expecting areas of high forage fish and young of the year rockfish abundance to exist in seascape classes that represent nutrient-rich, upwelled water. These conditions have been shown to support thriving ecosystems throughout the water column due to an abundance of energy and food for fishes that live higher in the water column. Upwelling brings cold, nutrient-rich water to the surface and occurs seasonally in most places along the California Coast. Using the variables listed above, that would mean below average sea-surface temperature, high dissolved organic matter, and high chlorophyll a. Some classes that represent conditions similar to this are:

  • Temperate Transition (Class 7)
  • Tropical/Subtropical Upwelling (Class 11)
  • Subpolar (Class 12)
  • Temperate Blooms Upwelling (Class 14)
  • Warm, Blooms, High Nutrients (Class 21)

Preliminary multivariate community structure analysis has shown some statistically significant relationships between certain species and certain seascape classes using this data. If spatial patterns do exist, I expect there to be some relationship between the surface conditions and the fish found at depth of the midwater trawls.

Since my shift focused from forage fish to YOY rockfish, my hypothesis shifted from a prediction of correlation between certain seascape classes and forage fish abundance to relationships between rockfish and other seascape classes (as determined by the aforementioned multivariate analyses).

My hypothesis, taking into account all of the aforementioned considerations, can be restated as follows:

It is expected that the spatial pattern of areas of high young-of-the-year rockfish abundance will be related to the spatial patterns of seascape classes that represent areas of high nutrient availability and upwelled water along the California Coast.

Additionally, I expect even higher areas of abundance to occur in areas where two or more seascapes representing these favorable conditions converge or border one another. These border areas are likely to indicate much larger swaths of ocean that hold habitable conditions that are likely to be able to support entire communities rather than smaller numbers of fishes. While this hypothesis was not tested in this project, future work can be conducted using FRAGSTATS, patch analysis software, and other landscape ecology methods to seek an answer related to this prediction.

Approaches Used

Exercise 1: For the first exercise, different interpolation methods were used to explore their effects on the point trawls data. The interpolations (Kriging and IDW were tested) were used to model the abundance of rockfish in four selected years.

Exercise 2: Next, a neighborhood analysis was used to determine what the dominant seascapes were in the areas around trawls that produced high rockfish abundance (both according to the data and the interpolation) and low rockfish abundance. Buffers were set at 5, 10, and 25km distances around two trawls of each kind and seascape classes were measured as a function of % area of each buffer.

Exercise 3: Finally, a confusion matrix was calculated that measured the agreement between occurrence of the significant seascapes (as determined by exercise 2) and areas of high abundance (as determined by the interpolations in exercise 1).


Exercise 1: The interpolations produced a number of maps of varying spatial extents and with varying resolution. Methods had to be refined multiple times to procure workable results, but maps interpolating rockfish abundance for 4 distinct years were eventually created and used.

Exercise 2: The neighborhood analysis produced statistics related to the percent of each buffer area occupied by each seascape class. This information was then used to create two plots – one for areas of high abundance and one for areas of low abundance. The most important result was an understanding of which seascape classes represented areas of high rockfish abundance. The dominant seascape classes in areas of high abundance turned out to match two of the seascape classes predicted in my hypothesis: Temperate Transition (Class 7) and Subpolar (Class 12).

Exercise 3: The final exercise produced a confusion matrix that measured true positives, true negatives, false positives, and false negatives as they related to the agreement between the seascape classes and the interpolated maps. The final matrix also produced an overall agreement score (Kappa).


The confusion matrix ended up being a bit of a failure, as both the seascape information and interpolated information had to be turned into presence-absence data for the matrix to be calculated. This eliminated some of the fine-tuned resolution associated with the interpolation and ultimately led to a Kappa score that suggested less agreement than would be expected by chance. While this specific result is not important to science or managers, these methods can be refined to improve the agreement between the two sources, ultimately providing a way to statistically link physical ocean conditions and rockfish abundances in California.

For these reasons, I consider this project to be an excellent pilot study for these data – one that introduces me to the types of analyses available to researchers who work with these types of data and their strengths and shortcomings.

Learning Outcomes

Software and Statistics

This class gave me an opportunity to brush up on my Arcmap and Modelbuilder skills, which I used to be very proficient in but had lost. I’ve also had the opportunity to explore some analyses through the lenses of R and Python, though I did not end up executing any parts of my project in these programs. The most important statistical learning outcomes for me had to do with interpolation and confusion matrices. The first exercise taught me all about the different ways point data can be interpolated and the consequences that that can have for results. I was familiar with the different types of hotspot analysis available to users in ArcGIS, but was not aware of their differences. The confusion matrix introduced me to an entirely new way to connect independent modeling methods. Methodologically, I learned about the importance of recording your methods as you execute processes and about the benefits of naming files sequentially. Had I not been organized, I would have had a hard time keeping track of the different processes executed and files used during each step.