Monthly Archives: May 2019

Exercise 3: Poisson regression analysis of Texas rural index scores and colorectal cancer mortality counts

Research Question

What is the association between Texas county rurality index score and colorectal cancer (CRC) mortality?

In Exercise 1, I created a rurality index for Texas counties using rural indicator variables, while in Exercise 2, I selected Texas counties with the highest CRC death counts and created multi-ring buffers around them to visualize and measure how rural indicator variables shift as ones moves away from the counties. In this exercise, I am doing a more direct analysis of CRC mortality and my rurality index by utilizing a Poisson regression model. I expect the results to show that as index scores become more “rural,” county CRC death counts will increase.

Tools and Data Sources Used

The analysis, diagnostic, and graphing procedures for this exercise were all completed using the following R functions/packages: glm(), vcd, AER, and car. More specifically on the R tools, a generalized linear model (GLM) of the family Poisson was utilized for modeling, a rootogram from the vcd package was used for Poisson goodness-of-fit, a test for model over-dispersion was utilized from the AER package, and an influence plot was created using the car package. Other goodness-of-fit diagnostics and statistical measures were ascertained via the base glm() procedure.The rural indicators utilized for the index in this analysis are from the same sources I used in Exercise 1: Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database). Like in my Exercise 2, aggregated CRC mortality counts per 100,000 population for Texas counties were obtained for the years 2011-2015 from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program U.S. Population Data – 1969-2015.


Attribute Table Conversion to R: First, the attribute table of CRC death counts and rural index scores by county (created as part of Exercise 2) was exported to an Excel file using the Table to Excel tool in Arc. The Excel file was then loaded into R using the “readxl” package.

Univariate Poisson Diagnostics: Before introducing a Poisson regression, I wanted confirm the outcome variable follows a Poisson distribution. To do this, I used the rootogram() function from the vcd package in R to visualize the distribution of the CRC mortality count data. The figure below show that the data seems to roughly follow a Poisson distribution. Based on this, I proceeded to fitting the model.

Model Fitting: I fit a GLM of the Poisson family using the following formula in R: glm(formula = CRCDeathrate ~ PCAWeighted_Index, family = “poisson”). Initially, I attempted to construct a linear OLS regression with this data, but after many attempted transformations, I realized the outcome count data was likely following a Poisson distribution. In subsequent steps, I walk through the post-fitting Poisson diagnostics about the appropriateness of this fit.

Poisson Regression Diagnostics: In Poisson regressions, one of the main concerns is over-dispersion of the data, as Poisson distributions have only one free parameter and do not allow for variance to be adjusted individually from the mean. Therefore, if over-dispersion exists, the data has more variation than the Poisson distribution allows, biasing the variance of the model. To ensure the Poisson regression is not over-dispersed, I utilized the dispersiontest() function from the AER package. Equi-dispersion from this function is displayed when alpha=0. The alpha value of -0.20 and p=0.99 displayed below show that model is not over-dispersed.

Further proof that Poisson is the correct model for the data can be shown through an influence plot from the car package, where only a few data points (101,57) are highly influential (shown below).

I also used a chi-squared test to confirm that a Poisson model is a good fit. The resulting p-value (0.95) indicates that a Poisson model is a very good fit for the data (shown below).

Exporting to Arc for Mapping: After creation the model, I computed the standardized residuals of the model in R for each county using the studres() function. I then exported this data to Arc and added it to the attribute table I mentioned previously in this exercise. I then used symbology to represent the residuals on the Texas map and created a layout


Because Poisson regression of counts utilized a log link, the resulting coefficient(s) need to be exponentiated. The exponentiated coefficient for this regression display that as county index scores increase (become more urban) by 1 unit, the intercept (mean of CRC counts) is multiplied by 0.99. In other words, for every unit increase in rurality, county CRC deaths counts increase by 1% (p<0.001). The narrow confidence intervals for this estimate indicate that the coefficients are accurate.

The map of standardized residuals can be seen above and indicates that there is significant regionality to the association between index scores and CRC death counts, where central Texas is much more green (negative standardized residuals) and counties near the edges are orange and red (positive standardized residuals). Also, as expected, the high CRC death count Texas counties from Exercise 2 (Anderson, Howard, Gonzales, and Newton) have standardized residuals significantly higher than expected in the Poisson model.


This process of regression construction was very helpful for determining how to best model my data. I initially attempted to use a linear regression and after laboring to find a transformation for my outcome data, I finally decided on Poisson regression (count data duh!). Based on the model diagnostics I performed, it seems that the data works perfectly for Poisson regression. In my opinion, the most difficult part of this exercise was interpreting the coefficients of the regression. Because the coefficients are on the log scale, they have to be exponentiated, which greatly confused my interpretation at first. Because there is an extensive amount of county CRC death data missing from this analysis (confidentiality), these results may be less indicative of the true rurality-CRC mortality associations in Texas. Further, this analysis suffers from the modifiable areal unit problem, where counties may not an appropriate boundary for comparison. In future analyses, I would like to use more complete and spatially defined data from the state cancer registry to shed more light on the relationship.

Multi-Distance Spatial Cluster Analysis for Woodpecker Nests (Ripley’s K)

Exercise 1: Multi-Distance Spatial Cluster Analysis (Ripley’s K)


How is the spatial presence of postfire woodpecker nests related to the spatial presence of salvage-logged forest stands?

  • How are woodpecker nests clustered within survey units? (Exercise 1 and 3)
  • How does this clustering relate to salvage treatment units within the survey units? (Exercise 2 and 3)


Multi-Distance Spatial Cluster Analysis (Ripley’s K) in R

Multi-Distance Spatial Cluster Analysis (Ripley’s K) analyzes point data clustering over a range of distances. Ripley’s K indicates how spatial clustering or dispersion changes with neighborhood size. If the user specifies distances to evaluate, starting distance, or distance increment, Ripley’s K identifies the number of neighboring features associated with each point if the neighboring features are closer than the distance being evaluated. As the evaluation distance increases, each feature will typically have more neighbors. If the average number of neighbors for a particular evaluation distance is higher/larger than the average concentration of features throughout the study area, the distribution is considered clustered at that distance.

K Function Graphic














Referenced from ArcGIS Desktop 9.3 Help (

Data Description

My polygon dataset includes 10 survey units (6 treatment, 4 control) totaling over 7,000 ac. Each survey unit is between 397 and 1154 ac2. 34 salvage logging units are dispersed throughout the 6 treatment units. The salvage units are characterized by three silvicultural prescriptions replicating preferred postfire habitat for the three target woodpecker species: black-backed, white-headed, and Lewis’s woodpeckers. The 4 control survey units and the unlogged landscape in the 6 treatment units act as the undisturbed control habitat.

In 2016 and 2017, belt transects with corresponding avian point counts were surveyed in each of the 10 survey units. These surveys detected woodpecker occupancy and nest locations based on audio playback and nest searching methodology. Woodpecker nest datasets were developed from these surveys for pre-salvage (2016, n = 71) and post-salvage (2017, n = 77) conditions. I wanted to run Ripley’s K on the two datasets separately to determine differences in nest clustering before and after the salvage treatments. In the future, with successive datasets collected in 2018 and 2019, I can analyze clustering trends up to three years after salvage logging. I will also integrate 3D forest structure variables from pre- and post-salvage lidar datasets.

A subset of the 2016 and 2017 nest datasets picturing clockwise from top right: East Fork Canyon Creek (Ctrl), Wall Creek (Ctrl), Alder Gulch (Trt), Lower Fawn Springs (Trt), Upper Fawn Springs (Trt).

By using Exercise 1 to determine how nests are clustered within the study units, I can use this clustering to inform my Exercises 2 and 3, which should reveal how the clustering relates to the salvage harvest units.

Ripley’s K Steps

  1. Export nest points as their own shapefile. My preliminary point dataset includes both nest points and vegetation survey points, so I needed to isolate only nest points. This will indicate how nest points cluster within study units.
  2. Further export nest points by survey unit. Isolate nests in each survey unit as their own shapefiles. Running Ripley’s K on 2016 and 2017 nests as a whole will not be useful, since clustering across an entire nest dataset will reflect the 10 survey units selected for this study. In that case, Ripley’s K would falsely demonstrate high clustering within those 10 survey units. In reality, the nests are only clustered here because the surveys intentionally restricted nest detection to these areas instead of across the entire Canyon Creek fire complex. I ran Ripley’s K on all the 2016 and 2017 nests and the output proved true to this phenomenon, indicating statistically significant clustering at smaller distances:

  1. Use R to run Ripley’s K on each survey unit’s 2016 nest shapefile.
  2. Use R to run Ripley’s K on each survey unit’s 2017 nest shapefile.

Below is the example script for Ripley’s K using the 2016 Alder Gulch survey unit. For 2017 I  changed the variable names to 2017:

AGnests2016 <- readOGR(“./data”, layer = “AG_2016_nests”)
AGnests2016.ppp <- as(AGnests2016,”ppp”)
n <- 100
AGnests2016RK <- envelope(AGnests2016.ppp, fun= Kest, nsim=n, verbose=F)

I gave the option to run either 100 or 1000 iterations. The difference is that the confidence interval (grey area in the output graphs) widened with 1000 iterations, shown below for the Big Canyon survey unit:

Below are the Ripley’s K results for each survey unit in 2016 and 2017 over 100 iterations. The accompanying visual plots display nest point distribution in each survey unit. Some survey units did not have large enough sample sizes for the tool to function correctly. Notable results from a visual analysis of each graph include nest clustering in Big Canyon (Treatment 1) and Crawford Gulch (Control) in 2016 and Lower Fawn Creek (Treatment 2) in 2017.

Alder Gulch 2016 (n = 5)


Alder Gulch 2017 (n = 7)


Big Canyon 2016 (n = 13)


Big Canyon 2017 (n = 10)


Crazy Creek 2016 (n = 10)

Crazy Creek 2017 (n = 11)


Crawford Gulch 2016 (n = 12)


Crawford Gulch 2017 (n = 8)



East Fork Canyon Creek 2016 (n = 3); Not surveyed in 2017



Lower Fawn Creek 2016 (n = 8)


Lower Fawn Creek 2017 (n = 14)


Overholt Creek 2017 (n = 4); Not surveyed in 2016


Sloan Gulch 2016 (n = 7)


Sloan Gulch 2017 (n = 7)



Upper Fawn Creek 2016 (n = 4)


Upper Fawn Creek 2017 (n = 5)


Wall Creek 2016 (n = 9)

Wall Creek 2017 (n = 11); Cannot import figures due to unresolved maximum file size error.


Because of the small sample size for each of the survey units (as few as 4 nests in some years/units), my Ripley’s K output graphs appeared blocky instead of continuous. Notice how continuous the graphs appeared when I ran all 2016 nests. Overall this analysis did not perform well with these datasets. I am also unclear how edge effects were factored into this particular analysis. There may be more parameters I could define in the code when running this analysis in R. For example, I did not manually specify distances in the R code.

Ripley’s K requires (X,Y) coordinates for each point location. I first tried to perform this analysis in ArcMap. However, my woodpecker nest point shapefiles contain UTM coordinates divided into two separate fields for northing and easting. This caused a problem when the Ripley’s K tool in ArcMap asked for the dependent variable and I could only select one field, northing or easting. Therefore, I needed to run the Ripley’s K tool in RStudio instead.

The above Step 2 explains another issue with trying to analyze all nests at once, but I was able to resolve it by individually isolating each survey unit. Nest dataset analyses must also address a temporal component related to the salvage logging. 2016 nests must be analyzed separately from 2017-2019 nests, but 2017-2019 nests can be analyzed either by year or as a group.

Exercise 3: Characterizing the spatial pattern of confusion between pixels classified by global settlement datasets and OSM refugee settlements.

Question that I asked:

In Exercise 3, I was asking how the different classification methods that I’m using align or do not align with each other — essentially, creating a confusion matrix to identify the pixels that are True Positives, False Positives, and False Negatives when comparing two of the classifications. That is — what is the spatial pattern of confusion between pixels classified by global settlement datasets and an unsupervised binary classification of pixels within an OSM boundary?

Approach that I used:

I used EarthEngine to simulate confusion among pixels — I did not necessarily create a matrix in order to do so, but considered the concept of a confusion matrix in order to calculate a number for each pixel.

Steps I followed to complete the analysis:

In order to do this, I needed to compare “Facebook” and “K-means” as well as “WSF” and “K-means.” In this case, the “K-means” classification was my “truth” classification, because I needed something to measure the other classifications against that was not just a summary of pixels inside of a vector but a more traditional binary unsupervised classification. Thus, I used the settlements to guide a clustering algorithm of K-Means to cluster similar pixels, and created clusters of “settlement-like” pixels. I needed to perform raster math in order to combine these datasets and represent False Positives, False Negatives, and True Positives. I added WSF to K-means, with each pixel labeled as “1” in each representing  a positive record. I then added Facebook to K-means, again with each pixel labeled as 1, but then multiplied this by 4 for further additive properties to identify the overlaps in False Positives, False Negatives, and True Positives. Ultimately, this meant that the pixels that were False Negative in either or both comparisons had pixel values of 10, 11, and 14. The data that were False Positive in either or both comparisons had pixel values of 1, 4, and 5. A value of 15 represented where all three datasets registered as True Positive.

Brief description of results I obtained:

I was interested in both False Positives and False Negatives, but for different reasons. In False Positives, I wanted to visualize where the different global settlement datasets DID detect settlements while the K-Means did not. False Negatives would show me where K-Means picked up settlement data but neither of the global datasets did. Ultimately, False Negatives were much more prevalent throughout the dataset, further illustrating the exclusion of refugee settlements from global datasets. I chose to display a map of a more interesting pattern that appeared when looking over a larger settlement that was near a large body of water: because the binary clustering grouped water with “settlement” type landscapes, this area is a significant False Negative in the data. If I were to calculate statistics, it would make sense to exclude the area of water in order to not promote bias in the data.

False Positives and False Negatives between Refugee Settlement Binary Clusters and Facebook Classification and World Settlement Footprint Classification

Critique of the method – what was useful, what was not?

This exercise probably caused the most trouble with regards to methods that I thought made sense and would work but both presented more challenges and fewer patterns than I was hoping. There wasn’t really the spatial pattern that I was expecting, perhaps because this data is so noisy and over discontinuous areas in the landscape. It was useful to dig into the overlapping data, but ultimately the actual spatial patterns were not very remarkable. Perhaps the numbers presented in a matrix would be a more useful representation of the data or patterns for this specific method. Aside from the spatial statistics, I actually ran into quite a few function issues in EarthEngine, Pro, and ArcGIS Desktop — all three of which I used to try to manipulate this data into the ways I was imagining it would. For my final project, I may need to revisit these methods or seek advice from others, because my techniques were not as effective as I’d hoped.

Supervised Image classification on forested stands

Question that I asked?

Could I identify functional tree species with supervised image-classification in my stands?

The reason I asked this question was so that If I had to do a geographically weighted regression again it would be valuable to have deciduous or coniferous tree species in my point data for an added variable.

Name of the tool or approach that you used.

The main tool that I used for image classification was the maximum-likelihood classification in ArcMap. I also used the create accuracy assessment points to help create a confusion matrix in excel.

Brief description of steps you followed to complete the analysis.

I downloaded 2016 NAIP imagery in my area of interest and used the high resolution imagery to create a false colored image with the bands being arranged as NIR, Red, and Green. To help delineate broad-leaf vegetation from coniferous vegetation, I applied a histogram equalize stretch that enhanced my ability to identity conifers in the landscape. From there I created a maximum likelihood classification by drawing training data polygons on the false color imagery, which involved me using the Esri digital imagery base map as a reference image.


Once the image classification was complete, I used the create accuracy points on my stand and then extracted the raster values from the thematic map output to those points to create a confusion matrix in Excel. I clipped the thematic map raster to the watershed polygons I made when I did an individual tree segmentation to show what pixel classifications were assigned in my tree tops.


Brief description of results you obtained.

The thematic map output was 83% accurate with conifer and developed land covers performing the worst in the model. The developed  land cover is generally difficult to model in a landscape, and the variability in urban spectral reflectance leads to errors in modeling. The conifer land cover performed more poorly due to my trouble achieving accurate training data with the imagery resolution, and also with the model having trouble delineating conifers from grass and deciduous vegetation. Errors of commission on my part (65% accuracy), and errors of omission (75%  accuracy) lead to the lower accuracy of the conifer land cover (Table 1).  Despite these errors, the thematic map output performed well, and the land cover pixels in my stands showed that conifer trees were accurately assigned in the Saddleback stand (Figure 2). For the baker creek stand the large amount of shadows, sun glare on canopies, and classification cut off, lead to a poor classification of that stand.

Figure 1. The land cover thematic map for the entire NAIP image. The cyan blue color indicates the locations of Saddleback and Baker Creek stands.

Figure 2.The land cover classification output for the Saddleback stand.

Figure 3. The land cover classified output for Baker Creek Stand. Note that the NAIP imagery that was classified did not extend to cover the entire stand. The tree crown polygons were laid below the output to show where the land cover cuts off.

Table 1. Confusion matrix for the thematic map output.


Critique of the method – what was useful, what was not?

Some critiques about this process was that it was time consuming to create training data detailed enough to capture the variation in the scene for my desired accuracy. Sources of errors in the thematic map include shadows, resolution and variable spectral response signatures in the remotely sensed vegetation. Shadows occluded trees that would otherwise stand out, and distorted the classification enough for me to have to add in a land cover classification for shadows to mask them out of the scene. The issue of resolution just means that NAIP imagery was not detailed enough for the applications I asked. Imagery taken from unmanned aerial drones may be a potential avenue for acquiring a more  higher resolution data set. The confusion matrix highlights this issue, with an omission error of 65% for conifers and 75% commission error. It was difficult to determine conifer trees accurately in the training data from the variability of the spectral reflectance and the blurred crowns from the 1 meter resolution.


Since I only did a classification, I didn’t attempt to classify tree functional species to my tree polygons. The process that comes to mind on how to do that is to visibly determine which classification color is more pronounced in a tree top, and then placing that species in the point data as an attribute. This process would be highly time consuming and developing a methodology to streamline the classification of functional tree species to my tree points would be potential future work.Overall, the thematic map outputs are useful for areas like the Saddleback stand that have less shadows and distortion. The map is less useful for areas with high distortion like my Baker Creek stand.

Watershed recession behavior as a function of watershed environmental variable

Question asked: How is the spatial pattern of watershed recession coefficients related to the spatial patterns of watershed elevation, soils, geology, basin area, and precipitation?

Name of the tool or approach used: To perform this analysis, I used recession analysis to quantify low flow metrics for 12 streams and rivers in the Oregon Coast Range. I classified the a and b recession coefficients using 2 different approaches in ArcGIS: equal interval and natural jenks. The resulting classifications are quite different.

Methods of procedure:

  1. Recession Analysis: the methods employed for the recession analysis are described in Exercise 1. For Exercise 2, I calculated recession coefficients for an additional seven sites in the Oregon Coast Range for a total of nine sites. Data for the present analysis was analyzed on the hourly timestep.
  2. Watershed Delineation: I delineated watersheds in ArcGIS based on each gaged pour point.
  3. Spatial data: I downloaded relevant spatial data from various geospatial databases including Oregon Geospatial Enterprise Office, USGS, USDA, and OSU PRISM Climate Group.
  4. Data Visualization and Classification: I used ArcGIS to visualize my spatial data and to classify the recession analysis results. I used two classification methods: equal interval and natural jenks. Both methods had 5 classes.
    1. Equal interval classification: classification categories using this method is calculated by dividing the range of data by the designated number of classes.
    2. Natural jenks classification: this is an optimization method of classification that minimizes variance within classes and maximizes variance between classes.
  5. The results of my classification were analyzed by visual comparison. That is comparison between the two classification methods, as well as how the results of each classification methods compared to the environmental variables across the watersheds.


Results from this analysis demonstrate that the recession coefficient classification method influences the apparent spatial variability of recession metrics. The natural jenks method results in more spatial heterogeneity across the latitudinal gradient (Figure 1). The a and b recession coefficients seem to be related in the way that they vary across space, however coefficient b is slightly more variable than coefficient a. The equal interval classification method results in a more homogenous representation of both a and b coefficients (Figure 2). Coefficient a seems to be largely controlled by basin size using this method, which is apparent because the smallest basin that is in the class with high values. Coefficient b is more variable, and also seems to be controlled by basin area. Precipitation may also be a primary control. Overall, between the two classification methods, the Nehalem watershed (furthest north) and the Chetco watershed (furthest south) seem to demonstrate similar recession coefficients. The watersheds between the Chetco and the Nehalem, in general, demonstrate similar recession behavior.

Figure 1. Classification of recession coefficients using the natural jenks method in ArcGIS, compared to five environmental variables. Coefficient a is in the first panel and coefficient b is in the second panel.

Figure 2. Classification of recession coefficients using the equal interval method in ArcGIS, compared to five environmental variables. Coefficient a is in the first panel and coefficient b is in the second panel.

Critique of method:

This was a simple method of visual comparison across space, however it was quite useful for both: 1) considering the spatial patterns of watershed recession behavior, and 2) comparing classification methods and how they influence the outcome of the analysis. Because it is just a visual comparison, there are no quantifiable differences presented here, which will be important moving forward. Additionally, this was an important exercise to understand the mechanical steps necessary for making this comparison.

Monitoring Postfire Salvage Logging Effects on Woodpecker Population Dynamics

Research Question

How is the spatial presence of postfire woodpecker nests related to the spatial presence of salvage-logged forest stands?


This project will use datasets targeting the 2015 Canyon Creek fire complex on the Malheur National Forest in eastern Oregon. A salvage logging operation occurred in the burn area in July 2016. My research is in cooperation with a Rocky Mountain Research Station study examining salvage logging effects on three woodpecker species in the Canyon Creek complex. In 2016 and 2017 I led crews on this project collecting extensive postfire woodpecker occupancy and nest productivity datasets for black-backed, white-headed, and Lewis’s woodpecker populations. This resulted in a 148-nest dataset for 2016 and 2017, representing woodpecker populations before and after salvage logging. A polygon shapefile outlining ten RMRS woodpecker point count survey units serves as the area of interest (6 treatment, 4 control). Within the 6 treatment units, another polygon shapefile outlining 34 salvage harvest units indicates treatment areas. Three silvicultural prescriptions replicating optimal habitat types for each woodpecker species designate salvage variables like post-harvest stand density and diameter. Each salvage unit adheres to one of these three harvest prescriptions. 2016 pre-salvage and 2017 post-salvage lidar datasets are in processing for eventual correlation between 3D forest structure variables and woodpecker nest site selection before and after harvest. Supplementary geospatial data includes a 2015 WorldView-3 1 m raster and ArcGIS basemaps.

Image result for canyon creek fire oregon   

Above: The 2015 Canyon Creek Fire burning near John Day, OR.

Above: The Canyon Creek fire complex as a false color WorldView-3 1 m raster. The area of interest includes 10 study units in blue, labeled with yellow text (6 treatment, 4 control). This visual orients the survey units to an area in eastern Oregon southeast of John Day and Canyon City. The false color image displays healthy vegetation as red, with the darkest areas displaying high burn severity. The survey units are found within some of the highest burn severity areas in the fire complex.

Above: A close-up of the 34 salvage treatment polygons outlined in red and labeled with white text. Control units lack red salvage polygons. This image does not include Overholt Creek.

Above: A subset of the 2016 and 2017 nest points featuring survey and salvage unit polygons.


I expect to see dispersed nests in 2016 with possible trends indicating species habitat preferences. Previous research indicates species-specific preferences for certain forest habitat variables. Black-backed woodpeckers prefer dense, small-diameter stands for foraging and nest excavation. White-headed woodpeckers prefer a mosaic of live and dead variable density and diameter stands for pine cone foraging. Lewis’s woodpeckers prefer tall medium to large-diameter stands for aerial foraging maneuvers. I expect to see nest sites in both years clustered in areas with these forest structures. In 2017 I also expect to see nest sites clustered near salvage treatments implemented for each species. Overall I expect the control units to exhibit nest dispersal and high woodpecker activity.

Image result for black-backed woodpecker              Image result for white headed woodpecker             Image result for lewis's woodpecker

Black-backed woodpecker (Picodies arcticus)     White-headed woodpecker (Picoides albolarvatus)                       Lewis’s woodpecker (Melanerpes lewis)


A graphic depicting 3 salvage harvest treatment types and a control designed to benefit each of the target woodpecker species (Dave Halemeier 2016).


Analyses will consider pre- and post-salvage variables to determine changes in forest structure alongside woodpecker population dynamics. I would like to learn about spatial autocorrelation analyses such as Moran’s I. It is likely I will find patterns of dependent observations based on localized conditions. Woodpecker species presence and nest locations may be affected by burn severity, since highly weakened trees will host their primary food source, bark beetle larvae. In 2017 woodpecker species presence may be grouped according to salvage treatments targeting each species, or control areas. Regression analyses showing the relationship strength between nest distance from salvage units and salvage treatment types could indicate certain forest variables affecting postfire woodpecker colonization.

Expected Outcome

Regression coefficients describing the relationship between woodpecker presence and salvage treatment location and type will help develop inferences towards postfire management effects. I will create interpretive maps of nest locations showing survey unit and salvage unit polygons. These maps could include statistical and geospatial relationships represented with different colors and symbols. Eventually, I will geovisualize the lidar data with these maps and statistical relationships for a comprehensive and communicative representation of woodpecker population and forest structure change dynamics.


I am processing two lidar datasets of the study area from 2016 and 2017. These datasets were acquired before and after the salvage logging treatments occurred. I will produce forest metrics such as stand density, diameter class, and height in salvage and survey units. I will then correlate geospatial and statistical properties of the nest dataset to quantified forest variables affecting woodpecker nest site selection. I will examine trends between 2016 and 2017 nest selection to understand the effects of harvest treatments on woodpecker populations. At least two more years of woodpecker data will exist for 2018 and 2019, so future research will add these datasets to the analyses. I would like to see a machine learning algorithm developed from this study that could predict areas of optimal habitat suitability for snag-dependent wildlife species. Postfire wildlife habitat prediction will be crucial to resource managers as wildfires increase in the coming decades alongside accelerated landscape restoration.

This spatial problem is important to science and resource managers as climate change amplifies wildfire effects. Using 3D remote sensing datasets for resource management is the future trend across all disciplines. Increased wildfire activity around the world necessitates cutting-edge methods for active fire and postfire ecosystem quantification. In the Blue Mountains ecoregion in eastern Oregon, Rocky Mountain Research Station, Malheur National Forest, and Blue Mountains Forest Partners rely on this project’s lidar quantification for their research and management decisions. Determining acceptable levels of salvage harvest for wildlife habitat affects whether government agencies and rural economies in this region will allow small businesses to profit from sustainable harvest operations. Science overall will benefit from the continued investigation of wildlife response to anthropogenic disturbance, specifically postfire forest management decisions and the controversial practice of salvage logging.

Above: A salvage treatment in the Crazy Creek woodpecker survey unit.


I took an ArcInfo class (ARC Macro Language) during my undergraduate program. I am currently taking a Python class for geospatial programming. I have experience in image processing through an undergraduate degree in land use and GIS, multiple years of professional experience in remote sensing and GIS, and my current MS program in Forest Geomatics. I have academic and professional experience with R, C++, ArcGIS, and multiple types of remote sensing software for 2D and 3D data analysis.

Landscape Patterns as Predictors of Tree Height

Question: Which landscape features correspond to clusters of greater than expected trees?

Methods: I performed two Hot Spot Analyses in ArcMap; one on Hot Spots of tree height and another on hot spots of distance between trees. Both were constrained to the reserved control areas of the HJ Andrews Forest. Hot spots in tree height are regions of greater than expected tree height, while hot spots in distances between trees are regions of greater than expected distance between individual trees (more dispersed trees). Hot spots and cold spots of each analysis generally overlapped. However, hot spots between tree height and spacing did not overlap in all cases, so I wanted to know what landscape features might explain this difference. Covariates I explored included slope, aspect, elevation, and landform. Since the end goal is to find landscape features that may correlate with amount of soil carbon, I conducted this analysis with the assumption that taller trees may correlate with regions of greater soil carbon. I used the package ‘caret’ in R to calculate a  confusion matrix between the Z-scores of height and distance for all the hot spot bins (-3,-2,-1,0,1,2,3), then further constrained the analysis to only the most extreme hot and cold spots (-3 and 3). I then compared mean height, distance, slope and elevation between the four combinations of the extreme hot and cold spots (Table 1).

Results: Regions of taller than expected trees often correspond to regions of greater than expected distances between trees, which agrees with current forest growth models (Fig. 1). Hot spots of tall trees are typically in valleys and cold spots are commonly on ridges (Fig 3 & 4). When we zoom in to the Lookout Mountain area of HJ Andrews, we see that hot spots of tall trees are more concentrated in valleys and on footslopes, and cold spots are closer to mountain ridges (Fig 3). When compared with the distance hot spot map of the same area, we see that cold spots go much further down the footslopes and even into the valleys in some cases (Fig 4). So although we have evidence for a strongly linear relationship between height and distance between trees, we also have evidence that they do not fully explain each other and other landscape features are likely at play.

Fig 1. Distance Z-scores vs. Height Z-scores from hot spot analyses show a linear relationship.

Fig 2. HJ Andrews elevation with reserved control areas in orange and

inset area of Lookout Mountain hot spot maps (below)


Fig 3. Hot Spot Analysis showing hot spots of tree heights (tallest trees)

in the Lookout Mountain area


Fig 4. Hot Spot Analysis showing the greatest distance between trees

in the Lookout Mountain area


An elevation band that correlates with occurrences of tall trees exists up to around 1100 m, after which point number of tall trees drops off substantially (Fig. 5). Certain aspects seem to correlate with taller trees, but those relationships are harder to tease apart and I have yet to fully explore them. Greater slopes tend to correlate with shorter trees, but this relationship is not linear. There is an interesting upwards trend at slopes between 30 and 50 degrees that seems to correlate with slightly taller trees, then a big drop in mean height Z-score at slopes of 60 degrees.

Fig 5. Aspect, elevation and slope compared with Z-scores of mean height.

A comparison of Z-scores from hot spot analyses of height and distances shows that although hot spots of height and distance tightly correlate, covariates that explain them are different (mean slope and elevation). When we compare the most extreme Z-scores to one another, slope, height and distance between trees are not particularly different. Mean elevation in three categories of Z-score is similar, but mean elevation in the fourth group (>3,>3) is significantly lower. A next step is to map out these

Table 1. Comparison between the most extreme Z-scores of tree height and tree spacing.

Height Z-Score Distance Z-Score Mean Height (m) Height_SD Mean Distance (m) Distance_SD Mean Slope (m) Slope_SD Mean Elevation (m) Elevation_SD
<-3 <-3 22.8 10.5 5.1 2.4 27.9 10.5 1285 294
<-3 >3 24.5 11.2 5.6 2 26.9 4.5 1377 153
>3 <-3 34.8 7.1 5 2 31.8 5.9 1310 44
>3 >3 39.5 16.7 4.7 2.6 26.2 11 934 188

Critique: These analyses are still based on Hot Spot Analyses, so they still comes with the same criticisms as previous Hot Spot Analyses. One of these criticisms was that it’s basically a smoothing function. Since the LiDAR dataset I’m using is basically a census of tree heights, running hot spot analyses is reducing the information in that dataset unnecessarily. I have yet to map out regions that were well-predicted and poorly predicted spatially, so I cannot fully discuss the merits of the confusion matrix method.


Exercise 3: Lagoons, ENSO Indices, and Dolphin Sightings

Exercise 3: Are bottlenose dolphin sightings distances to nearest lagoon related to ENSO indices in the San Diego, CA survey site?

1. Question that you asked

I was looking to see a pattern at more than one scale, specifically the relationship with ENSO and sighting distributions off of San Diego. I asked the question: do bottlenose dolphin sighting distributions change latitudinally with ENSO related to distance from the nearest lagoon. The greater San Diego area has six major lagoons that contribute the major estuarine habitat to the San Diego coastline and are all recognized as separate estuaries. All of these lagoons/estuaries sit at the mouths of broad river valleys along the 18 miles of coastline between Torrey Pines to the south and Oceanside to the north. The small boat survey transects cover this entire stretch with near-exact overlap from start to finish. These habitats are known to be highly dynamic, experience variable environmental conditions, and support a wide range of native vegetation and wildlife species.

Distribution of common bottlenose dolphin sightings in the San Diego study area along boat-based transects with the six major lagoons.


0 Buena Vista Lagoon
1 Agua Hedionda Lagoon
2 Batiquitos Lagoon
3 San Elijo Lagoon
4 Tijuana Estuary
5 Los Penasquitos Lagoon
6 San Dieguito Lagoon

2. Name of the tool or approach that you used.

I utilized the “Near” tool in ArcMap 10.6 that calculated the distance from points to polygons and associated the point with FID of that nearest polygon. I also used R Studio for basic analysis, graphic displays, and ANOVA with Tukey HSD.

3. Brief description of steps you followed to complete the analysis.

  1. I researched the San Diego GIS database for the layer that would be most helpful and found the lagoon shapefile.
  2. Imported the shapefile into ArcMap where I already had my sightings, transect line, and 1000m buffered transect polygon.
  3. I used the “Near” tool in the Analysis toolbox, part of the of the “proximity toolset”. I chose the point to polygon option with my dolphin sightings as the point layer and the lagoon polygons as the polygon layer.
  4. I opened the attribute table for my dolphin sightings and there was now a NEAR_FID and NEAR_DIST which represented the identification (number) related to the nearest lagoon and the distance in kilometers to the nearest lagoon, respectively.
  5. I exported using the “conversion” tool to Excel and then imported into R studio for further analyses (ANOVA between the differences in sighting distances to lagoons and ENSO indices).

4. Brief description of results you obtained

After a quick histogram in ArcMap, it was visually clear that the distribution of points with nearest lagoons appeared clustered, skewed, or to have a binomial distribution, without considering ENSO. Then, after importing into R studio, I created a box plot of the distance to nearest lagoon compared to the ENSO index (-1, 0, or 1). I ran an ANOVA which returned a very small p-value of 2.55 e-9. Further analysis using a Tukey HSD found that the differences between ENSO states of neutral (0) and -1 and neutral and 1 were significant, but not between 1 and -1. These results are interesting because this means the sightings of dolphins differ most during neutral ENSO years. This could be that certain lagoons are preferred during extremes compared to the neutral years. Therefore, yes, there is a difference in dolphin sightings distances to lagoons during different ENSO phases, specifically the neutral years.

Histogram comparing the distance from the dolphin sighting to nearest lagoon in San Diego during the three major indices of El Niño Southern Oscillation (ENSO): -1, 0, and 1.


Violin plot showing the breakdown of distributions of dolphin sighting distances to lagoons (numbered 0-6) during the three different ENSO indices.

5. Critique of the method – what was useful, what was not?

This method was incredibly helpful and also was the easiest to apply once I got started, in comparison to my previous steps. It allowed to both visualize and quantify interesting results. I also learned some tricks for how to better graph my data and to symbolize my data in ArcMap.

Contact information: this post was written by Alexa Kownacki, Wildlife Science Ph.D. Student at Oregon State University. Twitter: @lexaKownacki

Does the presence of Olympia oysters correspond with predicted suitable habitat?

Question explored
In my last blog post, I mapped habitat suitability for Olympia oysters in Yaquina Bay, OR by assessing three environmental parameters: salinity, substrate availability, and elevation. In exercise 2, I brought in oyster location data points collected from field surveys of the intertidal zone to compare against the map of suitable habitat. The question I am examining in this exercise is:

How does the spatial pattern of Olympia oyster location data correspond to the spatial pattern of suitable habitat in Yaquina Bay?

Field surveys of the intertidal zone of Yaquina Bay were conducted on April 19-20 and May 17, 2019 during low tides. Oysters were characterized as ‘present’ if evidence of at least one oyster (living or dead) was detected within the predefined search area.

Name of tool or approach
I first uploaded the data points collected in the field into Google Earth where I could easily verify the locations against my field notes and photos, as well as perform some minor quality control. The points were imported into ArcGIS Pro for spatial analysis. Statistical information was reviewed and plotted in Excel.

Brief description of steps to complete the analysis

  1. After validating the data point locations and performing some minor quality control in Google Earth, I saved the points as a KML file. In ArcGIS Pro, I used the geoprocessing tool ‘KML to layer’ to convert them for analysis. Once I added the data points onto the map as a layer, I edited the symbology to display the points as ‘Present’ or ‘Absent’.
  2. To assess the neighboring habitat, defined by 4 class types from least suitable to most suitable, surrounding each of the data points, I used the ‘Multiple Ring Buffer’ tool to create 3 buffer rings around each of the points at distances of 75, 150, and 300 meters. For the results in this blog post, only the 300-meter buffer was used. I selected ‘overlapping (disks)’ in the dissolve option to assess the habitat around each data point individually.
  3. Once the buffers were created, I used the ‘Zonal Statistics’ tool to overlay the buffered areas onto the raster of habitat suitability. This tool allows the user to select by statistic desired (mean, median, etc.) to generate a spatial output. I chose ‘majority’, which categorizes the buffer zones based on dominant habitat suitability type within the buffer. Majority also represented the median in this output. For example, if the majority of the suitable habitat within the buffer area is class 4 (most suitable) then the buffer display is shown as ‘4’.
  4. In addition to the ‘Zonal Statistics’ spatial output, I used the tool ‘Zonal Statistics as Table’ to generate a table of all the statistical information relevant to this analysis. The same input data is used (overlaying the buffered zones on the habitat suitability raster) to create this table.
  5. I copied the table generated into Excel where I split up the data on a couple levels: 1) Presence vs. Absence and 2) North shore vs. South shore for comparison. The north and south shores are managed very differently: the north shore is largely composed of rip rap and steeper slopes because Yaquina Bay Road runs right along the edge, whereas the south shore is more natural and less developed. I created box and whisker plots of the majority habitat suitability type, minority, variety, and mean.


The results show some mixed information. When looking ‘Presence’ data points, the majority habitat types surrounding these points on both the north and south shore are 3 or 4, most suitable areas. The minority habitat type is very different between the north and south shore, with the south, more natural shoreline showing a stronger correlation with low suitability habitat being found less often around presence data points. The means for presence data points generally correspond with greater habitat suitability.

However, the absence data points show that the majority habitat type tends to be more closely aligned with predicted most suitable habitat, especially on the south shore. Additionally, the minority habitat types surrounding the absence data points are 1, 2, and 3, indicating that the least suitable habitat does not constitute much of the area. This could be partly due to low coverage by the least suitable habitat type overall (see maps). There appears to be very weak correlation between absence of oysters and location of suitable habitat. Absence of oysters is likely to be recorded in both suitable and unsuitable habitat.

Critique of the method
This method further revealed to me that the importance of the resolution of the baseline data. Based on field observations and conversations with shellfish biologists, the distribution of Olympia oysters is very patchy due to substrate availability. The oysters may be found attached to a pile of rocks in the middle of the mud flat, but will not be found elsewhere in the mud flat. The raster layer I have available for substrate has classified substrate into large generalized categories, which does not reflect the nuanced nature of their opportunistic settling strategy. Dividing habitat suitability into only 4 categories limits the complexity of the analysis which can be helpful, but also means that there’s not a lot of distinction between suitable and unsuitable. Additionally, more data points will help make this analysis more robust.

Using the buffers and the ‘Zonal Statistics’ tools created a generalized output that provides some useful information for analyzing habitat suitability for the oysters. The approach is easily duplicated, which was helpful as I needed to add my field data points in batches as I collected them. What would be more informative for the next iteration is to be able to analyze multiple buffers side-by-side; how does the smallest neighborhood around each point compare to the larger ones?

Ex. 3: Does black stain spread through landscape networks?


For those who have not seen my previous posts, my research involves building a model to simulate the spread of black stain root disease (a disease affecting Douglas-fir trees) in different landscape management scenarios. Each of my landscapes are made up of stands assigned one of three forest management classes. These classes determine the age structure, density, thinning, and harvest of the stands, factors that influence probability of infection.


Are spatial patterns of infection probabilities for black stain root disease related to spatial patterns of forest management practices via the connectivity structure of the network of stands in my landscape?


 I decided to look at how landscape connectivity influenced the spatial relationship between forest management practices and infection probabilities. This approach builds off of a network approach based in graph theory (where each component of the landscape is a “node” with “edges” connecting them) and incorporates concepts from landscape ecology regarding distance-dependent ecological processes and the importance of patch characteristics (e.g., area, habitat quality) in the contribution of patches to the connectivity of the landscape. I used ArcMap, R, and a free software called Conefor (Saura and Torné 2009) to perform my analysis.


 1. Create a mosaic of the landscape

The landscape in my disease spread model is a torus (left and right sides connected, top and bottom are connected). The raster outputs from my model with stand ID numbers and management classes do not account for this and are represented as a square. Thus, in order to fully consider the connectivity of each stand in the landscape, I needed to tile the landscape in a 3 x 3 grid so that each stand at the edge of the stand map would have the correct spatial position relative to its neighbors beyond the original raster boundary. I did this in R by making copies of the stand ID raster and adjusting their extent. In ArcMap, I then assigned the management classes to each of those stands, converting to polygon, using the “Identity” tool with the polygon for management class, and then using the “Join Field” tool so that every stand with the same unique ID number would have the relevant management class assigned. If I had not done this step, then the position of stands at the edge of the raster in the network would have been misrepresented.

2. Calculate infection probability statistics for each stand

I then needed to relate each stand to the probability of infection of trees in that stand (generated by my model simulation and converted to point data in a previous exercise). In ArcMap, I used the “Spatial Join” tool to calculate statistics for infection probabilities in that stand, because each stand contains many trees. Statistics included the point count, median, mean, standard deviation, minimum, maximum, range, and sum.

3. Calculate the connectivity of each stand in the network of similarly managed stands in the landscape

3a. For this step, I used the free software Conefor, which calculates a variety of connectivity indices at the individual patch and overall landscape level. First, I used the Conefor extension for ArcMap to generate the input files for the Conefor analysis. The extension generates a “nodes” file for each feature and a “connection” file, which contains the distances between features a binary description of whether or not a link (“edge”) exists between two features. One can set the maximum distance for two features to be linked or generate a probability of connection based on an exponential decay function (built-in feature of Conefor, which is an incredible application). For my analysis, I performed connectivity analyses that only considered features to be linked if (i) they had the same management class and (ii) there were no more than 10 meters of distance between the stand boundaries. Ten meters is about the upper limit for the maximum likely root contact distance between two Douglas-fir trees.

3b. For each management class, I ran the Conefor analysis to calculate multiple metrics. I focused primarily on:

  • Number of links in the network
  • Network components – Each component is a set of connected patches (stands) that is internally connected but has no connection to any other set of patches.
  • Integral Index of Connectivity (IIC) – Essentially, this index gives each patch (stand) a value in terms of its importance for connectivity in the network based on its habitat attributes (e.g., area, habitat quality) and its topological position within the network. For this index, higher values indicate higher importance for connectivity. This is broken into three non-redundant components that sum to the total IIC:
    • IIC intra – connectivity within a patch
    • IIC flux – area-weighted dispersal flux
    • IIC connector – importance of a patch for connecting other patches in the network) (Saura and Rubino 2010)
  1. Analyze the relationship between connectivity metrics and infection probabilities

I reduced the mosaic to include only feature for each stand, eliminating those at the periphery and keeping those in the core. I confirmed that the values were similar for all of the copies of each stand near the center of the mosaic. I then mapped and plotted different combinations of connectivity and infection probability metrics to analyze the relationship for each management class (Fig. 1, Fig. 2).

Fig. 1. Map of IIC connectivity index and mean infection probability for the extensively managed stands.


I generally found no relationship between infection probability and the various metrics of connectivity. As connectivity increased, infection probabilities did not change for any of the metrics I examined (Fig. 2). I would like to analyze this for a series of landscape simulations in the future to see whether patterns emerge. I could also refine the distance used to generate links between patches to reflect the dispersal distance for the insects that vector the disease.

Fig. 2. Plots of infection probability statistics and connectivity metrics for each of the stands in the landscape. Each point represents one stands in the randomly distributed landscape, with extensively managed stands in red, intensively managed stands in blue, and old-growth stands in green.

CRITIQUE OF THE METHOD – What was useful, what was not?

I had originally planned to use the popular landscape ecology application Fragstats (or the R equivalent “landscapemetrics” package), but I ran into issues. As far as I could tell (though I may be incorrect), these options only use raster data and consider one value at a time. What I needed was for the analysis to consider groups of pixels by both their stand ID and their management class, because stands with the same management class are still managed independently. However, landscapemetrics would consider adjacent stands with the same management class to be all one patch. This meant that I could only calculate metrics for the entire landscape or the entire management class, which did not allow me to look at how each patch’s position relative to similarly or differently managed patches related to its probability of infection. In contrast, Conefor is a great application that allows for calculation of a large number of connectivity metrics at both the patch and landscape level.


Saura, S. & J. Torné. 2009. Conefor Sensinode 2.2: a software package for quantifying the importance of habitat patches for landscape connectivity. Environmental Modelling & Software 24: 135-139.

Saura, S. & L. Rubio. 2010. A common currency for the different ways in which patches and links can contribute to habitat availability and connectivity in the landscape. Ecography 33: 523-537.

Ex 3: Grain size distributions and peak flow events

The context:

For this exercise, I wanted to figure out if there was a relationship between the temporal pattern of peak discharge in my study creeks and the temporal pattern of grain size distributions. The temporal pattern of grain size distributions could help tell a story about the interplay between hillslope and alluvial process and the forces involved in shaping and stabilizing the streams. One might predict that changes in grain size distribution might be related to extreme events. Large events could associated with debris flows which might reduce sorting while more moderate flow events could increase sorting.

The data:

Pebble counts were conducted in conjunction with cross section surveys at five reaches between 1995 and 2011. They show both the median (D50) and standard deviation of grain size varying over time.

During this time period, two different cross section sampling methods were used, depending on the year. In one method, every cross section in a given year was surveyed. In another method, the only cross sections sampled were ones in which the field crew thought that the creek bed had changed. Because of this, the grain size samples in some year are incomplete and biased towards conditions that favor visible change.

The graphs below show the mean of the D50 and mean of the standard deviation for each set of cross section for each year sampled in subfigure a. The error bars show standard error. The faded points represent years where the field crew only sampled a subset of the cross sections. Subfigure b shows the area-normalized annual peak discharge for the stream gauges on Mack and Lookout Creek. Cold Creek is ungauged.


These figures imply that most reaches were the least sorted in 1997, the year after the flood of record, followed by various decreases and increases over time.

The questions

I asked the following questions to try to understand the relationship between peak flow and grain size distribution:

  1. Is the standard deviation of grain size in a given water associated with the peak flow from that water year?
  2. Is the standard deviation of grain size in a given water associated with the peak flow from the previous water year?
  3. Is the change in standard deviation of grain size between two years associated with the largest peak flow from the interceding years?
  4. Is the change in D50 between two years associated with the largest peak flow from the interceding years?


The methods and results

I used simple linear regression to relate each of these variables and took the R2 value to represent how much of the sediment distribution variability each variable might explain. The results were negative in all cases.


SITE  Rsquared

1 LOL    0.00983

2 LOM    0.00320

3 MAC    0.0340

4 MCC    0.106


SITE  Rsquared

1 LOL     0.0546

2 LOM     0.0297

3 MAC     0.0191

4 MCC     0.217


SITE  Rsquared

1 LOL     0.0845

2 LOM     0.0718

3 MAC     0.0842

4 MCC     0.0852


SITE  Rsquared

1 LOL    0.0440

2 LOM    0.0252

3 MAC    0.0553

4 MCC    0.00688

The answer to these questions all appear to be no – clearly hydrology has some impact on grain size distributions, but the relationship may be too complicated to address using a single predictor variable and limited data.


Exercise 3: Perception of natural resource governance and changes in environment over time


What are the spatial patterns of natural resource governance perceptions in relation to environmental condition over time?

Tools and Approaches

  1. Tabulate area in ArcGIS Prop
  2. T-test in R

Analysis Steps

  1. To begin to answer this question, I first needed to find spatial layers of the environmental condition in the Puget Sound over time. I could not use the same ‘environmental effects’ data I used for Exercise 2, as the data was only completed 5 months prior. Instead, I substituted two land cover maps from 1992 and 2016 to approximate environmental condition.

I first loaded the raster layers into ArcGIS Pro. I then reclassified the land types (of which there were 255, but only 25 present in the region) down to 14. I then created buffers of 5km for my points of individuals with governance scores one standard deviation above and below the average. I then tabulated the land cover area within the buffers for each time period.

  1. To test whether land cover change differed between those with high and low perceptions, I exported the tables and calculated the change in land cover for the two samples. I then ran two-sample t-tests for each land cover type change between the two groups.


Land cover change differed significantly for four different land cover types between the two groups—grassland, forest, shrub/scrub, and bare land cover. Grassland cover decreased for both groups, but decreased by 5% more in in the areas with individuals with below average governance perceptions. Forest cover also decreased for both group, but decrease by 1% more individuals with below average governance perceptions (the amount of forest cover in the region is very large which accounts for why a 1% difference is significant). Shrub and scrub cover increased for both groups, but increased by 3% more in areas with individuals with below average governance perceptions. Lastly, bare land decreased for both, but decreased by 5% more for individuals in areas with individuals with below average governance perceptions (Table 1).

The effect size of land cover change on perception was small for each of the significant variables with biserial correlations between .07 and .09 (Table 1).

Table 1. Differences in land cover change between individuals with above average perceptions of natural resource governance perceptions and individuals with below average governance perceptions

  Governance Perception1      
  Above Average Below Average t-value p-value Effect size (rpb)
High Development 17 16 0.80 .426 .02
Medium Development 19 19 0.67 .500 .02
Low Development 11 11 0.22 .829 .00
Developed Open Space 12 13 0.21 .837 .00
Agriculture -1 0 0.74 .461 .02
Grassland -13 -18 3.03 .002 .09
Forest -10 -9 2.61 .009 .08
Shrub/Scrub 40 43 3.04 .002 .09
Wetlands 0 0 1.67 .095 .05
Unconsolidated Shore 2 2 0.10 .918 .00
Bare Land -21 -25 2.43 .015 .07
Water 0 0 0.90 .368 .03
Aquatic Beds -1 -5 1.19 .233 .03
Snow/Tundra 0 0 0.00 .997 .00

1Cell entries are percent changes of land cover from 1992 to 2016


  1. Critique

The biggest problem I had with this method was figuring out the most appropriate way to test my question. I struggled with classifying ‘good’ or ‘bad’ land cover change, and ultimately decided it was inappropriate. I also don’t think land cover is necessarily the best way to test environmental condition. I think it would be more appropriate if I were able to use environmental effects over time. I think it would also be best if I did different sized buffers around individuals.

Additionally, I believe I should have controlled for how long they lived in the region. The average number of years individuals have lived in the region was 34, which is less than the land cover change of 23 years. However, as years lived in the Puget Sound is a significant determinant of perception, it may have more to add in this case. The problem here is that this method is clunky. The way in which I differentiated high/low perception was by having different csv files. Having to then split these by years in the region would create an additional step, and more files, that could ultimately increase personal error because of the difficulty in keeping track of which buffer is which type of individual.


Ex 3: Analysis of juniper density and slope, aspect characteristics

1.Question asked and data used: 

How does juniper density vary with slope and aspects characteristics (as indicators of soil moisture)?

This study site for this analysis is the Camp Creek Paired Watershed Study (CCPWS), located in a semiarid region of central Oregon. This research focuses on two watersheds: one in which most juniper was removed in 2005-2006, referred to as the treated WS, and one watershed in which mature juniper is the dominant overstory species, referred to as the untreated WS. Belt transects (3m wide by 30m long) which recorded the number, height, and canopy cover of juniper within the treated WS were used for analysis. The number of juniper found in each transect was represented as a single point at the beginning of each transect. Additionally, NAIP imagery and a 30m DEM were used to assess juniper density and slope and aspect characteristics across both watersheds and in the surrounding areas.

2. Approach and tools used:

Analysis was conducted using R Studio, ArcGIS Pro, and Excel. A Chi-squared test and logistic regression were calculated in R, followed by Geographically Weighted Regression (GWR) in ArcGIS Pro. Excel was used to calculate the percentage of pixels classified as juniper within each buffer.

3. Steps used in analysis:

     A. Data preparation

              1. The following steps were conducted in a previous exercise: 1) a 30m DEM was used to extract slope and aspect characteristics, 2)slope and aspect values were each divided into nine groups, representing characteristics generally associated with increased soil moisture (e.g., flat slopes with northerly aspects were rated highest), 3)using the raster calculator the values of slope and aspect were averaged ((slope category+aspect category)/2). The raster created using this process rated each grid square from 1-9 (henceforth referred to as “rating model”). The rating model is shown below.

             2.Classification of NAIP imagery (also from a previous exercise). The NAIP imagery was classified using Support Vector Machine supervised classification. (Please note: the resolution of the NAIP imagery makes identification of juniper saplings difficult, therefore conclusions from the data should be made with caution.) The NAIP raster was resampled to the same resolution as the 30m DEM. The pixels were reclassified into a binary raster, with each pixel being either “juniper” or “not juniper”. The resampled NAIP imagery is shown below.

            3.The classified NAIP raster and rating model were projected to NAD 1983 UTM zone 10. Both rasters were masked to cover the same extent.

            4. A table was created which represented the number of juniper found per transect as a point. Using the “extract multi values to points” tool in ArcGIS the corresponding value from the rating model was assigned to each transect point.

             5. In a previous exercise, a buffer of 50m was created surrounding each of the belt transects (treated WS only). The percentage of juniper (pixels classified as juniper/total pixels) was calculated in excel and added to the belt transect table created previously.

             6. Using the “create random points” tool in ArcGIS, 96 random points were created within the study area (treated WS, untreated WS, as well as surrounding area). These points were a minimum of 150m apart.

             7. A 50m buffer was created surrounding each of the 96 random points. The percentage of juniper in each buffer was calculated as indicated above. These values were added (using the join function) to a table with the random points.

      B. Analysis

             1. The binary classified NAIP raster and rating model were loaded into R Studio and reviewed (addresses have been abbreviated):





             2. Chi-squared test was conducted for the rating model and classified NAIP raster (R):


             3. Logistic model, summary of results and associated plots (R):<-glm(juniperonly_NAIP[]~apslop[],data=v2,family=binomial)


              4. GWR was conducted in ArcGIS for the belt transects (in treated WS), using the juniper density in the 50m buffer as the dependent variable and the rating model as the independent variable. The distance band neighborhood type and golden search neighborhood selection were used.

              5. GWR was conducted in ArcGIS using the 96 random points (in treated WS, untreated WS, and surrounding areas). The same neighborhood parameters described above were used.

              6. Global Moran’s I was calculated using the residuals for both GWR analyses. Inverse distance and the Euclidean distance method were used for analysis.

4. Overview of Results:

      A. The Chi-squared test of the binary NAIP raster (indicating juniper versus non-juniper) and the rating model raster indicated a slightly significant relationship:

Pearson’s Chi-squared test data:

apslop[] and juniperonly_NAIP[]

X-squared = 28.624, df = 15, p-value = 0.01798

      B. The logistic regression model indicated a significant relationship between the aspect and slope classification (from the rating model) and the presence or absence of juniper (dependent variable). However, the difference between the null deviance and the residual deviance suggests that this may not be an appropriate model for predicting the presence or absence of juniper. The exponent of the coefficient indicated that for every increase in aspslop rating the probability of juniper being present increased by 1.06.

Call: glm(formula = juniperonly_NAIP[] ~ apslop[], family = binomial, data = v2) Deviance Residuals:

Min 1Q Median 3Q Max

-0.6934 -0.6576 -0.6317 -0.5984 1.9537


Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.80732 0.11048 -16.359 < 2e-16 ***

apslop[] 0.05935 0.01948 3.046 0.00232 **

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 7425.5 on 7761 degrees of freedom

Residual deviance: 7416.1 on 7760 degrees of freedom

(656 observations deleted due to missingness)

AIC: 7420.1

Number of Fisher Scoring iterations: 4

(Intercept)      apslop[]

0.164094       1.061147

      C. The GWR of treated WS dataset indicated an adjusted R-squared value of 0.28. Also, the following warning was indicated for this analysis: “At least one of your local regressions had very limited variation after applying the weights. Please use caution when interpreting the results”.  This further suggests that this model may not be indicative of the explanatory factors related to juniper density. The GWR results are provided below.

Golden Search Results

Distance Band     AICc

292.6545      222.3773

884.3284      221.4705

518.6538      216.8374

658.3291      219.2658

432.3298      217.0552

572.0050      217.5695

485.6810      216.6200

465.3026      216.6782


 WARNING 110259: At least one of your local regressions had very limited variation after applying the weights. Please use caution when interpreting the results.

——————————– Analysis Details ——————————-

Number of Features:                                                             33

Dependent Variable:                      BELTTRANSECTS_50M_EXCELTOTABLE.F__JUNIPER


Distance Band:                                                            465.3026


——– Model Diagnostics ———-

R2:                              0.4918

AdjR2:                           0.2845

AICc:                          216.6782

Sigma-Squared:                  28.4635

Sigma-Squared MLE:              20.4670

Effective Degrees of Freedom:   23.7290

     D. The GWR performed for the larger dataset (points distributed across both watersheds and in the surrounding areas) indicated a similar R-squared value to the GWR based in the treated WS only. A similar warning regarding the residuals was indicated for this GWR analysis. The results of this GWR are displayed below.

Golden Search Results

Distance Band     AICc

1360.4243     -98.1728

3270.5186     -84.8038

2090.0154     -90.0118

2540.9275     -87.1310

1811.3364     -92.9095

1639.1033     -95.1341

1532.6574     -96.4953

1466.8702     -97.2349

1426.2115     -97.6318

1401.0830     -97.8502

1385.5527     -97.9775

1375.9545     -98.0539


WARNING 110259: At least one of your local regressions had very limited variation after applying the weights. Please use caution when interpreting the results.

—————————- Analysis Details —————————

Number of Features:                                                     96


Explanatory Variables:            RANDOM_POINTS_105.ASPSLOP_PROJECTED_CLIP

Distance Band:                                                   1375.9545


——— Model Diagnostics ———-

R2:                              0.2932

AdjR2:                           0.1943

AICc:                          -98.0539

Sigma-Squared:                   0.0190

Sigma-Squared MLE:               0.0167

Effective Degrees of Freedom:   84.3419

     E.The Global Moran’s I for the GWR residuals from the belt transects dataset (treated WS) indicated clustering (p<0.0001) while the GWR residuals for the larger dataset (both WS with surrounding area) indicated random distribution (p=0.327).

5. Discussion/critique of methods:

The analysis used here suggests that other explanatory factors (outside of the rating model created based slope and aspect) may be in place which influence the density of juniper. The R-squared values of this analysis improved compared to previous exercises but are still relatively low. Additionally it should be noted that the presence of juniper has been correlated to reduced soil moisture in some conditions (Lebron et al., 2007), therefore slope and aspect may not be sufficient representatives of soil moisture in these watersheds.

The intent of this analysis was to develop a potential workflow that can be used with other datasets. Caution should be used in drawing any conclusions from this analysis specifically. Additionally, caution should be taken in interpreting the results of this analysis due to the characteristics of the datasets used. For instance, the resolution of the NAIP imagery makes the detection of juniper saplings difficult. Therefore the classification results may not be an accurate representation of juniper density, particularly in the treated WS. However, this process may be applied to more expansive data to include higher-resolution imagery (such as imagery collected using UAVs, etc.).


Lebron, I., Madsen, M. D., Chandler, D. G., Robinson, D. A., Wendroth, O., & Belnap, J. (2007). Ecohydrological controls on soil moisture and hydraulic conductivity within a pinyon-juniper woodland. Water Resources Research, 43(8), W08422.


Does proximity of open areas influence invasibility of forested areas?

Question Asked

I am interested in how the spatial pattern of non-forested areas influences invasibility of forest plots by the exotic annual grass vententata (Ventenata dubia). For exercise 3 I asked the question, how are forest plots with high and low ventenata cover related to the spatial pattern of canopy openness?

Tools and Approaches Used

To explore this question, I performed a Ripley’s cross-K between ventenata forest plots with high and low cover and non-forest areas (Ripley 1976). This analysis is used to interpret spatial relationships between two point patterns. I used ArcGIS and R to answer the question following the steps below:

  • I added my sample plots with ventenata cover and coordinates and fire perimeters into ArcGIS.
  • To generate non-forest points, I added canopy cover raster data for the study area into ArcGIS with values ranging from 0 to 95% canopy cover.
  • I buffered the sampled fire perimeters by 3000m using the “buffer” tool and generate 200 random points, not closer than 100m in each buffer using the “create random points” tool in ArcGIS.
  • I extracted the raster values to the random points using the “values to points” tool, and extracted points that had <10% canopy cover into a new layer file for open areas.
  • I then extracted my sample plots only from forested areas into a forest_plots layer.
  • I exported these plots and the random points into a data frame with spatial data and an indicator for sites with high ventenata (>15% cover), low ventenata (0-15% cover), or random open points with no ventenata recorded (labeled “high”, “low”, and “open” respectively).
  • I performed a Ripleys cross K between the high and low ventenata forest plots and non-forest “open” areas and plotted the response in R with the package “spatstat” (Thanks, Stephen C. for sharing R code!). I repeated the analysis individually for three sampled fires (Fox, South Fork, and Canyon Creek) to avoid picking up the inherently clustered spatial pattern of my sample plots and overshadowing the research question. However, I also performed the analysis on the entire sample area to view where issues arise when ignoring a clustered sample design and compare to results from individual fires.
    1. In order to perform the cross Ripleys K on each sampled fire, I first had to establish rectangular “windows” around each fire and provide the xy coordinates for the vertices using tools in ArcGIS depicted below.

Results & Discussion

Most annual grass species prefer areas of low canopy cover. These open areas can, however, act as source populations for invasion into more resistant forested areas. I hypothesized that ventenata (an annual grass) abundance in forested areas would increase with increased proximity and abundance of open areas near the sampled forest (i.e. sites with high ventenata cover in forested areas would be more spatially “clustered” around non-forest areas than sites with low ventenata cover).

The results of my Ripley’s cross-K show that the spatial relationship between sites with high/ low ventenata cover differs between fires. At the Fox fire (FOX), high and low sites closely follow the projected Poisson distribution when crossed with open areas, suggesting that ventenata abundance in forest sites is not related to nearby open areas. However, in the Corner Creek (CC) and South Fork (SF) fires, heavily invaded forest sites are more clustered around open points and low sites are more dispersed around open points than would be expected by chance. I may not have seen a relationship between open areas and high/ low sites at the fox fire because of the relatively low number of forest sample plots (n = 5 high and 4 low). Another explanation could be that I am missing an interaction between fire and invasion by not accounting for canopy lost in forest plots as a function of fire. Analyzing burned vs. unburned forest plots (with fire as an additional driver of invasion) would have been ideal, but would have dropped the sample size too low to perform analyses.

When I performed the cross-K on the entire study area it shows strong clustering of both high and low plots around open areas. However, this is likely the result of the clustered spatial pattern of my sampling method (targeting specific burned areas), and less representative of the actual spatial relationship between ventenata invasion and canopy cover.

Critique of Method

Overall, I thought the cross-K was a useful method for evaluating the spatial relationship between two variables (ventenata cover and open areas). However, a method that could include interaction terms (burning) would probably have been more appropriate for this study. Additionally, the arrangement of sample plots made it difficult to evaluate overall patterns across the study region. It would have been a more useful analysis had my samples been evenly distributed across the region.

Ripley, B.D., 1976. The second-order analysis of stationary point processes. Journal of applied probability, 13(2), pp.255-266.

Ex3: Correlation of Mining and Non-Mining Villages’ NDVI Loss with Distance to Nearest Primary Road

Following up from Exercise 2, where I determined how NDVI loss relates to distance from village center, I now wanted to explore whether or not the NDVI loss I observed could be explained or attributed to another factor: the distance of a village from a primary road.

To do this, I acquired a shapefile of roads from OpenStreetMaps for Senegal, which I used alongside my locations of mining and non-mining villages, ArcGIS Pro for some simple spatial work, and Excel to calculate a simple correlation.

First, I imported the shapefile of roads in Senegal to ArcGIS Pro, and first determined what the nearest actual road to each village was (fig. 1). Based on this first-pass glance, it appears the more mining sites in my sample than non-mining sites are nearest to “primary” paved roads, and that the majority of non-mining villages are near “tertiary” roads. This simply means that, out of my sample, more mining happens closer to paved roads, which would make sense: gold needs to be transported from gold sites, and equipment/supplies need to be transported in.

Fig 1

Next, I manipulated the shapefile to only include features which were designated “primary” roads (fig. 2). I then used ArcGIS Pro’s “Near” tool to generate new attributes for each set of villages which listed the shortest distance to the nearest primary road in meters.

Fig 2: Green dots are non-mining villages; red dots are mining villages

With these distances in hand, I plotted them in excel against the NDVI % loss values for the first ring (0-250m) around each village, as a simple first pass of understanding how distance from a primary road may be related to the amount of NDVI loss being experienced at a particular road (fig. 3).

Fig 3

Then, within Excel, I calculated a simple correlation (a la fig. 4) to determine how distance from a nearest primary road relates to amount of NDVI loss experienced by a village.

Fig 4: x, in this case, is distance (m) from nearest primary road and y is NDVI loss at the first ring.

For mining villages, I calculated a correlation of -.804, and for non-mining villages, I calculated a correlation of .686. This means that, for this sample at least, the NDVI loss experienced at a mining village is negatively correlated with distance from a primary road; that is, as a mining village’s distance from a primary road increases, its NDVI loss should decrease. This makes perfect sense: mining villages closer to primary roads are easier to access both by miners and by other users, and to leave as well, encouraging larger groups of people to live there, which would decrease NDVI as a result of their activities.

Non-mining villages, on the other hand, are more difficult to explain. One explanation, which may be plausible, is that more distant villages have less opportunities for other livelihoods which are not environmentally dependent, and so are more obligated to work in and with the environment, perhaps encouraging vegetation loss as users rely on and drawdown more on their environment.

Overall, I think this was a useful approach for examining another possible explanation for why mining villages may have higher rates of NDVI loss than non-mining villages. Mining villages are closer to primary roads, which means they are more accessible to potential miners, and thus more likely to grow larger than non-mining villages, and drawdown more on their environment. That being said, as can be seen in Fig. 3, this is not a perfect relationship by any means, and a more robust analysis would be useful to say much more. But, as a first pass, it’s useful for illustrating some additional explanation.

Exercise 3: Confusion Matrix to Test Predictive Capacity of Seascapes

Question Asked

I’ve been using my time in this class to explore methods and analyses that could help me to better understand the spatial and temporal relationships between satellite-derived seascapes and fishes in the California Current Ecosystem. In Exercise 1, different types of interpolations were explored and applied to the trawl data to see how results changed. After refining methods related to the Kriging Interpolation, Exercise 2 was conducted, exploring he possible uses of neighborhood analysis in determining the prominent seascapes related to high rockfish abundance in the California Coast Ecosystem. Exercise 3 builds on the results of the previous two exercises: in order to explore the relationship between seascapes and rockfish abundance, a confusion matrix was calculated to measure the ability for certain seascapes to predict the occurance of rockfish hotspots. Simply put, the question being asked is: How were seascapes related to areas of high rockfish abundance in the California Coast Ecosystem in May of 2007?

Tool Used

In order to answer this question, I’m going to use a confusion matrix. A confusion matrix is a table that measures the agreement between two raster layers and provides an overall measure of a model’s predictive capacity. Additionally, these matrices can also measure a raster’s error (in terms of false-positives or false negatives). Each statistic calculated is useful, as each one clues the researcher into spatial or methodological processes that may account for certain types of error. The confusion matrix works by taking original data and creating accuracy assessment points, which are points assigned to random spatial points and given the value indicated by the raster value at that point. Those points are then matched up to points on the model raster and statistics are calculated measuring how much they agree.


The data used for this exercise are the following:

  • Significant seascape classes that indicate high rockfish abundance for May 2007, as determined by the neighborhood analysis in Exercise 2
  • Seascape raster data for the relevant month
  • Rockfish abundance raster (interpolated, from Exercise 1)

Confusion matrices can only be calculated when your ground-truthed and modeled layers are the same units. For this reason, my raster data had to be converted to presence absence data for both the seascape classes and the interpolated rockfish abundances. In order to do this, I used the Reclassify tool in ArcGIS to change “present” values to 1 and “absent” values to 0. For the interpolated abundance raster, any nonzero and nonegative values were changed to 1 to indicated modeled rockfish presence. The remaining cells were changed to 0 to indicate modeled rockfish absence. The result was the raster shown below:

Rockfish Presence-Absence Layer calculated using the Reclassify too on the interpolated trawl data.

Similar steps were used to reclassify the seascape data – seascape classes 7, 12, 14, 17, and 27, which were shown to be present within a 25km radius of the high abundance trawls in Exercise 2, were counted as present cells, while the other classes were counted as absent cells. The result was the raster shown below:

Rockfish Presence-Absence raster calculated using the results from exercise 2 to reclassify the seascape data

It was determined that the seascpae layer would be the base or “ground-truthed” layer and the interpolation layer would be the modeled layer, since the interpolated values are simply estimations. The next step was using the “Create Accuracy Assessment Points” Tool to create randomized points within the modeled layer. The result was the following shapefile:

Randomized accuracy assessment points created using the Create Accuracy Assessment Points tool

Once the points are created, they must be compared to the values in the modeled layer. The Update Accuracy Assessment Points tool does this automatically, and my points were matched up to corresponding values from the seascape layer. Finally, the final step was to run the resulting shapefile through the “Calculate Confusion Matrix” tool.


The resulting confusion matrix is displayed below:

ClassValue C_0 C_1 Total U_Accuracy Kappa
C_0 9 79 88 0.102272727  
C_1 89 276 365 0.756164384  
Total 98 355 453  
P_Accuracy 0.091836735 0.777464789 0.629139073  
Kappa         -0.135711

The two rasters measured in the matrix did not compare well at all, resulting in a negative Kappa value. Kappa values measure overall agreement between two measures, with 1 being perfect agreement. A negative score indicate lower agreement than would be found by chance. The interpolated model performed especially poorly measuring absent values, only agreeing with the seascape model on 9 pixels. While the two rasters agreed on a vast majority of the areas where rockfish may have been found, the irregularities on the absence areas drove the matrix to failure. The large number of both false positive and false negative readings shows that this was not a fundamental disagreement based on scale or resolution of the models – the two methods of measuring rockfish abundance fundamentally disagreed on where they should be found, to the point where they cannot be reconciled.

Critique of Method

Overall, I think a confusion matrix is a good way to compare modeled and true (or ground-truthed) measured information. However, in my case, both of my informational rasters are the results of modeled analyses, each with their own set of assumptions and flaws. The interpolation, for example, is at the mercy of the settings used when executing the Kriging: how many points shall be measured per pixel? Is it a fixed sampling method, or variable, spatially based? And the seascape raster is at the mercy of the neighborhood analysis, which was only conducted on 4, hand-picked points, which introduce all kinds of sampling and spatial biases. All of this, coupled with the fact that both rasters had to be distilled into presence-absence data, which eliminates much of the resolution that makes Kriging so great, results in a heavily flawed methodology that I do not believe truly represents the data. This framework which I’ve used, however, has the potential to be modified, standardized, and rerun in order to address many of these issues. But for now, I feel as though the confusion matrix is a helpful tool, just not for two sets of modeled data.

Ex 3: Relating distance between wells to communication type and fault presence

Question that I asked:

Is there a relationship between the distance between wells, their communication status based on a pumping interference test, and whether or not they are separated by a fault?

Name of the tool or approach used:

Polyline creation and classification in ArcGIS Pro, boxplot creation and two-sided t-tests in R.


29 wells in my study area were evaluated by the Oregon Water Resources Department during pumping interference tests in 2018 and 2019. This test involves pumping one well, and seeing whether the water levels in nearby wells drop in response. I received a verbal account of the wells that did and did not communicate, sketched it on a map, and then transferred that information to ArcIS Pro. I drew polylines using the well locations as snapped endpoints. Then, I created and populated fields containing the well communication type (“communicate” for wells that respond to pumping at a nearby well, and “does not communicate” for wells that do not) and whether or not the path between two wells crosses a fault. Shape_Length in feet was automatically calculated when I created the polylines, on account of the projection I used for the shapefile.

I exported that data table to a csv and imported it in R, where I subset it into three categories: all paths, paths between wells that communicate, and paths between wells that do not communicate. I then created box plots and ran t-tests to see differences between means and distributions of path length based on communication type or fault length.


Comparing the path length and the communication type of all 29 wells involved in the communication test, there is not significant evidence of a difference in mean path length between wells that do and do not communicate because the p-value of a two-sided t-test was 0.152. While the mean distance between wells that do not communicate is larger than the mean distance between wells that do communicate, the overlapping interquartile ranges in both categories make this difference less significant. There is not clear evidence that distance plays a role in well communication.

There is some evidence for a difference in mean path lengths between wells that do and do not cross faults, based on a p-value of 0.047 in a two-sided t-test. The mean path length that crosses a fault is 5,139 ft, while the mean path length that does not cross a fault is 3,608 ft. Wells that are closer together are less likely to be separated by a fault.

For wells that do communicate, there is evidence of a difference between the mean path lengths that cross faults and the mean path lengths that do not cross faults. The p-value for a two-sided t-test was 0.024. Wells that communicate but are not separated by a fault are more likely to be closer together than wells that are separated by a fault.

For wells that do not communicate, there is no evidence of a difference in mean path lengths between paths that do and do not cross faults, given a p-value of 0.98 in a two-sided t-test. Wells that do not communicate are likely to be separated by the same mean distance whether or not they are separated by faults, although there is a larger range of path length values for wells separated by a fault that do not communicate.


Summary of results:

Wells that communicate in pumping tests do not have a significantly different mean distance between them than wells that do not communicate (p = 0.152)

Wells that are closer together are less likely to be separated by a fault. (p = 0.047)

Wells that communicate but are not separated by a fault are more likely to be closer together than communicating wells that are separated by a fault. (p = 0.024)

Wells that do not communicate are likely to be separated by the same mean distance whether or not they are separated by faults, although there is a larger range of path length values for non-communicating wells separated by a fault. (p = 0.98)

Critique: I wish I had more sample points and paths to work with, so I could use a more interesting analysis such as ANOVA.

Estimates of connectivity to critical infrastructure in Seaside, OR following a rupture of the Cascadia Subduction Zone


For exercise 3, I evaluated the connectivity of each building within Seaside, OR, to critical infrastructure following a rupture of the Cascadia Subduction Zone. The probability of connectivity for each building was determined using networks and considered the following:

  • Electric Power Network (EPN): probability that each building has electricity.
  • Transportation: probability that each building can reach the hospital or fire stations via the road network.
  • Water Supply Network: probability that each building has access to running water.

The connectivity analysis was deaggregated by hazard as well as the intensity of the event.

Tool and approach

For this exercise, I used: (1) a probabilistic earthquake/tsunami damage model to evaluate the functionality of linkages; (2) the network analysis package python-igraph to evaluate the connectivity of each tax lot to critical infrastructure; and (3) QGIS for spatial visualization of the results.

Description of steps

Networks were created to represent the connectivity of the three infrastructure components (Figure 1). A network consists of nodes connected to each other through edges. When edges are removed, a connectivity analysis can be performed to determine whether there is any path from one node to any other specific node. A disconnection in the network results in two (or more) separate networks.

Here, the earthquake and tsunami hazards cause damages to edges which are removed from the network if deemed nonfunctional. A connectivity analysis between each tax lot and critical infrastructure was performed, and each tax lot was triggered with a binary yes/no for connectivity. A Monte-Carlo approach with 10,000 simulations was implemented to determine the probability of each tax lot being connected to critical infrastructure. The resulting probabilities were then added as attributes to GIS shapefiles in order to evaluate the spatial distribution of connectivity.

Figure 1: GIS and representative networks for each infrastructure component


Description of results

Characteristics of the network can be described by a degree distribution. In a network, the degree of a node is the number of immediate connections that the node has to other nodes (e.g. a node connected to 3 other nodes has a degree of 3). A histogram of the degrees can be generated to describe the overall distribution of the entire network. The degree distribution for the three infrastructure components are shown in Figure 2. It can be seen that in the EPN network, most nodes are connected to two other nodes. This is likewise apparent in the network of Figure 1, as the EPN network appears more “linear” compared to the transportation and water networks. The transportation and water networks exhibit similar characteristics to each other in that the majority of nodes have a degree of three.

Figure 2: Degree distribution for each infrastructure component

Using the results from the Monte-Carlo network analysis, maps were created to show the spatial variability of connectivity. The connectivity was deaggregated by both hazard and intensity of the event, as deaggregation provides an avenue for smart mitigation planning. Although similar maps were produced for all three networked infrastructure, for brevity, only the spatial distribution of the transportation network is shown (Figure 3). The maps show the probability of each tax lot becoming disconnected from the fire stations (2 in Seaside) and hospital (1 in Seaside) via the road network. It can be observed that the tsunami hazard results in significant damage to the transportation system relative to the earthquake hazard. The result of bridge failures caused by the tsunami can be observed for intensities larger than the 500-year event. The region west of the Necanicum River becomes completely disconnected from the fire stations and hospital which are located east of the river.

In addition to the spatial deaggregation, the aggregated results provide a comprehensive overview of the connectivity. Figure 4 shows the average fraction of tax lots disconnected from critical infrastructure across all of Seaside. The three networked infrastructure systems approach complete disconnectivity for hazard intensities larger than the 1000-year event. The transportation and water networks are dominated by the tsunami for the higher magnitude events; whereas the EPN see’s significant damage from both the tsunami and earthquake. Consequently, if resource managers are planning for high magnitude events, they should invest in tsunami damage mitigation measures.

Figure 3: Spatial distribution of connectivity to fire station and hospitals

Figure 4: Fraction of tax lots connected to critical infrastructure



Network analysis provides a means to evaluate the connectivity of tax lots to critical infrastructure, and incorporating probabilistic methods accounts for uncertainties as opposed to a deterministic approach. While this type of analysis can be useful to determine overall connectivity, it does not account for limitations and additional stresses in the “flow” of the network. For example, damage to the transportation network would result in additional travel times to the fire stations and hospital. In order to provide a more comprehensive analysis of the impact to networked infrastructure, both connectivity and flow should be considered.

Comparing the Spatial Patterns of Remnant Infected Hemlocks, Regeneration Infected Hemlocks, and Non-Hosts of Western Hemlock Dwarf Mistletoe


For Exercise 1, I wanted to know about the spatial pattern of western hemlock trees infected with western hemlock dwarf mistletoe. I used a hotspot analysis to determine where clusters of infected and uninfected trees were in my 2.2 ha study area (Map 1). I discovered a hot spot and a cold spot, indicating two clusters, one of high values (infected) and one of low values (uninfected).

For Exercise 2, I wanted to know how the spatial pattern of these clustered infected and uninfected trees were related to the spatial pattern of fire refugia outlined in my study site. I used Geographically weighted Regression to determine the significance of this relationship, however I did not find a significant relationship between a western hemlock, its intensity of clustering and infection status, and it’s distance to its nearest fire refugia polygon.

This result led to the realization that the polygons as they were drawn on the map were not as relevant as the actual “functional refugia”. I hypothesized that, after the 1892 fire, the only way for western hemlock dwarf mistletoe to spread back into the stand would be from the trees that survived that fire, or “remnant” trees. These would then infect the “regeneration” tree that came after the 1892 fire. The functional refugia I am interested in are defined by the location of the remnant western hemlocks. I also hypothesized that the spatial pattern of non-susceptible host tree (trees that were not western hemlocks) would play a role in the distribution of the mistletoe.

Question Asked

How are the spatial patterns of remnant western hemlocks related to the spatial patterns of regeneration western hemlocks, uninfected western hemlocks, and non-western hemlock tree species, and how are these relationships related to the spread of western hemlock dwarf mistletoe in the stand?

The Kcross and Jcross Functions

The cross-type functions (also referred to as multi-type functions) are tools capable of comparing the spatial patterns of two different type events (type i and j) in a similar spatial window, of some point process, X. It does this by assigning labels to the events differentiating the type and summarizing the number or distance, of and between events, at differing spatial scales, or radius circles (r).

The statistic Kij (r) , summarizes the number of type j events, around a type i event at a distance of r, or a point process X. Deviations of the observed Kij(r) curve from the the Poisson curve, or if type j events are truly randomly distributed, indicates dependence of type j events on type i events. Similar results can be obtained from the regular Ripley’s K: deviations above the curve indicate clustering and deviations below indicate dispersal.

(Incredibly helpful and interactive explanation link:

The statistic Jij(r) = (1 – Gij(r))/(1-Fj(r)) summarizes the shortest distance between a type i and j event and compares it to the empty space function of type j event. This is another test for inferring independence or dependence of type j events to type i. Deviations of the Jij(r) curve the value of 1, indicate levels of dependence of the events to each-other. Specific deviations from 1 can be hard to interpret without an understanding of the Fj(r) function so imagining it stationary in the ratio makes it easier. As Gij(r) increases, the numerator shrinks, creating a smaller Jij(r) statistic. Deviations below 1 indicate that type i and j events are dependent and that as r increases, the shortest distance between points of type i and j increases. As Gij(r) decreases, the numerator grows, creating a larger Jij(r) statistic. Deviations above 1 indicate that type i and j events are dependent and that as r increases, the shortest distance between points of type i and j decreases.

Methods Overview

In R, the package “spatstat” provides a suite of spatial statistic functions including the cross-type functions. In order to use these you need to create a “point pattern process” object. These objects incorporate X and Y coordinates, and a frame of reference, or a “window,” and give spatial context top a list of values. Then marks are applied to these points that create the necessary multi-type point pattern process object. These marks serve to distinguish the type i and type j events described earlier in your analysis. Then running the “Kcross()” or “Jcross()” functions with the specified type events produces a graph that you can interpret, very similar to producing the normal Ripley’s K plot.

  • I took my X – Y coordinates of all trees on the stand and added a column called “Status” to serve as my mark for the point pattern analysis.
    1. The four statuses were “Remnant,” “Regen,” “Uninfected,” and “NonHost” to identify my populations of interest.
      1. I had access to tree cores, so I identified trees that were older than 170 yrs old and these trees’ diameters served as my cutoff for the “Remnant” diameter class.
      2. All trees DBH > 39.8 cm.
    2. Doing this in ArcMap removed steps I would have to have taken when I migrated the dataset to R.
    3. I removed all the dead trees because I wasn’t concerned with them for my analysis.
  • I exported this attribute table to a csv and loaded it into R Studio.
  • I created the boundary window of my study site using the “owin()” function, and the corner points from my study site polygon.
  • The function “ppp()” creates the point pattern object and I assigned the marks to the data set using the “Status” column I created in ArcMap
    1. It’s important your marks are factors otherwise it is not converted into a multi-type point pattern object.
  • The last step is running the “Kcross()” and “Jcross” to compare the “Remnant” population to the “Regen,” “Uninfected,” and “NonHost” populations.
    1. This produced 6 plots, 3 of each type of cross-type analysis.
    2. Compare these easily using the “par()” function, for example:

par(mfrow = c(1,3))




This produces the three plots in a single row and three columns.


Because I am assuming the remnant, infected western hemlock trees are one of the main factors for the spread of western hemlock dwarf mistletoe and that they are the center of  new infection centers on the study site, I did all my analysis centered on the remnant trees (points with status = “Remnant” treated as event type i).

1)   i = Remnant, j = Regen

The first analysis between remnant and regeneration trees demonstrate that there is dependence on the two events to each other. At fairly small distances, or values of r, infected western hemlocks that have regenerated after the 1892 fire cluster around infected remnant western hemlocks that survived the 1892 fire. This stands to reason because we assume that infected trees will be near other infected trees, and that infection centers start usually with a “mother tree.” In this case the remnant trees serve as the start of the new infection centers. The Jcross output also shows me that the two types of trees are clustered using the frequency of the shortest distances. After ~8 meters the two tree types exhibit definite clustering. In terms of the function, the Gij(r) in the numerator of the Jij(r) function is approaching 1, or the highest frequency of very short distances.

2)   i = Remnant, j = Uninfected

The Kcross plot from the second set of analyses between remnant and uninfected trees demonstrates that there is independence between the two events up to ~15 meters. After that, the trees exhibit slight clustering effects. The lack of dispersal tendencies is strange for these two types of trees because we expect uninfected trees to be furthest away from the center of infection centers. The presence of clustering may be indicative of the small spatial scale of my study site. It may also be that the size of the infection centers are only about 15 meters (if we assume that remnant trees are the center). The Jcross plot shows something similar: at small distances the types of trees seem independent and then around 8 meters they exhibit clustering.

3)   i = Remnant, j = NonHost

The Kcross from the last set of analyses between the remnant trees and the non-hosts demonstrates a similar pattern exhibited by the regeneration trees. After about 4 meters, the trees tend to be clustered. This is an interesting find because if the non-hosts cluster to remnant trees but uninfected trees are independent, then the non-hosts may be playing a role in this. The Jcross plot shows the same: the two types of trees exhibit clustering.

4)   Comparing Kcross Functions with eval.fv()

A useful way to compare patterns of Kcross functions is using the eval.fv function. The titles of each plot tell which Kcross was subtracted from which; note the difference in scales. The first plot shows that the regenerating trees’ spatial pattern as related to remnant trees is very different from the uninfected trees’ pattern. The regenerating trees’ spatial pattern is much more similar to the non-hosts’ spatial pattern at short distances, until about 15 meters. Then the patterns differ with the regenerating trees exhibiting more of a clustering tendency. However the scale is much smaller than the other two graphs. Lastly, the third plot shows the difference between the non-host trees’ spatial pattern and the uninfected trees’ spatial pattern. There appears to be a stepwise relationship where, at very near and very far distances the non-host trees are much more clustered, but at moderate distances the differences may be less dramatic.

Critique of Cross-Type Functions

The amount of easily interpretable literature on the spatstat package as a whole is sparse, although a wealth of very technical information exists. The function was easy to use and execute though and so was the process of creating the point pattern object. These two functions can clearly show how the spatial patterns of the two types of events change with scale. It would be helpful if there was a way to compare three or more types of events. The last drawback is that there is a lack of specific information for each point on your map or study site. This pattern that is generalizable to a whole set of points may not be as useful when trying to put together a story, such as the story of a stand’s development through time.

Additional Raster Analysis

The last critique of the cross-type functions led me to attempt a visualization of these patterns on my stand. Very briefly, I determined densities of the infected, uninfected, and the non-host trees using the Kernel Density function in ArcMap. Then I classified these densities using natural breaks and coded these for raster addition. After adding all three density rasters together, I coded each unique density classification combination to tell me how the densities of the populations appeared in the study site.

It appears that there are distinct patches of high density separated by areas of low density. On the eastern side of my study site, it appears that the high density areas of infected trees cluster with the remnant infected trees. An interesting interaction is occurring between the high density patches of uninfected trees and infected trees in the western portion of my study site. The mechanism for the seemingly clear divide may be the non-TSHE trees.

Blackleg classification by segmentation error analysis and visualization


Once again, I am using the confusion matrix to determine the accuracy of segmented classification based on the ground truth of manual classification. This work piggybacks off of exercise two where I also used a confusion matrix for determining accuracy. In my previous work I critiqued the ability to draw any relevant conclusions based on a sample size of one. Here I have included a sample size of five images which have gone through the entire image processing, segmentation, manual classification and confusion matrix. This allows us to make more relevant comparisons and looks for trends in the data. Additionally, I mentioned the confusion matrix was inaccurate because it was performed on a rectangular region that was simply the extent of the image and not just the leaf. This has been corrected for and now conducts a confusion matrix on the entire leaf and no outside area. Lastly, I have adjusted the segmentation settings to increase accuracy based on previous findings in exercise one and two.


  • How accurate is the segmentation classification?
  • Are there commonalities between the false positives and false negatives?
  • Are more raster cells being classified as false negatives or false positives when misclassified?
  • What is going undetected in segmentation?
  • What is most easily detected in segmentation?

Tools and Methods

Many of the steps are repeated from exercise two. I will repeat them here in a completer and more coherent format with the new steps found in the process.

ArcGIS Pro

Begin by navigating to the Training Sample Manager. To get to the Training Sample Manager, go to the Imagery tab and Image Classification group and select the Classification Tools where you can click Training Sample Manager. In the Image Classification window pane go to create new scheme. Right click on New Scheme and Add New Class and name it “img_X_leaf” and supply a value (the image #) and a description if desired. Click ok and select the diseased pixels scheme and then the Freehand tool in the pane. Draw an outline around the entire leaf (Images 1-5). Save this training sample and the classification scheme. Save ArcGIS Pro and exit. Reopen ArcGIS Pro and go to Add data in the Layer group on the Map tab to upload the training sample polygon of the leaf. This process must be repeated for each leaf.

The polygon of the leaf will need to be converted to a raster. Click the Analysis tab and find the Geoprocessing group to Tools once again. Search Polygon to Raster (Conversion Tools) in the Tools pane and select it. Use the polygons layer for Input Features and the window pane options should adjust to accommodate your raster. The only adjustment I made was to the Cellsize, which I changed to 1 to maintain the cell size in my original raster. Select Run in the bottom right corner of the pane. Each of the polygons should now be converted to a rasterized polygon with a unique ObjectID, Value and Count which can be found by going to the Attribute Table for that layer or clicking on a group of pixels. This also must be completed for each of the polygons created for each leaf outline.

Now we should have three layers for each leaf in which we want to export as Tiff files. We have our “leaf” that we just created, our “manually classified” and “segmented” layers. To export these three, you need to go to each layer and right click. Start with the “leaf” image and go to Data and Export Raster and a pane will appear on the right. Zoom into the image so there isn’t much area above or below the leaf. Select Clipping Geometry and click on Current Display Extent. This will export the image as a Tiff file which we will use later. Continue this process with the other two images without changing the extent. Each image should be exported after this step with the same extent resulting in the same number of pixels and overlap between pixels if placed overtop one another. This can be confirmed in R in the follow steps.


Open R studio and use the following annotated code below:

required packages for rasters
rm(list=ls()) ##clears variables
##raster upload
img_10_truth <- raster(“E:/Exercise1_Geog566/MyProject3/img_10_truth.tif”)
img_10_predict <- raster(“E:/Exercise1_Geog566/MyProject3/img_10_predict.tif”)
img_10_leaf <- raster(“E:/Exercise1_Geog566/MyProject3/img_10_leaf_shape.tif”)
##view the raster
img_10_truth ##confirm dimensions, extent, etc. They need to be the same or very close for a confusion matrix
plot(img_10_leaf) ##view the images
##export data to excel
img_10_leaf_table <-, xy=T) ##creates tables
img_10_predict_table <-, xy=T)
img_10_truth_table <-, xy=T)
write.table(img_10_leaf_table, file = “img_10_leaf.csv”, sep = “,”)
write.table(img_10_predict_table, file = “img_10_predict.csv”, sep = “,”)
write.table(img_10_truth_table, file = “img_10_truth.csv”, sep = “,”)



At this point the three Tiff files should have been uploaded to R as raster’s and undergone a quality check by viewing the images and the data using the supplied code above. Next, it should have been converted to a table for each file and then all exported out as csv files. From here, open all three files into excel. Begin with the “Truth” and “Predict” files and select the entire column that is the farthest to the left, the values associated with each cell. Go to the Home tab and the Editing group and click on the drop down for Sort and Filter and select the A to Z. Change everything from whatever value they are, to 1. This can easily be accomplished by using the Find and Replace function in the Editing group. Replace all the NA’s with 0’s and then go to left column which simply lists each pixel number and sort the column from A to Z so its back to the order it was when originally opened into excel. Once this is completed for both the “Predict” and “Truth” files, copy the right most column of values and paste them into the Leaf” excel file. Use the Sort function to sort the right column from A to Z on the leaf values column and allow for it to expand the selection. It should be a list of 0’s and if you scroll down you will eventually begin seeing NA’s in the leaf value column. Delete everything below this point in the “Truth” and “Predict” value columns. Scroll to the top and delete the first four columns that are the numbers, x and y coordinates and the leaf values column. You should be left with two columns that are the “Truth” and “Predict” values with 1’s and 0’s only. Give each column a label (Truth & Predict), save this document as an xlsx file and exit out.

This excel step helped us overcome the issue of having values that weren’t in a 0’s and 1’s format needed for the confusion matrix. Here, 1’s stand for diseased patch and 0’s stand for non-diseased. What we also did was removed all the values that were outside of the leaf area. So, the only values we are working with are those which are in the leaf and not part of the background. The last step is importing the excel file that was just created into R to perform the confusion matrix.


##confusion matrix
img_10_confusion$Predict <- as.factor(img_10_confusion$Predict) ##convert to factors
img_10_confusion$Truth <- as.factor(img_10_confusion$Truth)
confusionMatrix(img_10_confusion$Predict, img_10_confusion$Truth)


You should now have R output of the confusion matrix and associated statistics.

ArcGIS Pro

For the purpose of visualization, I also changed the colors of the three Tiff files. To do this, I simply right clicked on each of the three layers and selected Symbology. The pane on the right side of the screen appears where you have options to change the Color Scheme. I made the leaf layer green, the segmented regions white and manually classified regions white. The snipping tool was used to extract these images.


Between the five images there were 15,519 pixels classified in the confusion matrix. 14,884 were properly classified as non-diseased. 282 were classified as true disease. 307 were identified as false negatives and 46 were classified as false positives. Overall the segmentation classification had an accuracy of 97.7% with a p-value less than 0.05 indicating statistical significance. The model had a sensitivity of 99.69% and a specificity of 47.88%.

Accuracy between the five images ranged from 94% to 99%, all with significant p-values. The sensitivity for all five individual matrices and the combined matrix was at 99% with no exceptions. Specificity had a fairly large range.

I was happy with the level of accuracy produced throughout all five images and the overall cumulative matrix. The five images ranged a bit in complexity with the number of lesions present, leaf size and shape, size of lesions, etc. Looking at the confusion matrix we can see there were many more false negatives than false positives. This indicates the model is possibly not quite sensitive enough to detect the all regions or certain parts of the disease. The specificity was not consistent between the five images for some reason which is still unclear to me. As mentioned, although the model may have some issues with detection certain patches, the sensitivity indicated throughout all matrices is very high.

Viewing the images below we can see where the false positives appear and where the false negatives occur (Images 1-5). It seems that both the false positives and negatives tend to appear on the margins of the lesion when they at least detected. Additionally, larger patches seem to have a higher rate of detection than those which are small in size. This is especially true in Image 1&2.


My only critique is the process to get to this step. I was very satisfied with my results but see logistical problems in getting through large amounts of samples. I will attempt to find shortcuts throughout the process my next go around. I will also seek creating a model in ArcGIS Pro for some of the rudimentary tasks done in the processing of the images.


Table 1. Confusion matrix of all five samples combined. Values are number of pixels classified out of 15,519 total.


Figure 1. Classification error for the combined matrix. The values are the number of pixels classified out of 15,519 total.


Image 1. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.


Image 2. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.


Image 3. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.


Image 4. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.


Image 5. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.

Exercise 2: Possible Influence of ENSO Index on Dolphin Sighting Latitudes

Exercise 2

Question Asked: Are latitudinal differences in dolphin sightings in the San Diego, CA survey area related to El Niño Southern Oscillation (ENSO) index values on a monthly temporal scale?

  1. My previous question for Exercise 1 was: do the number of dolphin sightings in the San Diego, CA survey region differ latitudinally? I was finally able to answer this question with a histogram of sighting count by latitudinal difference. I defined latitudinal difference as the difference from the highest latitude of dolphin sightings (the Northernmost sighting point along the San Diego transect line) to the other sighting points, in decimal degrees. Therefore it becomes a simple mathematical subtraction in ArcMap. Smaller differences would be the result of a small difference and therefore mean more Northerly sighting, with large differences being from more Southerly areas. I used all sightings in the San Diego region (from 1981 through 2015). As you can see from below, there is an unequal distribution of sightings at different latitudes. Because I had visual confirmation of differences at least when all sightings are binned (in terms of all years from 1981-2015 treated the same), I looked for what process could be affecting these differences in latitude.

    Comparing the Latitudes with the frequency of dolphin sightings in San Diego, CA

ENSO is a large-scale climate phenomena where the climate modes periodically fluctuate (Sprogis et al. 2018). The climate variability produced by ENSO affects physical oceanic and coastal conditions that can both directly and indirectly influence ecological and biological processes. ENSO can alter food webs because climate changes may impact animal physiology, specifically metabolism. This creates further trophic impacts on predator-prey dynamics, often because of prey availability (Barber and Chavez 1983). During the surveys of bottlenose dolphins in California, multiple ENSO cycles have caused widespread changes in the California Current Ecosystem (CCE), such as the squid fishery collapse (Nezlin, Hamner, and Zeidberg 2002). With this knowledge, I wanted to see if the frequency of dolphin sightings in different latitudes of the most-consistently studied area was driven by ENSO.


Primarily R Studio, some ArcMap 10.6 and Excel

Step by Step:

  1. 1.For this portion of the analysis, I exported my table of latitudinal differences within my attribute table for dolphin sightings from ArcMap 10.6. I saved this as a .csv and imported it into R Studio.
  2. Some of the sighting data needed to be changed because R didn’t recognize the dates as dates, rather as factors. This is important in order to join ENSO data by month and year.
  3. Meanwhile, I found NOAA data on a publicly-sourced website that had months as the columns and years as the rows for a matching ENSO index value of either: 1, 0, or -1 for each month/year combination. A value of 1 is a positive (warm) year, a value of 0 is a neutral year, and a value of -1 is a negative (cold) year. This is a broad-value, because indices range from 1 to -1. But, to simplify my question this was the most logical first step.
  4. I had to convert the NOAA data into two-column data with the date in one column by MM/YYYY and then the Index value in the other column. After multiple attempts in R studio, I hand-corrected them in Excel. Then, imported this data into R studio.
  5. I was then able to tell R to match the sighting date’s month and year to the ENSO data’s month and year, and assign the respective ENSO value. Then I assigned the ENSO values as factors.
  6. I created a boxplot to visualize if there were differences in distributions of latitudinal differences and ENSO index. (See figure)Illustrating the number of sightings grouped by ENSO index values (1, 0, and -1).
  7. Then I ran an ANOVA to see if there was a reportable, strong difference in sighting latitudinal difference and ENSO index value.



    From the boxplot, it appears that in warm years (ENSO index level of “1”), the dolphins are sighted more frequently in lower latitudes, closer to Mexican waters when compared to the neutral (“0”) and cold years (“-1”). This result is intriguing because I would have expected dolphins to move northerly during warm months to maintain similar body temperatures in the same water temperatures. However, warm ENSO years could shift prey availability or nutrients southerly, which is why there are more sightings further south.  The result of the ANOVA, was a p-value of <2e-16, providing very strong evidence to reject the null of hypothesis of no difference. I followed up with a Tukey HSD and found that there is strong evidence for differences between both the 0 and -1, -1 and 1, and 1 and 0 values. Therefore, the different ENSO indices on a monthly scale are significantly contributing to the differences in sighting latitudes in the San Diego study area.

Tukey HSD output:

diff               lwr                        upr           p adj

0–1 0.01161047 0.004250827 0.01897011 0.0006422

1–1 0.04101170 0.030844193 0.05117920 0.0000000

1-0 02940123 0.020689737 0.03811272 0.0000000

 Critique of the Method(s):

These methods worked very well for visualization and finally solidifying that there was a difference on sighting latitude related to ENSO index value on a broad level. Data transformation and clean-up was challenging in R, and took much longer than I’d expected.



Barber, Richard T., and Francisco P. Chavez. 1983. “Biological Consequences of El Niño.” Science 222 (4629): 1203–10.

Sprogis, Kate R., Fredrik Christiansen, Moritz Wandres, and Lars Bejder. 2018. “El Niño Southern Oscillation Influences the Abundance and Movements of a Marine Top Predator in Coastal Waters.” Global Change Biology 24 (3): 1085–96.

Contact information: this post was written by Alexa Kownacki, Wildlife Science Ph.D. Student at Oregon State University. Twitter: @lexaKownacki

Spatial and Temporal Patterns of Reported Salmonella Rates in Oregon

  1. Question Asked

Here I asked if there was evidence supporting temporal autocorrelation of age-adjusted Salmonella county rates within Oregon from 2008-2017 and if so what type of correlation structure is most appropriate. I also investigated spatial patterns of reported Salmonella rates as they related to various demographic variables like: Percent of county which is aged 0-4, percent of county which is aged 80+, percent of county which is female, median age, percentage of county residents who have completed high school, median county income, percent of county who is born outside the US, percent of county who speaks a language other than English at home, percentage of county estimated to be in poverty, and percent of children in a county estimated to be in poverty.

To answer these questions, I used the same data outlined in the first exercise blog post with newer demographic variables being taken from the American Community Survey and published on AmericanFactFinder which provides yearly estimates of demographic information at the county level. Unfortunately, yearly data prior to the year 2009 is unavailable which shortened the window of analysis by a year.

  1. Names of analytical tools/approaches used

The methods used to answer these questions was first to create an exhaustive general linear model where county Salmonella rates were a function of the above listed demographic variables. A Ljung-Box Test was used to assess if there was evidence of non-zero autocorrelation of residuals of the model at various time lags. Following this, an ideal linear model was selected using a stepwise AIC selection process and then different variance structures were compared by AIC, BIC, and log Likelihood metrics as well as ANOVA testing. Following the selection of an appropriate base model and variance structure, I allowed for interaction between all variables and time and performed more ANOVA testing to select the best model which allowed for variable time interaction. A version of this model would later be used in geographically weighted regression analysis. I performed geographically weighted regression allowing the coefficients to vary across space in Oregon.

  1. Description of the analytical process

I created a lattice plot of reported age-adjusted Salmonella rates over the 10-year period in every county to visually assess whether Salmonella rates were changing over time. After seeing that rates of reported Salmonella were changing over time in many Oregon counties I created a “full” linear model in which the rate of reported Salmonella cases in a county was a function of the demographic variables described above. Because this is longitudinal data measured over time I wanted to see if rates of Salmonella were correlated over time, meaning that a county’s rate one year could be predicted from the rate found in another year in the same county. I first performed a Ljung-Box test to assess the need to evaluate for temporal autocorrelation as well as tested normality assumptions, and log-transformed my outcome (Salmonella rates) based on those normality assumption tests. A simple backward step-wise AIC approach was used on my full model to identify and remove extraneous variables. This worked by removing variables in the full model in a stepwise fashion, comparing the AIC values between the two models, and continuing this process until the AIC values between the two models being compared are not significantly different. I then used this model to select an ideal variance structure to compare Salmonella rates autocorrelated at different time lags. The types of variance compared were: Independent variance, compound symmetric, AR1, MA1, ARMA (1,1), unstructured, and allowing for different variance structures across time. After AIC, BIC, log Likelihood, and ANOVA testing an ideal variance structure was determined and the model using this variance structure was evaluated for basic interaction with time. All variables present in the model were allowed to have time interaction including time itself (i.e. allowing for a quadratic time trend). Once again AIC, BIC, log Likelihood, and ANOVA testing were used to select the most ideal model.

Following this I moved on to GWR, where I was able to use the model identified above to create a new data frame containing Beta coefficient estimates of the significant variables in my final model for every county in Oregon. This data frame of coefficients was merged with a spatial data frame containing county level information for all of Oregon. Plots of the coefficient values for every different county were created.

  1. Brief Description of Results

Every panel represents one of Oregon’s 36 counties and panels containing an error did not have any cases of reported Salmonella. Some counties are seen decreasing with time, others show slightly increasing trends, and others show a fairly level rate over time. Clearly there is evidence of some time trend for some counties.

Results of normality testing found that age- adjusted rates of reported Salmonella in Oregon were not normally distributed, for the ease of visualization and as an attempt to address the failure to meet the assumption of normality in linear modeling I log-transformed the rates of reported Salmonella. Results of the Ljung-Box test with my full model provided evidence of non-zero time autocorrelation in the data and a visual inspection of the lattice plot supports this with most counties showing a change in rates over time.

The stepwise AIC model selection approach yielded the following model:

logsal ~ %female + %childpov + medianage +year

Covariance structure comparison:

Covariance Model logLikelihood AIC BIC
Independent -431 874 896
Compound Symmetry -423 860 887
AR(1) -427 869 895
MA(1) -427 869 895
ARMA(1,1) -423 862 892
Unstructured -387 859 1017
Compound Symmetry Different Variance Across Time -412 854 910


Mostly AIC, BIC, and Log Likelihood values were clustered together for the different models. I decided to base my choice primarily on the lowest AIC because that’s how I did variable selection to this point. This resulted in me choosing a compound symmetric model which allowed for different variances across time.

Next, I built models which allowed for simple interaction with time meaning that any three-way interaction with time was not evaluated for. Subsequent ANOVA testing comparing the different interaction models to each other, to a model where no interaction was present, and a model where time was absent were used in my selection of a final model.

Final Model: 

logsal ~ %female + %childpov + medianage + year +(medianage*year)

This model follows a 5×5 compound symmetric correlation variance model which allows for variance to change over time.

Code: interact_m <- gls(logsal ~ female + childpov + medianage + year +(medianage*year), na.action=na.omit, correlation= corCompSymm(form= ~1|county),weights=varIdent(form=~1|year),data=alldata)
Within County Standard Error  (95% CI): 0.92
Estimate Name Estimate (log-scale) Std. Error p-value
Intercept -759.42 237.79 0.002
% Female 18.06 4.46 <0.001
% Child Poverty 0.03 3.27 0.001
Median Age 17.16 3.11 0.002
Year 0.38 3.18 0.002
Median Age*Year -0.01 -3.12 0.002


Estimates are on the log scale making them difficult to interpret without exponentiation, however it can be seen that a percent change in the number of females in a county or a year change in median age are associated with much larger changes in rates of reported Salmonella incidence compared to changes in percent of child poverty and the year. Overall, incidence rates of reported Salmonella were shown to increase with time, county percentage females, county percentage of child poverty, and county median age with a significant protective interaction effect between time and median age.

For my GWR analysis I used a function derived from the website and looks like:

regfun <- function(x) {
dat <- alldata[alldata$county == x, ]
m <- glm(logsal ~ female + childpov + medianage + year + (medianage*year), data=dat)

As can be seen, I retained the same significant variables found in the OLS regression for my time series analysis. GWR in this case allows for the coefficient estimates to vary by county.

This allowed me to create a data frame of all coefficient estimates for every county in Oregon. Subsequent dot charts showed the direction and magnitude of all covariates varied across the counties. Plots and dot charts of Oregon for the different coefficient estimates were made.

For % Female

For % Child Poverty

For Median Age

For Year

For Median Age * Year

For the most part, county % female, median age, year, and the interaction term clustered close to 0 for most counties. Some counties were showed highly positive/negative coefficient estimates though no consistently high/low counties could be identified. The maps for the coefficients of median age and year are very similar though I do not have a clear idea why this is the case. The map of the coefficients of child poverty showed the most varied distribution across space. Autocorrelation analysis using Moran’s I of the residuals from this GWR model did not find any evidence of significant autocorrelation. I could not find evidence of a significant non-random spatial pattern for the residuals of my model.

  1. Critique of Methods Used

While the temporal autocorrelation analysis was useful in that it provided evidence of temporal autocorrelation present in the data and prior univariate spatial autocorrelation provided limited evidence of variables being spatially autocorrelated at different spatial lags, I was unable to perform cross correlation analysis between my outcome and predictor variables. One important note: I do plan on performing this analysis I just need to figure out how the “ncf” package works in R. This is one of the more glaring shortcomings of my analysis so far is that I do not have evidence that my outcome is correlated with my predictor variables at various distances. Another critique is that the choice of an ideal temporal correlation structure was fairly subjective with my choice of model selection criteria being AIC. Basing my decision on other criteria would likely change my covariance structure. A similar argument could be said for my choice of variable selection being based on a backwards stepwise AIC approach where other selection criteria/methods would likely have different variables in the model.

Finally, the results of my GWR analysis do not show actual drivers of reported Salmonella rates. Rather it shows the demographic characteristics associated with higher reported rates. While this information is useful it does not identify any direct drivers of disease incidence. Further analysis will be required to see if these populations have different exposures or severity of risky exposures.

Determining accuracy of segmentation versus ground truth disease patches of Blackleg on Turnip

My research involves the classification of a leaf spots on turnip derived from the pathogen blackleg. I had hypothesized that spatial patterns of pixels based on image classification are related to manually classified spatial patterns of observed disease on turnip leaves because disease has a characteristic spectral signature of infection on the leaves. Here I focus on the accuracy of previously determined disease patches through segmentation and ground truth classification. To do this a confusion matrix is used and allows for detection of true positives, true negatives, false positives and false negatives. All of the image processing took place in ArcGIS Pro. The next step involved the accuracy assessment which was conducted in R.

How accurate is the computer image classification?
• Can the accuracy be quantified?
• How can the image classification error be visually represented?

Tools and Methodology
I began in ArcGIS Pro to help visually represent the false negative and false positive regions of the disease patches through the segmentation process of classification. To start I turned on both layers where you can see the overlap between the two classification methods in Image 1.

I then went to the symbology of the segmented image and changed the color to white. I placed this layer on top, so it covered all the raster cells of the manually classified patches leaving only the false negatives, seen in Image 2. Next, I went back into the symbology, but for the manually classified image and changed the color scheme to white. I moved this layer to the top and changed the segmented image color scheme back to unique values. I was left with Image 3 showing the false positives. This was an easy way to visualize these disease patches and how well the classification method was working.

Next, I exported a Tiff file of both the manually classified patches and the segmented patches. To ensure each cell between the two layers lined up in R I had to make sure the extent was the same for both layers when I exported. To do this I right clicked on the layer and went down to data and selected export raster. A pane appears in right side of ArcGIS Pro where I hit the drop-down arrow for clipping geometry and selected current display. I did this with both layers and had one Tiff file for computer classified image through segmentation and one for the manually classified disease patches.

Using the raster, rgdal and sp packages I was able to upload my two Tiff files in raster format to R. I gave the two files each a name and used the plot function to view the two images. I noticed they both had values associated with each patch which were on a gradient scale. To correct for this, I converted my two raster layers in R to tables. This provided a coordinate for each cell and the value associated with it. In the image segmentation raster table I had 0 to 2 and the manually classified image I had 1 to 6. For all the white space I was given ‘N/A’, which was another issue. I used the xlsx package to export my data tables to excel files. I opened the two files in excel and used the sort smallest to largest function. From here I was able to use the replace function and change all the ‘N/A’ values to 0’s and all the values associated with pixels to 1’s. The values were arbitrary associated with the pixels and I needed the raster in 1’s and 0’s format. After doing this with both excel sheets I copy and pasted the two side by side and deleted all the associated coordinates. These were also unnecessary because each pixel from the same coordinate between the layers were in the same row. I saved this excel file and uploaded it into R. I downloaded the caret package and performed a confusion matrix which can be seen below in Table 1.

The visual representation for my false positive and false negative results can be seen below in Image 2 & 3 with Image 1 for comparison. You can see the false negatives for disease covers a much larger area than does the false positives. This may imply that the segmentation is limited in its assessment of diseased area. What it tends to miss is the margins of the disease but does a fair job of predicting the center of the disease where it likely originated and is most severe. To correct this, setting a larger threshold may allow for less severe regions of the disease to be classified. Because the segmentation is based on pixel reflectance value at the red band, this would mean the threshold value needs to be slightly lowered.

Additionally, an entire patch of disease was missed which can be seen in the right corner of Image 1. Currently, the classification system is set to only create segmented patches of 10 or more pixels. This patch is 9 pixels and therefore just missed the cutoff. Even though it was just shy of this requirement, we are unsure if the segmentation would have detected a difference in this diseased patch or if it was also out of the threshold for classification. If it is common for disease patches to be this small, it may be an indicator to lower the value to 5 or 6 for what it is allowed in the segmentation blocking.

There are multiple steps in the processing of the image where different routes could have been taken and potentially increased classification accuracy. The objective of this classification method is to have as little percent error as possible or at least determine if you’d rather have more false positive than false negatives, vice versa or equal. Here we have greater percent of cells which are false negatives and modest assessment of diseased pixels when simply visualizing the images.

To help quantify these images and determine how accurate the model was, a confusion matrix was used in R, seen in Table 1. The segmented classification correctly identified non-diseased regions very well and did a pretty good overall job of predicting disease that was confirmed with ground truth. There were 2075 true positives, 55 false negatives, 41 true negatives, and 12 false positives. The model correctly identified 97.1% of the 2183 total cells. The precision of the model was 42.7%. The sensitivity of the model was 77.4% and the specificity was 97.4%. The accuracy was very good for this model and is essentially a percent error calculation of the model. The sensitivity measures the proportion of actual positives that are correctly identified while the specificity is the opposite and measures the proportion of true negatives identified. The precision gives us a sense of how useful or complete the results are. The model did well overall but provided insights to possible adjustments that could be made which would increase the predictive power here.

Critique of method
One critique I have is the small sample size. While I simply intended to only lay down the framework for creating a stepwise process in disease classification, it supplies results that can hardly be statistically backed. I would like to increase my sample size to five images for part three and look for similarities and differences between the five. I also intend to make some adjustments to the process to try and increase overall accuracy, precision, specificity and sensitivity. So essentially this critique is the limited conclusion that can be drawn from a sample size of one and the need to increase that to five for now.

The second critique I have is the number of steps I have used to get to this point. I would like to find a more manageable way to do the segmentation process and the image processing steps to get to this point. I have found small changes I can make along the way. Ideally, I can use the Modelbuilder in ArcGIS Pro where most of the processing is done. This will streamline the process when I find a way to do this.

An error which is present in my results is the confusion matrix. The matrix is considering 2183 raster cells in order to perform the matrix. These cells were determined by a defined rectangle when exporting a Tiff file from ArcGIS Pro. Many of these cells are not even of the leaf and is simply classifying regions outside of the leaf. To correct this, I would need to export a Tiff file which is symbolic of the leaf shape. The confusion matrix results provided were therefore erroneous in a sense or a bit misleading.

Partner ideas
My partner talked about how they were doing a neighborhood analysis and it could be practical for me to do. She had mentioned doing it in earth engine which I haven’t used but could get some help from her. There is a multiple rings buffer and I could look at the false positives in this light. She also mentioned using geographically weighted regression. We didn’t discuss much about it, but it seemed like a good regression to perform on my error analysis. We related on some level with our projects data and issues but at the time didn’t have any clear resolves. I will be curious to follow up on our chat and see what type of analysis was performed and share my results as well.


Image 1. Overlap between segmentation on top and manual classification below

Image 2. False negatives after subtracting segmented regions from manually classified.

Image 3. False positives after manually classified cells are subtracted from segmented.

Table 1. Confusion matrix

R Code
##raster upload



img20_seg <- raster(“E:/Exercise1_Geog566/MyProject3/RasterT_afr7_Polygon_1.tif”)

img20_ground <- raster(“E:/Exercise1_Geog566/MyProject3/diseased_20_PolygonT_1.tif”)



##export data

raster.table <-, xy=T)
truth <-, xy=T)


tableimg <- raster.table

write.table(tableimg, file = “dataexport.csv”, sep = “,”)
write.table(truth, file = “truth1.csv”, sep = “,”)

##confusion matrix

#######this isn’t working


table(confusionM$reference, confusionM$predicted)

confusionMatrix(confusionM$reference, confusionM$predicted)

##For the confusion matrix if above doesn’t work

myconfusionM <- table(confusionM$predicted, confusionM$reference)

##accuracy of matrix

2075+41+12+55 #total
(12+55)/2183 #misclassified/total







Ex2: Incremental pixel greenness while moving away from refugee settlement boundaries

Exercise 2: Incremental pixel greenness while moving away from refugee settlement boundaries


  • The question I asked centered on how the pixel greenness / NDVI varied in buffered increments around settlements within BidiBidi, Imvepi, and Rhino Refugee camps. I wanted to compare the settlements in Bidi Bidi and Imvepi, which have a larger settlements, to Rhino, which tended to have smaller settlements more uniformly spaced and spread out. Given the more uniformed and wider spacing of Rhino, I expect that green-up will happen more quickly in comparison to the settlements in Bidi Bidi and Imvepi, which are closer together and varying in size and development pattern. This leads me to believe that there’s more cleared or available land and that

    Map of Regions of Interest & Buffers

    settlement size is based on political organization of camp blocks rather than natural boundaries that might exist already. One of the questions that I’m asking with these settlements and the geography of exclusion is essentially why different settlement areas and parts of these areas are included or excluded from global settlement datasets. One of the factors that contributes to this is a spectral and spatial distinction – that is, how might the green space in and around a settlement change as you move away from said settlement? With this exercise, I wanted to compare the settlements in Rhino to the settlements in Bidi Bidi and Imvepi to see if the pixel greenness changed at a different rate or in a different pattern as one moved away from the settlement center.

  • I used multiple different tools, including the Multi-Ring Buffer tool in QGIS (since I was working with export JSONs), EarthEngine to extract NDVI mean values from Landsat 8 satellite images at these settlement locations, and the smoothing factor in ggplot in R to plot and statistically examine the way NDVI changed.
  • I first needed to buffer the regions of interest that I wanted to study, of which there were 44 in Bidibidi and Impvipi and 41 in Rhino. I performed this buffering in QGIS using the Multi-Ring Buffer plugin and made buffers at 100-meter increments from 0 to 1000 meters. Some of the buffers overlapped, but for the sake of simplicity of this assessment, I ignored this. Within EarthEngine, I pulled Landsat8 images from 2018 that covered these settlements. After adding NDVI and NDBI calculated bands to the image collection of 2018 images, I performed a quality mosaic to compress the image collection into one image. I based this quality mosaic on NDVI, meaning that the pixel chosen from the image collection would be the pixel with the highest NDVI. While this can sometimes pull pixels from different dates, it does exclude the possibility of clouds and seasonality affecting the dataset by comparing just the most vegetated pixel that occurred in that area. If I were to re-do this, I might choose a single date image to capture phenological nuance. After reducing the image across all of the buffers (that is, calculating the mean NDVI within each buffer), I exported the geoJSONs, brought them back into QGIS, ensured that there was a spatial selection component linking all of the buffers and regions of interest together, and brought this data into R to plot the NDVI change over distance for all regions of interest.

Bidi Bidi Settlement; Imveppi Settlement Buffers

Rhino Settlement Buffers

  • The pattern of greening in the buffers around settlements in Rhino versus Imvipi and BidiBidi did present different patterns, but not particularly significant different patterns. It appears that the Rhino settlements had a faster rate of increase in greenness while moving away from the settlements especially in the first 500 meters, whereas Bidi Bidi and Imveppi showed a more gradual green-up, although there also seems to be a small shift at 500 meters. These results are somewhat expected, but also not very drastic. It would be interesting to see how the green-up changes if I increased my buffer extent or decreased my buffer increments.
  • I think that looking at NDVI in buffers was an interesting approach, but as I said above, my choice of pixel quality selection (highest NDVI) could alter a neutral selection of data. Also, what buffers I chose were relatively arbitrary – I chose equal intervals, but this does mean that when the mean NDVI is calculated, the mean is reduced across a larger area as each buffer gets further from the center. I could also try testing with larger buffers (200 or 500 meter buffers) that extend beyond 1000 meters from the settlement edge. Further, some of my buffers overlapped and encroached on other actual boundaries: this means that the buffers sometimes contained pixels from other identified settlements. For this reason, I chose to present the data in a smoothing trend. I will likely need to fix some of these errors for the final project, because I do think that this is a typical and useful spatial analysis to perform on this type of data and some of the errors are relatively easy to fix and would show stronger data integrity.


Exercise 2: Geographically weighted regression on two forested stands.

Bryan Begay

  1. Initial Spatial Question: How does the spatial arrangement of trees relate to forest aesthetics in my areas of interest?


To understand forest aesthetics in my stand called Saddleback, I did a Ripley’s K analysis for Saddleback and on a riparian stand called Baker Creek to determine if the stands are clustered or dispersed.  The Baker Creek location is a mile west of the Saddleback stand.

  1. Geographically weighted Regression:

I performed a geographically weighted regression on both the Saddleback and the Baker Creek stands. The dependent variable was a density raster value and the explanatory value was tree height.

  1. Tools and Workflow

Figure 1. The workflow for creating the Geographically Weighted Regression for the Saddleback Stand. The Baker Creek stand followed the same workflow as well.



Figure 2. Geographically Weighted Regression showing the explanatory variable coefficients in the Saddleback and Baker Creek stands near Corvallis Oregon. Yellow color indicates negative relationships and the hotter colors  indicate positive relationships between tree height and density.

Figure 3. Geographically Weighted Regression showing the Local R2 values in the Saddleback and Baker Creek stands near Corvallis Oregon. Yellow color indicates that the local model is performing poorly, while hotter colors indicate better performance locally.

Table 1. Summary table output for the Saddleback stand’s geographically weighted regression.

Table 2. Summary table output for the Back Creek stand’s geographically weighted regression.

4. Interpretation/Discussion:

Having done the Ripley’s K analysis, I wanted to have a connection with this exercise, so I created a point density raster on both my stands (Figure 1). The point density raster calculates a magnitude-per-unit area from my tree points and outputs a density for the neighborhood around each tree point. The raster values would then be a descriptor of the trees neighborhood density. Having the density neighborhood values describes the stands tree spatial arraignment and relates to the Ripley’s K analysis outputs of telling if a stand is spatially clustered or dispersed.

Figure 2. shows that there is a spatial pattern in the Saddleback stand between density and height. There is a positive relationship on the edges of the stand and a decreasing relationship in the middle of stand between the two variables. This makes sense when thinking about how the stand would have denser and higher trees on the edges of the managed stand to screen the forest operations. The coefficient values for the baker creek showed a positive relationship on the north eastern portion of the stand, which would need further investigation to understand the relationship between density and height. Overall the relationship was negative in the Baker creek stand between density and height, but this may be attributed to the low local R2 values that indicate poor modeling (figure 3). Table 2. also shows that the Baker Creek model only accounted for 50% of the variance for the adjusted R2 values, which would indicate that more variables would be needed for the riparian stand. Figure 1. shows the summary table for GWR in the Saddleback stand.

  1. Critiques

The critiques for this exercise is that I only look at height and density. If I had more knowledge of working with LAS data sets I would have liked to have implemented the return values on the LiDAR data as an indicator of density. Another critique would be that I used density as a dependent variable and height as an explanatory variable. Using density as the dependent value allows me to see the spatial patterning of my trees when plotted in ArcMap so I can reference the Ripley’s K outputs for further analysis. Having height as a response variable with density as an explanatory is something that would have been easier for me understand and explain that relationship. Density can affect tree height in a stand but understanding tree height as a factor that affects density is not as intuitive. Looking at how tree height responds to density in my stand would tell something about tree height, but that relationship has already been explored in great depth.

Spatial Distribution of Trees by Height Class, Slope and Elevation in the HJ Andrews Forest

Guiding Questions: How do distances between trees differ depending on tree height? How does the spatial pattern of tall trees relate to the spatial pattern of slope and elevation?

Methods: I used a combination of ArcMap, QGIS and R to perform analyses and view results. I used the results of my previous distance analysis within the HJ Andrews Forest, which grouped individual trees into ten height classes and calculated the mean distance between trees within the same height class, to correlate tree spacing with other spatial phenomena. I wanted to know if hot spots in within class tree spacing correlated with hot spots in tree height, so I examined hot spots and cold spots of tree distances within each height class and compared them to tree heights, slope and elevation. Height class 1 is the shortest class of trees and height class 10 is the tallest class of trees.

I used the Hot Spot Analysis Tool in the Arc Toolbox > Spatial Statistics Tools > Mapping Clusters > Hot Spot Analysis (Getis-Ord Gi*) to perform a Hot Spot Analysis on each of the ten height classes by mean distance to the 30 closest trees of the same height class. In the context of this analysis, the interpretation of a hot spot is that it is a region of greater than expected distances between trees of the same height class. For example, in the shortest height class, 1, hot spots are regions of greater than expected mean distance between a short tree and the 30 closest short trees. Cold spots would then be regions of closer than expected mean distance between short trees.

The Hot Spot Analysis in ArcMap used a self-generated distance band of 113m for my original hot spot analysis of the global dataset (not broken up by height class), so I decided to use a distance band of 100m for each subsequent hot spot analysis. Each height class has a different number of total trees in it, so by holding the distance band constant, I hoped to avoid influence from any differences in total number of trees between height classes.

After viewing the hot spot results, I plotted the z-scores of heights for each height class against the z-scores of the distances between trees to visually examine their relationship. If both heights and distances between trees were perfectly normally distributed, one would expect a circular distribution on the density plots with a slope of zero.

I then compared the mean slopes, elevations, and standard deviation of slopes and elevation within height classes across the entire forest. Since HJA is a research forest with many different management areas, including harvested patches and research plots, I limited the next part of the analysis to only within control areas of the forest. I downloaded the most recent (2014) land use designations from the HJA data repository ( For this analysis, I used Entity Title 3: Reserved areas (controls) within the Andrews Experimental Forest. I compared slopes and elevations within the control plots only by height class, to see if there were differences between the global dataset and the control regions of the forest.


The density plots of height z-score versus distance z-score revealed a different pattern between smaller height classes of trees and tree spacing than the relationship between larger height classes of trees and tree spacing. As we go from shorter height classes of trees to taller height classes, the density plot distributions change (Figures 1-10). There is strong evidence of positive correlation between hot spots of short trees and hot spots of distance between short trees, but from height class 6-10, there is little to no evidence of a relationship between hot spots of trees and distance between them. Tall trees are more or less distributed randomly throughout the forest.

Figure 1


Figure 2


Figure 3


Figure 4


Figure 5


Figure 6


Figure 7


Figure 8


Figure 9


Figure 10

There is clearly some structure to the density plots (especially in height classes 1-5), so we can assume that the trees are not randomly distributed and that there is a relationship between height and distance between trees. I compared mean and standard deviation of slope, as well as mean and standard deviation of elevation for each height class of trees (Table 1). Mean slopes do not significantly differ between height classes, so slope is likely not a main driver of tree height. However, there is some evidence that tree heights differ at lower and higher elevations, with the shortest height class of trees at a mean elevation of 1014 m, and the tallest height class at a lower mean elevation of 831 m. It’s important to note that mean elevations have large standard deviations, so the trend may not be as strong as it first appears. I wanted to know if there was more evidence for this pattern, so I calculated the same statistics for subsets of the hotspot analyses constrained to only the control areas of the forest (Table 2) to get an idea of how management may or may not influence the relationship between tree height and tree spacing throughout the forest. Mean slopes and elevations, accounting for standard deviation from the mean, do not differ significantly between the global and control datasets, meaning that the control regions are reasonable representations of the rest of the forest. To examine this further, I examined the same data for the entire forest excluding the control areas (Table 3). The same pattern holds between the three datasets; class 10 is the only class that is consistently at lower elevations across the forest. I made two density plots to display this relationship. Figure 11 shows the distribution of elevation for the shortest height class (class 1) of trees (green) versus the global dataset of all trees (red). The short trees follow the same distribution as the rest of the dataset, meaning that they are dispersed more or less evenly across elevations. Figure 12 shows the distribution of the tallest height class of trees (class 10; light blue) by elevation versus the global dataset of trees by elevation (red). This clearly shows that tall trees are not distributed at higher elevations.

Figure 11. Density of trees by elevation in height class 1 (shortest trees; green) versus global dataset of all trees (red).


Figure 12. Density of trees in height class 10 (tallest trees; light blue) versus global dataset of all trees (red) by elevation.

Table 1: Global dataset

Height Class Mean Slope SD Slope Mean Elevation (m) SD Elevation
1 23 11.7 1014 294
2 23 11.6 1008 291
3 25 11.6 990 295
4 25 11.5 948 291
5 25 11.3 931 293
6 26 10.9 972 291
7 27 10.5 982 250
8 26 10.6 959 221
9 25 10.7 926 207
10 23 11.2 831 197


Table 2: Subset of control regions

Height Class Mean Slope SD Slope Mean Elevation (m) SD Elevation
1 27 11.9 1104 317
2 27 11.5 1136 328
3 28 11 1157 329
4 28 10.8 1143 311
5 28 10.2 1133 285
6 28 10 1071 262
7 28 10 992 228
8 27 10.6 955 193
9 26 10.8 935 187
10 24 11 869 189


Table 3: Global dataset excluding control regions

Height Class Mean Slope SD Slope Mean Elevation (m) SD Elevation
1 22 11.4 991 284
2 22 11.4 979 273
3 24 11.6 952 272
4 24 11.5 902 266
5 24 11.5 867 265
6 25 11.3 908 290
7 26 10.7 973 267
8 25 10.5 964 242
9 23 10.4 919 221
10 22 11.2 793 198


Critique of the method:

A criticism of hot spot analysis is that it’s basically a smoothing function that places a focal around an area but does not account for the distribution of values within that area. So, the tallest tree in the dataset could be in cold spot (region of shorter than expected trees) and the hot spot analysis would give you no indication of that, so one may miss out on potentially useful/interesting information.

This is only a cursory look at the data and a next step is to more closely examine how slope, elevation and aspect influence distribution and height of trees, particularly within the control areas of the forest.

Ex 2: Spatial distribution of juniper saplings and slope,aspect, and seed source

1. Question Asked: How does the density of juniper vary based on characteristics associated with soil moisture? Additionally, does this pattern vary with juniper density (as an indicator of potential seed source) surrounding each transect?

The study site for this analysis is the Camp Creek Paired Watershed Study (CCPWS) in central Oregon. The majority of juniper was removed from one watershed (“treated WS”) while one watershed is dominated by mature juniper (“untreated WS”). This analysis seeks to examine the density of juniper sapling reestablishment in the treated WS and mature juniper density in the untreated WS as it relates to indicators of soil moisture (slope, aspect, and a combination of both) and seed sources (surrounding juniper density).

2. Tools and steps used in analysis: All steps were completed in ArcGIS 10.6 or ArcGIS Pro.

Surrounding juniper density: Thirty-three belt transects were conducted in the summer of 2018 in the treated WS. The number of juniper for each transect (30m long by 3m wide) was recorded. For the purposes of this analysis, the transects were represented by a point. Supervised classification (support vector machine) was applied to National Agriculture Imagery Program (NAIP) imagery for this area and pixel-based analysis was used to estimate the density of juniper. General steps are as follows:

  1. Buffers were created surrounding each transect point at 50m, 100m, and 250m using the buffer tool.
  2. The zonal histogram was used to extract the pixel values within each buffer. To ensure proper counts, each buffer must be iterated separately if there is overlap between buffers.
  3. A model was created in ModelBuilder to extract values from the buffers and combine values in a table. In-line variables were used to separate the data for each transect into separate tables prior to being merged.The submodel (first image) and full model (second image) are displayed below:


  1. The percentage of juniper pixels within each buffer (number of juniper pixels/total number of pixels) was calculated using excel. This percentage was used as an estimate of surrounding juniper canopy/density within the surrounding areas.
  2. Point plots were created indicating the number of juniper per transect and the corresponding juniper density.
  3. The transects were divided in to two groups representing “low” counts (transects where one or fewer juniper saplings were found) and “high” counts (transects where five or greater juniper were found).

Slope and Aspect: Slope and aspect were considered to be indicators of soil moisture within the watershed. Separate approaches were used to assess if juniper density was found to vary with these characteristics: regression (using ordinary least squares and global weighted regression) and a rating model based on slope and aspect categories. General steps are as follows:

  1. Using a 30m DEM, the slope and aspect values were extracted using the slope and aspect tools within ArcMap.
  2. Using the calculate field function, aspect and slope were each ranked from 1-9, with areas were greater soil moisture is expected (flat and shallow slopes and northern slopes were weighted heaviest). The following parser was used (aspect values are in degrees and slope is in percent):

def Reclass(Aspect_category):

    if (Aspect_category<0 and Aspect_category>=-1):
        return 9
    if (Aspect_category>=337.5 and Aspect_category<360):
        return 8
    if (Aspect_category>=0 and Aspect_category<22.5):
        return 8
    if (Aspect_category>=22.5 and Aspect_category<67.5):
        return 7
    if (Aspect_category>=292.5 and Aspect_category<337.5):
        return 6
    if (Aspect_category>=67.5 and Aspect_category<112.5):
        return 5
    if (Aspect_category>=247.5 and Aspect_category<292.5):
        return 4
    if (Aspect_category>=112.5 and Aspect_category<157.5):
        return 3
    if (Aspect_category>=202.5 and Aspect_category<247.5):
        return 2
    if (Aspect_category>=157.5 and Aspect_category<202.5):
        return 1
def Reclass(Slope_category):
    if(Slope_category>=0 and Slope_category<2):
        return 9
    if(Slope_category>=2 and Slope_category<4):
        return 8
    if(Slope_category>=4 and Slope_category<6):
        return 7
    if(Slope_category>=6 and Slope_category<8):
        return 6
    if(Slope_category>=8 and Slope_category<10):
        return 5
    if(Slope_category>=10 and Slope_category<12):
        return 4
    if(Slope_category>=12 and Slope_category<14):
        return 3
    if(Slope_category>=14 and Slope_category<16):
        return 2
      return 1
  1. Based on the categorized values for aspect and slope, a rating model was calculated using the raster calculator to determine if areas of expected greatest soil moisture corresponded to juniper density. The formula for this rating model was: (slope category +aspect category)/2.
  2. As the rating model had a lower spatial resolution and extent than the classified NAIP raster, the classified raster was resampled to be the same resolution as the rating model. The NAIP raster was masked using the extent of the rating model raster. These rasters will be used for further analysis in exercise 3. For the purposes of exercise 2, two maps were created for visual analysis. For future analysis, both rasters were projected to NAD 1983 UTM Zone 10N.
  3. The hot spot analysis from exercise one was overlaid on the rating model to visually assess the patterns over a small scale.
  4. An ordinary least squares (OLS) analysis was performed using the number of juniper counted for each transect as the dependent variable and the slope category, aspect category, and combination of both categories as the explanatory variables.
  5. The standard residual for each OLS was used to calculate Global Moran’s I.
  6. Geographically Weighted Regression (GWR) was used with the same inputs described for the OLS.

3. Overview of results.

Buffer. No clear trend was found between juniper density and the number of juniper sapling found in each transect. However, for all transects the juniper density was lower with increasing buffer distance. Results are displayed in the chart below, points are labeled by distance and “high” (corresponding to transects with 5 or more juniper) and “low” (corresponding to transects with 1 or fewer juniper:


Comparison of Rating Model and Classified Raster. The classified NAIP (prior to resampling and reprojection) and rating model raster are displayed below. Red pixels within the classified raster are pixels classified as juniper. For the rating model, lighter shaded areas correspond to those areas where higher soil moisture is expected based on the slope and aspect characteristics. The hot spot analysis provided limited use (as only one hot spot and one cold spot were indicated) but could be useful for additional analysis. However, the cold spot aligned with areas of expected lower soil moisture and the hot spot corresponded to areas with expected higher soil moisture. Further analysis of these two datasets will continue with exercise 3.

Regression analysis. The R-squared values for OLS for each explanatory variable(s)(slope category only, aspect category only, and a combination of slope and aspect) ranged from -0.025 to -0.060, suggesting that these characteristics were not a good predictor of juniper density (at least for the data set used in this case). The coefficient for each explanatory variable(s) ranged from -0.087 to 0.024, also indicating that this model was not a good fit. The Koenker (BP) statistic was insignificant for all cases. The p-values (p>0.86 for all cases) for Global Moran’s I for each combination of explanatory variables indicated random spatial patterns. As expected, results were similar for the GWR analysis with R-squared values ranging from -0.011 to -0.037. The GWR residual squares ranged from 180.8 to 182.3.


4. Discussion/critique of methods. Based on the buffer analysis, no clear patterns were indicated for repulsion/attraction based on the transect numbers and the density analysis. However, the juniper density (both in range and overall percentage) decreased for the 250m buffer regardless of the number of juniper counted per transect. These patterns are most likely related to the fact that juniper was removed from one watershed fifteen year ago and that the adjacent watershed was left untreated. This coupled with the small sample size may not be sufficient to indicate if spatial patterns between juniper density and saplings found for each transect exist. The OLS and GWR analysis indicated that slope and aspect (at least based on the data used here) were not good predictors of juniper density. For exercise 3, regression analysis of the classified raster and the rating model will take place.

The methods used here were relatively easy to apply, and can be accomplished using tools available in ArcGIS Pro. The ModelBuilder process required time to develop and some issues, particularly with the merge function, required that the tables be manually manipulated. However, ModelBuilder still was more efficient than executing the buffer and zonal histogram for each transect separately. Also, the supervised classification can easily be accomplished within ArcGIS. However, the spatial resolution of the NAIP imagery (1m) likely led to reduced accuracy of juniper sapling identification compared to images with higher spatial resolution. This was the highest resolution imagery available for the entirety of the two watersheds but this methodology could easily be applied to other data.

Overall, the methods used here may inform future analysis to be applied to other datasets. For the purpose of these exercises, very few spatial trends are evident but that may be the result of a)the sample size, b)the effects of juniper removal, and c)characteristics of the dataset (e.g., spatial resolution of the raster). In particular, I would be interested to see how these approaches would work in areas where larger-scale removal has taken place. Additionally, the use of rasters with higher spatial resolution (such as UAV-based imagery) would likely improve the classification and increase accuracy of juniper sapling detection.

Ex 2: Relationship between stream cross-sectional change and across-channel slope

For Exercise 1, I investigated autocorrelation in patterns of cross-sectional change across and along stream reaches.

For Exercise 2, I wanted to know whether these patterns of change were related to channel geometry.

Specifically, I wanted to know if erosion or deposition happened more frequently close to cut banks than further away from them.

I was originally going to investigate change in both the along-channel and across-channel directions, but after consulting with my classmates, I decided that I would not be able to draw as many conclusions in the along-channel direction because I don’t have good information about the spacing and orientation of my cross sections.


To find cross-sectional change, I paired sequential years and calculated change along each cross section for each pair of years as I did in exercise #1.

I wanted to compare the change to the location of the steepest bank, but I couldn’t figure out how to identify what parts of a cross section counted as a “bank” using a computer. Instead, I looked at the spatial pattern of change in relationship to the point of steepest slope in the across-channel direction.

I used the original cross section profiles to identify the point of steepest across-channel slope. Since the point of steepest slope may move from year to year, I used the steepest slope from the first year in every year pair. I used the loess function in R to lightly smooth each cross-sectional profile before extracting slope in hopes of reducing the effects of some small bed features. This worked well in most cross sections, but in some cross sections, especially those with prominent midchannel features, the point of steepest slope occurred in the middle of the channel.

Once I had identified the steepest point in each of the cross section for each year pair, I calculated how far every other point in the cross section was from the steep point. Then, within each reach, I aggregated the data by distance, rounding to the nearest decimeter, and calculated the mean absolute elevation change (that is, counting both erosion and deposition as positive values). I wanted to see broad patterns overall, so the aggregate combines data for every cross section and every pair of sample years.

I plotted the resulting data from each reach. In the figure below, the colors represent how many horizontal centimeters of reach are aggregated into each point on the line graph. Bigger numbers and more blue colors represent averages from more cross sections and years while smaller numbers and more yellow colors represent distances where fewer cross sections or fewer years had data at that distance from the steep point.


The figure implies that perhaps a lot of channel change tends to happen very close to the steepest point, but then stabilizes. Far from the point, the average vertical change values become very unsteady, perhaps because fewer data points are integrated into the average.


I thought that this was an interesting and fairly straightforward analysis to conduct, but I am not sure how physically meaningful the results are, since the steepest points are not placed in the same location in each cross section. The results figure looks a bit like a channel cross section itself, which I thought was very interesting! I wonder if this is because the averaging falls apart at a distance roughly equal to the average channel width or if there really is more change happening near channel banks on average in these streams.



Ex 2: A stain on the neighborhood… How does management relate to infection for black stain root disease?

Sorry for the bad puns in the title… Could not help myself.


How do spatial patterns of black stain root disease infection probabilities relate to spatial patterns of forest management classes? How do these spatial relationships differ between landscapes where similar management classes are clustered and landscapes where management classes are randomly distributed?


I used neighborhood analysis to analyze the spatial relationship between hot and cold spots of infection probabilities and the three forest management classes simulated in my model (extensively managed plantations, intensively managed plantations, and passively managed old-growth stands).

I used a combination of ArcMap (to perform the majority of the procedure) and R (mostly for spatial data wrangling and analyzing and plotting results).


1. Compare the distribution of infection probabilities between landscapes and management classes

I performed this step in R to determine whether there was evidence of significant differences in the infection probabilities (a) between the clustered and the random landscapes and (b) between management classes both within and among landscapes.

2. Hotspot analysis

Converted my raster data of infection probabilities to point data (in R, using the “raster” package) and perform a hotspot analysis (hotspot tool in ArcMap) (Fig. 1). For the hotspot analysis, I used inverse squared distance weighting to conservatively include trees within hotspots.

I also created polygon shapefiles for areas identified as hotspots to the 99% confidence level in each of the landscapes and calculated the area of these hotspots for each landscape, management class, and landscape X management class.

Figure 1. Hotspot analysis overlaid on forest management classes for the “clustered” landscape. The same analysis was performed on the “random” landscape.

3. Select point for neighbor analysis

For this step, I chose one point in a hot spot and one point in a cold spot for each of the two landscapes to perform the neighborhood analysis. In the future, I would write a script to repeat this procedure with a random sample of a large number of points, but for this exercise, I just used one point for each.

For the hotspots, I only used points identified in hotspots at the 99% confidence level. For the cold spots, I used a point from the 99% confidence level for the clustered landscape, but my only options for the random landscape were points identified at the 90% confidence level. I visually selected the points for analysis (non-random), but I would not do so for my full analysis.

4. Create concentric ring buffers for the neighborhood analysis

I used the Multiple Ring Buffer tool in ArcMap to generate hollow ring buffers at a series of distances around each hotspot and cold spot point selected for analysis (Fig. 2). I then intersected these buffers with the management class shapefile and calculated the proportion of each buffer ring composed of each management class (by area). I plotted these proportions as a function of distance to complete the neighborhood analysis.

Figure 2. Concentric ring buffers used for the neighborhood analysis, with outer ring distances labeled. Example shown for the “random” landscape for both hot and cold spots.


Non-spatial analysis

Before performing the neighborhood analysis, I wanted to know whether there was any difference in the infection probabilities between the two landscapes and the three management classes. Visual assessment of the box plots (Fig. 3) led me to believe that there were significant differences between the means of the infection probabilities between the landscapes and management classes, which was supported by the results of student t-tests (p << 0.01). There were also significant differences in the infection probabilities between management classes both within and among the two landscapes. Since the outputs I am analyzing are “dummy data” because my model is not fully complete, this does not surprise me and I did not perform further, more rigorous statistical analysis. The higher infection probabilities simply reflect the density of trees in each of these management classes. Extensive management had the highest infection probabilities (highest initial planting density, evenly spaced), followed by old-growth (lower density but clustered trees) and finally intensive management.

Figure 3. Comparisons of infection probability distribution between (a) landscapes, (b) management classes, (c) management classes by landscape; (d) the proportion of each management class covered by an infection hot spot in each of the two landscapes (by area).

Neighborhood analysis

For the clustered landscape, the management class in which the point was located made up the largest proportion of the first several distances analyzed (cold spot: old-growth, hot spot: extensive). This makes sense given that the management classes were spatially clustered in this landscape. At a certain threshold, the proportion of this management class started to decrease, as the other two management classes increased in proportion at a similar rate. For the random landscape, the decrease in the starting management class proportion was relatively rapid with distance, and all three management classes converged at their landscape proportion by 150 meters, with some fluctuation (in both landscapes, each of the three management classes makes up about 1/3 of the total landscape).

Figure 4. Neighborhood analysis showing the proportion of each (hollow) ring buffer accounted for by each management class for hot and cold spot points in each of the two landscapes.

Critique of the method – what was useful, what was not?

Most of the drawbacks of this analysis were due to my process and the nature of my data. Because I only used one hotspot and one cold spot point for each of the two landscapes, it is difficult to say much from this analysis. If I were to automate the analysis and run it on a series of random points drawn from the hot and cold spots, I would get a lot more insight as to the patterns of neighborhood effects if any exist. In addition, the “dummy data” used for analysis come from model runs that only have local disease transmission (between neighbors at a radius of several meters). However, the full model will also incorporate long-distance dispersal by insect vectors (on the scale of kilometers), which will likely be much more interesting and less predictable when neighborhood analysis is performed.

Issues with the method itself is that it is quite time intensive – an issue that could be cleared up with a script written in Python or R to automate this process given a set of input files. If there is a way to do this already built into either of these platforms (or Arc/QGIS), I was not able to find it. Also, there is a lack of clear quantitative interpretation for these plots to separate statistically significant variability in the proportion of each management class at each distance from non-significant variability. A means of doing so would enrich the analysis.

Exercise 2: Using Neighborhood Analysis to Identify Relationships Between Seascape Classes and Rockfish Abundance Hotspots

Background and Question Asked

In Exercise 1, different interpolation methods were used to create a heat map of rockfish abundance based off of a large collection of point data. That blog discussed some of the challenges that arose while attempting to use a time series of point data with many points in close proximity to one another (if not overlapping). The exploring was in many ways successful: it was discovered that the Kriging method provided a more robust representation of the data than Inverse Distance Weighting. However, in the time since that post was published, my interpolation methods have been refined:

  • Instead of using the entire time series as an input for the interpolation, four individual years were selected to represent the whole dataset (2003, 2007, 2011, 2015). Kriging was then used to create heat maps for each individual year.
  • Additionally, the union tool was used to remove the land boundaries from the environment so that the interpolation only affected parts of the ocean
  • The symbology of the abundance point data was synced across all four years being used in the analyses so that they could be easily compared to one another
  • The symbology of the interpolated heat maps was also modified to be consistent throughout the analyses

For this exercise, I plan to compare my new, interpolated data to an already existing set of data, effectively comparing my two variables. Specifically, I hope to answer the question “Is there a spatial relationship between areas of significantly high and low rockfish abundances and specific seascape classes?”

An example of the most recent Kriging output using the 2007 data.

Name of Tool or Approach Used

I will be using a neighborhood analysis to seek an answer to this question. The neighborhood analysis requires taking areas of interest and examining the environmental conditions around that area from the perspective of another variable. By varying the distance from your original point of interest, a researcher is able to infer about the spatial relationship between the two variables.


Data Used

  • “Points of Interest” chosen from plot below
  • Buffers created around points of interest at 5km, 10km, and 25km radii
  • YOYRock Kriging Abundance Interpolation for 2007
  • Seascape NetCDF Raster File for May 5th, 2007

Rockfish abundance plotted against water column depth for trawls from 2007.

The first thing that was needed to complete this analysis was points of interest. I chose to use four points form the year 2007, as the data from this year provided the largest spatial footprint of all of the years of interest. Two of the points represented trawls that found significantly high rockfish abundance, and the other two represent trawls which found no rockfish. All four points vary spatially and physically (latitude, longitude, water column depth, etc). All points were selected from interpolated areas with different modeled outputs. Next, circular buffers were created around each point of interest with 5km, 10km, and 25km radii.

Map showing Points of Interest with circular buffers overlaid on seascape NetCDF file.

In order to use the overlay tool in ArcGIS, two polygon features are needed. In order to convert my NetCDF Raster files into a polygon, I used the Raster to Polygon tool. Once the seascape classes were converted to polygons, the Intersect tool was used to measure the shape area of each seascape class within each buffer. Those statistics were then converted to .xlsx files and summarized in Excel.

Results and Discussion

Example of data after Raster to Polygon and Intersect tools used

The neighborhood analysis found evidence that specific seascape classes may have impacted young of the year rockfish abundances in the locations selected to be a part of this analysis.

The low-abundance trawls were dominated by three seascape classes: Class 14 (Temperate Upwelling Blooms), Class 19 (Subpolar Shelves), and Class 21 (Warm, Blooms, high Nutrients). While there were more classes represented overall by the high abundance trawls, those areas were mostly dominated by two seascape classes: Class 7 (Unnamed) and Class 12 (Subpolar Nutrient). Additionally, there was very little overlap between the two areas – the only seascape class that appeared in both the high abundance radii and the low abundance radii was Class 14. Further analyses would be needed to determine if these trends are representative to the entire region or year, but this neighborhood analysis provides results that give us a place to start. Overall, I found this analysis to be extremely useful despite the number of steps needed to make it work. In addition to working in GIS normally, the data type of my seascapes had to be changed and much of my analysis had to be done in Excel, as ArcGIS cannot summarize key statistics. However, I feel as though streamlining this method could be done now that I am familiar with it.

Exercise 2: Cross Variogram and Kriging

In Volcanic Regions fluid flow paths are limited to the rubbley bases and flow tops of lava flows where permeability promotes transitivity.


The depths to first water will corresponds to depths located near lava flows.


Contact: location on the surface, or at depth where two different rock types touch.

Depth to first water: the depth from a particular well log where water was first noted, this is not always listed.

Depth to first lava: the depth at which the first lava is noted in a particular well log. There can be multiple contacts to a lava in a well log, which is why I specified first.

Figure 1: Block diagram of what well log depicts. The green and red planes represent contacts between the two rock types. Well (grey) and well logs record these contacts as depths from the surface (brown).


Does the depth to first water correspond to the depth to first lava in my data set?


Cross Variogram and Kriging

Like the variogram, the cross variogram is a tool that allows you to compare spatial data at multiple scales. Unlike the variogram, the cross variogram compares one data set to another data set at multiple scales.

Kriging uses the variogram to interpolate a surface.

Brief Description:

In order to use both the variogram and the cross variogram you must normalize the data you are working with. Otherwise the semivariance values can range from 0 to infinity. Normalizing the data allows you to distinguish data that are correlated (semivariance<1) from data that have no correlation with eachother (semivariance>1).

In order to use the R function gstat, I had to turn the data into a spatial data frame.

The function gstat allows you to simultaneously create variograms for each of the induvidal data sets you are working with and compares them with each other. In this case I just compared the depth to first lava with the depth to first water.

I used Arc GIS kriging formula (ordinary) to krige a surface that represented the difference between between the depth to first lava and the depth to first water. In other words I subtracted the depth to first lava from the depth to first water and kriged that “surface”. I wanted to see if there was a spatial distribution around which those difference were low. I tried to use R to krige, but did not have the time to work out the krige function.


My results were strange, though ultimately unsurprising.

Figure 2: Variograms of my two variables water1 and lava1 as well as the cross variogram that comes from comparing the two. Note that the water1lava1 cross variogram has negative values for the semivariance.

Figure 3: Plot of well log data with the difference between first lava and first water plotted on top of its kriged surface.

The strangest thing to resolve from this exercise are the negative semivariance values for the cross variogram. Semivariance is a squared values and therefore should not have negative values. I have no idea what is happening here. I need to ruminate upon it. Either way, the data does not appear to be well correlates, or at the very least, I am not comfortable making conclusions about it with the negative semivariance values.

The ordinary Kriged surface interpolated from the difference between lava and water lets me know that the highest Kriged surface (Fig 3, white) between water and contact lies in the middle of the study area . In geographic and geologic space this corresponds to a basin filled with sediment and inter-fingered with lava. Many of the well logs in the region are not deep, they don’t have to be because the water here is near the surface, and close to some of these buried basalt flows. The data at the far edges of the map are spatial outlies, and thus we can’t look at any of the map that lies far from the main cluster of data points.

From physically looking at the well logs I know that while the well logs do often correspond to a lava flow, it is almost never the first lava flow. I am not surprised that the semivariance indicates that the data are not correlated.

Critique of the method:

what was useful, what was not?

It was not particularly useful, because it told me what I already know and left me with more questions than answers. However, I did walk away with some considerations. The cross variogram (and variogram) might work at a smaller scale.

In other words, if I broke my field area up in regions where I think the lava layers might source from the same place (Lassen Peak or Medicine Lake Volcano) I would be able to make the assumption that lava layers that are at similar depths in the well log correspond to the same lava flow in space. If we consider figure 1, in a small area we would be able to link the green layer to other locals where the green layers lies at depth, and would be able to spatially autocorrelate them with the variogram.

The next step, one I narrowed down my area, would be to correlate the depth to water with the depth to every lava flow I found in the well log. This would allow me to see which lava layer best corresponds to the depth to first water.

One of the things I discussed with my partner was trying to figure out what the negative values meant in my variogram. As I stated above, I still need to think about this, or figure out what I did wrong. I also discussed taking data out of lat-long space and into UTM space; that is something I am also still thinking about.

One final note: At the moment my data is both clustered around certain spots, and I do not have much of it. Every time I add a few data points, the shape of the variogram changes.  Some of the spikiness I am seeing is likely from that.



Exercise 2: Neighborhood analysis of Texas counties with high colorectal cancer mortality rates

Question being asked

How do rurality indicator variables shift as distance increases away from Texas counties with high colorectal cancer (CRC) mortality?

In exercise 1, I used principal component analysis (PCA) to create a PCA-weighted rural index of the state of Texas using 3 scaled variables: population density, land development percentage, and median income. In this exercise I applied these same variables to determine how they change as distance increases away from the 4 Texas counties with the highest CRC mortality rates. To do this, I created multi-ring buffers around Anderson, Gonzales, Howard, and Newton county and computed averages of each rural indicator variable for each successive buffer “donut.” I hypothesize that as distance increases away from high CRC county centroids, rurality indicator measures will have more “urban” values (i.e. higher population density, higher percent developed, higher median income) and CRC mortality rates will decrease.

Tools and Data Sources Used

For this exercise, I utilized the intersection, feature-to-point, and multi-ring buffer tools in ArcGIS along with the latticeExtra/gridExtra plotting packages in R. The rural indicator data used in this analysis are from the same sources I used in Exercise 1: Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database). These data were then scaled using the same procedure from Exercise 1, which can be found here.  Aggregated CRC mortality rates for Texas counties were obtained for the years 2011-2015 from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program U.S. Population Data – 1969-2015.


Attribute Table Wrangling: The Texas county indicator variables were linked to county polygons in my Exercise 1, but cancer mortality data was not. For polygon linkage in this exercise, I imported the mortality data excel sheet into Arc and used the join procedure to insert the data into the existing attribute table (with indicator variables) for county polygons.

Centroid & Multi-ring Buffer Creation: First, I utilized the point-to-feature tool in Arc to create a layer of county centroid points from the county polygon layer. Once the county polygons had been converted to centroids, I identified the 4 Texas counties with the highest CRC mortality rates. Then, using the select features tool and multi-ring buffer procedure, I selected each of the 4 counties separately and created multi-ring buffers at 50, 75, and 100 Km. These distances were chosen based on the size of the selected counties and the size of the full state of Texas.

Intersection & Donut Summary Statistics: Once the multi-ring buffer layers were created, I intersected each of the 4 buffer layers with the original county polygon layer containing all relevant variables. Then, mean via the summary statistics tool were computed in Arc for population density, percent developed land, and median income for each successive donut in the multi-ring buffers. The computed tables of buffer donut means for each variable and county were then exported to Excel files.

R Plotting: The Excel files were then imported into R and line plots of buffer means by distance were created using the xyplot function within the latticeExtra package. Plots were then combined into figures by county using the gridExtra package.


This figure of scaled population density means for county multi-ring buffer donuts indicates varying trends for population density between the 4 high CRC counties. Two counties have increasing population density as distance increases away from centroids, and two have decreasing population density as distance increases away from centroid. Only the buffer map for population density was presented above for post conciseness and space limitations on this blog site. More specific neighborhood relationships between CRC death rates and all indicator variables can be seen in the line plots and explanations below.

Line Plots of County Indicator Variables Over Buffer Distances

The above plots display with more specificity than the buffer map that areas surrounding the 4 counties have differences in indicator variable trends as distance increases away from county centroids. Both Anderson and Newton counties largely follow the trend hypothesized: as distance from the county centroids increases, rural indicator variables have more “urban” values and CRC mortality rate decreases. For Newton county, this trend does not hold for median income, because as CRC mortality decreases away from the county centroid, median income also decreases. For the other other 2 counties, Gonzales and Howard, the hypothesized relationship does not hold, because as distance increases away from county centroids, the rural indicator variables become more “rural” as CRC mortality decreases. This indicates that the associations between CRC mortality and rural indicator variables are complex and that neighborhood analysis does not capture all relationships.


This sort of neighborhood analysis was effective at determining trends in rural indicator variables in the areas surrounding high CRC mortality counties in Texas. The buffer map produced great broad results, where more generalized trends can be determined. The line plots specifically were highly useful for visualizing more specific changes in indicator variables and CRC mortality rates over distance. These results should be considered in the light of some limitations. First, county level data was used for all variables and the buffer donuts may be too large or too small to capture the true neighborhood relationships in the analysis, as statistical procedures were not utilized to determine the distances. Further, more buffer donuts may have been useful to see more nuanced trends over distance. Secondly, as can be seen in the buffer map, there is a lack of CRC mortality data for many counties in west and southern Texas due to data suppression in order to preserve patient confidentiality. This presents significant bias in result interpretation, especially in Howard county, where many of the counties surrounding it have suppressed CRC mortality data.

My future analysis of this data will likely be a comparative confusion matrix of the PCA-weighted index I created in Exercise 1 and CRC mortality data used in this exercise.

Exercise 2: Spatial Patterns of Perceptions of Natural Resource Governance and Environmental Condition


What are the spatial patterns of natural resource governance perceptions in relation to environmental condition?


Environmental condition was represented as: (1) Environmental restoration sites, and (2) aggregate ‘environmental effects’ scores by census tract on a scale from (1) low to (10) high that include lead risk from housing, proximity to hazardous waste treatment storage and disposal facilities, proximity to nation priorities list facilities (Superfund Sites), proximity risk management plan facilities, and wastewater discharge.


Sub-questions related to the primary question are:

a1. Does being near a restoration site associate with individual perception? a2. Does being near more restoration sites associate with individual perception?

b1. Do the environmental effects where an individual lives associate with individual perception? b2. Do the environmental effects around an individual associate with individual perception?

Tools and Approaches

  1. Nearest Neighbor analysis in ArcGIS Pro and R studio
  2. Geographically weighted regression in R studio
  3. Neighborhood analysis in ArcGIS Pro

Analysis Steps

  1. Nearest neighbor analysis was used to examine questions a1 and a2. First, a file of restoration sites was loaded into R. The sites were points in the same projection as my participant location points. This analysis required four libraries (as the points were from different files. The libraries were: nlem, rpart, spatstat, and sf. To use the tool I needed, I first had to turn my points into point objects in R (.ppp objects). First I used the convenxhull.xy() function to create the ‘window’ for my point object, then I used that to run the ppp() function. After doing this for both sets of points, I was able to use the nncross() function. This function produced min, max, average, and quartile distances from individuals to the nearest restoration site. I added the ‘distance’ from the nearest neighbor as a variable in a linear regression to determine an association for a1.


To examine a2, I used these distances (1st quartile, median, and 3rd quartile) to produce buffers in ArGIS Pro. After creating the buffers, I ran spatial joins between them and the restoration sites. This produced an attribute table that had a join count—the number of restoration sites within the buffer. I exported the three attribute tables from the three buffer distances back to R. In R I ran a linear regression with join count as an independent variable.


  1. To test b1, I preformed geographically weighted regression. I used both ArcGIS Pro and R studio to run this analysis. Initially I used the GWR tool in Arc to run the regression, but wanted the flexibility of changing parameters and an easily readable output that R provides. First I joined individual points to my rank data, a shapfile at census tract level. This gave individuals a rank for environmental effects at their location. In R, I used two different functions to run GWR, gwr(), and gwr.basic(). The gwr() required creating a band using gwr.sel, and the gwr.basic required creating a band using bw.gwr. The difference between these functions is that gwr.basic produces p-values for the betas. I ran gwr on both my entire data set and a subset based on perceptions. The subset was the ‘most agreeable’ and ‘least agreeable’ individuals who I defined as those one standard deviation above and below the mean perception.


  1. I preformed neighborhood analysis to test the final question, b2. First, I created a dataset that was just the upper and lower values of my governance perceptions (one standard deviation above and below the means. I then added buffers to these points at 1 and 5 miles. I then joined the buffers in ArcGIS Pro to the rank values to get an average rank within those buffers. I exported the files for each buffer to R. I R I created a density plot of average rank for the low governance values at each buffer, and for the high governance values at each buffer.


  1. The median distance for individuals from restoration site was 0.037 degrees, 1st quartile was 0.020 degrees, and 3rd quartile was 0.057 degrees.

The regression on whether distance correlates with individual perception was insignificant (p = 0.198). This led to the conclusion that distance from the nearest restoration site does not influence perceptions.


For each regression on the number of sites near an individual, all coefficients were negative. This implies that the more sites near an individual, the more disagreeable their perception was. All produced significant results, but the effect size of number of sites near individuals was very minimal (Table 1).


Table 1. Regression results for ‘nearest neighbor’ of individuals to restoration sites.

Buffer Size b p-value Effect Size (rpb)
Buffer 1 (1st quartile) -0.003 0.0181 .003
Buffer 2 (median) -0.002 0.045 .002
Buffer 3 (3rd quartile) -0.001 0.002 .003





  1. For the geographically weighted regression, I am presenting the results from the gwr.basic() model. For this model, I included demographic variables to control for these factors. In the model, rank, life satisfaction, years lived in the Puget Sound, and Race are significant (Table 2). All other variables were insignificant, so I will not discuss their trends. For rank (the main variable of interest), the coefficient was positive. In this case higher rank values are worse environmental effects, so as agreeable perceptions increase, environmental condition decreases. Life satisfaction is a variable of how satisfied individuals are with their life overall, which correlated positively with perception (Table 1). Years lived in the Puget Sound correlated negatively, or perception decreased the longer someone lives there (Table 1). Race, a dummy variable of white or not-white, indicated higher perceptions were held by white individuals.


Overall, the effect size of this model on governance perceptions was small, explaining about 10% of the variance in the data (Table 1).


Table 1. Regression results for environmental effects at individuals’ locations.

Variable1 b p-value2
Rank 0.077 0.007**
Life Satisfaction 0.478 <0.001**
Years Lived in the Puget Sound -0.010 0.002**
Sex -0.156 0.289
Area -0.150 0.179
Education -0.002 0.922
Income -0.022 0.622
Race 0.006 0.034*
Ideology 0.103 0.533

1R2 = 0.094

2 ** = significant at the 0.01 level, * = significant at the 0.05 level


  1. The plot of high and low governance values at the two buffers is presented below. The black and red curves represent respondents from the survey that were at least one standard deviation lower than the mean (the mean was neutral). The black curve is average rank with a one mile buffer, and the red curve is average rank at the five mile buffer. The green and blue curves represent respondents from the survey that were at least one standard deviation higher than the mean. The green curve is the average rank at the one mile buffer, and the blue curve is the average rank at the five mile buffer.

What this figure indicates is that there are two peaks in average rank at low environmental effects (~1) and at mid-environmental effects (~4.75). Those with lower perceptions of environmental governance had higher peaks at low environmental effects for each buffer size. Those with higher perceptions of environmental governance had a bimodal distribution with peaks at low and mid-environmental effects. The bimodal nature of these density functions leads me to believe there is some variable moderating governance perception related to environmental effects.


  1. Critique

The methods I used were all helpful in determining the spatial relation of environmental condition to perceptions of natural resource governance. However, switching between the two programs (ArcGIS and R) was a little bit of a hassle. R has greater flexibility when running analyses, as in running is easier. This meant it was ideal for the analyses that required me to tweak my models multiple times. I also ran into little problems with Arc in terms of loading the data which made everything run slower, but Arc is more intuitive in terms of finding and executing analyses/

Does the spatial arrangement of vegetation cover influence ventenata invasion?

Question Asked

To predict the invasion potential of a species, it is necessary to understand the spatial pattern of the invasion in relation to landscape scale variables. For exercise 2, I explored how the spatial pattern of invasion by the recently introduced annual grass, Ventenata dubia (ventenata) relates to the spatial pattern of vegetation cover categories throughout the Blue Mountain Ecoregion of eastern Oregon.

Tools and Approaches Used

To unpack this question, I performed a neighborhood analysis to explore how the proportion of different vegetation type cover differ at increasing distances from plots with high versus low ventenata cover.

The neighborhood analysis required several steps performed in ArcGIS and in R:

  • I split my sample plot layer in ArcGIS into two layers – one containing plots with only high ventenata cover (>50%) and one containing plots with only low cover (<5%).
  • I buffered each plot by 10m, 50m, 100m, 200m, 400m, and 800m using the “buffer” tool in ArcGIS and then erased each buffer layer by the preceding buffer layer to create “donuts” surrounding each sample points using the “erase” tool in ArcGIS (Fig. 1).
  • I brought in a vegetation cover categories raster file (Simpson 2013) that overlaps with my study area and used the “tabulate area” tool in ArcGIS to calculate the total cover of each vegetation type (meadow, shrub steppe, juniper, ponderosa pine, Douglas-fir, grand-fir, hardwood forest/ riparian, and subalpine parkland) that fell within each buffer for every point. I repeated this for high and low ventenata points.
  • Finally, I consolidated the tables in R and created a line graph with the ggplot2 package to plot how the proportion of vegetation type differed by buffer distance from point (Fig. 2). Cover represents percent cover of each vegetation type at each buffer distance. Error bars at each distance represent standard error. VEDU refers to the plant code for Ventenata dubia (ventenata).

I was also curious to how the high and low points differed from random points in the same area. To explore this I:

  • Created 110 random points that followed the same selection criteria of 1000m proximity to fire perimeter used to select the ventenata sampling points.
  • Repeated steps 2 through 4 above to graphically represent how vegetation cover differs as a function of distance from these random points in relation to low and high ventenata points (Fig. 3).

Results & Discussion

My analysis revealed that vegetation type differs between high and low ventenata sites and random sites within the study area. The high ventenata plots were located entirely in ponderosa pine and shrub steppe vegetation types, but as distance increased from the plots, the distribution of about half of the vegetation types became more evenly distributed (Fig. 2). Ponderosa covers over 75% of the high ventenata 10m buffer areas with shrub steppe making up the remaining 25%. However, as distance increased, ponderosa cover dropped sharply to under 35% at 400m. Shrub steppe gradually declined throughout the 800m distance, and was surpassed by grand fir and Douglas fir by 800m. Meadows covered about 10% of the 50m buffer but declined to about 5% by the 400m buffer. The remaining vegetation types, juniper, riparian, and subalpine fir, were consistently under 5% cover throughout the buffer analysis.

In the low ventenata sites, shrub steppe vegetation was the most dominant, but the distribution was spread more evenly across the vegetation types than in the high ventenata sites (Fig. 2). Shrub steppe vegetation droped from 45% to 30% from the 10m to the 50m buffer, and then remained relatively constant throughout the remaining buffer distances. Like the high ventenata sites, grand-fir gradually increased in cover throughout, becoming the most dominant vegetation type of the 800m buffer. Unlike the high sites, ponderosa pine made up only about 10% of each buffer. Riparian vegetation was the only cover type that remained 0 throughout all the buffers.

In the random sites, the distributions of vegetation type were steady throughout the 800m, with only small fluctuations in cover with increasing distances (Fig. 3). Shrub steppe vegetation type was the highest at about 30% throughout, followed by juniper, ponderosa pine, and grand fir at about 20% cover.

This analysis demonstrates that ventenata could be dependent on specific vegetation types not only at the sample location, but also in the vicinity surrounding the sample area. This is evident in the high ventenata analysis where ponderosa pine cover remains much higher than the low sites and the random sites throughout the 800m buffered area. This analysis also depicts my sample bias as it demonstrates which community types I was targeting for sampling (shrub steppe and dry forest communities), which may not be representative of the area as a whole (as demonstrated in the random points analysis).

Critique of Method

The neighborhood analysis was a useful way of visualizing how vegetation type changes with distance from high and low ventenata points and may have helped uncover the importance of large areas of ponderosa pine as a driver of invasion; however, the results of the analysis could be a relic of my sampling bias towards shrub steppe and dry forest communities rather than an absolute reflection of community drivers of ventenata. The vegetation layer that I used was also not as accurate or as detailed as I would have liked to capture the nuance of the different shrub steppe and forest community types that I was attempting to differentiate in my sampling. If I were to do this again, I would try to find and use a more accurate potential vegetation layer with details on specific community attributes. Additionally, the inclusion of error bars was not possible using the “multiple ring buffer” tool in ArcGIS, so, I instead had to make each buffer distance as a separate layer and erase each individually to maintain the variation in the data.  I like the idea of the random points as a sort of randomization test; however, more randomizations would make this a more robust test. With more time and more knowledge of coding in ArcGIS/ python, I would attempt a more robust randomization test.


Simpson, M. 2013. Developer of the forest vegetation zone map. Ecologist, Central Oregon Area Ecology and Forest Health Program. USDA Forest Service, Pacific Northwest Region, Bend, Oregon, USA

On the relationship between spatial variation of economic losses and tsunami momentum flux


For Exercise 1, I evaluated the spatial pattern of economic losses resulting from a joint earthquake and tsunami event. I then deaggregated the losses by hazard (earthquake only, tsunami only, and combined) as well the intensity of the event.

For Exercise 2, I evaluated how the spatial pattern of economic losses resulting from a tsunami relates to the spatial pattern of tsunami momentum flux (a measure of velocity and inundation depth) by performing a geographically weighted regression (GWR). For this analysis, I only considered the tsunami because there is significant spatial variation in the hazard, whereas the spatial variation for the earthquake is minimal.

Tool and approach

I performed the GWR using the python library PySAL (Python Spatial Analysis Library). The independent variable was defined as the momentum flux, and the dependent variable defined as the percent loss of economic value.

Description of steps

The average losses at each building resulting from an earthquake/tsunami loss model were first converted to percent loss (loss divided by real market value), and added as an attribute to a GIS shapefile. The percent loss was used as opposed to the economic losses because each building has a different initial value. Consequently, the percent loss serves to normalize the economic losses across all buildings within Seaside. For this analysis, the results from the “1000-year” tsunami event were analyzed.

The GWR was then performed using PySAL with the momentum flux defined as the independent variable and the percent loss defined as the dependent variable. The GWR resulted in a slope and intercept at each tax lot, as well as a global r2 value. Two separate maps were generated wherein each tax lot was color coded based on values of the slope and intercept.

Description of results

The results from the GWR and a global regression are shown in Figures 1 and 2 respectively. A global r-squared value of 0.575 was obtained, indicating that the data is moderately correlated. In Figure 1, it can be seen that the intercept is larger near to the ocean, and decreases as the distance to the shore increases. This can be explained by the fact that the momentum flux is the largest near to the coast, and decreases as the tsunami propagates over the land.

Similar trends would be expected for the slope coefficients; however, it can be seen that along the coast the results are negative indicating that the economic losses decrease as the momentum flux increases. This can likely be explained by inconsistent building types within Seaside. For example, concrete buildings are able to better withstand the impact of a tsunami compared to their wood counterparts. Similarly, buildings of different heights (number of floors) have different damage properties. Consequently, because the building types are not consistent within Seaside, significant variations in the percent of loss within a small spatial region can occur (e.g. a wood building is located next to a concrete building). This would lead to a decrease in percent loss for a larger momentum flux.

Figure 1: Spatial variation of slope and intercept resulting from the GWR

Figure 2: Global regression and line of best fit


While the GWR does provide a means to evaluate correlation between two variables that are within the same geographical region, there are limitations for this particular application. The results showed negative slopes in some locations, which is likely caused by the large variation in the percent loss. To alleviate this, alternative statistical models could be developed using GWR that only consider similar building types. An example of a non-spatial regression for wood buildings with 2 and 3 floors can be seen in Figure 3. The improvement in r-squared values can be observed, and would likely translate to the GWR.

Figure 3: Example of global regression considering specific building types


Fire Refugia’s Effects on Clustering of Infected and Uninfected Western Hemlock Trees


For Exercise 1, I wanted to know about the spatial pattern of western hemlock trees infected with western hemlock dwarf mistletoe. I used a hotspot analysis to determine where clusters of infected and uninfected trees were in my 2.2 ha study area (Map 1). I discovered a hot spot and a cold spot, indicating two clusters, one of high values (infected) and one of low values (uninfected).

In my study site, 2 fires burned. Once in 1829, burning most of the stand, and then again in 1892, burning everywhere except the fire refugia (polygons filled in blue). This created a multi-storied forest with remnant trees located in the fire refugias. One component of the remnant forest are infected western hemlocks. These remnant hemlocks serve as the source of inoculum for the hemlocks regenerating after the 1892 fire.

For Exercise 2, my research question was: How does the spatial pattern of fire refugia affect the spatial pattern of western hemlock dwarf mistletoe?

I predicted that a cluster of infected western hemlocks are more likely to be next to a fire refugia than a cluster of uninfected trees. In order to assess this relationship, I used the geographically weighted regression tool in ArcMap.

Geographically Weighted Regression

Geographically weight regression (GWR) works by creating a local regression equation for each feature in a data set you want to analyze, using an explanatory variable(s) to predict values for the response variable, using the least squares method. The Ordinary Least Squares (OLS) tool differs from GWR because OLS creates a global regression model (one model for all features) whereas GWR creates local models (one model per feature) to account for the spatial relationship of the features to each other. Because the method of least squares is still used, assumptions should still be met for statistically rigorous testing. The output of the GWR tool is a feature class of the same type as the input, with a variety of attributes for each feature. These attributes summarize the ability of the local regression model to predict the actual observed value at that feature’s location. If you have an explanatory variable that explains a significant amount of the variation of the response variable, this is useful for seeing how its coefficient varies spatially.

Execution of GWR

To use this tool, I quantified the relationship between the trees and the fire refugia. I used the “Near” tool for this to calculate the nearest distance to a fire refugia polygon’s edge. This was my explanatory variable. My response variable was the z-score that was output for each tree from the Optimized Hot Spot Analysis. Then I ran the GWR tool. I then used the Moran’s I tool to check for spatial autocorrelation of the residuals. This is to check the clustering of residuals. Clustering indicates I may have left out a key explanatory variable. The figure below displays my process.

I tested the relationship between nearest distance to a fire refugia polygon’s edge and the z-score that was output for each tree from the Optimized Hot Spot Analysis using OLS, which is necessary to develop a well specified model. My R2 value for this global model was 0.005, which is incredibly small. Normally I would have stopped here and sought out other variables to explain this pattern, but for this exercise I continued the process. 


This GWR produced a high global R2 value of 0.98 (Adj R2 0.98) indicating that distance to refugia does a good job of explaining variance in the spatial pattern of infected and uninfected trees. However, examining the other metrics for the local model performance gives a different picture of model performance.

Map 2 displays results for the coefficients for the explanatory variable of distance to nearest refugia. As this variable changes, the z-score increases or decreases. These changes in z-scores indicate a clustering of high or low values. From examining the range of coefficient values, the range is quite small, -0.513 to 0.953. This means that across my study site, the coefficient only changes slightly from positive to negative. In the north western corner, we see a cluster of positive coefficient values. Here, as distance to refugia increases, the z-score of trees increases, predicting a clustering of infected trees. These values are associated with high local R2 values (Map 4). In other places of the stand we see slight clustering of negative coefficients, indicating distance to refugia decreases the z-score of trees, predicting a clustering of uninfected trees.

Map 3 displays the standardized residuals for each tree. Blue values indicate where the local model over-predicted what the actual observed value was, and red values are under-predictions. When residuals from the local regression models are distributed randomly (i.e. not clustered or dispersed) over the study area, then the geographically weighted regression model is fit well, or well specified. The residuals of the local regression models were significantly clustered. (Moran’s Index of 0.265, p-value of 0.000, z-score of 24.344). Because we can observe clustering in my study area of residuals, there is another phenomenon driving the changes in z-scores; in other words, driving the clustering of infected and uninfected trees.

From the previous two map evaluations I saw that the distance of a tree to fire refugia was not the only explanatory variable necessary to explain why infected and uninfected trees clustered. Map 4 displays the local R2 values for each feature. The areas in red are high local R2 values. We see the northwestern corner has a large number of large values which correspond to a cluster of small residuals and positive coefficients. Here, distance to fire refugia explains the clustering of infected trees well. The reverse is observed in several other places (clusters of blue) where distance to fire refugia does not explain why infected or uninfected trees cluster. In fact the majority of observations had a local R2 of 0.4 or less. From this evaluation, I believe this GWR model using distance to refugia does a good job of explaining the clustering of infected trees, but not much else.


GWR is useful for determining how the coefficient of an explanatory variable can change across an area. One feature in a specified area may have a slightly different coefficient from another feature, indicating these two features are experiencing different conditions in space. This allows the user to make decisions about where the explanatory has the most positive or negative impact. This result is not something you can derive from a simple OLS global model. This local regression process is something you could do manually but the tool in ArcMap makes this process easy. The output of GWR is also easy to interpret visually.

Some drawbacks are that you need to run the OLS model first for your data to determine which variables are significant in determining your response variable. If not, then a poorly specified model can lead to inappropriate conclusions about the explanatory variable (i.e. high R2 values). Also, the evaluation of how the features interact in space is not totally clear. The features are evaluated within a fixed distance or number of neighbors, but there is no description for how weights are applied to each neighboring feature. Lastly, for incidence data, this tool is much harder to use if you want to determine what is driving the spatial pattern of your incidence data. Some other continuous metric (in my case a z-score) must be used as the response variable, making results harder to interpret.

Model Results Follow-Up

After finding that distance to a refugia was not a significant driver for the majority of trees, I examined my data for other spatial relationships. After a hotspot analysis on solely the infected trees, I found that the dispersal of infected trees slightly lined up with the fire refugia drawn on the map (Map 5).

Among other measures, forest structure was used to determine where fire refugia were located. Old forest structure is typically more diverse vertically and less clustered spatially. Also infected western hemlocks are good indicators of fire refugia boundaries because as a fire sensitive tree species, they would not survive most fire damage and the presence of dwarf mistletoe indicates they have been present on the landscape for a while. From the map we can see that the dispersal of infected trees only lines up with the refugia in a few places. This mis-drawing of fire refguia bounds may be a potential explanation for under-performance of the GWR model.

Courtney’s EX2: Comparing faults and principal components

Question asked in this exercise:

How does ion principal component 2 at a well vary with the well’s distance from faults along the groundwater flow path?

In EX1, I used principal component analysis to evaluate how parameters accounted for variance between the wells I studied. Based on my knowledge of how chemistry varies as water flows through the basalt, ion PC2 accounts for variance caused by ion exchange between the basalt and the groundwater, with increasing sodium ion concentration/pH and decreasing calcium/magnesium ion concentrations as the water spends more time underground.

In this exercise, I estimated the groundwater flow directions in my study area using interpolation, calculated fault incidence direction, calculated angular difference between flow direction and fault direction, and then manual measurement of the distance between each well and the distance to the nearest fault segment that had flow direction within 45 degrees of parallel to the fault along the estimated flow path.

Name of tool or approach used:

Interpolation, reclassification, raster math, distance measurement in ArcGIS Pro


Input data:

  • 2018 static water levels in wells provided by Oregon Water Resource Department (OWRD)
  • Well lithology from OWRD groundwater information system (GWIS)
  • Well seal depth from well logs accessed through OWRD GWIS
  • Fault polyline shapefile from Madin and Geitgey 2017
  • Well locations from OWRD database, with ion concentration information based on my sampling in the summer of 2018


  1. Classified wells in the static water level dataset by the basalt formation that they were open to, based on lithology and seal depth. Excluded wells that lacked this information.
    1. Output: wells classified as open to Saddle Mountain Basalt (Smb) Wanapum Basalt (Wb), Grande Ronde Basalt (Grb), or both Wanapum and Grande Ronde Basalt (WbGrb). These classifications were joined to the static water level information.
  2. Created an interpolation surface for static water levels of wells open to both the Wanapum and Grande Ronde. I ignored the Saddle Mountain formation, since the wells that I had sampled were not open to it. I used kriging with a cell size of 200 ft, and this created an estimated potentiometric surface for these wells. The interpolation was a bit ugly because my wells were not ideally distributed for this.
    1. I tried creating interpolations based on other combinations of formations to better approximate the potentiometric surfaces posited by past studies in the region. I ended up creating two potentiometric surfaces: one using wells that were only open to the Wanapum Basalt (Wb), and a second using wells that were open only the Grande Ronde Basalt as well as those that were open to both the Grande Ronde Basalt and Wanapum Basalt(WbGrb_Grbonly).
  3. Calculated flow direction in the two aquifer groups – Wb and WbGrb_Grbonly
    1. Used the Hydrology toolbox to fill sinks and then calculate flow direction.
    2. This creates out a raster with eight possible values between 1 and 256.
    3. I then reclassified it so the values corresponded to the eight primary cardinal directions (N, NE, E, SE, S, SW, W, NW), which range from 0 to 360 degrees
  4. Calculated fault incidence angle
    1. Added a cell to the polyline attribute table called “angle”
    2. Split the polyline at each vertex, which creates a new shapefile
    3. Used the field calculator and a python script to assign an angular value between 0 and 359 to each fault polyline segment
    4. Performed the polyline to raster function, using the angle as the cell value. I used a cell size of 200 ft.
    5. This created a raster where only pixels that include part of a fault polyline had values, and those values ranged between 0 and 359.
  5. Subtracted the flow direction raster from the fault incidence angle raster, in order to create fault line pixels that had values that reflected the difference between the fault incidence angle and flow direction. I did this twice, once each for the Wb and WbGrb_Grbonly rasters
    1. Reclassified this raster so that pixels with fault direction within 45 degrees of parallel to the flow direction were 0, and the pixels with fault direction with 45 degrees of perpendicular to the flow direction were 1.
    2. I then added the WB and WbGrb_Grbonly results together, so that pixels with fault direction within 45 degrees of parallel to the flow direction in both were 0, and the pixels with fault direction within 45 degrees of perpendicular in either raster to the flow direction were 1, and pixels with fault direction within 45 degrees of perpendicular in both rasters were 2. I named this allwbgrb_ff.
  6. On a map layout, I added my sampled sites with PCA data, the interpolated potentiometric surface for wells open to the Wanapum and Grande Ronde aquifers, and allwbgrb_ff.
    1. I added a field to the sampled sites with PCA data called “dist_from_fault”
    2. Using the measure tool, I measure the distance from each well along the path most perpendicular to potentiometric contours to the nearest allwbgrb_ff pixel with a value of 1 or 2.
    3. This was subjective because the potentiometric surface is imperfect because of the erratic spacing of wells. In areas where the potentiometric surface had noticeable glitches, I used my own judgement based on topography and literature about groundwater flow direction in the region.
  7. I then graphed the “dist_from_fault” against the ionsPC2 category.

Discussion and Results:

I hypothesized that ion PC2 would increase with decreasing potentiometric surface elevation. An increased score in ionPC2 indicates an elevation sodium concentrations and pH and a decrease in calcium and magnesium caused by a progressive ion exchange reaction between the groundwater and the basalt. Because water flows from higher potentiometric elevations to lower potentiometric elevations, I would expect water samples from lower potentiometric elevations to show chemical evidence of increased interaction with the basalt. If this process of down-gradient groundwater flow were the only process influencing the ion exchange reactions, the well symbols on the map below would become progressively darker as the interpolated well level surface elevation decreased and the wells were further from the up-gradient recharge zone.

However, upon examining a map of ion PC2 values this is not the case – there are anomalously low values of PC2 in the valley, where one would expect to see increased values if groundwater flowed uninterrupted from the up-gradient recharge zones. In this study I introduced the variable of distance from faults in order to test another hypothesis: that faults compartmentalized groundwater flow, blocking lateral flow through the aquifer while promoting vertical permeability and modern recharge into the down-gradient aquifer. I also hypotehsized that if a fault was a barrier, PC2 values up-gradient of the fault would be elevated as the fault trapped water behind it. This would result in more chemically evolved groundwater backed up behind the fault, and less chemically evolved groundwater down-gradient of the fault.

The results of this study tentatively support the conceptual model of fault compartmentalization. In particular, water samples from wells in the valley down-gradient of the fault zones have evidence of less exposure to the basalt than wells further towards the recharge zones. 15 of the 18 wells sampled show a positive correlation between distance from a fault along a flow path and ion PC2 score, especially when graphed points are compared to their up-gradient and down-gradient neighbors (i.e. 57946 being up-gradient from 57235 and down-gradient from 54277).

Because of the width of the raster cells indicating faults and their flow direction in this model, four wells ended up have a distance from faults of 0. This does not seem unreasonable, because examination of exposed fault zones in the area indicated that many are up to a couple hundred meters wide. Additionally, while geological studies of the area indicate that the faults are close to vertical, their exact dip angles are unknown and this introduces a certain amount of uncertainty about their location at depth. Ion PC2 values of wells with dist_from_fault = 0 show an interesting dichotomy of either very high (4167, 4179) or very low (3929, 3962) values. I believe this indicates that wells 4167 and 4179 are up-gradient of faults that are acting as barrier, while 3929 and 3962 are slightly down-gradient or within the fault damage zone itself.

map of fault locations, potentiometric surface, and wells, plus a chart of distance vs ion PC2

Critique of Method:

This is admittedly a crude method of estimating groundwater flow direction. However, a certain amount of estimation is necessary to model a system with relatively few and unevenly spaced measured water level points in a hydrogeologic regime with many individual water-bearing layers of lava interflow zones that are unpredictably connected. I wish I had been able to find an automated way to measure well distance from faults along the groundwater flow path, which would have taken away some of the subjectivity of measuring the flow paths by hand the old-fashioned way.

Because of the uneven spacing of wells used for interpolation, the flow direction rasters that I created have glitchy areas and some of these areas of physically improbable values influenced the Fault and GW Flow Direction raster. I reclassified that raster into broad categories to avoid creating a false sense of precision.