Category Archives: Exercise 3

Nearest Neighbor Analysis on Woodpecker Nest Locations Within Salvage Logging Units

Exercise 3: Nearest Neighbor Analysis


How is the spatial presence of postfire woodpecker nests related to the spatial presence of salvage-logged forest stands?

  • How are woodpecker nests clustered within survey units? (Exercise 1 and 3)
  • How does this clustering relate to salvage treatment units within the survey units? (Exercise 2 and 3)


Average Nearest Neighbor in ArcMap

The Average Nearest Neighbor tool measures the distance between each feature centroid and its nearest neighbor’s centroid location and averages the distances for a nearest neighbor ratio. If the ratio is less than the average for a hypothetical random distribution, the feature distribution is clustered. If the ratio is greater than a hypothetical random distribution average, the features are dispersed.

Near Distance in ArcMap

The Near tool measures near distances between one set of target features and another set of target features. It produces additional columns in the original shapefile containing distance measurements in units of the user’s choosing. The user can enable a location option displaying X and Y distances individually.


For this analysis, I used 2016 and 2017 woodpecker nest point datasets clipped to each RMRS survey unit. I also used a polygon shapefile of 35 salvage harvest units within RMRS woodpecker survey units. I used another polgyon shapefile of the woodpecker survey units and a WorldView-3 1 m raster for supplementary data.

Nearest Neighbor Analysis Steps

  1. Export 2016 and 2017 nest points as one nest shapefile per RMRS survey unit for inputs into the Average Nearest Neighbor tool.
  2. Export salvage treatment polygons into three shapefiles for treatments 1, 2, and 3.
  3. Run the Average Nearest Neighbor tool in ArcMap with multiple inputs. I ran the tool on each 2016 and 2017 nest shapefile per survey unit, all 2016 and 2017 nests, all salvage units, and each salvage treatment type shapefile (see table below).
  4. Run the Near tool to produce near distances in meters for the distance of each nest to a salvage unit. I ran this tool for 2016 and 2017 nests using salvage unit centroids and salvage unit polygons, which produced different results as expected.
  5. Use Excel to create an Average Nearest Neighbor table for comparing 2016 and 2017 results.
  6. Use ggplot in R to plot near distance graphs for 2016 and 2017. These graphs display near distance distributions for nest points to salvage unit centroids and near distances for nest points to salvage polygon boundaries.


Near Distances

I produced boxplots displaying near distance distributions for nest points. Sets of graphs are featured with and without the control nests for different visual interpretations.

Near Distances for Nest Points to Salvage Centroids (With Control Nests)

Near Distances for Nest Points to Salvage Centroids (Without Control Nests)

Near Distances for Nest Points to Salvage Polygons (With Control Nests)

Near Distances for Nest Points to Salvage Polygons (Without Control Nests)

Nearest Neighbor Results Table

Above: Nearest neighbor results for woodpecker nests in each survey unit in 2016 and 2017. The NN Ratio is a threshold for expected clustering or dispersion. NN Ratio values less than 1 indicate clustering and NN Ratio values greater than 1 indicate dispersion. In 2017, green cells indicate an increase in value and red cells indicate a decrease in value. NN Ratio cells with an increased value in 2017 indicate units where nests increased dispersion. NN Ratio cells with a decreased value in 2017 indicate units where nests increased clustering. Alder Gulch (Treatment 3), Lower Fawn Creek (Treatment 2), and Upper Fawn Creek (Treatment 2) demonstrated increased clustering in 2017. Big Canyon (Treatment 1), Crazy Creek (Treatment 1), and Sloan Gulch (Treatment 3) experienced increased nest dispersion in 2017. All control units experienced increased or present dispersion in 2017. The “All Nests” and salvage shapefile results were produced for comparisons and to evaluate how clustering/dispersion within datasets may affect clustering/dispersion of other datasets.

With these two techniques creating near distance values and a nearest neighbor table, we can implement a refined nearest neighbor analysis. This analysis examines nearest neighbor distances in combination with distances to the study boundary. Refined nearest neighbor uses standardized formulas integrating the distribution of observed nearest neighbor distances and the distribution of expected (random) nearest neighbor distances while factoring in distance to unit boundaries. In the table above, the NN Expected and NN Observed values will aid  refined nearest neighbor results.

Problems and Critique

Did this nearest neighbor analysis account for the target area size and shape? In the ArcMap Average Nearest Neighbor help documents, these assumptions are made:

  • “The equations used to calculate the average nearest neighbor distance index (1) and z-score (4) are based on the assumption that the points being measured are free to locate anywhere within the study area (for example, there are no barriers, and all cases or features are located independently of one another).”
  • “The average nearest neighbor method is very sensitive to the Area value (small changes in the Area parameter value can result in considerable changes in the z-score and p-value results). Consequently, the Average Nearest Neighbor tool is most effective for comparing different features in a fixed study area.”

Because of this, I’m not sure whether I can interpret these results properly based on the study design. We are surveying specially designated study units, not the entire burn area. Therefore, our sample calculations should not assume the entire area is available for clustering or dispersion. There is an option in ArcMap to input “Area” as an integer but not as a polygon shapefile (for survey and salvage units). To control for this, I calculated Average Nearest Neighbor on nest shapefiles for each survey unit instead of all nests together. However, as shown in the table above I also calculated Average Nearest Neighbor for all nests and different combinations of salvage units. These results may prove less reliable.

I would like to extend this analysis to a nearest neighbor technique adjusting for both the clustering/dispersion of the nests and the clustering/dispersion of the salvage units and survey units at the same time. The predetermined study design continues to require special consideration.

Cluster analysis of watershed lithology in Oregon coastal systems

The question that I asked for Exercise 3 was: how similar are the lithological profiles of Oregon coastal watersheds?

I used cluster analysis to answer this question. Broadly, cluster analysis is a way of grouping data that minimizes within-group variance and maximizes between-group variance. There are many ways of performing a cluster analysis, usually involving statistical optimization computations. For this exercise, I used hierarchical clustering, which builds a hierarchical grouping network based on the distribution of numerical attributes in a dataset. My dataset was the lithological profile of 35 watersheds in the Oregon Coast Range. The lithological profile refers to the percentage of each lithology, including metamorphic, plutonic, sedimentary, surficial sediments, tectonic, and volcanic types.


  1. Computing lithological profile for each watershed
    1. Downloaded “Oregon Geologic Data Compilation – 2015” from Oregon Spatial Data Library.;id=e71e1897f5864b689a3a4a131287a309
    2. Used ArcMap to dissolve the geology layer based on the GEN_LITH_TY field, which classified each geological type into either: metamorphic, plutonic, sedimentary, surficial sediments, tectonic, or volcanic.
    3. Used ArcMap to intersect geology data with study watersheds. The Intersect tool subsets the dissolved lithology layer to only include data bounded by the study watersheds.
      1. Ran Intersect on multiple different shapefiles: wshd4, wshd5, and all Coos watersheds
      2. The output of the Intersect tool is a shapefile with each discrete lithological type as a single feature. As a result, there was often more than one polygon per lithology in each watershed.
    4. Used Calculate Geometry tool in ArcMap attribute table to calculate area (in square kilometers), and the latitude and longitude of each watershed polygon centroid.
    5. Exported attribute tables of each output shapefile and standardized columns in Excel. Exported as CSV and reformatted table in R so that rows demonstrated each watershed and their lithological profile. Used dplyr to group each lithological type and sum percentage. Used tidyr (spread function) to make each lithology a column.
  2. Cluster analysis
    1. Used the daisy function in R (package: cluster) to compute all pairwise dissimilarities between the 35 study watersheds. The dissimilarity distances were calculated using the Euclidean method.
    2. Performed hierarchical clustering analysis (HCA) on the dissimilarity matrix computed in Method 2a. I used the agnes function to compute the HCA using Ward’s method. The minimum variances between groups in Ward’s method are calculated by squaring the Euclidean distance.
    3. Validated hierarchical clustering procedure using eclust (package: factoextra) and fviz_silhouette. The latter function is just for visualizing the results from the former, the cluster validation.


The HCA resulted in a fairly successful clustering with an agglomerative coefficient of .96. The agglomerative coefficient is a measure of the clustering structure, with values closest to one representing a high degree of dissimilarity between clusters. The cluster validation demonstrated that all but two watersheds were adequately grouped (Figure 1).

Figure 1.Results from cluster analysis silhouette plot.

Figure 2. Dendritic hierarchical clustering results. Three numbers are written below each final branch: the top is the watershed ID, the middle is the percent sedimentary lithology (blue), and the bottom is the volcanic geology (purple).

The HCA resulted in two main groups with many sub branches (Figure 2). By comparing the lithology percentages to the clusters, I determined that one main branch includes watersheds dominated by volcanic lithology and the other branch includes watersheds dominated by sedimentary lithology. Eight of the 35 watersheds are primarily volcanic, 26 are primarily sedimentary, and one watershed does not fit this classification (Watershed 31: 85% surficial sediments).

Critique of Method

This method was very useful for me because it helped me characterize the lithology of my study watersheds in a quantitative way, and then identify the dissimilarities between the lithological profiles of each watershed. The results from this exercise will advance my ability to compare hydrologic regime characteristics across coastal watersheds and the driving controls on such hydrological processes. For example, the dendrogram in Figure 2 shows that there are five watersheds with 100% sedimentary lithology and five watersheds with predominantly volcanic geology. Such statistics will help me stratify my study design and guide my flow regime analysis.

Manipulating salinity to create a better fit habitat suitability model for Olympia oysters

Follow-up from Exercise 2
In Exercise 2, I compared Olympia oyster location data to the model of predicted suitable habitat that I developed in Exercise 1. Results from that analysis showed that 13 of the 18 observations within or near areas of high-quality habitat (type 4) indicated the presence of Olympia oysters (72%) versus 5 locations where oysters were not found (28%). No field survey locations fell within areas of majority lowest quality habitat (type 1). Seven observations were found within the second lowest quality habitat type (2), with 2 of those observations indicating presence (29%) and 5 indicating absence (71%).

Habitat suitability
4 3 2 1
Presence 13 [0.72] 4 [0.4] 2 [0.29] 0 [0]
Absence 5 [0.28] 6 [0.6] 5 [0.71] 0 [0]
Total (n = 35*) 18 [1.0] 10 [1.0] 7 [1.0] 0 [0]

*3 data points removed from analysis due to inconclusive search results.

To expand on this analysis, I used a confusion matrix to further examine the ‘errors’ in the data, or the observations that did not correlate with my model of predicted suitable habitat. For ease of interpretation, I removed habitat suitability type 1 since there were not observations in this category, and type 3 since it fell in between high and low-quality habitat.

Habitat suitability
4 (high) 2 (low)
Presence 0.37 0.06
Absence 0.14 0.14

Decimals reported indicate the proportion of total observations (n = 35) that fell within this category. The habitat suitability model predicted that oysters would be present within the highest quality habitat type and absent in low-quality habitat. The confusion matrix shows that the model was successful in predicting that 37% of the total observations where oysters were present were found within habitat type 4 (high), and 14% of the observations where oysters were absent were found in habitat type 2 (low).

In the type 4 habitat, 14% of the total observations found that oysters were absent, which goes against the model prediction. I suspect this is partly due to the patchy nature of substrate availability in Yaquina Bay and the low-resolution quality of the substrate raster layer used for analysis. For the 6% of observations that show oyster presence within habitat type 2, it’s possible that these points were juvenile oysters that were able to settle in year-1, but are less likely to survive into adulthood. Both of these errors could also indicate issues with the weights assigned in the model back in Exercise 1.

Question asked
For exercise 3, I wanted to expand on the habitat suitability analysis to see if I could more accurately predict oyster locations and account for the errors found in exercise 2. Here I asked:

Can the spatial pattern of Olympia oyster location data be more accurately described by manipulating the spatial pattern of one of the parameters of suitable habitat (salinity)?

I decided to modify the rank values of one of the model parameters: salinity. Based on my experience collecting oyster location data in the field, it seemed that salinity was the biggest influence in where oysters would be found. It was also the easiest parameter to change since it had the fewest rank categories. The excerpt below comes from the ranking value table I established for the habitat parameters in Exercise 1. Changes to rank value for salinity are indicated in the right-most column.

Habitat parameter Subcategories Subcategory variable range Olympia oyster tolerance Rank value applied
Mean wet-season salinity (psu) Upper estuary < 16 psu somewhat, but not long-term 1 –> 2
Upper mid estuary 16.1 – 23 psu yes 4 –> 3
Lower mid estuary 23.1 – 27 psu yes 3 –> 4
Lower estuary > 27 psu somewhat 2 –> 1

Name of tool or approach
I combined my approach from exercise 1 and exercise 2 to create a different model output based on the new rank values applied to the salinity parameter. The analysis was completed in ArcGIS Pro and the table of values generated was reviewed in Excel.

Brief description of steps to complete the analysis

  1. After assigning new rank values to the salinity parameter, I applied a new ‘weighted overlay’ to the salinity raster layer in ArcGIS. As I did in exercise 1, I used the ‘weighted overlay’ tool again to combine the weighted substrate and bathymetry layers with the updated salinity layer. A new map of suitable habitat was created based on these new ranking values.
  2. Then, I added the field observation data of oyster presence/absence to the map and created a new map of all the data points overlaid on habitat suitability.
  3. I then created buffers around each of the points using the ‘buffer’ tool. In the last analysis, I used the ‘multiple ring buffer’, but was only able to analyze the largest buffer (300m). This time, I created only the one buffer around each point.
  4. Using the ‘Zonal Statistics’ tool, I overlaid the newly created buffers on the updated raster of habitat suitability and viewed the results. I again chose ‘majority’ as my visual represented statistic, which categories the buffer based on the habitat suitability type occupying the largest area.
  5. I also created a results table using the ‘Zonal Statistics as Table’ tool, then copied it over to Excel for additional analysis.

An updated table based on manipulated salinity rank values was generated to compare to the table created from exercise 2 and displayed at the top of this blog post. Results from this analysis showed that only 2 of the 35 total observations fell within or near areas of high-quality habitat (type 4), one indicated presence and the other absence. The adjustments to the salinity rank value allowed the habitat type 3 to dominate the map, with 31 of the total 35 observations falling in this category. Of the 31 points, 18 showed presence data (58%) and 13 were absence data (42%). Again, no field survey locations fell within areas of majority lowest quality habitat (type 1). Two observations were found within the second lowest quality habitat type (2), both indicating absence (100%).

Habitat suitability
4 3 2 1
Presence 1 [0.5] 18 [0.58] 0 [0] 0 [0]
Absence 1 [0.5] 13 [0.42] 2 [1.0] 0 [0]
Total (n = 35) 2 [1.0] 31 [1.0] 2 [1.0] 0 [0]

Again, I used a confusion matrix to further examine the ‘errors’ in the data, or the observations that did not correlate with my model of predicted suitable habitat. I removed habitat suitability type 1 since there were not observations in this category.

Habitat suitability
4 (high) 3 2 (low)
Presence 0.03 0.51 0
Absence 0.03 0.37 0.06


Decimals reported indicate the proportion of total observations (n = 35) that fell within this category. The confusion matrix shows that the model fit nearly all observations (31) into the type 3 habitat category, with a near even split between presence (18) and absence (13). In reference to the confusion matrix from exercise 2 at the top of this blog, it is difficult to make a direct comparison of the errors since most of the observations fell into type 3.

Critique of the method
I was surprised to see how drastically the map of suitable habitat changed by manipulating only one of the habitat parameters. The adjustment of the rank values for salinity resulted in a vast reduction in area attributed to the highest quality habitat (type 4). The results indicate that choosing the salinity parameter to manipulate did not result in a better fit model and that changes to salinity rank values were too drastic. Since the salinity parameter contains only 4 subcategories, or 4 different weighted salinity values, the impacts to the habitat suitability map were greater than if the parameter had had more nuance. For example, the bathymetry parameter has 10 subcategories and a reworking of the ranking values within could have made more subtle changes to the habitat suitability map.

The next steps would be to examine another parameter, either substrate or bathymetry, to see if adjustments to ranking values result in a more appropriate illustration of suitable habitat. Additionally, the collection of more oyster location data points will help in creating a better fit model and understanding the nuances of suitable habitat in Yaquina Bay.


Exercise 3: Poisson regression analysis of Texas rural index scores and colorectal cancer mortality counts

Research Question

What is the association between Texas county rurality index score and colorectal cancer (CRC) mortality?

In Exercise 1, I created a rurality index for Texas counties using rural indicator variables, while in Exercise 2, I selected Texas counties with the highest CRC death counts and created multi-ring buffers around them to visualize and measure how rural indicator variables shift as ones moves away from the counties. In this exercise, I am doing a more direct analysis of CRC mortality and my rurality index by utilizing a Poisson regression model. I expect the results to show that as index scores become more “rural,” county CRC death counts will increase.

Tools and Data Sources Used

The analysis, diagnostic, and graphing procedures for this exercise were all completed using the following R functions/packages: glm(), vcd, AER, and car. More specifically on the R tools, a generalized linear model (GLM) of the family Poisson was utilized for modeling, a rootogram from the vcd package was used for Poisson goodness-of-fit, a test for model over-dispersion was utilized from the AER package, and an influence plot was created using the car package. Other goodness-of-fit diagnostics and statistical measures were ascertained via the base glm() procedure.The rural indicators utilized for the index in this analysis are from the same sources I used in Exercise 1: Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database). Like in my Exercise 2, aggregated CRC mortality counts per 100,000 population for Texas counties were obtained for the years 2011-2015 from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program U.S. Population Data – 1969-2015.


Attribute Table Conversion to R: First, the attribute table of CRC death counts and rural index scores by county (created as part of Exercise 2) was exported to an Excel file using the Table to Excel tool in Arc. The Excel file was then loaded into R using the “readxl” package.

Univariate Poisson Diagnostics: Before introducing a Poisson regression, I wanted confirm the outcome variable follows a Poisson distribution. To do this, I used the rootogram() function from the vcd package in R to visualize the distribution of the CRC mortality count data. The figure below show that the data seems to roughly follow a Poisson distribution. Based on this, I proceeded to fitting the model.

Model Fitting: I fit a GLM of the Poisson family using the following formula in R: glm(formula = CRCDeathrate ~ PCAWeighted_Index, family = “poisson”). Initially, I attempted to construct a linear OLS regression with this data, but after many attempted transformations, I realized the outcome count data was likely following a Poisson distribution. In subsequent steps, I walk through the post-fitting Poisson diagnostics about the appropriateness of this fit.

Poisson Regression Diagnostics: In Poisson regressions, one of the main concerns is over-dispersion of the data, as Poisson distributions have only one free parameter and do not allow for variance to be adjusted individually from the mean. Therefore, if over-dispersion exists, the data has more variation than the Poisson distribution allows, biasing the variance of the model. To ensure the Poisson regression is not over-dispersed, I utilized the dispersiontest() function from the AER package. Equi-dispersion from this function is displayed when alpha=0. The alpha value of -0.20 and p=0.99 displayed below show that model is not over-dispersed.

Further proof that Poisson is the correct model for the data can be shown through an influence plot from the car package, where only a few data points (101,57) are highly influential (shown below).

I also used a chi-squared test to confirm that a Poisson model is a good fit. The resulting p-value (0.95) indicates that a Poisson model is a very good fit for the data (shown below).

Exporting to Arc for Mapping: After creation the model, I computed the standardized residuals of the model in R for each county using the studres() function. I then exported this data to Arc and added it to the attribute table I mentioned previously in this exercise. I then used symbology to represent the residuals on the Texas map and created a layout


Because Poisson regression of counts utilized a log link, the resulting coefficient(s) need to be exponentiated. The exponentiated coefficient for this regression display that as county index scores increase (become more urban) by 1 unit, the intercept (mean of CRC counts) is multiplied by 0.99. In other words, for every unit increase in rurality, county CRC deaths counts increase by 1% (p<0.001). The narrow confidence intervals for this estimate indicate that the coefficients are accurate.

The map of standardized residuals can be seen above and indicates that there is significant regionality to the association between index scores and CRC death counts, where central Texas is much more green (negative standardized residuals) and counties near the edges are orange and red (positive standardized residuals). Also, as expected, the high CRC death count Texas counties from Exercise 2 (Anderson, Howard, Gonzales, and Newton) have standardized residuals significantly higher than expected in the Poisson model.


This process of regression construction was very helpful for determining how to best model my data. I initially attempted to use a linear regression and after laboring to find a transformation for my outcome data, I finally decided on Poisson regression (count data duh!). Based on the model diagnostics I performed, it seems that the data works perfectly for Poisson regression. In my opinion, the most difficult part of this exercise was interpreting the coefficients of the regression. Because the coefficients are on the log scale, they have to be exponentiated, which greatly confused my interpretation at first. Because there is an extensive amount of county CRC death data missing from this analysis (confidentiality), these results may be less indicative of the true rurality-CRC mortality associations in Texas. Further, this analysis suffers from the modifiable areal unit problem, where counties may not an appropriate boundary for comparison. In future analyses, I would like to use more complete and spatially defined data from the state cancer registry to shed more light on the relationship.

Exercise 3: Characterizing the spatial pattern of confusion between pixels classified by global settlement datasets and OSM refugee settlements.

Question that I asked:

In Exercise 3, I was asking how the different classification methods that I’m using align or do not align with each other — essentially, creating a confusion matrix to identify the pixels that are True Positives, False Positives, and False Negatives when comparing two of the classifications. That is — what is the spatial pattern of confusion between pixels classified by global settlement datasets and an unsupervised binary classification of pixels within an OSM boundary?

Approach that I used:

I used EarthEngine to simulate confusion among pixels — I did not necessarily create a matrix in order to do so, but considered the concept of a confusion matrix in order to calculate a number for each pixel.

Steps I followed to complete the analysis:

In order to do this, I needed to compare “Facebook” and “K-means” as well as “WSF” and “K-means.” In this case, the “K-means” classification was my “truth” classification, because I needed something to measure the other classifications against that was not just a summary of pixels inside of a vector but a more traditional binary unsupervised classification. Thus, I used the settlements to guide a clustering algorithm of K-Means to cluster similar pixels, and created clusters of “settlement-like” pixels. I needed to perform raster math in order to combine these datasets and represent False Positives, False Negatives, and True Positives. I added WSF to K-means, with each pixel labeled as “1” in each representing  a positive record. I then added Facebook to K-means, again with each pixel labeled as 1, but then multiplied this by 4 for further additive properties to identify the overlaps in False Positives, False Negatives, and True Positives. Ultimately, this meant that the pixels that were False Negative in either or both comparisons had pixel values of 10, 11, and 14. The data that were False Positive in either or both comparisons had pixel values of 1, 4, and 5. A value of 15 represented where all three datasets registered as True Positive.

Brief description of results I obtained:

I was interested in both False Positives and False Negatives, but for different reasons. In False Positives, I wanted to visualize where the different global settlement datasets DID detect settlements while the K-Means did not. False Negatives would show me where K-Means picked up settlement data but neither of the global datasets did. Ultimately, False Negatives were much more prevalent throughout the dataset, further illustrating the exclusion of refugee settlements from global datasets. I chose to display a map of a more interesting pattern that appeared when looking over a larger settlement that was near a large body of water: because the binary clustering grouped water with “settlement” type landscapes, this area is a significant False Negative in the data. If I were to calculate statistics, it would make sense to exclude the area of water in order to not promote bias in the data.

False Positives and False Negatives between Refugee Settlement Binary Clusters and Facebook Classification and World Settlement Footprint Classification

Critique of the method – what was useful, what was not?

This exercise probably caused the most trouble with regards to methods that I thought made sense and would work but both presented more challenges and fewer patterns than I was hoping. There wasn’t really the spatial pattern that I was expecting, perhaps because this data is so noisy and over discontinuous areas in the landscape. It was useful to dig into the overlapping data, but ultimately the actual spatial patterns were not very remarkable. Perhaps the numbers presented in a matrix would be a more useful representation of the data or patterns for this specific method. Aside from the spatial statistics, I actually ran into quite a few function issues in EarthEngine, Pro, and ArcGIS Desktop — all three of which I used to try to manipulate this data into the ways I was imagining it would. For my final project, I may need to revisit these methods or seek advice from others, because my techniques were not as effective as I’d hoped.

Supervised Image classification on forested stands

Question that I asked?

Could I identify functional tree species with supervised image-classification in my stands?

The reason I asked this question was so that If I had to do a geographically weighted regression again it would be valuable to have deciduous or coniferous tree species in my point data for an added variable.

Name of the tool or approach that you used.

The main tool that I used for image classification was the maximum-likelihood classification in ArcMap. I also used the create accuracy assessment points to help create a confusion matrix in excel.

Brief description of steps you followed to complete the analysis.

I downloaded 2016 NAIP imagery in my area of interest and used the high resolution imagery to create a false colored image with the bands being arranged as NIR, Red, and Green. To help delineate broad-leaf vegetation from coniferous vegetation, I applied a histogram equalize stretch that enhanced my ability to identity conifers in the landscape. From there I created a maximum likelihood classification by drawing training data polygons on the false color imagery, which involved me using the Esri digital imagery base map as a reference image.


Once the image classification was complete, I used the create accuracy points on my stand and then extracted the raster values from the thematic map output to those points to create a confusion matrix in Excel. I clipped the thematic map raster to the watershed polygons I made when I did an individual tree segmentation to show what pixel classifications were assigned in my tree tops.


Brief description of results you obtained.

The thematic map output was 83% accurate with conifer and developed land covers performing the worst in the model. The developed  land cover is generally difficult to model in a landscape, and the variability in urban spectral reflectance leads to errors in modeling. The conifer land cover performed more poorly due to my trouble achieving accurate training data with the imagery resolution, and also with the model having trouble delineating conifers from grass and deciduous vegetation. Errors of commission on my part (65% accuracy), and errors of omission (75%  accuracy) lead to the lower accuracy of the conifer land cover (Table 1).  Despite these errors, the thematic map output performed well, and the land cover pixels in my stands showed that conifer trees were accurately assigned in the Saddleback stand (Figure 2). For the baker creek stand the large amount of shadows, sun glare on canopies, and classification cut off, lead to a poor classification of that stand.

Figure 1. The land cover thematic map for the entire NAIP image. The cyan blue color indicates the locations of Saddleback and Baker Creek stands.

Figure 2.The land cover classification output for the Saddleback stand.

Figure 3. The land cover classified output for Baker Creek Stand. Note that the NAIP imagery that was classified did not extend to cover the entire stand. The tree crown polygons were laid below the output to show where the land cover cuts off.

Table 1. Confusion matrix for the thematic map output.


Critique of the method – what was useful, what was not?

Some critiques about this process was that it was time consuming to create training data detailed enough to capture the variation in the scene for my desired accuracy. Sources of errors in the thematic map include shadows, resolution and variable spectral response signatures in the remotely sensed vegetation. Shadows occluded trees that would otherwise stand out, and distorted the classification enough for me to have to add in a land cover classification for shadows to mask them out of the scene. The issue of resolution just means that NAIP imagery was not detailed enough for the applications I asked. Imagery taken from unmanned aerial drones may be a potential avenue for acquiring a more  higher resolution data set. The confusion matrix highlights this issue, with an omission error of 65% for conifers and 75% commission error. It was difficult to determine conifer trees accurately in the training data from the variability of the spectral reflectance and the blurred crowns from the 1 meter resolution.


Since I only did a classification, I didn’t attempt to classify tree functional species to my tree polygons. The process that comes to mind on how to do that is to visibly determine which classification color is more pronounced in a tree top, and then placing that species in the point data as an attribute. This process would be highly time consuming and developing a methodology to streamline the classification of functional tree species to my tree points would be potential future work.Overall, the thematic map outputs are useful for areas like the Saddleback stand that have less shadows and distortion. The map is less useful for areas with high distortion like my Baker Creek stand.

Landscape Patterns as Predictors of Tree Height

Question: Which landscape features correspond to clusters of greater than expected trees?

Methods: I performed two Hot Spot Analyses in ArcMap; one on Hot Spots of tree height and another on hot spots of distance between trees. Both were constrained to the reserved control areas of the HJ Andrews Forest. Hot spots in tree height are regions of greater than expected tree height, while hot spots in distances between trees are regions of greater than expected distance between individual trees (more dispersed trees). Hot spots and cold spots of each analysis generally overlapped. However, hot spots between tree height and spacing did not overlap in all cases, so I wanted to know what landscape features might explain this difference. Covariates I explored included slope, aspect, elevation, and landform. Since the end goal is to find landscape features that may correlate with amount of soil carbon, I conducted this analysis with the assumption that taller trees may correlate with regions of greater soil carbon. I used the package ‘caret’ in R to calculate a  confusion matrix between the Z-scores of height and distance for all the hot spot bins (-3,-2,-1,0,1,2,3), then further constrained the analysis to only the most extreme hot and cold spots (-3 and 3). I then compared mean height, distance, slope and elevation between the four combinations of the extreme hot and cold spots (Table 1).

Results: Regions of taller than expected trees often correspond to regions of greater than expected distances between trees, which agrees with current forest growth models (Fig. 1). Hot spots of tall trees are typically in valleys and cold spots are commonly on ridges (Fig 3 & 4). When we zoom in to the Lookout Mountain area of HJ Andrews, we see that hot spots of tall trees are more concentrated in valleys and on footslopes, and cold spots are closer to mountain ridges (Fig 3). When compared with the distance hot spot map of the same area, we see that cold spots go much further down the footslopes and even into the valleys in some cases (Fig 4). So although we have evidence for a strongly linear relationship between height and distance between trees, we also have evidence that they do not fully explain each other and other landscape features are likely at play.

Fig 1. Distance Z-scores vs. Height Z-scores from hot spot analyses show a linear relationship.

Fig 2. HJ Andrews elevation with reserved control areas in orange and

inset area of Lookout Mountain hot spot maps (below)


Fig 3. Hot Spot Analysis showing hot spots of tree heights (tallest trees)

in the Lookout Mountain area


Fig 4. Hot Spot Analysis showing the greatest distance between trees

in the Lookout Mountain area


An elevation band that correlates with occurrences of tall trees exists up to around 1100 m, after which point number of tall trees drops off substantially (Fig. 5). Certain aspects seem to correlate with taller trees, but those relationships are harder to tease apart and I have yet to fully explore them. Greater slopes tend to correlate with shorter trees, but this relationship is not linear. There is an interesting upwards trend at slopes between 30 and 50 degrees that seems to correlate with slightly taller trees, then a big drop in mean height Z-score at slopes of 60 degrees.

Fig 5. Aspect, elevation and slope compared with Z-scores of mean height.

A comparison of Z-scores from hot spot analyses of height and distances shows that although hot spots of height and distance tightly correlate, covariates that explain them are different (mean slope and elevation). When we compare the most extreme Z-scores to one another, slope, height and distance between trees are not particularly different. Mean elevation in three categories of Z-score is similar, but mean elevation in the fourth group (>3,>3) is significantly lower. A next step is to map out these

Table 1. Comparison between the most extreme Z-scores of tree height and tree spacing.

Height Z-Score Distance Z-Score Mean Height (m) Height_SD Mean Distance (m) Distance_SD Mean Slope (m) Slope_SD Mean Elevation (m) Elevation_SD
<-3 <-3 22.8 10.5 5.1 2.4 27.9 10.5 1285 294
<-3 >3 24.5 11.2 5.6 2 26.9 4.5 1377 153
>3 <-3 34.8 7.1 5 2 31.8 5.9 1310 44
>3 >3 39.5 16.7 4.7 2.6 26.2 11 934 188

Critique: These analyses are still based on Hot Spot Analyses, so they still comes with the same criticisms as previous Hot Spot Analyses. One of these criticisms was that it’s basically a smoothing function. Since the LiDAR dataset I’m using is basically a census of tree heights, running hot spot analyses is reducing the information in that dataset unnecessarily. I have yet to map out regions that were well-predicted and poorly predicted spatially, so I cannot fully discuss the merits of the confusion matrix method.


Exercise 3: Lagoons, ENSO Indices, and Dolphin Sightings

Exercise 3: Are bottlenose dolphin sightings distances to nearest lagoon related to ENSO indices in the San Diego, CA survey site?

1. Question that you asked

I was looking to see a pattern at more than one scale, specifically the relationship with ENSO and sighting distributions off of San Diego. I asked the question: do bottlenose dolphin sighting distributions change latitudinally with ENSO related to distance from the nearest lagoon. The greater San Diego area has six major lagoons that contribute the major estuarine habitat to the San Diego coastline and are all recognized as separate estuaries. All of these lagoons/estuaries sit at the mouths of broad river valleys along the 18 miles of coastline between Torrey Pines to the south and Oceanside to the north. The small boat survey transects cover this entire stretch with near-exact overlap from start to finish. These habitats are known to be highly dynamic, experience variable environmental conditions, and support a wide range of native vegetation and wildlife species.

Distribution of common bottlenose dolphin sightings in the San Diego study area along boat-based transects with the six major lagoons.


0 Buena Vista Lagoon
1 Agua Hedionda Lagoon
2 Batiquitos Lagoon
3 San Elijo Lagoon
4 Tijuana Estuary
5 Los Penasquitos Lagoon
6 San Dieguito Lagoon

2. Name of the tool or approach that you used.

I utilized the “Near” tool in ArcMap 10.6 that calculated the distance from points to polygons and associated the point with FID of that nearest polygon. I also used R Studio for basic analysis, graphic displays, and ANOVA with Tukey HSD.

3. Brief description of steps you followed to complete the analysis.

  1. I researched the San Diego GIS database for the layer that would be most helpful and found the lagoon shapefile.
  2. Imported the shapefile into ArcMap where I already had my sightings, transect line, and 1000m buffered transect polygon.
  3. I used the “Near” tool in the Analysis toolbox, part of the of the “proximity toolset”. I chose the point to polygon option with my dolphin sightings as the point layer and the lagoon polygons as the polygon layer.
  4. I opened the attribute table for my dolphin sightings and there was now a NEAR_FID and NEAR_DIST which represented the identification (number) related to the nearest lagoon and the distance in kilometers to the nearest lagoon, respectively.
  5. I exported using the “conversion” tool to Excel and then imported into R studio for further analyses (ANOVA between the differences in sighting distances to lagoons and ENSO indices).

4. Brief description of results you obtained

After a quick histogram in ArcMap, it was visually clear that the distribution of points with nearest lagoons appeared clustered, skewed, or to have a binomial distribution, without considering ENSO. Then, after importing into R studio, I created a box plot of the distance to nearest lagoon compared to the ENSO index (-1, 0, or 1). I ran an ANOVA which returned a very small p-value of 2.55 e-9. Further analysis using a Tukey HSD found that the differences between ENSO states of neutral (0) and -1 and neutral and 1 were significant, but not between 1 and -1. These results are interesting because this means the sightings of dolphins differ most during neutral ENSO years. This could be that certain lagoons are preferred during extremes compared to the neutral years. Therefore, yes, there is a difference in dolphin sightings distances to lagoons during different ENSO phases, specifically the neutral years.

Histogram comparing the distance from the dolphin sighting to nearest lagoon in San Diego during the three major indices of El Niño Southern Oscillation (ENSO): -1, 0, and 1.


Violin plot showing the breakdown of distributions of dolphin sighting distances to lagoons (numbered 0-6) during the three different ENSO indices.

5. Critique of the method – what was useful, what was not?

This method was incredibly helpful and also was the easiest to apply once I got started, in comparison to my previous steps. It allowed to both visualize and quantify interesting results. I also learned some tricks for how to better graph my data and to symbolize my data in ArcMap.

Contact information: this post was written by Alexa Kownacki, Wildlife Science Ph.D. Student at Oregon State University. Twitter: @lexaKownacki

Ex. 3: Does black stain spread through landscape networks?


For those who have not seen my previous posts, my research involves building a model to simulate the spread of black stain root disease (a disease affecting Douglas-fir trees) in different landscape management scenarios. Each of my landscapes are made up of stands assigned one of three forest management classes. These classes determine the age structure, density, thinning, and harvest of the stands, factors that influence probability of infection.


Are spatial patterns of infection probabilities for black stain root disease related to spatial patterns of forest management practices via the connectivity structure of the network of stands in my landscape?


 I decided to look at how landscape connectivity influenced the spatial relationship between forest management practices and infection probabilities. This approach builds off of a network approach based in graph theory (where each component of the landscape is a “node” with “edges” connecting them) and incorporates concepts from landscape ecology regarding distance-dependent ecological processes and the importance of patch characteristics (e.g., area, habitat quality) in the contribution of patches to the connectivity of the landscape. I used ArcMap, R, and a free software called Conefor (Saura and Torné 2009) to perform my analysis.


 1. Create a mosaic of the landscape

The landscape in my disease spread model is a torus (left and right sides connected, top and bottom are connected). The raster outputs from my model with stand ID numbers and management classes do not account for this and are represented as a square. Thus, in order to fully consider the connectivity of each stand in the landscape, I needed to tile the landscape in a 3 x 3 grid so that each stand at the edge of the stand map would have the correct spatial position relative to its neighbors beyond the original raster boundary. I did this in R by making copies of the stand ID raster and adjusting their extent. In ArcMap, I then assigned the management classes to each of those stands, converting to polygon, using the “Identity” tool with the polygon for management class, and then using the “Join Field” tool so that every stand with the same unique ID number would have the relevant management class assigned. If I had not done this step, then the position of stands at the edge of the raster in the network would have been misrepresented.

2. Calculate infection probability statistics for each stand

I then needed to relate each stand to the probability of infection of trees in that stand (generated by my model simulation and converted to point data in a previous exercise). In ArcMap, I used the “Spatial Join” tool to calculate statistics for infection probabilities in that stand, because each stand contains many trees. Statistics included the point count, median, mean, standard deviation, minimum, maximum, range, and sum.

3. Calculate the connectivity of each stand in the network of similarly managed stands in the landscape

3a. For this step, I used the free software Conefor, which calculates a variety of connectivity indices at the individual patch and overall landscape level. First, I used the Conefor extension for ArcMap to generate the input files for the Conefor analysis. The extension generates a “nodes” file for each feature and a “connection” file, which contains the distances between features a binary description of whether or not a link (“edge”) exists between two features. One can set the maximum distance for two features to be linked or generate a probability of connection based on an exponential decay function (built-in feature of Conefor, which is an incredible application). For my analysis, I performed connectivity analyses that only considered features to be linked if (i) they had the same management class and (ii) there were no more than 10 meters of distance between the stand boundaries. Ten meters is about the upper limit for the maximum likely root contact distance between two Douglas-fir trees.

3b. For each management class, I ran the Conefor analysis to calculate multiple metrics. I focused primarily on:

  • Number of links in the network
  • Network components – Each component is a set of connected patches (stands) that is internally connected but has no connection to any other set of patches.
  • Integral Index of Connectivity (IIC) – Essentially, this index gives each patch (stand) a value in terms of its importance for connectivity in the network based on its habitat attributes (e.g., area, habitat quality) and its topological position within the network. For this index, higher values indicate higher importance for connectivity. This is broken into three non-redundant components that sum to the total IIC:
    • IIC intra – connectivity within a patch
    • IIC flux – area-weighted dispersal flux
    • IIC connector – importance of a patch for connecting other patches in the network) (Saura and Rubino 2010)
  1. Analyze the relationship between connectivity metrics and infection probabilities

I reduced the mosaic to include only feature for each stand, eliminating those at the periphery and keeping those in the core. I confirmed that the values were similar for all of the copies of each stand near the center of the mosaic. I then mapped and plotted different combinations of connectivity and infection probability metrics to analyze the relationship for each management class (Fig. 1, Fig. 2).

Fig. 1. Map of IIC connectivity index and mean infection probability for the extensively managed stands.


I generally found no relationship between infection probability and the various metrics of connectivity. As connectivity increased, infection probabilities did not change for any of the metrics I examined (Fig. 2). I would like to analyze this for a series of landscape simulations in the future to see whether patterns emerge. I could also refine the distance used to generate links between patches to reflect the dispersal distance for the insects that vector the disease.

Fig. 2. Plots of infection probability statistics and connectivity metrics for each of the stands in the landscape. Each point represents one stands in the randomly distributed landscape, with extensively managed stands in red, intensively managed stands in blue, and old-growth stands in green.

CRITIQUE OF THE METHOD – What was useful, what was not?

I had originally planned to use the popular landscape ecology application Fragstats (or the R equivalent “landscapemetrics” package), but I ran into issues. As far as I could tell (though I may be incorrect), these options only use raster data and consider one value at a time. What I needed was for the analysis to consider groups of pixels by both their stand ID and their management class, because stands with the same management class are still managed independently. However, landscapemetrics would consider adjacent stands with the same management class to be all one patch. This meant that I could only calculate metrics for the entire landscape or the entire management class, which did not allow me to look at how each patch’s position relative to similarly or differently managed patches related to its probability of infection. In contrast, Conefor is a great application that allows for calculation of a large number of connectivity metrics at both the patch and landscape level.


Saura, S. & J. Torné. 2009. Conefor Sensinode 2.2: a software package for quantifying the importance of habitat patches for landscape connectivity. Environmental Modelling & Software 24: 135-139.

Saura, S. & L. Rubio. 2010. A common currency for the different ways in which patches and links can contribute to habitat availability and connectivity in the landscape. Ecography 33: 523-537.

Ex 3: Grain size distributions and peak flow events

The context:

For this exercise, I wanted to figure out if there was a relationship between the temporal pattern of peak discharge in my study creeks and the temporal pattern of grain size distributions. The temporal pattern of grain size distributions could help tell a story about the interplay between hillslope and alluvial process and the forces involved in shaping and stabilizing the streams. One might predict that changes in grain size distribution might be related to extreme events. Large events could associated with debris flows which might reduce sorting while more moderate flow events could increase sorting.

The data:

Pebble counts were conducted in conjunction with cross section surveys at five reaches between 1995 and 2011. They show both the median (D50) and standard deviation of grain size varying over time.

During this time period, two different cross section sampling methods were used, depending on the year. In one method, every cross section in a given year was surveyed. In another method, the only cross sections sampled were ones in which the field crew thought that the creek bed had changed. Because of this, the grain size samples in some year are incomplete and biased towards conditions that favor visible change.

The graphs below show the mean of the D50 and mean of the standard deviation for each set of cross section for each year sampled in subfigure a. The error bars show standard error. The faded points represent years where the field crew only sampled a subset of the cross sections. Subfigure b shows the area-normalized annual peak discharge for the stream gauges on Mack and Lookout Creek. Cold Creek is ungauged.


These figures imply that most reaches were the least sorted in 1997, the year after the flood of record, followed by various decreases and increases over time.

The questions

I asked the following questions to try to understand the relationship between peak flow and grain size distribution:

  1. Is the standard deviation of grain size in a given water associated with the peak flow from that water year?
  2. Is the standard deviation of grain size in a given water associated with the peak flow from the previous water year?
  3. Is the change in standard deviation of grain size between two years associated with the largest peak flow from the interceding years?
  4. Is the change in D50 between two years associated with the largest peak flow from the interceding years?


The methods and results

I used simple linear regression to relate each of these variables and took the R2 value to represent how much of the sediment distribution variability each variable might explain. The results were negative in all cases.


SITE  Rsquared

1 LOL    0.00983

2 LOM    0.00320

3 MAC    0.0340

4 MCC    0.106


SITE  Rsquared

1 LOL     0.0546

2 LOM     0.0297

3 MAC     0.0191

4 MCC     0.217


SITE  Rsquared

1 LOL     0.0845

2 LOM     0.0718

3 MAC     0.0842

4 MCC     0.0852


SITE  Rsquared

1 LOL    0.0440

2 LOM    0.0252

3 MAC    0.0553

4 MCC    0.00688

The answer to these questions all appear to be no – clearly hydrology has some impact on grain size distributions, but the relationship may be too complicated to address using a single predictor variable and limited data.


Exercise 3: Perception of natural resource governance and changes in environment over time


What are the spatial patterns of natural resource governance perceptions in relation to environmental condition over time?

Tools and Approaches

  1. Tabulate area in ArcGIS Prop
  2. T-test in R

Analysis Steps

  1. To begin to answer this question, I first needed to find spatial layers of the environmental condition in the Puget Sound over time. I could not use the same ‘environmental effects’ data I used for Exercise 2, as the data was only completed 5 months prior. Instead, I substituted two land cover maps from 1992 and 2016 to approximate environmental condition.

I first loaded the raster layers into ArcGIS Pro. I then reclassified the land types (of which there were 255, but only 25 present in the region) down to 14. I then created buffers of 5km for my points of individuals with governance scores one standard deviation above and below the average. I then tabulated the land cover area within the buffers for each time period.

  1. To test whether land cover change differed between those with high and low perceptions, I exported the tables and calculated the change in land cover for the two samples. I then ran two-sample t-tests for each land cover type change between the two groups.


Land cover change differed significantly for four different land cover types between the two groups—grassland, forest, shrub/scrub, and bare land cover. Grassland cover decreased for both groups, but decreased by 5% more in in the areas with individuals with below average governance perceptions. Forest cover also decreased for both group, but decrease by 1% more individuals with below average governance perceptions (the amount of forest cover in the region is very large which accounts for why a 1% difference is significant). Shrub and scrub cover increased for both groups, but increased by 3% more in areas with individuals with below average governance perceptions. Lastly, bare land decreased for both, but decreased by 5% more for individuals in areas with individuals with below average governance perceptions (Table 1).

The effect size of land cover change on perception was small for each of the significant variables with biserial correlations between .07 and .09 (Table 1).

Table 1. Differences in land cover change between individuals with above average perceptions of natural resource governance perceptions and individuals with below average governance perceptions

  Governance Perception1      
  Above Average Below Average t-value p-value Effect size (rpb)
High Development 17 16 0.80 .426 .02
Medium Development 19 19 0.67 .500 .02
Low Development 11 11 0.22 .829 .00
Developed Open Space 12 13 0.21 .837 .00
Agriculture -1 0 0.74 .461 .02
Grassland -13 -18 3.03 .002 .09
Forest -10 -9 2.61 .009 .08
Shrub/Scrub 40 43 3.04 .002 .09
Wetlands 0 0 1.67 .095 .05
Unconsolidated Shore 2 2 0.10 .918 .00
Bare Land -21 -25 2.43 .015 .07
Water 0 0 0.90 .368 .03
Aquatic Beds -1 -5 1.19 .233 .03
Snow/Tundra 0 0 0.00 .997 .00

1Cell entries are percent changes of land cover from 1992 to 2016


  1. Critique

The biggest problem I had with this method was figuring out the most appropriate way to test my question. I struggled with classifying ‘good’ or ‘bad’ land cover change, and ultimately decided it was inappropriate. I also don’t think land cover is necessarily the best way to test environmental condition. I think it would be more appropriate if I were able to use environmental effects over time. I think it would also be best if I did different sized buffers around individuals.

Additionally, I believe I should have controlled for how long they lived in the region. The average number of years individuals have lived in the region was 34, which is less than the land cover change of 23 years. However, as years lived in the Puget Sound is a significant determinant of perception, it may have more to add in this case. The problem here is that this method is clunky. The way in which I differentiated high/low perception was by having different csv files. Having to then split these by years in the region would create an additional step, and more files, that could ultimately increase personal error because of the difficulty in keeping track of which buffer is which type of individual.


Does proximity of open areas influence invasibility of forested areas?

Question Asked

I am interested in how the spatial pattern of non-forested areas influences invasibility of forest plots by the exotic annual grass vententata (Ventenata dubia). For exercise 3 I asked the question, how are forest plots with high and low ventenata cover related to the spatial pattern of canopy openness?

Tools and Approaches Used

To explore this question, I performed a Ripley’s cross-K between ventenata forest plots with high and low cover and non-forest areas (Ripley 1976). This analysis is used to interpret spatial relationships between two point patterns. I used ArcGIS and R to answer the question following the steps below:

  • I added my sample plots with ventenata cover and coordinates and fire perimeters into ArcGIS.
  • To generate non-forest points, I added canopy cover raster data for the study area into ArcGIS with values ranging from 0 to 95% canopy cover.
  • I buffered the sampled fire perimeters by 3000m using the “buffer” tool and generate 200 random points, not closer than 100m in each buffer using the “create random points” tool in ArcGIS.
  • I extracted the raster values to the random points using the “values to points” tool, and extracted points that had <10% canopy cover into a new layer file for open areas.
  • I then extracted my sample plots only from forested areas into a forest_plots layer.
  • I exported these plots and the random points into a data frame with spatial data and an indicator for sites with high ventenata (>15% cover), low ventenata (0-15% cover), or random open points with no ventenata recorded (labeled “high”, “low”, and “open” respectively).
  • I performed a Ripleys cross K between the high and low ventenata forest plots and non-forest “open” areas and plotted the response in R with the package “spatstat” (Thanks, Stephen C. for sharing R code!). I repeated the analysis individually for three sampled fires (Fox, South Fork, and Canyon Creek) to avoid picking up the inherently clustered spatial pattern of my sample plots and overshadowing the research question. However, I also performed the analysis on the entire sample area to view where issues arise when ignoring a clustered sample design and compare to results from individual fires.
    1. In order to perform the cross Ripleys K on each sampled fire, I first had to establish rectangular “windows” around each fire and provide the xy coordinates for the vertices using tools in ArcGIS depicted below.

Results & Discussion

Most annual grass species prefer areas of low canopy cover. These open areas can, however, act as source populations for invasion into more resistant forested areas. I hypothesized that ventenata (an annual grass) abundance in forested areas would increase with increased proximity and abundance of open areas near the sampled forest (i.e. sites with high ventenata cover in forested areas would be more spatially “clustered” around non-forest areas than sites with low ventenata cover).

The results of my Ripley’s cross-K show that the spatial relationship between sites with high/ low ventenata cover differs between fires. At the Fox fire (FOX), high and low sites closely follow the projected Poisson distribution when crossed with open areas, suggesting that ventenata abundance in forest sites is not related to nearby open areas. However, in the Corner Creek (CC) and South Fork (SF) fires, heavily invaded forest sites are more clustered around open points and low sites are more dispersed around open points than would be expected by chance. I may not have seen a relationship between open areas and high/ low sites at the fox fire because of the relatively low number of forest sample plots (n = 5 high and 4 low). Another explanation could be that I am missing an interaction between fire and invasion by not accounting for canopy lost in forest plots as a function of fire. Analyzing burned vs. unburned forest plots (with fire as an additional driver of invasion) would have been ideal, but would have dropped the sample size too low to perform analyses.

When I performed the cross-K on the entire study area it shows strong clustering of both high and low plots around open areas. However, this is likely the result of the clustered spatial pattern of my sampling method (targeting specific burned areas), and less representative of the actual spatial relationship between ventenata invasion and canopy cover.

Critique of Method

Overall, I thought the cross-K was a useful method for evaluating the spatial relationship between two variables (ventenata cover and open areas). However, a method that could include interaction terms (burning) would probably have been more appropriate for this study. Additionally, the arrangement of sample plots made it difficult to evaluate overall patterns across the study region. It would have been a more useful analysis had my samples been evenly distributed across the region.

Ripley, B.D., 1976. The second-order analysis of stationary point processes. Journal of applied probability, 13(2), pp.255-266.

Ex3: Correlation of Mining and Non-Mining Villages’ NDVI Loss with Distance to Nearest Primary Road

Following up from Exercise 2, where I determined how NDVI loss relates to distance from village center, I now wanted to explore whether or not the NDVI loss I observed could be explained or attributed to another factor: the distance of a village from a primary road.

To do this, I acquired a shapefile of roads from OpenStreetMaps for Senegal, which I used alongside my locations of mining and non-mining villages, ArcGIS Pro for some simple spatial work, and Excel to calculate a simple correlation.

First, I imported the shapefile of roads in Senegal to ArcGIS Pro, and first determined what the nearest actual road to each village was (fig. 1). Based on this first-pass glance, it appears the more mining sites in my sample than non-mining sites are nearest to “primary” paved roads, and that the majority of non-mining villages are near “tertiary” roads. This simply means that, out of my sample, more mining happens closer to paved roads, which would make sense: gold needs to be transported from gold sites, and equipment/supplies need to be transported in.

Fig 1

Next, I manipulated the shapefile to only include features which were designated “primary” roads (fig. 2). I then used ArcGIS Pro’s “Near” tool to generate new attributes for each set of villages which listed the shortest distance to the nearest primary road in meters.

Fig 2: Green dots are non-mining villages; red dots are mining villages

With these distances in hand, I plotted them in excel against the NDVI % loss values for the first ring (0-250m) around each village, as a simple first pass of understanding how distance from a primary road may be related to the amount of NDVI loss being experienced at a particular road (fig. 3).

Fig 3

Then, within Excel, I calculated a simple correlation (a la fig. 4) to determine how distance from a nearest primary road relates to amount of NDVI loss experienced by a village.

Fig 4: x, in this case, is distance (m) from nearest primary road and y is NDVI loss at the first ring.

For mining villages, I calculated a correlation of -.804, and for non-mining villages, I calculated a correlation of .686. This means that, for this sample at least, the NDVI loss experienced at a mining village is negatively correlated with distance from a primary road; that is, as a mining village’s distance from a primary road increases, its NDVI loss should decrease. This makes perfect sense: mining villages closer to primary roads are easier to access both by miners and by other users, and to leave as well, encouraging larger groups of people to live there, which would decrease NDVI as a result of their activities.

Non-mining villages, on the other hand, are more difficult to explain. One explanation, which may be plausible, is that more distant villages have less opportunities for other livelihoods which are not environmentally dependent, and so are more obligated to work in and with the environment, perhaps encouraging vegetation loss as users rely on and drawdown more on their environment.

Overall, I think this was a useful approach for examining another possible explanation for why mining villages may have higher rates of NDVI loss than non-mining villages. Mining villages are closer to primary roads, which means they are more accessible to potential miners, and thus more likely to grow larger than non-mining villages, and drawdown more on their environment. That being said, as can be seen in Fig. 3, this is not a perfect relationship by any means, and a more robust analysis would be useful to say much more. But, as a first pass, it’s useful for illustrating some additional explanation.

Exercise 3: Confusion Matrix to Test Predictive Capacity of Seascapes

Question Asked

I’ve been using my time in this class to explore methods and analyses that could help me to better understand the spatial and temporal relationships between satellite-derived seascapes and fishes in the California Current Ecosystem. In Exercise 1, different types of interpolations were explored and applied to the trawl data to see how results changed. After refining methods related to the Kriging Interpolation, Exercise 2 was conducted, exploring he possible uses of neighborhood analysis in determining the prominent seascapes related to high rockfish abundance in the California Coast Ecosystem. Exercise 3 builds on the results of the previous two exercises: in order to explore the relationship between seascapes and rockfish abundance, a confusion matrix was calculated to measure the ability for certain seascapes to predict the occurance of rockfish hotspots. Simply put, the question being asked is: How were seascapes related to areas of high rockfish abundance in the California Coast Ecosystem in May of 2007?

Tool Used

In order to answer this question, I’m going to use a confusion matrix. A confusion matrix is a table that measures the agreement between two raster layers and provides an overall measure of a model’s predictive capacity. Additionally, these matrices can also measure a raster’s error (in terms of false-positives or false negatives). Each statistic calculated is useful, as each one clues the researcher into spatial or methodological processes that may account for certain types of error. The confusion matrix works by taking original data and creating accuracy assessment points, which are points assigned to random spatial points and given the value indicated by the raster value at that point. Those points are then matched up to points on the model raster and statistics are calculated measuring how much they agree.


The data used for this exercise are the following:

  • Significant seascape classes that indicate high rockfish abundance for May 2007, as determined by the neighborhood analysis in Exercise 2
  • Seascape raster data for the relevant month
  • Rockfish abundance raster (interpolated, from Exercise 1)

Confusion matrices can only be calculated when your ground-truthed and modeled layers are the same units. For this reason, my raster data had to be converted to presence absence data for both the seascape classes and the interpolated rockfish abundances. In order to do this, I used the Reclassify tool in ArcGIS to change “present” values to 1 and “absent” values to 0. For the interpolated abundance raster, any nonzero and nonegative values were changed to 1 to indicated modeled rockfish presence. The remaining cells were changed to 0 to indicate modeled rockfish absence. The result was the raster shown below:

Rockfish Presence-Absence Layer calculated using the Reclassify too on the interpolated trawl data.

Similar steps were used to reclassify the seascape data – seascape classes 7, 12, 14, 17, and 27, which were shown to be present within a 25km radius of the high abundance trawls in Exercise 2, were counted as present cells, while the other classes were counted as absent cells. The result was the raster shown below:

Rockfish Presence-Absence raster calculated using the results from exercise 2 to reclassify the seascape data

It was determined that the seascpae layer would be the base or “ground-truthed” layer and the interpolation layer would be the modeled layer, since the interpolated values are simply estimations. The next step was using the “Create Accuracy Assessment Points” Tool to create randomized points within the modeled layer. The result was the following shapefile:

Randomized accuracy assessment points created using the Create Accuracy Assessment Points tool

Once the points are created, they must be compared to the values in the modeled layer. The Update Accuracy Assessment Points tool does this automatically, and my points were matched up to corresponding values from the seascape layer. Finally, the final step was to run the resulting shapefile through the “Calculate Confusion Matrix” tool.


The resulting confusion matrix is displayed below:

ClassValue C_0 C_1 Total U_Accuracy Kappa
C_0 9 79 88 0.102272727  
C_1 89 276 365 0.756164384  
Total 98 355 453  
P_Accuracy 0.091836735 0.777464789 0.629139073  
Kappa         -0.135711

The two rasters measured in the matrix did not compare well at all, resulting in a negative Kappa value. Kappa values measure overall agreement between two measures, with 1 being perfect agreement. A negative score indicate lower agreement than would be found by chance. The interpolated model performed especially poorly measuring absent values, only agreeing with the seascape model on 9 pixels. While the two rasters agreed on a vast majority of the areas where rockfish may have been found, the irregularities on the absence areas drove the matrix to failure. The large number of both false positive and false negative readings shows that this was not a fundamental disagreement based on scale or resolution of the models – the two methods of measuring rockfish abundance fundamentally disagreed on where they should be found, to the point where they cannot be reconciled.

Critique of Method

Overall, I think a confusion matrix is a good way to compare modeled and true (or ground-truthed) measured information. However, in my case, both of my informational rasters are the results of modeled analyses, each with their own set of assumptions and flaws. The interpolation, for example, is at the mercy of the settings used when executing the Kriging: how many points shall be measured per pixel? Is it a fixed sampling method, or variable, spatially based? And the seascape raster is at the mercy of the neighborhood analysis, which was only conducted on 4, hand-picked points, which introduce all kinds of sampling and spatial biases. All of this, coupled with the fact that both rasters had to be distilled into presence-absence data, which eliminates much of the resolution that makes Kriging so great, results in a heavily flawed methodology that I do not believe truly represents the data. This framework which I’ve used, however, has the potential to be modified, standardized, and rerun in order to address many of these issues. But for now, I feel as though the confusion matrix is a helpful tool, just not for two sets of modeled data.

Ex 3: Relating distance between wells to communication type and fault presence

Question that I asked:

Is there a relationship between the distance between wells, their communication status based on a pumping interference test, and whether or not they are separated by a fault?

Name of the tool or approach used:

Polyline creation and classification in ArcGIS Pro, boxplot creation and two-sided t-tests in R.


29 wells in my study area were evaluated by the Oregon Water Resources Department during pumping interference tests in 2018 and 2019. This test involves pumping one well, and seeing whether the water levels in nearby wells drop in response. I received a verbal account of the wells that did and did not communicate, sketched it on a map, and then transferred that information to ArcIS Pro. I drew polylines using the well locations as snapped endpoints. Then, I created and populated fields containing the well communication type (“communicate” for wells that respond to pumping at a nearby well, and “does not communicate” for wells that do not) and whether or not the path between two wells crosses a fault. Shape_Length in feet was automatically calculated when I created the polylines, on account of the projection I used for the shapefile.

I exported that data table to a csv and imported it in R, where I subset it into three categories: all paths, paths between wells that communicate, and paths between wells that do not communicate. I then created box plots and ran t-tests to see differences between means and distributions of path length based on communication type or fault length.


Comparing the path length and the communication type of all 29 wells involved in the communication test, there is not significant evidence of a difference in mean path length between wells that do and do not communicate because the p-value of a two-sided t-test was 0.152. While the mean distance between wells that do not communicate is larger than the mean distance between wells that do communicate, the overlapping interquartile ranges in both categories make this difference less significant. There is not clear evidence that distance plays a role in well communication.

There is some evidence for a difference in mean path lengths between wells that do and do not cross faults, based on a p-value of 0.047 in a two-sided t-test. The mean path length that crosses a fault is 5,139 ft, while the mean path length that does not cross a fault is 3,608 ft. Wells that are closer together are less likely to be separated by a fault.

For wells that do communicate, there is evidence of a difference between the mean path lengths that cross faults and the mean path lengths that do not cross faults. The p-value for a two-sided t-test was 0.024. Wells that communicate but are not separated by a fault are more likely to be closer together than wells that are separated by a fault.

For wells that do not communicate, there is no evidence of a difference in mean path lengths between paths that do and do not cross faults, given a p-value of 0.98 in a two-sided t-test. Wells that do not communicate are likely to be separated by the same mean distance whether or not they are separated by faults, although there is a larger range of path length values for wells separated by a fault that do not communicate.


Summary of results:

Wells that communicate in pumping tests do not have a significantly different mean distance between them than wells that do not communicate (p = 0.152)

Wells that are closer together are less likely to be separated by a fault. (p = 0.047)

Wells that communicate but are not separated by a fault are more likely to be closer together than communicating wells that are separated by a fault. (p = 0.024)

Wells that do not communicate are likely to be separated by the same mean distance whether or not they are separated by faults, although there is a larger range of path length values for non-communicating wells separated by a fault. (p = 0.98)

Critique: I wish I had more sample points and paths to work with, so I could use a more interesting analysis such as ANOVA.

Estimates of connectivity to critical infrastructure in Seaside, OR following a rupture of the Cascadia Subduction Zone


For exercise 3, I evaluated the connectivity of each building within Seaside, OR, to critical infrastructure following a rupture of the Cascadia Subduction Zone. The probability of connectivity for each building was determined using networks and considered the following:

  • Electric Power Network (EPN): probability that each building has electricity.
  • Transportation: probability that each building can reach the hospital or fire stations via the road network.
  • Water Supply Network: probability that each building has access to running water.

The connectivity analysis was deaggregated by hazard as well as the intensity of the event.

Tool and approach

For this exercise, I used: (1) a probabilistic earthquake/tsunami damage model to evaluate the functionality of linkages; (2) the network analysis package python-igraph to evaluate the connectivity of each tax lot to critical infrastructure; and (3) QGIS for spatial visualization of the results.

Description of steps

Networks were created to represent the connectivity of the three infrastructure components (Figure 1). A network consists of nodes connected to each other through edges. When edges are removed, a connectivity analysis can be performed to determine whether there is any path from one node to any other specific node. A disconnection in the network results in two (or more) separate networks.

Here, the earthquake and tsunami hazards cause damages to edges which are removed from the network if deemed nonfunctional. A connectivity analysis between each tax lot and critical infrastructure was performed, and each tax lot was triggered with a binary yes/no for connectivity. A Monte-Carlo approach with 10,000 simulations was implemented to determine the probability of each tax lot being connected to critical infrastructure. The resulting probabilities were then added as attributes to GIS shapefiles in order to evaluate the spatial distribution of connectivity.

Figure 1: GIS and representative networks for each infrastructure component


Description of results

Characteristics of the network can be described by a degree distribution. In a network, the degree of a node is the number of immediate connections that the node has to other nodes (e.g. a node connected to 3 other nodes has a degree of 3). A histogram of the degrees can be generated to describe the overall distribution of the entire network. The degree distribution for the three infrastructure components are shown in Figure 2. It can be seen that in the EPN network, most nodes are connected to two other nodes. This is likewise apparent in the network of Figure 1, as the EPN network appears more “linear” compared to the transportation and water networks. The transportation and water networks exhibit similar characteristics to each other in that the majority of nodes have a degree of three.

Figure 2: Degree distribution for each infrastructure component

Using the results from the Monte-Carlo network analysis, maps were created to show the spatial variability of connectivity. The connectivity was deaggregated by both hazard and intensity of the event, as deaggregation provides an avenue for smart mitigation planning. Although similar maps were produced for all three networked infrastructure, for brevity, only the spatial distribution of the transportation network is shown (Figure 3). The maps show the probability of each tax lot becoming disconnected from the fire stations (2 in Seaside) and hospital (1 in Seaside) via the road network. It can be observed that the tsunami hazard results in significant damage to the transportation system relative to the earthquake hazard. The result of bridge failures caused by the tsunami can be observed for intensities larger than the 500-year event. The region west of the Necanicum River becomes completely disconnected from the fire stations and hospital which are located east of the river.

In addition to the spatial deaggregation, the aggregated results provide a comprehensive overview of the connectivity. Figure 4 shows the average fraction of tax lots disconnected from critical infrastructure across all of Seaside. The three networked infrastructure systems approach complete disconnectivity for hazard intensities larger than the 1000-year event. The transportation and water networks are dominated by the tsunami for the higher magnitude events; whereas the EPN see’s significant damage from both the tsunami and earthquake. Consequently, if resource managers are planning for high magnitude events, they should invest in tsunami damage mitigation measures.

Figure 3: Spatial distribution of connectivity to fire station and hospitals

Figure 4: Fraction of tax lots connected to critical infrastructure



Network analysis provides a means to evaluate the connectivity of tax lots to critical infrastructure, and incorporating probabilistic methods accounts for uncertainties as opposed to a deterministic approach. While this type of analysis can be useful to determine overall connectivity, it does not account for limitations and additional stresses in the “flow” of the network. For example, damage to the transportation network would result in additional travel times to the fire stations and hospital. In order to provide a more comprehensive analysis of the impact to networked infrastructure, both connectivity and flow should be considered.

Comparing the Spatial Patterns of Remnant Infected Hemlocks, Regeneration Infected Hemlocks, and Non-Hosts of Western Hemlock Dwarf Mistletoe


For Exercise 1, I wanted to know about the spatial pattern of western hemlock trees infected with western hemlock dwarf mistletoe. I used a hotspot analysis to determine where clusters of infected and uninfected trees were in my 2.2 ha study area (Map 1). I discovered a hot spot and a cold spot, indicating two clusters, one of high values (infected) and one of low values (uninfected).

For Exercise 2, I wanted to know how the spatial pattern of these clustered infected and uninfected trees were related to the spatial pattern of fire refugia outlined in my study site. I used Geographically weighted Regression to determine the significance of this relationship, however I did not find a significant relationship between a western hemlock, its intensity of clustering and infection status, and it’s distance to its nearest fire refugia polygon.

This result led to the realization that the polygons as they were drawn on the map were not as relevant as the actual “functional refugia”. I hypothesized that, after the 1892 fire, the only way for western hemlock dwarf mistletoe to spread back into the stand would be from the trees that survived that fire, or “remnant” trees. These would then infect the “regeneration” tree that came after the 1892 fire. The functional refugia I am interested in are defined by the location of the remnant western hemlocks. I also hypothesized that the spatial pattern of non-susceptible host tree (trees that were not western hemlocks) would play a role in the distribution of the mistletoe.

Question Asked

How are the spatial patterns of remnant western hemlocks related to the spatial patterns of regeneration western hemlocks, uninfected western hemlocks, and non-western hemlock tree species, and how are these relationships related to the spread of western hemlock dwarf mistletoe in the stand?

The Kcross and Jcross Functions

The cross-type functions (also referred to as multi-type functions) are tools capable of comparing the spatial patterns of two different type events (type i and j) in a similar spatial window, of some point process, X. It does this by assigning labels to the events differentiating the type and summarizing the number or distance, of and between events, at differing spatial scales, or radius circles (r).

The statistic Kij (r) , summarizes the number of type j events, around a type i event at a distance of r, or a point process X. Deviations of the observed Kij(r) curve from the the Poisson curve, or if type j events are truly randomly distributed, indicates dependence of type j events on type i events. Similar results can be obtained from the regular Ripley’s K: deviations above the curve indicate clustering and deviations below indicate dispersal.

(Incredibly helpful and interactive explanation link:

The statistic Jij(r) = (1 – Gij(r))/(1-Fj(r)) summarizes the shortest distance between a type i and j event and compares it to the empty space function of type j event. This is another test for inferring independence or dependence of type j events to type i. Deviations of the Jij(r) curve the value of 1, indicate levels of dependence of the events to each-other. Specific deviations from 1 can be hard to interpret without an understanding of the Fj(r) function so imagining it stationary in the ratio makes it easier. As Gij(r) increases, the numerator shrinks, creating a smaller Jij(r) statistic. Deviations below 1 indicate that type i and j events are dependent and that as r increases, the shortest distance between points of type i and j increases. As Gij(r) decreases, the numerator grows, creating a larger Jij(r) statistic. Deviations above 1 indicate that type i and j events are dependent and that as r increases, the shortest distance between points of type i and j decreases.

Methods Overview

In R, the package “spatstat” provides a suite of spatial statistic functions including the cross-type functions. In order to use these you need to create a “point pattern process” object. These objects incorporate X and Y coordinates, and a frame of reference, or a “window,” and give spatial context top a list of values. Then marks are applied to these points that create the necessary multi-type point pattern process object. These marks serve to distinguish the type i and type j events described earlier in your analysis. Then running the “Kcross()” or “Jcross()” functions with the specified type events produces a graph that you can interpret, very similar to producing the normal Ripley’s K plot.

  • I took my X – Y coordinates of all trees on the stand and added a column called “Status” to serve as my mark for the point pattern analysis.
    1. The four statuses were “Remnant,” “Regen,” “Uninfected,” and “NonHost” to identify my populations of interest.
      1. I had access to tree cores, so I identified trees that were older than 170 yrs old and these trees’ diameters served as my cutoff for the “Remnant” diameter class.
      2. All trees DBH > 39.8 cm.
    2. Doing this in ArcMap removed steps I would have to have taken when I migrated the dataset to R.
    3. I removed all the dead trees because I wasn’t concerned with them for my analysis.
  • I exported this attribute table to a csv and loaded it into R Studio.
  • I created the boundary window of my study site using the “owin()” function, and the corner points from my study site polygon.
  • The function “ppp()” creates the point pattern object and I assigned the marks to the data set using the “Status” column I created in ArcMap
    1. It’s important your marks are factors otherwise it is not converted into a multi-type point pattern object.
  • The last step is running the “Kcross()” and “Jcross” to compare the “Remnant” population to the “Regen,” “Uninfected,” and “NonHost” populations.
    1. This produced 6 plots, 3 of each type of cross-type analysis.
    2. Compare these easily using the “par()” function, for example:

par(mfrow = c(1,3))




This produces the three plots in a single row and three columns.


Because I am assuming the remnant, infected western hemlock trees are one of the main factors for the spread of western hemlock dwarf mistletoe and that they are the center of  new infection centers on the study site, I did all my analysis centered on the remnant trees (points with status = “Remnant” treated as event type i).

1)   i = Remnant, j = Regen

The first analysis between remnant and regeneration trees demonstrate that there is dependence on the two events to each other. At fairly small distances, or values of r, infected western hemlocks that have regenerated after the 1892 fire cluster around infected remnant western hemlocks that survived the 1892 fire. This stands to reason because we assume that infected trees will be near other infected trees, and that infection centers start usually with a “mother tree.” In this case the remnant trees serve as the start of the new infection centers. The Jcross output also shows me that the two types of trees are clustered using the frequency of the shortest distances. After ~8 meters the two tree types exhibit definite clustering. In terms of the function, the Gij(r) in the numerator of the Jij(r) function is approaching 1, or the highest frequency of very short distances.

2)   i = Remnant, j = Uninfected

The Kcross plot from the second set of analyses between remnant and uninfected trees demonstrates that there is independence between the two events up to ~15 meters. After that, the trees exhibit slight clustering effects. The lack of dispersal tendencies is strange for these two types of trees because we expect uninfected trees to be furthest away from the center of infection centers. The presence of clustering may be indicative of the small spatial scale of my study site. It may also be that the size of the infection centers are only about 15 meters (if we assume that remnant trees are the center). The Jcross plot shows something similar: at small distances the types of trees seem independent and then around 8 meters they exhibit clustering.

3)   i = Remnant, j = NonHost

The Kcross from the last set of analyses between the remnant trees and the non-hosts demonstrates a similar pattern exhibited by the regeneration trees. After about 4 meters, the trees tend to be clustered. This is an interesting find because if the non-hosts cluster to remnant trees but uninfected trees are independent, then the non-hosts may be playing a role in this. The Jcross plot shows the same: the two types of trees exhibit clustering.

4)   Comparing Kcross Functions with eval.fv()

A useful way to compare patterns of Kcross functions is using the eval.fv function. The titles of each plot tell which Kcross was subtracted from which; note the difference in scales. The first plot shows that the regenerating trees’ spatial pattern as related to remnant trees is very different from the uninfected trees’ pattern. The regenerating trees’ spatial pattern is much more similar to the non-hosts’ spatial pattern at short distances, until about 15 meters. Then the patterns differ with the regenerating trees exhibiting more of a clustering tendency. However the scale is much smaller than the other two graphs. Lastly, the third plot shows the difference between the non-host trees’ spatial pattern and the uninfected trees’ spatial pattern. There appears to be a stepwise relationship where, at very near and very far distances the non-host trees are much more clustered, but at moderate distances the differences may be less dramatic.

Critique of Cross-Type Functions

The amount of easily interpretable literature on the spatstat package as a whole is sparse, although a wealth of very technical information exists. The function was easy to use and execute though and so was the process of creating the point pattern object. These two functions can clearly show how the spatial patterns of the two types of events change with scale. It would be helpful if there was a way to compare three or more types of events. The last drawback is that there is a lack of specific information for each point on your map or study site. This pattern that is generalizable to a whole set of points may not be as useful when trying to put together a story, such as the story of a stand’s development through time.

Additional Raster Analysis

The last critique of the cross-type functions led me to attempt a visualization of these patterns on my stand. Very briefly, I determined densities of the infected, uninfected, and the non-host trees using the Kernel Density function in ArcMap. Then I classified these densities using natural breaks and coded these for raster addition. After adding all three density rasters together, I coded each unique density classification combination to tell me how the densities of the populations appeared in the study site.

It appears that there are distinct patches of high density separated by areas of low density. On the eastern side of my study site, it appears that the high density areas of infected trees cluster with the remnant infected trees. An interesting interaction is occurring between the high density patches of uninfected trees and infected trees in the western portion of my study site. The mechanism for the seemingly clear divide may be the non-TSHE trees.

Blackleg classification by segmentation error analysis and visualization


Once again, I am using the confusion matrix to determine the accuracy of segmented classification based on the ground truth of manual classification. This work piggybacks off of exercise two where I also used a confusion matrix for determining accuracy. In my previous work I critiqued the ability to draw any relevant conclusions based on a sample size of one. Here I have included a sample size of five images which have gone through the entire image processing, segmentation, manual classification and confusion matrix. This allows us to make more relevant comparisons and looks for trends in the data. Additionally, I mentioned the confusion matrix was inaccurate because it was performed on a rectangular region that was simply the extent of the image and not just the leaf. This has been corrected for and now conducts a confusion matrix on the entire leaf and no outside area. Lastly, I have adjusted the segmentation settings to increase accuracy based on previous findings in exercise one and two.


  • How accurate is the segmentation classification?
  • Are there commonalities between the false positives and false negatives?
  • Are more raster cells being classified as false negatives or false positives when misclassified?
  • What is going undetected in segmentation?
  • What is most easily detected in segmentation?

Tools and Methods

Many of the steps are repeated from exercise two. I will repeat them here in a completer and more coherent format with the new steps found in the process.

ArcGIS Pro

Begin by navigating to the Training Sample Manager. To get to the Training Sample Manager, go to the Imagery tab and Image Classification group and select the Classification Tools where you can click Training Sample Manager. In the Image Classification window pane go to create new scheme. Right click on New Scheme and Add New Class and name it “img_X_leaf” and supply a value (the image #) and a description if desired. Click ok and select the diseased pixels scheme and then the Freehand tool in the pane. Draw an outline around the entire leaf (Images 1-5). Save this training sample and the classification scheme. Save ArcGIS Pro and exit. Reopen ArcGIS Pro and go to Add data in the Layer group on the Map tab to upload the training sample polygon of the leaf. This process must be repeated for each leaf.

The polygon of the leaf will need to be converted to a raster. Click the Analysis tab and find the Geoprocessing group to Tools once again. Search Polygon to Raster (Conversion Tools) in the Tools pane and select it. Use the polygons layer for Input Features and the window pane options should adjust to accommodate your raster. The only adjustment I made was to the Cellsize, which I changed to 1 to maintain the cell size in my original raster. Select Run in the bottom right corner of the pane. Each of the polygons should now be converted to a rasterized polygon with a unique ObjectID, Value and Count which can be found by going to the Attribute Table for that layer or clicking on a group of pixels. This also must be completed for each of the polygons created for each leaf outline.

Now we should have three layers for each leaf in which we want to export as Tiff files. We have our “leaf” that we just created, our “manually classified” and “segmented” layers. To export these three, you need to go to each layer and right click. Start with the “leaf” image and go to Data and Export Raster and a pane will appear on the right. Zoom into the image so there isn’t much area above or below the leaf. Select Clipping Geometry and click on Current Display Extent. This will export the image as a Tiff file which we will use later. Continue this process with the other two images without changing the extent. Each image should be exported after this step with the same extent resulting in the same number of pixels and overlap between pixels if placed overtop one another. This can be confirmed in R in the follow steps.


Open R studio and use the following annotated code below:

required packages for rasters
rm(list=ls()) ##clears variables
##raster upload
img_10_truth <- raster(“E:/Exercise1_Geog566/MyProject3/img_10_truth.tif”)
img_10_predict <- raster(“E:/Exercise1_Geog566/MyProject3/img_10_predict.tif”)
img_10_leaf <- raster(“E:/Exercise1_Geog566/MyProject3/img_10_leaf_shape.tif”)
##view the raster
img_10_truth ##confirm dimensions, extent, etc. They need to be the same or very close for a confusion matrix
plot(img_10_leaf) ##view the images
##export data to excel
img_10_leaf_table <-, xy=T) ##creates tables
img_10_predict_table <-, xy=T)
img_10_truth_table <-, xy=T)
write.table(img_10_leaf_table, file = “img_10_leaf.csv”, sep = “,”)
write.table(img_10_predict_table, file = “img_10_predict.csv”, sep = “,”)
write.table(img_10_truth_table, file = “img_10_truth.csv”, sep = “,”)



At this point the three Tiff files should have been uploaded to R as raster’s and undergone a quality check by viewing the images and the data using the supplied code above. Next, it should have been converted to a table for each file and then all exported out as csv files. From here, open all three files into excel. Begin with the “Truth” and “Predict” files and select the entire column that is the farthest to the left, the values associated with each cell. Go to the Home tab and the Editing group and click on the drop down for Sort and Filter and select the A to Z. Change everything from whatever value they are, to 1. This can easily be accomplished by using the Find and Replace function in the Editing group. Replace all the NA’s with 0’s and then go to left column which simply lists each pixel number and sort the column from A to Z so its back to the order it was when originally opened into excel. Once this is completed for both the “Predict” and “Truth” files, copy the right most column of values and paste them into the Leaf” excel file. Use the Sort function to sort the right column from A to Z on the leaf values column and allow for it to expand the selection. It should be a list of 0’s and if you scroll down you will eventually begin seeing NA’s in the leaf value column. Delete everything below this point in the “Truth” and “Predict” value columns. Scroll to the top and delete the first four columns that are the numbers, x and y coordinates and the leaf values column. You should be left with two columns that are the “Truth” and “Predict” values with 1’s and 0’s only. Give each column a label (Truth & Predict), save this document as an xlsx file and exit out.

This excel step helped us overcome the issue of having values that weren’t in a 0’s and 1’s format needed for the confusion matrix. Here, 1’s stand for diseased patch and 0’s stand for non-diseased. What we also did was removed all the values that were outside of the leaf area. So, the only values we are working with are those which are in the leaf and not part of the background. The last step is importing the excel file that was just created into R to perform the confusion matrix.


##confusion matrix
img_10_confusion$Predict <- as.factor(img_10_confusion$Predict) ##convert to factors
img_10_confusion$Truth <- as.factor(img_10_confusion$Truth)
confusionMatrix(img_10_confusion$Predict, img_10_confusion$Truth)


You should now have R output of the confusion matrix and associated statistics.

ArcGIS Pro

For the purpose of visualization, I also changed the colors of the three Tiff files. To do this, I simply right clicked on each of the three layers and selected Symbology. The pane on the right side of the screen appears where you have options to change the Color Scheme. I made the leaf layer green, the segmented regions white and manually classified regions white. The snipping tool was used to extract these images.


Between the five images there were 15,519 pixels classified in the confusion matrix. 14,884 were properly classified as non-diseased. 282 were classified as true disease. 307 were identified as false negatives and 46 were classified as false positives. Overall the segmentation classification had an accuracy of 97.7% with a p-value less than 0.05 indicating statistical significance. The model had a sensitivity of 99.69% and a specificity of 47.88%.

Accuracy between the five images ranged from 94% to 99%, all with significant p-values. The sensitivity for all five individual matrices and the combined matrix was at 99% with no exceptions. Specificity had a fairly large range.

I was happy with the level of accuracy produced throughout all five images and the overall cumulative matrix. The five images ranged a bit in complexity with the number of lesions present, leaf size and shape, size of lesions, etc. Looking at the confusion matrix we can see there were many more false negatives than false positives. This indicates the model is possibly not quite sensitive enough to detect the all regions or certain parts of the disease. The specificity was not consistent between the five images for some reason which is still unclear to me. As mentioned, although the model may have some issues with detection certain patches, the sensitivity indicated throughout all matrices is very high.

Viewing the images below we can see where the false positives appear and where the false negatives occur (Images 1-5). It seems that both the false positives and negatives tend to appear on the margins of the lesion when they at least detected. Additionally, larger patches seem to have a higher rate of detection than those which are small in size. This is especially true in Image 1&2.


My only critique is the process to get to this step. I was very satisfied with my results but see logistical problems in getting through large amounts of samples. I will attempt to find shortcuts throughout the process my next go around. I will also seek creating a model in ArcGIS Pro for some of the rudimentary tasks done in the processing of the images.


Table 1. Confusion matrix of all five samples combined. Values are number of pixels classified out of 15,519 total.


Figure 1. Classification error for the combined matrix. The values are the number of pixels classified out of 15,519 total.


Image 1. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.


Image 2. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.


Image 3. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.


Image 4. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.


Image 5. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.