GEOG 566

         Advanced spatial statistics and GIScience

Archive for 2019

May 24, 2019

Landscape Patterns as Predictors of Tree Height

Question: Which landscape features correspond to clusters of greater than expected trees?

Methods: I performed two Hot Spot Analyses in ArcMap; one on Hot Spots of tree height and another on hot spots of distance between trees. Both were constrained to the reserved control areas of the HJ Andrews Forest. Hot spots in tree height are regions of greater than expected tree height, while hot spots in distances between trees are regions of greater than expected distance between individual trees (more dispersed trees). Hot spots and cold spots of each analysis generally overlapped. However, hot spots between tree height and spacing did not overlap in all cases, so I wanted to know what landscape features might explain this difference. Covariates I explored included slope, aspect, elevation, and landform. Since the end goal is to find landscape features that may correlate with amount of soil carbon, I conducted this analysis with the assumption that taller trees may correlate with regions of greater soil carbon. I used the package ‘caret’ in R to calculate a  confusion matrix between the Z-scores of height and distance for all the hot spot bins (-3,-2,-1,0,1,2,3), then further constrained the analysis to only the most extreme hot and cold spots (-3 and 3). I then compared mean height, distance, slope and elevation between the four combinations of the extreme hot and cold spots (Table 1).

Results: Regions of taller than expected trees often correspond to regions of greater than expected distances between trees, which agrees with current forest growth models (Fig. 1). Hot spots of tall trees are typically in valleys and cold spots are commonly on ridges (Fig 3 & 4). When we zoom in to the Lookout Mountain area of HJ Andrews, we see that hot spots of tall trees are more concentrated in valleys and on footslopes, and cold spots are closer to mountain ridges (Fig 3). When compared with the distance hot spot map of the same area, we see that cold spots go much further down the footslopes and even into the valleys in some cases (Fig 4). So although we have evidence for a strongly linear relationship between height and distance between trees, we also have evidence that they do not fully explain each other and other landscape features are likely at play.

Fig 1. Distance Z-scores vs. Height Z-scores from hot spot analyses show a linear relationship.

Fig 2. HJ Andrews elevation with reserved control areas in orange and

inset area of Lookout Mountain hot spot maps (below)


Fig 3. Hot Spot Analysis showing hot spots of tree heights (tallest trees)

in the Lookout Mountain area


Fig 4. Hot Spot Analysis showing the greatest distance between trees

in the Lookout Mountain area


An elevation band that correlates with occurrences of tall trees exists up to around 1100 m, after which point number of tall trees drops off substantially (Fig. 5). Certain aspects seem to correlate with taller trees, but those relationships are harder to tease apart and I have yet to fully explore them. Greater slopes tend to correlate with shorter trees, but this relationship is not linear. There is an interesting upwards trend at slopes between 30 and 50 degrees that seems to correlate with slightly taller trees, then a big drop in mean height Z-score at slopes of 60 degrees.

Fig 5. Aspect, elevation and slope compared with Z-scores of mean height.

A comparison of Z-scores from hot spot analyses of height and distances shows that although hot spots of height and distance tightly correlate, covariates that explain them are different (mean slope and elevation). When we compare the most extreme Z-scores to one another, slope, height and distance between trees are not particularly different. Mean elevation in three categories of Z-score is similar, but mean elevation in the fourth group (>3,>3) is significantly lower. A next step is to map out these

Table 1. Comparison between the most extreme Z-scores of tree height and tree spacing.

Height Z-Score Distance Z-Score Mean Height (m) Height_SD Mean Distance (m) Distance_SD Mean Slope (m) Slope_SD Mean Elevation (m) Elevation_SD
<-3 <-3 22.8 10.5 5.1 2.4 27.9 10.5 1285 294
<-3 >3 24.5 11.2 5.6 2 26.9 4.5 1377 153
>3 <-3 34.8 7.1 5 2 31.8 5.9 1310 44
>3 >3 39.5 16.7 4.7 2.6 26.2 11 934 188

Critique: These analyses are still based on Hot Spot Analyses, so they still comes with the same criticisms as previous Hot Spot Analyses. One of these criticisms was that it’s basically a smoothing function. Since the LiDAR dataset I’m using is basically a census of tree heights, running hot spot analyses is reducing the information in that dataset unnecessarily. I have yet to map out regions that were well-predicted and poorly predicted spatially, so I cannot fully discuss the merits of the confusion matrix method.


Exercise 3: Lagoons, ENSO Indices, and Dolphin Sightings

Exercise 3: Are bottlenose dolphin sightings distances to nearest lagoon related to ENSO indices in the San Diego, CA survey site?

1. Question that you asked

I was looking to see a pattern at more than one scale, specifically the relationship with ENSO and sighting distributions off of San Diego. I asked the question: do bottlenose dolphin sighting distributions change latitudinally with ENSO related to distance from the nearest lagoon. The greater San Diego area has six major lagoons that contribute the major estuarine habitat to the San Diego coastline and are all recognized as separate estuaries. All of these lagoons/estuaries sit at the mouths of broad river valleys along the 18 miles of coastline between Torrey Pines to the south and Oceanside to the north. The small boat survey transects cover this entire stretch with near-exact overlap from start to finish. These habitats are known to be highly dynamic, experience variable environmental conditions, and support a wide range of native vegetation and wildlife species.

Distribution of common bottlenose dolphin sightings in the San Diego study area along boat-based transects with the six major lagoons.

2. Name of the tool or approach that you used.

I utilized the “Near” tool in ArcMap 10.6 that calculated the distance from points to polygons and associated the point with FID of that nearest polygon. I also used R Studio for basic analysis, graphic displays, and ANOVA with Tukey HSD.

3. Brief description of steps you followed to complete the analysis.

  1. I researched the San Diego GIS database for the layer that would be most helpful and found the lagoon shapefile.
  2. Imported the shapefile into ArcMap where I already had my sightings, transect line, and 1000m buffered transect polygon.
  3. I used the “Near” tool in the Analysis toolbox, part of the of the “proximity toolset”. I chose the point to polygon option with my dolphin sightings as the point layer and the lagoon polygons as the polygon layer.
  4. I opened the attribute table for my dolphin sightings and there was now a NEAR_FID and NEAR_DIST which represented the identification (number) related to the nearest lagoon and the distance in kilometers to the nearest lagoon, respectively.
  5. I exported using the “conversion” tool to Excel and then imported into R studio for further analyses (ANOVA between the differences in sighting distances to lagoons and ENSO indices).

4. Brief description of results you obtained

After a quick histogram in ArcMap, it was visually clear that the distribution of points with nearest lagoons appeared clustered, skewed, or to have a binomial distribution, without considering ENSO. Then, after importing into R studio, I created a box plot of the distance to nearest lagoon compared to the ENSO index (-1, 0, or 1). I ran an ANOVA which returned a very small p-value of 2.55 e-9. Further analysis using a Tukey HSD found that the differences between ENSO states of neutral (0) and -1 and neutral and 1 were significant, but not between 1 and -1. These results are interesting because this means the sightings of dolphins differ most during neutral ENSO years. This could be that certain lagoons are preferred during extremes compared to the neutral years. Therefore, yes, there is a difference in dolphin sightings distances to lagoons during different ENSO phases, specifically the neutral years.

5. Critique of the method – what was useful, what was not?

This method was incredibly helpful and also was the easiest to apply once I got started, in comparison to my previous steps. It allowed to both visualize and quantify interesting results. I also learned some tricks for how to better graph my data and to symbolize my data in ArcMap.

Contact information: this post was written by Alexa Kownacki, Wildlife Science Ph.D. Student at Oregon State University. Twitter: @lexaKownacki

Ex. 3: Does black stain spread through landscape networks?


For those who have not seen my previous posts, my research involves building a model to simulate the spread of black stain root disease (a disease affecting Douglas-fir trees) in different landscape management scenarios. Each of my landscapes are made up of stands assigned one of three forest management classes. These classes determine the age structure, density, thinning, and harvest of the stands, factors that influence probability of infection.


Are spatial patterns of infection probabilities for black stain root disease related to spatial patterns of forest management practices via the connectivity structure of the network of stands in my landscape?


 I decided to look at how landscape connectivity influenced the spatial relationship between forest management practices and infection probabilities. This approach builds off of a network approach based in graph theory (where each component of the landscape is a “node” with “edges” connecting them) and incorporates concepts from landscape ecology regarding distance-dependent ecological processes and the importance of patch characteristics (e.g., area, habitat quality) in the contribution of patches to the connectivity of the landscape. I used ArcMap, R, and a free software called Conefor (Saura and Torné 2009) to perform my analysis.


 1. Create a mosaic of the landscape

The landscape in my disease spread model is a torus (left and right sides connected, top and bottom are connected). The raster outputs from my model with stand ID numbers and management classes do not account for this and are represented as a square. Thus, in order to fully consider the connectivity of each stand in the landscape, I needed to tile the landscape in a 3 x 3 grid so that each stand at the edge of the stand map would have the correct spatial position relative to its neighbors beyond the original raster boundary. I did this in R by making copies of the stand ID raster and adjusting their extent. In ArcMap, I then assigned the management classes to each of those stands, converting to polygon, using the “Identity” tool with the polygon for management class, and then using the “Join Field” tool so that every stand with the same unique ID number would have the relevant management class assigned. If I had not done this step, then the position of stands at the edge of the raster in the network would have been misrepresented.

2. Calculate infection probability statistics for each stand

I then needed to relate each stand to the probability of infection of trees in that stand (generated by my model simulation and converted to point data in a previous exercise). In ArcMap, I used the “Spatial Join” tool to calculate statistics for infection probabilities in that stand, because each stand contains many trees. Statistics included the point count, median, mean, standard deviation, minimum, maximum, range, and sum.

3. Calculate the connectivity of each stand in the network of similarly managed stands in the landscape

3a. For this step, I used the free software Conefor, which calculates a variety of connectivity indices at the individual patch and overall landscape level. First, I used the Conefor extension for ArcMap to generate the input files for the Conefor analysis. The extension generates a “nodes” file for each feature and a “connection” file, which contains the distances between features a binary description of whether or not a link (“edge”) exists between two features. One can set the maximum distance for two features to be linked or generate a probability of connection based on an exponential decay function (built-in feature of Conefor, which is an incredible application). For my analysis, I performed connectivity analyses that only considered features to be linked if (i) they had the same management class and (ii) there were no more than 10 meters of distance between the stand boundaries. Ten meters is about the upper limit for the maximum likely root contact distance between two Douglas-fir trees.

3b. For each management class, I ran the Conefor analysis to calculate multiple metrics. I focused primarily on:

  • Number of links in the network
  • Network components – Each component is a set of connected patches (stands) that is internally connected but has no connection to any other set of patches.
  • Integral Index of Connectivity (IIC) – Essentially, this index gives each patch (stand) a value in terms of its importance for connectivity in the network based on its habitat attributes (e.g., area, habitat quality) and its topological position within the network. For this index, higher values indicate higher importance for connectivity. This is broken into three non-redundant components that sum to the total IIC:
    • IIC intra – connectivity within a patch
    • IIC flux – area-weighted dispersal flux
    • IIC connector – importance of a patch for connecting other patches in the network) (Saura and Rubino 2010)
  1. Analyze the relationship between connectivity metrics and infection probabilities

I reduced the mosaic to include only feature for each stand, eliminating those at the periphery and keeping those in the core. I confirmed that the values were similar for all of the copies of each stand near the center of the mosaic. I then mapped and plotted different combinations of connectivity and infection probability metrics to analyze the relationship for each management class (Fig. 1, Fig. 2).

Fig. 1. Map of IIC connectivity index and mean infection probability for the extensively managed stands.


I generally found no relationship between infection probability and the various metrics of connectivity. As connectivity increased, infection probabilities did not change for any of the metrics I examined (Fig. 2). I would like to analyze this for a series of landscape simulations in the future to see whether patterns emerge. I could also refine the distance used to generate links between patches to reflect the dispersal distance for the insects that vector the disease.

Fig. 2. Plots of infection probability statistics and connectivity metrics for each of the stands in the landscape. Each point represents one stands in the randomly distributed landscape, with extensively managed stands in red, intensively managed stands in blue, and old-growth stands in green.

CRITIQUE OF THE METHOD – What was useful, what was not?

I had originally planned to use the popular landscape ecology application Fragstats (or the R equivalent “landscapemetrics” package), but I ran into issues. As far as I could tell (though I may be incorrect), these options only use raster data and consider one value at a time. What I needed was for the analysis to consider groups of pixels by both their stand ID and their management class, because stands with the same management class are still managed independently. However, landscapemetrics would consider adjacent stands with the same management class to be all one patch. This meant that I could only calculate metrics for the entire landscape or the entire management class, which did not allow me to look at how each patch’s position relative to similarly or differently managed patches related to its probability of infection. In contrast, Conefor is a great application that allows for calculation of a large number of connectivity metrics at both the patch and landscape level.


Saura, S. & J. Torné. 2009. Conefor Sensinode 2.2: a software package for quantifying the importance of habitat patches for landscape connectivity. Environmental Modelling & Software 24: 135-139.

Saura, S. & L. Rubio. 2010. A common currency for the different ways in which patches and links can contribute to habitat availability and connectivity in the landscape. Ecography 33: 523-537.

Ex 3: Grain size distributions and peak flow events

Filed under: 2019,Exercise 3 @ 5:47 pm

The context:

For this exercise, I wanted to figure out if there was a relationship between the temporal pattern of peak discharge in my study creeks and the temporal pattern of grain size distributions. The temporal pattern of grain size distributions could help tell a story about the interplay between hillslope and alluvial process and the forces involved in shaping and stabilizing the streams. One might predict that changes in grain size distribution might be related to extreme events. Large events could associated with debris flows which might reduce sorting while more moderate flow events could increase sorting.

The data:

Pebble counts were conducted in conjunction with cross section surveys at five reaches between 1995 and 2011. They show both the median (D50) and standard deviation of grain size varying over time.

During this time period, two different cross section sampling methods were used, depending on the year. In one method, every cross section in a given year was surveyed. In another method, the only cross sections sampled were ones in which the field crew thought that the creek bed had changed. Because of this, the grain size samples in some year are incomplete and biased towards conditions that favor visible change.

The graphs below show the mean of the D50 and mean of the standard deviation for each set of cross section for each year sampled in subfigure a. The error bars show standard error. The faded points represent years where the field crew only sampled a subset of the cross sections. Subfigure b shows the area-normalized annual peak discharge for the stream gauges on Mack and Lookout Creek. Cold Creek is ungauged.

These figures imply that most reaches were the least sorted in 1997, the year after the flood of record, followed by various decreases and increases over time.

The questions

I asked the following questions to try to understand the relationship between peak flow and grain size distribution:

  1. Is the standard deviation of grain size in a given water associated with the peak flow from that water year?
  2. Is the standard deviation of grain size in a given water associated with the peak flow from the previous water year?
  3. Is the change in standard deviation of grain size between two years associated with the largest peak flow from the interceding years?
  4. Is the change in D50 between two years associated with the largest peak flow from the interceding years?


The methods and results

I used simple linear regression to relate each of these variables and took the R2 value to represent how much of the sediment distribution variability each variable might explain. The results were negative in all cases.


  SITE  Rsquared
1 LOL    0.00983
2 LOM    0.00320
3 MAC    0.0340 
4 MCC    0.106


  SITE  Rsquared
1 LOL     0.0546
2 LOM     0.0297
3 MAC     0.0191
4 MCC     0.217


  SITE  Rsquared
1 LOL     0.0845
2 LOM     0.0718
3 MAC     0.0842
4 MCC     0.0852


SITE  Rsquared
1 LOL    0.0440 
2 LOM    0.0252 
3 MAC    0.0553 
4 MCC    0.00688

The answer to these questions all appear to be no – clearly hydrology has some impact on grain size distributions, but the relationship may be too complicated to address using a single predictor variable and limited data.


Exercise 3: Perception of natural resource governance and changes in environment over time


What are the spatial patterns of natural resource governance perceptions in relation to environmental condition over time?

Tools and Approaches

  1. Tabulate area in ArcGIS Prop
  2. T-test in R

Analysis Steps

  1. To begin to answer this question, I first needed to find spatial layers of the environmental condition in the Puget Sound over time. I could not use the same ‘environmental effects’ data I used for Exercise 2, as the data was only completed 5 months prior. Instead, I substituted two land cover maps from 1992 and 2016 to approximate environmental condition.

I first loaded the raster layers into ArcGIS Pro. I then reclassified the land types (of which there were 255, but only 25 present in the region) down to 14. I then created buffers of 5km for my points of individuals with governance scores one standard deviation above and below the average. I then tabulated the land cover area within the buffers for each time period.

  1. To test whether land cover change differed between those with high and low perceptions, I exported the tables and calculated the change in land cover for the two samples. I then ran two-sample t-tests for each land cover type change between the two groups.


Land cover change differed significantly for four different land cover types between the two groups—grassland, forest, shrub/scrub, and bare land cover. Grassland cover decreased for both groups, but decreased by 5% more in in the areas with individuals with below average governance perceptions. Forest cover also decreased for both group, but decrease by 1% more individuals with below average governance perceptions (the amount of forest cover in the region is very large which accounts for why a 1% difference is significant). Shrub and scrub cover increased for both groups, but increased by 3% more in areas with individuals with below average governance perceptions. Lastly, bare land decreased for both, but decreased by 5% more for individuals in areas with individuals with below average governance perceptions (Table 1).

The effect size of land cover change on perception was small for each of the significant variables with biserial correlations between .07 and .09 (Table 1).

Table 1. Differences in land cover change between individuals with above average perceptions of natural resource governance perceptions and individuals with below average governance perceptions

  Governance Perception1      
  Above Average Below Average t-value p-value Effect size (rpb)
High Development 17 16 0.80 .426 .02
Medium Development 19 19 0.67 .500 .02
Low Development 11 11 0.22 .829 .00
Developed Open Space 12 13 0.21 .837 .00
Agriculture -1 0 0.74 .461 .02
Grassland -13 -18 3.03 .002 .09
Forest -10 -9 2.61 .009 .08
Shrub/Scrub 40 43 3.04 .002 .09
Wetlands 0 0 1.67 .095 .05
Unconsolidated Shore 2 2 0.10 .918 .00
Bare Land -21 -25 2.43 .015 .07
Water 0 0 0.90 .368 .03
Aquatic Beds -1 -5 1.19 .233 .03
Snow/Tundra 0 0 0.00 .997 .00

1Cell entries are percent changes of land cover from 1992 to 2016


  1. Critique

The biggest problem I had with this method was figuring out the most appropriate way to test my question. I struggled with classifying ‘good’ or ‘bad’ land cover change, and ultimately decided it was inappropriate. I also don’t think land cover is necessarily the best way to test environmental condition. I think it would be more appropriate if I were able to use environmental effects over time. I think it would also be best if I did different sized buffers around individuals.

Additionally, I believe I should have controlled for how long they lived in the region. The average number of years individuals have lived in the region was 34, which is less than the land cover change of 23 years. However, as years lived in the Puget Sound is a significant determinant of perception, it may have more to add in this case. The problem here is that this method is clunky. The way in which I differentiated high/low perception was by having different csv files. Having to then split these by years in the region would create an additional step, and more files, that could ultimately increase personal error because of the difficulty in keeping track of which buffer is which type of individual.


Ex 3: Analysis of juniper density and slope, aspect characteristics

Filed under: 2019 @ 3:19 pm

1.Question asked and data used: 

How does juniper density vary with slope and aspects characteristics (as indicators of soil moisture)?

This study site for this analysis is the Camp Creek Paired Watershed Study (CCPWS), located in a semiarid region of central Oregon. This research focuses on two watersheds: one in which most juniper was removed in 2005-2006, referred to as the treated WS, and one watershed in which mature juniper is the dominant overstory species, referred to as the untreated WS. Belt transects (3m wide by 30m long) which recorded the number, height, and canopy cover of juniper within the treated WS were used for analysis. The number of juniper found in each transect was represented as a single point at the beginning of each transect. Additionally, NAIP imagery and a 30m DEM were used to assess juniper density and slope and aspect characteristics across both watersheds and in the surrounding areas.

2. Approach and tools used:

Analysis was conducted using R Studio, ArcGIS Pro, and Excel. A Chi-squared test and logistic regression were calculated in R, followed by Geographically Weighted Regression (GWR) in ArcGIS Pro. Excel was used to calculate the percentage of pixels classified as juniper within each buffer.

3. Steps used in analysis:

     A. Data preparation

              1. The following steps were conducted in a previous exercise: 1) a 30m DEM was used to extract slope and aspect characteristics, 2)slope and aspect values were each divided into nine groups, representing characteristics generally associated with increased soil moisture (e.g., flat slopes with northerly aspects were rated highest), 3)using the raster calculator the values of slope and aspect were averaged ((slope category+aspect category)/2). The raster created using this process rated each grid square from 1-9 (henceforth referred to as “rating model”). The rating model is shown below.

             2.Classification of NAIP imagery (also from a previous exercise). The NAIP imagery was classified using Support Vector Machine supervised classification. (Please note: the resolution of the NAIP imagery makes identification of juniper saplings difficult, therefore conclusions from the data should be made with caution.) The NAIP raster was resampled to the same resolution as the 30m DEM. The pixels were reclassified into a binary raster, with each pixel being either “juniper” or “not juniper”. The resampled NAIP imagery is shown below.

            3.The classified NAIP raster and rating model were projected to NAD 1983 UTM zone 10. Both rasters were masked to cover the same extent.

            4. A table was created which represented the number of juniper found per transect as a point. Using the “extract multi values to points” tool in ArcGIS the corresponding value from the rating model was assigned to each transect point.

             5. In a previous exercise, a buffer of 50m was created surrounding each of the belt transects (treated WS only). The percentage of juniper (pixels classified as juniper/total pixels) was calculated in excel and added to the belt transect table created previously.

             6. Using the “create random points” tool in ArcGIS, 96 random points were created within the study area (treated WS, untreated WS, as well as surrounding area). These points were a minimum of 150m apart.

             7. A 50m buffer was created surrounding each of the 96 random points. The percentage of juniper in each buffer was calculated as indicated above. These values were added (using the join function) to a table with the random points.

      B. Analysis

             1. The binary classified NAIP raster and rating model were loaded into R Studio and reviewed (addresses have been abbreviated):





             2. Chi-squared test was conducted for the rating model and classified NAIP raster (R):


             3. Logistic model, summary of results and associated plots (R):<-glm(juniperonly_NAIP[]~apslop[],data=v2,family=binomial)


              4. GWR was conducted in ArcGIS for the belt transects (in treated WS), using the juniper density in the 50m buffer as the dependent variable and the rating model as the independent variable. The distance band neighborhood type and golden search neighborhood selection were used.

              5. GWR was conducted in ArcGIS using the 96 random points (in treated WS, untreated WS, and surrounding areas). The same neighborhood parameters described above were used.

              6. Global Moran’s I was calculated using the residuals for both GWR analyses. Inverse distance and the Euclidean distance method were used for analysis.

4. Overview of Results:

      A. The Chi-squared test of the binary NAIP raster (indicating juniper versus non-juniper) and the rating model raster indicated a slightly significant relationship:

Pearson’s Chi-squared test data:

apslop[] and juniperonly_NAIP[]

X-squared = 28.624, df = 15, p-value = 0.01798

      B. The logistic regression model indicated a significant relationship between the aspect and slope classification (from the rating model) and the presence or absence of juniper (dependent variable). However, the difference between the null deviance and the residual deviance suggests that this may not be an appropriate model for predicting the presence or absence of juniper. The exponent of the coefficient indicated that for every increase in aspslop rating the probability of juniper being present increased by 1.06.

Call: glm(formula = juniperonly_NAIP[] ~ apslop[], family = binomial, data = v2) Deviance Residuals:

Min 1Q Median 3Q Max

-0.6934 -0.6576 -0.6317 -0.5984 1.9537


Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.80732 0.11048 -16.359 < 2e-16 ***

apslop[] 0.05935 0.01948 3.046 0.00232 **

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 7425.5 on 7761 degrees of freedom

Residual deviance: 7416.1 on 7760 degrees of freedom

(656 observations deleted due to missingness)

AIC: 7420.1

Number of Fisher Scoring iterations: 4

(Intercept)      apslop[]

0.164094       1.061147

      C. The GLM of treated WS dataset indicated an adjusted R-squared value of 0.28. Also, the following warning was indicated for this analysis: “At least one of your local regressions had very limited variation after applying the weights. Please use caution when interpreting the results”.  This further suggests that this model may not be indicative of the explanatory factors related to juniper density. The GWR results are provided below.

Golden Search Results

Distance Band     AICc

292.6545      222.3773

884.3284      221.4705

518.6538      216.8374

658.3291      219.2658

432.3298      217.0552

572.0050      217.5695

485.6810      216.6200

465.3026      216.6782


 WARNING 110259: At least one of your local regressions had very limited variation after applying the weights. Please use caution when interpreting the results.

——————————– Analysis Details ——————————-

Number of Features:                                                             33

Dependent Variable:                      BELTTRANSECTS_50M_EXCELTOTABLE.F__JUNIPER


Distance Band:                                                            465.3026


——– Model Diagnostics ———-

R2:                              0.4918

AdjR2:                           0.2845

AICc:                          216.6782

Sigma-Squared:                  28.4635

Sigma-Squared MLE:              20.4670

Effective Degrees of Freedom:   23.7290

     D. The GWR performed for the larger dataset (points distributed across both watersheds and in the surrounding areas) indicated a similar R-squared value to the GWR based in the treated WS only. A similar warning regarding the residuals was indicated for this GWR analysis. The results of this GWR are displayed below.

Golden Search Results

Distance Band     AICc

1360.4243     -98.1728

3270.5186     -84.8038

2090.0154     -90.0118

2540.9275     -87.1310

1811.3364     -92.9095

1639.1033     -95.1341

1532.6574     -96.4953

1466.8702     -97.2349

1426.2115     -97.6318

1401.0830     -97.8502

1385.5527     -97.9775

1375.9545     -98.0539


WARNING 110259: At least one of your local regressions had very limited variation after applying the weights. Please use caution when interpreting the results.

—————————- Analysis Details —————————

Number of Features:                                                     96


Explanatory Variables:            RANDOM_POINTS_105.ASPSLOP_PROJECTED_CLIP

Distance Band:                                                   1375.9545


——— Model Diagnostics ———-

R2:                              0.2932

AdjR2:                           0.1943

AICc:                          -98.0539

Sigma-Squared:                   0.0190

Sigma-Squared MLE:               0.0167

Effective Degrees of Freedom:   84.3419

     E.The Global Moran’s I for the GWR residuals from the belt transects dataset (treated WS) indicated clustering (p<0.0001) while the GWR residuals for the larger dataset (both WS with surrounding area) indicated random distribution (p=0.327).

5. Discussion/critique of methods:

The analysis used here suggests that other explanatory factors (outside of the rating model created based slope and aspect) may be in place which influence the density of juniper. The R-squared values of this analysis improved compared to previous exercises but are still relatively low. Additionally it should be noted that the presence of juniper has been correlated to reduced soil moisture in some conditions (Lebron et al., 2007), therefore slope and aspect may not be sufficient representatives of soil moisture in these watersheds.

The intent of this analysis was to develop a potential workflow that can be used with other datasets. Caution should be used in drawing any conclusions from this analysis specifically. Additionally, caution should be taken in interpreting the results of this analysis due to the characteristics of the datasets used. For instance, the resolution of the NAIP imagery makes the detection of juniper saplings difficult. Therefore the classification results may not be an accurate representation of juniper density, particularly in the treated WS. However, this process may be applied to more expansive data to include higher-resolution imagery (such as imagery collected using UAVs, etc.).


Lebron, I., Madsen, M. D., Chandler, D. G., Robinson, D. A., Wendroth, O., & Belnap, J. (2007). Ecohydrological controls on soil moisture and hydraulic conductivity within a pinyon-juniper woodland. Water Resources Research, 43(8), W08422.


Does proximity of open areas influence invasibility of forested areas?

Filed under: 2019,Exercise 3 @ 2:34 pm

Question Asked

I am interested in how the spatial pattern of non-forested areas influences invasibility of forest plots by the exotic annual grass vententata (Ventenata dubia). For exercise 3 I asked the question, how are forest plots with high and low ventenata cover related to the spatial pattern of canopy openness?

Tools and Approaches Used

To explore this question, I performed a Ripley’s cross-K between ventenata forest plots with high and low cover and non-forest areas (Ripley 1976). This analysis is used to interpret spatial relationships between two point patterns. I used ArcGIS and R to answer the question following the steps below:

  • I added my sample plots with ventenata cover and coordinates and fire perimeters into ArcGIS.
  • To generate non-forest points, I added canopy cover raster data for the study area into ArcGIS with values ranging from 0 to 95% canopy cover.
  • I buffered the sampled fire perimeters by 3000m using the “buffer” tool and generate 200 random points, not closer than 100m in each buffer using the “create random points” tool in ArcGIS.
  • I extracted the raster values to the random points using the “values to points” tool, and extracted points that had <10% canopy cover into a new layer file for open areas.
  • I then extracted my sample plots only from forested areas into a forest_plots layer.
  • I exported these plots and the random points into a data frame with spatial data and an indicator for sites with high ventenata (>15% cover), low ventenata (0-15% cover), or random open points with no ventenata recorded (labeled “high”, “low”, and “open” respectively).
  • I performed a Ripleys cross K between the high and low ventenata forest plots and non-forest “open” areas and plotted the response in R with the package “spatstat” (Thanks, Stephen C. for sharing R code!). I repeated the analysis individually for three sampled fires (Fox, South Fork, and Canyon Creek) to avoid picking up the inherently clustered spatial pattern of my sample plots and overshadowing the research question. However, I also performed the analysis on the entire sample area to view where issues arise when ignoring a clustered sample design and compare to results from individual fires.
    1. In order to perform the cross Ripleys K on each sampled fire, I first had to establish rectangular “windows” around each fire and provide the xy coordinates for the vertices using tools in ArcGIS depicted below.

Results & Discussion

Most annual grass species prefer areas of low canopy cover. These open areas can, however, act as source populations for invasion into more resistant forested areas. I hypothesized that ventenata (an annual grass) abundance in forested areas would increase with increased proximity and abundance of open areas near the sampled forest (i.e. sites with high ventenata cover in forested areas would be more spatially “clustered” around non-forest areas than sites with low ventenata cover).

The results of my Ripley’s cross-K show that the spatial relationship between sites with high/ low ventenata cover differs between fires. At the Fox fire (FOX), high and low sites closely follow the projected Poisson distribution when crossed with open areas, suggesting that ventenata abundance in forest sites is not related to nearby open areas. However, in the Corner Creek (CC) and South Fork (SF) fires, heavily invaded forest sites are more clustered around open points and low sites are more dispersed around open points than would be expected by chance. I may not have seen a relationship between open areas and high/ low sites at the fox fire because of the relatively low number of forest sample plots (n = 5 high and 4 low). Another explanation could be that I am missing an interaction between fire and invasion by not accounting for canopy lost in forest plots as a function of fire. Analyzing burned vs. unburned forest plots (with fire as an additional driver of invasion) would have been ideal, but would have dropped the sample size too low to perform analyses.

When I performed the cross-K on the entire study area it shows strong clustering of both high and low plots around open areas. However, this is likely the result of the clustered spatial pattern of my sampling method (targeting specific burned areas), and less representative of the actual spatial relationship between ventenata invasion and canopy cover.

Critique of Method

Overall, I thought the cross-K was a useful method for evaluating the spatial relationship between two variables (ventenata cover and open areas). However, a method that could include interaction terms (burning) would probably have been more appropriate for this study. Additionally, the arrangement of sample plots made it difficult to evaluate overall patterns across the study region. It would have been a more useful analysis had my samples been evenly distributed across the region.

Ripley, B.D., 1976. The second-order analysis of stationary point processes. Journal of applied probability, 13(2), pp.255-266.

Ex3: Correlation of Mining and Non-Mining Villages’ NDVI Loss with Distance to Nearest Primary Road

Filed under: Exercise 3 @ 2:26 pm

Following up from Exercise 2, where I determined how NDVI loss relates to distance from village center, I now wanted to explore whether or not the NDVI loss I observed could be explained or attributed to another factor: the distance of a village from a primary road.

To do this, I acquired a shapefile of roads from OpenStreetMaps for Senegal, which I used alongside my locations of mining and non-mining villages, ArcGIS Pro for some simple spatial work, and Excel to calculate a simple correlation.

First, I imported the shapefile of roads in Senegal to ArcGIS Pro, and first determined what the nearest actual road to each village was (fig. 1). Based on this first-pass glance, it appears the more mining sites in my sample than non-mining sites are nearest to “primary” paved roads, and that the majority of non-mining villages are near “tertiary” roads. This simply means that, out of my sample, more mining happens closer to paved roads, which would make sense: gold needs to be transported from gold sites, and equipment/supplies need to be transported in.

Fig 1

Next, I manipulated the shapefile to only include features which were designated “primary” roads (fig. 2). I then used ArcGIS Pro’s “Near” tool to generate new attributes for each set of villages which listed the shortest distance to the nearest primary road in meters.

Fig 2: Green dots are non-mining villages; red dots are mining villages

With these distances in hand, I plotted them in excel against the NDVI % loss values for the first ring (0-250m) around each village, as a simple first pass of understanding how distance from a primary road may be related to the amount of NDVI loss being experienced at a particular road (fig. 3).

Fig 3

Then, within Excel, I calculated a simple correlation (a la fig. 4) to determine how distance from a nearest primary road relates to amount of NDVI loss experienced by a village.

Fig 4: x, in this case, is distance (m) from nearest primary road and y is NDVI loss at the first ring.

For mining villages, I calculated a correlation of -.804, and for non-mining villages, I calculated a correlation of .686. This means that, for this sample at least, the NDVI loss experienced at a mining village is negatively correlated with distance from a primary road; that is, as a mining village’s distance from a primary road increases, its NDVI loss should decrease. This makes perfect sense: mining villages closer to primary roads are easier to access both by miners and by other users, and to leave as well, encouraging larger groups of people to live there, which would decrease NDVI as a result of their activities.

Non-mining villages, on the other hand, are more difficult to explain. One explanation, which may be plausible, is that more distant villages have less opportunities for other livelihoods which are not environmentally dependent, and so are more obligated to work in and with the environment, perhaps encouraging vegetation loss as users rely on and drawdown more on their environment.

Overall, I think this was a useful approach for examining another possible explanation for why mining villages may have higher rates of NDVI loss than non-mining villages. Mining villages are closer to primary roads, which means they are more accessible to potential miners, and thus more likely to grow larger than non-mining villages, and drawdown more on their environment. That being said, as can be seen in Fig. 3, this is not a perfect relationship by any means, and a more robust analysis would be useful to say much more. But, as a first pass, it’s useful for illustrating some additional explanation.

Exercise 3: Confusion Matrix to Test Predictive Capacity of Seascapes

Filed under: 2019,Exercise 3 @ 2:10 pm

Question Asked

I’ve been using my time in this class to explore methods and analyses that could help me to better understand the spatial and temporal relationships between satellite-derived seascapes and fishes in the California Current Ecosystem. In Exercise 1, different types of interpolations were explored and applied to the trawl data to see how results changed. After refining methods related to the Kriging Interpolation, Exercise 2 was conducted, exploring he possible uses of neighborhood analysis in determining the prominent seascapes related to high rockfish abundance in the California Coast Ecosystem. Exercise 3 builds on the results of the previous two exercises: in order to explore the relationship between seascapes and rockfish abundance, a confusion matrix was calculated to measure the ability for certain seascapes to predict the occurance of rockfish hotspots. Simply put, the question being asked is: How were seascapes related to areas of high rockfish abundance in the California Coast Ecosystem in May of 2007?

Tool Used

In order to answer this question, I’m going to use a confusion matrix. A confusion matrix is a table that measures the agreement between two raster layers and provides an overall measure of a model’s predictive capacity. Additionally, these matrices can also measure a raster’s error (in terms of false-positives or false negatives). Each statistic calculated is useful, as each one clues the researcher into spatial or methodological processes that may account for certain types of error. The confusion matrix works by taking original data and creating accuracy assessment points, which are points assigned to random spatial points and given the value indicated by the raster value at that point. Those points are then matched up to points on the model raster and statistics are calculated measuring how much they agree.


The data used for this exercise are the following:

  • Significant seascape classes that indicate high rockfish abundance for May 2007, as determined by the neighborhood analysis in Exercise 2
  • Seascape raster data for the relevant month
  • Rockfish abundance raster (interpolated, from Exercise 1)

Confusion matrices can only be calculated when your ground-truthed and modeled layers are the same units. For this reason, my raster data had to be converted to presence absence data for both the seascape classes and the interpolated rockfish abundances. In order to do this, I used the Reclassify tool in ArcGIS to change “present” values to 1 and “absent” values to 0. For the interpolated abundance raster, any nonzero and nonegative values were changed to 1 to indicated modeled rockfish presence. The remaining cells were changed to 0 to indicate modeled rockfish absence. The result was the raster shown below:

Rockfish Presence-Absence Layer calculated using the Reclassify too on the interpolated trawl data.

Similar steps were used to reclassify the seascape data – seascape classes 7, 12, 14, 17, and 27, which were shown to be present within a 25km radius of the high abundance trawls in Exercise 2, were counted as present cells, while the other classes were counted as absent cells. The result was the raster shown below:

Rockfish Presence-Absence raster calculated using the results from exercise 2 to reclassify the seascape data

It was determined that the seascpae layer would be the base or “ground-truthed” layer and the interpolation layer would be the modeled layer, since the interpolated values are simply estimations. The next step was using the “Create Accuracy Assessment Points” Tool to create randomized points within the modeled layer. The result was the following shapefile:

Randomized accuracy assessment points created using the Create Accuracy Assessment Points tool

Once the points are created, they must be compared to the values in the modeled layer. The Update Accuracy Assessment Points tool does this automatically, and my points were matched up to corresponding values from the seascape layer. Finally, the final step was to run the resulting shapefile through the “Calculate Confusion Matrix” tool.


The resulting confusion matrix is displayed below:

ClassValue C_0 C_1 Total U_Accuracy Kappa
C_0 9 79 88 0.102272727  
C_1 89 276 365 0.756164384  
Total 98 355 453  
P_Accuracy 0.091836735 0.777464789 0.629139073  
Kappa         -0.135711

The two rasters measured in the matrix did not compare well at all, resulting in a negative Kappa value. Kappa values measure overall agreement between two measures, with 1 being perfect agreement. A negative score indicate lower agreement than would be found by chance. The interpolated model performed especially poorly measuring absent values, only agreeing with the seascape model on 9 pixels. While the two rasters agreed on a vast majority of the areas where rockfish may have been found, the irregularities on the absence areas drove the matrix to failure. The large number of both false positive and false negative readings shows that this was not a fundamental disagreement based on scale or resolution of the models – the two methods of measuring rockfish abundance fundamentally disagreed on where they should be found, to the point where they cannot be reconciled.

Critique of Method

Overall, I think a confusion matrix is a good way to compare modeled and true (or ground-truthed) measured information. However, in my case, both of my informational rasters are the results of modeled analyses, each with their own set of assumptions and flaws. The interpolation, for example, is at the mercy of the settings used when executing the Kriging: how many points shall be measured per pixel? Is it a fixed sampling method, or variable, spatially based? And the seascape raster is at the mercy of the neighborhood analysis, which was only conducted on 4, hand-picked points, which introduce all kinds of sampling and spatial biases. All of this, coupled with the fact that both rasters had to be distilled into presence-absence data, which eliminates much of the resolution that makes Kriging so great, results in a heavily flawed methodology that I do not believe truly represents the data. This framework which I’ve used, however, has the potential to be modified, standardized, and rerun in order to address many of these issues. But for now, I feel as though the confusion matrix is a helpful tool, just not for two sets of modeled data.

Ex 3: Relating distance between wells to communication type and fault presence

Filed under: 2019,Exercise 3 @ 1:46 pm
Tags: , ,

Question that I asked:

Is there a relationship between the distance between wells, their communication status based on a pumping interference test, and whether or not they are separated by a fault?

Name of the tool or approach used:

Polyline creation and classification in ArcGIS Pro, boxplot creation and two-sided t-tests in R.


29 wells in my study area were evaluated by the Oregon Water Resources Department during pumping interference tests in 2018 and 2019. This test involves pumping one well, and seeing whether the water levels in nearby wells drop in response. I received a verbal account of the wells that did and did not communicate, sketched it on a map, and then transferred that information to ArcIS Pro. I drew polylines using the well locations as snapped endpoints. Then, I created and populated fields containing the well communication type (“communicate” for wells that respond to pumping at a nearby well, and “does not communicate” for wells that do not) and whether or not the path between two wells crosses a fault. Shape_Length in feet was automatically calculated when I created the polylines, on account of the projection I used for the shapefile.

I exported that data table to a csv and imported it in R, where I subset it into three categories: all paths, paths between wells that communicate, and paths between wells that do not communicate. I then created box plots and ran t-tests to see differences between means and distributions of path length based on communication type or fault length.


Comparing the path length and the communication type of all 29 wells involved in the communication test, there is not significant evidence of a difference in mean path length between wells that do and do not communicate because the p-value of a two-sided t-test was 0.152. While the mean distance between wells that do not communicate is larger than the mean distance between wells that do communicate, the overlapping interquartile ranges in both categories make this difference less significant. There is not clear evidence that distance plays a role in well communication.

There is some evidence for a difference in mean path lengths between wells that do and do not cross faults, based on a p-value of 0.047 in a two-sided t-test. The mean path length that crosses a fault is 5,139 ft, while the mean path length that does not cross a fault is 3,608 ft. Wells that are closer together are less likely to be separated by a fault.

For wells that do communicate, there is evidence of a difference between the mean path lengths that cross faults and the mean path lengths that do not cross faults. The p-value for a two-sided t-test was 0.024. Wells that communicate but are not separated by a fault are more likely to be closer together than wells that are separated by a fault.

For wells that do not communicate, there is no evidence of a difference in mean path lengths between paths that do and do not cross faults, given a p-value of 0.98 in a two-sided t-test. Wells that do not communicate are likely to be separated by the same mean distance whether or not they are separated by faults, although there is a larger range of path length values for wells separated by a fault that do not communicate.


Summary of results:

Wells that communicate in pumping tests do not have a significantly different mean distance between them than wells that do not communicate (p = 0.152)

Wells that are closer together are less likely to be separated by a fault. (p = 0.047)

Wells that communicate but are not separated by a fault are more likely to be closer together than communicating wells that are separated by a fault. (p = 0.024)

Wells that do not communicate are likely to be separated by the same mean distance whether or not they are separated by faults, although there is a larger range of path length values for non-communicating wells separated by a fault. (p = 0.98)

Critique: I wish I had more sample points and paths to work with, so I could use a more interesting analysis such as ANOVA.

May 23, 2019

Estimates of connectivity to critical infrastructure in Seaside, OR following a rupture of the Cascadia Subduction Zone


For exercise 3, I evaluated the connectivity of each building within Seaside, OR, to critical infrastructure following a rupture of the Cascadia Subduction Zone. The probability of connectivity for each building was determined using networks and considered the following:

  • Electric Power Network (EPN): probability that each building has electricity.
  • Transportation: probability that each building can reach the hospital or fire stations via the road network.
  • Water Supply Network: probability that each building has access to running water.

The connectivity analysis was deaggregated by hazard as well as the intensity of the event.

Tool and approach

For this exercise, I used: (1) a probabilistic earthquake/tsunami damage model to evaluate the functionality of linkages; (2) the network analysis package python-igraph to evaluate the connectivity of each tax lot to critical infrastructure; and (3) QGIS for spatial visualization of the results.

Description of steps

Networks were created to represent the connectivity of the three infrastructure components (Figure 1). A network consists of nodes connected to each other through edges. When edges are removed, a connectivity analysis can be performed to determine whether there is any path from one node to any other specific node. A disconnection in the network results in two (or more) separate networks.

Here, the earthquake and tsunami hazards cause damages to edges which are removed from the network if deemed nonfunctional. A connectivity analysis between each tax lot and critical infrastructure was performed, and each tax lot was triggered with a binary yes/no for connectivity. A Monte-Carlo approach with 10,000 simulations was implemented to determine the probability of each tax lot being connected to critical infrastructure. The resulting probabilities were then added as attributes to GIS shapefiles in order to evaluate the spatial distribution of connectivity.

Figure 1: GIS and representative networks for each infrastructure component


Description of results

Characteristics of the network can be described by a degree distribution. In a network, the degree of a node is the number of immediate connections that the node has to other nodes (e.g. a node connected to 3 other nodes has a degree of 3). A histogram of the degrees can be generated to describe the overall distribution of the entire network. The degree distribution for the three infrastructure components are shown in Figure 2. It can be seen that in the EPN network, most nodes are connected to two other nodes. This is likewise apparent in the network of Figure 1, as the EPN network appears more “linear” compared to the transportation and water networks. The transportation and water networks exhibit similar characteristics to each other in that the majority of nodes have a degree of three.

Figure 2: Degree distribution for each infrastructure component

Using the results from the Monte-Carlo network analysis, maps were created to show the spatial variability of connectivity. The connectivity was deaggregated by both hazard and intensity of the event, as deaggregation provides an avenue for smart mitigation planning. Although similar maps were produced for all three networked infrastructure, for brevity, only the spatial distribution of the transportation network is shown (Figure 3). The maps show the probability of each tax lot becoming disconnected from the fire stations (2 in Seaside) and hospital (1 in Seaside) via the road network. It can be observed that the tsunami hazard results in significant damage to the transportation system relative to the earthquake hazard. The result of bridge failures caused by the tsunami can be observed for intensities larger than the 500-year event. The region west of the Necanicum River becomes completely disconnected from the fire stations and hospital which are located east of the river.

In addition to the spatial deaggregation, the aggregated results provide a comprehensive overview of the connectivity. Figure 4 shows the average fraction of tax lots disconnected from critical infrastructure across all of Seaside. The three networked infrastructure systems approach complete disconnectivity for hazard intensities larger than the 1000-year event. The transportation and water networks are dominated by the tsunami for the higher magnitude events; whereas the EPN see’s significant damage from both the tsunami and earthquake. Consequently, if resource managers are planning for high magnitude events, they should invest in tsunami damage mitigation measures.

Figure 3: Spatial distribution of connectivity to fire station and hospitals

Figure 4: Fraction of tax lots connected to critical infrastructure



Network analysis provides a means to evaluate the connectivity of tax lots to critical infrastructure, and incorporating probabilistic methods accounts for uncertainties as opposed to a deterministic approach. While this type of analysis can be useful to determine overall connectivity, it does not account for limitations and additional stresses in the “flow” of the network. For example, damage to the transportation network would result in additional travel times to the fire stations and hospital. In order to provide a more comprehensive analysis of the impact to networked infrastructure, both connectivity and flow should be considered.

May 22, 2019

Comparing the Spatial Patterns of Remnant Infected Hemlocks, Regeneration Infected Hemlocks, and Non-Hosts of Western Hemlock Dwarf Mistletoe


For Exercise 1, I wanted to know about the spatial pattern of western hemlock trees infected with western hemlock dwarf mistletoe. I used a hotspot analysis to determine where clusters of infected and uninfected trees were in my 2.2 ha study area (Map 1). I discovered a hot spot and a cold spot, indicating two clusters, one of high values (infected) and one of low values (uninfected).

For Exercise 2, I wanted to know how the spatial pattern of these clustered infected and uninfected trees were related to the spatial pattern of fire refugia outlined in my study site. I used Geographically weighted Regression to determine the significance of this relationship, however I did not find a significant relationship between a western hemlock, its intensity of clustering and infection status, and it’s distance to its nearest fire refugia polygon.

This result led to the realization that the polygons as they were drawn on the map were not as relevant as the actual “functional refugia”. I hypothesized that, after the 1892 fire, the only way for western hemlock dwarf mistletoe to spread back into the stand would be from the trees that survived that fire, or “remnant” trees. These would then infect the “regeneration” tree that came after the 1892 fire. The functional refugia I am interested in are defined by the location of the remnant western hemlocks. I also hypothesized that the spatial pattern of non-susceptible host tree (trees that were not western hemlocks) would play a role in the distribution of the mistletoe.

Question Asked

How are the spatial patterns of remnant western hemlocks related to the spatial patterns of regeneration western hemlocks, uninfected western hemlocks, and non-western hemlock tree species, and how are these relationships related to the spread of western hemlock dwarf mistletoe in the stand?

The Kcross and Jcross Functions

The cross-type functions (also referred to as multi-type functions) are tools capable of comparing the spatial patterns of two different type events (type i and j) in a similar spatial window, of some point process, X. It does this by assigning labels to the events differentiating the type and summarizing the number or distance, of and between events, at differing spatial scales, or radius circles (r).

The statistic Kij (r) , summarizes the number of type j events, around a type i event at a distance of r, or a point process X. Deviations of the observed Kij(r) curve from the the Poisson curve, or if type j events are truly randomly distributed, indicates dependence of type j events on type i events. Similar results can be obtained from the regular Ripley’s K: deviations above the curve indicate clustering and deviations below indicate dispersal.

(Incredibly helpful and interactive explanation link:

The statistic Jij(r) = (1 – Gij(r))/(1-Fj(r)) summarizes the shortest distance between a type i and j event and compares it to the empty space function of type j event. This is another test for inferring independence or dependence of type j events to type i. Deviations of the Jij(r) curve the value of 1, indicate levels of dependence of the events to each-other. Specific deviations from 1 can be hard to interpret without an understanding of the Fj(r) function so imagining it stationary in the ratio makes it easier. As Gij(r) increases, the numerator shrinks, creating a smaller Jij(r) statistic. Deviations below 1 indicate that type i and j events are dependent and that as r increases, the shortest distance between points of type i and j increases. As Gij(r) decreases, the numerator grows, creating a larger Jij(r) statistic. Deviations above 1 indicate that type i and j events are dependent and that as r increases, the shortest distance between points of type i and j decreases.

Methods Overview

In R, the package “spatstat” provides a suite of spatial statistic functions including the cross-type functions. In order to use these you need to create a “point pattern process” object. These objects incorporate X and Y coordinates, and a frame of reference, or a “window,” and give spatial context top a list of values. Then marks are applied to these points that create the necessary multi-type point pattern process object. These marks serve to distinguish the type i and type j events described earlier in your analysis. Then running the “Kcross()” or “Jcross()” functions with the specified type events produces a graph that you can interpret, very similar to producing the normal Ripley’s K plot.

  • I took my X – Y coordinates of all trees on the stand and added a column called “Status” to serve as my mark for the point pattern analysis.
    1. The four statuses were “Remnant,” “Regen,” “Uninfected,” and “NonHost” to identify my populations of interest.
      1. I had access to tree cores, so I identified trees that were older than 170 yrs old and these trees’ diameters served as my cutoff for the “Remnant” diameter class.
      2. All trees DBH > 39.8 cm.
    2. Doing this in ArcMap removed steps I would have to have taken when I migrated the dataset to R.
    3. I removed all the dead trees because I wasn’t concerned with them for my analysis.
  • I exported this attribute table to a csv and loaded it into R Studio.
  • I created the boundary window of my study site using the “owin()” function, and the corner points from my study site polygon.
  • The function “ppp()” creates the point pattern object and I assigned the marks to the data set using the “Status” column I created in ArcMap
    1. It’s important your marks are factors otherwise it is not converted into a multi-type point pattern object.
  • The last step is running the “Kcross()” and “Jcross” to compare the “Remnant” population to the “Regen,” “Uninfected,” and “NonHost” populations.
    1. This produced 6 plots, 3 of each type of cross-type analysis.
    2. Compare these easily using the “par()” function, for example:

par(mfrow = c(1,3))




This produces the three plots in a single row and three columns.


Because I am assuming the remnant, infected western hemlock trees are one of the main factors for the spread of western hemlock dwarf mistletoe and that they are the center of  new infection centers on the study site, I did all my analysis centered on the remnant trees (points with status = “Remnant” treated as event type i).

1)   i = Remnant, j = Regen

The first analysis between remnant and regeneration trees demonstrate that there is dependence on the two events to each other. At fairly small distances, or values of r, infected western hemlocks that have regenerated after the 1892 fire cluster around infected remnant western hemlocks that survived the 1892 fire. This stands to reason because we assume that infected trees will be near other infected trees, and that infection centers start usually with a “mother tree.” In this case the remnant trees serve as the start of the new infection centers. The Jcross output also shows me that the two types of trees are clustered using the frequency of the shortest distances. After ~8 meters the two tree types exhibit definite clustering. In terms of the function, the Gij(r) in the numerator of the Jij(r) function is approaching 1, or the highest frequency of very short distances.

2)   i = Remnant, j = Uninfected

The Kcross plot from the second set of analyses between remnant and uninfected trees demonstrates that there is independence between the two events up to ~15 meters. After that, the trees exhibit slight clustering effects. The lack of dispersal tendencies is strange for these two types of trees because we expect uninfected trees to be furthest away from the center of infection centers. The presence of clustering may be indicative of the small spatial scale of my study site. It may also be that the size of the infection centers are only about 15 meters (if we assume that remnant trees are the center). The Jcross plot shows something similar: at small distances the types of trees seem independent and then around 8 meters they exhibit clustering.

3)   i = Remnant, j = NonHost

The Kcross from the last set of analyses between the remnant trees and the non-hosts demonstrates a similar pattern exhibited by the regeneration trees. After about 4 meters, the trees tend to be clustered. This is an interesting find because if the non-hosts cluster to remnant trees but uninfected trees are independent, then the non-hosts may be playing a role in this. The Jcross plot shows the same: the two types of trees exhibit clustering.

4)   Comparing Kcross Functions with eval.fv()

A useful way to compare patterns of Kcross functions is using the eval.fv function. The titles of each plot tell which Kcross was subtracted from which; note the difference in scales. The first plot shows that the regenerating trees’ spatial pattern as related to remnant trees is very different from the uninfected trees’ pattern. The regenerating trees’ spatial pattern is much more similar to the non-hosts’ spatial pattern at short distances, until about 15 meters. Then the patterns differ with the regenerating trees exhibiting more of a clustering tendency. However the scale is much smaller than the other two graphs. Lastly, the third plot shows the difference between the non-host trees’ spatial pattern and the uninfected trees’ spatial pattern. There appears to be a stepwise relationship where, at very near and very far distances the non-host trees are much more clustered, but at moderate distances the differences may be less dramatic.

Critique of Cross-Type Functions

The amount of easily interpretable literature on the spatstat package as a whole is sparse, although a wealth of very technical information exists. The function was easy to use and execute though and so was the process of creating the point pattern object. These two functions can clearly show how the spatial patterns of the two types of events change with scale. It would be helpful if there was a way to compare three or more types of events. The last drawback is that there is a lack of specific information for each point on your map or study site. This pattern that is generalizable to a whole set of points may not be as useful when trying to put together a story, such as the story of a stand’s development through time.

Additional Raster Analysis

The last critique of the cross-type functions led me to attempt a visualization of these patterns on my stand. Very briefly, I determined densities of the infected, uninfected, and the non-host trees using the Kernel Density function in ArcMap. Then I classified these densities using natural breaks and coded these for raster addition. After adding all three density rasters together, I coded each unique density classification combination to tell me how the densities of the populations appeared in the study site.

It appears that there are distinct patches of high density separated by areas of low density. On the eastern side of my study site, it appears that the high density areas of infected trees cluster with the remnant infected trees. An interesting interaction is occurring between the high density patches of uninfected trees and infected trees in the western portion of my study site. The mechanism for the seemingly clear divide may be the non-TSHE trees.

May 20, 2019

Blackleg classification by segmentation error analysis and visualization

Filed under: 2019,Exercise 3 @ 9:45 pm


Once again, I am using the confusion matrix to determine the accuracy of segmented classification based on the ground truth of manual classification. This work piggybacks off of exercise two where I also used a confusion matrix for determining accuracy. In my previous work I critiqued the ability to draw any relevant conclusions based on a sample size of one. Here I have included a sample size of five images which have gone through the entire image processing, segmentation, manual classification and confusion matrix. This allows us to make more relevant comparisons and looks for trends in the data. Additionally, I mentioned the confusion matrix was inaccurate because it was performed on a rectangular region that was simply the extent of the image and not just the leaf. This has been corrected for and now conducts a confusion matrix on the entire leaf and no outside area. Lastly, I have adjusted the segmentation settings to increase accuracy based on previous findings in exercise one and two.


  • How accurate is the segmentation classification?
  • Are there commonalities between the false positives and false negatives?
  • Are more raster cells being classified as false negatives or false positives when misclassified?
  • What is going undetected in segmentation?
  • What is most easily detected in segmentation?

Tools and Methods

Many of the steps are repeated from exercise two. I will repeat them here in a completer and more coherent format with the new steps found in the process.

ArcGIS Pro

Begin by navigating to the Training Sample Manager. To get to the Training Sample Manager, go to the Imagery tab and Image Classification group and select the Classification Tools where you can click Training Sample Manager. In the Image Classification window pane go to create new scheme. Right click on New Scheme and Add New Class and name it “img_X_leaf” and supply a value (the image #) and a description if desired. Click ok and select the diseased pixels scheme and then the Freehand tool in the pane. Draw an outline around the entire leaf (Images 1-5). Save this training sample and the classification scheme. Save ArcGIS Pro and exit. Reopen ArcGIS Pro and go to Add data in the Layer group on the Map tab to upload the training sample polygon of the leaf. This process must be repeated for each leaf.

The polygon of the leaf will need to be converted to a raster. Click the Analysis tab and find the Geoprocessing group to Tools once again. Search Polygon to Raster (Conversion Tools) in the Tools pane and select it. Use the polygons layer for Input Features and the window pane options should adjust to accommodate your raster. The only adjustment I made was to the Cellsize, which I changed to 1 to maintain the cell size in my original raster. Select Run in the bottom right corner of the pane. Each of the polygons should now be converted to a rasterized polygon with a unique ObjectID, Value and Count which can be found by going to the Attribute Table for that layer or clicking on a group of pixels. This also must be completed for each of the polygons created for each leaf outline.

Now we should have three layers for each leaf in which we want to export as Tiff files. We have our “leaf” that we just created, our “manually classified” and “segmented” layers. To export these three, you need to go to each layer and right click. Start with the “leaf” image and go to Data and Export Raster and a pane will appear on the right. Zoom into the image so there isn’t much area above or below the leaf. Select Clipping Geometry and click on Current Display Extent. This will export the image as a Tiff file which we will use later. Continue this process with the other two images without changing the extent. Each image should be exported after this step with the same extent resulting in the same number of pixels and overlap between pixels if placed overtop one another. This can be confirmed in R in the follow steps.


Open R studio and use the following annotated code below:

required packages for rasters
rm(list=ls()) ##clears variables
##raster upload
img_10_truth <- raster(“E:/Exercise1_Geog566/MyProject3/img_10_truth.tif”)
img_10_predict <- raster(“E:/Exercise1_Geog566/MyProject3/img_10_predict.tif”)
img_10_leaf <- raster(“E:/Exercise1_Geog566/MyProject3/img_10_leaf_shape.tif”)
##view the raster
img_10_truth ##confirm dimensions, extent, etc. They need to be the same or very close for a confusion matrix
plot(img_10_leaf) ##view the images
##export data to excel
img_10_leaf_table <-, xy=T) ##creates tables
img_10_predict_table <-, xy=T)
img_10_truth_table <-, xy=T)
write.table(img_10_leaf_table, file = “img_10_leaf.csv”, sep = “,”)
write.table(img_10_predict_table, file = “img_10_predict.csv”, sep = “,”)
write.table(img_10_truth_table, file = “img_10_truth.csv”, sep = “,”)



At this point the three Tiff files should have been uploaded to R as raster’s and undergone a quality check by viewing the images and the data using the supplied code above. Next, it should have been converted to a table for each file and then all exported out as csv files. From here, open all three files into excel. Begin with the “Truth” and “Predict” files and select the entire column that is the farthest to the left, the values associated with each cell. Go to the Home tab and the Editing group and click on the drop down for Sort and Filter and select the A to Z. Change everything from whatever value they are, to 1. This can easily be accomplished by using the Find and Replace function in the Editing group. Replace all the NA’s with 0’s and then go to left column which simply lists each pixel number and sort the column from A to Z so its back to the order it was when originally opened into excel. Once this is completed for both the “Predict” and “Truth” files, copy the right most column of values and paste them into the Leaf” excel file. Use the Sort function to sort the right column from A to Z on the leaf values column and allow for it to expand the selection. It should be a list of 0’s and if you scroll down you will eventually begin seeing NA’s in the leaf value column. Delete everything below this point in the “Truth” and “Predict” value columns. Scroll to the top and delete the first four columns that are the numbers, x and y coordinates and the leaf values column. You should be left with two columns that are the “Truth” and “Predict” values with 1’s and 0’s only. Give each column a label (Truth & Predict), save this document as an xlsx file and exit out.

This excel step helped us overcome the issue of having values that weren’t in a 0’s and 1’s format needed for the confusion matrix. Here, 1’s stand for diseased patch and 0’s stand for non-diseased. What we also did was removed all the values that were outside of the leaf area. So, the only values we are working with are those which are in the leaf and not part of the background. The last step is importing the excel file that was just created into R to perform the confusion matrix.


##confusion matrix
img_10_confusion$Predict <- as.factor(img_10_confusion$Predict) ##convert to factors
img_10_confusion$Truth <- as.factor(img_10_confusion$Truth)
confusionMatrix(img_10_confusion$Predict, img_10_confusion$Truth)


You should now have R output of the confusion matrix and associated statistics.

ArcGIS Pro

For the purpose of visualization, I also changed the colors of the three Tiff files. To do this, I simply right clicked on each of the three layers and selected Symbology. The pane on the right side of the screen appears where you have options to change the Color Scheme. I made the leaf layer green, the segmented regions white and manually classified regions white. The snipping tool was used to extract these images.


Between the five images there were 15,519 pixels classified in the confusion matrix. 14,884 were properly classified as non-diseased. 282 were classified as true disease. 307 were identified as false negatives and 46 were classified as false positives. Overall the segmentation classification had an accuracy of 97.7% with a p-value less than 0.05 indicating statistical significance. The model had a sensitivity of 99.69% and a specificity of 47.88%.

Accuracy between the five images ranged from 94% to 99%, all with significant p-values. The sensitivity for all five individual matrices and the combined matrix was at 99% with no exceptions. Specificity had a fairly large range.

I was happy with the level of accuracy produced throughout all five images and the overall cumulative matrix. The five images ranged a bit in complexity with the number of lesions present, leaf size and shape, size of lesions, etc. Looking at the confusion matrix we can see there were many more false negatives than false positives. This indicates the model is possibly not quite sensitive enough to detect the all regions or certain parts of the disease. The specificity was not consistent between the five images for some reason which is still unclear to me. As mentioned, although the model may have some issues with detection certain patches, the sensitivity indicated throughout all matrices is very high.

Viewing the images below we can see where the false positives appear and where the false negatives occur (Images 1-5). It seems that both the false positives and negatives tend to appear on the margins of the lesion when they at least detected. Additionally, larger patches seem to have a higher rate of detection than those which are small in size. This is especially true in Image 1&2.


My only critique is the process to get to this step. I was very satisfied with my results but see logistical problems in getting through large amounts of samples. I will attempt to find shortcuts throughout the process my next go around. I will also seek creating a model in ArcGIS Pro for some of the rudimentary tasks done in the processing of the images.


Table 1. Confusion matrix of all five samples combined. Values are number of pixels classified out of 15,519 total.


Figure 1. Classification error for the combined matrix. The values are the number of pixels classified out of 15,519 total.


Image 1. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.


Image 2. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.


Image 3. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.


Image 4. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.


Image 5. Classified leaf is green with black manually classified diseased patches and white overlapping segmented pixels. R output is also included for the confusion matrix.

May 16, 2019

Possible Influence of ENSO Index on Dolphin Sighting Latitudes

Exercise 2

Question Asked: Are latitudinal differences in dolphin sightings in the San Diego, CA survey area related to El Niño Southern Oscillation (ENSO) index values on a monthly temporal scale?

  1. My previous question for Exercise 1 was: do the number of dolphin sightings in the San Diego, CA survey region differ latitudinally? I was finally able to answer this question with a histogram of sighting count by latitudinal difference. As you can see from below, there is an unequal distribution of sightings at different latitudes. Because I had visual confirmation of differences at least when all sightings are binned (in terms of all years from 1981-2015 treated the same), I looked for what process could be affecting these differences in latitude.

    Comparing the Latitudes with the frequency of dolphin sightings in San Diego, CA

ENSO is a large-scale climate phenomena where the climate modes periodically fluctuate (Sprogis et al. 2018). The climate variability produced by ENSO affects physical oceanic and coastal conditions that can both directly and indirectly influence ecological and biological processes. ENSO can alter food webs because climate changes may impact animal physiology, specifically metabolism. This creates further trophic impacts on predator-prey dynamics, often because of prey availability (Barber and Chavez 1983). During the surveys of bottlenose dolphins in California, multiple ENSO cycles have caused widespread changes in the California Current Ecosystem (CCE), such as the squid fishery collapse (Nezlin, Hamner, and Zeidberg 2002). With this knowledge, I wanted to see if the frequency of dolphin sightings in different latitudes of the most-consistently studied area was driven by ENSO.


Primarily R Studio, some ArcMap 10.6 and Excel

Step by Step:

  1. 1.For this portion of the analysis, I exported my table of latitudinal differences within my attribute table for dolphin sightings from ArcMap 10.6. I saved this as a .csv and imported it into R Studio.
  2. Some of the sighting data needed to be changed because R didn’t recognize the dates as dates, rather as factors. This is important in order to join ENSO data by month and year.
  3. Meanwhile, I found NOAA data on a publicly-sourced website that had months as the columns and years as the rows for a matching ENSO index value of either: 1, 0, or -1 for each month/year combination. A value of 1 is a positive (warm) year, a value of 0 is a neutral year, and a value of -1 is a negative (cold) year. This is a broad-value, because indices range from 1 to -1. But, to simplify my question this was the most logical first step.
  4. I had to convert the NOAA data into two-column data with the date in one column by MM/YYYY and then the Index value in the other column. After multiple attempts in R studio, I hand-corrected them in Excel. Then, imported this data into R studio.
  5. I was then able to tell R to match the sighting date’s month and year to the ENSO data’s month and year, and assign the respective ENSO value. Then I assigned the ENSO values as factors.
  6. I created a boxplot to visualize if there were differences in distributions of latitudinal differences and ENSO index. (See figure)Illustrating the number of sightings grouped by ENSO index values (1, 0, and -1).
  7. Then I ran an ANOVA to see if there was a reportable, strong difference in sighting latitudinal difference and ENSO index value.


    The result of this, was a p-value of <2e-16, providing very strong evidence to reject the null of hypothesis of no difference. I followed up with a Tukey HSD and found that there is strong evidence for differences between both the 0 and -1, -1 and 1, and 1 and 0 values. Therefore, the different ENSO indices on a monthly scale are significantly contributing to the differences in sighting latitudes in the San Diego study area.

Tukey HSD output:

diff               lwr                        upr           p adj

0–1 0.01161047 0.004250827 0.01897011 0.0006422

1–1 0.04101170 0.030844193 0.05117920 0.0000000

1-0 02940123 0.020689737 0.03811272 0.0000000

 Critique of the Method(s):

These methods worked very well for visualization and finally solidifying that there was a difference on sighting latitude related to ENSO index value on a broad level. Data transformation and clean-up was challenging in R, and took much longer than I’d expected.



Barber, Richard T., and Francisco P. Chavez. 1983. “Biological Consequences of El Niño.” Science 222 (4629): 1203–10.

Sprogis, Kate R., Fredrik Christiansen, Moritz Wandres, and Lars Bejder. 2018. “El Niño Southern Oscillation Influences the Abundance and Movements of a Marine Top Predator in Coastal Waters.” Global Change Biology 24 (3): 1085–96.

Contact information: this post was written by Alexa Kownacki, Wildlife Science Ph.D. Student at Oregon State University. Twitter: @lexaKownacki
May 13, 2019

Spatial and Temporal Patterns of Reported Salmonella Rates in Oregon

  1. Question Asked

Here I asked if there was evidence supporting temporal autocorrelation of age-adjusted Salmonella county rates within Oregon from 2008-2017 and if so what type of correlation structure is most appropriate. I also investigated spatial patterns of reported Salmonella rates as they related to various demographic variables like: Percent of county which is aged 0-4, percent of county which is aged 80+, percent of county which is female, median age, percentage of county residents who have completed high school, median county income, percent of county who is born outside the US, percent of county who speaks a language other than English at home, percentage of county estimated to be in poverty, and percent of children in a county estimated to be in poverty.

To answer these questions, I used the same data outlined in the first exercise blog post with newer demographic variables being taken from the American Community Survey and published on AmericanFactFinder which provides yearly estimates of demographic information at the county level. Unfortunately, yearly data prior to the year 2009 is unavailable which shortened the window of analysis by a year.

  1. Names of analytical tools/approaches used

The methods used to answer these questions was first to create an exhaustive general linear model where county Salmonella rates were a function of the above listed demographic variables. A Ljung-Box Test was used to assess if there was evidence of non-zero autocorrelation of residuals of the model at various time lags. Following this, an ideal linear model was selected using a stepwise AIC selection process and then different variance structures were compared by AIC, BIC, and log Likelihood metrics as well as ANOVA testing. Following the selection of an appropriate base model and variance structure, I allowed for interaction between all variables and time and performed more ANOVA testing to select the best model which allowed for variable time interaction. A version of this model would later be used in geographically weighted regression analysis. I performed geographically weighted regression allowing the coefficients to vary across space in Oregon.

  1. Description of the analytical process

I created a lattice plot of reported age-adjusted Salmonella rates over the 10-year period in every county to visually assess whether Salmonella rates were changing over time. After seeing that rates of reported Salmonella were changing over time in many Oregon counties I created a “full” linear model in which the rate of reported Salmonella cases in a county was a function of the demographic variables described above. Because this is longitudinal data measured over time I wanted to see if rates of Salmonella were correlated over time, meaning that a county’s rate one year could be predicted from the rate found in another year in the same county. I first performed a Ljung-Box test to assess the need to evaluate for temporal autocorrelation as well as tested normality assumptions, and log-transformed my outcome (Salmonella rates) based on those normality assumption tests. A simple backward step-wise AIC approach was used on my full model to identify and remove extraneous variables. This worked by removing variables in the full model in a stepwise fashion, comparing the AIC values between the two models, and continuing this process until the AIC values between the two models being compared are not significantly different. I then used this model to select an ideal variance structure to compare Salmonella rates autocorrelated at different time lags. The types of variance compared were: Independent variance, compound symmetric, AR1, MA1, ARMA (1,1), unstructured, and allowing for different variance structures across time. After AIC, BIC, log Likelihood, and ANOVA testing an ideal variance structure was determined and the model using this variance structure was evaluated for basic interaction with time. All variables present in the model were allowed to have time interaction including time itself (i.e. allowing for a quadratic time trend). Once again AIC, BIC, log Likelihood, and ANOVA testing were used to select the most ideal model.

Following this I moved on to GWR, where I was able to use the model identified above to create a new data frame containing Beta coefficient estimates of the significant variables in my final model for every county in Oregon. This data frame of coefficients was merged with a spatial data frame containing county level information for all of Oregon. Plots of the coefficient values for every different county were created.

  1. Brief Description of Results

Every panel represents one of Oregon’s 36 counties and panels containing an error did not have any cases of reported Salmonella. Some counties are seen decreasing with time, others show slightly increasing trends, and others show a fairly level rate over time. Clearly there is evidence of some time trend for some counties.

Results of normality testing found that age- adjusted rates of reported Salmonella in Oregon were not normally distributed, for the ease of visualization and as an attempt to address the failure to meet the assumption of normality in linear modeling I log-transformed the rates of reported Salmonella. Results of the Ljung-Box test with my full model provided evidence of non-zero time autocorrelation in the data and a visual inspection of the lattice plot supports this with most counties showing a change in rates over time.

The stepwise AIC model selection approach yielded the following model:

logsal ~ %female + %childpov + medianage +year

Covariance structure comparison:

Covariance Model logLikelihood AIC BIC
Independent -431 874 896
Compound Symmetry -423 860 887
AR(1) -427 869 895
MA(1) -427 869 895
ARMA(1,1) -423 862 892
Unstructured -387 859 1017
Compound Symmetry Different Variance Across Time -412 854 910


Mostly AIC, BIC, and Log Likelihood values were clustered together for the different models. I decided to base my choice primarily on the lowest AIC because that’s how I did variable selection to this point. This resulted in me choosing a compound symmetric model which allowed for different variances across time.

Next, I built models which allowed for simple interaction with time meaning that any three-way interaction with time was not evaluated for. Subsequent ANOVA testing comparing the different interaction models to each other, to a model where no interaction was present, and a model where time was absent were used in my selection of a final model.

Final Model: 

logsal ~ %female + %childpov + medianage + year +(medianage*year)

This model follows a 5×5 compound symmetric correlation variance model which allows for variance to change over time.

Code: interact_m <- gls(logsal ~ female + childpov + medianage + year +(medianage*year), na.action=na.omit, correlation= corCompSymm(form= ~1|county),weights=varIdent(form=~1|year),data=alldata)
Within County Standard Error  (95% CI): 0.92
Estimate Name Estimate (log-scale) Std. Error p-value
Intercept -759.42 237.79 0.002
% Female 18.06 4.46 <0.001
% Child Poverty 0.03 3.27 0.001
Median Age 17.16 3.11 0.002
Year 0.38 3.18 0.002
Median Age*Year -0.01 -3.12 0.002


Estimates are on the log scale making them difficult to interpret without exponentiation, however it can be seen that a percent change in the number of females in a county or a year change in median age are associated with much larger changes in rates of reported Salmonella incidence compared to changes in percent of child poverty and the year. Overall, incidence rates of reported Salmonella were shown to increase with time, county percentage females, county percentage of child poverty, and county median age with a significant protective interaction effect between time and median age.

For my GWR analysis I used a function derived from the website and looks like:

regfun <- function(x) {
dat <- alldata[alldata$county == x, ]
m <- glm(logsal ~ female + childpov + medianage + year + (medianage*year), data=dat)

As can be seen, I retained the same significant variables found in the OLS regression for my time series analysis. GWR in this case allows for the coefficient estimates to vary by county.

This allowed me to create a data frame of all coefficient estimates for every county in Oregon. Subsequent dot charts showed the direction and magnitude of all covariates varied across the counties. Plots and dot charts of Oregon for the different coefficient estimates were made.

For % Female

For % Child Poverty

For Median Age

For Year

For Median Age * Year

For the most part, county % female, median age, year, and the interaction term clustered close to 0 for most counties. Some counties were showed highly positive/negative coefficient estimates though no consistently high/low counties could be identified. The maps for the coefficients of median age and year are very similar though I do not have a clear idea why this is the case. The map of the coefficients of child poverty showed the most varied distribution across space. Autocorrelation analysis using Moran’s I of the residuals from this GWR model did not find any evidence of significant autocorrelation. I could not find evidence of a significant non-random spatial pattern for the residuals of my model.

  1. Critique of Methods Used

While the temporal autocorrelation analysis was useful in that it provided evidence of temporal autocorrelation present in the data and prior univariate spatial autocorrelation provided limited evidence of variables being spatially autocorrelated at different spatial lags, I was unable to perform cross correlation analysis between my outcome and predictor variables. One important note: I do plan on performing this analysis I just need to figure out how the “ncf” package works in R. This is one of the more glaring shortcomings of my analysis so far is that I do not have evidence that my outcome is correlated with my predictor variables at various distances. Another critique is that the choice of an ideal temporal correlation structure was fairly subjective with my choice of model selection criteria being AIC. Basing my decision on other criteria would likely change my covariance structure. A similar argument could be said for my choice of variable selection being based on a backwards stepwise AIC approach where other selection criteria/methods would likely have different variables in the model.

Finally, the results of my GWR analysis do not show actual drivers of reported Salmonella rates. Rather it shows the demographic characteristics associated with higher reported rates. While this information is useful it does not identify any direct drivers of disease incidence. Further analysis will be required to see if these populations have different exposures or severity of risky exposures.

Determining accuracy of segmentation versus ground truth disease patches of Blackleg on Turnip

Filed under: 2019,Exercise 2 @ 11:17 am

My research involves the classification of a leaf spots on turnip derived from the pathogen blackleg. I had hypothesized that spatial patterns of pixels based on image classification are related to manually classified spatial patterns of observed disease on turnip leaves because disease has a characteristic spectral signature of infection on the leaves. Here I focus on the accuracy of previously determined disease patches through segmentation and ground truth classification. To do this a confusion matrix is used and allows for detection of true positives, true negatives, false positives and false negatives. All of the image processing took place in ArcGIS Pro. The next step involved the accuracy assessment which was conducted in R.

How accurate is the computer image classification?
• Can the accuracy be quantified?
• How can the image classification error be visually represented?

Tools and Methodology
I began in ArcGIS Pro to help visually represent the false negative and false positive regions of the disease patches through the segmentation process of classification. To start I turned on both layers where you can see the overlap between the two classification methods in Image 1.

I then went to the symbology of the segmented image and changed the color to white. I placed this layer on top, so it covered all the raster cells of the manually classified patches leaving only the false negatives, seen in Image 2. Next, I went back into the symbology, but for the manually classified image and changed the color scheme to white. I moved this layer to the top and changed the segmented image color scheme back to unique values. I was left with Image 3 showing the false positives. This was an easy way to visualize these disease patches and how well the classification method was working.

Next, I exported a Tiff file of both the manually classified patches and the segmented patches. To ensure each cell between the two layers lined up in R I had to make sure the extent was the same for both layers when I exported. To do this I right clicked on the layer and went down to data and selected export raster. A pane appears in right side of ArcGIS Pro where I hit the drop-down arrow for clipping geometry and selected current display. I did this with both layers and had one Tiff file for computer classified image through segmentation and one for the manually classified disease patches.

Using the raster, rgdal and sp packages I was able to upload my two Tiff files in raster format to R. I gave the two files each a name and used the plot function to view the two images. I noticed they both had values associated with each patch which were on a gradient scale. To correct for this, I converted my two raster layers in R to tables. This provided a coordinate for each cell and the value associated with it. In the image segmentation raster table I had 0 to 2 and the manually classified image I had 1 to 6. For all the white space I was given ‘N/A’, which was another issue. I used the xlsx package to export my data tables to excel files. I opened the two files in excel and used the sort smallest to largest function. From here I was able to use the replace function and change all the ‘N/A’ values to 0’s and all the values associated with pixels to 1’s. The values were arbitrary associated with the pixels and I needed the raster in 1’s and 0’s format. After doing this with both excel sheets I copy and pasted the two side by side and deleted all the associated coordinates. These were also unnecessary because each pixel from the same coordinate between the layers were in the same row. I saved this excel file and uploaded it into R. I downloaded the caret package and performed a confusion matrix which can be seen below in Table 1.

The visual representation for my false positive and false negative results can be seen below in Image 2 & 3 with Image 1 for comparison. You can see the false negatives for disease covers a much larger area than does the false positives. This may imply that the segmentation is limited in its assessment of diseased area. What it tends to miss is the margins of the disease but does a fair job of predicting the center of the disease where it likely originated and is most severe. To correct this, setting a larger threshold may allow for less severe regions of the disease to be classified. Because the segmentation is based on pixel reflectance value at the red band, this would mean the threshold value needs to be slightly lowered.

Additionally, an entire patch of disease was missed which can be seen in the right corner of Image 1. Currently, the classification system is set to only create segmented patches of 10 or more pixels. This patch is 9 pixels and therefore just missed the cutoff. Even though it was just shy of this requirement, we are unsure if the segmentation would have detected a difference in this diseased patch or if it was also out of the threshold for classification. If it is common for disease patches to be this small, it may be an indicator to lower the value to 5 or 6 for what it is allowed in the segmentation blocking.

There are multiple steps in the processing of the image where different routes could have been taken and potentially increased classification accuracy. The objective of this classification method is to have as little percent error as possible or at least determine if you’d rather have more false positive than false negatives, vice versa or equal. Here we have greater percent of cells which are false negatives and modest assessment of diseased pixels when simply visualizing the images.

To help quantify these images and determine how accurate the model was, a confusion matrix was used in R, seen in Table 1. The segmented classification correctly identified non-diseased regions very well and did a pretty good overall job of predicting disease that was confirmed with ground truth. There were 2075 true positives, 55 false negatives, 41 true negatives, and 12 false positives. The model correctly identified 97.1% of the 2183 total cells. The precision of the model was 42.7%. The sensitivity of the model was 77.4% and the specificity was 97.4%. The accuracy was very good for this model and is essentially a percent error calculation of the model. The sensitivity measures the proportion of actual positives that are correctly identified while the specificity is the opposite and measures the proportion of true negatives identified. The precision gives us a sense of how useful or complete the results are. The model did well overall but provided insights to possible adjustments that could be made which would increase the predictive power here.

Critique of method
One critique I have is the small sample size. While I simply intended to only lay down the framework for creating a stepwise process in disease classification, it supplies results that can hardly be statistically backed. I would like to increase my sample size to five images for part three and look for similarities and differences between the five. I also intend to make some adjustments to the process to try and increase overall accuracy, precision, specificity and sensitivity. So essentially this critique is the limited conclusion that can be drawn from a sample size of one and the need to increase that to five for now.

The second critique I have is the number of steps I have used to get to this point. I would like to find a more manageable way to do the segmentation process and the image processing steps to get to this point. I have found small changes I can make along the way. Ideally, I can use the Modelbuilder in ArcGIS Pro where most of the processing is done. This will streamline the process when I find a way to do this.

An error which is present in my results is the confusion matrix. The matrix is considering 2183 raster cells in order to perform the matrix. These cells were determined by a defined rectangle when exporting a Tiff file from ArcGIS Pro. Many of these cells are not even of the leaf and is simply classifying regions outside of the leaf. To correct this, I would need to export a Tiff file which is symbolic of the leaf shape. The confusion matrix results provided were therefore erroneous in a sense or a bit misleading.

Partner ideas
My partner talked about how they were doing a neighborhood analysis and it could be practical for me to do. She had mentioned doing it in earth engine which I haven’t used but could get some help from her. There is a multiple rings buffer and I could look at the false positives in this light. She also mentioned using geographically weighted regression. We didn’t discuss much about it, but it seemed like a good regression to perform on my error analysis. We related on some level with our projects data and issues but at the time didn’t have any clear resolves. I will be curious to follow up on our chat and see what type of analysis was performed and share my results as well.


Image 1. Overlap between segmentation on top and manual classification below

Image 2. False negatives after subtracting segmented regions from manually classified.

Image 3. False positives after manually classified cells are subtracted from segmented.

Table 1. Confusion matrix

R Code
##raster upload



img20_seg <- raster(“E:/Exercise1_Geog566/MyProject3/RasterT_afr7_Polygon_1.tif”)

img20_ground <- raster(“E:/Exercise1_Geog566/MyProject3/diseased_20_PolygonT_1.tif”)



##export data

raster.table <-, xy=T)
truth <-, xy=T)


tableimg <- raster.table

write.table(tableimg, file = “dataexport.csv”, sep = “,”)
write.table(truth, file = “truth1.csv”, sep = “,”)

##confusion matrix

#######this isn’t working


table(confusionM$reference, confusionM$predicted)

confusionMatrix(confusionM$reference, confusionM$predicted)

##For the confusion matrix if above doesn’t work

myconfusionM <- table(confusionM$predicted, confusionM$reference)

##accuracy of matrix

2075+41+12+55 #total
(12+55)/2183 #misclassified/total







May 12, 2019

Ex2: Incremental pixel greenness while moving away from refugee settlement boundaries

Filed under: 2019,Exercise 2 @ 11:45 pm

Exercise 2: Incremental pixel greenness while moving away from refugee settlement boundaries


  • The question I asked centered on how the pixel greenness / NDVI varied in buffered increments around settlements within BidiBidi, Imvepi, and Rhino Refugee camps. I wanted to compare the settlements in Bidi Bidi and Imvepi, which have a larger settlements, to Rhino, which tended to have smaller settlements more uniformly spaced and spread out. Given the more uniformed and wider spacing of Rhino, I expect that green-up will happen more quickly in comparison to the settlements in Bidi Bidi and Imvepi, which are closer together and varying in size and development pattern. This leads me to believe that there’s more cleared or available land and that

    Map of Regions of Interest & Buffers

    settlement size is based on political organization of camp blocks rather than natural boundaries that might exist already. One of the questions that I’m asking with these settlements and the geography of exclusion is essentially why different settlement areas and parts of these areas are included or excluded from global settlement datasets. One of the factors that contributes to this is a spectral and spatial distinction – that is, how might the green space in and around a settlement change as you move away from said settlement? With this exercise, I wanted to compare the settlements in Rhino to the settlements in Bidi Bidi and Imvepi to see if the pixel greenness changed at a different rate or in a different pattern as one moved away from the settlement center.

  • I used multiple different tools, including the Multi-Ring Buffer tool in QGIS (since I was working with export JSONs), EarthEngine to extract NDVI mean values from Landsat 8 satellite images at these settlement locations, and the smoothing factor in ggplot in R to plot and statistically examine the way NDVI changed.
  • I first needed to buffer the regions of interest that I wanted to study, of which there were 44 in Bidibidi and Impvipi and 41 in Rhino. I performed this buffering in QGIS using the Multi-Ring Buffer plugin and made buffers at 100-meter increments from 0 to 1000 meters. Some of the buffers overlapped, but for the sake of simplicity of this assessment, I ignored this. Within EarthEngine, I pulled Landsat8 images from 2018 that covered these settlements. After adding NDVI and NDBI calculated bands to the image collection of 2018 images, I performed a quality mosaic to compress the image collection into one image. I based this quality mosaic on NDVI, meaning that the pixel chosen from the image collection would be the pixel with the highest NDVI. While this can sometimes pull pixels from different dates, it does exclude the possibility of clouds and seasonality affecting the dataset by comparing just the most vegetated pixel that occurred in that area. If I were to re-do this, I might choose a single date image to capture phenological nuance. After reducing the image across all of the buffers (that is, calculating the mean NDVI within each buffer), I exported the geoJSONs, brought them back into QGIS, ensured that there was a spatial selection component linking all of the buffers and regions of interest together, and brought this data into R to plot the NDVI change over distance for all regions of interest.

Bidi Bidi Settlement; Imveppi Settlement Buffers

Rhino Settlement Buffers

  • The pattern of greening in the buffers around settlements in Rhino versus Imvipi and BidiBidi did present different patterns, but not particularly significant different patterns. It appears that the Rhino settlements had a faster rate of increase in greenness while moving away from the settlements especially in the first 500 meters, whereas Bidi Bidi and Imveppi showed a more gradual green-up, although there also seems to be a small shift at 500 meters. These results are somewhat expected, but also not very drastic. It would be interesting to see how the green-up changes if I increased my buffer extent or decreased my buffer increments.
  • I think that looking at NDVI in buffers was an interesting approach, but as I said above, my choice of pixel quality selection (highest NDVI) could alter a neutral selection of data. Also, what buffers I chose were relatively arbitrary – I chose equal intervals, but this does mean that when the mean NDVI is calculated, the mean is reduced across a larger area as each buffer gets further from the center. I could also try testing with larger buffers (200 or 500 meter buffers) that extend beyond 1000 meters from the settlement edge. Further, some of my buffers overlapped and encroached on other actual boundaries: this means that the buffers sometimes contained pixels from other identified settlements. For this reason, I chose to present the data in a smoothing trend. I will likely need to fix some of these errors for the final project, because I do think that this is a typical and useful spatial analysis to perform on this type of data and some of the errors are relatively easy to fix and would show stronger data integrity.


May 11, 2019

Exercise 2: Geographically weighted regression on two forested stands.

Bryan Begay

  1. Initial Spatial Question: How does the spatial arrangement of trees relate to forest aesthetics in my areas of interest?


To understand forest aesthetics in my stand called Saddleback, I did a Ripley’s K analysis for Saddleback and on a riparian stand called Baker Creek to determine if the stands are clustered or dispersed.  The Baker Creek location is a mile west of the Saddleback stand.

  1. Geographically weighted Regression:

I performed a geographically weighted regression on both the Saddleback and the Baker Creek stands. The dependent variable was a density raster value and the explanatory value was tree height.

  1. Tools and Workflow

Figure 1. The workflow for creating the Geographically Weighted Regression for the Saddleback Stand. The Baker Creek stand followed the same workflow as well.



Figure 2. Geographically Weighted Regression showing the explanatory variable coefficients in the Saddleback and Baker Creek stands near Corvallis Oregon. Yellow color indicates negative relationships and the hotter colors  indicate positive relationships between tree height and density.

Figure 3. Geographically Weighted Regression showing the Local R2 values in the Saddleback and Baker Creek stands near Corvallis Oregon. Yellow color indicates that the local model is performing poorly, while hotter colors indicate better performance locally.

Table 1. Summary table output for the Saddleback stand’s geographically weighted regression.

Table 2. Summary table output for the Back Creek stand’s geographically weighted regression.

4. Interpretation/Discussion:

Having done the Ripley’s K analysis, I wanted to have a connection with this exercise, so I created a point density raster on both my stands (Figure 1). The point density raster calculates a magnitude-per-unit area from my tree points and outputs a density for the neighborhood around each tree point. The raster values would then be a descriptor of the trees neighborhood density. Having the density neighborhood values describes the stands tree spatial arraignment and relates to the Ripley’s K analysis outputs of telling if a stand is spatially clustered or dispersed.

Figure 2. shows that there is a spatial pattern in the Saddleback stand between density and height. There is a positive relationship on the edges of the stand and a decreasing relationship in the middle of stand between the two variables. This makes sense when thinking about how the stand would have denser and higher trees on the edges of the managed stand to screen the forest operations. The coefficient values for the baker creek showed a positive relationship on the north eastern portion of the stand, which would need further investigation to understand the relationship between density and height. Overall the relationship was negative in the Baker creek stand between density and height, but this may be attributed to the low local R2 values that indicate poor modeling (figure 3). Table 2. also shows that the Baker Creek model only accounted for 50% of the variance for the adjusted R2 values, which would indicate that more variables would be needed for the riparian stand. Figure 1. shows the summary table for GWR in the Saddleback stand.

  1. Critiques

The critiques for this exercise is that I only look at height and density. If I had more knowledge of working with LAS data sets I would have liked to have implemented the return values on the LiDAR data as an indicator of density. Another critique would be that I used density as a dependent variable and height as an explanatory variable. Using density as the dependent value allows me to see the spatial patterning of my trees when plotted in ArcMap so I can reference the Ripley’s K outputs for further analysis. Having height as a response variable with density as an explanatory is something that would have been easier for me understand and explain that relationship. Density can affect tree height in a stand but understanding tree height as a factor that affects density is not as intuitive. Looking at how tree height responds to density in my stand would tell something about tree height, but that relationship has already been explored in great depth.

Spatial Distribution of Trees by Height Class, Slope and Elevation in the HJ Andrews Forest

Filed under: 2019,Exercise 2 @ 11:32 am
Tags: , ,

Guiding Questions: How do distances between trees differ depending on tree height? How does the spatial pattern of tall trees relate to the spatial pattern of slope and elevation?

Methods: I used a combination of ArcMap, QGIS and R to perform analyses and view results. I used the results of my previous distance analysis within the HJ Andrews Forest, which grouped individual trees into ten height classes and calculated the mean distance between trees within the same height class, to correlate tree spacing with other spatial phenomena. I wanted to know if hot spots in within class tree spacing correlated with hot spots in tree height, so I examined hot spots and cold spots of tree distances within each height class and compared them to tree heights, slope and elevation. Height class 1 is the shortest class of trees and height class 10 is the tallest class of trees.

I used the Hot Spot Analysis Tool in the Arc Toolbox > Spatial Statistics Tools > Mapping Clusters > Hot Spot Analysis (Getis-Ord Gi*) to perform a Hot Spot Analysis on each of the ten height classes by mean distance to the 30 closest trees of the same height class. In the context of this analysis, the interpretation of a hot spot is that it is a region of greater than expected distances between trees of the same height class. For example, in the shortest height class, 1, hot spots are regions of greater than expected mean distance between a short tree and the 30 closest short trees. Cold spots would then be regions of closer than expected mean distance between short trees.

The Hot Spot Analysis in ArcMap used a self-generated distance band of 113m for my original hot spot analysis of the global dataset (not broken up by height class), so I decided to use a distance band of 100m for each subsequent hot spot analysis. Each height class has a different number of total trees in it, so by holding the distance band constant, I hoped to avoid influence from any differences in total number of trees between height classes.

After viewing the hot spot results, I plotted the z-scores of heights for each height class against the z-scores of the distances between trees to visually examine their relationship. If both heights and distances between trees were perfectly normally distributed, one would expect a circular distribution on the density plots with a slope of zero.

I then compared the mean slopes, elevations, and standard deviation of slopes and elevation within height classes across the entire forest. Since HJA is a research forest with many different management areas, including harvested patches and research plots, I limited the next part of the analysis to only within control areas of the forest. I downloaded the most recent (2014) land use designations from the HJA data repository ( For this analysis, I used Entity Title 3: Reserved areas (controls) within the Andrews Experimental Forest. I compared slopes and elevations within the control plots only by height class, to see if there were differences between the global dataset and the control regions of the forest.


The density plots of height z-score versus distance z-score revealed a different pattern between smaller height classes of trees and tree spacing than the relationship between larger height classes of trees and tree spacing. As we go from shorter height classes of trees to taller height classes, the density plot distributions change (Figures 1-10). There is strong evidence of positive correlation between hot spots of short trees and hot spots of distance between short trees, but from height class 6-10, there is little to no evidence of a relationship between hot spots of trees and distance between them. Tall trees are more or less distributed randomly throughout the forest.

Figure 1


Figure 2


Figure 3


Figure 4


Figure 5


Figure 6


Figure 7


Figure 8


Figure 9


Figure 10

There is clearly some structure to the density plots (especially in height classes 1-5), so we can assume that the trees are not randomly distributed and that there is a relationship between height and distance between trees. I compared mean and standard deviation of slope, as well as mean and standard deviation of elevation for each height class of trees (Table 1). Mean slopes do not significantly differ between height classes, so slope is likely not a main driver of tree height. However, there is some evidence that tree heights differ at lower and higher elevations, with the shortest height class of trees at a mean elevation of 1014 m, and the tallest height class at a lower mean elevation of 831 m. It’s important to note that mean elevations have large standard deviations, so the trend may not be as strong as it first appears. I wanted to know if there was more evidence for this pattern, so I calculated the same statistics for subsets of the hotspot analyses constrained to only the control areas of the forest (Table 2) to get an idea of how management may or may not influence the relationship between tree height and tree spacing throughout the forest. Mean slopes and elevations, accounting for standard deviation from the mean, do not differ significantly between the global and control datasets, meaning that the control regions are reasonable representations of the rest of the forest. To examine this further, I examined the same data for the entire forest excluding the control areas (Table 3). The same pattern holds between the three datasets; class 10 is the only class that is consistently at lower elevations across the forest. I made two density plots to display this relationship. Figure 11 shows the distribution of elevation for the shortest height class (class 1) of trees (green) versus the global dataset of all trees (red). The short trees follow the same distribution as the rest of the dataset, meaning that they are dispersed more or less evenly across elevations. Figure 12 shows the distribution of the tallest height class of trees (class 10; light blue) by elevation versus the global dataset of trees by elevation (red). This clearly shows that tall trees are not distributed at higher elevations.

Figure 11. Density of trees by elevation in height class 1 (shortest trees; green) versus global dataset of all trees (red).


Figure 12. Density of trees in height class 10 (tallest trees; light blue) versus global dataset of all trees (red) by elevation.

Table 1: Global dataset

Height Class Mean Slope SD Slope Mean Elevation (m) SD Elevation
1 23 11.7 1014 294
2 23 11.6 1008 291
3 25 11.6 990 295
4 25 11.5 948 291
5 25 11.3 931 293
6 26 10.9 972 291
7 27 10.5 982 250
8 26 10.6 959 221
9 25 10.7 926 207
10 23 11.2 831 197


Table 2: Subset of control regions

Height Class Mean Slope SD Slope Mean Elevation (m) SD Elevation
1 27 11.9 1104 317
2 27 11.5 1136 328
3 28 11 1157 329
4 28 10.8 1143 311
5 28 10.2 1133 285
6 28 10 1071 262
7 28 10 992 228
8 27 10.6 955 193
9 26 10.8 935 187
10 24 11 869 189


Table 3: Global dataset excluding control regions

Height Class Mean Slope SD Slope Mean Elevation (m) SD Elevation
1 22 11.4 991 284
2 22 11.4 979 273
3 24 11.6 952 272
4 24 11.5 902 266
5 24 11.5 867 265
6 25 11.3 908 290
7 26 10.7 973 267
8 25 10.5 964 242
9 23 10.4 919 221
10 22 11.2 793 198


Critique of the method:

A criticism of hot spot analysis is that it’s basically a smoothing function that places a focal around an area but does not account for the distribution of values within that area. So, the tallest tree in the dataset could be in cold spot (region of shorter than expected trees) and the hot spot analysis would give you no indication of that, so one may miss out on potentially useful/interesting information.

This is only a cursory look at the data and a next step is to more closely examine how slope, elevation and aspect influence distribution and height of trees, particularly within the control areas of the forest.

May 10, 2019

Ex 2: Spatial distribution of juniper saplings and slope,aspect, and seed source

Filed under: 2019 @ 6:44 pm

1. Question Asked: How does the density of juniper vary based on characteristics associated with soil moisture? Additionally, does this pattern vary with juniper density (as an indicator of potential seed source) surrounding each transect?

The study site for this analysis is the Camp Creek Paired Watershed Study (CCPWS) in central Oregon. The majority of juniper was removed from one watershed (“treated WS”) while one watershed is dominated by mature juniper (“untreated WS”). This analysis seeks to examine the density of juniper sapling reestablishment in the treated WS and mature juniper density in the untreated WS as it relates to indicators of soil moisture (slope, aspect, and a combination of both) and seed sources (surrounding juniper density).

2. Tools and steps used in analysis: All steps were completed in ArcGIS 10.6 or ArcGIS Pro.

Surrounding juniper density: Thirty-three belt transects were conducted in the summer of 2018 in the treated WS. The number of juniper for each transect (30m long by 3m wide) was recorded. For the purposes of this analysis, the transects were represented by a point. Supervised classification (support vector machine) was applied to National Agriculture Imagery Program (NAIP) imagery for this area and pixel-based analysis was used to estimate the density of juniper. General steps are as follows:

  1. Buffers were created surrounding each transect point at 50m, 100m, and 250m using the buffer tool.
  2. The zonal histogram was used to extract the pixel values within each buffer. To ensure proper counts, each buffer must be iterated separately if there is overlap between buffers.
  3. A model was created in ModelBuilder to extract values from the buffers and combine values in a table. In-line variables were used to separate the data for each transect into separate tables prior to being merged.The submodel (first image) and full model (second image) are displayed below:


  1. The percentage of juniper pixels within each buffer (number of juniper pixels/total number of pixels) was calculated using excel. This percentage was used as an estimate of surrounding juniper canopy/density within the surrounding areas.
  2. Point plots were created indicating the number of juniper per transect and the corresponding juniper density.
  3. The transects were divided in to two groups representing “low” counts (transects where one or fewer juniper saplings were found) and “high” counts (transects where five or greater juniper were found).

Slope and Aspect: Slope and aspect were considered to be indicators of soil moisture within the watershed. Separate approaches were used to assess if juniper density was found to vary with these characteristics: regression (using ordinary least squares and global weighted regression) and a rating model based on slope and aspect categories. General steps are as follows:

  1. Using a 30m DEM, the slope and aspect values were extracted using the slope and aspect tools within ArcMap.
  2. Using the calculate field function, aspect and slope were each ranked from 1-9, with areas were greater soil moisture is expected (flat and shallow slopes and northern slopes were weighted heaviest). The following parser was used (aspect values are in degrees and slope is in percent):

def Reclass(Aspect_category):

    if (Aspect_category<0 and Aspect_category>=-1):
        return 9
    if (Aspect_category>=337.5 and Aspect_category<360):
        return 8
    if (Aspect_category>=0 and Aspect_category<22.5):
        return 8
    if (Aspect_category>=22.5 and Aspect_category<67.5):
        return 7
    if (Aspect_category>=292.5 and Aspect_category<337.5):
        return 6
    if (Aspect_category>=67.5 and Aspect_category<112.5):
        return 5
    if (Aspect_category>=247.5 and Aspect_category<292.5):
        return 4
    if (Aspect_category>=112.5 and Aspect_category<157.5):
        return 3
    if (Aspect_category>=202.5 and Aspect_category<247.5):
        return 2
    if (Aspect_category>=157.5 and Aspect_category<202.5):
        return 1
def Reclass(Slope_category):
    if(Slope_category>=0 and Slope_category<2):
        return 9
    if(Slope_category>=2 and Slope_category<4):
        return 8
    if(Slope_category>=4 and Slope_category<6):
        return 7
    if(Slope_category>=6 and Slope_category<8):
        return 6
    if(Slope_category>=8 and Slope_category<10):
        return 5
    if(Slope_category>=10 and Slope_category<12):
        return 4
    if(Slope_category>=12 and Slope_category<14):
        return 3
    if(Slope_category>=14 and Slope_category<16):
        return 2
      return 1
  1. Based on the categorized values for aspect and slope, a rating model was calculated using the raster calculator to determine if areas of expected greatest soil moisture corresponded to juniper density. The formula for this rating model was: (slope category +aspect category)/2.
  2. As the rating model had a lower spatial resolution and extent than the classified NAIP raster, the classified raster was resampled to be the same resolution as the rating model. The NAIP raster was masked using the extent of the rating model raster. These rasters will be used for further analysis in exercise 3. For the purposes of exercise 2, two maps were created for visual analysis. For future analysis, both rasters were projected to NAD 1983 UTM Zone 10N.
  3. The hot spot analysis from exercise one was overlaid on the rating model to visually assess the patterns over a small scale.
  4. An ordinary least squares (OLS) analysis was performed using the number of juniper counted for each transect as the dependent variable and the slope category, aspect category, and combination of both categories as the explanatory variables.
  5. The standard residual for each OLS was used to calculate Global Moran’s I.
  6. Geographically Weighted Regression (GWR) was used with the same inputs described for the OLS.

3. Overview of results.

Buffer. No clear trend was found between juniper density and the number of juniper sapling found in each transect. However, for all transects the juniper density was lower with increasing buffer distance. Results are displayed in the chart below, points are labeled by distance and “high” (corresponding to transects with 5 or more juniper) and “low” (corresponding to transects with 1 or fewer juniper:


Comparison of Rating Model and Classified Raster. The classified NAIP (prior to resampling and reprojection) and rating model raster are displayed below. Red pixels within the classified raster are pixels classified as juniper. For the rating model, lighter shaded areas correspond to those areas where higher soil moisture is expected based on the slope and aspect characteristics. The hot spot analysis provided limited use (as only one hot spot and one cold spot were indicated) but could be useful for additional analysis. However, the cold spot aligned with areas of expected lower soil moisture and the hot spot corresponded to areas with expected higher soil moisture. Further analysis of these two datasets will continue with exercise 3.

Regression analysis. The R-squared values for OLS for each explanatory variable(s)(slope category only, aspect category only, and a combination of slope and aspect) ranged from -0.025 to -0.060, suggesting that these characteristics were not a good predictor of juniper density (at least for the data set used in this case). The coefficient for each explanatory variable(s) ranged from -0.087 to 0.024, also indicating that this model was not a good fit. The Koenker (BP) statistic was insignificant for all cases. The p-values (p>0.86 for all cases) for Global Moran’s I for each combination of explanatory variables indicated random spatial patterns. As expected, results were similar for the GWR analysis with R-squared values ranging from -0.011 to -0.037. The GWR residual squares ranged from 180.8 to 182.3.


4. Discussion/critique of methods. Based on the buffer analysis, no clear patterns were indicated for repulsion/attraction based on the transect numbers and the density analysis. However, the juniper density (both in range and overall percentage) decreased for the 250m buffer regardless of the number of juniper counted per transect. These patterns are most likely related to the fact that juniper was removed from one watershed fifteen year ago and that the adjacent watershed was left untreated. This coupled with the small sample size may not be sufficient to indicate if spatial patterns between juniper density and saplings found for each transect exist. The OLS and GWR analysis indicated that slope and aspect (at least based on the data used here) were not good predictors of juniper density. For exercise 3, regression analysis of the classified raster and the rating model will take place.

The methods used here were relatively easy to apply, and can be accomplished using tools available in ArcGIS Pro. The ModelBuilder process required time to develop and some issues, particularly with the merge function, required that the tables be manually manipulated. However, ModelBuilder still was more efficient than executing the buffer and zonal histogram for each transect separately. Also, the supervised classification can easily be accomplished within ArcGIS. However, the spatial resolution of the NAIP imagery (1m) likely led to reduced accuracy of juniper sapling identification compared to images with higher spatial resolution. This was the highest resolution imagery available for the entirety of the two watersheds but this methodology could easily be applied to other data.

Overall, the methods used here may inform future analysis to be applied to other datasets. For the purpose of these exercises, very few spatial trends are evident but that may be the result of a)the sample size, b)the effects of juniper removal, and c)characteristics of the dataset (e.g., spatial resolution of the raster). In particular, I would be interested to see how these approaches would work in areas where larger-scale removal has taken place. Additionally, the use of rasters with higher spatial resolution (such as UAV-based imagery) would likely improve the classification and increase accuracy of juniper sapling detection.

Ex 2: Relationship between stream cross-sectional change and across-channel slope

Filed under: 2019,Exercise 2 @ 6:11 pm

For Exercise 1, I investigated autocorrelation in patterns of cross-sectional change across and along stream reaches.

For Exercise 2, I wanted to know whether these patterns of change were related to channel geometry.

Specifically, I wanted to know if erosion or deposition happened more frequently close to cut banks than further away from them.

I was originally going to investigate change in both the along-channel and across-channel directions, but after consulting with my classmates, I decided that I would not be able to draw as many conclusions in the along-channel direction because I don’t have good information about the spacing and orientation of my cross sections.


To find cross-sectional change, I paired sequential years and calculated change along each cross section for each pair of years as I did in exercise #1.

I wanted to compare the change to the location of the steepest bank, but I couldn’t figure out how to identify what parts of a cross section counted as a “bank” using a computer. Instead, I looked at the spatial pattern of change in relationship to the point of steepest slope in the across-channel direction.

I used the original cross section profiles to identify the point of steepest across-channel slope. Since the point of steepest slope may move from year to year, I used the steepest slope from the first year in every year pair. I used the loess function in R to lightly smooth each cross-sectional profile before extracting slope in hopes of reducing the effects of some small bed features. This worked well in most cross sections, but in some cross sections, especially those with prominent midchannel features, the point of steepest slope occurred in the middle of the channel.

Once I had identified the steepest point in each of the cross section for each year pair, I calculated how far every other point in the cross section was from the steep point. Then, within each reach, I aggregated the data by distance, rounding to the nearest decimeter, and calculated the mean absolute elevation change (that is, counting both erosion and deposition as positive values). I wanted to see broad patterns overall, so the aggregate combines data for every cross section and every pair of sample years.

I plotted the resulting data from each reach. In the figure below, the colors represent how many horizontal centimeters of reach are aggregated into each point on the line graph. Bigger numbers and more blue colors represent averages from more cross sections and years while smaller numbers and more yellow colors represent distances where fewer cross sections or fewer years had data at that distance from the steep point.


The figure implies that perhaps a lot of channel change tends to happen very close to the steepest point, but then stabilizes. Far from the point, the average vertical change values become very unsteady, perhaps because fewer data points are integrated into the average.


I thought that this was an interesting and fairly straightforward analysis to conduct, but I am not sure how physically meaningful the results are, since the steepest points are not placed in the same location in each cross section. The results figure looks a bit like a channel cross section itself, which I thought was very interesting! I wonder if this is because the averaging falls apart at a distance roughly equal to the average channel width or if there really is more change happening near channel banks on average in these streams.



A stain on the neighborhood… How does management relate to infection for black stain root disease?

Filed under: Exercise 2 @ 5:30 pm

Sorry for the bad puns in the title… Could not help myself.


How do spatial patterns of black stain root disease infection probabilities relate to spatial patterns of forest management classes? How do these spatial relationships differ between landscapes where similar management classes are clustered and landscapes where management classes are randomly distributed?


I used neighborhood analysis to analyze the spatial relationship between hot and cold spots of infection probabilities and the three forest management classes simulated in my model (extensively managed plantations, intensively managed plantations, and passively managed old-growth stands).

I used a combination of ArcMap (to perform the majority of the procedure) and R (mostly for spatial data wrangling and analyzing and plotting results).


1. Compare the distribution of infection probabilities between landscapes and management classes

I performed this step in R to determine whether there was evidence of significant differences in the infection probabilities (a) between the clustered and the random landscapes and (b) between management classes both within and among landscapes.

2. Hotspot analysis

Converted my raster data of infection probabilities to point data (in R, using the “raster” package) and perform a hotspot analysis (hotspot tool in ArcMap) (Fig. 1). For the hotspot analysis, I used inverse squared distance weighting to conservatively include trees within hotspots.

I also created polygon shapefiles for areas identified as hotspots to the 99% confidence level in each of the landscapes and calculated the area of these hotspots for each landscape, management class, and landscape X management class.

Figure 1. Hotspot analysis overlaid on forest management classes for the “clustered” landscape. The same analysis was performed on the “random” landscape.

3. Select point for neighbor analysis

For this step, I chose one point in a hot spot and one point in a cold spot for each of the two landscapes to perform the neighborhood analysis. In the future, I would write a script to repeat this procedure with a random sample of a large number of points, but for this exercise, I just used one point for each.

For the hotspots, I only used points identified in hotspots at the 99% confidence level. For the cold spots, I used a point from the 99% confidence level for the clustered landscape, but my only options for the random landscape were points identified at the 90% confidence level. I visually selected the points for analysis (non-random), but I would not do so for my full analysis.

4. Create concentric ring buffers for the neighborhood analysis

I used the Multiple Ring Buffer tool in ArcMap to generate hollow ring buffers at a series of distances around each hotspot and cold spot point selected for analysis (Fig. 2). I then intersected these buffers with the management class shapefile and calculated the proportion of each buffer ring composed of each management class (by area). I plotted these proportions as a function of distance to complete the neighborhood analysis.

Figure 2. Concentric ring buffers used for the neighborhood analysis, with outer ring distances labeled. Example shown for the “random” landscape for both hot and cold spots.


Non-spatial analysis

Before performing the neighborhood analysis, I wanted to know whether there was any difference in the infection probabilities between the two landscapes and the three management classes. Visual assessment of the box plots (Fig. 3) led me to believe that there were significant differences between the means of the infection probabilities between the landscapes and management classes, which was supported by the results of student t-tests (p << 0.01). There were also significant differences in the infection probabilities between management classes both within and among the two landscapes. Since the outputs I am analyzing are “dummy data” because my model is not fully complete, this does not surprise me and I did not perform further, more rigorous statistical analysis. The higher infection probabilities simply reflect the density of trees in each of these management classes. Extensive management had the highest infection probabilities (highest initial planting density, evenly spaced), followed by old-growth (lower density but clustered trees) and finally intensive management.

Figure 3. Comparisons of infection probability distribution between (a) landscapes, (b) management classes, (c) management classes by landscape; (d) the proportion of each management class covered by an infection hot spot in each of the two landscapes (by area).

Neighborhood analysis

For the clustered landscape, the management class in which the point was located made up the largest proportion of the first several distances analyzed (cold spot: old-growth, hot spot: extensive). This makes sense given that the management classes were spatially clustered in this landscape. At a certain threshold, the proportion of this management class started to decrease, as the other two management classes increased in proportion at a similar rate. For the random landscape, the decrease in the starting management class proportion was relatively rapid with distance, and all three management classes converged at their landscape proportion by 150 meters, with some fluctuation (in both landscapes, each of the three management classes makes up about 1/3 of the total landscape).

Figure 4. Neighborhood analysis showing the proportion of each (hollow) ring buffer accounted for by each management class for hot and cold spot points in each of the two landscapes.

Critique of the method – what was useful, what was not?

Most of the drawbacks of this analysis were due to my process and the nature of my data. Because I only used one hotspot and one cold spot point for each of the two landscapes, it is difficult to say much from this analysis. If I were to automate the analysis and run it on a series of random points drawn from the hot and cold spots, I would get a lot more insight as to the patterns of neighborhood effects if any exist. In addition, the “dummy data” used for analysis come from model runs that only have local disease transmission (between neighbors at a radius of several meters). However, the full model will also incorporate long-distance dispersal by insect vectors (on the scale of kilometers), which will likely be much more interesting and less predictable when neighborhood analysis is performed.

Issues with the method itself is that it is quite time intensive – an issue that could be cleared up with a script written in Python or R to automate this process given a set of input files. If there is a way to do this already built into either of these platforms (or Arc/QGIS), I was not able to find it. Also, there is a lack of clear quantitative interpretation for these plots to separate statistically significant variability in the proportion of each management class at each distance from non-significant variability. A means of doing so would enrich the analysis.

Exercise 2: Using Neighborhood Analysis to Identify Relationships Between Seascape Classes and Rockfish Abundance Hotspots

Background and Question Asked

In Exercise 1, different interpolation methods were used to create a heat map of rockfish abundance based off of a large collection of point data. That blog discussed some of the challenges that arose while attempting to use a time series of point data with many points in close proximity to one another (if not overlapping). The exploring was in many ways successful: it was discovered that the Kriging method provided a more robust representation of the data than Inverse Distance Weighting. However, in the time since that post was published, my interpolation methods have been refined:

  • Instead of using the entire time series as an input for the interpolation, four individual years were selected to represent the whole dataset (2003, 2007, 2011, 2015). Kriging was then used to create heat maps for each individual year.
  • Additionally, the union tool was used to remove the land boundaries from the environment so that the interpolation only affected parts of the ocean
  • The symbology of the abundance point data was synced across all four years being used in the analyses so that they could be easily compared to one another
  • The symbology of the interpolated heat maps was also modified to be consistent throughout the analyses

For this exercise, I plan to compare my new, interpolated data to an already existing set of data, effectively comparing my two variables. Specifically, I hope to answer the question “Is there a spatial relationship between areas of significantly high and low rockfish abundances and specific seascape classes?”

An example of the most recent Kriging output using the 2007 data.

Name of Tool or Approach Used

I will be using a neighborhood analysis to seek an answer to this question. The neighborhood analysis requires taking areas of interest and examining the environmental conditions around that area from the perspective of another variable. By varying the distance from your original point of interest, a researcher is able to infer about the spatial relationship between the two variables.


Data Used

  • “Points of Interest” chosen from plot below
  • Buffers created around points of interest at 5km, 10km, and 25km radii
  • YOYRock Kriging Abundance Interpolation for 2007
  • Seascape NetCDF Raster File for May 5th, 2007

Rockfish abundance plotted against water column depth for trawls from 2007.

The first thing that was needed to complete this analysis was points of interest. I chose to use four points form the year 2007, as the data from this year provided the largest spatial footprint of all of the years of interest. Two of the points represented trawls that found significantly high rockfish abundance, and the other two represent trawls which found no rockfish. All four points vary spatially and physically (latitude, longitude, water column depth, etc). All points were selected from interpolated areas with different modeled outputs. Next, circular buffers were created around each point of interest with 5km, 10km, and 25km radii.

Map showing Points of Interest with circular buffers overlaid on seascape NetCDF file.

In order to use the overlay tool in ArcGIS, two polygon features are needed. In order to convert my NetCDF Raster files into a polygon, I used the Raster to Polygon tool. Once the seascape classes were converted to polygons, the Intersect tool was used to measure the shape area of each seascape class within each buffer. Those statistics were then converted to .xlsx files and summarized in Excel.

Results and Discussion

Example of data after Raster to Polygon and Intersect tools used

The neighborhood analysis found evidence that specific seascape classes may have impacted young of the year rockfish abundances in the locations selected to be a part of this analysis.

The low-abundance trawls were dominated by three seascape classes: Class 14 (Temperate Upwelling Blooms), Class 19 (Subpolar Shelves), and Class 21 (Warm, Blooms, high Nutrients). While there were more classes represented overall by the high abundance trawls, those areas were mostly dominated by two seascape classes: Class 7 (Unnamed) and Class 12 (Subpolar Nutrient). Additionally, there was very little overlap between the two areas – the only seascape class that appeared in both the high abundance radii and the low abundance radii was Class 14. Further analyses would be needed to determine if these trends are representative to the entire region or year, but this neighborhood analysis provides results that give us a place to start. Overall, I found this analysis to be extremely useful despite the number of steps needed to make it work. In addition to working in GIS normally, the data type of my seascapes had to be changed and much of my analysis had to be done in Excel, as ArcGIS cannot summarize key statistics. However, I feel as though streamlining this method could be done now that I am familiar with it.

Exercise 2: Cross Variogram and Kriging

Filed under: 2019,Exercise 2 @ 3:19 pm

In Volcanic Regions fluid flow paths are limited to the rubbley bases and flow tops of lava flows where permeability promotes transitivity.


The depths to first water will corresponds to depths located near lava flows.


Contact: location on the surface, or at depth where two different rock types touch.

Depth to first water: the depth from a particular well log where water was first noted, this is not always listed.

Depth to first lava: the depth at which the first lava is noted in a particular well log. There can be multiple contacts to a lava in a well log, which is why I specified first.

Figure 1: Block diagram of what well log depicts. The green and red planes represent contacts between the two rock types. Well (grey) and well logs record these contacts as depths from the surface (brown).


Does the depth to first water correspond to the depth to first lava in my data set?


Cross Variogram and Kriging

Like the variogram, the cross variogram is a tool that allows you to compare spatial data at multiple scales. Unlike the variogram, the cross variogram compares one data set to another data set at multiple scales.

Kriging uses the variogram to interpolate a surface.

Brief Description:

In order to use both the variogram and the cross variogram you must normalize the data you are working with. Otherwise the semivariance values can range from 0 to infinity. Normalizing the data allows you to distinguish data that are correlated (semivariance<1) from data that have no correlation with eachother (semivariance>1).

In order to use the R function gstat, I had to turn the data into a spatial data frame.

The function gstat allows you to simultaneously create variograms for each of the induvidal data sets you are working with and compares them with each other. In this case I just compared the depth to first lava with the depth to first water.

I used Arc GIS kriging formula (ordinary) to krige a surface that represented the difference between between the depth to first lava and the depth to first water. In other words I subtracted the depth to first lava from the depth to first water and kriged that “surface”. I wanted to see if there was a spatial distribution around which those difference were low. I tried to use R to krige, but did not have the time to work out the krige function.


My results were strange, though ultimately unsurprising.

Figure 2: Variograms of my two variables water1 and lava1 as well as the cross variogram that comes from comparing the two. Note that the water1lava1 cross variogram has negative values for the semivariance.

Figure 3: Plot of well log data with the difference between first lava and first water plotted on top of its kriged surface.

The strangest thing to resolve from this exercise are the negative semivariance values for the cross variogram. Semivariance is a squared values and therefore should not have negative values. I have no idea what is happening here. I need to ruminate upon it. Either way, the data does not appear to be well correlates, or at the very least, I am not comfortable making conclusions about it with the negative semivariance values.

The ordinary Kriged surface interpolated from the difference between lava and water lets me know that the highest Kriged surface (Fig 3, white) between water and contact lies in the middle of the study area . In geographic and geologic space this corresponds to a basin filled with sediment and inter-fingered with lava. Many of the well logs in the region are not deep, they don’t have to be because the water here is near the surface, and close to some of these buried basalt flows. The data at the far edges of the map are spatial outlies, and thus we can’t look at any of the map that lies far from the main cluster of data points.

From physically looking at the well logs I know that while the well logs do often correspond to a lava flow, it is almost never the first lava flow. I am not surprised that the semivariance indicates that the data are not correlated.

Critique of the method:

what was useful, what was not?

It was not particularly useful, because it told me what I already know and left me with more questions than answers. However, I did walk away with some considerations. The cross variogram (and variogram) might work at a smaller scale.

In other words, if I broke my field area up in regions where I think the lava layers might source from the same place (Lassen Peak or Medicine Lake Volcano) I would be able to make the assumption that lava layers that are at similar depths in the well log correspond to the same lava flow in space. If we consider figure 1, in a small area we would be able to link the green layer to other locals where the green layers lies at depth, and would be able to spatially autocorrelate them with the variogram.

The next step, one I narrowed down my area, would be to correlate the depth to water with the depth to every lava flow I found in the well log. This would allow me to see which lava layer best corresponds to the depth to first water.

One of the things I discussed with my partner was trying to figure out what the negative values meant in my variogram. As I stated above, I still need to think about this, or figure out what I did wrong. I also discussed taking data out of lat-long space and into UTM space; that is something I am also still thinking about.

One final note: At the moment my data is both clustered around certain spots, and I do not have much of it. Every time I add a few data points, the shape of the variogram changes.  Some of the spikiness I am seeing is likely from that.



Exercise 2: Neighborhood analysis of Texas counties with high colorectal cancer mortality rates

Question being asked

How do rurality indicator variables shift as distance increases away from Texas counties with high colorectal cancer (CRC) mortality?

In exercise 1, I used principal component analysis (PCA) to create a PCA-weighted rural index of the state of Texas using 3 scaled variables: population density, land development percentage, and median income. In this exercise I applied these same variables to determine how they change as distance increases away from the 4 Texas counties with the highest CRC mortality rates. To do this, I created multi-ring buffers around Anderson, Gonzales, Howard, and Newton county and computed averages of each rural indicator variable for each successive buffer “donut.” I hypothesize that as distance increases away from high CRC county centroids, rurality indicator measures will have more “urban” values (i.e. higher population density, higher percent developed, higher median income) and CRC mortality rates will decrease.

Tools and Data Sources Used

For this exercise, I utilized the intersection, feature-to-point, and multi-ring buffer tools in ArcGIS along with the latticeExtra/gridExtra plotting packages in R. The rural indicator data used in this analysis are from the same sources I used in Exercise 1: Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database). These data were then scaled using the same procedure from Exercise 1, which can be found here.  Aggregated CRC mortality rates for Texas counties were obtained for the years 2011-2015 from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program U.S. Population Data – 1969-2015.


Attribute Table Wrangling: The Texas county indicator variables were linked to county polygons in my Exercise 1, but cancer mortality data was not. For polygon linkage in this exercise, I imported the mortality data excel sheet into Arc and used the join procedure to insert the data into the existing attribute table (with indicator variables) for county polygons.

Centroid & Multi-ring Buffer Creation: First, I utilized the point-to-feature tool in Arc to create a layer of county centroid points from the county polygon layer. Once the county polygons had been converted to centroids, I identified the 4 Texas counties with the highest CRC mortality rates. Then, using the select features tool and multi-ring buffer procedure, I selected each of the 4 counties separately and created multi-ring buffers at 50, 75, and 100 Km. These distances were chosen based on the size of the selected counties and the size of the full state of Texas.

Intersection & Donut Summary Statistics: Once the multi-ring buffer layers were created, I intersected each of the 4 buffer layers with the original county polygon layer containing all relevant variables. Then, mean via the summary statistics tool were computed in Arc for population density, percent developed land, and median income for each successive donut in the multi-ring buffers. The computed tables of buffer donut means for each variable and county were then exported to Excel files.

R Plotting: The Excel files were then imported into R and line plots of buffer means by distance were created using the xyplot function within the latticeExtra package. Plots were then combined into figures by county using the gridExtra package.


This figure of scaled population density means for county multi-ring buffer donuts indicates varying trends for population density between the 4 high CRC counties. Two counties have increasing population density as distance increases away from centroids, and two have decreasing population density as distance increases away from centroid. Only the buffer map for population density was presented above for post conciseness and space limitations on this blog site. More specific neighborhood relationships between CRC death rates and all indicator variables can be seen in the line plots and explanations below.

Line Plots of County Indicator Variables Over Buffer Distances

The above plots display with more specificity than the buffer map that areas surrounding the 4 counties have differences in indicator variable trends as distance increases away from county centroids. Both Anderson and Newton counties largely follow the trend hypothesized: as distance from the county centroids increases, rural indicator variables have more “urban” values and CRC mortality rate decreases. For Newton county, this trend does not hold for median income, because as CRC mortality decreases away from the county centroid, median income also decreases. For the other other 2 counties, Gonzales and Howard, the hypothesized relationship does not hold, because as distance increases away from county centroids, the rural indicator variables become more “rural” as CRC mortality decreases. This indicates that the associations between CRC mortality and rural indicator variables are complex and that neighborhood analysis does not capture all relationships.


This sort of neighborhood analysis was effective at determining trends in rural indicator variables in the areas surrounding high CRC mortality counties in Texas. The buffer map produced great broad results, where more generalized trends can be determined. The line plots specifically were highly useful for visualizing more specific changes in indicator variables and CRC mortality rates over distance. These results should be considered in the light of some limitations. First, county level data was used for all variables and the buffer donuts may be too large or too small to capture the true neighborhood relationships in the analysis, as statistical procedures were not utilized to determine the distances. Further, more buffer donuts may have been useful to see more nuanced trends over distance. Secondly, as can be seen in the buffer map, there is a lack of CRC mortality data for many counties in west and southern Texas due to data suppression in order to preserve patient confidentiality. This presents significant bias in result interpretation, especially in Howard county, where many of the counties surrounding it have suppressed CRC mortality data.

My future analysis of this data will likely be a comparative confusion matrix of the PCA-weighted index I created in Exercise 1 and CRC mortality data used in this exercise.

Exercise 2: Spatial Patterns of Perceptions of Natural Resource Governance and Environmental Condition

Filed under: 2019,Exercise 2 @ 2:25 pm
Tags: , ,


What are the spatial patterns of natural resource governance perceptions in relation to environmental condition?


Environmental condition was represented as: (1) Environmental restoration sites, and (2) aggregate ‘environmental effects’ scores by census tract on a scale from (1) low to (10) high that include lead risk from housing, proximity to hazardous waste treatment storage and disposal facilities, proximity to nation priorities list facilities (Superfund Sites), proximity risk management plan facilities, and wastewater discharge.


Sub-questions related to the primary question are:

a1. Does being near a restoration site associate with individual perception? a2. Does being near more restoration sites associate with individual perception?

b1. Do the environmental effects where an individual lives associate with individual perception? b2. Do the environmental effects around an individual associate with individual perception?

Tools and Approaches

  1. Nearest Neighbor analysis in ArcGIS Pro and R studio
  2. Geographically weighted regression in R studio
  3. Neighborhood analysis in ArcGIS Pro

Analysis Steps

  1. Nearest neighbor analysis was used to examine questions a1 and a2. First, a file of restoration sites was loaded into R. The sites were points in the same projection as my participant location points. This analysis required four libraries (as the points were from different files. The libraries were: nlem, rpart, spatstat, and sf. To use the tool I needed, I first had to turn my points into point objects in R (.ppp objects). First I used the convenxhull.xy() function to create the ‘window’ for my point object, then I used that to run the ppp() function. After doing this for both sets of points, I was able to use the nncross() function. This function produced min, max, average, and quartile distances from individuals to the nearest restoration site. I added the ‘distance’ from the nearest neighbor as a variable in a linear regression to determine an association for a1.


To examine a2, I used these distances (1st quartile, median, and 3rd quartile) to produce buffers in ArGIS Pro. After creating the buffers, I ran spatial joins between them and the restoration sites. This produced an attribute table that had a join count—the number of restoration sites within the buffer. I exported the three attribute tables from the three buffer distances back to R. In R I ran a linear regression with join count as an independent variable.


  1. To test b1, I preformed geographically weighted regression. I used both ArcGIS Pro and R studio to run this analysis. Initially I used the GWR tool in Arc to run the regression, but wanted the flexibility of changing parameters and an easily readable output that R provides. First I joined individual points to my rank data, a shapfile at census tract level. This gave individuals a rank for environmental effects at their location. In R, I used two different functions to run GWR, gwr(), and gwr.basic(). The gwr() required creating a band using gwr.sel, and the gwr.basic required creating a band using bw.gwr. The difference between these functions is that gwr.basic produces p-values for the betas. I ran gwr on both my entire data set and a subset based on perceptions. The subset was the ‘most agreeable’ and ‘least agreeable’ individuals who I defined as those one standard deviation above and below the mean perception.


  1. I preformed neighborhood analysis to test the final question, b2. First, I created a dataset that was just the upper and lower values of my governance perceptions (one standard deviation above and below the means. I then added buffers to these points at 1 and 5 miles. I then joined the buffers in ArcGIS Pro to the rank values to get an average rank within those buffers. I exported the files for each buffer to R. I R I created a density plot of average rank for the low governance values at each buffer, and for the high governance values at each buffer.


  1. The median distance for individuals from restoration site was 0.037 degrees, 1st quartile was 0.020 degrees, and 3rd quartile was 0.057 degrees.

The regression on whether distance correlates with individual perception was insignificant (p = 0.198). This led to the conclusion that distance from the nearest restoration site does not influence perceptions.


For each regression on the number of sites near an individual, all coefficients were negative. This implies that the more sites near an individual, the more disagreeable their perception was. All produced significant results, but the effect size of number of sites near individuals was very minimal (Table 1).


Table 1. Regression results for ‘nearest neighbor’ of individuals to restoration sites.

Buffer Size b p-value Effect Size (rpb)
Buffer 1 (1st quartile) -0.003 0.0181 .003
Buffer 2 (median) -0.002 0.045 .002
Buffer 3 (3rd quartile) -0.001 0.002 .003





  1. For the geographically weighted regression, I am presenting the results from the gwr.basic() model. For this model, I included demographic variables to control for these factors. In the model, rank, life satisfaction, years lived in the Puget Sound, and Race are significant (Table 2). All other variables were insignificant, so I will not discuss their trends. For rank (the main variable of interest), the coefficient was positive. In this case higher rank values are worse environmental effects, so as agreeable perceptions increase, environmental condition decreases. Life satisfaction is a variable of how satisfied individuals are with their life overall, which correlated positively with perception (Table 1). Years lived in the Puget Sound correlated negatively, or perception decreased the longer someone lives there (Table 1). Race, a dummy variable of white or not-white, indicated higher perceptions were held by white individuals.


Overall, the effect size of this model on governance perceptions was small, explaining about 10% of the variance in the data (Table 1).


Table 1. Regression results for environmental effects at individuals’ locations.

Variable1 b p-value2
Rank 0.077 0.007**
Life Satisfaction 0.478 <0.001**
Years Lived in the Puget Sound -0.010 0.002**
Sex -0.156 0.289
Area -0.150 0.179
Education -0.002 0.922
Income -0.022 0.622
Race 0.006 0.034*
Ideology 0.103 0.533

1R2 = 0.094

2 ** = significant at the 0.01 level, * = significant at the 0.05 level


  1. The plot of high and low governance values at the two buffers is presented below. The black and red curves represent respondents from the survey that were at least one standard deviation lower than the mean (the mean was neutral). The black curve is average rank with a one mile buffer, and the red curve is average rank at the five mile buffer. The green and blue curves represent respondents from the survey that were at least one standard deviation higher than the mean. The green curve is the average rank at the one mile buffer, and the blue curve is the average rank at the five mile buffer.

What this figure indicates is that there are two peaks in average rank at low environmental effects (~1) and at mid-environmental effects (~4.75). Those with lower perceptions of environmental governance had higher peaks at low environmental effects for each buffer size. Those with higher perceptions of environmental governance had a bimodal distribution with peaks at low and mid-environmental effects. The bimodal nature of these density functions leads me to believe there is some variable moderating governance perception related to environmental effects.


  1. Critique

The methods I used were all helpful in determining the spatial relation of environmental condition to perceptions of natural resource governance. However, switching between the two programs (ArcGIS and R) was a little bit of a hassle. R has greater flexibility when running analyses, as in running is easier. This meant it was ideal for the analyses that required me to tweak my models multiple times. I also ran into little problems with Arc in terms of loading the data which made everything run slower, but Arc is more intuitive in terms of finding and executing analyses/

Does the spatial arrangement of vegetation cover influence ventenata invasion?

Question Asked

To predict the invasion potential of a species, it is necessary to understand the spatial pattern of the invasion in relation to landscape scale variables. For exercise 2, I explored how the spatial pattern of invasion by the recently introduced annual grass, Ventenata dubia (ventenata) relates to the spatial pattern of vegetation cover categories throughout the Blue Mountain Ecoregion of eastern Oregon.

Tools and Approaches Used

To unpack this question, I performed a neighborhood analysis to explore how the proportion of different vegetation type cover differ at increasing distances from plots with high versus low ventenata cover.

The neighborhood analysis required several steps performed in ArcGIS and in R:

  • I split my sample plot layer in ArcGIS into two layers – one containing plots with only high ventenata cover (>50%) and one containing plots with only low cover (<5%).
  • I buffered each plot by 10m, 50m, 100m, 200m, 400m, and 800m using the “buffer” tool in ArcGIS and then erased each buffer layer by the preceding buffer layer to create “donuts” surrounding each sample points using the “erase” tool in ArcGIS (Fig. 1).
  • I brought in a vegetation cover categories raster file (Simpson 2013) that overlaps with my study area and used the “tabulate area” tool in ArcGIS to calculate the total cover of each vegetation type (meadow, shrub steppe, juniper, ponderosa pine, Douglas-fir, grand-fir, hardwood forest/ riparian, and subalpine parkland) that fell within each buffer for every point. I repeated this for high and low ventenata points.
  • Finally, I consolidated the tables in R and created a line graph with the ggplot2 package to plot how the proportion of vegetation type differed by buffer distance from point (Fig. 2). Cover represents percent cover of each vegetation type at each buffer distance. Error bars at each distance represent standard error. VEDU refers to the plant code for Ventenata dubia (ventenata).

I was also curious to how the high and low points differed from random points in the same area. To explore this I:

  • Created 110 random points that followed the same selection criteria of 1000m proximity to fire perimeter used to select the ventenata sampling points.
  • Repeated steps 2 through 4 above to graphically represent how vegetation cover differs as a function of distance from these random points in relation to low and high ventenata points (Fig. 3).

Results & Discussion

My analysis revealed that vegetation type differs between high and low ventenata sites and random sites within the study area. The high ventenata plots were located entirely in ponderosa pine and shrub steppe vegetation types, but as distance increased from the plots, the distribution of about half of the vegetation types became more evenly distributed (Fig. 2). Ponderosa covers over 75% of the high ventenata 10m buffer areas with shrub steppe making up the remaining 25%. However, as distance increased, ponderosa cover dropped sharply to under 35% at 400m. Shrub steppe gradually declined throughout the 800m distance, and was surpassed by grand fir and Douglas fir by 800m. Meadows covered about 10% of the 50m buffer but declined to about 5% by the 400m buffer. The remaining vegetation types, juniper, riparian, and subalpine fir, were consistently under 5% cover throughout the buffer analysis.

In the low ventenata sites, shrub steppe vegetation was the most dominant, but the distribution was spread more evenly across the vegetation types than in the high ventenata sites (Fig. 2). Shrub steppe vegetation droped from 45% to 30% from the 10m to the 50m buffer, and then remained relatively constant throughout the remaining buffer distances. Like the high ventenata sites, grand-fir gradually increased in cover throughout, becoming the most dominant vegetation type of the 800m buffer. Unlike the high sites, ponderosa pine made up only about 10% of each buffer. Riparian vegetation was the only cover type that remained 0 throughout all the buffers.

In the random sites, the distributions of vegetation type were steady throughout the 800m, with only small fluctuations in cover with increasing distances (Fig. 3). Shrub steppe vegetation type was the highest at about 30% throughout, followed by juniper, ponderosa pine, and grand fir at about 20% cover.

This analysis demonstrates that ventenata could be dependent on specific vegetation types not only at the sample location, but also in the vicinity surrounding the sample area. This is evident in the high ventenata analysis where ponderosa pine cover remains much higher than the low sites and the random sites throughout the 800m buffered area. This analysis also depicts my sample bias as it demonstrates which community types I was targeting for sampling (shrub steppe and dry forest communities), which may not be representative of the area as a whole (as demonstrated in the random points analysis).

Critique of Method

The neighborhood analysis was a useful way of visualizing how vegetation type changes with distance from high and low ventenata points and may have helped uncover the importance of large areas of ponderosa pine as a driver of invasion; however, the results of the analysis could be a relic of my sampling bias towards shrub steppe and dry forest communities rather than an absolute reflection of community drivers of ventenata. The vegetation layer that I used was also not as accurate or as detailed as I would have liked to capture the nuance of the different shrub steppe and forest community types that I was attempting to differentiate in my sampling. If I were to do this again, I would try to find and use a more accurate potential vegetation layer with details on specific community attributes. Additionally, the inclusion of error bars was not possible using the “multiple ring buffer” tool in ArcGIS, so, I instead had to make each buffer distance as a separate layer and erase each individually to maintain the variation in the data.  I like the idea of the random points as a sort of randomization test; however, more randomizations would make this a more robust test. With more time and more knowledge of coding in ArcGIS/ python, I would attempt a more robust randomization test.


Simpson, M. 2013. Developer of the forest vegetation zone map. Ecologist, Central Oregon Area Ecology and Forest Health Program. USDA Forest Service, Pacific Northwest Region, Bend, Oregon, USA

On the relationship between spatial variation of economic losses and tsunami momentum flux


For Exercise 1, I evaluated the spatial pattern of economic losses resulting from a joint earthquake and tsunami event. I then deaggregated the losses by hazard (earthquake only, tsunami only, and combined) as well the intensity of the event.

For Exercise 2, I evaluated how the spatial pattern of economic losses resulting from a tsunami relates to the spatial pattern of tsunami momentum flux (a measure of velocity and inundation depth) by performing a geographically weighted regression (GWR). For this analysis, I only considered the tsunami because there is significant spatial variation in the hazard, whereas the spatial variation for the earthquake is minimal.

Tool and approach

I performed the GWR using the python library PySAL (Python Spatial Analysis Library). The independent variable was defined as the momentum flux, and the dependent variable defined as the percent loss of economic value.

Description of steps

The average losses at each building resulting from an earthquake/tsunami loss model were first converted to percent loss (loss divided by real market value), and added as an attribute to a GIS shapefile. The percent loss was used as opposed to the economic losses because each building has a different initial value. Consequently, the percent loss serves to normalize the economic losses across all buildings within Seaside. For this analysis, the results from the “1000-year” tsunami event were analyzed.

The GWR was then performed using PySAL with the momentum flux defined as the independent variable and the percent loss defined as the dependent variable. The GWR resulted in a slope and intercept at each tax lot, as well as a global r2 value. Two separate maps were generated wherein each tax lot was color coded based on values of the slope and intercept.

Description of results

The results from the GWR and a global regression are shown in Figures 1 and 2 respectively. A global r-squared value of 0.575 was obtained, indicating that the data is moderately correlated. In Figure 1, it can be seen that the intercept is larger near to the ocean, and decreases as the distance to the shore increases. This can be explained by the fact that the momentum flux is the largest near to the coast, and decreases as the tsunami propagates over the land.

Similar trends would be expected for the slope coefficients; however, it can be seen that along the coast the results are negative indicating that the economic losses decrease as the momentum flux increases. This can likely be explained by inconsistent building types within Seaside. For example, concrete buildings are able to better withstand the impact of a tsunami compared to their wood counterparts. Similarly, buildings of different heights (number of floors) have different damage properties. Consequently, because the building types are not consistent within Seaside, significant variations in the percent of loss within a small spatial region can occur (e.g. a wood building is located next to a concrete building). This would lead to a decrease in percent loss for a larger momentum flux.

Figure 1: Spatial variation of slope and intercept resulting from the GWR

Figure 2: Global regression and line of best fit


While the GWR does provide a means to evaluate correlation between two variables that are within the same geographical region, there are limitations for this particular application. The results showed negative slopes in some locations, which is likely caused by the large variation in the percent loss. To alleviate this, alternative statistical models could be developed using GWR that only consider similar building types. An example of a non-spatial regression for wood buildings with 2 and 3 floors can be seen in Figure 3. The improvement in r-squared values can be observed, and would likely translate to the GWR.

Figure 3: Example of global regression considering specific building types


May 3, 2019

Fire Refugia’s Effects on Clustering of Infected and Uninfected Western Hemlock Trees


For Exercise 1, I wanted to know about the spatial pattern of western hemlock trees infected with western hemlock dwarf mistletoe. I used a hotspot analysis to determine where clusters of infected and uninfected trees were in my 2.2 ha study area (Map 1). I discovered a hot spot and a cold spot, indicating two clusters, one of high values (infected) and one of low values (uninfected).

In my study site, 2 fires burned. Once in 1829, burning most of the stand, and then again in 1892, burning everywhere except the fire refugia (polygons filled in blue). This created a multi-storied forest with remnant trees located in the fire refugias. One component of the remnant forest are infected western hemlocks. These remnant hemlocks serve as the source of inoculum for the hemlocks regenerating after the 1892 fire.

For Exercise 2, my research question was: How does the spatial pattern of fire refugia affect the spatial pattern of western hemlock dwarf mistletoe?

I predicted that a cluster of infected western hemlocks are more likely to be next to a fire refugia than a cluster of uninfected trees. In order to assess this relationship, I used the geographically weighted regression tool in ArcMap.

Geographically Weighted Regression

Geographically weight regression (GWR) works by creating a local regression equation for each feature in a data set you want to analyze, using an explanatory variable(s) to predict values for the response variable, using the least squares method. The Ordinary Least Squares (OLS) tool differs from GWR because OLS creates a global regression model (one model for all features) whereas GWR creates local models (one model per feature) to account for the spatial relationship of the features to each other. Because the method of least squares is still used, assumptions should still be met for statistically rigorous testing. The output of the GWR tool is a feature class of the same type as the input, with a variety of attributes for each feature. These attributes summarize the ability of the local regression model to predict the actual observed value at that feature’s location. If you have an explanatory variable that explains a significant amount of the variation of the response variable, this is useful for seeing how its coefficient varies spatially.

Execution of GWR

To use this tool, I quantified the relationship between the trees and the fire refugia. I used the “Near” tool for this to calculate the nearest distance to a fire refugia polygon’s edge. This was my explanatory variable. My response variable was the z-score that was output for each tree from the Optimized Hot Spot Analysis. Then I ran the GWR tool. I then used the Moran’s I tool to check for spatial autocorrelation of the residuals. This is to check the clustering of residuals. Clustering indicates I may have left out a key explanatory variable. The figure below displays my process.

I tested the relationship between nearest distance to a fire refugia polygon’s edge and the z-score that was output for each tree from the Optimized Hot Spot Analysis using OLS, which is necessary to develop a well specified model. My R2 value for this global model was 0.005, which is incredibly small. Normally I would have stopped here and sought out other variables to explain this pattern, but for this exercise I continued the process. 


This GWR produced a high global R2 value of 0.98 (Adj R2 0.98) indicating that distance to refugia does a good job of explaining variance in the spatial pattern of infected and uninfected trees. However, examining the other metrics for the local model performance gives a different picture of model performance.

Map 2 displays results for the coefficients for the explanatory variable of distance to nearest refugia. As this variable changes, the z-score increases or decreases. These changes in z-scores indicate a clustering of high or low values. From examining the range of coefficient values, the range is quite small, -0.513 to 0.953. This means that across my study site, the coefficient only changes slightly from positive to negative. In the north western corner, we see a cluster of positive coefficient values. Here, as distance to refugia increases, the z-score of trees increases, predicting a clustering of infected trees. These values are associated with high local R2 values (Map 4). In other places of the stand we see slight clustering of negative coefficients, indicating distance to refugia decreases the z-score of trees, predicting a clustering of uninfected trees.

Map 3 displays the standardized residuals for each tree. Blue values indicate where the local model over-predicted what the actual observed value was, and red values are under-predictions. When residuals from the local regression models are distributed randomly (i.e. not clustered or dispersed) over the study area, then the geographically weighted regression model is fit well, or well specified. The residuals of the local regression models were significantly clustered. (Moran’s Index of 0.265, p-value of 0.000, z-score of 24.344). Because we can observe clustering in my study area of residuals, there is another phenomenon driving the changes in z-scores; in other words, driving the clustering of infected and uninfected trees.

From the previous two map evaluations I saw that the distance of a tree to fire refugia was not the only explanatory variable necessary to explain why infected and uninfected trees clustered. Map 4 displays the local R2 values for each feature. The areas in red are high local R2 values. We see the northwestern corner has a large number of large values which correspond to a cluster of small residuals and positive coefficients. Here, distance to fire refugia explains the clustering of infected trees well. The reverse is observed in several other places (clusters of blue) where distance to fire refugia does not explain why infected or uninfected trees cluster. In fact the majority of observations had a local R2 of 0.4 or less. From this evaluation, I believe this GWR model using distance to refugia does a good job of explaining the clustering of infected trees, but not much else.


GWR is useful for determining how the coefficient of an explanatory variable can change across an area. One feature in a specified area may have a slightly different coefficient from another feature, indicating these two features are experiencing different conditions in space. This allows the user to make decisions about where the explanatory has the most positive or negative impact. This result is not something you can derive from a simple OLS global model. This local regression process is something you could do manually but the tool in ArcMap makes this process easy. The output of GWR is also easy to interpret visually.

Some drawbacks are that you need to run the OLS model first for your data to determine which variables are significant in determining your response variable. If not, then a poorly specified model can lead to inappropriate conclusions about the explanatory variable (i.e. high R2 values). Also, the evaluation of how the features interact in space is not totally clear. The features are evaluated within a fixed distance or number of neighbors, but there is no description for how weights are applied to each neighboring feature. Lastly, for incidence data, this tool is much harder to use if you want to determine what is driving the spatial pattern of your incidence data. Some other continuous metric (in my case a z-score) must be used as the response variable, making results harder to interpret.

Model Results Follow-Up

After finding that distance to a refugia was not a significant driver for the majority of trees, I examined my data for other spatial relationships. After a hotspot analysis on solely the infected trees, I found that the dispersal of infected trees slightly lined up with the fire refugia drawn on the map (Map 5).

Among other measures, forest structure was used to determine where fire refugia were located. Old forest structure is typically more diverse vertically and less clustered spatially. Also infected western hemlocks are good indicators of fire refugia boundaries because as a fire sensitive tree species, they would not survive most fire damage and the presence of dwarf mistletoe indicates they have been present on the landscape for a while. From the map we can see that the dispersal of infected trees only lines up with the refugia in a few places. This mis-drawing of fire refguia bounds may be a potential explanation for under-performance of the GWR model.

May 2, 2019

Courtney’s EX2: Comparing faults and principle components

Filed under: 2019,Exercise 2 @ 10:49 am

Question asked in this exercise:

How does ion principle component 2 at a well vary with the well’s distance from faults along the groundwater flow path?

In EX1, I used principle component analysis to evaluate how parameters accounted for variance between the wells I studied. Based on my knowledge of how chemistry varies as water flows through the basalt, ion PC2 accounts for variance caused by ion exchange between the basalt and the groundwater, with increasing sodium ion concentration/pH and decreasing calcium/magnesium ion concentrations as the water spends more time underground.

In this exercise, I estimated the groundwater flow directions in my study area using interpolation, calculated fault incidence direction, calculated angular difference between flow direction and fault direction, and then manual measurement of the distance between each well and the distance to the nearest fault segment that had flow direction within 45 degrees of parallel to the fault along the estimated flow path.

Name of tool or approach used:

Interpolation, reclassification, raster math, distance measurement in ArcGIS Pro


Input data:

  • 2018 static water levels in wells provided by Oregon Water Resource Department (OWRD)
  • Well lithology from OWRD groundwater information system (GWIS)
  • Well seal depth from well logs accessed through OWRD GWIS
  • Fault polyline shapefile from Madin and Geitgey 2017
  • Well locations from OWRD database, with ion concentration information based on my sampling in the summer of 2018


  1. Classified wells in the static water level dataset by the basalt formation that they were open to, based on lithology and seal depth. Excluded wells that lacked this information.
    1. Output: wells classified as open to Saddle Mountain Basalt (Smb) Wanapum Basalt (Wb), Grande Ronde Basalt (Grb), or both Wanapum and Grande Ronde Basalt (WbGrb). These classifications were joined to the static water level information.
  2. Created an interpolation surface for static water levels of wells open to both the Wanapum and Grande Ronde. I ignored the Saddle Mountain formation, since the wells that I had sampled were not open to it. I used kriging with a cell size of 200 ft, and this created an estimated potentiometric surface for these wells. The interpolation was a bit ugly because my wells were not ideally distributed for this.
    1. I tried creating interpolations based on other combinations of formations to better approximate the potentiometric surfaces posited by past studies in the region. I ended up creating two potentiometric surfaces: one using wells that were only open to the Wanapum Basalt (Wb), and a second using wells that were open only the Grande Ronde Basalt as well as those that were open to both the Grande Ronde Basalt and Wanapum Basalt(WbGrb_Grbonly).
  3. Calculated flow direction in the two aquifer groups – Wb and WbGrb_Grbonly
    1. Used the Hydrology toolbox to fill sinks and then calculate flow direction.
    2. This creates out a raster with eight possible values between 1 and 256.
    3. I then reclassified it so the values corresponded to the eight primary cardinal directions (N, NE, E, SE, S, SW, W, NW), which range from 0 to 360 degrees
  4. Calculated fault incidence angle
    1. Added a cell to the polyline attribute table called “angle”
    2. Split the polyline at each vertex, which creates a new shapefile
    3. Used the field calculator and a python script to assign an angular value between 0 and 359 to each fault polyline segment
    4. Performed the polyline to raster function, using the angle as the cell value. I used a cell size of 200 ft.
    5. This created a raster where only pixels that include part of a fault polyline had values, and those values ranged between 0 and 359.
  5. Subtracted the flow direction raster from the fault incidence angle raster, in order to create fault line pixels that had values that reflected the difference between the fault incidence angle and flow direction. I did this twice, once each for the Wb and WbGrb_Grbonly rasters
    1. Reclassified this raster so that pixels with fault direction within 45 degrees of parallel to the flow direction were 0, and the pixels with fault direction with 45 degrees of perpendicular to the flow direction were 1.
    2. I then added the WB and WbGrb_Grbonly results together, so that pixels with fault direction within 45 degrees of parallel to the flow direction in both were 0, and the pixels with fault direction within 45 degrees of perpendicular in either raster to the flow direction were 1, and pixels with fault direction within 45 degrees of perpendicular in both rasters were 2. I named this allwbgrb_ff.
  6. On a map layout, I added my sampled sites with PCA data, the interpolated potentiometric surface for wells open to the Wanapum and Grande Ronde aquifers, and allwbgrb_ff.
    1. I added a field to the sampled sites with PCA data called “dist_from_fault”
    2. Using the measure tool, I measure the distance from each well along the path most perpendicular to potentiometric contours to the nearest allwbgrb_ff pixel with a value of 1 or 2.
    3. This was subjective because the potentiometric surface is imperfect because of the erratic spacing of wells. In areas where the potentiometric surface had noticeable glitches, I used my own judgement based on topography and literature about groundwater flow direction in the region.
  7. I then graphed the “dist_from_fault” against the ionsPC2 category.

Discussion and Results:

I hypothesized that ion PC2 would increase with decreasing potentiometric surface elevation. An increased score in ionPC2 indicates an elevation sodium concentrations and pH and a decrease in calcium and magnesium caused by a progressive ion exchange reaction between the groundwater and the basalt. Because water flows from higher potentiometric elevations to lower potentiometric elevations, I would expect water samples from lower potentiometric elevations to show chemical evidence of increased interaction with the basalt. If this process of down-gradient groundwater flow were the only process influencing the ion exchange reactions, the well symbols on the map below would become progressively darker as the interpolated well level surface elevation decreased and the wells were further from the up-gradient recharge zone.

However, upon examining a map of ion PC2 values this is not the case – there are anomalously low values of PC2 in the valley, where one would expect to see increased values if groundwater flowed uninterrupted from the up-gradient recharge zones. In this study I introduced the variable of distance from faults in order to test another hypothesis: that faults compartmentalized groundwater flow, blocking lateral flow through the aquifer while promoting vertical permeability and modern recharge into the down-gradient aquifer. I also hypotehsized that if a fault was a barrier, PC2 values up-gradient of the fault would be elevated as the fault trapped water behind it. This would result in more chemically evolved groundwater backed up behind the fault, and less chemically evolved groundwater down-gradient of the fault.

The results of this study tentatively support the conceptual model of fault compartmentalization. In particular, water samples from wells in the valley down-gradient of the fault zones have evidence of less exposure to the basalt than wells further towards the recharge zones. 15 of the 18 wells sampled show a positive correlation between distance from a fault along a flow path and ion PC2 score, especially when graphed points are compared to their up-gradient and down-gradient neighbors (i.e. 57946 being up-gradient from 57235 and down-gradient from 54277).

Because of the width of the raster cells indicating faults and their flow direction in this model, four wells ended up have a distance from faults of 0. This does not seem unreasonable, because examination of exposed fault zones in the area indicated that many are up to a couple hundred meters wide. Additionally, while geological studies of the area indicate that the faults are close to vertical, their exact dip angles are unknown and this introduces a certain amount of uncertainty about their location at depth. Ion PC2 values of wells with dist_from_fault = 0 show an interesting dichotomy of either very high (4167, 4179) or very low (3929, 3962) values. I believe this indicates that wells 4167 and 4179 are up-gradient of faults that are acting as barrier, while 3929 and 3962 are slightly down-gradient or within the fault damage zone itself.

map of fault locations, potentiometric surface, and wells, plus a chart of distance vs ion PC2

Critique of Method:

This is admittedly a crude method of estimating groundwater flow direction. However, a certain amount of estimation is necessary to model a system with relatively few and unevenly spaced measured water level points in a hydrogeologic regime with many individual water-bearing layers of lava interflow zones that are unpredictably connected. I wish I had been able to find an automated way to measure well distance from faults along the groundwater flow path, which would have taken away some of the subjectivity of measuring the flow paths by hand the old-fashioned way.

Because of the uneven spacing of wells used for interpolation, the flow direction rasters that I created have glitchy areas and some of these areas of physically improbable values influenced the Fault and GW Flow Direction raster. I reclassified that raster into broad categories to avoid creating a false sense of precision.

April 28, 2019

Determining suitable habitat for Olympia oysters in Yaquina Bay, OR

Exercise #1

Question that you asked:
My goal for my thesis work is to evaluate the distribution of native Olympia oysters in Yaquina Bay, Oregon by assessing habitat suitability through spatial analysis of three habitat parameters: salinity, substrate availability, and elevation. A map of predicted suitable habitat as a result of the spatial analysis will be compared with field observations of oyster locations within Yaquina Bay. The main research question I am examining in this project is:

How is the spatial pattern of three habitat parameters (salinity, substrate, elevation) [A] related to the spatial pattern of Olympia oysters in the Yaquina estuary [B] over time [C]?

For this blog post, I will be evaluating the [A] portion of this question and the three habitat parameters simultaneously to identify where habitat is least suitable to most suitable. To better understand the spatial pattern of the habitat parameters, I am evaluating a raster layer for each parameter, then combining them to determine where overlap between the layers shows the best environmental conditions for oysters to survive.

Name of the tool or approach that you used:
For this portion of my research analysis, I wanted to be able to make an educated guess about where the best and worst habitat for Olympia oysters would be located within Yaquina Bay by ranking different subcategories within each of the salinity, substrate, and elevation datasets.

To do this, I started by looking through the available literature on the subject and consulting with shellfish biologists to get an idea of what conditions oysters prefer in order to apply a ranking value. The following table is a compilation of that information:

Habitat parameter Subcategories Subcategory variable range Olympia oyster tolerance Rank value applied
Mean wet-season salinity (psu) Upper estuary < 16 psu somewhat, but not long-term 1
Upper mid estuary 16.1 – 23 psu X 4
Lower mid estuary 23.1 – 27 psu X 3
Lower estuary > 27 psu somewhat 2
Substrate availability 1.2 Unconsolidated mineral substrate possible 3 Gravelly mud unlikely 2 Sandy mud no 1
2 Biogenic substrate yes 4
3 Anthropogenic substrate yes 4
3.1 Anthropogenic rock yes 4
3.1.2 Anthropogenic rock rubble unlikely 2
3.1.3 Anthropogenic rock hash no 1 Unclassified uncertain
Bathymetric depth (compared to MLLW) 1.5 – 2.5m supratidal no 1
1 – 1.5m supratidal no 1
0.5 – 1m intertidal maybe 2
0 – 0.5m intertidal yes 3
-2 – 0m intertidal yes 4
-3 – -2m subtidal yes 4
-4 – -3m subtidal yes 4
-6 – -4m subtidal yes 4
-8 – -6m subtidal yes 3
-12.5 – -8m subtidal yes 3

Once I established my own ranking values, I decided to use the ‘weighted overlay’ function, found within the Spatial Analyst toolbox in ArcGIS Pro. Weighted overlay applies a numeric rank to values within the raster inputs on a scale that the ArcGIS user is able to set. For example, on a scale from 1-9 ranking 1 as areas of least fit and 9 as areas of best fit. This can be used to determine the most appropriate site or location for a desired product or phenomenon. I used the ranking value scale 1-4 where 1 indicates the lowest suitability of subcategories for that parameter and 4 indicates the highest suitability.

Brief description of steps you followed to complete the analysis:

To apply the weighted overlay function:

  1. Open the appropriate raster layers for analysis in ArcGIS Pro. Weighted overlay will only work with a raster input, specifically integer raster data. Here, I pulled all three of my habitat parameter raster layers from my geodatabase into the Contents pane and made each one visible in turn as I applied the weighted overlay function.
  2. In the Geoprocessing pane, type ‘weighted overlay’ into the search box. Weighted overlay can also be found in the Spatial Analyst toolbox.
  3. Once in the weighted overlay window within the Geoprocessing pane, determine the appropriate scale or ranking values for the analysis. I used a scale from 1-4, where 1 was low suitability and 4 was high suitability.
  4. Add raster layers for analysis by selecting them from your geodatabase and adding them into the window at the top left. To add more than one raster, click ‘Add raster’ at the bottom of the window.
  5. Select one of the raster inputs and see the subcategories for that raster appear on the upper right. Here, ranking values within the predetermined can be individually applied to the subcategories by clicking from a drop-down list. Do this for each subcategory within each raster input. I ranked each subcategory within each of my habitat rasters according to the ranks listed on the table above.
  6. Determine the weights of each raster input. The weights must add up to 100, but can be manipulated according to the needs of the analysis. A raster input can be given greater or lesser influence if that information is known. For my analysis, I made all three of my habitat raster inputs nearly equal weight (two inputs were assigned a weight of 33, one was weighted 34 to equal 100 total).
  7. Finally, run the tool and assuming no errors, an output raster will appear in the Contents pane and in the map window.

Brief description of results you obtained:

The first three images show each habitat parameter weighted by suitability, with green indicating most suitable and red indicating least suitable.

Salinity —

Bathymetry —

Substrate —

The results of the final weighted overlay show that the oysters are most likely to be in the mid estuary where salinity, bathymetry, and substrate are appropriate.


Critique of the method – what was useful, what was not?:

The weighted overlay was a simple approach to combining all of the raster layers for habitat and creating something spatially meaningful for my research analysis. The areas indicated in green in the resulting map generally reinforce what was found in the literature and predicted by local shellfish biologists. While the weighted overlay tool did generate a useful visual, it is highly dependent on the quality of the raster inputs. In my analysis, the detailed resolution of the bathymetry layer was very helpful, but the substrate layer is a more generalized assessment of sediment types within Yaquina Bay. It doesn’t show the nuances of substrate availability that might be important for finding exactly where an opportunistic species like Olympia oysters might actually have settled. For example, in Coos Bay Olympia oysters have been found attached to shopping carts that have been dumped. The substrate raster is a generalized layer that uses standardized subcategories and does not pinpoint such small features.

Additionally, the salinity layer is an average of wet-season salinity, but it can change dramatically throughout the year. Some in situ measurements from Yaquina Bay this April showed that the surface salinity with the subcategory range of 16-23 psu were actually <10 psu. While it is more reasonable to generalize salinity for the purposes of this analysis, it is important to note that the oysters are exposed to a greater range over time.

This spatial information serves as a prediction of suitable oyster habitat. The next step is to compare this predicted suitability to actual field observations. I’ve recently completed my first round of field surveys and will be analyzing how closely the observations align with the prediction in Exercise #2.

April 26, 2019

Spatial Pattern of NDVI change from the Center of an Artisanal Gold Mine

Filed under: Exercise 2 @ 5:22 pm

For Exercise 2, I wanted to explore how the spatial distribution of land use/land cover (LULC) varied in relation to an artisanal, small-scale gold mine (ASGM). To do this, I took the NDVI change maps which I had generated for Exercise 1 (modified slightly to produce better, more accurate results), as well as my dataset of mapped footprints of known areas of ASGM activity, found the median center of those areas, generated buffers/donuts at 250m apart, from 250m to 2000m, clipped the NDVI change layer to each buffer, and counted the amount of each type of NDVI pixel contained within each buffer. In addition, I performed this same analysis around non-mining villages, to examine how the spatial pattern of NDVI loss changes with distance from the center of non-mining villages. With this data, I could generate a chart showing how the percentage of pixels representing loss or decrease in NDVI changed as you moved further away from the center of mining activity. This examination looked at nine mining villages and nine non-mining villages

In order to perform this analysis, I continued my work using ArcGIS Pro, in addition to Excel for a bit of charting work.

To begin, I imported my NDVI change map, which detailed increases and decreases in NDVI values between imagery taken in April 2018 with Landsat 8, and imagery taken in April 2007 with Landsat 5, representing 11 years of NDVI change. I also imported my shapefile containing polygons which I had digitized over my high-resolution satellite imagery depicting areas of known ASGM activity. With this shapefile, I used ArcGIS’s Median Center tool, which found the median center of the mining areas near the village of Douta (fig. 1). From there, I generated buffers/rings at 250m intervals (e.g. 0-250m, 251-500m, 501-750m, etc.), from 250m to 2000m around this median center. I then used the clip tool to clip the overall NDVI change layer to each buffer, resulting in NDVI change for 0m to 250m from center, NDVI change for 251m to 500m from center, and so on.

Fig 1: Center of the village of Douta with a 2km NDVI buffer; the blue buffer shows the 1750m extent, whose values were subtracted from the 2000m values, to only give values between 1751 and 2000m

Once I had accomplished this, I generated histograms for each buffered NDVI change layer in order to count the amount of pixels contained within each buffer, and assign them to one of two classes: negative values, representing overall NDVI decrease from 2007 to 2018, and positive values, representing overall NDVI increase from 2007 to 2018. I did not account for magnitude of change, as I wanted a general idea of how NDVI was changing from the center. Fig. 2 shows an example of these histograms, specifically for the 500m buffer. The values from the previous buffers, e.g. the one to 250m, were subtracted to only show values from 251 to 500m.

Fig. 2: Douta histogram

I entered all the pixel values for negative change at each distance into Excel, as well as all of the pixel values for positive change, and was able to generate a chart showing how the percentage of the overall pixels in each subsequent buffer from center representing NDVI decrease change over distance (fig. 3)

Fig. 3: Pale blue lines indicate individual mining villages. The dark blue line indicates the average of those villages. Thin red lines indicate non-mining villages. The dark red line indicates the average for those villages.

On the whole, this exercise was useful for illustrating the problem I’m attempting to grapple with. I was frustrated with the lack of Landsat imagery from the late 2000s — I was unable to find any Landsat 5 imagery corrected for surface reflection aside from the year 2007. Additionally, there are problems with comparing this 2007 image to the 2018 image. I found that the rainy season before the 2018 image was taken, in 2017, was wetter than average, while I was unable to determine the rain fall that preceded the image in 2007. As such, it is possible that the 2018 Landsat 8 image shows a non-normal vegetative pattern — or it’s even possible that the 2007 image is showing a non-normal vegetative pattern! I require some more investigation into the historical meteorology of the area before I can say either way. Regardless, I feel that this is a useful first step in investing how LULC change relates to the establishment of ASGM.

What is the spatial pattern of the ground cover vegetation within 100 m, 500 m and 1 km of 20 wetland sites?

Filed under: 2019,Exercise 1 @ 11:58 am

Exercise 1: What is the spatial pattern of the ground cover vegetation within 100 m, 500 m and 1 km of 20 wetland sites?

To determine these spatial patterns I first imported the latitudinal and longitudinal locations of my wetland sites as a point layer by adding a csv file with this data from the catalog to my map and choosing the Display X, Y Data option after right clicking on the imported file.
Display X, Y Data: X= longitude
Y= latitude

I then created 3 buffers around each point using the Buffer tool in spatial analyst. The buffers created had a 100m radius, a 500m radius, and a 1km radius.
Buffer: Input vector= wetland site points
Buffer= 100 (then 500, then 1000)
Unit= Meters

I used Clip in raster processing to clip my raster layer to the extent of my largest buffer layer so that the computer did not have to process so much extra raster data moving forward.
Clip: Input= raster layers
Clip to= 1kmBufferLayer

Then I created an NDVI raster layer from Landsat data obtained by the USGS Earth Explorer. I downloaded and then imported data from 1995 and 2019. The LIDAR data from 2019 was collected via the LandSat 8 satellite which collects, among other things, Near Infra-Red (NIR) and red wavelength reflectance. NIR is categorized as Band 5 and Red as Band 4. To create an NDVI layer from this data I had to rescale the Band 5 and Band 4 raster layers because the USGS self-correcting algorithm overcorrected for atmospheric disturbances. To do this I used to Raster Calculator in spatial management. This resulted in 2 new raster layers within the acceptable range (0-1000) for Band 5 and Band 4 of the 2019 LIDAR data. I did not have to rescale the 1995 data because there was no self-correcting algorithm used in that data.
Raster Calculator: Query= [(grid – Min value from grid) * (Max scale value – Min scale
value) / (Max value from grid – Min value from grid)] + Min scale value

I used the Raster Calculator again to create the final NDVI layer for 1995 and one for 2019 using the NIR and red wavelength data layers. The 1995 data was collected via LandSat 4 which categorizes NIR wavelengths as Band 4 and red wavelengths as Band 3.
Raster Calculator: Query= ((“NIR_Layer”-“red_Layer”)/(“NIR_Layer” + “red_Layer”))

2019 NDVI result below

I then used the Raster to Point tool to turn both the 1995 NDVI and 2019 NDVI data into a polygon layer.

Raster to Point: NDVI layers to point

Finally, then I used the Clip in spatial analyst tool to clip the polygon NDVI layers to the buffer layers I created earlier.
Clip: Clip “NDVI layers” to “Buffer layers”

Now I have the land cover type (in the form of a number signifying vegetation density) information for all of my sites within all of my buffer zones.

Ripley’s K analysis on two forested stands

Bryan Begay

  1. Question asked

Can a point pattern analysis help describe the relationship between the spatial pattern of trees in my area of interest with forest aesthetics? More specifically, how does Ripley’s K function describe forest aesthetics in different parts of the forest on the McDonald-Dunn Forest.

  1. Tool

The main tool that I used for the point pattern analysis was the Ripley’s K function.

  1. Steps taken for analysis

My workflow involved doing preprocessing of the LiDAR data, then creating a canopy height model to obtain an individual tree segmentation. The individual tree segmentation would then allow me to extract tree points with coordinates that could be usable points for the Ripley’s K Function.


LiDAR preprocessing

I started off with finding my harvest unit area of interest (Saddleback stand) and finding a nearby riparian stand that would be used to compare the Ripley’s K function outputs.   I create polygons to clip the LiDAR point clouds onto. I found the LiDAR files that were over the AOIs and used the ArcMap Create LAS Dataset (Data Management) to make workable files, then clipped the data sets to the polygons using the Extract LAS (3D analyst) tool. Fusion was used to merge the clipped LiDAR tiles to make one continuous data set for both AOIs. Then I normalized the point cloud with FUSION by using a 2008 DEM raster from DOGAMI, and the FUSION tools ASCII2DTM and Clipdata.


CHM, tree segmentation, and Ripley’s K

With the normalized point cloud, a canopy height model (CHM) was created in R-studio, and then an individual tree segmentation was made with an R package called lidR by using a watershed algorithm. The segmented trees were exported as a polygon shapefile that could be used in ArcMap. The Feature to Point tool (Data Management) was used to calculate the centroid of the polygons to identify individual trees as points. The points could then be used in RStudio spatstat package to be used in a Ripley’s K Function. The function was calculated for both Saddleback stand and a nearby riparian area.

  1. Results

The results show that the pattern for both the Saddleback stand and the riparian area were clustered. Both stands observed lines were plotted above the expected line for a random spatial pattern.  The lines were significantly different, being above the higher confidence envelope. The riparian stand has higher levels of clustering compared to Saddleback stand. The Saddleback stand showed a plotted clustering pattern as well, but not to the degree of the riparian stand.


  1. Critiques

Some critiques for my analysis would be to use a more robust individual tree segmentation algorithm analysis. For the sake of processing speed and creating delineated polygons with reduced noise, I used a resolution of 1 meter for my CHM. The 1 meter resolution for my CHM smoothed over the tree segmentation, possibly removing potential tree polygons but creating more defined segmented trees. The CHM lower resolution was used with a relatively simple watershed algorithm. Past algorithms I’ve used showed better results than watershed but required more detailed inputs. Another criticism I have is that using the feature to point does not necessarily give me the tree tops, but finds the centers of polygons that the tree segmentation identified as individual trees. Finding a more robust method for determining tree points would be more preferable.

April 25, 2019

Recession Analysis in Two Coastal Basins

Filed under: 2019,Exercise 1 @ 3:21 pm

Research Question

The question that I asked for this exercise was: What is the spatial variability of the recession timescale at different flow rates in rain-dominated, coastal basins?


Stream flow regimes are generally defined by five components: magnitude, frequency, duration, timing, and rate of change (Poff et al., 1997). For the purposes of this exercise, I am investigating the rate of change (or “flashiness”) component of flow regime in two river systems in the Oregon Coast Range: the Nehalem and Alsea River watersheds. I am analyzing recession behavior in these two systems to quantify a rate of change metric. I used the recession curve method, largely developed by Brutsaert and Nieber (1977), and later built upon by Krakauer et al. (2011) among many others. Recession curves describe the rate at which streamflow recedes in various streamflow conditions. In more general terms, recession curves provide an indication of watershed storage and groundwater behavior.


To complete the recession analysis in the Nehalem and Alsea watersheds, I followed the following steps:

  1. Downloaded streamflow data for the available period of record in 1-hour observation intervals using the dataRetrieval and EGRET (both developed by the USGS) packages in R.
  2. The data were in cubic-feet-per-second (cfs) units, which I converted to unit discharge (mm/hour) using respective basin areas.
  3. For recession analyses, only data points with insignificant precipitation are viable. Therefore, all time steps when precipitation was greater than 10% of total streamflow were removed. To do this, local precipitation data was necessary, however it is often hard to come by. As a result, the precipitation data sources for the Nehalem and the Alsea are different and the methods diverge slightly hereafter.
    1. Nehalem precipitation data was sourced from the USGS Vernonia site, which is upstream from the mainstem river gauging site where streamflow data for this analysis was sampled. For the purposed of this exercise, I assumed that rainfall (measured in inched at 30-minute intervals) was spatially consistent across the basin. For a more precise estimate of precipitation, multiple, spatially distributed rain gages would be needed.
    2. Alsea precipitation data in hourly timesteps was not readily available. There are a couple NOAA rain gages in the vicinity, but the data appeared to be spotty. As a result, I used daily precipitation data from PRISM to identify days in which total daily precipitation was greater than 10% of total daily streamflow. Next, I used the subset out the identified dates from the hourly streamflow data.
  4. Calculating rate of change: For this component of the analysis, I only wanted data from the receding hydrograph. I used the equation to estimate hourly -. Hourly streamflow (Q) for the corresponding hours was estimated as .
  5. (or  ) is the rate at which flow jumps values from one time-step to the next at a given flowrate. Because this is a recession analysis, I only wanted data that was on the receding slope and therefore subset the data to time-steps when  was negative.
  6. Lastly, I developed a simple linear regression model for vs Q for each watershed using the following equation:



When log-transformed and plotted, the discharge and rate of change of discharge follow a linear relationship, however the relationship is different for data analyzed at different temporal resolutions (Figures 1 a-b). The recession results from the daily timestep are likely muted because recession processed occur on shorter timescales, on the order of minutes to hours.

Figures 1 a-b. On the left, recession curve results for the Alsea River on a daily time step, after removing days with high precipitation. On the right, recession curve results for the river on an hourly time step, after days with high precipitation were removed. Both a and b include only recession data.


Table 1. Table showing different coefficient values for different temporal resolutions of Alsea recession data.

Data a coefficient b coefficient
Alsea (daily) -3.029 1.639
Alsea (hourly) -5.2346 0.9138


Recession analysis on year-round, hourly Nehalem River data resulted in an a coefficient value of -4.696 and a b coefficient value of 1.049, which are similar values for the Alsea hourly results. However, the Nehalem recession data is showing two linear trends in the plotted data (Figure 2). The two trendlines remained after controlling for both diurnal flux and season (Figure 3).

Figure 2. Recession results from the Nehalem River data. The red line is the calculated linear regression model, however two lines may be more representative of the recession behavior, such as those estimated in blue.

Figure 3. Recession data separated into seasons and only including receding slopes, night time hours, insignificant precipitation. The two trendlines are apparently less distinguished in certain seasons of the year.

Critique of Method

            Recession analysis is a method that I will use in my research, and this was a helpful exercise in that it helped me understand the nuances of recession data analysis, including data acquisition and availability, and opportunities for improvement. Moving forward, this analysis would benefit from: more spatially accurate precipitation data, more investigation of the explanations for the observed patterns and trends, and more sophisticated statistical comparison between sites.


PRISM Climate Group, Oregon State University,, created 4 Feb 2004.

Krakauer, N. Y., & Temimi, M. (2011). Stream recession curves and storage variability in small watersheds. Hydrology and Earth System Sciences, 15(7), 2377–2389.

Sawaske, S. R., & Freyberg, D. L. (2014). An analysis of trends in baseflow recession and low-flows in rain-dominated coastal streams of the pacific coast. Journal of Hydrology, 519, 599–610.

Brutsaert, W., & Nieber, J. L. (1977). Regionalized drought flow hydrographs from a mature glaciated plateau. Water Resources Research, 13(3), 637–643.

Poff, N. L., Allan, J. D., Bain, M. B., Karr, J. R., Prestegaard, K. L., Richter, B. D., … Stromberg, J. C. (1997). The Natural Flow Regime. BioScience, 47(11), 769–784.

April 20, 2019

Leaf spot patch analysis caused by Blackleg

Filed under: 2019,Exercise 1 @ 4:55 pm

My research involves the classification of a leaf spots on turnip derived from the pathogen blackleg. I had hypothesized that spatial patterns of pixels based on image classification are related to manually classified spatial patterns of observed disease on turnip leaves because disease has a characteristic spectral signature of infection on the leaves. This post focuses on the analysis of the clusters based on image classification through segmentation as well as the manually classified clusters. These clusters of pixels are expected to be representative of the diseased patches on the leaves. Here we seek to obtain some patch statistics which will be compared for relationships and accuracy at a later time. A large portion of this process went into image processing before the analysis could be conducted. All of the image processing took place in ArcGIS Pro. The next step involved the patch analysis which was conducted in Fragstats.

Some questions I asked myself about element A & B of my hypothesis included:

Can diseased patches be successfully identified and isolated by computer image classification and through manual classification?

If yes; what is the area and perimeter of the diseased patches based on image classification and manually classified diseased patches? I was mostly looking to obtain and gain some experience in the patch analysis provided by Fragstats. Much more thorough analysis can be and will be completed in Fragstats when variable A & B are compared for accuracy assessment in exercise two.

Tools used and methodology
The image processing and classification of pixels was conducted in ArcGIS Pro. This analysis required extracting both manually classified diseased patches and computer classified patches. I began by uploading band 3, which captures light energy in the red wavelength at 670 – 675 nm.

1. For computer classification of the image I used Segmentation.

a) I went to the Imagery tab and the Image Classification group. Click the drop-down arrow and an option for Segmentation should appear. Be sure the layer you desire to perform the segmentation on was selected prior to selecting the Segmentation Classification Tools. In the pane you have options to adjust Spectral detail, Spatial detail and Minimum segment size in pixels. There are variable and depend on image resolution, level of accuracy you wish to achieve, etc. Because my resolution to high I chose for higher spectral detail and spatial detail. I adjusted the spectral detail from 15.50 to 17 and the spatial detail from 15 to 17 as well. For the minimum segment size in pixels I chose 10. As mentioned these values are variable and although the standard values worked well for segmentation, I previewed other values and achieved greater results by adjusting. After previewing other options and deciding what values work best, click the Run button in the bottom right corner of the pane.

b) Masking all the cells that weren’t diseased was the next step and crucial to this diseased region selection process. To do this, go to the Analysis tab and the Raster group to find the Raster Functions. Open the Raster Functions and it should appear in the pane to the right. Search Mask and select it. Open the segmented image in the raster box and three options will be available for Included Ranges. These three options are representative of the blue, green and red bands. Because the data was adjusted from 12-bit to 8-bit in the segmentation process, we should have a range of values between 1 and 255. Despite having three band options, all the bands hold the same value because we are only working with the red band. Click on the segments you wish to include in the working window to check their values. Include a minimum and maximum which should be the same for all 1, 2 and 3 that includes the segments you want. The range I had included was 100 to 160, which is the RGB.Pixel Values that appear when clicking a pixel. Once you have these entered, click create new layer in the bottom of the right pane.

c) Next, I used the Clip function by navigating to the Imagery tab, the Analysis group and selecting Raster Functions. In the window pane to the left a search bar can be found along the top where you can enter Clip. Although there are other methods and locations the clip function can be found, I experienced different results depending on how I navigated to clipping, as well different clipping options in the window pane. In the Parameters section for Clip Properties, select the drop-down arrow for the raster you previously segmented. Leave Clipping Type and Clipping Geometry/Raster as the default. Adjust the active map view by zooming in or out to the region you want preserved after the clip. I use as small of area as I possibly can when clipping, while keeping all required segmented regions. Click the Capture Current Map Extent button found just to the right of the extent options in the pane (Green square map). To clip, click Create new layer at the bottom of the pane and a new map layer with the clipped region should appear in the map contents.

d) From here, go to the Analysis tab and Geoprocessing group where you can find the Tools. Click the Tools to open the options in the left pane. Here we want to use the Raster to Polygon (Conversion Tools) which can be found by searching at the top of the tools window pane. Use the most recent segmented-clipped raster for the Input raster, leave the Field blank and select where you would like the Output Polygon features to save. I left the remaining option as the default and clicked the Run arrow in the bottom right corner. Although we need the final product to be in raster format, this step is important for creating separate polygons which are grouped together as one unit.

e) As mentioned we must get the polygons back into a raster. Before we use the Polygon to Raster tool, polygons which overlap must be merged. This is one of the two reasons why we converted the raster to polygons. When the segmentation was conducted in (step d), diseased regions were not all categorized in the same bin. This resulted in regions which were maintained after masking but are different shades grey. Even one diseased patch may have two to three tones to it, resulting in segments for a patch. Now that everything is a polygon, we can merge these polygons that should be one polygon. Click the Edit tab and select Modify in the Features group. In the Modify Features, scroll down to the Construct group and click Merge. In Existing Features click Change the Selection and while holding shift, select the polygons which belong in one group. They will be highlighted in blue and appear in the pane if properly selected and if so, click Merge. Navigate to the Map tab and Selection group and click Clear. Use the same merge steps to merge any other polygons which belong to the same diseased lesion but were classified separately in raster segmentation.

f) Finally, the polygons can be converted back to a raster for the last step of image processing. Click the Analysis tab and find the Geoprocessing group to Tools once again. Search Polygon to Raster (Conversion Tools) in the Tools pane and select it. Use the polygons layer for Input Features and the window pane options should adjust to accommodate your raster. The only adjustment I made was to the Cellsize which I changed to 1 from 0.16 to maintain the cell size in my original raster. Select Run in the bottom right corner of the pane for the final layer. Each of the polygons should now be converted to a rasterized polygon with a unique ObjectID, Value and Count which can be found by going to the Attribute table for that layer or clicking on a group of pixels. This allows for each diseased patch to be aggregated and analyzed as such in Fragstats.

g) Go to the final raster layer, right click and go to data and export data. What I wanted is a TIFF file for analysis in Fragstats.

2. To compare the accuracy of the segmentation as a classification method for diseased pixels, manual classification was used as a ground truth technique.

a) For manual classification of the original red band, the Training Sample Manager was used. This allows for manual classification which can be used for supervised machine learning techniques of classification. Here, I simply used it to select diseased regions manually, but have an end goal of using the support vector machine learning model.

b) To get to the Training Sample Manager, go to the Imagery tab and Image Classification group and select the Classification Tools where you can click Training Sample Manager. In the Image Classification window pane go to create new scheme. Right click on New Scheme and Add New Class and name it diseased pixels and supply a value which is arbitrary and a description if desired. Click ok and select the diseased pixels scheme and then the Freehand tool in the pane. Draw an outline around the diseased regions in the image. Save these polygons and the classification scheme. Go to Add data in the Layer group on the Map tab to upload the training sample polygons.

c) Once the polygons are uploaded, use the Polygon to Raster protocol which can be found in step 1, part (f) followed by part (g).

Fragstats was used for the analysis of the polygons. For exercise 1 the intent in Fragstats was to simply obtain information about the diseased patches that were extracted from the red band image. This included the pixel number, area, perimeter, perimeter-area ratio and shape index. Many more options were available for patch analyses and will be considered for exercise 2.

1. Open up Fragstats for the patch analysis.

a) To import your first image click Add layer and go down to GeoTIFF grid and select it. In the Dataset name, search for your tiff file you created in ArcGIS Pro and select it and click ok. You have to do this for each of the datasets. Then go to the Analysis parameters and click Patch metrics in the Sampling strategy. For General options hit Browse to find a location for the output to save and click the Automatically save results checkbox.

b) Click on the red box called Patch metrics and in the Area – Edge tab select the Patch Area and Patch Perimeter. Click on the Shape tab and click Perimeter – Area ratio and Shape Index.

c) In the upper left corner, you can hit Run and it will check everything you have uploaded and the analysis you have selected. You must then hit Proceed after it checked the model consistency and it runs the analyses.

d) After this hit the Results options for viewing the output.

The patches aligned pretty well through visualization in ArcGIS Pro when comparing the two layers. The comparison between patches will be done in exercise 2. I did notice that the segmentation process missed many of the smaller diseased patches. Looking at Image 2, you can see that they were segmented from the surrounding regions but were lumped into a bin with that caught areas that appeared lighter in the image because of a reflection. Different thresholds could be used in the segmentation process in order to include those smaller patches. This could maybe be done if they followed a parameter involving shape. This may be considered for future segmentation. A concern is that you set the threshold too low and it includes many regions that aren’t diseased but include all the diseased regions also. This would result in many false positives for disease. The alternative is to set a high threshold value for diseased regions which would result in false negatives. The idea is to find a good balance between the two. Currently the segmentation is strictly based on the segments value which is attributed to the reflectance in the red band. As mentioned another parameter worth considering is shape. Since the diseased regions tend to be leaf spots resulting in a circular area typically, adjustments could be made to include more circular patches that don’t extend past a certain number of pixels. In table 1 we see that number of pixels for patches ranges from 8 to 50. The patch analysis provides some insights into possible spatial factors that explain these segmented regions and how the classification process could be done more accurately.

Table 1. Showing the 11 different patches that were manually selected and analyzed in Fragstats. The color indicates the corresponding patch that was detected through segmentation shown in Table 2.

Table 2. The 4 different patches that were identified through segmentation and analyzed in Fragstats. The color indicates the corresponding patch that was detected through segmentation shown in Table 1.


Critique of the method
The MicaSense red-edge camera has 5 bands which can be very helpful for applying different vegetative indices and compositing bands in order to help bring out the variation in spectral signatures between diseased and un-diseased tissue. Although the pixel size is just under 1 mm, which appears to be adequate for identifying lesions on the leaf, the bands do not have perfect overlap. This is due to the design of the camera which has five different lenses for the five different bands that are all separated by one to two inches. This could be corrected for if the extent for each of the five bands was manually adjusted for a near perfect overlap. Until this is resolved, the red band seems to show the most variation in pixel value for diseased patches in comparison to the other four bands and was used for this analysis.

Additionally, the amount of processing that is done in part 1 step (a) to get the segmented raster patches is has many steps. Methods for speeding up this process need to be considered for future analysis especially when conducting this analysis on the 500 leaves.

As mentioned before, Fragstats has many more statistically capabilities that were not applied in this analysis. Getting more statistics on the manually and segmented patches would be helpful for determining the level of accuracy as well as other parameters worth considering.


Image 1. The red band tif file uploaded into ArcGIS Pro for classification. The white patches are what we would expect the diseased regions to look like.

Image 2. The result of the segmentation performed on Image 1 described in step 1 part (a)

Image 3. The patches determined based on segmentation in part 1 from Image 2. Separate patches were created by using the Raster to Polygon tool, followed by the Polygon to Raster tool. Two layers were turned on here for the intent of showing the overlap between the raster patches and the original image.

Image 4. The patches determined based on the manual classification described in part 2 from Image 1. Training sample manager followed by polygon to raster was used to get these patches. Two layers were turned on here for the intent of showing the overlap between the raster patches and the original image.

April 19, 2019

Spatial Patterns of Salmonella Rates in Oregon

  1. Question Asked

At this stage I asked several questions regarding the spatial distribution of population characteristics in all counties in Oregon in 2014: What are the county level spatial patterns of reported age-adjusted Salmonella rates within Oregon in 2014? County level spatial patterns of proportions of females? Median Age? Proportion of infants/young children aged 0-4 years?

To answer these questions I used several different datasets. The first dataset used is a collection of all reported Salmonella cases in Oregon from 2008-2017 which includes information like sex, age group, county in which the case was reported, and onset of illness. The information in this dataset was deidentified by Oregon Health Authority. The second dataset used was a collection of Oregon population estimates over the same time period. This dataset includes sex and age group specific county level population information. I also obtained county level median ages from AmericanFactFinder. The last dataset used is a shapefile from the Oregon Spatial Data Library containing polygon information of all Oregon counties.

  1. Names of analytical tools/approaches used

I used a direct age adjustment (using the 2014 statewide population as the standard population) to obtain county level age-adjusted Salmonella rates. After calculating county level summary data e.g. proportion of females, proportion of children aged 0-4, median age, and age-adjusted Salmonella rates, I merged this information with a spatial dataframe containing polygonal data of every county in Oregon. After doing this I did both local (between 0-150 km) and global (statewide) spatial autocorrelation to get a Moran’s I statistic for each of the population variables listed above. I produced choropleth maps of each of the variables for Oregon as well. Finally, I produced a heatmap for county-level age-adjusted Salmonella rates using a Getis-Ord Gi* local statistic to evaluate statistically significant clustering of high/low rates of reported Salmonella cases.

  1. Description of the analytical process

After extensive reformatting, I was able to organize cases of Salmonella by age group and by county for the year 2014. After this I formatted 2014 county level population estimates in the same way. I then divided the Salmonella case dataframe by the population estimate dataframe to get rates by the different age groups. To get county age-adjusted rates I created a “standard population”, in this case I used Oregon’s statewide population broken down into the same age groups as above. I then multiplied the each of the county’s age-specific rates by the standard population’s matching age groups to create a dataframe of hypothetical cases. This dataframe represents the number of cases we would expect in each of the counties if they had the same population and age distribution as Oregon as a whole. I summed the expected Salmonella cases by county and divided this number by the 2014 statewide population. This yielded age-adjusted reported Salmonella rates by county.

Given that the population data contained county level populations broken down by age group and by sex I was able to calculate proportions of county populations which were female, and which were young children aged 0-4 years by dividing those respective group populations by the total county population.

After this I performed local and global spatial autocorrelation with Moran’s I using the county level median age, proportion of children, proportion of females, and age adjusted Salmonella rates which were associated with centroid points for each county. The global Moran’s I was calculated using the entire extent of the state and the local Moran’s I was calculated by limiting analysis to locations within 150 km of the centroid. Both global and local Moran’s I statistics were calculated using the Monte-Carlo method with 599 simulations.

Finally, I completed a Hot Spot Analysis using Getis-Ord Gi* to assess for any statistically significant hot or cold spots in Oregon. This was only done for the age-adjusted Salmonella rates. This was completed using the same county centroid points as above. I completed this analysis with a local weights matrix using Queen Adjacency for neighbor connectivity. The weighting scheme was set to where all neighbor weights when added together equaled 1.

  1. Brief description of results you obtained

Choropleth Maps of Oregon:

From the median age map, we can see that there are some clusters of older counties in the northeastern portion of the state and along west coast. Overall, the western portion of Oregon is younger than the eastern portion of the state.

From the proportion of children map there are a few clusters of counties in the northern portion of the state with high proportions of children compared to the rest of the state. Overall, the counties surrounding the Portland metro area have higher proportions of children compared to the rest of the state.

From the proportion of females map, we can see that the counties with the highest proportion of females are clustered in the western portion of the state.

Finally, from the age-adjusted county Salmonella rates map we can see that the highest rates of Salmonella occur mostly in the western portion of the state with a few counties in the northeast having high rates as well. Overall, the counties surrounding Multnomah county have the highest rates of Salmonella.

The global Moran’s I statistics:

  • County proportions of females: 0.053 with a p-value of 0.15. This suggests insignificant amounts of slight clustering.
  • County median age: 0.175 with a p-value of 0.02. This provides evidence of some significant mild clustering.
  • County proportions of children: 0.117 with a p-value of 0.05. This provides evidence of significant mild clustering
  • County age-adjusted Salmonella rates: -0.007 with a p-value of 0.32. This suggests insignificant amounts of higher dispersal than would be expected.

Local Moran’s I Statistics:

  • County proportions of females: 0.152 with a p-value of 0.02. This suggests significant amounts of mild clustering.
  • County median age: 0.110 with a p-value of 0.07. This provides evidence of some insignificant mild clustering.
  • County proportions of children: 0.052 with a p-value of 0.1617. This provides evidence of insignificant slight clustering
  • County age-adjusted Salmonella rates: -0.032 with a p-value of 0.5083. This suggests insignificant amounts of higher dispersal than would be expected.

Getis-Ord Gi*:

  • The heatmap shows a significant hotspot (with 95% confidence) in Clackamas county with another hotspot (with 90% confidence) in Hood River County. Three cold spots (with 90% confidence) are seen in Malheur, Crook, and Morrow counties.

  1. Critique of Methods

The choropleth maps were very useful at showing areas with high/values however this method was not able to detect counties with significantly different values compared their neighbors. Overall, it was useful as an exploratory tool. The global and local Moran’s I calculations were able to detect if high/low values were closely clustered or more dispersed than what is expected. However, I am unsure if this method was completely appropriate given the coarseness of this county level data. At a local scale, only the proportion of women showed a significant amount of clustering, and globally median age and proportion of children showed some amount of significant clustering. Given that most of the Moran’s I statistics were not associated with significant values, I don’t believe this analytical method highlighted a particularly meaningful spatial pattern in my data. The heatmap provided evidence of some significant hot and cold spots in Oregon, however this was based on immediate neighbor weights and perhaps global weights would be more appropriate. Overall, this tool was very useful in detecting significantly high/low Salmonella rates.

Exercise 1: Preparing for Point Pattern Analysis

Exercise 1

The Question in Context

In order to answer my question: are the dolphin sighting data points clustered along the transect surveys or do they have an equal distribution pattern? I need to use point pattern analysis. I am trying visualize where in space dolphins were sighted along the coast of California, specifically from my San Diego sighting area. In this exercise, the variable of interest is dolphin sightings. These are x,y coordinates (point data) indicating the presence of common bottlenose dolphins along a transect. However, these transect data were not recorded and I needed to recreate these lines to my best abilities. This process is more challenging than anticipated, but will prove useful in the short-term view of this class and project and long-term in management ramifications.

The Tools

As part of this exercise, I used ArcMap 10.6, GoogleEarth, qGIS, and Excel. Although I was only intending on importing my Excel data, saved as a .csv file into ArcMap, that was not working, so other tools were necessary. The final goal of this exercise was to complete point-pattern analyses comparing distance along recreated transects to sightings. From there, the sightings would be broken down by year, season, or environmental factor (El Niño versus La Niña years) to look for distributing patterns, specifically if the points were ever clustered or equally distributed at different points in time.

Steps/Outputs/Review of Methods and Analysis

My first step was to clean up my sightings data enough that it could be exported as a .csv and imported as x-y data into ArcMap. However, ArcMap, no matter the transformation equation, seemed to understand the projected or geographic coordinate systems. After many attempts, where my data ended up along the east coast of Africa or in the Gulf of Mexico, I tried a work around; I imported the .csv file into qGIS with the help of a classmate, and then exported that file as a shape file. Then, I was able to import that shape file into ArcMap and select the correct geographic and projected coordinate systems. The points finally appeared off the coast of California.

I then found a shape file of North America with a more accurate coastline, to add to the base map. This step will be important later when I add in track lines, and how the distributions of points along these track lines are related to bathymetry. The bathymetric lines will need to be rasterized and later interpolated.

The next step was the track line recreation. I chose to focus on the San Diego study site. This site has the most data and the most consistently and standardly collected data. The surveys always left the same port of Mission Bay, San Diego, CA traveled north at 5-10km/hr to a specific beach (landmark), then turned around. It is noted on sighting data whether the track line was surveyed on both directions (South to North and North to South), or unidirectional (South to North). Because some data were collected prior to the invention of a GPS and the commercial availability, I have to recreate these track lines. I started trying to use ArcMap to draw the lines but had difficulty. Luckily, after many attempts, it was suggested that I use Google Earth. Here I found a tool to create a survey line where I can mark the edges along the coastline at an approximate distance from shore, and then export that file. It took a while to realize that the file needed to be exported as a .kml and not a .kmz.

Once exported as a .kml, I was able to convert the .kml file to a layer file and then to a shape file in ArcMap. The next step in this is somehow getting all points within one kilometer of the track line (my spatial scale for this part of the project) to associate with that track line. One idea was snapping the points to the line. However, this did not work. I am still stuck here: the major step before I can have my point data with an association to the line and then begin a point pattern analysis in ArcMap and/or R Studio.


Although I do not currently have results of this exercise, fully. I can say for certain, that it has not been without trying, nor am I stopping. I have been brainstorming and milking resources from classmates and teaching assistants about how to associate the sighting data points with the track line to then do this cluster analysis. Hopefully, based on this can be exported to R studio where I can see distributions along the transect. I may be able to do a density-based analysis which would show if different sections along the transect, which I would need to designate and potentially rasterize first, have different densities of points. I would expect the sections to differ seasonally.


Although I add in my opinions on usefulness and ease above, I do believe this will be very helpful in analyzing distribution patterns. Right now, it is largely unknown if there are differences in distribution patterns for this population because they move rapidly and at great distances. But, by investigating data from only the San Diego site, I can determine if there are differences in distributions along the transects temporally and spatially. In addition, the total counts of sightings in each location per unit effort will be useful to see the influx to that entire survey area over time.

Contact information: this post was written by Alexa Kownacki, Wildlife Science Ph.D. Student at Oregon State University. Twitter: @lexaKownacki

Point Pattern Analysis of Tree Distribution by Height in the HJ Andrews Forest

1. Given that HJ Andrew Experimental Forest is a 16,000-acre forest with manipulations spanning back to its establishment as a Long-Term Ecological Research (LTER) site in 1948, it is a highly spatially heterogeneous ecosystem. Forest harvest began in the 1950s and resulted in a mosaic of young plantation forest (~30 percent of total forest area) and old growth (~40 percent of forest) ( My objective is to quantify the spatial pattern of trees across the forest and eventually relate that to quantifiable landscape features.

Motivating Questions: How does the spatial pattern of trees vary across the HJ Andrews Forest? Specifically, I’m exploring the relationship between tree height and tree spacing. One specific question of interest is: How does the mean distance between trees in the same height class differ from the mean distance between a single height class of tree and all other trees? This question attempts to address the clustering vs. dispersion of trees by height.

This is an analysis of the spatial distribution of one variable, tree height, so a consideration of the internal processes that may influence the spatial distribution of this variable is necessary.

  • Microclimate caused by the clustering or dispersion of trees could be either an attraction or repulsion process. Microclimate influences relative humidity and exposure to wind, along with many other factors, so clusters of trees would tend to have different microclimate features than more dispersed trees.
  • Population and community dynamics will influence the spatial distribution of trees. The speed at which a colonizer can take over a space and competition between different colonizers influence the distribution. Aboveground and belowground tree growth adn spacing of trees will be influenced by these factors.
  • Source and sink processes in a forest may result from topographical features, like valleys and hillslopes. I expect valleys to be a source (capable of producing a surplus of organisms) sink they will tend to be in areas near streams, so not water limited and in areas that serve as catchments for nutrients, so not nutrient limited.
  • Spatial distribution of tree height certainly is different according to the scale. The spatial pattern looks different at a single tree scale compared with a 50-m scale, compared with a 5-km scale. Some of these differences are revealed by Figures 3, 4 and 5, below.

2. My approach is to use k-nearest neighbor to examine distance between a given tree and proximal trees. I created ten height classes by using kmeans to find ten cluster centers in the tree height data, then used K nearest neighbor to examine the distance between each cluster center and the 30 closest trees.

3. For this analysis, I used a LiDAR dataset of vegetation heights downloaded from the HJ Andrews online data portal ( The selected data is from the third entry, titled, “Vegetation Height digital elevation model (DEM) from 2011 LiDAR, Upper Blue River Watershed (Oct 27th, 2011 – Nov 1st, 2011). A description of this dataset can be found here:

I performed the majority of the analyses in R and used QGIS for data visualization. I used the ‘st_read’ function from the ‘sf’ package in R to read in the stem map shapefile (stem_map.shp). The stem map shapefile includes crown radii (m) and tree heights (m) as well as the point locations of trees.

Because the stem map shapefile essentially provides a census of trees within the HJA Forest, and that is (1) way too much data to deal with at one time, and (2) not useful to perform statistical analyses on since we can just look at the data mapped out and visually discern where trees are located, I decided to use kmeans to group the trees into 10 height classes, resulting in the data space being partitioned into Veronoi cells.

I used the ‘nngeo’ package to examine k-nearest neighbors relations between and within height clusters. I examined the relationship between a single tree (point) and its nearest neighbor of the same height class. I also examined the relationship between a single tree of one height class and its nearest neighbor of any height class to start to elucidate to what extent trees of similar heights cluster or are dispersed spatially.

I performed T-tests on each tree height class to test if there was a significant difference between (1) the mean distance between a tree and the next closest tree within the same height class and (2) the mean distance between a tree and the next closest tree of any height class. I calculated the mean, standard deviation and kurtosis of each height class distance (both within and between height classes).

4. Results

The table below shows the center of each of the ten height class groupings, mean distances between and within tree height classes, standard deviation and kurtosis of those means, and p-values. Results of all t-tests were significant (p<0.01), meaning that there is strong evidence that the within group mean distances are not equal to zero, so there are differences in mean distances between trees of the same height class (within group) and mean distance to the closest trees of any height class. In other words, the distance between a large tree and its nearest neighbor (of any height class) is significantly different than the distance between a large tree and its nearest neighbor within the large tree height class. The same is true of trees within and outside of the small height class, as well as each of the other ten height classes.

Table 1. Results from k-nearest neighbor analysis and t-tests

Tree Class Class Center (Height (m)) Mean Distance to 30 closest trees (m) St Dev Mean Distance Kurtosis Mean Distance Mean Distance to Closest Tree (m) St Dev Mean Mean Distance to Closest Tree Within Group (m) St Dev Group Mean P-value
1 11.8 12.5 4.4 6.4 2.1 2 4 6.3 <0.01
2 15.1 11.7 3.9 3 2.4 1.7 4.9 6.6 <0.01
3 19.9 12.6 3.4 4 3.6 1.3 6.8 6.7 <0.01
4 24.4 13.4 3.1 5.6 4 1.4 6.9 6.5 <0.01
5 29.7 14.6 2.9 3.9 4.6 1.4 8 6.9 <0.01
6 35.5 16.7 3 2.8 5.3 1.6 9.3 7.2 <0.01
7 42 18.6 2.6 4.5 6.1 1.7 10.1 7 <0.01
8 50 19.7 2.5 6.6 6.9 1.8 11.4 7.3 <0.01
9 59 20.5 2.6 3.1 7.5 1.8 13.5 8.7 <0.01
10 70 21.2 2.7 0.7 8.1 1.9 15 11 <0.01

Fig 1. Distribution of all tree heights (m) in HJ Andrews Forest.




Fig 2. Mean distance (m) and standard deviation (m) between trees of each of ten height classes and the next closest 30 trees within that height class. Generally, as trees get larger, mean distance between them is larger.

Fig 3. Distribution of the tallest tree class (70 m tree class) across the HJ Andrews Forest.


Fig 4. A representative distribution of all tree height classes on either side of a road in the HJ Andrews Forest, showing the extent of clustering and the extent of dispersion of height classes. Height classes are in ascending order from smallest class (~12m tall; Class 1) to tallest class (70m tall; Class 10).


Fig 5. A closer look in the same area at Fig. 4, where small trees are clustered near the road and clustered tightly together, while larger trees are more dispersed.


Fig 6. Mean distance between trees of the same height class and other trees of the same height class (blue) and mean distance between trees of one height class and any other tree (red). The overlapping red confidence interval with the blue points suggests that the average distance between small trees is not significantly different than distance between small trees and any trees. The general upward trend suggests that as trees get taller, the distance between them increases and variance slightly decreases.

5. Critique of the method:

The results make sense, but do not provide much more information about the actual distribution of trees (clustering vs. dispersion) than simple maps of point data, so the next step might be to examine tree heights within different management regimes. The current analysis tells me that trees are somewhat clustered by height, and that the mean distance between a tree of one height class to a tree of the same height class is, in most cases, different from the mean distance of a tree of one height class to a tree of any other height class. I’ve examined a map of different management regimes within the HJ Andrews Forest and there are clear areas of old growth, harvested areas, clearly defined plots, etc., so I would expect some of these areas to show tree clustering by height class. The patterns I found using this analysis were not as clear as I was expecting. Using kmeans and nearest neighbor analysis is a great way to start to examine the spatial relationships between and among data, but with such a large and highly varied dataset there can be shortcomings, especially when it comes to drawing any concrete conclusions.


HJ Andrews Online Data Repository:

Johnson, S.; Lienkaemper, G. 2016. Stream network from 1997 survey and 2008 LiDAR flight, Andrews Experimental Forest. Long-Term Ecological Research. Forest Science Data Bank, Corvallis, OR. [Database]. Available: (10 April 2019) DOI:

Spies, T. 2016. LiDAR data (August 2008) for the Andrews Experimental Forest and Willamette National Forest study areas. Long-Term Ecological Research. Forest Science Data Bank, Corvallis, OR. [Database]. Available: (10 April 2019) DOI:

Spies, T. 2016. LiDAR data (October 2011) for the Upper Blue River Watershed, Willamette National Forest. Long-Term Ecological Research. Forest Science Data Bank, Corvallis, OR. [Database]. Available: (10 April 2019) DOI:

Spies, T. 2014. Forest metrics derived from the 2008 Lidar point clouds, includes canopy closure, percentile height, and stem mapping for the Andrews Experimental Forest.. Long-Term Ecological Research. Forest Science Data Bank, Corvallis, OR. [Database]. Available: (10 April 2019) DOI: 6073/pasta/875e10383e8c8aee3c9a49e0155eef1d.

Exercise 1: Principal component analysis for rural index weighting & rural indicator measure contributions for Texas counties

Filed under: 2019,Exercise 1 @ 3:08 pm
Tags: , ,

Questions Being Asked:

How much do different indicator measures of rurality in Texas counties each contribute to a basic index of rurality both statistically and spatially? How much does the weighted basic index differ from a simple calculated mean for rurality?

Tools and Approaches Used

The major statistical method I used to answer these questions was principal component analysis (PCA) via the prcomp function in R and subsequent rurality indicator variable weighting for counties in Texas. The raw indicator data used in this analysis are as follows: Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database).

Description of Analysis Steps

1) Data Wrangling: Both income and population density data were natively in county-level polygon form, so no conversion was required in ArcGIS. Land use required extensive conversion in order to produce the indicator variable that would both be at the appropriate spatial definition and measure the correct aspect of land use. The raw land use discrete data was first reclassified from many categories of land, such as “deciduous forest,” “barren land,” and “scrub/shrub,” into binary “developed” and “undeveloped” classification groups for the entire state of Texas. Then, the resulting binary raster was converted to unsimplified polygons and the “tabulate intersection” ArcGIS tool was utilized in order to intersect Texas county boundaries with the produced unsimplified polygons. This tool allowed for calculation of percent of area of each Texas county that is considered developed land by the USGS.

2) Data Scaling: After the 3 variables were in analyzable form, each were scaled on a 0 to 100-point scale for visual comparison, PCA analysis, and computation of index weights.

3) Creation of Maps of Scaled Variables for Comparison: ArcGIS was utilized to create maps of each of the scaled variables to visually compare the 3 factors to one another.

4) PCA and Weight Computation: The PCA was completed in R using the prcomp function and variable weights were extracted from the procedure by determining the proportion of variance each variable contributed to. This specific step helps answer how much each indicator variable contributes to rurality in Texas.

5) Simple Means Map, Weighted Index Map, and Statistical Comparison: The scaled variables were used to calculate both an unweighted mean rurality score and weighted mean rurality score for all Texas counties. Maps of each of these measures were produced for visual comparison. A student’s t-test was then completed to compare unweighted and weighted rurality scores for counties.


Maps of Scaled Variables of Texas Counties for Comparison

Rough but similar spatial patterns can be seen in all three maps. Counties with large cities and generally more urbanized counties have higher median income, population density, and percent of land area developed (Green=high, yellow=medium, red=low). There are more extreme values in the population density and percent of land area developed maps than median income.

PCA and Weighting Computation

The biplot of the PCA procedure helps visualize how the samples of the 3 variables relate to one another and how much each variable contributes to the first principal component. Important factors to notice in this plot are the direction of the arrows and clustering of the points. Typically, the more the variable arrow points toward the principal component 1 (PC1) axis, the more it contributes to the proportion of variance. The population density and percent land area developed variables happen to overlap one another in this plot, as they have highly similar contributions to the variance in PC1.

This table shows that population density contributes to 44.4% of the variance in PC1, income contributes to 11.11%, and percent land area developed contributes 44.5%. This PCA procedure indicates that when weighting rurality scores by county in Texas using these 3 variables, percent land area developed should have the highest weight, population density should have the second highest weight, and income should have the lowest. The weighting of variables in the weighted county rurality mean score follows these proportions.

Simple Means Map, Weighted Index Map, and Statistical Comparison

Visually comparing the two maps of rurality scores, there is is some overlap, but quite stark differences. County scores overall have become more rural (lower scores) based on simple visual comparison. Statistical analysis will allow a more nuanced view of the actual differences between the two.

The computed t-test indicates that the weighted index, on average, produces county rurality scores that are significantly more rural (lower scores on the 0-100 scale) than the unweighted index (p<0.001). The overall unweighted rurality score mean for all Texas counties is ~17, while the overall weighted rurality score mean is ~7.5. This indicates that the basic weighted index may better capture the multidimensionality of rurality for Texas counties than unweighted mean scores.


Based on my understanding, PCA is quite useful for creation of weighted indexes. Unfortunately, the variables I utilized for this analysis have somewhat varying patterns of clustering and possible outliers, which can be seen in the above biplot. These possible outliers could have an effect on the weighting of variables in the index. Due to the small sample size of counties and in order to produce a complete picture of all counties in Texas, I did not want to remove any counties from the analysis. PCA as it relates to index weighting also requires that there is either an a priori understanding or statistical proof of a correlation between variables. Based on the directions of the arrows in the biplot, there is likely correlation between the variables, but more considerations may be needed to confirm this. Also, the scaling of the median income variable is slightly off because the maximum value is over 100. This will need to be corrected in a later analysis.

My future analysis will include more health-related variables in the PCA procedure to determine how health variables contribute to rurality in comparison to the basic rural indicator variables analyzed in this exercise.

Mapping the stain: Using spatial autocorrelation to look at clustering of infection probabilities for black stain root disease

My questions:

I am using a simulation model to analyze spatial patterns of black stain root disease of Douglas-fir at the individual tree, stand, and landscape scales. For exercise 1, I focused on the spatial pattern of probability of infection, asking:

  • What is the spatial pattern of probability of infection for black stain root disease in the forest landscape?
  • How does this spatial pattern differ between landscapes where stands are clustered by management class and landscapes where management classes are randomly distributed?

    Fig 1. Left: Raster of the clustered landscape, where stands are spatially grouped by each of the three forest management classes. Each management class has a different tree density, making the different classes clearly visible as three wedges in the landscape. Right: Raster of the landscape where management classes are randomly assigned to stands with no predetermined spatial clustering. The color of each cell represents the value for infection probability of that cell. White cells in both landscapes are non-tree areas with NA values.

Tool or approach that you used: Spatial autocorrelation analysis, Moran’s I, correlogram (R)

My model calculates probability of infection for each tree based on a variety of tree characteristics, including proximity to infected trees, so I expected to see spatial autocorrelation (when a variable is related to itself in space) with the clustering of high and low values of probability of infection. Because some management practices (i.e., high planting density, clear-cut harvest, thinning, shorter rotation length) have been shown to promote the spread of infection, there is reason to hypothesize that more intensive management strategies – and their spatial patterns in the landscape – may affect the spread of black stain at multiple scales.

I am interested in hotspot analysis to later analyze how the spatial pattern of infection hotspots map against different forest management approaches and forest ownerships. However, as a first step, I needed to show that there is some clustering in infection probabilities (spatial autocorrelation) in my data. I used the “Moran” function in the “raster” package (Hijmans 2019) in R to calculate the global Moran’s I statistic. The Moran’s I statistic ranges from -1 (perfect dispersion, e.g., a checkerboard) to +1 (perfect clustering), with a value of 0 indicating perfect randomness.

Moran’s I = -1

Moran’s I = 0

Moran’s I = 1









I calculated this statistic at multiple lag distances, h, to generate a graph of the values of the Moran’s I statistic across various values of h. You can think of the lag distance of the size of the window of neighbors being considered for each cell in a raster grid. The graph produced by plotting the calculated value of Moran’s I across various lag values is called a “correlogram.”

What did I actually do? A brief description of steps I followed to complete the analysis

1. Imported my raster files, corrected the spatial scale, and re-projected the rasters to fall somewhere over western Oregon.

I am playing with hypothetical landscapes (with the characteristics of real-world landscapes), so the spatial scale (extent, resolution) is relevant but the geographic placement is somewhat arbitrary. I looked at two landscapes: one where management classes are clustered (“clustered” landscape), and one where management classes are randomly distributed (“random”). For each landscape, I used two rasters: probability of infection (continuous values from 0 to 1) and non-tree/tree (binary, 0s and 1s).

2. Masked non-tree cells

Since not all cells in my raster grid contain trees, I set all non-tree cells to NA for my analysis in order to avoid comparing the probability of infection between trees and non-trees. I used the tree rasters to create a mask.
c.tree[ c.tree < 1 ] <- NA # Set all non-tree cells in the tree raster to NA
c.pi.tree <- mask(c.pi, c.tree) # Combine the prob inf with tree, leaving all others NA
# Repeat with randomly distributed management landscape
r.tree[ r.tree < 1 ] <- NA # Set all non-tree cells in the tree raster to NA
r.pi.tree <- mask(r.pi, r.tree) # Combine the prob inf with tree, leaving all others NA

Fig 2. Filled and hollow weights matrices.

3. Calculated Global Moran’s I for multiple values of lag distance.

For each lag distance, I created a weights matrix so the Moran function in the raster package would know how to weight each neighbor pixel at a given distance. Then, I let it run, calculating Moran’s I for each lag to create the data points for a correlogram.

I produced two correlograms: one where all cells within a given distance (lag) were given a weight of 1 and another “hollow” weights matrix when only cells at a given distance were given a weight of 1 (see example below).

4. Plotted the global Moran’s I for each landscape and compare.







What did I find? Brief description of results I obtained.

The correlograms show that similar values become less clustered at greater distances, approaching a random distribution by about 50 cell distances. In other words, cells are more similar to the cells around them than they are to more-distant cells. The many peaks and troughs in the correlogram are present because there are gaps between trees because of their regular spacing in plantation management.

In general, the highest values of Moran’s I were similar between the landscape with clustered management landscape and the landscape with randomly distributed management classes. However, the rate of decrease in the value of Moran’s I with increasing lag distance was higher for the random landscape than the clustered landscape. In other words, similar infection probabilities had larger clusters when forest management classes were clustered. For the clustered landscape, there was actually spatial autocorrelation at lag distances of 100 to 150, likely because of the clusters of higher infection probability in the “old growth” management cluster.

Correlogram for the clustered and random landscape showing Moran’s I as a function of lag distance. “Filled” weights matrix.

Correlogram for the clustered and random landscape showing Moran’s I as a function of lag distance. “Hollow” weights matrix.














Critique of the method – what was useful, what was not?

My biggest issue initially was finding a package to perform a hotspot analysis on raster data in R. I found some packages with detailed tutorials (e.g., hotspotr), but some had not been updated recently enough to work in the latest version of R. I could have done this analysis in ArcMap, but I am trying to use open-source software and free applications and improve my programming abilities in R.

The Moran function I eventually used in the raster package worked quickly and effectively, but it does not provide statistics (e.g., p-values) to interpret the significance of the Moran’s I values produced. I also had to make the correlogram by hand with the raster package. Other packages do include additional statistics but are either more complex to use or designed for point data. There are also built-in correlogram functions in packages like spdep or ncf, but they were very slow, potentially taking hours on a 300 x 300 cell raster. That said, it may just be my inexperience that made a clear path difficult to find.


Glen, S. 2016. Moran’s I: Definition, Examples.

Robert J. Hijmans (2019). raster: Geographic Data Analysis and Modeling. R package version 2.8-19.


Exercise 1: Comparison of Interpolation Methods Using Fisheries Abundance Data

Filed under: 2019,Exercise 1 @ 1:07 pm

Question Asked

I would like to understand how the spatial variability of forage fish in the California Current Ecosystem is related to the spatial pattern of seascape classes (based on remotely sensed sea-surface variables)? In exercise 1, I will asking “what is the spatial pattern of forage fish in the California Current ecosystem?” In order to address this question, I will be testing the use of different types of interpolation on my point-data.

Approaches Used and Methods

To address these questions, the Kriging Interpolation and Inverse Distance Weighting Interpolation tools were employed in ArcMap. All processes were completed in ESRI ArcMap 10.6.1. Interpolation is described as a method of constructing new data using existing data as references. In spatial and temporal analyses, there are a range of different types of interpolation that can be used.

The original data, which includes a series of about 1300 trawls, the catch per unit effort (CPUE) per species per trawl, the latitude, longitude, and water column depth of each trawl, and the functional group of each species caught, were loaded into ArcMap using the “Display X,Y Point Data” function. The four functional groups used in this analysis are Forage, Young-of-the-Year Rockfish (YOYRock), YOYGround, and crustaceans. In the case of this analysis, all fishes that are part of the YOYRock functional group were included.

Representation of the raw YOYRock Trawl data in ArcMap

After the data were uploaded, they had to be converted to feature classes. For the purposes of this exercise, all trawl data from 2002 to 2015 was included as one feature class, though the way in which the data are organized make it easy to break the trawls down at a finer temporal resolution. The result was a Point Shapefile that was then binned into four abundance groups to make interpretation easier. The IDW and Kriging tools (from the “Spatial Analyst” Toolbox) were then employed. The base equation for both interpolations is virtually the same, and both are commonly used methods for continuous data in point form, but there are some major differences in the calculation of some of the stated variables:

Representation of the equation used to assign values to missing cells using the IDW and Kriging methods. Z(si) = the measured value at the ith location, λi = an unknown weight for the measured value at the ith location, s0 = the prediction location, and N = the number of measured values

Z(si) = the measured value at the ith location, λi = an unknown weight for the measured value at the ith location, s0 = the prediction location, and N = the number of measured values

1) IDW: IDW, or Inverse-distance weighting, is what’s known as a deterministic interpolation method, as it relies on surrounding values to populate the resulting surface. One of the major assumptions with using IDW is that it assumes that the variables being mapped decrease in influence linearly as you move away from a given value. In IDW, the weight given to the “measured value at the ith location” is solely calculated linearly, decreasing as you move farther away from a given value. In this case, the “power” value, which corresponds to the weight, was the default 2.

2) Kriging: Kriging is an interpolation method based on geostatistics including autocorrelation. The equation used to calculate the missing values is the same as for IDW, except that the weight variable is calculated using a mathematical function within a certain specified radius of the missing value. For this reason, one of the main assumptions when using Kriging is that the “distance or direction between sample points reflects a spatial correlation that can be used to explain variation in the surface.” (ESRI, ND).


The resulting interpolations can be found below. I’ve included output that focuses on the significant region of Monterey Bay to provide context regarding the proximity of the trawls to one another and to show detail on the boundaries of the interpolations.

Full Interpolation using IDW

Full Interpolation using Kriging

Detail of IDW in Monterey Bay, CA

Detail of Kriging in Monterey Bay, CA

Critiques of Methods

Any analysis of the results will conclude that the Kriging Tool provided a much more robust interpretation of the same patterns that can be observed in the original data and in the IDW interpolation. Both interpolations displayed significant patterns near Monterey Bay, and the Kriging Interpolation also represented additional areas of higher abundance north of San Francisco. While I do believe that both interpolation methods provide a good place to begin analysis, I believe that several adjustments will have to be made in order to create a useable result.

My first mistake was using the entire time series – since the interpolations are distance-based and extremely sensitive to points within close proximity to one another, the clear clusters of point most likely influenced the interpolations. A next step for me will be breaking the data down to an annual resolution, as I feel that shapefiles with one point at each trawl location will provide better data for interpolation. Additionally, this will provide multiple maps, which will allow for a chance to observe how the modeled patterns of YOYRock abundance have changed over time.

Another next step will be exploring the fine adjustments available within each interpolation method. I now have a greater understanding of the mechanics which drive IDW, so I’m eager o rerun the analyses at different powers to see how each impacts the interpolation. Similarly, the radius used to decide which points are considered while calculating a Kriging Interpolation can be adjusted, so that will be done in the future as an experiment.

Finally, I ran out of time to explore the symbology of the results – I hypothesize that classification by a smaller number of classes would result in more robust interpolation maps, as the visualizations now show a vast majority of the space to fall into the lowest class. The interpolation data is relatively bimodal in nature, so an adjustment in the symbology tab would likely result in a a more accurate and precise representation of abundance.

Overall, I see interpolation as a valuable way to identify spatial patterns from point data. In the case of species abundance data, I believe that Kriging is the superior method, as it does not have the linear influence assumption that’s baked into IDW. Additionally, the geostatical methods used in Kriging generally allow for a more robust and precise interpolation regardless of the type of continuous data being used.

The next steps mentioned above will be taken before the presentation of Tutorial 1.


Santora, Jarrod & Hazen, Elliott & Schroeder, Isaac & Bograd, Steven & Sakuma, KM & Field, JC. (2017). Impacts of ocean-climate variability on biodiversity of pelagic forage species in an upwelling ecosystem. Marine Ecology Progress Series. 580. 10.3354/meps12278.


Determining spatial autocorrelation of refugee settlements in Uganda

Filed under: 2019,Exercise 1 @ 12:40 pm
  1. Question that you asked

In analyzing the distribution of settlement boundaries that I obtained from OpenStreetMap, I wanted to know the general clustering and regionality of settlements in order to understand how other spatial statistics that I perform in my next step will behave based on the results from this explanatory step. The question I’m introducing for Exercise A centers around how similar or different nearby settlements are to each other: is there a regionality in settlement sizes? Some future questions that I’m considering are whether or not clustered settlements have higher detection in the World Settlement Footprint classification (which I’m using instead of GHSL because it’s a more localized classification). This would be determined using the percent of OSM settlement area that was detected by WSF.

  1. Name of the tool or approach that you used.

I decided to use spatial autocorrelation to answer this question, since the essence of my question centers around what the pattern of my data looks like, how clustered or not clustered are these settlements, and how that might affect my future analysis and considerations.

For this, I employed the Spatial Autocorrelation tool in ArcGIS Pro, which uses Global Moran I’s algorithm.

  1. Brief description of steps you followed to complete the analysis.

In order to identify the refugee settlements, I went through multiple rounds of querying with OpenStreetMap to extract the boundaries that are directly related to refugee camp boundaries. Hannah Friedrich and I have worked together on defining some of these boundaries, and she took some time to delineate boundaries for a separate study of hers. While I will incorporate these delineated ‘regions of interest,’ I will most likely not move forward with them in further analyses because it would present an issue with scaling up and manual interpretation. Below I list the three polygon data I’m analyzing here.

OSM Boundaries: This is a result of multiple query efforts within Overpass Turbo, a online server for downloading OpenStreetMap data. Various queries on attribute tags were performed in order to select relevant refugee boundary data.

Regions of Interest: This is a result from Hannah’s efforts of limiting areas within the OSM Boundaries that appear to be built up using high resolution imagery available on Google Satellite.

Selected OSM Boundaries: This is a selection from OSM Boundaries that merges polygons of the same settlement into multipolygons; this also a selection of boundary polygons that are less than 2000 hectares to account for some boundaries that are designated versus actually settled.

I then ingested these boundaries into Google Earth Engine, extracted the pixel areas for the WSF pixels within the three different boundary layers, and re-exported these.

I then imported these layers into ArcGIS Pro and performed the Spatial Autocorrelation tool and experimented with multiple parameters.

I ultimately chose the “row standardization” option given the possible bias in ‘sampling’ design, since this creation of this data is most often from the HOT Uganda team and resources might limit where they can travel and collect data.

  1. Brief description of results you obtained.

Ultimately, my results showed clustering within the OSM-identified settlements, which is to be expected. There are a variety of questions that arise from this that might confound my analysis that are still moderately troubling: the method of data recording, the spotlight effect, the presence of organizations able to record data, and the inherent clustering based on human behavior, or drivers such as conflict or environmental conditions. This initial analysis is really to understand the degree of clustering that I might expect in my future analyses – if there is high clustering, then I need to understand that clustering when I’m testing additional questions with my next exercise, like nearest road or nearest urban area. The images below demonstrates the result of an analysis settlement area in the “Selected OSM Boundaries,” “OSM Boundaries,” and “Regions of Interest.”

Moran I, Selection of OSM Boundaries

Moran I, Delineated Regions of Interest

Moran I, All OSM Boundaries


While OSM Boundaries and Selection from OSM Boundaries show high confidence of clustering, the “Regions of Interest” pattern shows a less confident result of clustering. This makes sense, given that both the OSM Selection and OSM Boundaries both have overlapping polygons and varying sizes.

Below is a map illustrating the distribution of refugee areas in northwest Uganda, where most of the settlements are concentrated.

  1. Critique of the method – what was useful, what was not?

Given my shortcomings with understanding statistical analyses, the interpretation of the results was most difficult for me. While my patterns appeared mostly clustered, the z-score and Moran’s Index showed changes when different parameters of comparison were chosen. Ultimately, I’m not sure if there would be a more effective spatial analysis to analyze aspects of these settlements that would prove more helpful in figuring out relevant information for moving forward with Exercise 2 and beyond.

Ex. 1: Spatial Autocorrelation in Stream Cross-Sectional Change In the Andrews Forest

Filed under: 2019,Exercise 1 @ 12:00 pm

Question that you asked

How much spatial autocorrelation exists in historic streambed changes in the Andrews Forest, both in the across-stream direction and the along-stream direction?

Name of the tool or approach that you used

I addressed each direction of the problems separately using. Since this resulted in one-dimensional, evenly spaced or faux-evenly spaced data, I just used the acf function in R to analyze the data.

Critique of the method – what was useful, what was not?

The methods provided an interesting look at patterns within the cross section data set. However, I think the one-dimensional approach fails to capture interactions between across- and along- channel changes as well as any temporal autocorrelation.

Description of steps you followed to complete the analysis and results you obtained.

For my project, I am looking at changes in the streambed along 4 stream reaches (black boxes) along two streams (blue lines) in the Andrews forest. The watershed of the larger stream, Lookout Creek is shown in black.

Researchers in the Andrews forest repeatedly surveyed fixed cross sections along the reaches from 1978 to 2011. In this exercise, I worked with a subset of the data from 1978 to 1998. The cartoon below does not contain real data but shows the arrangement of numbered reaches (blue lines) and their fixed endpoint stakes (blue dots) along a stream.

Part A – across-channel changes

Each individual cross section spans from bank to bank and show the topography and bathymetry across the width of the stream. Below are two consecutive years of cross-sectional profiles of the stream bed at Lower Lookout Creek, cross section # 3.

Profile from 1995.

Profile from 1996.

It looks like there was a lot of change between these two profiles. The active stream channel appears to have avulse from one side of the cross section to the other.

I compared the two profiles to see how much elevation was gained or lost.

Here grey line represents the surface of the stream bed in 1995, and the black line represents the surface of the stream bed in 1996. The area show in red represents material that was eroded away, and the green area represents material that was newly deposited.

We can simplify the plot above to show only the positive or negative elevation change across the channel. The figure below shows the same information as the figure above, but directly in terms of elevation change.

You can clearly see from this figure that this site at this year had one broad area of elevation gain or deposition, and one broad area of elevation loss or erosion. This isn’t precisely an attraction/repulsion phenomenon, but there can be positive feedbacks associated with erosion and deposition. For example, sediment deposited in a stream may locally obstruct or slow down the water passing by it and encourage deposition in adjacent areas.

Now, not every pair of years had this *much* elevation change, but also, not every pair of years shows such broad regions of positive or negative elevation change. Here is a figure of the elevation change at the same cross section between 1996 and 1997. You can see that elevation change is both smaller in magnitude and shows up in smaller patches.

I used spatial autocorrelation metrics to investigate the patch size. Note that autocorrelation at some scales is always expected when looking at topographic change. Even if you had extremely precise and accurate measurements, you would expect the stream bed change at one point to be strongly correlated with the change a few cm away both because the two points probably experience very similar forces and also because gravity tends to smooth out topographic irregularities in nature. But in most scenarios, you would probably not expect to find a correlation between the changes in a streambed with a change at a point 100 m away from the bank.

I picked two distances that seemed potentially interesting: 1 meter and 5 meters. Then I investigated autocorrelation at 1m and 5m lag distances using the acf function in R.

These lag distances are superimposed on the first change graph in the figure below.

I expected that all cross sections would have higher autocorrelation at the shorter lag distance than the longer one, but I also expected that a set of cross sections with very large patch sizes might show relatively higher autocorrelation at the larger lag distance compared to the small-patched cross sections. I need to spend more time looking into how the magnitude of the changes influences the autocorrelation function so that I can understand it better.

I repeated the process for every cross section at every year. Results are shown below.

The cross sections showed mostly positive autocorrelation at 1 m. Some years showed more autocorrelation across the board than others, and some individual cross sections seemed like they frequently had high autocorrelation values at multiple years.

Different patterns emerge at this spatial scale. At a 5 m lag, neutral or negative autocorrelation was much more common.


Part B – along-channel changes

I also wanted to look at autocorrelation between cross sections. Was the change at one cross section related to the change at cross sections immediately up- or down-stream?

Because I don’t have true geographic locations of the cross sections, I couldn’t do any fancy two dimensional work like interpolating the data into a river bathymetry model. Instead, I consolidated the cross-sectional change data into a single metric for each cross section.

Here is another look at the change plot I showed earlier. This is the same plot as before, but I’ve colored the positive values green to show deposition and the negative values red to show erosion.

I simplified this plot by removing the horizontal across-stream component and summing the red and green areas separately.

Then I had a choice: I could add the red and green areas together to show the total area of the cross section that changed in any way, or I could subtract the red area from the green area to show the net deposition in the cross section. Each of these metrics could be interesting for different reasons. For example, the total area of change might be relevant for ecological processes, but the net overal change could be relevant for tracking net sediment transport in these streams.

I started by adding the boxes together to calculate the total area of change.

In this way, I made a score for every cross section for every year pair. Hypothetical scores for the cartoon river are shown below.

Then I used the acf function again in R to determine how correlated adjacent cross section scores were with each other, compared to more distant cross section scores. This method is not fully spatial explicit, since I did not include any information about how far apart each cross section was from its neighbors. I just assumed that they were roughly equidistant within each reach. Indeed, the distance between cross sections can become hard to define when they are placed along curved parts of the stream.

I calculated these values with every year pair on each of the four reaches. I then repeated it with the subtraction method described above.

For the “addition” method, autocorrelation values were generally close to neutral with some reaches in some years tending towards positive values. Positive values could occur if changes in one reach directly influenced an adjacent reach (i.e. if upstream channel change influenced the hydraulics directly downstream) or if adjacent cross sections were influenced by similar other factors (i.e. two adjacent cross sections placed on a steep part of the reach might experience more similar changes to each other than two cross sections placed in a shallower part of the reach.)

The pattern is different for net change, where negative autocorrelation is more common. I think this is very interesting, because it implies that parts of the stream that had net deposition were more likely to be adjacent to parts of the stream that had net erosion. This could be an expression of a sink/source phenomenon where material scoured from one area is transported and deposited immediately downstream at the next cross section.

Exercise 1: Spatial Distribution of Juniper Saplings

Filed under: 2019,Exercise 1 @ 11:52 am

1.Question asked and data used: 

  • What are the spatial patterns of juniper sapling establishment within a treated watershed 14 years after most juniper were removed?
  • Two data sets were used for this analysis:
    • Thirty-three belt transects (30 m in length, 3 m in width) were used to assess the density of juniper saplings within the watershed (data in form of excel spreadsheet listing latitude and longitude with number of juniper per transect)
    • An orthomosaic created from UAV imagery of a small area within the watershed (data in form of multispectral raster, with brightness values for red, green, blue, near-infrared, and re-edge wavelengths)
  • Watershed map (National Agricultural Imagery Program image from 2016 shown) with belt transects and UAV study area:

2. Approach and tools used: A number of techniques were initially applied to explore the data and to examine which approaches were more (or less) effective.

  • Belt transects:
    • Kriging
    • Inverse Distance Weighted (IDW) interpolation
    • Spatial autocorrelation (Global Moran’s I)
    • Hot-spot analysis
  • Classified raster (UAV orthomosaic):
    • Hot-spot analysis

3. Steps used in analysis:

  • Slope and aspect calculation:
    • The slope and aspect tools (Spatial Analyst Tools) were applied to 30 m DEM (from This information was noted for hot spots and cold spots but not used for further analysis during this exercise.
    • The extract multi values to points was used to extract the slope, aspect, and elevation value
  • Belt transects:
    • Projected data into NAD 1983 UTM Zone 10N
    • The location of each survey was symbolized by color based on the number of juniper found.
    • Kriging (Spatial Analyst Tools and Geostatistical Wizard) was used for initial exploratory analysis to assess general spatial distribution across the watershed
      • The number of juniper per transect used as the z-value field
      • Histogram and QQPlot used assess distribution of the number of juniper per transect for the geostatistical layer
      • For the raster layer, the predicted values were compared to the observed values by extracting the values at each survey location
    • IDW interpolation
      •  Juniper used as input feature
      • Maximum neighbors:15; Minimum neighbors:10
    • Spatial autocorrelation (Global Moran’s I) (Spatial Statistics Tools)
      • Calculated using the number of juniper per transect
      • HTML file created lists Moran’s index, z-score, and p-value
    • Hot Spot Analysis (Getis-Ord Gi) (Spatial Statistics Tools)
      • Conceptualization of spatial relationships: Fixed distance band
      • Distance method: Euclidean distance
      • Default was used for the distance band
  • Classified Raster (orthomosaic):
    • Supervised classification (support vector machine) was used to identify juniper within the UAV orthomosaic
    • General procedure for classification: create training samples-> assess using interactive training sample manager and scatterplots->create .ecd file->apply classifier->apply Majority Filter tool
    • A point file was created from pixel clusters identified as juniper within the image
    • Hot Spot analysis (Getis-Ord Gi)

      • Conducted to assess areas of concentration of juniper saplings within the sample area
      • Integrate tool used to aggregate juniper data (5 m used for analysis, but other distances were initially tested)
      • Hot Spot analysis tool using aggregated data created in previous step as input layer (other inputs were the same as those used for the belt transect hot spot analysis)

4. Overview of Results:

  • Belt Transects
    • Kriging. I used two different approaches to kriging. First, I used the kriging tool under spatial analyst tools and then used the geostatistical wizard to calculate ordinary prediction kriging. The resulting maps using these two approaches were different. In particular, the use of the geostatistical wizard created maps more similar to the IDW while the geostatistical wizard created a map with different contours.
      • Related statistics:
        • minimum:0.05
        • maximum: 6.97
        • mean: 2.83
        • standard deviation: 0.86
      • Spatial Analyst Kriging Map:
      • Using the export values function, the predicted values of this kriging method at each belt transect site were very similar (within 0.05) to the observed values.
      • Ordinary Prediction Kriging:
      • The QQ plot indicates that assumptions of normality may not be met:
      • The ordinary prediction kriging also tended to overestimate the juniper density based upong the distribution chart:
        • Related cross-validation/error statistics for the ordinary prediction kriging:
          • Mean -0.0583700520511708
            Root-Mean-Square 2.09329560839956
            Mean Standardized -0.0193126589526469
            Root-Mean-Square Standardized 0.986330022079785
            Average Standard Error 2.09938534113134
    • IDW
      • While general trends were similar to kriging, the size and shape of contours between the methods were different.
      • Mean: 3.08, range: 0.00 to 7.00, standard deviation: 1.16
      • IDW
    • Spatial autocorrelation (Global Moran’s I)
      • Moran’s I based on juniper per transect: -0.019, with a p-value of 0.92
      • Indicates that the pattern of juniper density is considered random and not significantly different than a random distribution
    • Hot Spot analysis (Getis-Ord Gi)
      • One hot spot found (90% confidence): northwestern aspect (300 degrees) with 8.8% slope
      • One cold spot found (95% confidence): north-northwestern aspect (347 degrees) with 11.5% slope
      • Remaining points were considered insignificant
      • Map of Hot Spot analysis:
  • Classified Raster
    • Hot Spot analysis (Getis-Ord Gi)
      • Within the area covered by the orthomosaic, four hot spots were found:
        • One area with 99% confidence: 11% slope, 325 degree aspect
        • Three areas with 95% confidence: 3% slope, 254 degree aspect; 11% slope, 167 degree aspect; and 8% slope,137 degree aspect
      • UAV Orthomosaic Hot Spot map:

5. Discussion/Critique of Methods.

Based on the maps produced using the IDW and kriging methods, some general trends in juniper spatial distribution exist within this watershed. For example, we see lower densities in the north-northeastern areas and greater densities of juniper in the southeastern and western areas. However, both the IDW and the kriging raster using the spatial analyst tool produced the characteristic “bulls-eye” pattern often associated with IDW. When compared to the ordinary prediction kriging maps, different interpretations of juniper density in the watershed might be made. Compared to the belt transects data, the kriging created using the spatial analyst tool was similar to the observed values while the ordinary prediction kriging tended to overestimate the distribution. However, more ground data is necessary to determine how accurate these prediction would be in other areas.  One of the biggest takeaways from this is to carefully consider the approach used and the nature of the data (for example, the use of ordinary versus simply kriging, etc.). From a user standpoint, I found the geostatistical wizard the most useful approach — particularly as it made inspecting the statistics and semivariogram very easy. However, in the future I would explore different methods of kriging within this tool.

The spatial autocorrelation and hot spot tools were useful, although the results did not provide much significant information regarding the spatial distribution of juniper in the cases examined here. For future analysis, particularly when examining the relationship between juniper density and other watershed characteristics (e.g., slope, aspect, etc.) these tools may become more important. In the case of the UAV orthomosaic, this provided a “test run” for what will be a larger dataset. The steps taken prior to analysis, particularly creating a point layer from the classified pixel clusters, will be time intensive and may require alternative approaches. At the limited scale of the UAV orthomosaics these hotspots did not provide much useful information, but if extrapolated over a larger area more patterns may be observed.

The distribution of juniper saplings in this case may be associated with a number of factors. Juniper seeds may be spread through birds or wind, resulting in the spread of juniper saplings throughout the watershed. At a much larger scale, if seed sources are less available, this distribution may be more localized. This pattern may also vary with the presence of mature juniper. As previously discussed some patterns emerged in this data (lower densities in the northern areas of the study area, for example) which may be associated with the spatial distribution of other characteristics, such as soil moisture. For example, sources and areas of higher juniper may include areas were we anticipate higher soil moisture such as flatter slopes with northerly aspects. These factors will be assessed further in exercise 2.


April 18, 2019

Spatial variation of economic losses resulting from a joint earthquake/tsunami event: An application to Seaside, OR


What is the spatial variability of economic value of buildings as well as losses resulting from a joint earthquake/tsunami event? How does this spatial variability relate to independent earthquake and tsunami events, as well as the intensity of the hazard?

The purpose of this analysis was to consider the spatial variability of initial economic value, as well as economic losses resulting from a joint earthquake/tsunami event. The losses were deaggregated by hazard (earthquake only, tsunami only, joint earthquake/tsunami), as well as intensity of the event (100-year, 250-year, etc.).

Tool and approach

Two methods were implemented to evaluate the spatial variability economic value and losses: (1) interpolation via kernel density, and (2) a hotspot analysis using the Getis-Ord Gi* statistic. The economic losses were determined using a probabilistic earthquake/tsunami damage and loss model. This model implements Monte-Carlo methods to estimate the expected economic losses following an earthquake/tsunami event. For this application, 10,000 simulations were ran, from which the average loss at each building was computed for earthquake only, tsunami only, and a joint earthquake/tsunami event.

Description of steps

The average losses at each building resulting from the earthquake/tsunami loss model were added as an attribute to a GIS shapefile. Two methods to evaluate the spatial distribution were considered:

  1. Interpolation via kernel density: Spatial interpolation was performed using a kernel density estimate. A kernel with a size proportional to the value of the attribute under consideration (in this case economic value/loss) is placed at each data point. A map is then created by taking the sum of all kernels. The kernel radius and shape can vary to produce different results. In this analysis, a quartic kernel was utilized with a radius of 200 meters. The interpolation was performed using the built in interpolation feature in QGIS 3.
  2. Hotspot analysis using the Getis-Ord Gi* statistic: Hotspot analysis was performed using the Getis-Ord Gi* statistic. This statistic results in a p-value and z-score at each attribute, providing insight into whether the null-hypothesis can be rejected (in this case, spatial randomness). As such, features with a small p-value and a very large (or very small) z-score indicate that the null can be rejected (or that the data is not spatially random). Consequently, applied across an entire spatial dataset, the hotspot analysis identifies statistically significant clusters of high (or low) values. The hotspot analysis was performed using the available Hotspot Analysis plugin for QGIS 3.

Description of results

The results of the analysis are shown in Figure 1. The columns correspond to interpolation and hotspot analysis respectively. The first row shows the building values, whereas the second shows the economic losses resulting from a joint earthquake/tsunami event (2,500-year return period).

Areas of high economic value and losses can be easily observed from the interpolation analysis. Here, areas of red correspond to larger damages. Similarly, statistically significant clusters of large (and small) damages can be observed from the hotspot analysis. Again, red corresponds to a statistically significant hot spot (e.g. a cluster of large values and losses), whereas blue corresponds to a statistically significant cold spot (e.g. a cluster of small economic values and losses).

A large concentration of economic value is centrally located along the coast, and is due to the presence of resorts and condominium complexes. This area is observed from both the interpolation and hotspot analysis. Interestingly, more clusters are observed from the hotspot analysis as opposed to the interpolation. This could be explained by the scaling of the interpolation. In this case, the red regions correspond to a maximum value of $20M. If this value was reduced by half, more areas of high concentration would be observed.

The hotspot analysis provides insight into statistically significant clusters of high and low values, as opposed to single points of high values; however, when comparing interpolation and hotspot analysis, it should not be neglected that the results of the latter are visually more difficult to observe. This is due to the discrete nature of the Getis-Ord Gi* statistic (e.g. each point corresponds to a p-value and z-score, as opposed to the continuous surfaces of interpolation). This results in polygons that are shaded according to confidence levels.

Figure 1: Comparison of interpolation and hotspot analysis for both initial building value and economic losses

In addition to the initial value and economic losses resulting from the 2,500-year earthquake/tsunami event, interpolated maps were deaggregated based on hazard (earthquake only, tsunami only, combined) as well as intensity of the event (return years 100, 250, 500, 1,000, and 2,500). The results are shown in Figure 2, where each row corresponds to the hazard, and each column to the intensity of the event. Furthermore, the total economic losses to all buildings in Seaside were determined based on hazard and intensity (Figure 3).

Figure 2 shows that the economic losses are spatially consistent across Seaside for the 100-year event, and begin to exhibit spatial variability as the intensity increases. Losses begin to accumulate for the 250-year event near the center of Seaside, and it can be seen that the earthquake is the primary driving force. Similar trends are observed for the 500-year event. The 1000- and 2500-year events begin to see significant tsunami losses that are not as spatially concentrated as the earthquake losses, but are more evenly distributed along the coast. Figure 3 shows that the tsunami losses begin to dominate for the 1000-year event.

Figure 2: Earthquake and tsunami hazard deaggregated by hazard and intensity

Figure 3: Total earthquake and tsunami damages across Seaside, OR


Both the interpolation and hotspot analyses have limitations. As previously mentioned the hotspot analysis can be aesthetically challenging. Additionally, difficulties may arise in communicating the confidence levels to community planners and resource managers who may not have a statistical background.

Similarly, spatial interpolation via kernel density has its own limitations. As there are subjective options when performing the interpolation and viewing the results (e.g. radius, color scheme, and maximum values), the resulting maps could easily be deceiving. Figure 4 shows the same data but use of a different radius to define the kernel. It can be seen that the map on the right appears more severe than the map on the left. The practicality of a spatial interpolation map ultimately depends on the GIS analyst.

Figure 4: Comparison of interpolation resulting from different kernel radii.

April 17, 2019

Courtney’s Ex 1: Spatial patterns of ion concentrations in groundwater

1: Question being asked

How do ion and isotope concentrations vary spatially? And what trends in concentrations account for variability between my samples?

For detailed background information, I point you towards my problem statement for this class. In short, I’m hoping to use geochemical and isotope values for wells 32 well I sampled last summer in order to better understand how groundwater flows across faults in the Walla Walla Subbasin in northeastern Oregon. In the analysis for this post, I only used a subset of 18 wells for which I had the largest suite of measured parameters.

2: Approach used

For this exercise, I’m investigating spatial patterns of my ion and isotope concentrations. I could do this by individually examining distributions of all 19 of the parameters that I sampled for… but that would be graphically overwhelming and not terribly useful. It’s too difficult to draw conclusions by comparing 19 different maps of distributions.  Instead, I’m using a method called principal component analysis (PCA for short) to statistically evaluate my samples by the groups of parameters that account for large amounts of the variation that differentiates one sample from another.

The nuts and bolts of PCA involve sophisticated analysis that calculates eigenvectors, linear algebra, and n-dimensional properties of a data set, and I will let people more knowledgeable than me explain this if you’re curious about how exactly PCA works.

R spits out two kinds of useful graphic output for PCA based on the concept of a “biplot”. The first is called a loading plot, where two principal components are graphed against each other, for example PC1 and PC2. The value for each parameter in PC1 and PC2 is used to give the point a cartesian coordinate which defines a vector in comparison to the origin. The second kind of output is called a scree plot, which applies the same concept to the sample points (called “individuals” in the code) instead of to the  parameters (called “variables” in the code). I used an R package called ggbiplot to combined the scree and loading plots in one figure.

3: Methods

I used R to calculate the principal component vectors for my data and to create charts for them, and then symbolized my data in ArcGIS Pro to see if there was a spatial pattern to the information in the PCA output. The code in R is fairly manageable. I’ve included it below with my comments.

# Source:

# Preliminary set up stuff – add libraries, set working directory



setwd(“~/_Oregon State University/Thesis/Summer Fieldwork/PCA”)

# Before importing your data, make sure that every entry in its table has a numeric entry in it.

# This process will spit out garbage if you have any blank cells or text.

# The number of rows in the data set needs to exceed the number of columns. I used an argument like [,c(1:11)] to enforce this in my data

#Process 1 of 2: read data set 1 into R, view it, compute PCA, display summaries and biplots

# it’s important to set the row names so we can use them as labels in ggbiplot

PCAdata.ions2 <- read.csv(“ions_allchem20190401_usgsdwnld.csv”, row.names = 1)


Ions2.pca <- prcomp(PCAdata.ions2[,c(1:11)], center = TRUE, scale. = TRUE)


ggbiplot(Ions2.pca, choices=c(1,2), labels=rownames(PCAdata.ions2))

ggbiplot(Ions2.pca, choices=c(1,3), labels=rownames(PCAdata.ions2))

#read data set 2 into R, view it, compute PCA, display summaries and biplots

PCAdata.nonions <-read.csv(“NonIons_allchem20190401_usgsdwnld.csv”, row.names = 1)


NonIons.pca <- prcomp(PCAdata.nonions[,c(1:8)],center = TRUE, scale. = TRUE)


ggbiplot(NonIons.pca, labels=rownames(PCAdata.nonions))

ggbiplot(NonIons.pca, choices=c(1,3), labels=rownames(PCAdata.nonions))

# if you want plots other than PC1 vs PC2 or PC1 vs PC3 just change the argument “choices = (1,2)”

# Process 2 of 2 add tools to get individual coordinates

# Source

# only install the package if you haven’t before. You will need to have RTools 3.5 already installed

# install.packages(“devtools”)

# library(“devtools”)

# install_github(“kassambara/factoextra”)

# now run things, it will spit out tab-delimited data. The “ind” command outputs PC information for the individuals, and the “var” command outputs the PC information for the variables.


ind <- get_pca_ind(Ions2.pca)


ind2 <- get_pca_ind(NonIons.pca)


# copy that info into text files, import them into csv format, join in Arc!

var1 <- get_pca_var(Ions2.pca)


var2< get_pca_var(NonIons.pca)


Next, I symbolized my data in ArcGIS Pro. Because I only had 18 samples, it wasn’t too onerous to use selection in the attribute table to export data to another layer to symbolize the groups I saw in the principle component analysis biplot.

To examine how the principle components affected individual points, I used the second process in the above code to generate a tab-delimited chart of the principal component loadings for each sample. I copied this into a .txt file, imported it into Excel to create a CSV, and then joined that CSV to my shapefile of well locations. For the variables’ coordinates, I likewise created .txt files and imported the into Excel, but I symbolized them in Excel.

4: Results

In this section, I’ll show the graphs and maps that I created for the ion section of my data, and give an explanation of what I learned from this analysis.

For my ion data, the first two out of the 11 principle components explain about 80% of the variation in my data.

The first one, PCI, explains 55% of the variation in my data set and is shown on the X axis of the graph below. It is strongly influenced by concentrations of calcium, magnesium, chloride, alkalinity, sulfate, bromide, and sodium, which all have eigenvalues on the X axis greater than |1|. Comparing the scree plot to the loading plot, wells 56287, 3074, 4179, and 4167 are the primary individual drivers of this PC.

PC2 explains another 25% of the variance in my data, and is shown on the Y axis. Fluoride, silicon, and manganese are the primary variable drivers of variation between the individuals in PC2.

CvS PC1 PC2 ions biplot with groups

Based on the coordinate information for the variables, the variation explained by PC1 is due to increased levels of calcium, magnesium, potassium, sodium, alkalinity, bromide, chloride, and sulfate at four wells. This combination of ions all being similar could be explained by recent recharge influenced by agricultural chemicals, such as fertilizers. In an undisturbed basalt setting though time, calcium and magnesium tend to decrease as sodium and potassium increase. The fact the these four parameters vary together in PCI indicates that a process other than ion replacement is driving ion concentration. The variation in PC2 is predominantly explained by fluoride, dissolved silica, and manganese. Elevated concentrations of these three variables in comparison to the other variables indicates that the groundwater has spent more time in the basalt.In PC2, calcium and magnesium are inversely related to sodium, indicating that the cation concentrations are driven by the ion replacement process.

While the figure above locates wells based on their grouping on a biplot comparing two principle components, the following figures show the wells color-coded by their loadings in PC1 and in PC2. I’m hypothesizing that wells with smaller PC1 values are influenced by agricultural recharge, and wells with more positive values on PC2 are more influenced by the processes of ion dissolution as water travels through the basalt.

color-coded map of PC1 values for the 18 wells

PC1 is heavily driven by four particular wells, which show up here as being in colors of blue and blue-grey.

The distribution of PC2 values shows a pattern that I find really interesting. More negative – bluer – values indicate samples which were less influenced by the ion replacement and dissolution processes of groundwater traveling through the basalt. The values at high elevation are negative, which makes sense because they’re closer to the upland recharge zone, but so do some samples in the lowlands to the north which are downgradient of certain faults. This could indicate that those PC2 negative downgradient wells are cut off from the longer flowpaths of groundwater by faults that act as barriers to groundwater flow. The pinker – more positive – points indicate samples that were more influenced by the along-flowpath ion reactions.

Next, I present the PCA results for the non-ion parameters. Unlike the “ions” set whose title was somewhat self-explanatory, this PCA brought together different types of chemical and physical properties of water. These include pH, temperature, and specific conductivity which were measured in the field,  the analyzed values of hydrogen and oxygen isotopes as well as tritium, the elevation of the bottom of the well, and the cation ratio. The cation ratio is a standard calculated parameter for groundwater chemistry which is equal to ([Na] + [K]) / ([Ca] + [Mg]). As water spends more time in contact with the rock, its ion exchange with the rock leads to a smaller cation ratio. The first biplot graphs PC1 (~40% of variation) on the X  axis against PC2 (22% of variation) on the Y axis, and the second biplot graphs PC1 on the X axis versus PC3 (~18% of variation) on the Y axis.

The PC1 grouping explains variance among the samples by inversely correlating tritium, deuterium (d2H) and oxygen-18 (d18O) with pH and the cation ratio.  More positive PC1 values show that the individual has high d18O and d2H values, and low cation ratio and pH values. All d18O and d2H values are negative; values that are closer to 0 (“larger: for the purposes of this plot) are more enriched in the heavier isotope. PC2 is predominantly driven by an inverse correlation between well bottom elevation and temperature. PC3 inversely correlates tritium, specific conductance, and well bottom elevation with temperature, deuterium, and 18-oxygen.

Yellow/orange points on this map have more positive PC1 loadings. They have higher d18O and d2H values, and lower cation ratio and pH values. The lower cation ratio and pH values indicate that the sample is less chemically evolved based on the expected ion exchange reactions in the basalt. I found it interested that samples on that end of the spectrum were also more enriched in the heavier isotopes of hydrogen and oxygen.

Unfortunately I can’t seem to upload the maps for PC2 and PC3 for non-ions, I’m getting a message that all of my allowance for image uploads has been used up…

5: Critique

I’m pretty happy with what I put together using principal component analysis. In a perfect world I would have learned to do the spatial visualization elegantly in R instead of messing around with exporting data to Arc, and made my variable charts in R instead of exporting them to excel. However for the purposes of this project the less elegant method was quicker for me.

Exercise 1: Ventenata spatial clustering

Filed under: 2019,Exercise 1 @ 1:56 pm
Tags: , ,

Question Asked

I am interested in understanding the invasion potential of the recently introduced annual grass ventenata (Ventenata dubia) across eastern Oregon. Here I ask, what is the spatial pattern of the ventenata invasion across the Blue Mountains Ecoregion of eastern Oregon?

Tools and Approaches Used

To address this question, I (1) tested for spatial correlation at various distances using Moran’s I spatial autocorrelation coefficients plotted with a correlogram, and (2) performed hot-spot analysis (Getis-Ord Gi) to identify statistically significant clusters of areas with high and low ventenata cover.

Description of Analysis Steps

1a) Moran’s I: To compute Moran’s I spatial autocorrelation coefficient for all of my sample units, I used the “ape” package in R version 3.5.1. The first step to this analysis was to convert the ventenata data and associated coordinates into a distance matrix. Once the distance matrix was created, the Moran.I function computed the observed and expected spatial autocorrelation coefficients for the variable of interest (ventenata abundance). The function produces a test statistic that tests the null hypothesis of no correlation. See Gittleman and Kot (1990) for details on how the Moran.I function calculates Moran’s I statistics.

1b) Correlogram: I plotted a correlogram using Moran’s I coefficients with increasing distances (lags) to examine patterns of spatial autocorrelation in my data. I used the correlog function in the spdep package in R to plot a correlogram with lag intervals of 10,000m. The function has the option of randomly resampling the data at each increment to incorporate statistical significance. This randomization tests the null hypothesis of no autocorrelation. I ran the function with resamp = 100. Black points on the correlogram are indicative of Moran’s I values significantly larger or smaller than expected under the null hypothesis.

2) Hot Spot Analysis: I used the hot spot analysis (Getis-Ord Gi*) tool in Arc GIS to identify statistically significant clusters of areas with high and low ventenata cover across my study area. The tool produces z-scores and p-values that test the null hypothesis of a random distribution of high and low values rather than clusters of high or low values. High z-scores indicate clusters of high values and low z-scores indicate clusters of low values. Low p-values indicate that these clusters are more pronounced than would be expected by chance.


1a) Moran’s I: The Moran’s I spatial autocorrelation coefficient estimate for all of the points across the entire sample area was 0.3 ± 0.05 (p < 0.3). This value is not particularly informative, as it only indicates that the data is positive spatially autocorrelated, but does not provide information to describe the spatial pattern. I chose to follow the Moran’s I up with a correlogram to uncover the spatial pattern driving the autocorrelation.

1b) Correlogram: The Moran’s I spatial correlogram shows a general trend of decreasing autocorrelation from 0 to about 70,000m where sudden jumps in Moran’s I values occur to up to ~0.3. Following this jump, the correlation decreases to -0.5 to -0.2 between 120,000 and 152,000m, then increases to ~0.3 at 170,000m, decreases to almost -1.0 just after 200,000m, and finally increases to almost 1 at 220,000m. The general trend appears to be decreasing from 0.2 to -0.9 at 220,000m with some high peaks interspersed. These high and low peaks indicate distinct ventenata patches distributed throughout the study area, suggesting a clustered spatial pattern of the ventenata invasion. The extreme high and low values at distances over 200,000 are likely a result of the few sample units being compared at these distances, thus these are not so informative of the overall spatial pattern.

2) Hot Spot Analysis: Hot spot analysis in ArcGIS depicted clusters ranging from high ventenata cover (large red circles) to low ventenata cover (small blue circles) across my study area (Fig. 2) using the calculated z-scores and p-values for each sample unit. The resulting map shows distinct clusters of high, low, and moderate ventenata cover distributed across seven sampled burn perimeters (displayed in light orange). The highest cover clusters are all located within the Ochoco and Aldrich Mountains in the center of the study region. The fires on the perimeters of the region exhibited clusters of low to no ventenata cover.

Critique of Methods Used

When run on all of the data across the entire region, Moran’s I did not produce a useful statistic, indicating only if the data was spatial autocorrelation without indicating a spatial pattern. However, when visualized with a correlogram at varying distances, the correlation coefficients suddenly told a story of spatial clustering. The results from the hot spot analysis reinforce the findings from the correlogram by clearing depicting clusters on a map of the study area. The hot spot analysis further explores these results by mapping the clusters of high and low ventenata cover on top of each of my sample units, providing a useful visualization of exactly where the clusters of high and low cover fall across the region.


Gittleman, J. L. and Kot, M. (1990) Adaptation: statistics and a null model for estimating phylogenetic effects. Systematic Zoology39, 227–241.


Exercise 1: What is the spatial pattern of western hemlock dwarf mistletoe at the Wolf Rock reference stand?

For Exercise 1, I wanted to analyze the spatial pattern of western hemlock dwarf mistletoe infections in live western hemlocks on my 2.2 ha reference stand (Wolf Rock). This was without considering any attributes of the western hemlock trees themselves. Simply, what was the spatial pattern of infection?

To answer this I used the “Average Nearest Neighbor” tool in the Spatial Statistics toolbox in ArcMap. This tool calculates a z-score and a p-value from that z distribution. This is a commonly used method in dwarf mistletoe literature for assessing the clustering of infection centers. Also, the equations for this tool assume that points are free to locate wherever in space and that there are no barriers to spread.

ArcMap makes running these analyses very simple so I created a selection of infected trees (red dots), created a new feature, and then ran the tool. The p-value from my test was 0.097 and my Nearest Neighbor Index was 0.970, indicating that the spatial pattern of the infections are somewhat clustered with an alpha of 0.10.

Average Nearest Neighbor is a good test for analyzing whether or not a set of coordinates are clustered. The degree of clustering of may be harder to interpret as a lower p-value may not necessarily mean points are more clustered. Also I was unable to see where my clusters are, and if my intuitions match the analysis (see map). One other important consideration is the study area. Changes in analysis area can drastically change the result of your clustering analysis (i.e. larger study areas may make data look more clustered). Lastly, there was no option for edge correction. This may have skewed some of the clustering results along the edge of my study site and 2.2 ha is pretty small to be subsampled without losing a lot of my data.


After confirming that my infections were clustered, I wanted to see if the pattern I saw in my map, was actually on the ground. I wanted to know, where are infected trees clustered with infected trees and where are uninfected trees clustered with uninfected trees? Again, this was without considering any attributes of the western hemlock trees themselves.

I used the “Optimized Hot Spot Analysis” tool in the Mapping Clusters toolbox to analyze the incidence of infection data (0 = absence, and 1 = presence). The Optimized Hot Spot Analysis tool can automatically aggregate incidence data that are normally not appropriate for hot spot analysis. It also calculates several other metrics for me that made analysis easy. I could take these automatically calculated metrics and alter them in a regular hot spot analysis if needed.

This map displays clustering that matched up closely with my intuitions from Map 1. On the left, the blue values show a cluster of uninfected trees that are closely clustered with other uninfected trees. The larger swath on the right show a cluster of trees that are closely clustered with other infected trees. In the middle a mix of uninfected trees and infected trees are mixed without displaying any significant clustering. Lastly, small clusters in the top left and bottom left of infected trees were identified. These clusters may be edge of larger clusters outside my stand, or lightly infected trees that are starting a new infection center. These results will be extremely valuable in informing my steps for Exercise 2 because I can assess the conditions of both patches and determine differences between the two. I can also determine if distance to the refugia impact the clustering of infection because it appears the infected cluster is closer to the fire refugia.

The hot spot analysis was extremely useful for analyzing and displaying the information I needed about the clustering and was very useful for building off of the Average Nearest Neighbor analysis.

My data set also included a severity rating for dwarf mistletoe infected western hemlocks in my study site. I ran a similar hot spot analysis to above to determine if there were any similarities with how severity played out in the stand compared to solely incidence data. My data ranged from 0 – 5, 0 indicating uninfected trees and 5 indicating most heavily infected. These are classified data, not continuous but still appropriate for the optimized hot spot analysis. Western hemlock dwarf mistletoe forms infection centers, starting from a residual infected western hemlock that survived some disturbance. From there the infection spreads outwards. Another facet of infection centers is that the most heavily infected trees are almost always aggregated in the center of the infection center and infection severity decreases as you move towards the outside of the infection center. This is intuitive when you think about infected trees in terms of the time they’ve been exposed to a dwarf mistletoe seed rain: the trees in the center of the infection center likely have been exposed to infectious seed the longest. These trees can be rated using a severity rating system that essentially determines the proportion of tree crown infected. This is calculated in a way that gives a rating that is easily interpretable, in this case, 0-5.

This third map tells me about how severity is aggregated in the stand. I can see that the wide swath in the middle of the stand, associated with the fire refugia, has the largest aggregation of severely infected trees. This is what I expected in the stand because the trees in the fire refugia survived the fire and provide an infectious seed source for the post-fire regeneration. Also, on the edges of this high severity cluster, are lower severity values indicating the expected pattern of infection centers are playing out. The west side of the stand shows a large clustering of low severity ratings. We can see that the high density of uninfected trees, falls into our cold spot of low or no severity. Interestingly, the hot spot of trees found previously  in the southwest corner, is actually a cluster of low severity trees. This may be a new infection center forming or an exterior edge of another infection center outside the plot.  Lastly, the two pockets of low severity on the east side of the stand are more distinct when considering their severity.

This second application of hot spot analysis tells another story about my data and how dwarf mistletoe is patterned spatially. The non-significant swath in the center of my stand using the incidence data turns out to be a significant clustering of highly infected trees among other new observations.


Exercise 1: The Variogram


How does the variance between the depths to the first lava flow in my filed area vary with increasing distance?

The Tool: Variogram

I used a variogram to analyze the variance within my dataset. Variograms are discrete functions calculated using to measure the average correlation between pairs of measurements at various distance (Cameron and Hunter 2002). These distances are referred to as the binned distances (Anselin, 2016). In this study, binned distances determine the distances by which the depth to first lava flow is autocorrelated (Aneslin, 2016).

variog(geodata,coords = geodata$coords,data = geodata$data,max.dist=”number”) (R documentation)

The R code above need the geodata, an array of the data you are testing, the cords, or the coordinates those data correspond too.

Brief Methodology:

I selected data based on the quality of the descriptions, in the well log and assigned each well log I have interacted with a value from 0 to 3. Data with a score 0 represents either a dry well or a destroyed well. Data that has a score of 3 is well described and denotes either the depth to first water or the standing water level. I used data with a score of 3 for this analysis. I denoted the depth to the first lava flow in each of the well logs.

Figure 1: My field area, the well log locations are the blue circles, a rough delineation of my field area is in while.

First I normalized my data, giving a mean of 0 and a standard deviation of 1. Then I determined a max distribution, which determines the max bin size the variogram uses. The max distribution determines the lag increment, or the distances between which the increment is calculated (Anselin, 2016).

The data are projected, meaning that their horizontal distance is measured in meters. However, they are recorded in decimal degrees and located at the center of the township and range section that they are located in. Rather than convert my data from decimal degrees to meters, I recognized that there are around 111 km in 1 degree (at the equator), that there are 0.6214 miles in a kilometer and that there are 6 miles in a township. This helped me determine the max distribution for the variog function in R. I decided upon a max distributions of 0.09 and 0.5 degrees.
I then used R’s variog function on the normalized data with the max distribution of 0.09 and 0.5 degrees.

Succinct Results:

Figure 2: Variogram of the MLV-LP-BVM triangle with a max distribution of 0.09 degrees. Note the low semivariance with the lower bin size (.02 to 0.08 degrees).

Figure 3: Variogram of the MLV-LP-BVM triangle with a max distribution of 0.5 degrees. Note the low saw-toothed pattern of semivariance. From 0.01 to 0.08, semivariance is low, it spikes up, and lowers again around 0.2, spikes again at 0.3 degrees and lowers again at 0.4.


I tested max bin sizes of 0.5 and 0.09 degrees to see how the variogram changes with an increasing bin size. Changing the max bin size, called the max distribution in R, changes the variogram. The smaller bin size, 0.09 degrees, limits the max bin size to a narrow range of values. In effect, the variogram only tests the covariance of data points that are separated by a maximum .09 or degrees. Increasing the bin size increases the distance around which the covariance of the data points are tested. Thus, mad distributions of 0.5 result in a spikey plot. Normally, one would expect the variogram to plateau at larger bin sizes, representing large variance with the data with larger distances, but figure 2 does not.

At its simplest, the variogram in figures 1 implies that data points that are correlated in space are more likely to have similar depths to the first lava flow. You can see this in the low variance in you find in locations that are close to each other, the smaller distances, and the higher variance in distances that are farther from each other.

Figure 2, with the max distribution of 0.5 degrees one could made an argument for a distribution of the locations of lava flows based on the locations of volcanic centers. Medicine Lake Volcano lies in the North of the study area and Lassen Peak lies in the south. The two spikes in variance with location might be linked to the distances between those two centers. In other words, distances that correspond to low values of semivariance (>1) correspond to either regions that lie on the same lava flow, or another near surface lava flow sourced from another volcanic center in the region (figure 3). Rather than finding the same lava flow at depth, we are seeing different lava flows at similar depths.

Figure 4: My field area with the rough delineations of two of the near surface lava flows in the region. 1 degree of longitude corresponds to roughly 111 km.

Lava flows are not attracted or repulsed to or from each other, but they do follow the laws of physics. Often, Volcanoes build topography, when lava erupts from them, the lava will flow from the high elevations of the volcano, to basins, using paleo-valleys as conduits for flow. Thus, if you know the paleo topography you can understand where a lava might have flowed, and where it might emplace. Predicting paleo-topography can be difficult in old volcanic regimes, but on the geologic timescales we are looking at, I can predict that the topography of the MLV-LP-BVM triangle has not changed much over the past 5 ma.

Lavas flows from high to low topography and from Volcanos. The two main volcanos in my field area are Lassen Peak in the South and Medicine Lake Volcano, in the North. Moreover, lava flows from high to low elevation. Lavas emplace in basin, if they sit long enough, the top soils form, the basin subsides, and another the cycle repeats.

My data does have different spatial patterns at different scales. If you look at regions that are within 0.1 of a degree of distance then you would expect to see a similar depth to the lava flow. If you move to 0.5 of a degree, you see a sees-saw effect, where the depth to the lava flow moves from having lava close to the surface of deeper down. This variation stems from the proximity of the data point to the sources and the sinks of the lava flow.

I would use this technique again, it helped me think about my problem.

Cameron, K, and P Hunter. 2002. Using Spatial Models and Kriging Techniques to Optimize Long-Term Ground-Water Monitoring Networks: A Case Study. Environmetrics 13:629-59.

Anselin, Luc. “Variogram”. Copyright 2016. Powerpoint File.

R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL

Next Page »

© 2019 GEOG 566   Powered by WordPress MU    Hosted by