GEOG 566

         Advanced spatial statistics and GIScience

Archive for Tutorial 1 2017

June 1, 2017

Processing Multispectral Imagery from an Unmanned Aerial Vehicle For Phenotyping Seedlings in Common Garden Boxes: Part 1

Filed under: Tutorial 1 2017 @ 6:11 pm


A total of three common garden sites were imaged by flying a UAV (Tarot 810 hex-copter) over the boxes at 15m AGL. All three sites are within the Kaibab National Forest area near Jacob Lake, Arizona, USA (Figure 1). The aircraft carried two sensors on a servo-powered 2-axis gimbal. The first sensor, a Micasense Rededge multispectral imager, simultaneously records images in 5 discrete spectral bands as seen in Figure2. The second sensor, a FliR Tau 2 longwave infrared thermal camera, records images in the 7.5-13.5 µm spectral band. Sensors were programmed to trigger so that images would achieve 70% overlap and sidelap. Because of the similarities between the three locations, this tutorial will focus on only one of the sites, named “Little Mountain”. For this site, one flight resulted in 540 multispectral images (108 per band) and 125 thermal infrared images that were used for the processing and analyses that follow.

Figure 1: The approximate location of the common garden boxes, about 70 miles north of the Grand Canyon’s North Rim in Northern Arizona, USA.

Figure 2: The designations and spectral characteristics of bands associated with Micasense Rededge sensor

Orthomosaic Reconstruction

Due to the large number of raw images, the processing workflow could be streamlined by consolidating them into orthomosaic raster files using Agisoft Photoscan. Conveniently, the entire workflow in the program is located under the ‘Workflow’ tab on the top/center of the screen. This includes the first step, using either the ‘Add Photos’ or ‘Add Folder’ button to load in the raw imagery. Once the photos have been added, the program will give a prompt and the user should select ‘Create Cameras From Images’ for most datasets. In the case of multispectral data, however, select ‘Create Multispectral Cameras From Images’. This feature allows for simultaneous processing of all 5 bands from the multispectral sensor. Now, you can verify the sensor specification information by clicking on ‘Camera Calibration’ under the ‘Tools’ tab. Make sure that the camera type, pixel size, and focal length inputs are correct. This will save extra steps later.

Next, select ‘Align Photos’ under the ‘Workflow’ tab. This step is quite computationally demanding, so prepare for long processing times. Also, as the number of photos in the reconstruction increases, the processing demands increase exponentially. To test the adequacy of your imagery for coverage, adjust the accuracy to a lower level and check if all the photos are registered in the alignment. If some of the photos are missing, it is likely due to insufficient overlap and/or sidelap. For a very high-quality reconstruction, set inputs to match those in Figure 3. If the highest possible resolution is required, increase key point limit to 200,000 and increase tie point limit to 40,000.

Figure 3: Input parameters for very high resolution initial reconstruction. To lessen computational demands, lower accuracy levels can be used. For highest resolution, increase key point limit to 200,000 and tie point limit to 40,000.

Once the initial alignment is complete, use the ‘Optimize Cameras’ button on the Reference Toolbar to increase the accuracy of the camera positions. Use the three-axis orb to move your reconstruction around in the viewing window. Toggle camera visibility with the ‘Show Cameras’ button on the Main Toolbar. The next step in processing is to use densify the point cloud using ‘Build Dense Cloud’ under the ‘Workflow’ tab. Use the inputs in Figure 4 for highest possible quality. The computing requirements for this step are less than for the initial reconstruction, so reducing the quality will not have great impact on processing time.

Figure 4: Input parameters for building dense point cloud based on sparse cloud from initial photo alignment. Lowering quality level in this step will not result in much change in the processing time required.

Next, use the ‘Build Mesh’ (Figure 5) tab to make a mesh surface from the dense cloud. Increasing the face count field is necessary when creating detailed surface models, but in this case is not necessary. Finally, use  ‘Build Orthomosaic’ (Figure 6) to create the raster for export. Use ‘Top XY’ in the projection plane field to generate a nadir perspective image. To export the file, follow the path shown in Figure 7. Use TIFF format for easiest transition into ArcGIS and buffer creation steps that follow. The finished orthoimage can be seen in Figure 8.

Figure 5: Inputs for building mesh from dense cloud. Increase face count to generate more detailed mesh surface when planning to create surface models from data.

Figure 6: Input parameters for building orthomosaic image from mesh surface. This configuration will produce a nadir perspective orthophoto due to the ‘Planar > Top XY’ projection.

Figure 7: Location for exporting orthomosiac in image file formats. Exporting as TIFF will allow for raster viewing and manipulation using ArcMap subsequent steps.

Figure 8: The final product from image pre-processing:  an orthomosaic image of the area of interest. These common garden boxes are in the Kaibab National Forest, Arizona, USA.

Buffer Creation

Due to dense spacing within the boxes, a protocol was needed to identify where the plant centers were located, and ultimately, which pixels from the raster could be assigned to each seedling as part of its crown area. The first step toward accomplishing this was also the most tedious: to make three new shapefiles; one with points at the center of each seedling, another with points at the center of each box, and one with a single point at the center of the plot (Figure9). These shapefiles are used to calculate the spatial variables for the dataset displayed in Figure 10: distance to nearest neighbor (nn), distance to box center (bc) and distance to plot center.

Figure 9: Manually drawn points representing plant centers (yellow), box centers (red) and the plot center (green).  Note that the upper left box was omitted from analyses due to blurring effects in the reconstructed orthoimage.

A. B.  C.

Figure 10 (A-C): Visual representations of the spatial variables produced during buffer creation workflow. Buffer circles have been scaled by size according to distance to nearest neighbor (nn), center of box (bc), and center of plot (pc).

Circular buffer zones were drawn around plant centers using the formula  where R is the radius of the buffer and nn is the distance to the nearest neighboring point. This method was chosen over choosing an arbitrary uniform buffer dimension (i.e. 2 cm radius) due to its objectivity and repeatability. The new buffer layer represents the extent of the seedling crown areas, but still lacking is a filter to exclude non-vegetation pixels. The solution to this problem is in the next section.

Vegetation Masking

To ensure that my spectral analysis would only include values from ‘vegetation’ pixels, a mask layer would be needed. Making this layer was relatively easy due to something called the normalized differential vegetation index (NDVI). An NDVI layer was added to the dataset by performing raster algebra on the rasters from bands 3 and 4 from the Micasense sensor according to the formula : . One unique and particularly useful quality of NDVI imagery is that there is great contrast between vegetation and non-vegetation features (Figure 11). By performing an unsupervised natural breaks (Jenks) classification I split the newly created raster into two classes, exported only the pixels which were classified as vegetation, and ‘Voila!’, the vegetation mask layer was born (Figure 12).

Figure 11:  One common garden box viewed using normalized differential vegetation index (NDVI) to distinguish vegetation features from non-vegetation.

Figure 12: Vegetation mask layer (yellow) created from natural breaks (Jenks) classification of NDVI layer into 2 classes. Pixels classified as ‘vegetation’ based on the classification were exported into a new ‘mask’ layer.

Once the mask layer was created, all of the necessary inputs were in place to transform the dataset using my R-code (shown in part 3). The code uses the scaled buffer zones and vegetation mask layers to include only vegetation pixels for each plant and take an average for each of the spectral variables from within the crown area. Also, the spatial variables for each seedling are included. As a result, the new attribute table contains 11 variables (7 spectral and 4 spatial) for all 1697 seedlings. It is good to note, however, that if spatial variables were in fact independent of spectral response then it would be reasonable to classify seedlings based on spectral response alone (Figure 13). Unfortunately, the spatial variables do seem to make a difference, setting the stage for Part 2 of this tutorial.

Figure 13: Example analysis application for data resulting from part 1 of this tutorial: Individual seedling Isocluster unsupervised classification by standard deviation based on mean crown NDVI. Similar analyses could be conducted using other spectral variables such as TGI, but more complex methods are required to investigate the relationship between spatial variables and spectral response.


May 8, 2017

Tutorial 1: Nitrate Concentration Differences: Between Sites and Over Time

Filed under: 2017,Tutorial 1 2017 @ 10:26 am

Research Question: 

Do the differences in nitrate concentrations indicate a pattern between site locations, and do they show a pattern over time.

Tools and Approach: 

I performed the analysis largely in excel, though ARC GIS mapping software was useful in spatial proximity analysis. I was able to produce a chart and representative scatter plot of the nitrate concentration differences over time between various sites. The sites are naturally grouped/divided into three groups based on spatial location. The first is the Rock Creek basin (and the example used in this tutorial) additionally I analyzed Mary’s Peak streams and finally the Coastal streams.


First, I sorted my concentration data that I had produced for exercise one (Table 1). Table 1 represents total nitrate concentrations in ppm for all sites. Next I subtracted the concentration data between streams within varying proximity (Table 2). The second table represents the differences in concentration values, a negative value correlates with a higher nitrate concentration in the stream that was being subtracted. This analysis allowed me t
o find concentration of the nitrate data between points (stream sites) and over time.


Rock Creek Basin:I found higher differences (indicating higher nitrate concentrations) in the mainstem rivers than the tributaries. Additionally, I saw higher concentrations of nitrate in the spring samples as compared to the winter samples. I     expected higher values for nitrate in the mainstem, such as rock creek due to the concentration of water coming from upstream. Naturally the smaller tributaries would exhibit lower concentrations than the larger order stream that catches the two. I also expected nitrate differences and concentrations to decrease in the spring time. If the inputs of nitrogen to the system remain relatively stable, then the uptake length of nitrate (the most sought after in stream nutrient cycling) would shorten and cause the concentrations of nitrate sampled to decrease.









Critique of Method:

This method was useful as it made me think about my data set in a different way. Using the differences in concentrations over time, I was able to think about two variables (Y) difference in concentration, and stream site location to assess why each site exhibited the data it did. I was able to determine that upstream site locations typically have lower concentrations of nitrate then the downstream sites. I also was able to think about how nitrate is available as the seasons change from winter to spring. At first I wasn’t able to understand why nitrate concentrations could decrease in the winter months, but a combination
of lower flows and increase macroinvertebrate activity are potential players in a decrease in concentration.


May 7, 2017

Tutorial 1: Visual data analysis

Filed under: Tutorial 1 2017 @ 11:30 pm

Reva Gillman

Geog 566

  1. Research question: What is the geographic distribution of the native regions of the Japanese Tsunami Marine Debris (JTMD) species? Which regions are the most prevalent for JTMD species to be native to, and which are least prevalent?
  2. Approach that I used: visually represent native region frequency of JTMD species within 12 different coastal geographic regions of the world.
  3. I was able to get a shape file of the Marine Ecoregions of the World (MEOW), which are geographic coastal regions that match with my JTMD species data, into ArcMap, and categorize the 12 realms with different colors. After that was complete, I made an excel sheet of the species totals per region, which had a realm column of the same attribute of the shape file. I was then able to do a ‘join’ in ArcMap, to align the species data with the geographic shape file. Then I played around with data intervals for the species sum categorizations for the legend, to arrive with the following map of the JTMD native regions:

realms (1)


Marine Ecoregions of the World (MEOW)

  1. Results: The most prevalent region that JTMD species are native to is the Temperate Northern Pacific. This seems obvious, as JTMD originated from Japan, so we expect most species to be native to that same region of the globe. Next most prevalent native region for JTMD species is Eastern Indo-Pacific, the region southeast of Japan. However, after that the native regions that are prevalent for JTMD species begin to span the globe: Temperate Northern Atlantic, and Tropical Eastern Pacific. At the other end is the least prevalent region: the Southern Ocean, only one JTMD species is native to this cold southern region.
  2. Critique of the method – what was useful, what was not? This is a largely visual interpretation of the data, without using statistical analyses of the data. However, it is very useful to be able to visualize the native regions spatially on a map of the geographic regions, and be able to see the different species sum categories according to the legend, to determine high and low frequencies of occurrence. It is a good start, and good beginning before I further analyze other aspects of the JTMD species spatial dimensions.


Spatial autocorrelation in NDVI for agricultural fields vs the average biomass of that field

Filed under: Tutorial 1 2017 @ 4:29 pm


  1. Question that you asked…

Looking at the spatial autocorrelation of NDVI within a field and the average wet biomass of that field, is there a correlation? When the NDVI distribution is patchy, is the biomass lower? Or maybe higher? Or is there no relation at all.

  1. Name of the tool or approach that you used…

To assess the patchiness of the NDVI I used a variogram. The variogram shows the characteristics of spatial autocorrelation. There are three parameters of interest that can be read from a variogram graph; a) the range, which gives an indication of the lag in the autocorrelation. b) the sill, which represents the variance of the variable at the range lag. And, c) the nugget. This should be zero, but due to errors in the data this can vary slightly. The variograms were used for two analysis. First, we visually compare multiple variograms from different fields categorized for biomass. And second, we analytically compare the variograms with scatterplots between i) range and biomass, ii) sill and biomass, and iii) range and sill with biomass color scheme. For the scatter plots the correlation coefficients are determined with Pearson’s R and a p-value.


  1. Brief description of steps you followed to complete the analysis…

The variograms are created in R with the ‘usdm’ package.

Step1. Create a list of all the geotiff files in a folder

Step2. Create for every file a variogram.

var<- Variogram(raster,size=50)

Make sure to use a capital V in Variogram.

Step3. Plot the variograms with points and lines

Plot(var)                              # points

Plot(var,type=’l’)               # lines

Step4. Plot all the variograms on one graph

Par(new=TRUE)                 # this line makes sure all graphs are in the same figure


Step5. Include legend with field number and average biomass

Step6. For every variogram estimate the range and the sill and put it in an excel sheet.

Step7. Create scatterplots for i) range-biomass, ii) sill-biomass, and iii) range-sill color coded for biomass. Over here I switched to Matlab, because I’m more familiar with this. But R could do the trick as well.


Step8. Calculate Pearson’s correlation coefficient and p-value.

[R,p]=corrcoef(range,biomass)                   % repeat this for the other scatter plots.

  1. Brief description of results you obtained…

In the variogram plot, Figure 1, we can see that there is a high inter field variability in spatial auto correlation for NDVI. It is difficult to tell from the graph if there is a correlation between biomass and variogram type. Also, there is a difference in crop type between the field, which has a considerable influence on the biomass. For further analysis, a distinction between crop types should be made.



Figure 1: Variograms for the different fields with average biomass


Also in the scatter plots the relation between biomass and the variogram parameters in not apparent. The Pearson’s correlation coefficients are very low, between 0.049 and 0.158 with significant p-values (for a 5% level).

Figure 2: Scatter plots for Range vs Biomass and Sill vs Biomass

Figure 3: Scatter plot for Range vs Sill with biomass color coded


  1. Critique of the method – what was useful, what was not?…

This method does not show a correlation between variogram parameters and average biomass for the fields, however it is possible that a distinction in crop type would improve the results. The biggest gap in this method is the subjective estimation of the range and sill values for the variograms. In the future, a variogram can be fitted and the range and sill parameter can be automatically generated. However, also for this variogram fitting decisions need to be made, such as fitting type, which could possibly introduce subjectivity.

May 6, 2017

Forest Structure Assessment Using Variogram in R

Filed under: 2017,Tutorial 1 2017 @ 8:16 am

Question and tool

The questions that I try to answer in tutorial 1 is how the height and crown width vary in space in the study area, whether those variables are autocorrelated. I use variogram analysis in RStudio to answer these questions.

The variogram is defined as the variance of the difference between field values at two locations x and y across realizations of the field (Cressie 1993). The main goal of a variogram analysis is to construct a variogram that best estimates the autocorrelation structure of the underlying stochastic process. Most variograms are defined through several parameters; namely, the nugget effect, sill, and range.

Figure 1. The example of variogram graph

Figure 1. The example of variogram graph

  • Nugget effect represents micro-scale variation or measurement error. It is estimated from the empirical variogram as the value of y(h) for h = 0.
  • Sill representing the variance of the random field.
  • Range is the distance (if any) at which data are no longer autocorrelated.

Data Analysis and Result

The data used in this tutorial is generated from UAV visual images. The UAV visual images were processed through some software. Agisoft photoscan was used to make a 3D point cloud based on the UAV visual images. Then, the 3D point cloud was used in Fusion software to derive forest structure measurements, such as number of trees, height, and canopy width. The result of this data processing is a data set in a table consisting of 804 detected trees with their height and crown width. Unfortunately, because the images were not georeferenced yet, the height in this tutorial is the sum of tree height and elevation.

Figure 2. Detected trees by Fusion software.

In Rstudio, I use the R codes that are stated below to do the variogram analysis. First, I need to install the “gstat” package and load the package using the following R codes:

> install.packages(“gstat”)

> library(gstat)

Next, to load my data into the RStudio, I have to set my working directory. It can be done through the “session” toolbar. The first variogram analysis is for crown width and space (x and y). The R codes used for this first variogram analysis can be seen below. Based on the graph, in general,, crown width is not autocorrelated although there are some patches with different nugget, sill, and range. In the first part of the graph, there is patch with 25 range (distance) which is relatively larger than the other range. I think it can be happened due to different size of crown width in this particular study area. It seems like there are some trees with big crown and some other with smaller crown that are clustered.

> Treelist1<-read.table(“D:/Oregon State University/Term/Spring 2017/GEOG 566/Exercises/Exercise 2/treelist1.csv”, sep=”,”, header=T)

> g1 <- gstat(id = “”, formula =, locations = ~X+Y, +            data = Treelist1)

> plot(variogram(g1))

Figure 3. Variogram graph of crown width

The next variogram analysis is for height and location (x and y). The R codes and graph are shown below. From the graph, it can be seen that the semivariance is incremental with the increase of range (distance). It means that the height is autocorrelated. I think that can happen because of the height data which is the sum of tree height and elevation. Since the elevation is included in the height data, the elevation or the topography of the study area will influence the variance and the variogram analysis.

> g2 <- gstat(id = “Height”, formula = Height~1, locations = ~X+Y, +            data = Treelist1)> plot(variogram(g2))

> plot(variogram(g2))

Figure 4. Variogram graph of height.

My next variogram analysis is the cross variogram analysis. The R codes and graph are given below. It appears that the semivariance of height and crown width is negatively related, which means there is decrease in semivariance when the distance is increased. Cross-variogram values can increase or decrease with distance h depending on the correlation between variable A and variable B. Semi-variogram values, on the other hand, are by definition positive. One thing that is confusing is the negative value of semivariance. In my understanding, the semivariance should have positive value. Maybe there is mistake in my R codes that causes this confusing result.

> g3 <- gstat(g1, id = “Height”, formula = Height~1, locations = ~X+Y, +            data = Treelist1)

> plot(variogram(g3))

Figure 5. Cross variogram graph of crown width and height.

Critique of the method

As explained in the data analysis and result, variogram analysis is useful to assess how the forest structure parameters, such as crown width and height vary in space in the study area, and whether they are autocorrelated or not. In addition, for me, it is faster to do variogram in R using “gstat” package especially when you have data set in table format. I tried using my raster data for variogram analysis in R, but R crashed and there was not enough memory. Therefore, I prefer to use “gstat” and data set in table for variogram analysis. However, variogram analysis in R has limitation in term of visual presentation. It is quite hard to identify which part of the study area that has higher variability (larger patch) since the result is in graph format and not in map.


May 5, 2017

Tutorial 1: Use of the Optimized Hot Spot Analysis for explaining patterns of Socio-demographic and spatial variables of survey responses regarding flood-safety in neighborhoods

Filed under: Tutorial 1 2017 @ 11:52 pm
  • 1. Question Asked

    How socio-demographic and spatial variables explain patterns of survey responses about perceptions and attitudes regarding flood-safety in neighborhoods?

    Location of survey responses and variable values (variable: “perception of flooding reaching your home”) are shown in Figure 1.1.

    Figure 1.1. Location of survey responses and variable values (variable: “perception of flooding reaching your home”)

    2. Name of the tool or approach used

    Various tools are used to accomplish the objective of setting up inputs for the optimized hot spot analysis. The motivation of optimizing this analysis is because I was not able to identify any clear spatial pattern (I was getting trivial results) on my initial hot spot analysis without optimizing it.

Four tools are used to accomplish our objective:

  • For finding out the “Analysis Field” for optimizing calculations
    • Generate Near table in Arc Map


  • For missing data:
    • Replace all missing values in numerical variables with the mean using R or/and
    • Replace all missing values in categorical variables with the median using R


  • For the spatial analysis I am using:
    • Optimized Hot Spot Analysis in ArcMap


  • For creating a prediction surface map using results from the Hot Spot Analysis:
    • Kriging in ArcMap

3. A brief description of steps followed to complete the analysis

I followed the next steps:

  • For finding out an “Analysis Field” for optimizing calculations

Find distances from neighborhood lot properties to near streams

  • Use Near features tool in ArcMap. For “Input Feature” use the lot property shape file. For “Near Features” use the streams shape file.

  • In the output table, Identify “IN_FID”, which is the ID of the input feature(lot property ID)

  • Join the output table to the input features table based on “OBJECTID” of the input feature and “IN_FID” of the output table.

  • Now you have the distances in the “Input Feature” table. Next, export the table and copy variables of interest as one more value of the Survey dataset.
  • For missing data:

I have used a very simple concept: Replace all missing values in numerical variables with the mean and/or, replace all missing values in categorical variables with the median. The task for analyzing 108 variables was coded in R. The coding steps, showing only for some variables, is as follows:

  • For the spatial analysis:

Now we are ready to perform the hot spot analysis using the “Optimized Hot Spot Analysis” tool of ArcMap. As you can see in the figure below, there is an “Analysis Field (optional)” field, which in our case is quite useful.

For my problem:

Input Features: categorical survey responses with its corresponding location.

Analysis field (optional): corresponds to the properties distance to the water streams. This variable was calculated in 3.1. This is a continuous variable.

  • For creating a prediction surface map:

For this purpose, I have used results from the Hot Spot Analysis in the “Kriging” tool of ArcMap.

For my problem:

Input point features: Feature results from the Hot Spot Analysis.

Z value field: corresponds to the “GiZ scores” results from the Hot Spot Analysis.

4. Brief description of obtained result

The following results correspond to one variable of the survey responses (the survey contains 104 categorical variables to be analyzed).

Three scenarios of spatial correlation have been analyzed.

 Scenario 1: Patterns of resident’s perceptions of flooding reaching their home correlated to the distance of all streams (Willamette River, Marys River, Millrace) surrounding near the neighborhood. The results are shown in Figure 4.1

Figure 4.1. Patterns of resident’s perceptions of flooding reaching their home correlated to the distance of all streams (Willamette River, Marys River, Millrace) surrounding near the neighborhood.

The hot spot analysis yields patterns formation correlated to the distance of all streams (Willamette River, Marys River, Millrace) surrounding near the neighborhood.

Scenario 2: Patterns of resident’s perceptions of flooding reaching their home correlated to the distance of two major rivers (Willamette River and Marys River) surrounding near the neighborhood. The results are shown in Figure 4.2

Figure 4.2. Patterns of resident’s perceptions of flooding reaching their home correlated to the distance of two major rivers (Willamette River and Marys River) surrounding near the neighborhood.

The hot spot analysis yields patterns formation correlated to the distance of two major rivers (Willamette River and Marys River) surrounding near the neighborhood.

Scenario 3: Patterns of resident’s perceptions of flooding reaching their home correlated to the distance of a seasonal stream (Millrace) crossing the neighborhood. The results are shown in Figure 4.3

Figure 4.3. Patterns of resident’s perceptions of flooding reaching their home correlated to the distance of a seasonal stream (Millrace) crossing the neighborhood.

The hot spot analysis yields patterns formation correlated to the distance of a seasonal stream (Millrace) crossing the neighborhood.

5. Critique of the method

  • Generate Near table:

This method is useful for finding the shortest distance between two features. This information was quite useful for my analysis.

  • Replace missing data:

Missing data was one of the issues I encountered analyzing my data. I have first tried to remove missing values using the “VIM” package in R, which uses multiple imputation methodologies, but this method didn’t work out in my case. I was getting messages of problems trying to invert matrices. This problem can be attributed to the small range of values of the categorical variables vectors. So, I have used a very simple concept: Replace all missing values in numerical variables with the mean and/or, replace all missing values in categorical variables with the median (IBM Knowledge, Center  n.d.). This helped me a lot.

  • Optimized Hot Spot Analysis:

For my problem, found more useful the “Optimized Hot spot Analysis” tool rather than the “Hot Spot Analysis (Getis-Ord Gi*)” tool because the “Analysis field” allowed me to find and optimize clusters formation in my data.

  • Kriging:

This ArcMap tool allowed me mapping cluster formation based on the hot spot analysis outputs. This tool allow better visualization of spatial patches.


Getis, A., & Ord, J. K. (2010). The Analysis of Spatial Association by Use of Distance Statistics. Geographical Analysis, 24(3), 189–206.

IBM Knowledge Center – Estimation Methods for Replacing Missing Values. (n.d.). Retrieved May 8, 2017, from


Tutorial 1: Calculation, Graphing, and Analysis of Correlation Coefficients in Excel

Filed under: 2017,Tutorial 1 2017 @ 2:27 pm

Question: How well correlated are stage gage data at different locations within a single basin, based on the amount of distance between them?

Approach: To explore this question, I used the Correlation function in Excel to do a pairwise correlation analysis of my six sets of stage gage data, and then correlated the results with the distance between each pair of instruments, measured in ArcGIS.


Calculate correlation coefficients in Excel

  1. Import stage gage arrays into Excel with one column for each site. Measurements arrays won’t exactly match so make sure that the first value in each array was recorded within 5-10 minutes of the other first values.
  2. Create pairwise correlation table – blank out one half to prevent redundancy in calculations. Correlating a site with itself will always result in a correlation coefficient of 1 so blank out the diagonal as well.
  3. Calculate correlation coefficient for each pair of sites using CORREL function in Excel: =CORREL(array1,array2) in the corresponding cell (Table 1).

Table 1. Pairwise correlation table filled in with example values.

Calculate distance between sites in ArcGIS

  1. Import stream layer and site points into ArcGIS (Figure 1).

Figure 1. Map with imported stream layer (blue line) and site coordinates (yellow and green points). Stage gage locations are indicated with red stars.

  1. Site locations will probably not all be exactly on top of stream polyline. Snap site points to stream polyline using Editor Toolbar.
    1. Make sure Snapping is on in the Snapping Toolbar.
    2. Right click on site point layer and choose “Edit Features”, then “Start Editing”.
    3. Use the Edit tool to select the site you want to snap.
    4. Drag the point perpendicular (the shortest distance) from its current location toward the stream until it snaps to it.
    5. Repeat for all other sites that are not already snapped to streamline.
    6. “Save Edits” and “Stop Editing”.
  1. Split stream polyline at intersections with site points for easier measuring.
    1. Start Edit session.
    2. Use Split Tool to click at each intersection of a site point and a stream polyline. This will split the line in two at the place where you click.
    3. “Save Edits” and “Stop Editing”.
  2. Create another pairwise table in Excel to input distance data.
  3. Calculate distance between points.
    1. Start an Edit session.
    2. Highlight all portions of stream between one pair of points using Edit tool. Hold down shift key to select multiple portions at once (Figure 2).

Figure 2. Example selection of stream polyline sections between two stage gage sites. Selected lines are in light blue.

  1. Open stream polyline attribute table and right click on “Length.”
  2. Use Statistics function to calculate total length (Sum) (Figure 3).

Figure 3. Attribute table with selected stream polyline sections and statistics summary.

  1. Input total length value into Excel table.
  2. Repeat step 5 for all pairs of points.
    1. Use “Clear Selection” to clear previously selected points before starting process again.

Make correlation graphs

  1. Make x-column with pairwise distances and y-column with corresponding correlation coefficients. Include a third column with the pair for clarity (Table 2).

Table 2. Distance and correlation table.

  1. Graph x vs y and examine the relationship.

Figure 4. Plot of relationship between streamwise distance and correlation coefficient of all possible pairs of six stage gage sites.

Results: When I started this analysis, I expected to see a clear negative relationship between distance between instruments and correlation coefficient. I did not obtain this result when I performed the analysis, however. The plot that I made showed a slight negative relationship between distance and correlation coefficient. Upon further reflection, this relationship actually makes sense for the basin and is actually the result that I hoped to see. This relationship shows that regardless of location in the basin, the water levels are responding relatively uniformly, with only a slight delay between sites that are spaced further apart.

The points representing the Site 9 vs Site 36 and Site 9 vs Mainstem correlation coefficients could be considered outliers with their unusually low values. This could be because the two sites are located on different tributaries and are two of the most widely spaced apart pairs.

Utility: This method was useful for illustrating the relationships between different stage gages within my study basin. The instruments are located on different tributaries so the relationships are slightly more complicated than I expected but I think I will still be able to apply what I have learned from this analysis to extrapolate to other parts of the basin where I do not have stage data.

Tutorial 1: Raster auto correlation in R

Filed under: Tutorial 1 2017 @ 1:47 pm

A Quick Primer

My spatial question is how the radar backscatter (measured by the Sentinel-1 satellite) relates to features on the ground. The backscatter is influenced by slope, aspect, forest characteristics (tree height, density, moisture content etc.)

The question I was asking for the first excercise was whether or not my backscatter was autocorrelating. If it is, then I will need to reconvene on how to approach my question (spoiler alert: the data are not autocorrelated).

I tried a few approaches, but the one that worked best was using R code and the raster tool in R (not a default tool but easy enough to download). However, before I could do that I had to reduce my dataset to something that was manageable.

The variogram is a useful tool to measure autocorrelation. The process boils down to this:

The value of interest at each pixel (or point) is compared to the values of interest at every other pixel and the similarity between the values is examined spatially. If there is autocorrelation in the data, then pixel values will be more similar to each other in a spatially coherent way (another way of putting it: the values of a given pixel can be predicted by the values of pixels that are nearby). Spatially clumped data are an example of autocorrelated data, whereas uniform or random data typically do not display autocorrelation.

The backscatter data collected by the satellite should not display autocorrelation, unless the underlying features were spatially autocorrelated. Since I was looking at data for a hilly, forested area, it is expected that the radar data should not be autocorrelated.

Getting to the Good Stuff: Methods

First, I had to import my data into ArcMap. This was after pre-processing the radar data into a spatially and radiometrically corrected raster with pixel dimensions of 10 m x 10 m. I attempted to run a variogram on this raster in Arc, but it crashed. So I attempted to run a variogram on just a subset of this raster, and it crashed again. Finally, I developed a variogram within Arc on just a subset of my data. Unfortunately, there were still so many points that the variogram looked more like the shower scene from Psycho.

On to R, where I brought in my original raster to run a variogram. This crashed the whole computer this time, so I decided a small subset would be better to work with. Using the same subset as before, I brought in the GeoTiff file into R (using the ‘raster’ package), making sure that the data was not corrupted (previously I had tried to export the file from Arc, but it gave each pixel an 8-bit RGB value rather than the float decimal value of backscatter).

Above, the raster as it appeared in R. The extent of the raster was much larger than the raster itself, which I attribute to the export procedures in ArcMap. I made a histogram in R to view the values of the backscatter raster (values were in digital number dNBR):

Using the ‘usdm’ package in R, creating and viewing a variogram was as simple as:

var2 = Variogram(Gamma0)

The resulting plot showed little correlation beyond the immediate lag distance, and a rapid leveling off.

This corresponds to the data being relatively non-correlated, which is precisely what I was hoping for.


ArcMap is wholly unfit for processing raster autocorrelation beyond a certain resolution. I’m not sure what that resolution is, but the data I was working with was too much for Arc to handle. It was even too much for R to handle, which is why I had to use a subset of my data and hope that it was a representative sample of the data at large. In comparing other projects within the class, it seems like autocorrelation within Arc using variograms works when you have limited data; but when the data is composed of 100,000 individual points (or pixels) it is far too much for ArcMap. I would recommend anyone using high resolution data to stay away from the variogram, and use a different autocorrelation measure. This is probably due to the process that variograms use (i.e. comparing every point to every other point) that is susceptible to a rapid inflation of required computational time.

May 4, 2017

Tutorial 1: Using a Geographically Weighted Regression to assess the spatial relationship between blue whales and chlorophyll-a

Filed under: 2017,Tutorial 1 2017 @ 4:41 pm

Research Question

The goal of my spatial problem is to assess the relationship between blue whales in the South Taranaki Bight region of New Zealand and the environmental factors which define their habitat and influence their spatial distribution. Chlorophyll-a (chl-a) concentration is an indicator of productivity in the marine environment. Chl-a can be remotely-sensed, and the concentration reflects the abundance of phytoplankton, the tiny photosynthesizing organisms which form the base of the marine food web. Blue whales do not directly feed on phytoplankton. However, krill, which feed on aggregations of phytoplankton, are the main prey type for blue whales. Remotely-sensed chl-a can therefore be used as a proxy for the location of productive waters where I might expect to find blue whales. For this exercise, I asked the following question:

“What is the spatial relationship between blue whale group size and chlorophyll-a concentration during the 2017 survey?”


I downloaded a chl-a concentration raster layer and used the “extract values to points” tool in Arc in order to obtain a chl-a value for each blue whale group size value. I then used the geographically weighted regression (GWR) tool from Arc’s spatial statistics toolbox in order to investigate the spatial relationship between the whales and the concentration of chl-a.


The chl-a concentration layer was downloaded from the NASA Moderate Resolution Imaging Spetrometer (MODIS aqua) website. MODIS data can be accessed here, and available layers include numerous sources of remotely-sensed data besides chl-a concentration (Figure 1). Chl-a concentration values are averaged over a 4 km2 spatial area and a one-month time period. I downloaded the raster for February 2017, as our survey lasted for three weeks during the month of February.

Figure 1. NASA Moderate Resolution Imaging Spectrometer (MODIS) data sources, including chlorophyll-a concentration which is used in this tutorial.

I then used the extract values to points tool, which is located in the extraction toolset within Arc’s spatial analyst toolbox, to extract values from the chl-a concentration raster for each of the blue whale sighting data points. This resulted in a layer for blue whale sightings which contained the location of each sighting, the number of whales sighted at each location (group size), and the chl-a concentration for each location (Figure 2).

Figure 2. Blue whale sighting locations and group sizes overlaid with chlorophyll-a concentration.

The geographically weighted regression tool is found within the spatial statistics toolbox in Arc. The dependent variable I used in my analysis was blue whale group size, and the explanatory variable was chl-a concentration. I used a fixed kernel, and a bandwidth parameter of 20 km (Figure 3). Functionally, what this means is that the regression looks at point values that are within 20 km of one another, and then fits a linear relationship across the entire study area.

Figure 3. The geographically weighted regression tool in Arc’s spatial statistics toolbox.


The results of the GWR are shown graphically in the figure 4. The GWR fits a linear equation to the data, and the values which are plotted spatially on the map are coded according to their deviation from their predicted values. Many of the points are > 0.5 standard deviations from their predicted values. One point, which appears in red in figure 4, is > 2.5 standard deviations above its predicted value, meaning that blue whale group size at that location was much higher than expected given the chl-a concentration at that same location.

Figure 4. Result of the geographically weighted regression (GWR). Color codes represent the deviation from the expected values for blue whale group size according to the chl-a concentration value at that location.

The attribute table from the GWR output layer shows all of the components of the equation fitted by the GWR, as well as the predicted values for the dependent variable according to the fitted linear equation (Figure 5). It appears that there are several points which are far from their predicted values according to the GWR, such as the row highlighted in figure 5.

Figure 5. The attribute table associated with the geographically weighted regression (GWR) output layer. The highlighted row is an example of where the observed value was dramatically different from the predicted value.

The local R2 values are < 0.04 for all values, demonstrating that the data do not fit the relationship fitted by the GWR very well at all. My interpretation of this result is that, across the spatial scale of my study area, the relationship between blue whale group size and chl-a concentration is not linear. This is apparent from looking at the raw data as presented in figure 2, as it appears that there are several whales in an area of high chl-a concentration and some large groups of whales that are in an area of low chl-a concentration.

Critique of the Method

Although the results showed that there was no linear relationship between blue whale group size and chl-a concentration across the spatial extent of my study area, this is a very useful result. The ecological relationship between blue whales and chl-a concentration is not direct—remotely-sensed chl-a concentration indicates locations of productivity, where phytoplankton are present in high abundance. These areas of high phytoplankton are presumed to drive the location of dense aggregations of krill, the prey source for blue whales. However, there is likely both spatial and temporal lag between the phytoplankton and the blue whales. The fact that the ecological relationship of the two variables investigated here is an intricate one is reflected by the fact that a simple linear relationship does not capture how the variables are related to one another in space. I look forward to exploring this relationship further, perhaps incorporating other environmental factors that contribute to the location of krill patches such as sea surface temperature.

Tutorial 1: Using Logistic Regression vs Ordinary Least Squares Regression.

Filed under: Tutorial 1 2017 @ 3:04 pm


  • How is chytrid fungus presence/absence related to the distance from road or trail? I chose this question because humans are known to be one of the primary causes of pathogen transmission.
    • This question involves a relationship between presence or absence of chytrid fungus (dependent variable) and distance from road or trail (explanatory).


  • I attempted to use Logistic Regression and Ordinary Least Squares for my y variable (chytrid presence/absence) related to an x variable (for this I used distance from road or trail) by using python script and R. Although OLS may not be the best test to run for my data, (logistic regression is shown to be the best based on Table 6.1 in the exercise because my x and y are both binary or categorical data) I wanted to try it anyways to see if there was any linear relationship between x and y. The OLS tool also outputs some cool visuals, figures, and plots. This could be useful for someone who has continuous data.


  • To run OLS and Logistic regression, I had to prep my data. I wanted to answer the question, is the presence of Bd more likely near a road or trail (where humans are transporting the pathogen)? Therefore, I had to get distance from sample points to the nearest road. To do this, I first had to bring in multiple layers of roads and trails within Deschutes National Forest given to me by an employee. I used the “Merge” tool to bring all the layers together. My next step was to find the distance from the sample point to the nearest road or trail in my new “merged roads and trails layer”. I used the “Near” tool which generated a value representing the distance from the sample point to the nearest road or trail. Once I had that information, I ran the Ordinary Least Squares tool and logistic regression where I used Bd as my dependent variable, and distance from road as my explanatory variable. Below is my code and used for both OLS and logistic regression.


  • I am not certain whether OLS was useful or not, but I think not because my data was not continuous which violates one of the assumptions of the test. Although my data violates assumptions made by OLS, it backed up results by the logistic regression, showing a relationship between distance from road and Bd presence.However, while the logistic regression model stated a significant difference, OLS stated a non-significant difference.Logistic regression works better for my data and was useful because it showed a significant relationship between the two variables. Below are some figures produced by OLS, and above are the statistical results along with script from both OLS and logistic regression.

May 3, 2017

Tutorial 1: Cross-Correlation between tsunami inundation time series extracted from gridded netCDF data

Filed under: Tutorial 1 2017 @ 12:44 pm


Tsunami inundate shorelines in a chaotic manner, resulting in varying degrees of flooding from location to location. The spatial variability of tsunami hazards has important implications for the tsunami-resistant design of structures and evacuation plans because different safety margins may be required to achieve consistent reliability levels. Currently, tsunami mitigation strategies are primarily informed prediction models due to the lack of nearshore field data from tsunami disasters. Thus, to investigate the spatial variability of tsunami hazards, we analyze the relationship between different test locations in our domain. The question is:

“How much does tsunami inundation vary spatially?”

Essentially saying if one observes a certain flooding depth at their location, could they assume that a similar flood level is occurring elsewhere? And how does this change with time? And how to certain coastal features affect this?


To answer this question, I examined the cross correlation between time series of inundation extracted from simulated tsunami inundation data. All of the tools described in this tutorial were implemented in R.


My data was stored in large netCDF (.nc) files which had to be opened in R. This required the “ncdf4” package. The netCDF file was read into R by using the “ncopen” function and its contents were printed to the Console window in RStudio by using “print” as shown in figure 1. At this step we can use the “nvar_get” function to extract the time and geographic coordinate variables from “ncin” object which contains the data from the netCDF file.

Figure 1 – Reading and viewing contents of netCDF file containing tsunami simulation data.

After this, we want to extract the time series of inundation at particular geographical points specified by the user (me). In this case, I’m interested in the water level at the jetties for Channel Island Harbor (Figure 2) and Port Hueneme (Figure 3) and how correlated they are to one another. After determining the geographic coordinates for my points of interest, the closest points within the gridded data were identified by using a the minimization function “which.min(abs(desired_value – data))” in R as shown in Figure 4. Using the indices identified by the minimization procedure, the “ncvar_get” function is used obtain the particular time series. In this case, however, since the inundation data is very large, we only want to extract the desired subset and thus had to be careful with our “start” and “count” inputs in the “ncvar_get” function (see Figure 4).

Figure 2 – Geographic location of Channel Island Harbor Jetty

Figure 3 – Geographic location of Port Hueneme Jetty

Figure 4 – Code excerpt for extracting time series from user specified geographical points

After obtaining the time series, we plot to visualize in Figure 5.

Figure 5 – Time series of free surface variation at extracted jetty locations for Channel Island Harbor and Port Hueneme

Now that we have stored our inundation time series in arrays (ha.CHjetty and ha.PHjetty), we can now perform a cross correlation by using the “ccf” function from the “stats” package in R.


The result from this procedure was essentially a list of correlation coefficients for specific time lag values. This information is best visualized as a plot as shown in Figure 6.

Figure 6 – Cross correlation plot between the inundation time series at the Channel Island Harbor and Port Hueneme Jetties.

Figure 6 shows that the inundation at the two different jetties are somewhat lagged from one another, but by only a short time. Additionally, there are significant differences between the observations with the maximum correlation coefficients to be ~0.6.

Critique of Method

Overall this method was quite effective and very useful for the specific analysis. From a data analysis standpoint, this setup can be effectively automated and can be used to extract the time series at any given set of locations to compare correlations. This methodology was quite simple to implement and produced simple and interpretable results.

Furthermore, this information can be useful from the bigger question standpoint in the sense that we get pertinent information regarding the inundation levels at this different locations. If we treat the tsunami and incoming waves, we can consider the lag effect as being how long it takes for the wave to get from one jetty to the other. In this case, only minutes separate the largest effects of the wave from getting from one jetty to another which suggests that actions should be taken simultaneously to mitigate any issues that might occur.

The other aspect to note is that there are significant differences between the inundation levels observed at these two locations. This suggests that the measures that take the magnitude of tsunami inundation into account should be determined individually at these two harbor/ports. This preliminary analysis highlights a need for more localized and site-specific tsunami mitigation measures.

May 1, 2017

Tutorial 1: Assessing distribution shifts in kernel density probability surfaces

Filed under: Tutorial 1 2017 @ 1:48 pm

Overview: question & approach

Identifying spatial repulsion and/or shifts in distribution are often examined to identify interspecific competition effects. Therefore, I wanted to assess the distribution of cougar kill sites including the spatial clustering and frequency in pre- and post-wolf time periods focusing on identifying shifts. A simple solution to my presence only data (kill events) was to create a density feature. However, there are several elements in my data sets, besides the presence of wolves, which could produce a shift in kill density or distribution that I will need to account for including: 1) catch-per-unit effort discrepancies (i.e. larger sample sizes of cougars (and kill sites) in one data set), or 2) time effects from seasonal distribution shifts (i.e. prey migration patterns). Before getting into an investigation of causal mechanisms, a first step is to determine if there are differences in where cougar are killing prey between study time periods by asking:

Is the distribution of post-wolf cougar kill sites different than the distribution of pre-wolf cougar kill sites?

There are two ways to assess this question. For the purposes of this analysis the data would be “location, implicit” with the variable of interest (cougar kill sites post-wolf) having a covariate (cougar kill site density/distribution pre-wolf) measurable at each sampled location and the causal mechanism inferred from co-variation. Density measures for each kill site could be pulled from the density feature raster created, but would be most relatable on a prey species level since elk and mule deer behave differently at a landscape level. Alternatively, an isopleth (or contour line) probability surface created from the density features could be used to calculate the proportion of post-wolf cougar kill sites within specified probability features of the pre-wolf cougar kill site distribution.


My approach to this question was to relate the proportion of post-wolf cougar kill sites (points) to the pre-wolf cougar kill site density (raster) using the probability contours (polygon feature) of kill site density as a measure of distribution. I used several tools in ArcGIS and the Geospatial Modeling Environment (GME) to carry out this analysis. GME is a standalone application that makes use of ArcGIS shape files and R software to carry out spatial and quantitative analyses.


Figure 1. Geospatial Modeling Environment (GME) home page.



I had already created point shape files of cougar kill sites for each time period in ArcCatalog, and used the kde call in GME to calculate kernel density estimates, or utilization distributions, for cougar kill sites in each time period (exercise 1). For that analysis, I used the PLUGIN bandwidth and a 30-m resolution cell size. Multiple bandwidth estimators are available as well as input options for scaling, weighted fields, data subset and/or edge inflation.

Figure 2. GME kde call in the command builder GUI.


The next step for this analysis was to standardize the two cougar kill time period KDE raster data sets so the densities relative to catch (in this case kill events) per sample (cougar) were comparable. This was necessary because the sample of kill sites (and sample of cougar) in the pre-wolf time period was higher (45-48%) than the post-wolf sample of kill sites (and cougar). I used the raster calculator in ArcGIS to divide each period kill site KDE raster by that periods’ respective kills/per cougar ‘effort’.

Figure 3. Density raster data standardization using ArcGIS raster calculator.


Figure 4. Raw code for analysis. GME runs like an R GUI, therefore code can be written in any text editor and past into the command window.


Using the new density raster, I then used the ‘isopleth’ command in GME to create a polygon probability surface. The resulting polygons represent quantiles of interest (programmable) expressed as the proportion of the density surface volume. I specified isopleths for the 25th, 50th, 75th, 95th, and 99th quartiles. I also used the ‘addarea’ command in GME to calculate and add information to the isopleth shapefiles for each polygons’ area and perimeter. Next, I used the ‘countpntsinpolys’ command to add a column with the count of post-wolf cougar kills in each pre-wolf cougar kill isopleth polygon. In ArcGIS, I created another column in the attribute table and used the ‘Field Calculator’ to fill each field with the proportion of total kills. Finally, I used ArcMap to visualize the data and make a figure showcasing the process.


Figure 5. GME command text box where you can enter code directly, copy from the command builder, or paste code constructed in another editor. Because GME uses R you can program iterative analysis for loops to batch process large data sets.



Visual examination of the standardized kill density rasters demonstrated that part of the distributional shift observed was related to catch-per-unit effort influences, but a shift between the two highest density areas of kills was still evident from pre- to post-wolf time periods (Figure 6). This suggests other variables, like time effects from seasonal prey distribution changes or the presence of wolves, could also be factors influencing the distribution of kill density.

Figure 6. Comparison of pre- and post-wolf cougar kill site kernel density estimate (KDE) rasters before and after standardization for catch-per-unit effort (kills/cougar), and with 25th, 50th, 75th, 95th, and 99th isopleth probability contours. The isopleth contours for pre-wolf cougar kill site distribution is also fit with post-wolf cougar kill point locations to demonstrate posterior distributional shift.


The observed shift was further evident from the proportional changes in post-wolf cougar kills within the pre-wolf cougar kill probability surface. For example, only 28.5% of all post-wolf cougar kills were within the 50% pre-wolf cougar kill probability contour. Even if I exclude the kills outside the study area boundary the proportion of kills in each probability contour were 5-22% lower than would be expected based on the pre-wolf kill site distribution.


Table 1. Pre-wolf cougar kill site probability contour attribute table. Isopleths represent the 25th, 50th, 75th, 95th, and 99th quartiles. ‘Out’ refers to areas outside the probability contour surface. Number of kills (No. Kills) is the number of post-wolf cougar kill sites, % is the proportion of all kills within each polygon ‘donut’, and % Kills is the cumulative proportion of all post-wolf cougar kills within each quartile class.

Method Critique

Standardization of the two kill density rasters improved visual interpretation of the spatial patterns and accounted for one of the other factors (catch-per-unit effort) that might mask distributional influences related to wolf presence. However, similar to the visual examination of the non-standardized density raster this method allowed for only an implicit understanding of space-time concepts and lacked inferential measures of significance to quantify the shifts and formally relate the patterns of cougar predation across time. Further, because the isopleth contours represent quantiles expressed as the proportion of the density surface volume they were identical when created from the standardized or non-standardized density rasters. Using the probability contours as a ‘prior’ distribution of kill sites offered a more robust and meaningful measure to quantify the shift in cougar kill distribution. However, the mechanism behind the shift could still be related to factors other than wolf presence like seasonal shifts in prey distribution or density. Both methods (standardization and prior distribution comparison) were useful and provide evidence toward the presence of a shift in where cougars are killing prey (i.e. the answer to my ‘question asked’ is yes), but further analysis is necessary to infer causal mechanisms.

April 30, 2017

Tutorial 1: An analysis of the spatial distribution of the North American deer mouse (Peromyscus maniculatus) in the Toiyabe mountain range, NV, USA.

Filed under: 2017,Tutorial 1 2017 @ 10:07 pm

Background & Question

The north American deer mouse (Paramyscus maniculatus) is the most widely distributed rodent species in north America and has the highest proportional abundance across many of the rodent communities in the Great Basin. Understanding its distribution in space is important for understanding its impact on the spatial arrangement of the rest rodent community to which it belongs. In this regard, I wanted to evaluate the degree to which this species is clustered in the Toiyabe mountain range in central Nevada. I expect the pattern of P. maniculatus distribution in the Toiyabe mountain range to be reflective of its distribution behavior in other similar mountain ranges in the Great Basin, and thus exerts similar forces across the landscape. The first step in understanding the relationship between the distribution of P. maniculatus and the other species in small mammal communities was to map the distribution of P. maniculatus. Here I used a descriptive statistical technique of spatial analysis, specifically, Hot Spot analysis. Because data are spatially explicit at the scale of interest the spatial relationship I was interested in measuring could be expressed using the equation for spatial autocorrelation below (eq. 1); where the number of P. maniculatus at location i is a function of the number of P. maniculatus at location i±h, where h is some displacement of space.

Eq. 1.               P. maniculatusi = f(P. maniculatusi±h)

I expected that at the scale of interest, the number of individual P. maniculatus captured would be distributed approximately equally across trap sites. Hot spot analysis provides an index of spatial clustering from clustered to random to dispersed. Clustering suggests the spatial process of attraction, while dispersal suggests repulsion between points. These patterns could be explained as the ecological processes of intraspecific commensalism and competition, respectively. Alternatively, if the distribution of points is random it may suggest that the underlying process producing the distribution of points is not captured by the sampling design, or that the distribution of individuals is not spatially dependent. I do not expect the distribution of P. maniculatus to be spatially dependent.


Raw data tables represent all trap data from 2009 – 2011 for the Toiyabe mountain range small mammal survey. Each individual animal is associated with explicit and implicit spatial data, as well as biological measurements including body size measurements and age-class. Additional information about collection and preservation methods are also associated with individual animals. I collapsed this comprehensive data set by constraining my analysis to one species, Peromyscus maniculatus.  I then collapsed this dataset further by converting it to a count of individual P. maniculatus at each unique location. Explicit location data was recorded as latitude, longitude using the NAD1927 Datum.

Using ArcMap10, I projected these points onto relief base map of North America. To accomplish this I imported the raw data as a .csv file into ArcMap.

.csv file

After importing the .csv file, set the datum for the projection by right clicking on the .csv file in the table of contents and selecting layer properties and setting the data source.

import .csv

Within data source menu, select the datum that your coordinates were measured in, and indicate the X and Y coordinate you wish to display.

.csv properties

Set datum and XY

Once the datum is set, it is easiest to then select a base map to project your points onto you’re your tool bar Select the add basemap button and choose an appropriate map. Here I used terrain with labels.

Select Basemap

Once you have your points projected onto the base map you will need to export them as a shape file. Right click on the layer you wish to export and choose the selection to make a shape file.

Export shape file

Perform Hot Spot analysis on this shape file by assigning the count data to each point.

Perform hotspot analysis

The output for the Hot Spot analysis is the confidence level that clustering (hot spot) or dispersal (cold spot) is occurring.

Hot spot output 


Hot Spot analysis of P. maniculatus based on trap line locality provided mixed insights. Specifically, this analysis does not show how P. maniculatus is distributed on the landscape, but rather how the researchers distributed trap lines on the landscape. However, it does demonstrate that across trap lines P. maniculatus essentially evenly distributed. Only one trap line represents a hot spot for this species, while all other localities were not significantly cold or hot, that is not clustered or dispersed. The single hot spot occurred at a point in space where a human food subsidy to the population likely drove increased population density, as it was located within a campground.


Final map


Despite the non-random distribution of point localities, the distribution of P. maniculatus across the points was consistent with the generalist ecology of this species. This suggests that members of small mammal community at all measured points in space will be influenced by the presence of P. maniculatus. However, the results of this analysis do not suggest that that influence of this generalist species will be uniform at different spatial localities as the habitat and other rodent species present vary in space despite the distribution of P. maniculatus.

While Hot spot analysis is a useful tool for understanding the distribution of individual occurrences in space, sampling will exert a strong influence over the inference one is capable of making based on this analysis. Trap line data do not provide much inferential power as the distribution of points is determined by the researcher and at best the hot spot analysis provides the same information that a histogram might, with count as the response variable and the latitude or longitude of each trap line as the explanatory variable. The greatest value in performing this analysis on this type of data is that it provides a standardized visual representation of the data. Which may help me see how variable the distribution of a species is between trap lines. However, moving forward I will be using elevation rather than latitude and longitude and my spatially explicit variable.


April 28, 2017

Tutorial 1: Calculating Global Moran’s Index on Point Data

Filed under: 2017,Tutorial 1 2017 @ 2:19 pm

A quick summary of my project and objectives: I am using a subset of the FIA (Forest Inventory and Analysis program run by the USFS) dataset, which is contained in a .csv file. I want to assess the understory vegetation response to ponderosa pine canopy cover and other environmental factors, and I am using USFS plot data to simulate the process I will be using on my own dataset in the future. The dataset currently looks like this:

Obviously there are multiple variables associated with each plot, but for the sake of this exercise, I will only be assessing canopy cover and total vegetation cover. Because my dataset is not a raster, nor is it regularly gridded, I decided not to attempt a semi-variogram. I was also unable to kick my ArcMap habit for this exercise. I decided to run a Global Moran’s Index assessment on the points in ArcMap.

The first step of this process was to load my .csv into Arc. This can be accomplished by adding the file as X,Y data. However, you can’t do much with X,Y data for our purposes, so I exported the X,Y data as a shapefile. This created an attribute table holding all the same information as the .csv file. I also added a boundary of Deschutes County, which is where my subset of data is located. I used the Project tool to make sure both the points and county boundary were properly oriented on the landscape, using the WGS_1984_UTM_Zone_10N coordinate system.

From here, the process was very simple. I used the search tool to seek out a spatial autocorrelation tool, which brought up the Global Moran’s Index. I selected my points layer as the input, and selected canopy cover and total vegetation as my input fields (in separate trials).

The reports showed the levels of clustering for each variable. The reports are shown below.

Canopy cover spatial autocorrelation:

Vegetation percent cover spatial autocorrelation:

I found that the variables had very different levels of clustering. The canopy cover variable had essentially no clustering. The Moran’s Index value was -0.034, with a p-value of 0.57, meaning the level of clustering was non-significant. However, the vegetation cover was highly clustered, with a Moran’s Index value of 0.0496 and a p-value of 0.00028.

This method was useful in giving me a different perspective on the data. It is difficult to simply review a spreadsheet and try to make any meaningful interpretation of the numbers. While the tool did not give any sort of visualization, it allowed me to imagine how the vegetation might be distributed across the landscape. The results mean that canopy cover varies across the focus area randomly, so stands with thin or dense canopies can be found just about anywhere. However, stands with dense understory vegetation are probably located in similar regions, as are stands with thin understories. This might imply that other environmental factors have a strong influence on how densely vegetation grows over a wide area.

Tutorial 1: Species Area Curves with Vegetation Community Data

Filed under: 2017,Tutorial 1 2017 @ 1:58 pm

Salmon River Estuary: Map of study site and plot locations

Species Area curves represent the relationship between a specified area of habitat and the number of species present there. Species-area curves are useful for comparing species richness data from sample units that differ in area, especially nested sampling plots. Empirically, the larger the area, the larger the number of species you are likely to capture; a principle that has held mathematically true in ecology. Species-area curves can address the adequacy of sample size and estimate the overall diversity and spatial heterogeneity within a community dataset. Species-area curves help to conceptualize the relationship between species richness and spatial scale in a sampled environment. Typically species-area curves take only species presence, into account and demonstrates average number of species per plot sampled. Species-area curves can also help researchers determine the appropriate spatial grain for a study area, as the spatial grain (the size/dimensions of the smallest sample unit) has a strong impact on the shape of a species area curve. Species area curves are a basic diversity measure that are helpful for describing the heterogeneity of a community sample. Species accumulation curves, are derived by a log transformation of the species area curve, and describe a linear trend of how species richness increases over area sampled. The species accumulation curve is also called the ‘effort’ or ‘discovery’ curve, as the species richness measured is often related to the amount of time and space sampled by a researcher. The goal for my tutorial is to demonstrate how I have created species-area and species accumulation curves to investigate spatial pattern of vegetative richness within and between tidal marshes at my study site.

Equation for Species Area Relationship: S = cAz  >>>>> log(S) = c + zlog(A)

Where S = Species Richness, A = area, and c and z are empirically determined constants from plotted data (Rosenzweig 1995).

The research question I have relevant to this exercise/tutorial is: Do species area relationships differ between restored and remnant tidal marshes at Salmon River Estuary? How do field methods, specifically, nested, rectangular (Modified Whittaker) plots and square, non-nested (Transect) plots capture these species-area relationships differently?

Between the two types of sampling techniques I applied, I would ultimately expect the Modified Whittaker plots, which cover more area, to capture the same number if not more salt marsh species compared to transect plots. I hypothesize that the reference marsh would have a greater number of species, and a steeper slope for the species-area curve compared to restored marshes. I would also expect more recently restored marshes to have less ‘steep curves’ than marshes that were restored less frequently.

Name of the tool or approach that you used:

PC-ORD is statistical software counterpart to R that that performs ordinations and multivariate analyses on community datasets. Microsoft Excel was also used for graphics. Initially I used PC-ORD to generate the average number of species per plot over cumulative area sampled, for each plot size in each tidal marsh. I then exported the data to excel and produced graphs of cumulative species over log10 area and figured out the slope of each line. I kept it simple and used software I am familiar with, to expedite the process of producing useful results I can interpret and be confident in.

Brief description of steps you followed to complete the analysis:

Step 1: Enter data into an excel spreadsheet and format for PC-ORD

Open PC-ORD. Start a new project (File > new project). Name the project appropriately and import the excel sheet into the ‘main matrix’ input. Save the project. Use the summary command to produce a species-area curve output. The output will include average number of species per plot over sampled distance. Repeat this for every plot size sampled ((x2)1 m2, 10 m2, 100 m2, 1,000 m2).

Save PC-ORD output and import back into Excel. Graph average species richness against area sampled for species area curves. Do this for transect 1 m2 plots and MW 1 m2 plots, for each tidal marsh, using Excel commands (highlight data > insert > scatter plot graph). Calculate the log10 of the area sampled and log10 of species richness, for all nested MW plots (1 m2, 10 m2, 100 m2, 1,000 m2). The trendline is a fitted linear trend that describes log species richness vs. log species area.

Brief description of results:

The species curve for transect data appears to rise and plateau quickly, whereas the MW species curve rises more steadily and plateaus later (Figure 1 a-d, Figure 2 a-d). MW plots had higher overall species richness (21) and lower species turnover (1.7) compared to transect data species richness (17) and species turnover (4.5). This is an expected result, as there were fewer larger MW plots that covered more area compared to transect plots. Furthermore, the transect plots exist on environmental gradients related to salinity and elevation, suggesting greater species turnover within sample units, compared to MW plots that are not aligned with gradients (McCune and Grace 2002; Rosenzweig 1995). Average species accumulation for transects is similar (16.26) to average species accumulation for MW plots (17.93), suggesting that both represent adequate sample size and appropriate spatial grain for this study site (Figure 2 a, b, c, d, Table 2). ANOVA and post hoc t tests indicated that all tidal marshes have significantly different species accumulation rates.

Critique of the method – what was useful, what was not?

I would describe this method as useful, fast and simple. It was reasonably accessible and easy to do, without requiring extensive knowledge of statistical programs or software packages. However, I would not have found the method as easy if I had not just completed a course at OSU that instructed me on how to use PC-ORD (taught by Dr. Bruce McCune, who created the software). Needing to use two programs to get my results may be a bit cumbersome, however with an initial quick google search I was not able to find any helpful code on how to produce semilog species area graphs and species area curves with R. It seems like pursuing R would have been more frustrating and the results would not have been as clear to me. I am satisfied with my approach and pleased with my outcome, as it is simple and the results make sense to me.

Example of Results:





Tutorial 1: Using variograms to assess patchiness in raster pixels for a single wildfire event

Filed under: 2017,Tutorial 1 2017 @ 10:32 am

Overview: Question & Tool

Often patch dynamics are examined under the lens of single disturbance events, but contemporary disturbance regimes are demonstrating that large-scale disturbances can overlap in time and space.  The spatial configuration of patches from an initial disturbance may influences the spatial configuration of patches of a sequential disturbance.  The broader research question is focused on determining if there is a relationship between patches from an initial disturbance of beetle outbreak and a sequential disturbance of wildfire.  Before I address the broader question, I would like to determine if the post-fire landscape indicates some inherent patchiness. The research question here asks:

What is the patchiness (variability) of  fire severity for a single fire disturbance?

In order to do this I employed a variogram to assess the spatial pattern for a post-fire image.  The post-fire image was generated for the Entiako fire that burned in 2012 in central interior British Columbia.


The burn severity raster was generated by following protocols outlined by Key and Benson (2006). I collected a pre and post fire Landsat (30m) images for the Entiako fire that burned in 2012.  Both images were from the same month to match phenology and had minimal cloud cover.  I calculated the differenced Normalized Burn Ratio (dNBR), a pixel indexing classification.  Initially, NBR is computed for each individual image:



Then the difference is taken between the pre- and post-fire NBR images:

dNBR = NBRprefire – NBRpostfire

The dNBR image is useful for differentiating a gradient of fire severity within a fire perimeter.

A variogram was used to assess the spatial variability between pixel values in the dNBR image. A variogram is a means of evaluating spatial correlation between locations, and the shape of the variogram may indicate the level of patchiness and average patch size. The following spatial equation was considered as a guide:

  • yi = f(yi-h,i+h)

The variogram analysis was conducted in R version 3.3.2 (Team 2016) with the USDM Package (Naimi 2015). I used the following code:






The variogram analysis indicated a spherical relationship in the dNBR image, which suggests some level of patchiness (Figure 1).  The sill of the variogram suggests that patches averaged about 1500m, which seemed a bit high (Figure 1).  The dNBR image contained a couple large bodies of water, which may influence the variogram.  I removed the water bodies from the dNBR image, and ran the variogram on the water free image (Figure 1).  The water free analysis shows a shift in the sill location and indicates that average patch size might be around 500m (Figure 1).  The variogram without water also seems to have a double sill, which might indicate that there are both smaller and larger patches.

Critique of the method – what was useful, what was not?

The variogram seemed like a useful descriptive statistical tool for assessing the spatial relationship of pixels, as an indicator of overall patchiness, and an indicator of average patch size.  It allowed us to examine this relationship while maintaining the continuous dNBR raster.  Conducting the variogram in R was the key to maintaining the continuous raster data.  In order to conduct the analysis in Arc GIS I would have had to convert the continuous raster to point data.  This analysis allowed us to consider both implicit and explicit concepts of time.  The dNBR image captures this idea of implicit time, by measuring the difference between the pre- and post-fire images.  The variogram allows us to capture the idea of explicit time by describing the inherent patchiness in the post-fire landscape. It is important to recognize that the variogram analysis is taking a sample of the data.  The sample is dependent on the spatial resolution and extent.  If you rerun the variogram analysis the curve is slightly different each time, because the model re-samples the data set.  In trying this a few times it did seem like the sill remained at a relatively consistent lag distance.


Figure 1. Variogram for the Entiako dNBR image with water included (a) and excluded (b). The red circles indicate the sill, which provide a general indication of patch size. With increasing distance (Lag) there is increasing variability. The dNBR images (c and d) for the Entiako Fire that burned in 2012. Image c was used for variogram a, and Image d was use for variogram b. Removal of large bodies of water is visible in d (black circle), which correspond to variogram b.



Key, C. H., and N. C. Benson. 2006. Landscape assessment (LA): Sampling and analysis methods. USDA Forest Service General Technical Report RMS-GTR-164-CD:1–55.

Naimi, B. 2015. Package “ usdm .” R Topics Document.

Team, R. C. 2016. R: A language and environment for statisitical computing. R Foundation for Statistical Computing, Vienna, Austria.

April 20, 2017

Exercise 1 Part 2: Assessing spatial autocorrelation of snowpack using semivariograms

Filed under: Tutorial 1 2017 @ 10:50 am

Research Question

For this exercise, I elected to compare the spatial autocorrelation of snow cover frequency (SCF) before and after the Grays Creek wildfire in central Idaho.

Approach/tools used

After trying out a few different tools, I decided to use ArcGIS to create semivariogram plots for my SCF data within the Grays Creek wildfire perimeter.

Analysis procedure

First, a bit of data processing was required for this exercise. Since conducting spatial autocorrelation analyses in ArcGIS is more attainable using point data, I converted my SCF data from its original raster format into an array of equally spaced points (see example in figure below). A point was created at the center of each 500m pixel. This process was completed for both the winter (Oct. 1 – May 1) preceding the wildfire and the winter following the wildfire. Finally, the semivariogram clouds were produced using ArcMap’s geostatistical analyst toolbox.


SCF Pre-fire Semivariogram cloud:

SCF Post-fire Semivariogram cloud:

Overall, the semivariograms display similar distributions between the pre- and post-wildfire data. The semivariogram clouds show the empirical semivariance values for each pair of points plotted against the lag (distance) separating these two points. For this data, higher semivariance values correspond to large differences in SCF, while low values indicate similar SCF values. Although subtle, it does appear as though the post-fire semivariogram has a less defined trend of increasing semivariance with distance. There’s not enough information to say whether this is a result of severely burned wildfire patches, but it would be interesting to examine this potential relationship further.

As expected, data point pairs generally become less similar as the distance between them increases. In both plots, the semivariogram clouds display a steady increasing trend. This trend is likely a product of the snowpack gradient that exists across elevations at a local scale – as elevation increases, so does snowpack. In other words, points that are closer together will be more similar in elevation, and since snowpack is closely linked with elevation at a local scale, nearby points will also have more similar SCF values.


Method critique

The main critique of this procedure was the variable that I chose to analyze. Because snowpack is undoubtedly spatially autocorrelated at close distances (due to the strict elevational relationship), the semivariogram did not reveal any patterns that I was not already aware of.

However, I anticipate the semivariogram to be a useful spatial analysis tool as I move forward with my research. The semivariogram statistics would probably be much more appropriate for either my vegetation index data or burn severity data as these datasets have a less predictable distribution across a landscape.

April 5, 2017

Tutorial 1

Filed under: 2017,Tutorial 1 2017 @ 8:15 pm

This is the blog page for Spring 2017 Tutorial 1 posts

© 2019 GEOG 566   Powered by WordPress MU    Hosted by