GEOG 566

         Advanced spatial statistics and GIScience

Archive for Exercise/Tutorial 1 2018

April 30, 2018

Path Analysis for Understanding Visitor Movement

Filed under: 2018,Exercise/Tutorial 1 2018 @ 11:45 am

Question Asked

The movement ecology paradigm provides a useful, organizing framework for understanding and conducting path analysis on GPS tracks of overnight, wilderness sea kayakers in Glacier Bay National Park and Preserve Wilderness. Formally proposed in 2008, the movement ecology paradigm was designed to provide an overarching framework to guide research related to the study of organismal movement, with specific emphasis on guiding questions of why organisms move and how movement occurs through the lens of space and time (Nathan et al., 2008). The framework emphasizes understanding components of the movement, looking for patterns among those components, and understanding meaning behind movement through the underlying patterns (Figure 1). Ultimately, the target understanding is the movement path itself, which can be understood through quantification of the external factors, internal factors, and capacity for movement and/or navigation by the moving organism (Figure 2; Nathan et al., 2008). Through employing a movement ecology approach to the study of overnight kayaker movements in a protected area, individual movement tracks can be broken down into relevant components, the components of the path can be studied for patterns, and ultimately internal and external factors can be explored for influence or explanation of the movement path.

Following the movement ecology paradigm, the questions of focus for Exercise 1 include the following:

  1. How does the distribution of step lengths and turning angles vary throughout the duration of the overnight kayaker’s trip?
  2. What is the frequency of step lengths (in meters) at one-minute time intervals for overnight kayaker movements?
  3. What is the frequency of turning angles (in degrees) at one-minute time intervals for overnight kayaker movements?

These questions focus on understanding two parameters of interest for individual movement paths: step length and turning angle. These parameters have traditionally been used to describe and quantify the movement of animals. Changes in step length and/or turning angle have been used to identify changes in behavioral states among animals, such as migration or foraging behaviors. In the context of this work, I hypothesize that understanding changes in step length and turning angle among kayakers will allow us to identify changes in the behavioral states of the recreationists of focus in this work. Therefore, univariate analyses to create histograms of step length and turning angle were produced for a small test batch of five GPS tracks.

Tool/Approach Used

Jenna and I worked together on developing the analytical approach – both from a conceptual perspective and from a pratical and mechanical perspective in R. We used the R analysis package moveHMM to generate step lengths and turning angles for each minute of the track (Michelot, Langrock, & Patterson, 2017). The packages uses hidden Markov models and the frequentist (drawing conclusions from sampled data emphasizing frequencies and/or proportions) statistical approach. The hidden Markov model approach requires that spatial data meet the following two criteria: 1) the data are sampled at regular intervals, and 2) the spatial data have a high degree of accuracy with minimal positional error. The data used in this exercise are believed to meet the criteria for hidden Markov models. First, while the data were not originally collected in one-minute increments, the maximum amount of time between collection of X,Y coordinates throughout a track was one-minute. The data were down-sampled into one-minute time bins to standardize the sampling interval through averaging the values of the X and Y coordinates for each one-minute interval. The maximum number of X,Y points averaged per one-minute was 6, as the minimum amount of time between X,Y observations was 8 seconds. Second, the spatial accuracy of the GPS units used to collect the tracks advertise a spatial accuracy of 2.5 meters. Given that the emphasis of the overall analysis is to understand general patterns in movement and behavioral relative to time rather than in relation to specific X,Y point locations the potential for up to 2.5 meters of spatial inaccuracy in the data does not present a cause for concern. Additionally, the data were visually inspected using ArcMap and significant spatial inaccuracies were removed prior to analysis. The tracks analyzed are believed to be free from spatial inaccuracy due to unit malfunction or loss of satellite signal.

Additional tools used for data preparation and data visualization for use before and after the moveHMM package tool include the following:

  1. A function to convert latitude and longitude spatial measures to UTM measures (source code taken from
  2. The plotSat function in the R package ggmap.

Description of Steps Used to Complete the Analysis

Jenna and I developed R scripts to complete the data transformation and formatting needed for data output from the collected GPS units to be ready for analysis by the moveHMM R package, run the tool, and create additional data visualizations.

Data Input: The input data was the attribute table of all the collected GPS tracks exported as a CSV file from ArcMap.

Steps for Analysis:

Data preparation for moveHMM: The tool requires that data be regularly sampled and that each observation has a unique numeric ID code associated with the data in the data frame. The tool will operate on latitude and longitude coordinates, but for this application of the tool it was decided that UTM coordinates, with units of meters, produced results that were practically meaningful.

  1. Import CSV file into R Studio (manually). Make sure the data column with the date and time of X,Y point collection is set to datetime.
  2. Down sample the data so that X,Y observations are aggregated into one-minute time bins. Average X and Y coordinates within each minute to produce one-minute time bin summary.
  3. Convert the averaged latitude and longitude coordinates to UTM coordinates.
  4. Insert a new ID data column to provide a numeric ID code for each GPS track in the data frame.

Running the moveHMM tool and generating output:

  1. Run the moveHMM tool and create plots. See lines of code below for example commands:

#Loading prepared dataframe into moveHMM for creation of step length and turning angle histograms

data <- prepData(PreppedData_MoveHMM, type = “UTM”, coordNames = c(“X”, “Y”))


#Creating plots of and histograms


The output of the moveHMM tool is a series of four figures produced for each GPS track (Figure 3). The top left histogram depicts how the step length (meters) varies through time (t = minutes). The top right histogram depicts how the turning angle (radians) varies through time (t = minutes). The bottom two histograms show the frequencies of step length and turning angle. The moveHMM summaries leave something to be desired, namely, the frequency histograms are not expressed as a percentage on the Y axis and the histograms are not standard across all tracks analyzed – the lengths of the X and Y axes vary. To produce more descriptive figures, the output file produced from the moveHMM analysis was manipulated to produce additional histograms for step length and turning angle.

Data Visualization of moveHMM results:

Step Length

  1. Identify the minimum and maximum step length values from the entire dataset. Given the minimum and maximum step length values, determine the desired number of step length bins for the X axis.
  2. Summarize the step lengths into the desired bins. Calculate the percentage of step lengths per bin.
  3. Create bar charts of individual tracks and all tracks together using ggplot.

Turning Angle

  1. Display the calculated turning angels in a rose diagram using the rose.diag tool from the R package Circular using 12 bins.

Description of Results Obtained

The step length and turning angle histograms suggest that across the five test tracks processed using the moveHMM tool, the majority of step lengths across the tracks were between 600-700 meters per minute and kayakers generally traveled in a straight direction turning infrequently. The step lengths were greater than originally anticipated, but after investigation into average speeds of travel for water-based recreationists, it appears that having a step length of 600-700 meters per minute is not unreasonable. Outside of the 600-700 meter per minute step length bin, the 0-100 meter per minute step length bin was the next most frequent among the step length bins. The 0-100 meter per minute step length bin likely represents stoppage time. The disparity between the two step length categories suggests that in the case of water-based recreationists, step length may be a good metric for examining changes in behavioral state in future bivariate analyses.

When looking at the available geography for water travel and the paths taken by the overnight kayakers, the small range of values observed for turning angle movements is consistent with the data visualization in space on the satellite image. Presently, the scale as which the turning angle histograms are presented is difficult to interpret and perhaps not refined enough to identify cut points in turning angle movements that would suggest changes in behavioral state. Further analysis and exploration is needed to understand whether or not turning angle would be a useful metric for further consideration in Exercise 2 analyses.

Critique of Method

Using the moveHMM tool to produce measures of step length and turning angle per minute of a recreationist’s trip provided a novel application for looking at changes in movement path overtime. Once the data were formatted properly, the tool was easy to run, producing a dataset with the calculated step length and movement angle measures while also producing four figure displays per track analyzed.

The data formatting for the tool and post processing visualization generated by the tool were cumbersome and had a steep learning curve. The tool requires data to be summarized into regular time bins for analysis. My data were not originally in this format; therefore, down-sampling was preformed to meet this requirement of the tool. Because the down-sampling was performed, in part, on a datetime variable in R, I experienced a learning curve in getting R to accurately read the date and time information in my data frame as a datetime class rather than a character or string class. Additionally, my data were not originally projected into UTM coordinates and I could not generate the necessary UTM coordinates through analysis in ArcGIS. I therefore had to search around for a work around and ended up finding a function that I was able to customize for use in transforming my data. While the moveHMM tool will work with latitude and longitude coordinates, the output is less meaningful and difficult to interpret when considering the variable of step length. Therefore, converting to UTMs was another data transformation step that was needed in order to run moveHMM smoothly.

The histograms generated by moveHMM, while useful in looking at overall patterns and trends are not user-friendly. The histograms themselves cannot be customized from within the package; therefore, any customization or additional data visualization must be done outside of the moveHMM package using different R tools. I would have also appreciated more documentation on the turning angle calculation, movements, and output. The associated tool documentation does not provide any guidance for interpreting the histograms, and while the step length histogram was pretty straight forward I had more difficulty with the turning angle diagram. This may be due to my inexperience with this measure, but I do feel the tool’s documentation could have provided more concrete examples of interpretation and additional documentation on units.



Michelot, T., Langrock, R., Patterson, T. (2017). An R package for the analysis of animal movement data. Available:

Nathan, R., Getz, W. M., Revilla, E., Holyoak, M., Kadmon, R., Saltz, D., & Smouse, P. E. (2008). A movement ecology paradigm for unifying organismal movement research. PNAS 105(49): 19052 – 19059.

Advanced Spatial Statistics: Blog Post 1 (.5) : Temporal autocorrelation of phenological metrics

Filed under: 2018,Exercise/Tutorial 1 2018 @ 11:11 am

1. Key Question

How has the phenology of production changed over time at two grassland sites, and does this change differ between and C3 and a C4 grassland?

2. Approach used

My approach uses an autocorrelational analysis to assess whether differences in phenological indices are indicative of change over time, or cyclical patterns.

3. Methods / steps followed

To answer this question, I used the R package greenbrown to extract phenological indices from time series of MODIS NDVI data from 2001-2015 at two locations in C3 and a C4 grassland. The sites correpond to Eddy Covariance Flux tower locations at the University of Kansas Biological Station and Konza Prairie Biological station, in eastern Kansas.

Phenological metrics include the start of the growing season, the end of the growing season, the length of the growing season, the peak growing season productivity. The Phenology() function calculates the phenological metrics by 1) identifying and filling permanent (i.e., winter) gaps in the time sereis, 2) smoothing and interpolating the time series, 3) detecting the phenology metrics from the smoothed and interpolated time sereis, and 4) correcting the annual DOY time series such that the metrics associated with days of the year (e.g., start of season, end of season) don’t jump between years. The Phenology() function provides several different approaches to calculate phenology metrics and to conduct temporal smoothing and gap filling. For this analysis, I used the “White” approach to calculation phenology metrics by scaling annual cycles between 0 and 1 (White et al. 1997) and used Linear interpolation / a running median for temporal smoothing and gap filling. The code for the phenology calculation function is “kon_phen <- Phenology(kon_ndvi, tsgf=”TSGFlinear”, approach=”White”)”. The end result is a dataframe with annual phenology metrics.

After calculating the annual phenology metrics, I used the acf() function to assess whether annual differences in phenology were a product of change over time, or cyclical trends.

4.1 Results: Phenological metrics

The phenological metrics appear to differ distinctly between the sites, which reflects established differences between the phenology of C3 vs. C4 grasses. The C3 site has a consistently longer growing season than the C4 site, with an earlier start of season and an later end of season. Based on the NDVI data, the sites have similar mean growing season (MGS) and peak growing season values of NDVI.

4.2 Results: autocorrelation analysis

In the autocorrelograms above, the dashed lines represent the upper and lower thresholds for statistically significant autocorrelation. Vertical lines represent 1-year lags, and a line at 0 is provided for reference. In each plot, the C4 site is orange, and the C3 site is blue.

The autocorrelational analysis reveals only a few instances of temporal autocorrelation that appear to be marginally significant. Overall, there does not appear to be strong temporal autocorrelation in the phenological metrics, suggesting that there are not annual or interannual cycles influencing the phenological metrics.

There appear to be different, but not significantly distinct, patterns of autocorrelation between the C3 and the C4 site, suggesting that the production patterns are being controlled by the same environmental drivers.

The few instances of statistically significant autocorrelation are:
– 3- and 4-year lags for EOS for the C3 site, indicating that the first positive peak in a cyclical pattern of EOS would occur at 3 years, and the first trough for EOS would occur at 4 years. This pattern is not evident at the C4 site.
– 2-, 3-, and 5-year lags for MGS at the C3 site, indicating that the first negative trough in a cyclical pattern of the mean growing season value would occur at 2 and 3 years intervals, and that the first positive peak in the cycle would occur at 5 years. Again, the C4 site does not exhibit a similar pattern.

Anecdotally, the result that the C3 site shows more statistically significant autocorrelation might indicate that the C3 site follows more cyclical patterns of phenology than the C4 site– perhaps suggesting that production at the C3 site is less sensitive to interannual variation in climate.

5. Critique

The distinction between the autocorrelational patterns at the C3 and C4 site may be due to a change in the land management at the C3 site over the course of the time series analyzed. In 2007 management shifted from an irrigation / field management type to lleaving the area in a more natural prairie state, when the Eddy Covariance Flux tower was installed. In contrast, the C4 site was maintained as a natural, unirrigated prairie for the duration of the time series.

Further, though NDVI is easy and accessible, comparison of the NDVI record with the Eddy Covariance Flux record at these sites suggests that it does not accurately capture intra-annual variation in production dynamics between the C3 and C4 sites. Eddy covariance flux tower records show that the C4 site has consistently higher max annual production than the C3 site, and a more distinct phenology. Because NDVI is a proxy of vegetation health by measuring greenness, rather than the physiology of plant production, it is less useful when plants look the same, but have distinct resource-use efficiences.

This method appears to work well on NDVI data; the greenbrown package appears to have been optimized for a ~2-week temporal resolution. When I attempted to use the package on Eddy Covariance flux data at daily resolution, the Phenology() function returned errors or missing data, and was unable to produce a smooth time series. Next steps include further processing the flux data for use with the greenbrown package, and performing a bivariate analysis to link annual phenological metrics with annual climate variables (e.g., mean annual temperature, mean annual precipitation, monthly precipitation variables, growing degree days).

April 29, 2018

Mapping Beaver Dam Habitat and Geometry through OD Cost Analysis

Filed under: Exercise/Tutorial 1 2018 @ 6:08 pm


Question 1: How well do habitat models predict suitable beaver dam locations?

This analysis was initially driven by the question of how well models of suitable habitat for beaver dams could predict stream reaches where observed beaver dam locations were identified during a pilot field season in the West Fork Cow Creek of the South Umpqua River (see Map: Umpqua Basin  below) in southern Oregon in fall of 2017. This question focused initially on implementing a Habitat Suitability Index (HSI) described by Suzuki and McComb (1998).  The findings from this query are noted below and ultimately led to a second question for this analysis.

Map: Umpqua and WFCC basins

Click to enlarge


Question 2: Is there a difference in the geometry of suitable dam habitats with and without observed  beaver dams?

After applying the Suzuki and McComb HSI to the stream network it appeared that observed beaver dams occurred more frequently on those sections of streams where habitat ‘patches’ – that is, contiguous segments of the stream classified as suitable for damming – were larger and/or separated from other patches by smaller breaks of unsuitable habitat.  These obervations are consistent with theory from landscape ecology that suggest habitat ‘geometry’ (see Figure 1) such as size and relationship to other habitats can be an important factor in selection of habitat by animals (Dunning et al, 1992).  To approach this question I focused on generating two metrics of habitat geometry: 1) patch size, and 2) distance to next patch.

Figure 1.

Figure from Dunning et al. 1992 showing how habitats A and B, while both too small to support a population, may be occupied (A) if in close proximity to other habitats


Click to enlarge

Seemingly simple, these tasks ultimately proved more time intensive than I first anticipated so my contribution for this first exercise focuses less on results than on describing the geoprocesing procedures and workflow used derive this information.  The hope is that others may be able to repeat the processes for similar questions related to stream habitat without running into as many dead end efforts as it took me.

Data and Tools Used:

For this exercise, I used data from Netmap, a modeled stream layer representing the stream network with polylines at approximately 100m segments across more than two dozen attributes.  More information can be found here.

Select by Attribute query to identify suitable habitat.

The first task was to identify all segments in the stream network that fit the HSI criteria: stream gradient ≤ 3%, active channel width > 3m but < 6m, and valley floor width ≥ 25m.   To accomplish this, I used a Select by Attribute query, which can be found in the Attribute Table “Options” dropdown.  The procedure is relatively simple but requires that you use SQL (Structured Query Language) syntax to define the variable(s) and respective parameters for the attributes that you wanted selected.

Analyzing Patch Length with Buffer and Dissolve tools

To combine contiguous segments of the stream suitable for damming into single/unique habitat patches I used a combination of tools found the in the ArcMap Toolbox. The primary tool was Dissolve, which can combine a layer’s features, in this case the stream segments defined as habitat, along with the attribute information for each feature.  After several iterations, I found that using the Buffer function to ensure overlap among contiguous habitat segments in the stream network prior to dissolving was helpful.

Analyzing Distance to the next Habitat Patch with OD Cost Matrix

I initially tried several approaches to measure the minimum distance to the next patch for every patch in the stream network. The first approach I used was the nndist function in the R spatstats package that calculates the distance from one point to the nearest neighboring point. A similar tool exists in ArcMap, Near, and was appealing so that I could maintain workflow in the same program if possible.  Ultimately, I landed on OD Cost Matrix, a traffic planning tool in ArcMap that looks at distance between points using a specified network of travel corridors (aka streets).   This seemed best suited to my needs of looking at travel distances between patches when considering realistic pathways of movement for beavers along the stream network.

Steps to complete the analysis:

Identify Suitable Habitat with Select by Attribute.

These steps were relatively straight-forward. I opened ‘Select by Attribute’ window from the options dropdown menu in my attribute table of the stream network layer and used the following syntax to select all stream segments that met the HSI criteria: “WIDTH_M” >=3 AND “WIDTH_M” <=6 AND “GRADIENT” <= .03 AND “VAL_WIDTH” >= 25.  After running this query, all of the segments matching these criteria are highlighted on the stream network layer.  From there I exported the selected items into a new layer Suitable_habitat by right clicking the stream layer in Table of contents, select export data, and chose to export ‘selected features ‘from dropdown, being sure to click what geo-referencing system I wanted to use. I usually choose to reference based on the existing data frame when exporting data since I specified that when first opening the map and its an easy way to keep all exported layers consistent.

Dissolve Contiguous Stream Habitat Segments into Single Patches

I defined a “habitat patch” as any contiguous length of suitable habitat segments in the stream network. My goal was to take these individual contiguous segments and convert them into single polylines of “patches’ through the dissolve function that would ultimately provide a measure of the total patch length.

Step 1: Buffer all suitable habitat segments.

Before dissolving I created a 20m buffer on my habitat segments so that I could ensure that contiguous sections were overlapping eachother.  To do this, I opened the Buffer tool, selected the Suitable_habitat layer and specified 20 meter buffers.  This generated a new layer of the buffered habitat segments I labeled Suitable_habitat_20mbuff.

Step 2: Dissolve Buffered layer into unique patches.

I then opened the Dissolve tool, selected the buffered layer.  For the second part of this function I needed to choose which attributes to keep from each habitat segment (e.g. length, gradient, width, valley width, etc.) and how I wanted it combined with the other segments as attributes of the patch layer that I was generating.  To do this I went to the Statistics drop down in the Dissolve window and added each attribute along with the relevant calculation. For example, I specified that Length (LENGTH_M) be the sum (SUM) of all the segments for that patch, for GRADIENT I choose to carry over the mean of each segment (MEAN) and so on.   Finally, I unchecked the “Feature with multiple parts” box and checked “Unsplit line” box. By doing so I told ArcMap that I do not want all of these segments to be to turned into one feature (meaning one row of attributes in the attribute table) and instead turn any overlapping line vertices (the ends of my stream segments) into a single feature or ‘unsplit” line.  I then ran the function and labeled the layer output as Habitat_patch.

*Note, that I’ve had to repeat the Dissolve more than once and found the Results window in ArcMap highly useful. (Main Toolbar > Customize > Results).  By using it I can re-navigate to the Dissolve tool table with all the Statistic/attribute selections already populated.

Calculating Distance between Patches Using OD Cost Matrix

The OD Cost Matrix is a tool developed for transportation planning that can calculate the distance (referred to in Arc as Cost) from starting points (Origins), to ending points, (Destinations), using a specified network of travel corridors (Network Dataset) such as streets, highways, etc.  When completed, the analysis generates a dataset with distance from every specified origin to every specified destination along the network paths and can become quite large depending on the number of origins/destinations that you specify.

Step 1: Turn on relevant tools:

The first step I needed to do was turn on the Network Analyst extension found on the Main Toolbar > Customize > Extensions and click the Network Analyst to add a check next to it.  Then I needed open the Network Analyst toolbar.  Main Menu Toolbar > Customize > Toolbars, click on Network Analyst.

Step 2: Create Network Dataset

In the Arc Catalogue, I navigated to the network I wanted to use as travel paths (the stream network layer), right clicked and selected New Network Dataset.  From there, Arc moved me through a series of windows.  The most important of which are to specify ‘any point’ under the connectivity policy.  The default ‘end point’ leaves a number of Origin points out of the analysis for some reason.  The other important window/step was adding the metric I wanted to use for the ‘cost’ analysis.  In my case this was the LENGTH_M variable, which is the length in meters of each stream segment in the network.  Note, that the name needed to match the variable name in the dataset exactly.

Step 3: Convert Dissolved Patch Habitats into Point Data.

Because OD Cost Matrix relies on starting points and ending points I needed to convert the polylines representing my patches into points.  To do this I used the tool Feature Vertices to Point. The tool itself is fairly straight-forward but requires a tradeoff decision on whether to convert the patches to a single “MID” for the mid-point of the patch or “DANGLE” for a point at each end of the patch. The Mid Point selection offers a simpler output from the OD Cost Analysis, and the “Dangle” offered a more precise measure of nearest distance, but cumbersome output because it had twice the data points to and from which distances needed to be calculated (more on that below). In the end I did both and used the mid-points as barriers (see next step).

Step 4: Load Data in OD Cost Matrix

In the Network Analyst toolbar I selected “OD Cost Matrix” from the dropdown menu, then, clicked the icon with a small flag overlaid on a grid.  This last step opens a new window adjacent to the table of contents. In that new window I right click the bolded “Origins” and selected “Load data” and selected the patch points I created in Step 3.  Then I repeat the process for “Destinations”.  Then for “Point Barriers” I loaded my mid-point patch layer.  It took me a few times to realize this, but the barriers data prevents the analysis from computing numerous unnecessary distances.

Step 5: Solve OD Cost Matrix

Once all the data is loaded I clicked the solve icon on the Network Analyst toolbar that looks like a chart with a upward line curve overlaid on it.

Step 6: Export Distances from OD Cost Matrix into New Layer.

Once OD Cost Matrix completes the analysis the Table of Contents is populated with a number of outputs including a layer called “Lines”.  This layer shows straight lines from every Origin to every Destination.  At first this was a little confusing because the lines suggest that these are Euclidian distances, but once opening the attribute table I was able to verify that they are in fact distances via the stream network.  I exported this as a new layer as patch_distances.

Step 7: Create Summary Table for Distances and Join to Dissolved Patch Layer

The last step I made was to add the new distance information to my layer of dissolved habitat patches and then add that information to the Habitat_patch layer.  To do this I opened the attribute table for the patch_distances and right clicked on one of the column headers and selected “Summarize”.  This opens a window with a drop down for what column to create summary data and with a number of options lower in the window for what information to summarize.  Here I used the patch orgin and destination ID and summarized by minimum, maximum, sum and average distances.  This then generates a new table summarizing each of those metrics by the Origin/Destination ID. I saved this table as Patch_Dist_Summary.  Then, I used the Join tool to add the summary information to the Habitat_patch layer.  Note doing this required that the table have an identifier for each distance that relates to the corresponding patch feature in the Habitat_patch layer so that it can associate (or add) the distance metrics appropriately.


Identifying Suitable Dam Habitat

Map 1 shows the results of the habitat selection based on the Suzuki and McComb (1998) HSI criteria and includes 15.8 km of stream length (2.3% of the total stream network).  Also shown are the dam sites that were observed during our pilot field season in 2017. Dam sites occurred across 21 stream segments for a total 2.2km of stream length, all of which fall within the bounds of the HSI criteria. However, as a predictive tool the HSI selection criteria also identified a larger number of stream segments that were observed as unoccupied in our site surveys, meaning that the model tends to ‘over predict habitat’ based on our observed dam sites.  One aspect that stood out, however was that the dams sites tended to occur on segments of suitable habitat that were contiguous to other suitable stream segments and/or separated only by small distances of unsuitable stream segments.

Map 1

Click to enlarge

Patch Length:

Map 2 shows all habitat patches, color coded by length with observed dam sites occurring among the longest patches.  These preliminary findings support the hypothesis that patch geometry may be a relevant factor in the beavers habitat selection for dam building.   It should also be noted, however that there are a few patches that were not surveyed so it is possible that unobserved dams exist in those locations.

Map 2

Click to enlarge

Patch Distances:

Map 3 shows the stream network distances generated from the OD Cost Matrix with lines showing the origin/destination points for all route distances calculated.  (NOTE: while the lines are straight, the distance calculations relate the ‘path distance’ or distance if traveling through the stream network) Here green lines indicate shorter distances and red lines indicating longer distances. Map 4 applies similar color codes to the patches themselves with red indicated a patches relative isolation and green indicting high connectivity to the nearest patch.

Map 3


Click to enlarge

Map 4

Click to enlarge

Distance weighted Length

Lastly, I wanted to consider both length and distance to nearest patch together and so created a final metric with Patch Length/Distance to nearest Patch.  The results are show in Map 5.  Overall there is not a large change either of these

Map 5

Click to enlarge

Critiques Critique of Habitat selection

I do not have a lot of input on this.  It’s a very useful procedure to quickly select out features of a dataset based on a fairly simple process.  The challenge, or course, is that SQL is a language and like any other can be fussy when first learning it’s syntax.

Critique of Habitat patching process

Overall, the dissolve process accomplished most of what I needed but with some notable caveats.  The most significant limitation is that the ‘unsplit lines’ function I used to combine stream layers only works where two line vertices overlaps.  In a handful of instances there were locations in the patches where three vertices came together at the confluence of a higher and lower order stream.  These cases are ignored by unsplit line function and remain ‘undissolved’. I found a work around by later adding a Patch ID to the attribute table by hand for these three segments and then dissolving by that common ID.  However, I’m not excited about applying the same work around where the number of suitable stream segments will increases from 158 in West Fork Cow Creek to over 11, 474 when I expand this procedure the rest of the Umpqua Basin (See Map 6 of suitable habitat segments in the greater Umpqua).

Critique of Distance Estimates

My initial efforts to identify patch distances began with the nndist function in spatstats package of R and the Near tools in ArcMap.  The main limitation for these is that they calculate the Euclidean distance (i.e. straight lines) between points and are not particularly realistic based on what we know about beaver movement in a watershed and their fidelity to streams for protection from predators (i.e. they tend not to waddle up ridges and into the next drainage).  Instead I needed a tool that would calculate the distance it would take a beaver to move from one patch to the next using the stream network as it’s travel corridor.

For my purposes the OD Cost Matrix seems very promising.  The greatest challenge I had with this procedure was relating the distance output layer back to the original patch layer. When deriving a new dataset from analysis of an existing one, Arc carries the original feature ID information forward but re-labels as “OriginID”  In my case, that means the Patch ID were carried forward in the conversion to point data, but when the point data was used in the OD Cost Matrix, and then subsequently Summarized, those Patch IDs were long gone.  As a result, I had to trace back the geneology and create a new Patch ID column in the summary table so that I could Join it with the original patch layer. Again, this is doable in a sub-basin context but when I apply this procedure to the entire Umpqua (Map 6) I’ll have to find a better solution.

Map 6:

Click to enlarge


Dunning, J. B., Danielson, B. J., & Pulliam, H. R. (1992). Ecological Processes That Affect Populations in Complex Landscapes. Oikos, 65(1), 169–175.

Suzuki, N., & McComb, W. C. (1998). Habitat classification models for beaver (Castor canadensis) in the streams of the central Oregon Coast Range. Retrieved from


April 27, 2018

Fitting and Assessing Fit of Lognormal Regression to Phenology Data

Filed under: 2018,Exercise/Tutorial 1 2018 @ 6:30 pm


The purpose of this analysis was to find ways to describe phenological data I gathered in 2017, in order to quantify trends and compare between sites. Phenology was recorded for two interacting species: the introduced cinnabar moth (Tyria jacobaeae), and native perennial herb Senecio triangularis, which is a novel host to the moth larvae in North America. For S. triangularis, individual flowerheads (capitula) were scored into six phenological stages: primorida (P), buds (B), young flower (YF), flowers (F), fruit (FR), and dehisced fruit (D).

The motivating question is: does the flowering phenology differ by site for five surveyed sites? And if so, is the difference apparent at all flowering stages, or only for specific stages?

In order to complete this analysis, I first needed to convert my survey dates to growing degree days, defined as accumulated daily heat gain above 5˚ and below 37˚ C. The development thresholds used here were estimated from previous studies on phenology of alpine flowers (Kudo and Suzuki, 1999). I used the single triangle method with upper threshold (Sevacherian et al., 1977) with 2017 Tmin and Tmax rasters acquired from PRISM. This was accomplished using R code developed by Dr. Tyson Wepprich (personal communication), and provided me with accumulated or growing degree days (hereafter GDD) as a unit of thermal time, against which I plotted the response variable of capitula counted in one of the flowering stages.

I will focus on F stage capitula to describe the workflow that will be expanded to each flowering stage. I hoped to be able to describe an overall behavior for observations of F stage capitula (hereafter called F), producing a predictive curve fit to the data. The goodness of fit of this curve could then be examined for each site separately, to yield an estimate in difference in timing at certain sites.



I used a combination of tools to choose and fit a curve to my observations of F against GDD. First, I used an R package called fitdistrplus to visually compare and also estimate parameters for a variety of curves, including Weibull, gamma and lognormal curves. Then I used the built-in nls (non-linear least squares) function in R with parameter starting estimates determined from output of fitdistrplus.



Initial trials with fitdistrplus showed that the package was only able to fit density plots of one variable. In order to create a density plot that reflected capitula counts on a given degree day, I used package splitstackexchange. I used the expandRows() command to expand rows of the data frame by values in the column contain F counts. In the resulting dataset, each observation of a time (GDD) represented an individual capitula counted at that time, so that a plant with seven F capitula at time t would be represented by seven observations of t. This method was in part based upon methods by Emerson and Murtaugh for the statistical analysis of insect phenology (Murtaugh et al., 2012).

Using the fitdist() command from fitdistrplus and the expanded set of GDD observations for F, I fit gamma, lognormal, and Weibull curves to my data, then calling the denscomp function to visually compare each curve against the density plot of observations. I repeated this method across all floral stages to determine that lognormal curves seemed to best reflect visual trends in my data (Figure 1a & b).

Figure 1a: Observations of F stage capitula against growing degree days

Figure 1b. Density plot produced with fitdistrplus showing observations of F-stage capitula at time x in growing degree days, with fitted Weibull, lognormal & gamma curves.

The lognormal fitdist object created for the steps above also contained estimates of the parameters for the lognormal curves, which I extracted with the summary() call.

Seeking a more flexible lognormal model, I saved these estimated parameters and turned next to the built-in nls() (non-linear least squares) function. This function requires the user to specify a formula and vector of starting estimates for parameters, and then uses least squares methods to find a best fit for the parameters with the provided data. My formula in this case was the lognormal probability density function (pdf):

I created an nls object using this formula, the original (non-expanded) dataset, and the estimates from the fitdist() function for starting estimates of the parameters  and . I was able to obtain an nls object using these methods, but had a scaling issue between the axes – because my x axis was stretched over 900 degree days, and because of inherent properties of the lognormal pdf, the scale of the y axis was off by about three orders of magnitude. To address this, I scaled GDD by dividing the number of degree days by 100, and converted my counts to a per-plant proportion of F. Obtaining new estimates from fitdist() for  and, I found these adjustments indeed yielded an appropriately fitting curve for my data. I used least squares and root mean squared error (RMSE) methods to check the fit of the curve.



With the predict() function and the nls object fit above, I plotted a predictive curve based on my lognormal model against the prepared data, and used a visual check to determine that this result appeared to be a relatively good fit, especially compared to earlier attempts to fit simple polynomials to the data (Figure 2).

Figure 2. Observations of F stage capitula fitted with lognormal (red) and cubic polynomial (blue) curves using non-linear least squares method.

The original goal was to compare observations for each site to this overall predictive curve. I was able to plot site-specific data with the curve for visual inspection (Figure 3a-e).

Figure 3a.

Figure 3b.

Figure 3c.

Figure 3d.

Figure 3e.

Additionally, I calculated mean absolute error (MAE) and root mean squared error (RMSE) to summarize the difference between observed values for each site and predicted values given by the overall lognormal curve. Of the five sites, Juniper had the lowest values for both RMSE  (0.1274 compared to 0.1848 for the data overall) and MAE (0.0712 compared to 0.103 overall); and Waterdog had the highest RMSE and MAE (0.2202 and 0.1269 respectively). A summary of these values is included in Figure 4:

Figure 4. Lognormal curve fit by nls shown against F stage observations by site. Inset shows root mean squared error (RMSE) and mean absolute error (MAE) calculated overall and by site.


There was no one tool I could find that was able to fit a phenomenological model to these data. The use of fitdistrplus in combination with nls was an adequate solution because in the end it did yield an appropriate looking curve. To use fitdistrplus, I had to collapse my explanatory and response variables into one variable whose value represented the explanatory variable and frequency represented the response variable. This was a little overcomplicated for my purposes, but did allow me to visually compare fitted lognormal, gamma & Weibull curves, which was a great benefit.

The nls function, which allowed me to switch back to my initial data format, allowed me to easily extract estimates and predicted values, and plot the curve of predicted values. This was extremely useful for both visual and mathematical assessments of fit. A drawback of the nls function is that without starting parameters estimated by fitdistrplus, it was extremely difficult to fit a lognormal curve; some starting points yielded no tractable model and others yielded models with little context about the goodness of fit.

I am concerned that I could only fit the curve by rescaling the data, which may due to an underlying property of the lognormal pdf: that the area under the curve must be equal to 1. At moments I felt unsure whether I was fitting the data to the curve or the curve to the data.

But in the end, this method yielded a lognormal curve which is able to capture important properties of my count/proportion data, such as rising and falling at appropriate rates around the peak and not dipping into negative values. I will continue to explore, and attempt to verify that the method I used for estimating parameters in fitdistrplus to be used in nls was valid.


Kudo, G., and Suzuki, S. (1999). Flowering phenology of alpine plant communities along a gradient of snowmelt timing.

Murtaugh, P.A., Emerson, S.C., Mcevoy, P.B., and Higgs, K.M. (2012). The Statistical Analysis of Insect Phenology. Environ. Entomol. 41, 355–361.

Wepprich, Tyson. (2018). “PRISM_to_GDD_daterange”, R Code.

Sevacherian, V., Stern, V.M., and Mueller, A.J. (1977). Heat Accumulation for Timing Lygus Control Measures in a Safflower-Cotton Complex13. J. Econ. Entomol. 70, 399–402.


April 26, 2018

Using multivariate statistics to test spatial variability of model performance.

Filed under: Exercise/Tutorial 1 2018 @ 2:42 pm

Question that you asked

Understanding where my model performs well compared to in-situ observations is important for assessing the appropriateness of snow and climate metrics derived from it in comparison to Dall sheep demographic data. For instance, is the model adequately capturing the paths of wind-blown snow that are so critical for winter grazing? Are there arrangements of landscape whose snow evolution is consistently difficult to represent accurately? Answering these questions produces dual benefit; identification of space to improve the model and the possibility to characterise and handle uncertainties in subsequent analyses using model data.


The data I’m using in this analysis are measurements of snow depth taken at a field site in the Wrangell St Elias National Park between the 18thand 24thMarch 2017. These snow depth measurements were taken at relatively high spatial resolution using a GPS enabled Magnaprobe and have been therefore aggregated into 30m by 30m grid cells of mean snow depth, along with other simple statistics (e.g. standard deviation). These grid cells match the output rasters of my spatially distributed snow evolution model enabling comparison of model performance by comparison of observed mean snow depth to modelled snow depth. For simplicity’s sake, I have chosen the middle date of the observation period, the 21stMarch, for the modelled observations. This is done in the knowledge that there was no snowfall or high winds in the entire observation period, both in reality and model space, and that snowpack evolution governed by other processes is relatively minor under such conditions and time-period.  Important to note also is that the model output compared to is the product of >250 iterations of model calibration. The selected model after this process produces the lowest Root Mean Squared Error (RMSE) between model and observed snow depth and water equivalent. This analysis is hence designed to examine the spatial distribution of this error. To do this I have categorised the result from subtracting the mean observed snow depth and the modelled observed snow depth as either being weak or strong, over or under prediction based on whether it is within or over the RMSE, positively or negatively.

The landcover dataset, derived from the National Land Cover Database 2011 Alaska (Homer et al. 2015), alongside a DEM for the model domain provide landscape data.

Fig 1; Spiral of Magnaprobe measurements on Jaeger Mesa

Name of the tool or approach that you used.

To produce landscape metrics I used the following data processing tools in QGIS;

  • GRASS GIS 7 r.slope and r.aspect,
  • Raster Calculator to produce a ‘northerness’ layer – cosine(aspect)

I also used QGIS to aggregate the point snow depth measurements into 30m resolution rasters for mean depth, standard deviation, max/min depth.

To perform the multivariate analysis I used the R package FactoMineR (Lê, S., Josse, J. & Husson, F. 2008). FactoMinerR is a Multivariate Exploratory Data Analysis and Data Mining Tool that enables Multiple Factor Analysis to deal with discrete data such as the nominal landcover data in my analysis. The particular package used is FAMD – Factor Analysis for Mixed Data.

Brief description of steps you followed to complete the analysis.

Data preparation;

For the aggregation of the in-situ measurements I used QGIS tools to create a 30m polygon grid that matched the extent and aligned to my SnowModel domain. I then used the ‘Join Attributes by Location’ tool to intersect the point data to these polygons, selecting mean, min, max, and standard deviation statistics for summary. The Rasterise tool then could be used to convert the polygon layer into ArcInfo Ascii format.

To compare model results vs observations, I first subtracted the model result raster for the 21stMarch 2017 from the mean observed snow depth raster. The resulting raster showing the difference could then be mapped on top of the landscape variables such as elevation, slope etc. (see figs 2 to 3 below) in QGIS.

Fig. 2; Model performance of snow depth mapped against elevation

Fig. 3; Model performance of snow depth mapped against landcover class

Matlab was employed to build the table for Multiple Factor Analysis in R. Using the arcgridread function it was possible to open the arrays of observations, model results and landcover variables, and then build a table holding each sampled cell’s corresponding variables. This table was then converted to .csv format for use in R.

Within R the table was imported as a data frame and the land cover vector converted into factor type.

Data Analysis;

The FAMD package of FactoMineR was used to perform the multivariate analysis, producing tables and plots of the results.

Brief description of results you obtained.

Fig. 4; Screeplot of explained variance by dimension

Table 1; Eigenvalues, percent variance and cumulative variance by dimension

The FAMD analysis allows insight into which dimensions describe the greatest variability, effectively ‘flattening’ the data, reducing a large set of variables to a selection that contains most of the information. In Fig 4 and Table 1 above, we can see that for my data 24 % of the variance is explained by the 1stdimension, 16 % by the 2nddimension. Cumulatively, by the 5thdimension 70% of the variance is described.

Fig. 5; Variable contribution to dimension 1

It is then possible to see which variables most contribute to each variable. Figure 5 shows that elevation and landcover both contribute >20 % of the variance seen in the 1stdimension. Above the 95% significance level, the Observed Mean Snow Depth (obsMeanSNOD) is also seen as a contributor.

Fig. 6; Variable contribution to dimension 2

The second dimension has the greatest contribution from the categorical values of the model performance (diffCategory), and significant contributions by both modelMeanSNOD and landcover, see figure 6.

Fig. 7; Quantitive variables on dimensions 1 and 2 coloured by the cosine of their contribution.

It is possible to then plot both the quantitive, figure 7, and qualitative, figure 8, variables on both the 1stand 2nddimension axis. By doing so some insight into how the variables interact is possible. The high positive standing on the 1stdimension and 2nddimensions by Observed Mean Snow Depth (fig. 7, obsMeanSNOD) is matched by the qualitative variables of coniferous forest and strong under predict. This suggests that the model strongly under predicts when the land cover is coniferous forest and the observed snow depth is high. This is confirmed by the greatest influence of elevation on the 1stdimension, higher elevations in my study domain have lower snow depth and bare or prostrate shrub tundra land cover. Indeed, my model is specifically calibrated to best reproduce snow in these areas as they’re prime Dall Sheep habitat. Looking further at figure 8 it is possible to see that weak underpredict matches the bare land cover class quite well, and similarly weak overpredict lines up nicely with prostrate shrub tundra. On the 2nddimension it is possible to see that erect shrub tundra is in the direction of the Model Mean Snow Depth and strong over-predict. This would suggest that for this land cover the model is producing too much snow.

Fig. 8. Qualitative variables contribution to dimensions 1 and 2

Figures 9 and 10 further describe the patterns explored above by plotting the individual rows, i.e. the pixels where I’ve sampled, on the first 2 dimensions and colouring them by their category for both the categorical description of model performance and land cover. Here however, we see that the inferences aren’t necessarily as clear cut. For example, the erect shrub tundra land cover class does not entirely comprise of strong model overprediction.

Fig. 9; Individuals by model performance category

Fig. 10; Individuals by land cover category

Critique of the method – what was useful, what was not?

This task was useful for quite a number of reasons, not least because it forced me into baby steps use of R. It neatly confirmed my suspicions about where my model is over predicting, high elevations with prostrate shrub tundra, but also identified other areas of poor representation, namely strong under prediction coniferous forest. It too gave me a certain satisfaction that for the majority of pixels in high elevation, low vegetation Dall Sheep terrain, there’s only weak over or under prediction, suggesting my calibration efforts haven’t been without cause.

It is also interesting to observe that the bare land cover class is subject to weak under prediction, whereas the prostrate shrub tundra class tends to be over predicted. This is quite likely due to with how the model treats the roughness of the land cover compared to how rough that land cover is in reality (as well as whether the 30m NLCD land cover classes are actually a good representation of ground conditions). In my study area, the bare patches were more frequently pretty rough scree patches that had the capacity to intercept a fair amount of snow. Conversely the prostrate shrub tundra snow holding height in the model parameters is set at 10 cm, whereas the reality of the land cover is these areas was a patchwork of sparse alpine grasses and sedges, probably not over 5 cm in height. This is useful information that allows me to play with adjusting these parameters.

A surprise is that the other landscape variables, slope, aspect, northerness, have little influence on the variability. I had expected that slope may have had more of an influence given how important it is to wind redistribution of snow

Where I have queries with the method in respect to my particular problem is the selection, type (qualitative vs quantitative) and quantity of the variables included. For instance, I originally left the description of model performance as a continuous variable but found the results not nearly as easily interpretable as when I categorised them. There is also an element of variation within variation that is masked by attempting to flatten the data. Dominant variables take precedence, such as land cover and elevation in my instance, so that the causes of variation within land covers, for example, are hard to see. A further exercise useful to my specific problem would be to just run the analysis on pixels from the bare and prostrate shrub tundra classes and seeing whether landscape variables are greater contributors.



Homer, C.G., Dewitz, J.A., Yang, L., Jin, S., Danielson, P., Xian, G., Coulston, J., Herold, N.D., Wickham, J.D., and Megown, K., 2015, Completion of the 2011 National Land Cover Database for the conterminous United States-Representing a decade of land cover change information. Photogrammetric Engineering and Remote Sensing, v. 81, no. 5, p. 345-354

Lê, S., Josse, J. & Husson, F. (2008). FactoMineR: An R Package for Multivariate AnalysisJournal of Statistical Software25(1). pp. 1-18.

April 25, 2018

Using a hidden Markov Model tool to identify movement patterns of recreationists

Filed under: 2018,Exercise/Tutorial 1 2018 @ 2:48 pm

Question that you asked

For this exercise, the question that I asked was how to use path segmentation analysis to identify movement patterns of day-user recreationists at String and Leigh Lake in Grand Tetons National Park.

Name of the tool or approach that you used.

To answer this question, Susie and I employed methods derived from the hidden Markov Model (hMM). Hidden Markov Models, a type of path segmentation analysis, are typically used in the field of movement ecology to understand animal behavior. These models relate animal behavior to covariates, i.e. other environmental factors or ‘hidden states’ that may drive and explain animal movement. Often the movement of animals are separated into researcher-defined ‘states’: slow moving states (foraging) or faster moving states (in transit). Understanding the conditions and underlying features that influence the change in movement between these states is at the heart of what makes this analytical tool useful for movement ecologists – and potentially for social scientists.

What makes hMM an appealing tool for our GPS data is the assumptions necessary for this analysis align with the features of our dataset: (1) measurement error is small to negligible, and (2) the sampling units collect data at regular intervals.  The field of recreation science is only beginning to explore more advanced spatial methods to analyze and understand human behavior in outdoor recreation systems. Thus, this exercise is an opportunity to adopt tools and methods employed by other disciplines and test how compatible they are with human movement data.

In July 2017, Michelot, Langrock, & Patterson developed an R package —  ‘moveHMM’ — that provides step-by-step instructions to prep data into appropriate formats for hMM analysis. This package is built with the functions and algorithms necessary for more rigorous statistical testing with multiple variables, but for the purposes of this exercise we used moveHMM package to conduct a univariate analysis of two variables:  step-length and turning angle.

Brief description of steps you followed to complete the analysis.

Step 1: Subset data I chose to use nine tracks that were collected on July 19, 2017.

Step 2: Create unique ID column The moveHMM package requires that each track contain a ID column with a unique numeric value allocated to each individual.

Step 3: Establish temporal parameters To calculate the step lengths and turning angles, we needed to define the temporal parameter for each step segment (i.e. every minute, every five minutes, every 6 hours, etc). The original dataframe had a temporal resolution of 10 seconds. However, that level of precision was not necessary for this analysis as we assumed human movement did not vary significantly every 10 seconds. Thus, we aggregated the points to 1-minute intervals. To do this I used the dplyr package in R to average the coordinates to the minute level for each individual.

Step 4: Choose what type of coordinates to use for analysis The moveHMM package allows the user to use either Latitude/Longitude coordinates or UTM coordinates. My original data contained both. I choose to use UTM coordinates rather than Lat/Long. This is because the output for UTM coordinates is represented in meters which is a little easier to interpret and understand than the Latitude and Longitude units.

Step 5: Calculate step-length and turning angle My prepped dataframe is ready to go. It contains: unique ID number, timestamp (at the minute-level), and x & y projection values.

Move HMM has an easy code that quickly calculates the step-length and turning angle. With this code I created a new dataframe that includes these two additional variables.

Step 6: Plot the data The moveHMM package offers a code that plots histograms of the frequency and distribution of step lengths and turning angles for each individual (see results section).

Step 7: Create additional graphs that represent frequency of step-length and angle in % We noticed that the frequency graphs generated in the moveHMM output represented counts on the y-axis rather than percentages. Because the y-axis wasn’t standardized, it made it difficult to compare step-length and turning angle across the nine individuals. Thus, we created two additional histograms that represented frequency as a percent on the y-axis. Note: the histogram of the turning angles is represented as a rose diagram.

Step 8: Plot the tracks on a graph for a visual bonus The final thing we did was plot each GPS track onto a graph so the viewer can have additional context when interpreting the results.

Brief description of results you obtained.

We were able to successfully calculate the turning angles and step-lengths of each individual at one-minute intervals. We also produced graphs of the frequency and distribution of these values, allowing us to make comparisons between individuals and identify common patterns and trends. See images and pdf links below.

I was surprised to see the variation in movement across the nine individuals. Given that 8 of the 9 individuals were hiking on designated trails, I assumed the behavior would be relatively homogeneous, particularly the turning angles. However, the figures indicated that people who spent a relatively short amount of time in the area (<30 minutes) tended to have shorter step lengths and more variability in turning angles.  This could be explained by the environmental features of the areas near the parking lot. Along the southeastern shoreline of String Lake there are locations that are denuded of vegetation and offer ideal conditions for ‘beaching’, sight-seeing, and/or picnicking. Therefore, people who recreate in these areas may have shorter step lengths and greater variation in turning angles than those people who choose to venture further along the trail.

Figure 1: Example of a water-user track. From the top (going clockwise) : plots describing the change in step lengths and turning angles over time; plot of the track with long and lat coordinates; rose diagram representing the frequency of turning angles; histogram representing frequency of step length (in meters).

Figure 2: Example of a land-user track. From the top (going clockwise) : plots describing the change in step lengths and turning angles over time; plot of the track with long and lat coordinates; rose diagram representing the frequency of turning angles; histogram representing frequency of step length (in meters).

Figure 3: Example of another land-user track. From the top (going clockwise) : plots describing the change in step lengths and turning angles over time; plot of the track with long and lat coordinates; rose diagram representing the frequency of turning angles; histogram representing frequency of step length (in meters).

To see all of the output, click on the pdf links below





Critique of the method – what was useful, what was not?

What was useful:

The moveHMM package was easy to understand and provided an example dataset that guided me throughout the analysis. Once I was able to get my data into the proper format, the moveHMM package took care of the calculations and created the initial figures. Additionally, the vignette included a brief overview of the theory and defined the concepts of the hidden Markov Model.

Some criticisms:

The figures that were generated in the moveHMM package were helpful in representing the distribution and values of each person’s step length and turning angle. However, the y-axis of the histograms represented frequency as a count rather than a percentage. Therefore, the y-axis varied for each individual depending on the number of step segments each user generated while carrying the GPS unit. This variation made it difficult to make comparisons between individuals. Therefore, we chose to create supplemental histograms that represented frequency in percent rather than tallies.


Click here to view the moveHMM documentation

Edelhoff, H., Signer, J., & Balkenhol, N. (2016). Path segmentation for beginners: An overview of current methods for detecting changes in animal movement patterns. Movement Ecology, 4(1).

Michelot, T., Langrock, R., & Patterson, T. (2016). moveHMM An R package for the analysis of animal movement data, 1–24.

Patterson, T. A., Basson, M., Bravington, M. V., & Gunn, J. S. (2009). Classifying movement behaviour in relation to environmental conditions using hidden Markov models. Journal of Animal Ecology, 78(6), 1113–1123.



April 24, 2018

Spatial distribution of behavior changes

Filed under: Exercise/Tutorial 1 2018 @ 9:00 pm


Since writing ‘My Spatial Problem’, a behavioral change point analysis tool was implemented on the swim paths of juvenile salmon as they encountered boom angles of 20, 30, and 40 degrees. Boiling entire swim paths down to one behavioral change point greatly simplifies the analysis, by investigating the hydraulic thresholds that may or may not incite behavioral changes. A behavioral change is identified by a change in swim velocity, swim direction, or both. However, for the purposes of this study, the type of behavioral change is important. Do fish that pass the boom do so at one hydraulic threshold or location, while fish that are halted do so at another? The question asked in this investigation is: do ‘passing’ and ‘halting’ behaviors occur at different locations in space at any boom angle, or between boom angles?

Method and steps for analysis:

Python was used to visualize behavior changes in two ways: kernel density estimation and multidimensional confidence intervals. First, behavior changes were classified as either ‘passing’ or ‘halting’. ‘Passing’ behavior precedes downstream movement. ‘Halting’ behavior precedes upstream or pausing movement. The results of these classifications for all 20 degree trials are shown in Figure 1.

Figure 1. Behavioral changes of all fish during 20 degree trials. Red markers indicate halting behavior, that which precedes upstream movement or pausing of downstream movement. Blue markers indicate passing behavior.

Second, to determine if the spatial distributions of passing and halting behavior overlap, kernel density estimates were calculated using the scipy.stats function, Gaussian_kde. Kernel density estimates estimate a variable’s probability density function, two of which are shown in Figure 2. However, this method presents two shortfalls for the purposes of this investigation: 1) it fails to provide direct estimates of confidence, so its statistical power is low, and 2) the overlap between kernel density estimates, of which there is plenty in these data, closely resembles a bruise.


Figure 2. Kernel density estimates of passing (blue) and halting (red) behaviors over all 30 degree trials. Overlap is shown in purple.

A direct method of determining spatial independence in two dimensions is with multidimensional confidence intervals. A slew of boot-legged functions for calculating and plotting confidence ellipses are available on Python help pages like GitHub and StackOverflow. StackOverflow user ‘Rodrigo’ provides helpful code which was modified to create the confidence intervals in Figure 3.


Figure 3. Multidimensional, 95% confidence intervals for the spatial distributions of halting and passing fish behaviors during trials at 40 degrees.


Because the 95% multidimensional confidence intervals between passing and halting fish behaviors show substantial amounts of overlap at all boom angles, no evidence exists to suggest that passing and halting behavioral changes occur at different locations in the channel during trials. Furthermore, the behavioral changes between 20, 30, and 40 degrees show no significant differences in spatial distribution from one another (Figure 4). This finding holds promise for future analyses: if a threshold exists in some hydraulic variable (turbulence, water speed, etc.) for inciting behavioral changes (either halting or passing), it likely exists where a threshold appears in Figure 4, when a fish has passed between 0% and 25% of the floating guidance structure.

Figure 4. Behavioral changes and their 95% confidence intervals at 20, 30, and 40 degrees imply that the hydraulic signature of a floating guidance structure at a consistent fraction of its length (between 0 and 0.25) incites a reaction from juvenile fish.

Critique of methods:

Kernel density estimates are useful for visualizing the density of entities that are spatially independent of one another. However, the statistical significance of any overlap is unclear and difficult to present with small numbers of observations, which blur densities. Multidimensional confidence intervals, on the other hand, show clear estimates of confidence and overlap in this dataset.

Among-site differences in giant kelp (Macrocystis pyrifera) temporal autocorrelation

Filed under: 2018,Exercise/Tutorial 1 2018 @ 5:15 pm


As a preliminary analysis to explore pattern before incorporating other variables, I investigated how giant kelp (Macrocystis pyrifera) autocorrelation varied through time. By way of refresh, these abundance data are from seven subtidal sites, with each site comprised of five 10x2m2 transects. All 35 transects were sampled biannually (every 6 months) since 1980. Based on personal observations, I suspect that the physical substrate underlying these sites to be highly variable, both within- and among-sites (as is the associated subtidal community structure), such that averaging these transects up to the site level would gloss over pattern that might otherwise provide insight into spatiotemporal dynamics. Specifically, for this exercise I asked:

  • Are patterns of M. pyrifera temporal autocorrelation similar within-sites (among-transects)?
  • Are patterns of M. pyrifera temporal autocorrelation similar among-sites?


I used the base autocorrelation function in R studio (acf) and ggplot2 to visualize these data, and excel to create a spreadsheet tallying instances outside of the confidence interval.


  • Existing code was used to structure my data into data frames for the relevant sites and transects (dplyr)
  • Use the base R acf function to calculate lagged temporal autocorrelation coefficients for each of the 35 transects.
  • Create new data frame of correlation coefficients, and use those data to group and visualize with ggplot2 the five transects comprising each site.
  • Use correlation coefficient data to create spreadsheet tallying the instances of positive autocorrelation before the ‘first drop’ below the noise confidence interval, and tally subsequent peaks or dips above or below the confidence interval.
  • Examine within- and among-site patterns


While I did not use a statistical test to evaluate my spreadsheet values, a visual examination provides insight into the differences both within and among sites. To address my first question (within-site, or among-transect variation), I do see differences in the temporal scale of correlation, though it is unclear how significant or meaningful these differences are.

The patterns among-site are more apparent, with certain sites exhibiting either no instances of positive autocorrelation, or a single point before dropping into the confidence interval (e.g., West Dutch, East Dutch). This indicates rapid shifts in M. pyrifera abundance at the 6 month interval. These same two sites also exhibited the longest lagged temporal scale of positive correlation (e.g., at the 15, 16, and 18 lagged scale, or 8-9 years).

Other sites exhibited longer periods of positive autocorrelation before the ‘first drop’ (i.e., West End Urchin, West End Kelp, and NavFac), with one site—Sandy Cove—almost uniformly exhibiting positive autocorrelation out to 7 (3.5 years). However, results from this site must be cautioned by the dramatic shifts in M. pyrifera over time, and thus non-stationarity is almost guaranteed. That being said, Sandy Cove also exhibited long periods of negative autocorrelation, often into the ‘uninterpretable range’ (i.e., past approximately 1/3rd of the total temporal scale).

Critique of the method

I did find it useful to calculate and view the temporal autocorrelation coefficients for M. pyrifera. These results support my ‘sense’ of the system that there is substantial variation among-sites (despite all these sites being within 10km of one another). While differences were found within-site, it is unclear how significant or substantive the within-site variation is. While autocorrelation obviously cannot shed light on causal factors, mechanisms, or even associations (without other variables included) underlying these patterns of correlation coefficients, I do believe they provide grounds to proceed with an investigation into the associations between physical substrate complexity and M. pyrifera density through time.

Day_acf Day ED_acf ED WD_acf WD WEK_acf WEK WEU_acf WEU SC_acf SC NF_acf NF

Table 1: temporal lag of steps before dropping into the confidence interval, and any subsequent departures from the confidence interval. The orange box are sites exposed to high storm surge, and the blue box are sites relatively sheltered from large wave events. The column on the far right uses color to qualitativley depict the relative degree of physical substrate complexity at each site, with red sites exhibiting large pinnacle structures (i.e., high-relief), and green sites almost uniformly flat (i.e., low-relief).

Fig. 1: 2m bathymetry for Sandy Cove, Dutch Harbor, and Dayona (L-R), with red depicting vertical slope, and green depicting flat (i.e., no slope)

April 22, 2018

Mapping fire extent from binary point data

Filed under: Exercise/Tutorial 1 2018 @ 11:25 am

(Please click the links to view all figures.  They aren’t very clear or large in the post!)

Question that you asked?

My overall objective is to build a predictive model of the annual spatial distribution of fire across my fire reconstruction area.  This will inform how fire extent and distribution were related to climate, topography, and fuels. Prior to doing this I need to answer these questions:

  1. What is the best method for mapping fire boundaries from binary point data?
  2. What do these maps indicate about spatial patterns of historical fires?

Dataset – I reconstructed fire history at 31 sampling points evenly distributed on a 5km grid, and 21 points that encircled landscaped patches of lodgepole pine forest (Figure 1). Point samples were denser near lodgepole pine forests because they may limit fire spread due to slow fuel recovery. At each sampling point I collected 3-6 cross sections from fire-scarred trees (194 trees total). All cross sections were precisely dated, and 1,969 fire scars were assigned to the calendar year of their formation. At each sample point individual tree records were composited into a record of fire occurrence at the sample point. Pyrogeographers composite fire records at points because most trees that survive fire do not form and preserve fire scars even when directly in the path of fire, and recorder trees record fire events over different time periods. Obtaining a full census of fire events at a sample point (e.g. figure 2) requires sampling multiple recorder trees within a defined search radius (250 meters in my study). I eliminated 89 scars that occurred on only one tree or could have been attributed to lightning, mechanical, or insect damage.

Figure 1 Point samples and recorder trees across the study area

Figure 2. Top panel –  individual recorder trees at a sample point, Bottom panel – composite records at each sample point. Vertical slashes on timelines indicate fire events.


Name of the tool or approach that you used?

Question 1 – I compared three different tools for mapping fire extent from binary point data. This has also been done by Hessl et al. 2008, but for a different study landscape using a smaller sampling grain over smaller areas.

  1. Thiessen polygons are polygons whose boundaries define the area that is closet to a sample point. Using thiessen polygons to map fires assumes the best evidence of fire or no fire at an unsampled point is the record at the nearest sampled point. Thus all unsampled areas are assigned the record of the nearest sampled point.
  2. Kriging is an optimal interpolation method that uses a linear combination of weights at know points to estimate the value at an unknown point (Burrough and McDonnel 1998).
  3. Inverse Distance Weighting (IDW) is a deterministic interpolation method that calculates the value of an unsampled cell as a distance-weighted average of sample points within a specified neighbor (Burrough and McDonnel 1998).

Question 2 – To assess spatial variation in mapped fires and the sum of all fires that occurred I used three approaches

  1. I created an animation of fires from 1700 – 1925 using the Time Slider tool in arc map to visualize each fire event using the thiessen polygon mapping method
  2. I mapped fire history statistics (e.g. mean, maximum, interval, CV of interval, fire size) at each point using the Thiessen Polygon and Kriging methods
  3. I performed cluster analysis on the occurrence of fire from 1700-1918, and then mapped the fire groups.


Brief description of steps you followed to complete the analysis?

All mapping approaches were testing in ArcMap.

Question 1

1a. I used the create thiessen polygons tool with my sample points as the input. Make sure to click environments to specify the processing extent or you will likely get unintended results! Outer polygons extend infinitely, and need to be clipped. I clipped them using my study area boundary (2.5km from any sampled point).

To map fires for each year, I used R to subset my fire year data for each tree to a composite at each sample point.  In Arc map, I joined this data to my sample points and thiessen polygons surrounding each sample point

1b and c. Prior to Kriging and IDW I created a binary matrix of fire occurrence for each fire year at each plot. Rows were plots and columns were fire years (52 rows by 132 columns). For both Kriging and IDW the input features were my sample points, and the value field was the binary presence/absence of the fire year I wanted to map.

I initially used the default settings for Kriging and IDW. After comparing results with the Thiessen polygon method I adjusted the importance of near versus far points by adjusting the search radius. For IDW you can also adjust the power parameter. As power increases IDW approximates the Thiessen polygon method where the interpolated value takes on the value of the nearest known point.

Kriging settings: method = ordinary semivariogram, model = spherical, search radius variable, number of points = 12.

IDW settings: power = 2, search radius variable, number of points = 12.

Question 2

2a. ArcMap has a convenient time slider that allows you to move through time to visual temporal variation in spatial data. I simply followed a tutorial to use the tool.

Make sure that you store all shapefiles that depend on an animation in a file geodatabase or the animation and Time Slider will not function correctly. The nice feature about enabling time is that any joins you make are dynamic and info tables update with each time step, but only if you store files in a geodatabase!

2b. I calculated summary statistics the fire record at each plot from 1700-1918 using the ddply package in R. This helpful package allows you to apply a set of functions to a group identifier (plot) within a data frame.  I’m happy to share the code if this is useful for someone in the class. After the summary table was created I joined this to my shapefile of sample points in Arc map to map variation in summary statistics. I used Thiessen polygons and Kriging to spatially represent variation in the statistics.

2c. Taylor and Skinner (2003) used cluster analysis to identify and map spatial patterns of fire occurrence. Similarly, I created a binary matrix of fire occurrence where rows were sample points and columns were fire years.  Cluster analysis was performed in PC-ORD using a Sorensen distance measure with a flexible beta method of β = 0.25. The resulting dendrogram was pruned by examining stem length and branching distribution to identify nodes that maximized both within-group homogeneity and between-group differences, while minimizing the number of groups (McCune and Grace 2002).


Brief description of the results you obtained? (I went overboard on what I included, but its all useful at this stage)

I obtained maps of all fire events using Thiessen Polygons, and maps of selected fires using Kriging and IDW for comparing the methods. In these maps red indicates area burned and blue indicates unburned.  The gradient between indicates uncertainty for Kriging and IDW. Trees that recording fire are represented by black points and trees that did not are represented by white points.

1918 map of fire extent comparing interpolation methods

1829 map of fire extent comparing interpolation methods

A movie of fire events based on Thiessen Polygon Mapping this file is too large to attach 🙁

FireMetrics mapped using thiessen polygons

FireMetrics_Kriging Metrics mapped using Kriging

FireGroups Fire occurrence groups identified through cluster analysis

I was also able to calculate and graph Fire Extent by year using the Theissen polygon method


Critique of the method- what was useful, what was not?

 Fire mapping techniques – The advantage of the Kriging and IDW techniques are that multiple data points are used to interpolate fires, whereas Thiessen polygons are informed only by the nearest point. Additionally Kriging and IDW are able to represent uncertainty of fire perimeters while Thiessen polygons produce abrupt fire boundaries that are an artifact of the sample distribution.

Kriging produced the most seemingly realistic and attractive fire maps for many fires (e.g. 1918).  However, Kriging poorly represented several large fires with irregular un-burned patches (e.g. 1829). Kriging requires the spatial variation in the variable represented can be similarly observed at all locations (requires spatial homogeneity), and performs best with uniform sampling density. Irregular unburned patches occur in several of the large fires that occurred in my study landscape. Logically they occur where fire burned and removed fuel in the years preceding the fire of interest.  For example, the large unburned area in 1829 on the East side of the study landscape burned in 1827. In combination, irregular burn probabilities and non-uniform sampling limit the utility of Kriging to consistently represent fire perimeters for my data and study landscape.

IDW was not similarly limited by irregular burn patterns. However, IDW creates a bullseye pattern of high to low burn probability where sample points are isolated or are on burn perimeters.  This imputes lower burn probability in the area between the somewhat isolated point and the main mass of the fire.  When all points within a large area recorded fire, IDW imputes a higher burn probability to the unsampled area at the center of the sampled points that actually recorded fire. In reality we know the sampled points burned, and they should not have lower burn probability than the unsampled points. IDW’s representation of fire can be improved by decreasing the search neighborhood or increasing the power function. However, this approaches the Thiessen polygon interpolation technique (see 1829 map).

Both Kriging and IDW are time intensive and would require a different and subjective threshold to be applied to each fire map to delineate burned and unburned area. The Thiessen polygon method ultimately provides the most efficient, objective, and parsimonious method to map fire perimeters based on the distribution of my sample points. After watching the animation of fire events mapped with Thiessen polygons the most important pattern that appeared was a consistent lag between fires that prevented reburn within short (<5 year) time periods.  Thus, fuel recovery after fire may constrain fire extent, and time since fire may be an important predictor of the annual spatial distribution of fire in the landscape. Kriging and IDW assume higher likelihood of fire at points where no fire was recorded that are surrounded by points that did record fire.  This provides another rationale for using the binary Thiessen interpolation method.

In making my choice to use the Theissen polygon method I also considered that area burned and fire metrics were highly and significantly recorded across all interpolation techniques (Hessl et al. 2007). Furthermore, Thiessen polygons accurately represented burn perimeters and fire frequency through a validation using known modern day fire perimeters (Farris et al. 2010).

Mapping Fire Metrics

I preferred the Kriging method to identify regions of the landscape with distinct fire regime metrics.  The Kriging method incorporates more than just the sample point allowing regions with higher or lower values for a metric to be clearly represented. The Southeast region of the study area burned with lower frequency, longer maximum intervals, and higher variability. This area has a high concentration of low productivity and relatively fuel limited lodgepole pine forest.

Identifying and Mapping Fire Types

Cluster analysis of fire years appears to be a promising technique for identifying regions with a similar history. The FireGroups were geographically clustered in the study landscape.  It may be possible to use these fire types to identify landscape features that constrain fire. This map suggests that the similarity or spatial auto correlation of fire history varies depending on position within the landscape.  This suggests a non uniform distribution of landscape features that constrain fire.


See this link for more about Kriging

Burrough P and McDonnel R. 1998. Principles of geographical information systems. Oxford: Oxford University Press.
Dieterich JH. 1980. The composite fire interval: a tool for more accurate interpretation of fire history. USFS GTR- RM-81.
Farris CA, Baisan CH, Falk DA, Yool SR, Swetnam TW (2010) Spatial and temporal corroboration of a fire-scar based fire history in a frequently burned ponderosa pine forest. Ecological Applications 20(6):1598-1614.
Hessl A, Miller J, Kernan J, Keenum D, McKenzie D. 2007. Mapping Paleo-Fire Boundaries form Binary Point Data: Comparing Interpolation Methods. The Professional Geographer 59:1, 87-104.
Taylor AH, and CN Skinner (2003) Spatial Patterns and Controls on Historical Fire Regimes and Forest Structure in the Klamath Mountains. Ecological Applications 13(3):704-719.
Farris CA, Baisan CH, Falk DA, Yool SR, Swetnam TW (2010) Spatial and temporal corroboration of a fire-scar based fire history in a frequently burned ponderosa pine forest. Ecological Applications 20(6):1598-1614.

April 20, 2018

Temporal Cross Correlation Between Bull Kelp Patches in Oregon

Filed under: 2018,Exercise/Tutorial 1 2018 @ 7:32 am

One of the ultimate questions of my work is comparing what factors drive bull kelp populations in northern California versus in Oregon. With this exercise I wanted to examine whether patches exhibited temporal synchrony from year to year. If they are in sync, this may suggest that the two populations are being driven by large-scale, coastwide factors (e.g. ENSO phase, wind strength). If not, this gives evidence that the populations are being influenced more so by local factors than shared, coast-wide factors. One way to look at temporal synchrony of two areas is via cross correlation.

To conduct cross correlation, I used both the randomly generated data and my actual data. The randomly generated data should be representative of the null hypothesis that there is no synchrony between patches. I can then compare this to what patterns of cross correlation I get for my actual data, to see if it differs measurably or conforms to a similar pattern as my null hypothesis.

To interpret cross correlation you first need to look the autocorrelation for each variable. I used the acf function in the ‘ncf’ package in R to conduct autocorrelation on the time-series for two patches, Orford Reef and Rogue Reef (See Figure 1). With the randomly generated data, neither of the patches shows any kind of significant autocorrelation.

Figure 1: Autocorrelation of maximum annual kelp coverage for Orford Reef (left) and Rogue Reef (right) with randomly generated data. Lag is in terms of years.

I also ran autocorrelation on two kelp patches from my real data. I would expect to see some kind of autocorrelation in real populations. It makes intrinsic sense that with a real population, the size of the population now should influence how many there are one step into the future. However, the patch at Orford Reef had no autocorrelation other than at time=0 (see Figure 2). The patch at Rogue Reef was somewhat auto-correlated at a time lag of a year, but otherwise had no significant autocorrelation. This indicates that the size of the kelp canopy one year tells us very little about what it will look like in the future, although at Rogue Reef, canopy size in the current year will have some positive correlation with the canopy size next year.

Figure 2: Autocorrelation of maximum annual kelp coverage for Orford Reef (left) and Rogue Reef (right). Lag time is in years.

Once I understood what autocorrelation looked like for each of these reefs, I then moved on to looking at cross correlation between them. I did this using the ccf function in the R package ‘ncf’. For the randomly generated data, I expected no cross correlation, and this is what I saw (see Figure 3). The ccf results for the random data of Rogue Reef and Orford Reef did not go beyond the confidence envelope (except for one small point at a 10 year leading lag). While there is this small bit of cross correlation, since the data is randomly generated we can assume this is coincidental.


Figure 3: Cross correlation between maximum annual kelp cover at Rogue Reef and Orford Reef over a 35 year time series. The lag is in years.

With the real data, I would have expected some kind of cross-correlation between the reefs. The two are less than 20 miles apart and should be influenced by similar oceanographic conditions. However, the ccf graph for the real data was very similar to that of the random data. Other than a small amount of correlation at 0 years and -18 years, these two reefs were essentially uncorrelated.

Figure 4: Cross correlation between annual maximum kelp cover at Rogue Reef and Orford Reef. Lag is in years.

Overall, these results were surprising. For one, I was expecting some level of autocorrelation for the reefs. However, because bull kelp is an annual species, it is possible that each year the population resets and recruitment next spring is determined by density independent factors. Furthermore, since bull kelp have a bipartite life cycle, alternating between gametophytes and sporophytes, its possible that the transition over the winter from spores to gametophytes to gametes to baby kelp next spring may further decouple maximum canopy size in the fall from population size the next year.

I was also somewhat surprised that there was no cross correlation between the two patches. The lack of cross correlation suggests that two reefs are not correlated on an annual or multi-annual scale. Therefore, despite the fact that the patches are within 20 miles of one another, apparently there is enough local variation in the factors controlling population size to create substantially different sizes and patterns between the two.

I found this technique to be very useful. If I am interpreting the results correctly, then this technique is already helping me uncover some surprising results. One caveat about these results is that my data may not be stationary. However, there may be some reasons that this test is not the most appropriate for my data. Cross-correlation assumes stationarity, and given the intense inter-annual variability, it is not clear whether there are any long term changes in the mean of the population. Therefore, my data may not fulfill this assumption. I would welcome any feedback on how to better assess stationarity in my time series.


April 18, 2018

Visualizing a recreationists movement through space and time

Filed under: 2018,Exercise/Tutorial 1 2018 @ 1:30 pm

Question that you asked

The question I asked for this exercise was how to plot and visually represent recreationists’ movements through time and space.  My goal was to plot five individual GPS tracks onto a graph with X & Y coordinates (Long & Lat), and a Z-coordinate representing time. To visualize and identify patterns within and between each individual’s movements through time, I aimed to represent change in time using color gradients. In order to make comparisons between the five tracks, I also sought to standardize the color ramp so that each GPS track was color coded using the same timestamp parameters.

For this exercise, I used five GPS tracks that were collected on July 19, 2017 at String & Leigh Lake in Grand Teton National Park.

Name of the tool or approach that you used.

Although the goal for this exercise was seemingly simple (visualizing a person’s movement through space and time) the biggest challenge for me was learning how to produce this result using R. I am still familiarizing myself to this software, so the learning curve was steep throughout all phases of the process.

The packages I used in R were:

ggmap – to load in a satellite image of the location where the GPS tracks were collected

ggplot – to plot the tracks onto a graph

leaflet – to create an interactive plot of the GPS tracks


Brief description of steps you followed to complete the analysis.

Step 1: Put data into correct format. The first thing I needed to do was manipulate the data into an appropriate format for analysis.  I converted the shapefiles into a csv format in R.

Step 2: Subset data The dataframe I was working in contained 652 GPS tracks that spanned from July 15 – September 8. Thus, I needed to take a subset of those tracks to simplify the initial analysis and enable me to more easily test out my code. I chose to subset five tracks that were collected on July 19, 2017.

Step 3: Set up temporal color ramp parameters and palette In order for me to make temporal comparisons between each of the GPS tracks, I needed to standardize the color ramp so each GPS point was given the same color code depending on their date stamp. To do this, I calculated the maximum and minimum time value for all five tracks combined.  These max/min values gave me the temporal parameters for the color ramp.

After I defined the values for the color ramp, I created a color palette that could be applied to each of the tracks. I coded and defined this palette in R.

Step 4: Pull in a satellite map image layer for the graph To provide more context and meaning to the plotted GPS tracks, I thought it would be helpful to add a visual of the location where the tracks were collected. To do this, I downloaded the ‘ggmap’ package in R which allowed me to plug in the coordinates of the GPS tracks and extract a map image layer from Google Earth.

Step 5: Plot the GPS tracks onto the graph I used the ‘ggplot2’ package to plot the GPS points onto the XY coordinate graph. I applied the satellite imagery as a background layer and the color palette that I defined in Step 3. I produced five graphs (see results section).

Bonus approach: After successfully plotting the GPS tracks, I was curious to find out if there were more interactive ways to visualize a recreationists’ movement through time and space. I discovered the ‘leaflet’ package that allows a person to plot data (from both raster and vector data structures) and then zoom in and out, click on features, and essentially interact more dynamically with the results. I was eager to explore this package using my GPS data. After familiarizing myself with the syntax and objects for this package, I was able to successfully plot the GPS points and color code the tracks based on their timestamp (see results sections).


Brief description of results you obtained.

The results I obtained were five graphs that represented the movement of five recreationists through time at String and Leigh Lake in Grand Teton National Park. These results successfully demonstrate the first step necessary for most data analysis: visualizing the data to identify patterns and inform the next steps. Among the five tracks I selected, there was variability in the amount time spent in the area, the locations traveled, and the time for recreation. I intentionally selected one track that represented a water-based recreation user to visualize how their movement compared to those on land.

I was also successful in generating plots that allowed me to zoom-in and out of the points. By doing this, I was able to recognize more detail in the recreation user’s movement. Further, this result allowed me to clearly see where the GPS points were stacked up or close together, i.e. stationary, or slowed movement; or where the person was moving at a higher speed, i.e. points more spread out, evenly spaced.  I was unable to embed the html document into WordPress but am currently working on a method to display these plots. For now, here are a few screenshots that hopefully illustrates the dynamism of this type of plot.

From these results I intend to dive into the next phase of analysis to better understand the behaviors and underlying processes that cause and influence human movement in recreation settings. These processes could be temporal (are there certain times of day or times in the season that influence the movement of people?), environmental (how does the landscape influence the behavior and movement of a person?), social (does group size influence how people move and recreate?) etc. Before doing that, my goal is to analyze some behavioral characteristics within each person’s track. These characteristics include the individual’s step length at one-minute time intervals, and the person’s turning angles at one-minute time intervals. Stay tuned for that tutorial..

Critique of the method – what was useful, what was not?

What was useful: Once I learned the code and syntax to build these graphical representations of a human’s movement through time, I was very pleased with the results. The ggplot package has great functionality and allows me to represent the data in a number of ways. I also appreciated the ability to add a basemap to the graph using ggmap. Because I was new to the software and syntax, the bulk of the work involved building the code.

If people are interested in understanding how I built my code, I generated a text file that annotates the code and steps I took. Click here for the Exercise1_Code

Some criticisms:

a. Limitations in the zoom function for the satellite image —  Adding the satellite image to the graph provided additional meaning to the results. However, the zoom feature was clunky. For example, a zoom of a value of 5 compared to a zoom of a value of 6 dramatically changed the scale of the image. I’m sure there are ways to work around this, but I wasn’t able to make it work for this exercise.

b. Interpreting time as an integer – In order to denote color to the gps points based on a date stamp, I needed to convert the value of the date stamp to an integer in R. For a non-academic viewing the graphs, he/she/they may be able to understand overall changes in movement through time by simply looking at the variation in color, but they wouldn’t be able to interpret the integer values on the color ramp legend. In other words, by changing the date stamp to an integer, it’s hard to conceptually place the data points within a real point in time. I chose to work around this issue by including a start time and end time to each graph, but this was a cumbersome, crude approach.

c.The interactive leaflet map only works with an internet connection. Also, this course’s WordPress site won’t let me embed the image into the blog. The leaflet map was a neat, bonus approach for visualizing the data. However, it would only be useful in situations where there is an internet connection and computer monitor. Therefore, this tool is best used for power-point presentations, websites, or web-tutorials. I was frustrated that I couldn’t embed the html file into the Word Press blog, so you will have to simply view a screenshot of the plots. I believe this inability to embed the html is a function of the settings for the OSU blogs.

April 13, 2018

Autocorrelation of GPP and LUE at C3 vs C4 grassland sites

Filed under: 2018,Exercise/Tutorial 1 2018 @ 12:36 pm

Question Asked:

The question I asked in this exercise was: What degree and timing of temporal autocorrelation exists for LUE and GPP derived from eddy covariance flux data, and how does the degree and timing differ between a C3 and a C4 grassland site?

I aim to use an autocorrelational analysis to assess temporal patterns in these production indices, and to assess whether they reflect distinct seasonality between the C3 and C4 site. I.e., if the production indices show distinct autocorrelational patterns, that may indicate that the production of the C3 and C4 sites are responding to distinct environmental drivers. Because this is just an autocorrelational analysis and not a cross-correlational analysis, I am not quantifying the relationships between environmental drivers in this exercise.

Tool or approach used:

I used the R function stats::acf() to compute an autocorrelation function for my time series of production indices.

Steps followed to complete the analysis:

My data are derived from eddy covariance flux tower measurements, which are taken every 30 minutes. My data are from two grasslands—Konza Prairie Biological Station (US-Kon) is 99% C4 grass, and the University of Kansas Field Station is 75% C3 grass. I obtained gap-filled, 30-minute resolution data from site PIs, for 2008-2015.

From the 30-minute resolution data, I calculated GPP and LUE for time units of days, months, and years. I first sum and convert the EC flux measurements of GPP to be in units of gC/m^2/day. LUE is calculated as the total daily GPP divided by the total daily photosynthetic photon flux density (PPFD) and is in units of gC/MJ/m^2/day. I then smooth the dataset using a 7-day rolling mean, and filter it to remove extreme values that are artifacts of pre-processing or instrument error.

For the monthly time interval, I calculate the monthly average of the daily values of GPP and LUE. I do this by grouping the data by month, and then calculating the mean of all values for that month.

For the annual time interval, I calculate the total annual GPP, and the annual average of daily LUE. I sum the daily values of GPP and PPFD, and then divide the two to obtain annual LUE. This yields GPP in gC/m^2/year and LUE in gC/MJ/m^2/day.

However, the quality of these data vary widely, due to environmental inconsistency and equipment error. Thus, even though the instruments are ostensibly recording data every 30 minutes, after post-processing and gap-filling, I still end up with entire days– and sometimes months of data missing.

For each dataset– daily, monthly, and annual GPP and LUE– I used the R stats::acf() function to compute estimates of the autocorrelation function.


Fig. 1: A plot of autocorrelation versus lag, for daily total GPP and LUE, for each site. Dashed lines represent upper and lower 95% confidence intervals, and vertical lines represent lags of 365 and 730 days, to approximate 1- and 2-year lags.

For the daily data, the autocorrelation shows distinct autocorrelational patterns between each of the sites. These plots appear to reflect distinct seasonality of production between the C3 and the C4 sites.

However, I’m not certain whether the different timing of autocorrelation may reflect data that are missing due to filtering– I excluded certain values that were outside thresholds I had set, rather than setting values to 0 or NA. If values are missing for dates, then that would affect the autocorrelation.


Fig.2: Plot of autocorrelation vs. lag for the monthly average of daily GPP and LUE. Dashed lines represent 95% confidence intervals for the autocorrelation function. Vertical lines represent lags of 12, 24, and 36 months.

For the monthly data, and in contrast to the daily data, this plot demonstrates autocorrelation of production indices that is very similar between the two sites.


Fig. 3: Plot of autocorrelation of autocorrelation vs. lag of GPP. Dashed lines represent 95% confidence intervals for the autocorrelation function, colored for each site.

Similar to the plots of autocorrelation of the monthly average of daily GPP and LUE, the autocorrelation of the total annual GPP and average annual LUE show similar patterns between the two sites.

The daily, monthly, and annual datasets suggest alternate conclusions about the temporal autocorrelation of the data. Because the data are filtered heavily to exclude outlying values, this means that data for days- and months- are missing, and are not accounted for in the time series of observations.

Critique of the method:

Because the autocorrelation function relies on the observations that are present in order to calculate lags, and does not use a date-time field present in the function– the autocorrelation function is unable to accurately represent missing data or irregular observation intervals. The unseen, missing data– particularly in the daily dataset– appear to be causing the offset in the autocorrelation function between the two sites. If the sites are missing data from different times, and are missing different numbers of observations at those times, then the autocorrelation function will look different.

I will continue to explore gap-filling and extrapolation methods from my data in order to compare autocorrelation on sub-month time intervals.

April 2, 2018

Tutorial 1 Sample

Filed under: Exercise/Tutorial 1 2018 @ 2:36 pm

© 2019 GEOG 566   Powered by WordPress MU    Hosted by