GEOG 566

         Advanced spatial statistics and GIScience

Archive for Exercise/Tutorial 3 2018

June 11, 2018

Assessing the importance of 3 predictors of fire occurrence and the random effect of site with a GLMM model

Filed under: Exercise/Tutorial 3 2018 @ 4:11 pm


The question I asked was whether fire occurrence from 1700-1918, a binary (0 – no fire, 1 = fire) response, was related to three explanatory variables:

Climate – Annual Palmer Drought Severity Index (PDSI) a measure of moisture stress reconstructed from tree rings (range -6.466 – 7.234; severe drought to cool and wet). In exercise 2 I found using superposed epoch analysis that large and extensive fires occurred during years of extreme drought, and that previous year PDSI was not related to fire occurrence.

Interval – The amount of years since the last fire event (range 1-46). This variable is a surrogate for fuel, a primary constraint on fire spread. The longer we go without fire the more fuel accumulates potentially increasing the probability of fire spread.

Note – Calculating Time since fire for each year at each plot was quite tricky. I eventually found a solution with the dplyr package.  This code sums the amount of rows (years) between events (FIRE_PLU)) within each plot and adds the sum of years since fire to each year (row) in a new column. This almost does what I want, but starts at 0 on each year/row when a fire occurred.  The interval in the year of fire occurrence should be the years since the last fire (e.g. 20 not 0) so I shifted fire occurrence forward in time one year (FIRE_PLUS).

binary2<-binary1 %>% group_by(SP_ID, idx = cumsum(FIRE_PLUS == 1L)) %>% mutate(counter = row_number ()) %>%  ungroup %>% select(-idx)

Lodgepole – The proportion of area occupied by lodgepole pine within a 2 km buffer surrounding each sample plot.  Lodgepole pine stands have lower productivity and relatively sparse understory fuels.  Fire occurrence may be less frequent in areas where lodgepole pine forest limits fire spread.

My data looks like this:

SD200 1700 0 0 2.935 13 40
SD200 1701 0 0 2.028 14 40
SD200 1702 0 0 4.049 15 40
SD200 1703 0 0 -5.034 16 40
SD200 1704 0 0 2.306 17 40
SD200 1705 1 0 0.773 18 40
SD200 1706 0 1 -2.451 1 40

SP_ID is the sample plot identifier (n =52), YEAR is the year of observation, FIRE is the binary record of whether a fire occurred, FIRE_PLUS is fire occurrence shifted so I could calculate interval since fire for each year and each plot, INTERVAL is the year since fire occurrence, and PICO is the proportion of area around each plot occupied by lodgepole.

I’m also interested in the relative importance in the model of each explanatory variable, and whether the response varies by sample plot.


I used the LME4 package in R to construct generalized linear mixed models (GLMM; Bolker et al. 2009) to complete my analysis. The GLMM approach is useful for a binary response where data is nested within itself.  In this case my response is nested within sample plots and a mixed model allows me to pool information about fire occurrence across sites.

With a binary response sample size is limited to the amount of 1 (fire) occurrences of the response variable. In my case I have ~200 binary observations at each plot (years), but fire occurs in only 8-28 years depending on the sample plot. Therefore sample sizes would be too small for 3 predictor variables. The GLMM approach allowed me to pool information across plot, and determine whether sample plot, the random effect in the model, created a different response to the fixed effects (climate, interval, lodgepole).

I followed the following steps

  1. Checked for independence of the explanatory variables
  2. Specified the model with sample plot as the random effect, climate, interval, and lodgepole as the fixed effects, and fire as the response. This is a GLMM model fir by maximum likelihood with a logit link.

fire.glmm1 <- glmer(FIRE ~ PDSI+INTERVAL+PICO + (1|SP_ID), family = “binomial”, data =, control=glmerControl(optCtrl=list(maxfun=2e5)))

  1. Checked the model fit and assumption of residual independence. To do this I used the DHARMa package and the following vignette that describes how residual interpretation for GLMMs is problematic and provides a solution.

If the model is specified correctly then the observed data should look like it was created from the fitted model. Hence, for a correctly specified model, all values of the cumulative distribution should appear with equal probability. That means we expect the distribution of the residuals to be flat, regardless of the model structure (Poisson, binomial, random effects and so on).

Overall residuals for the full model


Residuals vs. Variables – Residuals for each fixed effect


Temporal autocorrelation of residuals


ACF plots of Residuals


4. I used R2 for GLMM (Nakagawa and Schielzeth 2012) to indicate the relative importance of the fixed effects and to determine the importance of sample point (random effect). The marginal R squared values are those associated with your fixed effects, the conditional ones are those of your fixed effects plus the random effects. To determine the relative importance of the fixed effects I dropped 1 fixed effect at a time and compared changes in R2 among models.

5. To interpret the influence of each fixed effect I exponentiatied the coefficients and graphed the probability functions.


All three fixed effects were significant in the GLMM model, but PICO was barely significant (p value = 0.0458). For every 1 unit increase in PDSI the probability of fire occurring decreased by ~25%, for each year without fire (Interval) the probability of fire increased by 7%, and for each percent increase in lodgepole forest around a sample point the probability of fire decreased by 1%.


The difference in marginal and conditional R2 for the full model was small. The random intercept of site explained little additional variance. This suggests there’s not a lot more to be explained at the sample plot level.

R2m       R2c 0.2072182 0.2329679

Comparing conditional R2 among models while dropping fixed effects indicated that climate a top-down driver consistent across sample points accounted for most of the R2 (0.15 for PDSI, 0.5 for Interval, and 0.1 for Lodgepole).

These results suggested that Bottom-up drivers such as local topography, composition, or microclimate seem to have little influence on fire occurrence. I was surprised to see that climate had a stronger influence on fire occurrence within a year, but this makes sense as I think more about time since fire and fuel accumulation.  If fuel recovers quickly (e.g. 2-5 years) in this landscape then it won’t be limiting for most years.  If fuel is generally not a limiting constraint on fire then fire occurrence shouldn’t change dramatically with time since fire.  Essentially 5 years without fire and 20 years without fire have the same influence on whether a fire occurs and fire occurrence is simply more likely over a longer period of years. Thus climate is a more influential predictor of fire occurrence.

Overall most of the variance was not explained.  If climate and fuels are generally not limiting then fire occurrence in a year would occur as a result of human and lightning ignitions.  Data for these predictors is not available.


Overall I found GLMM modeling to be quite challenging and based on the amount of R packages and recent publications on the topic it seems like an evolving statistical method. I had trouble understanding the implications of the Random effect sample point accounting for little of the R2.  I interpret this to mean that there’s not a lot more to be explained at the sample point level that has not been included.

Interpreting coefficients from GLMM model with a logit link is also a challenging and there are few examples available online.


Bolker, B.M.,Brooks,M.E.,Clark,C.J.,Geange,S.W.,Poulsen,J.R.,Stevens,M.H.H. & White, J.S.S. (2009) Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology & Evolution, 24,127–135.

Florian Hartig (2018). DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. R package version 0.2.0.

Nakagawa S. and Schielzeth H (2012) A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods in Ecology and Evolution Vol 4(2):133-142.

The Hidden Behavioral States May Still Be Hidden: Exploring the Applicability of Hidden Markov Models and Environmental Covariates for Modeling Movement Data (Exercise 3 Part 2)

Filed under: 2018,Exercise/Tutorial 3 2018 @ 10:10 am

Question Asked

Overall, my aim for Exercise 3 work was to determine the extent to which environmental covariates (of the type explored in Exercise 2), could be related to spatially explicit behavioral states defined by step length and turning angle data generated from GPS tracks of movement in Exercise 1. As described in my Exercise 3 Part 1 blogpost, to operationalize the behavioral states, paired step length and turning angle measurements were generated from the raw GPS tracks, as described in my Exercise 1 blogpost. Histograms of both step length and turning angle distributions for five sample tracks revealed that two states may be emerging from the data: 1) a state characterized by small step lengths and wide turning angles and 2) a state characterized by large step lengths and very narrow (near zero) turning angles. The emergence of these two potential behaviors from the visual inspection of the histograms is characteristic of the behavioral states used to describe the movement behaviors of animals, to which hidden Markov model approaches have been applied.

Through coursework and collaboration in this class, my lab mate Jenna and I discovered the moveHMM R data analysis package, which uses hidden Markov model statistical theory to fit a two-state model of behavior driven by step length and turning angle observations. The hidden Markov model approach requires that measured data, in this case time-stamped X and Y coordinates of movement actually represent movement data. The literature on the application of hidden Markov models for movement behavior suggests this requirement can be met by assuming that spatial inaccuracy does not exist within the data (i.e., the measured coordinates represent actual movement behaviors) and by regular time-stamped sampling of the GPS data (i.e., no missing data). Through assuming that the measured data represents a known state (in this case, movement), the hidden Markov model uses patterns in the measured data to reveal the “hidden” underlying states in the data. In this application of hidden Markov models, the hidden states being modeled are two behavioral states defined by combinations of co-occurring step length and turning angle movements. This approach is well documented in the movement ecology literature for understanding transitions between two states of movement in animals, using tracking approaches such as GPS or telemetry. Given its applicability for the study of the movement of animals, Jenna and I thought it would be an interesting approach for understanding the movement behavior of people on the landscape.

Additionally, covariates can be explored to understand how a co-occurring environmental variable that is changing in space and time with the step length and turning angle movement data may or may not correlate with behavioral shifts between states. In this way, landscape-level environmental data of the type explored in Exercise 2 (i.e., vegetation type and elevation) can be related to spatially-explicit behavioral data. It should be noted that for Exercise 3, bathymetry and distance to shoreline were the environmental variables used in this exercise. I previously explored relationships between land-based vegetation cover class and elevation as environmental variables in Exercise 2; however, I was able to get the bathymetry data originally intended for Exercise 2 exploration to work with my dataset in ArcGIS. Therefore, instead of continued exploration of land cover vegetation class and elevation I will be exploring distance to shoreline and bathymetry as behavioral covariates in Exercise 3 Part 2 as these data co-occurred in space with the GPS movement data.

Given the above background, my research questions are as follows:

  1. How does bathymetry influence the transition probability between a movement-based behavioral state and a stationary behavioral state?
  2. How does distance to shoreline influence the transition between a movement-based behavioral state and a stationary behavioral state?
  3. Which covariate, if either, has more explanatory power?

Tool/Approach Used

I used the R package moveHMM (Michelot, Langrock, & Patterson, 2016) to run the hidden Markov models, generate behavioral probabilities, and run an AIC analysis on my fit models. For data preparation, I used the fitdistrplus R package to define parameter estimates and distributions (see Exercise 3 Part 1 post) and ArcGIS spatial analyst tools (see Exercise 2 post) to relate bathymetry and distance to shoreline data to each step length and turning angle observation in the dataset.

Description of Steps Used to Complete the Analysis

Many of the initial preparatory data wrangling and formatting steps used to set up the Exercise 3 Part 2 analysis are documented in blogposts for Exercises 1, 2, and 3 Part 1. The below list describes steps that were taken as part of this analysis that have been previously described in other posts. New analytic steps are subsequently described:

Previously completed workflows:

See Exercise 1 blogpost for a description of the steps used to generate step length and turning angles from raw GPS data for this analysis using the “prepData” function in moveHMM.

See Exercise 2 blogpost for a description of how raster-based elevation data were related to point-based movement data using the Extract Values to Points tool in ArcGIS. Similarly, see Exercise 2 blogpost for a description of how distance to shoreline was calculated for each point-based movement data observation using the Spatial Joins tool in ArcGIS.

See Exercise 3 Part 1 blogpost for a description of how initial distribution and parameter estimates were generated for step length and turning angle inputs for the moveHMM tool.

New analytic workflows:

New steps completed to run the moveHMM tool and explore model fits, visualize results, and calculate model AIC values are as follows. The below-described steps are adapted from published moveHMM workflows (Michelot, Langrock, & Patterson, 2017; Michelot, Langrock, & Patterson, 2016).

  1. Prior to fitting the moveHMM model, define distribution parameters for step length and turning angle. The default distribution for step length is gamma and the default distribution for turning angle is von Mises. If other distributions are used, they must be defined in the fitHMM function.
  2. Run model using the “fitHMM” function. Define input data (must be generated through prepData function), number of behavioral states, distribution parameters (define prior and then call in command), and covariates. This step generates numeric and text output reporting the model parameter estimates for each behavioral state.
  3. Generate a visual summary of the model, including density histograms for step length and turning angle, transition probabilities for behavioral states, and spatially explicit behavioral state occurrences through “plot(model)” function.
  4. Generate visual summaries of state transitions through “plotStates(model)” for each track.
  5. Run AIC on model(s) to provide a measure for which model is statistically favored.

Description of Results Obtained

Overall, the results obtained using this tool varied wildly. The below presented results are those that I think best represent the data and shed light on the research questions that I posed at the beginning of the exercise. Results are presented for each of the above listed analysis steps.

Results: Define distribution parameters for step length and turning angle and run model

This part of the exercise turned out to be more difficult than anticipated. I had hoped that by doing a thorough exploration of my data, both in aggregate and segmented into data-drive two-state bins, that the parameter estimates defined by the movement data would ultimately map well within the moveHMM analytic environment. Unfortunately, this was not the case. The first round of parameter estimates I tried in the model fitting process included the parameter estimates derived from the gamma and wrapped Cauchey distributions identified in Exercise 3 Part 1. The result of these parameter estimates was a model that only included 1 state (state 2), and had standard deviation estimates for the model parameters of infinity. These results were not expected, and suggested that the model was not a good fit for the data.

Given that the original, data driven estimates did not result in a well-fitting model, I began trying various combinations of parameter estimates seen in the literature and developed through my own knowledge of travel rates. Additionally, I reverted to using the two default distributions in the moveHMM tool: gamma for step length and von Mises for turning angle as the tool seemed to perform better with those distributions. Table 1 reports a set of initial parameter estimates used derived from the fitdistrplus tool and a second set of parameter estimates developed through trial and error ultimately used in model development.

Table 1. Parameter estimates for two model fitting trials.

Figure 1. Model outputs for the initial parameters (left panel). Results produced from the initial parameters resulted in only behavioral state (State 1) with a mean value of 0 for State 2, which essentially indicates a non-state as the gamma distribution is a positive distribution and thereby does not include negative parameter values.

Given the poor results from the initial parameters, the best fitting model parameters were used to fit two models for exploration in this exercise: a two-behavioral state model with distance to shore as a covariate for behavior transition and a two-behavioral state model with bathymetry as a covariate for behavior transition. Figure 2 reports the model outputs for these model runs.

Figure 2. Results from the model outputs for the best fitting parameters with distance to shore as a covariate (left panel) and bathymetry as a covariate (right panel). The distance to shore model suggests a two-state model, with one state being characterized by shorter step lengths (mean = approximately 1 meter) and wider turning angles (mean = 3.1 degrees) and a second state being characterized by longer step lengths (mean = approximately 224 meters) and narrow turning angles (mean = -0.014 degrees).The bathymetry model suggests a two-state model, with one state being characterized by shorter step lengths (mean = approximately 2.5 meters) and wider turning angles (mean = 3.1 degrees) and a second state being characterized by longer step lengths (mean = approximately 262 meters) and narrow turning angles (mean = -0.007 degrees).

Visual Presentation of Results and AIC Analysis

Results of the AIC analysis to determine the favored model suggest that model 2, the bathymetry model is favored over model 1, the distance to shore model, given AIC values of 72708.23 and 82097.23 respectively. Given these results, the remaining visualizations and interpretation of results is provided for Model 2, with bathymetry as a covariate.

Figure 3. Density histogram of step length states for the Bathymetry Model. The figure suggests that the distribution for state 2, the movement oriented state, has a higher density across step lengths from slightly greater than 0 to approximately 300 meters in length than state 1, which peaks right around a step length of 0.


Figure 4. Density histogram of turning angle states for the Bathymetry Model. Turning angles for State 2 cluster around 0, which is expected given the narrow turning angle movement mean reported for this state. The curve of State 1 is unexpected, given that the mean of the turning angle movement reported for this state is around 3.

Figure 5. Example output for location of behavioral states in space for track ID 2. In theory, the figure would show blue to characterize portion of the track during which the individual is in state 2 and orange for portions for the track during which the individual is in state 1. At this scale, state 1 is hard to see, but there are small clusters around the beginning and end of the trip that show some orange track pieces. This figure shows that for individual 2, the majority of the track is in state 2 behavior rather than state 1 behavior.

Figure 6. Distribution of behavioral states across time for track ID 2. The figures show that the majority of the movement track is spent in state 2, the moving state, and that state 1, the resting state, occurs infrequently throughout the movement track.

Figure 7. Transition probability matrix for influence of bathymetry on transition between or among behavioral states. The four transition probability plots show that as bathymetry increases (water depth increases), the likelihood of staying in state 1, the resting state, decreases dramatically at a small increase in bathymetry and then remains at 0. Similarly, at small increases in bathymetry, the probability of transitions from state 1, a resting state, to state 2, a moving state increases rapidly and then states at a probability of 100%.

Returning to my research questions for this exercise, I have reached the following conclusions:

  1. How does bathymetry influence the transition probability between a movement-based behavioral state and a stationary behavioral state?

As bathymetry increases, the likelihood of staying in a stationary behavioral state, if already in a stationary behavioral state, decreases rapidly. Similarly, the likelihood of transitioning between a stationary state and a movement-oriented state increases rapidly and abruptly as bathymetry increases.

  1. How does distance to shoreline influence the transition between a movement-based behavioral state and a stationary behavioral state?

(Results not pictured in this exercise) As distance to shore increases, the likelihood of staying in a stationary behavioral state, if already in a stationary behavioral state, decreases rapidly. Similarly, the likelihood of transitioning between a stationary state and a movement-oriented state increases rapidly and abruptly as bathymetry increases.

  1. Which covariate, if either, has more explanatory power?

Through an AIC analysis comparing the Bathymetry Model and the Distance to Shore Model, the Bathymetry Model was favored over the Distance to Shore Model as the preferred model for modeling two-state movement behavior.

Despite being able to answer the three research questions I originally posed, I am not yet confident to conclude that these results, or the modeled two-state behaviors, are valid in their current estimations. The critique below identifies why I am cautious of the presented results.

Critique of Method

Conceptually, I think the application of hidden Markov models and the use of the moveHMM tool has great potential for exploring movement data outside the realm of animal-based movement. The types of analyses that the tool is capable of is exciting, particularly given the ability to relate covariates to the movement data. However, I have several critiques of the method that arose through my experimentation in Exercise 3:

  1. Additional guidance is needed for establishing null distributions and parameters in the model – the moveHMM package documentation states that establishing the movement parameters themselves for each behavioral state, prior to running the model, is the most important component of the modeling process. Given that caveat to running the model, it seems that a high degree of familiarity with the movement data, and the population under study, is needed in order to run the two-state behavioral model. Therefore, the model in and of itself does not seem overly exploratory in nature by rather a tool to identify the transition probabilities between two known states – not how to identify the behavioral states. The literature does not present hidden Markov models in this light, and I think that the amount and degree of background knowledge on the expected model outcomes should be discussed more candidly in the literature on the application of hidden Markov models to movement data. It essentially felt like I was trying random combinations of parameters in order to get the models to run. Part of the struggle is certainly a function of being a beginner to this type of modeling.
  2. Better error messaging needed in moveHMM – When working with the tool, the errors returned were difficult to interpret and, given the newness of the tool itself, there is not yet an established online community providing responses to hiccups in the tool itself. Therefore, when I was getting errors about value estimates outside of the parameters or missing fields, I had difficulty identifying exactly where I was going wrong, given that I was using the published vignettes as guides for working with my own data. More descriptive error handling guidance would be helpful in using this tool again in the future.
  3. Small sample size and combined sample leading to erratic results – The sample of tracks modeled is only representative of 5 movement patterns, and in reviewing these movement patterns they are quite distinct from each other. I think the model would perform better on a larger sample of data to help iron-out some of the disparities in the parameter estimates. Additionally, the modeling exercise presented assumes that one distribution for each state is appropriate for the population, therefore, any variability in the individual track data is minimized in the model. This is only a helpful approach if the researcher believes that all actors behave in a similar capacity – this is a big assumption for my data and given this modeling exercise I do not necessarily think it holds.

For these reasons, the numeric results of the material presented should not be interpreted literally, but rather more as an exercise in how the tool could be applied in the future with additional data and a involved initial understanding of the behaviors of interest.

June 10, 2018

Fitting Distributions for Two Spatial Data Behavioral Measures: Step Length and Turning Angle (Exercise 3, Part 1)

Filed under: Exercise/Tutorial 3 2018 @ 10:11 pm

Question Asked

For Exercise 3, I wanted to explore the degree to which two environmental covariates influence the transition probabilities between and among two behavioral states using a hidden markov model approach. To operationalize the behavioral states, paired step length and turning angle measurements were generated from the raw GPS tracks, as described in my Exercise 1 blogpost. Histograms of both step length and turning angle distributions for five sample tracks revealed that two states may be emerging from the data: 1) a state characterized by small step lengths and wide turning angles and 2) a state characterized by large step lengths and very narrow (near zero) turning angles. The emergence of these two potential behaviors from the visual inspection of the histograms is characteristic of the behavioral states used to describe the movement behaviors of animals, to which hidden markov model approaches have been applied.

In order to fit a hidden markov model to the step length and turning angle data using the moveHMM tool (Exercise 3, Part 2), the null distributions and associated parameters for step length and turning angle must be defined for each state. The moveHMM tool states that one of three possible distributions must be defined for step length: gamma, Weibull, and lognormal; one of two possible distributions must be defined for turning angle: von Mises and wrapped Cauchy. Therefore, for Exercise 3 Part 1, I pose the following research questions related to the distributions of step length and turning angle:

  1. Of gamma, lognormal, and Weibull distributions, which distribution best fits the step length dataset?
    1. For the best fitting distribution, what are the associated model parameters for state 1 and state 2 behaviors?
  2. Of von Mises and wrapped Cauchy, which distribution best fits the turning angle dataset?
    1. For the best fitting distribution, what are the associated model parameters for state 1 and state 2 behaviors?

Tool/Approach Used

Earlier in the term, another classmate, faced with a similar need to define distributions for her data, discovered and presented on an R package called fitdistrplus (Delignette-Muller & Dutang, 2014) – a package allowing for the exploration of various distributions to user-provided data. Remembering that she had presented on Weibull, lognormal, and gamma distributions, I decided to answer my research questions through use of the fitdistrplus R package.

Description of Steps Used to Complete the Analysis

My overall approach to the analysis was exploratory, and following a learn-by-doing approach to understanding the best inputs for fitting the model. Of the variety of model-fitting options available within the fitdistrplus R package, I ended up using the following three model-fitting approaches on my data:

  1. Maximum likelihood estimation
  2. Maximum goodness of fit estimation with a Cramer-von Mises distance
  3. Maximum goodness of fit estimation with a Kolmogorov-Smirnov distance

I selected the above listed approaches because of the methods provided by the program (moment matching estimation and quantile matching estimation are also options), these three approaches seemed the most straightforward, referenced by other researchers in the literature and in discussion, and would provide parameter estimates needed in later analytical phases.

I also generated parameter estimates for three different slices of my data. First, I generated parameter estimates for all of the step length and turning angle data. Then I generated estimates for a rough cut at two behavioral states for the step length data: step lengths less than 400 meters (state 1) and step lengths 400 meters or greater (state 2). This cut off point emerged from my data as a trough in an otherwise somewhat bimodal step length distribution with a grouping of small step lengths (100 meters or less) and a grouping of larger step lengths (around 600 meters).

To fit the individual models, I used the “fitdist” function for each combination of data and model fitting type. I also used the “plot” and “fitdistres” functions to generate visualizations of the distributions.

Description of Results Obtained

Through my exploration with the tool, the various methods for fitting models, and my three difference data slices, I ended generating parameter estimates for 45 different model/data fitting/data combinations. Of those combinations, the results presented below were the most interesting and influential in my conclusions:

Step Length

Figure 1 displays the histogram and theoretical densities for the step length dataset as a whole, showing curves for the Weibull, lognormal, and gamma distributions, generated using maximum likelihood estimation as the model fitting technical. Among the three model fitting techniques tested, I did not see any practically significant differences between the curves produced or the estimated model parameters. Looking at the distribution of the step length data all together, I concluded that the Weibull and gamma distributions did not seem to differ, in a practical sense from each other, at least when looking at the dataset as a whole. Therefore, for the remainder of this blogpost I’ll present figures from the maximum likelihood estimation method, which is the default method for the fitdistrplus tool. I also noticed the bimodal nature of the step length histogram (x axis labeled “data”) and noticed that I should probably divide the data into two sets in order to generate parameter estimates from the data for two behavioral states.

Figure 1. Distribution fit using maximum likelihood estimation for the step length dataset as a whole

Figure 2 displays the histogram and theoretical densities for the state 1 slice of the step length dataset (step lengths less than 400 meters), showing curves for the Weibull, lognormal, and gamma distributions, generated using maximum likelihood estimation as the model fitting technical. Looking at the state 1 distributions, I noticed that the gamma distribution appeared to be the intermediate distribution among Weibull, the more conservative, and lognormal, the more extreme. Given that both the lognormal and Weibull distributions appeared to be influenced more by the height of the frequency distribution (lognormal) and width of the distribution (Weibull), the gamma distribution emerged as a potential contender for use in modeling the data.

Figure 2. Distribution fit using maximum likelihood estimation for the State 1 of the step length dataset

Figure 3 displays the histogram and theoretical densities for the state 2 slice of the step length dataset (step lengths 400 meters and greater), showing curves for the Weibull, lognormal, and gamma distributions, generated using maximum likelihood estimation as the model fitting technical. Looking at the state 2 distributions, noticeable differences emerged between the Weibull and gamma/lognormal curves. The Weibull curve seemed to be much more influenced by the height of the frequency distribution rather than the width of the distribution. Some background exploration of my data revealed that the height was being driven primarily by a single track’s data (of the set of 5 tracks under exploration in this class). Therefore, to not unduly weight one track’s behavior over the other’s, I ruled at Weibull for use in the moveHMM tool.

Figure 3. Distribution fit using maximum likelihood estimation for the State 2 of the step length dataset

Among gamma and lognormal, gamma is more frequently used by movement ecologists in modeling step length data. Given the results of the distribution explorations did not practically differ between the lognormal and gamma, I also felt confident moving forward using the gamma parameter estimates. The estimates for the gamma distribution parameters or shape and rate for each state are as follows:

State 1 Shape = 2.07, Rate = 0.031

State 2 Shape = 89.09, Rate = 0.140

Turning Angle

The turning angle parameters showed less variability from the onset of the project; my use of the fitdistrplus tool for turning angle was more for my own learning than for deriving data-driven parameter estimates for the moveHMM model. Through the fitdistrplus tool I wanted to explore weather wrapped Cauchey or von Mises was a better fit for the data. Unlike the step length plots, I was unable to create comparative plots for the wrapped Cauchey and von Mises distributions (this might be user error). Instead, I generated the below diagnostic plots for each distribution for each state. I ran the model fitting analyses for turning angle and the states defined by step length presented previously – I do not yet know enough about either distribution or the turning angle data generally to make an informed estimate of what the values might be for each state. Figures 4 and 5 below present the diagnostic plots for the wrapped Cauchey distribution. Ultimately, I selected this distribution because the Q-Q and P-P plots for wrapped Cauchey indicated a marginally better model fit than von Mises; however, both sets of diagnostic plots indicated well-fitting distributions, as indicated by little deviation in the scatter points from the plot lines.

Figure 4. Diagnostic plots for maximum likelihood estimation fit for wrapped Cauchey for State 1 turning angles


Figure 5. Diagnostic plots for maximum likelihood estimation fit for wrapped Cauchey for State 1 turning angles

The wrapped Cauchey parameter estimates (location and concentration) for the two states are as follows:

State 1 Location = -0.014, Concentration = 0.15

State 2 Location = -0.015, Concentration = 0.06

Critique of Method

Overall, I found the tool to be very useful in allowing me to compare and contrast the fit of various distributions for my step length and turning angle data. It was convenient and straight forward to be able to use a single tool to fit the various distributions, rather than having to use individual tools to step length and turning angle and/or individual tools for various distributions.

My critique of the tool lies in my own lack of knowledge in how best to interpret the outputs of the distributional fits. I appreciated that the tool presented both numeric estimates of the parameters and comparative models, but my own unfamiliarity with each of the distributions I was fitting made it difficult for me to evaluate the output. Through online research, the moveHMM tool documentation, movement ecology publications, and the fitdistrplus documentation, I feel I was able to put together enough of a working understand to make sense of the output; however, I feel like my interpretation could certainly be helped by additional understanding of the distributions and model fitting techniques themselves.


Delignette-Muller, M.L., & Dutang, C. (2014). fitdistrplus: An R package for fitting distributions. Journal of Statistical Software 64(4) [online]

Logistic Regression of Plant “Vulnerability” Against Three Explanatory Variables

Filed under: Exercise/Tutorial 3 2018 @ 3:46 pm


In this exercise, I used logistic regressions to investigate whether whether three explanatory variables (growing degree days, soil temperature, soil moisture) could be related to changes in the probability of finding vulnerable versus invulnerable plants in my 2017 data. Logistic regressions allow us to fit models to data of probabilities that range between 0 and 1 (Hosmer et al., 2013). Vulnerable plants were defined as plants with any vulnerable capitula present whereas plants were defined as invulnerable if they had only invulnerable stages (vulnerable stages: primordia, buds, young flowers and flowers; invulnerable stages: fruit and dehisced fruit). Based on these definitions, I created binary response variable at the plant level:

vul.plant = 1 for a plant having any vulnerable capitula
vul.plant = 0 for a plant that has only invulnerable capitula

Because all flowering capitula begin at a vulnerable stage (primordia, bud) and pass through to invulnerable stages (fruit, dehisced fruit), a baseline expectation was that early season samples would have a probability of vulnerability close to 100%, and that samples late in the season would have a probability of vulnerability close to 0%. (Prediction 1): growing degree days are expected to explain a significant degree of the change in probability from 100% to 0%.

Previous analysis showed that warmer soil temperatures (˚F) were linearly associated with higher growing degree days, so I expected that soil temperature should explain some degree of the change in probability of vulnerability, both through this correlation with growing degree days and also through directly impacting flowering phenology. (2) Higher soil temperature is expected to be associated with decreased probability of vulnerability. On the other hand, the correlation between growing degree days and soil moisture (measured as percent soil moisture) was negative and neither as strong nor as consistent as soil temperature. For these reasons, soil moisture might explain a portion of the change in probability of plant vulnerability, and (3) higher soil moisture is  expected to be associated with increased probability of vulnerability.

Finally, I also wanted to ask whether the patterns shown in the logistic regressions I performed would differ between the five surveyed sites represented in the 2017 data.


To complete this analysis, I used the generalized linear model function glm() to perform logistic regressions on data prepared in tutorial 2. Because soil temperature measurements were significantly impacted by the time of day at which they were measured, I removed the linear trend between soil temperature measured and time of day of measurement, creating adjusted soil temperature (also in ˚F). Soil moisture was used directly.

Binary response variable for plants was created using dplyr package and the summarise function to collapse per-capitula phenology scores to plant-level presence/absence scores for each flowering stage, then further collapsing the presence/absence scores for six stages into two classes, “vulnerable” and “invulnerable,” as defined above. Similar methods were used to create a binary variable for the presence of virulent larvae (where virulent was defined as larval stages L4 & L5), and a third binary variable for phenological synchrony, defined as presence on a single plant of both vulnerable capitula and virulent larvae. These additional binary variables will not be analyzed in this exercise.


  1. Create binary response variables – this was a little tricky because initial data structure was plant-level counts of capitula by stage. Using dplyr package and the group_by and summarise functions, I grouped the data by all plant-level variables I wished to keep in the resulting dataset, and then used summarise to create 2 new binary response variables: presence of vulnerable capitula (vulnerable capitula present = 1, none present = 0), and presence of virulent larvae (virulent larvae present = 1, none present = 0). I created a third binary response variable by multiplying the first two to document “phenological synchrony”, the co-occurrence on a single plant of vulnerable capitula and virulent larvae (virulent larvae.
  2. Once data were properly wrangled, I used the glm() function with family set to binomial to fit logistic regression models of the binary response variable of interest (plant vulnerability) as a function of included explanatory variables.
  3. A summary() of the glm object gave estimates of the model coefficients, and their significance levels (z scores, p-values). Based on methods outlined in the datacamp module on logistic regressions, I used predict() function to create a list of fitted probabilities and check how often these correctly classify plants as vulnerable or invulnerable (
  4. To visualize the results as a smooth curve of predicted probabilities given by the glm object, I again used the predict() function with a regular sequence of x values overlapping the range of the explanatory variable and plotted these together using the baseplot or ggplot functions. For each explanatory variable, I eventually created a single plot with an overall model fit, and models fit to data from each individual site, in order to visually compare patterns/responses at different sites.


With plant vulnerability as the response variable, I fit a model with growing degree days (in hundreds) as the single explanatory variable to confirm my initial expectation, that increased growing degree days would be associated with a move from near 100% vulnerability to 0% vulnerability. A logistic regression model fit to the data showed strongly significant estimates for the coefficients. The mean success rate for predicting vulnerability classification correctly using this fitted model was 91%. A visual representation of the probability of plant vulnerability showed the probability of plants being vulnerable starting very high (close to 1.0) and ending very low (close to 0.0) with the threshold value of 0.50 falling at about 720 growing degree days (Figure 1). A drop-in-deviance test for the model including growing degree days against a reduced model of plant vulnerability gave strong evidence (P<0.001 for χ2 with df = 1) that the probability of a flower being vulnerable depends on number of growing degree days.

Figure 1. Logistic regression of plant vulnerability (vulnerable = 1) against growing degree days (hundreds) with fitted curve shows a strong switch in probability of vulnerability from high to low with increasing growing degree days.

Next, I fit a model with plant vulnerability as the response and adjusted soil temperature (˚F) as the explanatory variable, understanding that some of the behavior seen between plant vulnerability and soil temperature can be attributed to the positive correlation between adjusted soil temperature and growing degree days (Figure 2). Coefficients fitted in this model were also highly significant (***see note on these p-values under “CRITIQUE”); but the mean success rate for fitted predictions was at 62% using this fitted model. Probabilities plotted using this fitted model ranged between 0.8 and 0.2 for the range of adjusted soil temperatures represented in the data. I found strong evidence (P<0.001 for χ2 with df = 1) that the probability of a flower being vulnerable depends on adjusted soil temperature. The correlation between growing degree day and adjusted soil temperature is estimated at 0.30.

Figure 2. Logistic regression of plant vulnerability (vulnerable = 1) against adjusted soil temperature (°F) with fitted curve shows increasing adjusted soil temperature accounts for some decrease in probability of vulnerability.

I also fitted individual logistic regression model for plant vulnerability against adjusted soil temperature for each site, and found that Buck Mt. and Holland Mdw. had significant p-values for coefficient estimates while Juniper Mt., Waterdog Lk., and Blair Lk. did not***. A plot of fitted values from each site-level logistic regression model and an overall logistic regression model of plant vulnerability against adjusted soil temperature is given in Figure 3.  In this figure, we see that the curve fit for Blair never predicts plants as being ‘vulnerable’, since the curve does not reach above probability 0.5, while curves for several other sites (Buck, Juniper, Holland and weakly, Waterdog) predict a switch from high to low probability of vulnerability with increasing soil temperature. The poorness of fit for Blair Lake could be due to missing data, since one survey date did not have soil temperature data for that site.

Figure 3. Logistic regression of probability of plant vulnerability against adjusted soil temperature (degrees F) for all sites combined (grey dashed curve) and individual sites.

We see differences in how well these fitted models describe and predict the occurrence of vulnerable plants at the five sites; I do not see clear indication that the inflection points differ significantly between sites, however the rate of change from probabilities near 1.0 to probabilities near 0.0 is steeper/quicker for certain sites.

Logistic regressions of plant vulnerability as a function of soil moisture showed an opposite relationship to previous regressions: increasing values of soil moisture saw the probability of plants being vulnerable going from low (near 0.0) to high (near 1.0) (Figure 4).  I found strong evidence (P<0.001 for χ2 with df = 1) that the probability of a flower being vulnerable depends on soil moisture.

Figure 4. Logistic regression of probability of plant vulnerability against soil moisture (percent) shows an increase in probability of vulnerability with increased soil moisture.

Fitting logistic regressions for each site individually did show some different patterns per site (Figure 5). Waterdog Lake reached probabilities near 1.0 much sooner than the rest, while Buck Mt., which tended to have higher soil moisture than the other four sites, had a much slower transition between low and high probabilities of vulnerability.

Figure 5. Logistic regression of probability of plant vulnerability against soil moisture (percent) for all sites combined (grey dashed curve) and individual sites.



Previous to this analysis, I had been informed that the Wald’s test z-statistic and p-values reported in the output of the glm() fitted model are not a reliable test for significance of coefficient estimates and inclusion of terms. Instead, a drop-in-deviance F-test testing a full versus a reduced model will yield a p-value for the inclusion of variables represented in the full model (Ganio, 2018). The drop-in-deviance test is intuitive and interpretable, but somewhat inconvenient for an analysis involving many separate logistic regressions (like, per site), and the reporting of the Wald’s test significance levels can be misleading. For this reason, I performed a drop-in-deviance test for each of the fullest models (growing degree days, adjusted soil temperature, soil moisture) that including all data. For the by-site logistic regressions, I found that calculating the mean success rate of fitted probabilities was a nice way to compare model goodness of fit quickly (datacamp, 2018), though my own comprehension of the steps involved is not as solid as with the drop-in-deviance test, making interpretation more difficult. Calculating odds ratios and analyzing inflection points and slopes for the by-site curves would help me interpret these results in a more biologically meaningful manner, to address the overarching research questions that motivated this data analysis.

An additional critique is that for the regressions of soil temperature and moisture to be most meaningful, I would like to include growing degree days in the model so that I can test the significance of the effect of soil temperature or moisture on probability of vulnerability after accounting for the affect of growing degree days. The challenges which stopped my from performing this analysis were that soil temperature and growing degree day were strongly correlated, and that interpretation of the x-axis was intractable for the model that included multiple explanatory variables with different scales. Further research could help make this analysis feasible.


Datacamp. (2018). Logistic Regression in R Tutorial.

Ganio, Lisa. (2018). FES 524 Natural Resources Data Analysis Lecture, Feb 27 2018.

Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (Vol. 398). John Wiley & Sons.


June 7, 2018

Wavelet and Cross Wavelet Analyses of Kelp Canopy Cover and Temperature

Filed under: Exercise/Tutorial 3 2018 @ 4:26 pm

Question: With this exercise I wanted to look at patterns in the cycles of temperature and kelp coverage in southern Oregon. I wanted to examine if these two variables fluctuate together and if they do, is that synchronization is constant throughout the timeseries.

Tool: To do this I used wavelet analysis and cross wavelet analysis. These analyses do not depend on stationarity, unlike many of our other spatio-temporal tools, so they can detect changes in the period and frequency of a cycling time series or pair of time series.

Steps: Wavelet analyses require evenly sampled data. My temperature data is sampled evenly with one mean temperature per month (for both the intertidal and satellite data sets). My kelp timeseries is based on sporadic sampling however (i.e. whenever we could get a good cloud free satellite image). So the first task I needed to do was to interpolate my kelp timeseries to have monthly average canopy cover.

To do this, I interpolated each year’s timeseries separately. I first added a low canopy coverage for all winter months that did not have a data point. For the winter months (November –April), I found the median winter month canopy coveragewas about 1200m2, so all empty winter months were assigned to this median number. Every year had at least 3 data points in the non-winter months, so between the added winter month data and the summer month points, I was able to fit a polynomial spline to each year’s kelp timeseries. I did this in R using the lm(y~bs(x, degree = #) function. I then interpolated the value on the 15th of every month from this spline using predict function.

I then used the WaveletComp package in R to conduct wavelet and cross wavelet analyses on my kelp and temperature timeseries.



Figure 1: A) Wavelet analysis of my kelp canopy timeseries. B) Wavelet Analysis of the satellite temperature timseries. For both, colors correspond to power, white contour lines to the 0.1 significance level, and black lines to the power ridge.

My interpretation of the wavelets plots is:

1) Kelp Canopy – The highest power comes at the 6 month and 12 month periods. The 12 month wavelet makes sense, considering Nereocystis is an annual species, and the 6 month period wavelet corresponds to intra-growing season changes in canopy cover, possibly due to changes resulting from wave events. These areas of high power at 6 and 12 months are not consistent throughout the data set, and only occur when there are spikes in canopy coverage, which points to years where kelp cover, even at the height of the growing season was quite low,Another result to note is that in the first half of the timeseries, there is some power at the 5 year period. This could potentially be related to a climatic oscillation such as El Nino or PDO.

2) Satellite Temperature – similar to kelp canopy, the highest power for this wavelet analysis was at the 12 month period. Unlike the kelp wavelet, this 12 month period was consistently high power for the entire duration of the timeseries.

My interpretation of the cross-wavelet plots I generated is:

1) Kelp Canopy and Satellite Temperature – kelp canopy and satellite temperature appear to both have high power at 12 month periods. The arrows in this 12 month band largely point to the right, which suggests that x and y are in phase. A few of the arrows point down to some extent as well (around -30 degrees). This band is not constant however, and the significance of this high power band is lost around 1997 and 2003-2005. This suggests neither timeseries had strong annual cycles around this time. There are also a few other periods that appear to have a great enough amplitude to be considered significant. Perhaps most interesting is that the 5 year/60 month period appears is significant throughout the series. The arrows along this band are mostly pointing to the left and upwards, indicating that the two may be out of phase and that temperature (y) may be leading x (kelp). We saw a similar trend in the wavelet analysis for kelp and may be indicative of some kind of effect of climate oscillations.

2) Kelp Canopy and Local, Intertidal Temperature – This cross wavelet analysis was done on a dataset about how as long as the kelp/satellite dataset. This cross wavelet power graph is ‘chunkier’ at the annual scale than that between kelp and satellite data. This cross wavelet shows a dip in power from 2003-2005 like the above graph did, but also a dip about 2009-2011 as well. This indicates that the two timeseries are not both cycling together as strongly as did the kelp and satellite data. However, this cross wavelet has additional high power areas in the period of 3-11 months about 2014 and at 13-32 months from 2006-2010. There also appears to be some significant power at the 5 year range, similar to several previous power graphs, although this is hard to tell with a shorter timeseries.

Overall, my interpretation would be that both temperature datasets tend to cycle with kelp most strongly on the annual timescale, which makes sense. The local, intertidal data set does not appear to cycle as strongly with kelp as the satellite derived temperature does. My guess is that we got these results because local temperature is more susceptible to interruptions by relatively small scale phenomena such as upwelling or snowmelt and therefore adhere less strongly to the strong global cycles stemming from sun angle and exposure. I also think that these power graphs taken together suggest that some kind of multi-year climate oscillations may be involved in cycles of kelp canopy cover, since several of these analyses found significant amplitude at a roughly 5 year frequency.

Critique: My critique of this method is that I feel like wavelet and cross wavelet analysis is less informative than I was expecting it to be. Some of the strongest results just show cycles at an annual period that have a lower amplitude some years, which is pretty obvious from just looking at the graphs. I still feel like I’m not interpreting my results correctly, and I think part of the reason is that it’s hard for me to stop thinking in terms of correlation after my first two exercises.

Aggregating Suitable Dam Habitat to Consider the Role of Habitat Size and Connectivity

Filed under: 2018,Exercise/Tutorial 3 2018 @ 11:47 am


How does the size and connectivity of beaver dam habitat relate to observed dam sites in West Fork Cow Creek?


In Exercise 2, I considered if stream gradient, active channel width, and valley bottom width, described by Suzuki & McComb (1998) as the most important predictors of beaver damming, corresponded to the observed dam sites collected in our field surveys last fall. And while I found that the their thresholds defining suitable damming habitat captured all observed dam sites, it was also clear that there were was also a significant amount of suitable habitat where no observed dams occurred.  In other words, there was still quite a bit of ‘available’ habitat.

To understand why some of these habitats were used and others were not I turned to work from landscape ecology, that posits the size and connectivity of habitat is an important consideration in resource selection of animals (Dunning et al. 1992).  To consider if this may help explain why dams were observed in some suitable habitats and not others, I conducted the ‘patching’ process and OD Cost Matrix to estimate the size and connectivity of suitable damming habitats using the Suzuki & McComb (1998) criteria and which are considered in this final exercise.

Steps used to complete the analysis:

Data preparation: Despite having already generated the size and connectivity measures for patches of damming habitat, there was still some geoprocessing required.  In particular, I need to identify what patches of habitat had actually been surveyed during our fall 2018 field work as well as identify what patches were observed to have beaver dams somewhere in them.  To do this, I used the Spatial Join Tool, choosing the ‘INTERSECT’ method of identifying overlapping patches and survey sites. I repeated this processes to identify patches with observed dams sites.  Of the 49 habitat patches generated in Exercise 1, 29 of those coincided with our survey points.  Of those, 48 dams sites were observed on only 4 of the 29 patches. All patches that were not surveyed were removed from the dataset and fields for a dummy variable along with dam counts were added for observed dams.

Variable generation: Using the data from the OD Cost Matrix analysis in Exercise 1 I had two variables of interest: 1) length of contiguous habitat reaches, that I refer to as patch ‘Length’, and another which is the distance of a patch to it’s nearest neighboring patch if traveling through the stream network; referred to hear as ‘Nearest Neighbor (NN) Distance.  For this exercise I was curious, however, if the size of the NN was important, surmising that a patch would be more likely to be occupied if it was located 100 meters away from a large patch (e.g. 900m) as opposed to a small patch (e.g. 100m).  As a result I calculated a NN Length, weighted by the distance to the NN patch (NN Length/NN Distance); referred to as ‘NNLwDist’.

Analysis: Similar to Exercise 2 I used the easyGgplot2 package to generate overlapping histograms of habitat patches with and without observed dam sites for these three variables. I also used this same package to generate density plots to show the relative distribution of patches with and without observed dams.  I then computed the means, standard deviation, and range for each variable with and without observed dams.

Lastly, I compared the difference in means using a permutation test because a standard Welches t-test was inappropriate due to the small and unequal sample sizes in each group.  I also applied a Wilcoxon rank sum test but this proved problematic due to a high number of ties in these data.


Distributions and Summary Statistics:

Table 1






Patch length: The histogram and density plot in figures 1 and 2 show a clear delineation in the patch length between those with and without observed dams.  On average, patches where dams were observed were more than 900 meters longer (z=3.6, p<0.0001) than those without (table 1).


Figure 1

Figure 2










Distance to Nearest Neighbor Patch: On average, patches with dams were nearly 350 meters closer to their nearest neighboring patch and overall are highly skewed to the shortest distances (95m to 100m) and clearly visible in the density plot (figure 4).  Patches without dams had a much wider distribution of distances (96m to 1591m). However, the permutation test did not find the difference in distances to be significant (z= -1.5, p=0.14).

Figure 3

Figure 4










Nearest Neighbor weighed by Distance: The mean difference between patches with and without observed dams was 6.4m with the permutation test suggesting this was significant (z=3.001, p=0.003).  Patches with observed dams were more distributed than in the other metrics (2m to 15m) which is particularly apparent in the density plots.

Figure 5

Figure 6










Lastly, figures 7, 8, and 9 show plots of Patch Length by NN Distance and NNLwDist, as well as NN Distance by NNLwDist, respectively.  The clustering of patch with observed dams by patch length is consistent with the above discussion, however, patches with observed dams also seem bounded or highly clustered by only small distances to the NN.  Given a larger sample size, I suspect, that differences in patches with and without dams would be more significant.  Conversely, patches with observed dams are more distributed by NNLwDist and is curious why those are significant.


Figure 7

Figure 8

Figure 9










Critique of methods:

Given the simplicity of the analysis in this exercise, I don’t have a lot of critiques to offer. The fundamental challenge with analyzing these data were that they are ‘0’ inflated, meaning that of the 29 patches generated, only 4 of them had observed dam sites.  I had initially applied a logistic regression model to consider the probability of dams as a function of patch length, distance to nearest neighbor and the weighted neighbor length, but refrained from providing the results and opting instead to offer the simpler analysis above given the small and lopsided samples in this analysis. That said, the overall effort confirms that patch geometry seems to have a role in the occurrence of beaver dams.  In particular, there is strong evidence of a difference in the size of patches where dams were observed compared to those where they were not and is consistent with theory from landscape ecology that suggests animals will preferentially select larger habitat over smaller habitats all else being equal.  Curiously, these data do not show compelling evidence of a difference in the NN Distance between patches with and without observed dams sites, but do for the NN Length weighted by the distance to that patch.



June 6, 2018

Geographically weighted regression and the location of behavior changes

Filed under: Exercise/Tutorial 3 2018 @ 6:07 pm


Exercise 2 determined that velocity gradient, water speed, and turbulent kinetic energy (TKE) gradient are the most important aspects of principal components that predict the spatial distribution of behavior changes. However, because channel hydraulics vary on a small scale (e.g. centimeters) within the experimental channel, the relationship between each hydraulic variable and the location of behavior changes also varies spatially. For example, a multiple linear regression comparing the downstream distance of a behavior change and three hydraulic variables of interest shows that the model residuals are non-uniformly distributed (Figure 1). As portion of the boom passed increases, the model shifts from over-predicting behavior change location to under-predicting it. In this exercise, a geographically weighted regression was conducted to examine the relationship of three hydraulic variables and the downstream distance of behavior changes on fine spatial scales. The objective was to determine if one or more independent variables have a spatially-consistent relationship with the location of behavior changes.

Figure 1. A general linear model relating the distance downstream of a behavior change and three hydraulic variables (water speed, TKE gradient, and velocity gradient) shows non-uniformly distributed residuals. These results indicate that the relationship of between independent and dependent variables may vary in space, and warrant analysis using a geographically weighted regression.




A geographically weighted regression (GWR) differs from multiple linear regressions by estimating local relationship coefficients for every point in the dataset, rather than assigning a global coefficient across the entire dataset. In this way, local variation in the relationship between independent and dependent variables can be seen. A 2014 blog post by Adam Dennett was very helpful in understanding and implementing a GWR using R. Once local estimates of regression coefficients between velocity gradient, water speed, and TKE gradient and the location of a behavior change were found, the data were visualized using Python. Only statistically significant relationships (p-value < 0.05) were displayed, so that local coefficients can be regarded with confidence.


The relationship by hydraulic variables and the location of behavior changes varies substantially in space. Local coefficients of both TKE gradient and water speed range from positive to negative values (Figure 2 and Figure 3). That is, depending on location within the experimental channel, high values of TKE may incite a behavior change relatively early or late as a fish passes the guidance structure. Similarly, high values of water speed may incite a behavior change relatively early or late as a fish passes the guidance structure.

Figure 2. Locally-derived coefficients between water speed and the location of a behavior change. Grey shading indicates the intensity of water speed.

Figure 3. Locally-derived coefficients between TKE gradient and the location of a behavior change. Grey shading indicates the intensity of TKE gradient.

Only velocity gradient demonstrates a consistent relationship with the downstream location of a behavior change (Figure 4). For all observed behavior changes, increasing velocity gradients correlate with increasing distance downstream of the behavior change. Stated another way, if a behavior change occurred at low velocity gradients, we can expect its location to be relatively far upstream. If a behavior change occurred at high velocity gradients, we can expect its location to be relatively far downstream. This differs from the local relationships between the location of behavior changes and TKE gradient and water speed, indicating that velocity gradient may be the best predictor of the distance downstream that a fish passes a guidance structure before changing its behavior. The variability of hydraulic intensity that incites a behavior change is a result of natural variability in a fish’s past experience, physiology, and perception of stimulus, making absolute thresholds of fish behavior difficult to determine. However, velocity gradient may most consistently predict the location of behavior changes in this experiment.

Figure 4. Local coefficients between velocity gradient and the location of a behavior change show consistently positive relationships. Grey shading indicates the intensity of velocity gradient.

Critique of methods:

Geographically weighted regression is a statistical tool that informs local variation in the relationship between independent and dependent variables. If a researcher has reason to believe that global coefficients don’t capture variation in space, a GWR is a useful, if slightly confusing, method of analysis.

Observed vs. expected time spent in vegetation types as a function of distance to water

Filed under: 2018,Exercise/Tutorial 3 2018 @ 10:32 am

Disclaimer – this post is long, particularly in the ‘steps taken’ section. I may use some of these methods in the Fall, so I made it very detailed.

Question asked

For this analysis I wanted to dig a little more into the relationship between vegetation type, distance to water, and the movement of the recreationist. The reason for this is because in Exercise 2 I learned that coniferous woodland dominates the study area and the hikers’ path. However, one could wonder if people are less likely to spend time near water if the surrounding vegetation is dense with conifer. Perhaps this dense vegetation is less appealing than the lake feature and the hiker consequently keeps trekking until they reach a more open area. Understanding the significance of this relationship is important prior to applying the data to a hidden Markov Model. Results from this exercise may indicate that I should include vegetation data into the model because vegetation (such as coniferous woodland, meadows, etc.) may explain the movement of recreationists in addition to, or regardless of, distance to water.

My questions for this exercise are:

1.) What are the distributions of the length of hiker’s track versus the time they spend in the area as a function of distance to water?

2.) Are hikers spending significantly more or less time in certain vegetation types despite being near water?

Tool/Approach Used

I used ArcGIS and R to complete my analysis. I did a few things differently in this exercise in contrast to exercise 2. Steps outlined below:

(1) Simplify data and reduce to 1 minute time intervals between points: I am using a sub-sample of 5 GPS tracks. These tracks collected point data every 20 seconds. I decided to aggregate each point to 1 minute intervals to make it a little easier to analyze and interpret. To do this I used the dplyr package in R to aggregate the GPS points to 1 minute timestamps and average the xy coordinates.

(2) Import revised and reduced spatial dataframe to ArcGIS: I ran into a snag here. In the process of trying to convert the spreadsheet to a shapefile I kept losing the timestamp information. As a workaround, rather than converting the spreadsheet to a shapefile, I converted it to a dBASE table which (fortunately) preserved the timestamp information.

(3) Create 5 unique track layers: At this point, my data were still in one large table. I then split up the merged table into the five unique tracks and layers. To do this I used the select by attribute tool and create layer from selection tool.

(4) Build attribute table: I wanted the following variables in the attribute tables of each track: distance to water feature, vegetation type, length of track, timestamp, xy coordinates. So far, I had the xy coordinates and time stamps. To calculate distance to water feature I used the joins and relates tool to calculate the nearest distance between the GPS point and the outline of the water feature (I had previously created a polygon of the water feature using the digitize tool in editor mode; see Figure 1). To extract vegetation data I used the spatial join tool to link the vegetation type to the GPS point (Figure 2)

(4b) Calculating length of track: the final attribute I needed was the length of the GPS track. To do this, I used the Tracking Analyst tool (thank you, Sarah, for the suggestion!) then track intervals to feature option. What this does is calculate the distance between one point and the subsequent point. This is why the time stamp interval was crucial for me to maintain. I needed to ensure that the tool was using the timestamp to determine the subsequent point for calculating length. This tool added a new column to the attribute table of each track denoting the calculated distance between each point to the next point. Now, I have information on the actual distance (in meters) traveled.

(5) At this point, I’m very happy. I finally have an attribute table with all the information I need for analysis. I then export the table as a text file. I import the table into R for manipulation, visualization, and analysis.

In R:

(6) Bin distance to water data: to create histograms representing proportions I needed to bin the distance to water data. I used the cut tool in R to bin the distance to water data into groups of 20 meters (i.e. 1 – 20, 21 – 40, etc.)

(7) Create dummy variables for vegetation type: I used ifelse statements with the dplyr package to create new columns for conifer presence (0 = not present, 1 = present), meadow presence, and residential facilities presence.

(8) Calculate total length of hikers path in vegetation type each distance class: I again used dplyr to calculate the total length of each hiker’s path grouped by the distance class. I then used a conditional statement to only calculate the length of data points in each vegetation type (i.e. conifer, meadow, residential).

(9) Calculate time spent in each distance class: I first created a column with 1’s to represent 1 minute. I then calculated the time each hiker spent in each distance class. I used a conditional statement to only calculate the time spent in each vegetation type (i.e. conifer, meadow, residential)

(10) I then calculated ‘Observed’ versus ‘Expected’ proportions.

Expected = length of [vegetation type] / total length within distance class (assumes people are traveling at a constant speed)

Observed = time spent in [vegetation type] / total time within distance class

This will tell me if people are spending more or less time in a certain vegetation type despite being close to water.

(11) I did a chi-square test for each vegetation type to see if the observed vs expected distributions were significantly different.

Description of Results Obtained:

The first three figures I generated demonstrated to me the overall proportions of the length of the hikers’ tracks grouped by distance class, and the time spent in each distance class. Figure 4 indicates that the five hikers spent 75% of their time 0 -20 meters from the lake. Additionally, of the entire track length, over 60% of the length was 0 – 20 meters from the lakeshore. From an ocular check, it looks like my hypothesis is being supported: people are spending more time near the water. I wonder if that’s a function of the lake or another variable?

Laura introduced me to a neat way to conceptualize differences in distributions, particularly when the distributions represent different units of measurement (i.e. time vs. length). Creating an odds ratio essentially divides the values of time vs. length to give me a ratio. A value less than 1 indicates that people spent less time in the distance class than they would have if they had been traveling at a constant pace. A value greater than one indicates the people spent more time in distance class than they would have if traveling at a constant pace. Figure 5 demonstrates that people were spending more time in the area when they were 0-20 meters from the lake shore and 61-80 meters from the lake shore. I am ignoring the 141-160 and 161-180 distance classes because the sample sizes are too small. Once again, looks like the hypothesis is being supported. But, is this because of the water feature? Or vegetation? Or both?

Figure 6 illustrates the proportion of vegetation types within each distance class. It looks like 0 -20 meters from the lake shore, the predominate vegetation types are conifer woodland (~ 50%) and residential and facilities (~ 40%), i.e. paved parking lots, toilets, picnic areas, paved sidewalks.

The next set of figures illustrate the expected amount a time a person would spend in specific vegetation types within each distance class if they were traveling at a constant rate compared to the observed amount of time that person actually spent in each vegetation type within each distance class.  Coniferous woodland was the first vegetation type I examined. Figure 7 indicates that people were spending less time in conifer despite being closer to water! The further away from water, the more time they spent in conifer. The chi-square test indicates these differences in distributions are statistically significant (X2  = 90.359 df = 7, p-value < .001). Figure 8 demonstrates the odds ratio of Observed/Expected. Interestingly, Figure 5 suggests people are spending more time near water, yet when there’s conifer present, this is not the case; I wonder if there is another vegetation type explaining this?

Figure 9 represents the observed versus expected distributions for herbaceous vegetation types. In other words, open meadows. Interestingly, it appears that the hikers spent more time in meadows than was expected (see Figure 10 for odds ratio). It also appears that meadows are prevalent further away from shore. Back in Figure 5 we learned that people are spending, on the whole, more time in areas 0-20 meters from shore and 61-80 meters from shore. Perhaps for the 61-80 meters distance class, this can be explained by the presence of open meadows. Interestingly, the chi-square test indicates that these distributions are not significantly different from one another (X2= 7.4523 df = 3, p-value = .06).

Figure 11 is very revealing as it indicates that people are spending more time in residential facilities that are close to shore. This includes areas with picnic tables, paved sidewalks, and some areas of the parking lot. This makes sense as I imagine people are drawn to the amenities that are offered. Also, the chi-square test indicates that the differences in these distributions are significant (c2 = 197.86, df = 3, p-value < .01) Seeing these results, I wonder if next time I should modify my hypothesis and examine if/ how distance from vehicle explains behavior.

Critique of Method

I created a workflow that made sense to me, but overall the process was very time consuming, particularly in ArcMap. In my experience thus far, I have appreciated the visualization ArcMap provides and find the tools easy to learn and navigate. However, during this particular exercise ArcMap was especially clunky and slow to process. I was glad to use ArcMap so I could gain experience using the software, but many hours elapsed before I could build an adequate attribute table for subsequent analysis in R. Fortunately, thanks to this class, I am also growing more and more comfortable in R, so, with relative ease, I was able to quickly modify my data and build code that allowed me to analyze and visually represent the data.

Another limitation and critique of this method is that I am only using a conveniently chosen subset of 5 GPS tracks. Therefore, my sample size of humans (not GPS points) is low and not randomly selected; thus, I can’t generalize beyond the five hikers in the subsample. Given the time constraints at this point in the term, I decided to stick with the five hikers rather than boosting my sample size. Despite the low sample size, this type of analysis introduced me to a new approach for examining relationships between recreationists and their surrounding environment. I plan to adopt these methods for future research.

Driven by curiosity, at the end of this exercise I looked at the tracks in ArcMap again. It appears that the majority of use is occurring near the parking lot. Additionally, the ‘meadow’ in this dataset is sandwiched between residential facilities (i.e. the parking lot and picnic tables). This situation makes me wonder if distance to vehicle or parking lot is also playing a role in the response. Further, I wonder if I should even classify that polygon as meadow, especially since it is unclear if people are spending time there because of the vegetation or because of the distance to the vehicle/easy access to facilities. If I had more time, I would examine this additional variable.

June 5, 2018

Comparing SnowModel output to metrics of Dall Sheep recruitment.

Filed under: 2018,Exercise/Tutorial 3 2018 @ 4:16 pm

Question asked

Is Dall Sheep recruitment more influenced by near-summer snow conditions or do early snow season conditions also play a role?

A typical metric for assessing sheep recruitment, i.e. the number of young animals available to ‘recruit’ into the population, is the lamb to ewe ratio (hereafter referred to as lamb:ewe). In the case of Dall sheep, demographic surveys take place most frequently in the summer months of June and July, after their lambing months of April and May. In this question I will examine the prevailing theory that a cause of Dall sheep population decline are spring snow storms causing high lamb mortality by comparing summer lamb:ewe ratios to aggregated monthly and seasonal snow data derived from a spatially explicit snow evolution model run at daily timesteps.

Data / Tool / Approach used

Snow Data

The snow data for this analysis is derived from SnowModel, a spatially explicit snow evolution model, and consists of daily mean snow depth, total snowfall, mean air temperature, and forageable area (the percentage area under snow depth and density thresholds that allow Dall sheep to graze) aggregated into monthly and seasonal means and totals.

Sheep Data

Sheep data used here is from annual surveys completed by Alaska Department of Fish and Game, Bureau of Land Management, US Fish and Wildlife Service and National Park Service in 6 different Dall Sheep ranges – see blogpost 2. Survey methods include distance sampling, stratified random sampling, and minimum count methods either from a ground location or fixed wing aircraft.

Approach used and steps followed

At this initial stage a simplistic approach was employed to test the research question by counting the occurrence of observations by month and season that confirm the following conditions.

  1. High month/season snowfall andlow lamb:ewe ratios
  2. Low month/season air temperature andlow lamb:ewe ratios
  3. High month/season snow depth andlow lamb:ewe ratios
  4. Low month/season forageable area and low lamb:ewe ratios

The basis of these statements are the assumptions that increased snowfall and snow depth, or lower air temperatures and forageable area, produce conditions where greater energy expenditure is required for survival. Dall sheep are in calorific deficit during the snow season so benign conditions mean that ewes reach the lambing period in better condition and are potentially then more able to provide for their lambs, increasing the observed lamb:ewe ratio. An alternative or complementary idea is that conditions during and after lambing are more important as lambs require a narrower range of conditions than adult sheep to survive.

To test these conditions the snow data and lamb:ewe ratios were converted into anomalies and coded as being strongly/weakly positive or negative based on whether they were outside or within one standard deviation and the direction of their sign. The above conditions were then assessed based on this coded data via the plotting of heatmaps and tallying occurrences (see results below).

This approach is not complex but does begin to examine which time-scales and time periods of the snow season are important for Dall sheep, insight that can later be used in more complex predictive models.

The analysis took place using R and the biggest hurdle was the being able to pass column names as arguments into functions. This was overcome by learning the use of ‘enquo()’, tilde ‘~’, and the nuances of standard versus lazy evaluation that govern whether to include an underscore (e.g. aes_()) after function calls in dplyr pipes. See further, and better described, info!

Brief description of results obtained

In the following section I will present graphs from the Wrangell St Elias primarily, analysis was also conducted in the five other domains but in the interest of brevity I am limiting the number of figures.

Above average snowfall and below average lamb:ewe ratios by month


The heatmap, top left, shows in blue the months where the condition is met (the x axis is the same as the bar chart beneath). The dark grey bars correspond to years where a sheep survey took place (yaxis, 15 out of 37). On the right panel, the scatter diagram shows the lamb:ewe anomaly for each year.

From fig. 1 we can see that 9 of the 15 years of sheep surveys had a lamb:ewe ratio below the mean. Of these nine years the most common months that had higher than average snowfall were October and November, with 6 each. By contrast, the months believed to be important to lamb survival (Apr, May, June) only have 4 recorded instances of above average snowfall. Each year with below average lamb:ewe ratios had at least 4 months of higher than average snowfall (excluding August, which comes after the survey). 50 out of the 99 months (9 years with below average lamb:ewe ratios, excluding August) have above average snowfall.

Below average air temperature and below average lamb:ewe ratios by month


Fig. 2 by contrast shows air temperature by month. Here we see that May has the greatest number of months where the condition is met (n = 6). However, October to November, have the same number of instances as Mar, June and July. 48 out of 99 months agree with the condition.

Above average snow depth and below average lamb:ewe ratios by month


Mean snow depth, fig. 3, shows 4 to 5 instances per month meet the condition from October to June for years with low lamb:ewe ratios. 45 out of 99 months agree with the condition.

Below average forageable area and below average lamb:ewe ratios by month


The autumn months of September to November are comparatively low in instances where the condition is met (n = 4 to 5) for below average forageable area next to December through May where at least 6 out of 9 years show the condition met. February the condition is recorded 8 times. 66 of 99 possible months agree with the condition.

By season


When considering the conditions by season, both autumn and spring snowfall meet the condition 6 years out of 9. Above average snowfall is seen in 19 out 36 possible seasons in low lamb:ewe ration years. Summer air temperature meets the condition 6 times, winter and spring 5 times each, autumn just twice. 19 out of 36 possible seasons snow lower than average air temperature. Snow depth by season is not seen to meet the condition more than 5 times for any season (winter and summer) and 17 out of 36 seasons meet the condition. Forageable area meets the conditions 23 out of 36 seasons, winter with the highest count of 7 out of 9 years met.

Conclusions / Critique of method

This method was a simplistic approach to examine which variable and when could have an effect on Dall sheep summer recruitment. Both by month and by season, below average forageable area had the most recorded instances of being seen alongside low average lamb:ewe ratios, 66 out of 99 possible months, 23 out of 36 seasons. Snow depth did not appear as important as either snowfall or air temperature in monthly or seasonal comparisons.

A critique of this method is that it doesn’t capture instances where the opposite of the condition occurs, e.g. high lamb:ewe ratios and high forageable area. It also doesn’t test the significance of any relationships and is suspect to potential anomalies affecting a limited sample size and its mean in the lamb:ewe ratios. The same tests presented above but with conditions that described instances of a snow variable and lamb:ewe ratios outside of one standard deviation did not produce any meaningful patterns, with occurrences being isolated to single months or seasons, if at all. Despite these limitations this approach does however give insight towards exploring the variables using more complex regression methods.

June 4, 2018

Shift in the lag of GPP with soil water content at a C4 grassland site

Filed under: 2018,Exercise/Tutorial 3 2018 @ 4:48 pm

1. Question asked:

How is soil water content related to gross primary production (GPP) and the light use efficiency of photosynthesis (LUE) of a natural grassland? How do these relationships differ between C3- and C4-dominated grasslands? Specifically, how does interannual variation among water years?

2. Tool / approach used:

To answer this question, I calculated cross-correlation of GPP and LUE with soil water content (SWC). I then compare the maximum cross-correlation, as well as comparing the timing of the lag at which the maximum cross-correlation occurs. I performed these calculation separately for two sites, and for water years.  Water years are measured from October to September. Water years, in contrast to calendar years, are a more biologically meaningful variable for examining the relationship between plant growth and water variables.

My data are derived from eddy covariance flux tower locations in natural grasslands in Eastern Kansas. Soil water content, like other eddy covariance flux data, is recorded at 30-minute intervals. Here, I have aggregated flux and environmental data to daily sums and averages.

3. Steps followed to complete analysis

Data prep and selection

– Because SWC data are most consistently available at both sites between 2009-2014, I limited my analysis to data collected between those years. I further limited the data to 2010-2013 due to data availability.

– I also initially examined the water use efficiency (WUE) of production, but WUE, calculated as GPP / evapotranspiration (ET), has distinct dynamics from GPP or LUE and does not appear to be strongly cross-correlated with soil water content, so I have omitted it from this analysis.

Plotting GPP and Light Use Efficiency (LUE) against soil water content, for each water year, helps, Visualizing inital lag correlation between the variables. Here, it seems evident that the water year captures the temporal extent of the relationship between production and soil water content, and you can perceive a hump in soil water content (dashed line) that precedes a hump in GPP and LUE at each site, in each year.

Visual assessment of additional environmental variables that may be related to production

This plot shows the annual timecourse of cumulative precipitation and GPP, for water years from 2010-2013. This was an attempt to investigate more closely the environmental conditions that might be driving thesoil water content and its relationship with production. However, the cumulative annual precipitation, calculated as the rolling sum of precipitation measured at the flux tower over the course of a water year, shows very low values of precipitation in 2012 and 2013 at KFS and Konza, respectively. This suggests that there was an error in the collection of the precipitation data. As I proceed with this analysis, I will substitute [DayMet]( climate data for the site-level flux data, which will sacrifice spatial resolution but should improve data quality.

Analytical Methods

I used the ccf() function to calculate cross-correlation between GPP and SWC, and between LUE and SWC. First, I calculated the cross-correlation of GPP and LUE, with soil water content, at each site separately, for all years from 2010-2013.

However, I’m most interested in comparing patterns of cross-correlation among years, particularly with regard to the timing of peak cross-correlation in 2012, a drought year. Does the drought year have a distinct pattern of cross-correlation between GPP, LUE, and soil water content than a non-drought year?

I then used the ccf() function to calculate cross-correlation of GPP and LUE with soil water content at each site, for each year from 2010-2013 in order to assess interannual variability in cross-correlation between GPP and LUE with soil water content. In order to visualize trends both between sites and between years, I separately plotted each site, by year– as well as each year, by site.

Lastly, to better visualize the shift in timing and magnitude of cross-correlation at each site, I extracting the date and value of the maximum and minimum cross-correlation and plotted these values for each year.

4. Results obtained

Cross-correlation of production with soil water content, for all years

First, I calculated cross-correlation between GPP and SWC, and between LUE and SWC, for all years at both sites. Thus, this is the “average” cross-correlation between production and soil water content, which has a peak maximum cross-correlation at a lag of about 100 days for both GPP and LUE. However, I’m most interested in looking at how interannual variation in climate affects plant production.

Cross-correlation between production and soil water content, by site and water year

When we examine how cross-correlation between GPP and soil water content varies between sites and between years, we see a clear shift at Konza Praire, the C4 grassland in 2012. This shift indicates that GPP and LUE are peaking earlier in relation to soil water content: with a lag of about 30 days, as opposed to 100 days at the Kansas Field Station, and at Konza in previous years.

Because 2013 followed the severe drought year, 2012, this shift suggests that the C4 grasses at Konza may be more flexible in shifting the timing of their water use in response to drought or other irregular moisture conditions.

To visualize this further, I also extracted the value and lag date of the maxima and minima of the cross-correlation function to visualize how these points change over time at each year.

Interannual variation in the timing and value of maximum cross-correlation

Plotting the date of the maximum cross-correlation, as it changes over years at each site, also helps visualize the divergence in the timing of the relationship between the production indices and soil water content.

5. Critique of the method

While this method is useful in assessing which variables are cross-correlated, and when those relationships peak, it does not clarify which environmental variables are controlling the production dynamics that I am seeing. In order to determine whether soil moisture content, or another variable (like cumulative precipitation, air temperature, soil temperature, a drought severity index, or others) is more responsible for the production dynamics, I will need to build and compare linear mixed models that predict GPP and LUE, from a suite of environmental variables, for sites at different years. This will help me assess which environmental variables may be more or less important in certain years, and better quantify their relationships with production.

This task will be made more complex as the timing of the relationships, where cross-correlation is a strong factor, change over time. So, it may be difficult to construct an appropriate linear mixed model with the most influential lag variable.

© 2019 GEOG 566   Powered by WordPress MU    Hosted by