GEOG 566






         Advanced spatial statistics and GIScience

Archive for 2018

June 11, 2018

Analysis of flowering phenology as a step towards understanding phenological synchrony

Filed under: 2017,Final Project @ 5:55 pm

BACKGROUND: While the cinnabar moth (Tyria jacobaeae) was intentionally introduced to North America as a biological control insect, its release had unintended consequences as it established on a non-target native perennial herb, Senecio triangularis (Diehl and McEvoy, 1988). Populations of triangularis colonized by the cinnabar moth can experience heavy foliar herbivory (up to 100%) without showing reductions in plant size or seed set (Rodman, 2017). Herbivory of flowers is less common given a mismatch between the flowering time of S. triangularis and peak feeding stages for the moth, but when it does occur, floral/seed herbivory may have direct impacts on population dynamics. Previous work has shown that that larvae experiencing phenological synchrony with S. triangularis flowers decreased seed set by 95% (Rodman, 2017); and that S. triangularis is seed-limited, so that reduction of seed set decreases seedling recruitment to the next generation (Lunde, unpublished data).

Phenological synchrony is defined in this study as the overlap of “virulent” larval stages (late stages: L4 and L5) and “vulnerable” flower stages (early stages: primordia (P), buds (B), young flowers (YF) & flowers (F)). Synchrony varies by site and year, and the environmental drivers causing this variation are unknown. Knowing which populations of S. triangularis would be most likely to experience seed herbivory by cinnabar larvae could help managers track and respond to cinnabar moth presence and the risk to S. triangularis on a site-by-site basis. Because phenological synchrony was a rare occurrence for the five sites represented in 2017 data, most of the analyses described here focus on characterizing the timing of flowering of S. triangularis, while later work will focus on the timing of the L4 and L5 stage “virulent” cinnabar larvae.

 

RESEARCH QUESTION: How does the timing of “vulnerable” flower stages differ by site? Can I describe and quantify between site-differences in the timing of these stages? And how well do differences in candidate environmental factors account for differences in the occurrence of vulnerable flowers?

 

DATASET: This project used data collected from July to September, 2017, at five sites in a region of the Willamette National Forest near Oakridge, Oregon. A sixth site, Bristow, was lost due to fire, but is occasionally present in data analyses. Phenology scores (for cinnabar larvae and Senecio triangularis) and environmental variables were taken at each site during surveys at regular 10-day intervals. Some data were taken at the individual plant level, while some variables (such as snowmelt date) could not be measured in a meaningful manner below the site level. Data details:

  1. Population locations: a polygon layer outlining the 5 meadow sites surveyed over the 2017 season.
  2. Phenology: counts of vulnerable and invulnerable flower heads (capitula), and counts of cinnabar larvae in peak feeding stages (4th & 5th instars), collected from random subsets of tagged plants during each survey. Binary response variables created from phenology scores included plant vulnerability (1=vulnerable, 0=invulnerable), larval virulence (1=virulent larvae present, 0=absent), and phenological synchrony (1=co-occurrence of virulent larvae & vulnerable plants, 0 = one or none occurred).
  3. The following environmental variables are included in the phenology dataset: ambient temperature at 36 inches from the ground; soil temperature and soil moisture at 6 inch depth. Soil moisture and temperature readings were taken at each individual plant.
  4. Growing degree days for each site and survey day were modeled using a single triangle method with developmental thresholds set as above 5˚ and below 37˚ C. Growing degree days are defined as accumulated daily heat gain above and below developmental thresholds, here set to 5˚ and below 37˚ C. Developmental thresholds were estimated from previous studies on phenology of alpine flowers (Kudo and Suzuki, 1999); whereas I recognize that 10˚ C may be more commonly used as the minimum temperature for development of insects.

 

  1. HYPOTHESES: Alpine and subalpine perennial plant species have been shown to vary widely in terms of which environmental factors drive flowering phenology (Dunne et al., 2003). I expect to see differences in flowering phenology between sites. I expected differences between sites to correlate with differences in the abiotic environment: growing degree days, soil temperature, and soil moisture. And in order to test this, I also needed to explore how these three environmental variables were correlated to one another, so that I could understand issues underlying the analysis of these independent explanatory variables.

Because insects are ectotherms whose development is largely constrained by their ability to gain heat externally, the phenology of insects is often predicted on growing degree days (Johnson et al., 2007); and I would also expect to see variation between sites in the timing of occurrence for virulent larvae, perhaps driven more strongly by degree days, and less strongly by soil moisture and temperature. But I was not able to address that hypothesis during the analysis presented here.

 

APPROACHES:

  • Lognormal Regression of Flowering Phenology Data – the first approach I used to understand the patterns of flowering phenology present in my 2017 data, I used the fitdistrplus package in r to compare and fit phenomenological curves to counts of vulnerable (F-stage) capitula on given degree days. This required combining degree days of observation date and capitula counts into a single response variable to create a frequency distribution that a density functions could be fit to. I used splitstackexchange and methods described in Emerson and Murtaugh (2012) to create an observation of time (in growing degree days) for each event (F stage capitulum). This method led to the fitting of lognormal curves with the nls() function and parameter estimates given by the fitdist() function within The “goodness of fit” for the lognormal curve fit to all observations was assessed for data from each individual site as an attempt to describe variation in the timing of F-stage flowering by site. Visual comparison of lognormal curves fit to each site gave further insight into left-right shifts of these curves.

 

  • Assessing correlation in soil temperature and moisture data –in order to examine patterns in variation of soil temperature and moisture, particularly with respect to time, I converted dates and times to POSIXct objects using the lubridate After examining soil temperature against time of day of survey, I used the lm() function to fit linear regressions to this data series, and used the mutate function within the dplyr package to create a new variable of “adjusted soil temperature” normalized for time-of-day (Noon). I fit a linear regression to soil temperature against growing degree days for all sites, and repeated this for each individual site to see if time-of-season had significantly different impacts on soil temperature per site. I also looked at the relationship between soil moisture and time-of-day as well as time-of-season, but found less distinct trends and did not normalize for time-of-day.

 

  • Logistic Regression of probability of plant “vulnerability” against growing degree day, soil temperature, and soil moistureby simplifying phenology scores per capitulum to a binary response of “vulnerable” or “invulnerable” for an entire plant, I was able to analyze how the probability p of a plant being vulnerable varied with growing degree day, soil temperature, or soil moisture using logistic regressions (Hosmer et al., 2013). I used the glm() function in r to fit logistic regressions (family=binomial) to binary data for growing degree days, adjusted soil temperature (see part 2), and soil moisture. I used the predict() function to test how these fitted models performed in terms of classifying plants as vulnerable or invulnerable with a threshold value at 0.50 (datacamp 2018). I compared predicted response curves for the five sites visually, and made individual pairwise comparisons between two sites by using the coefficient estimates for site contrast parameters (Faustini 2000).

 

RESULTS: This project was an exploration of the data I had gathered in 2017, and produced a series of comparative figures showing variation in flowering phenology, measured environmental factors, and the relationship between the two. Early in analysis, a frequency distribution of stages counted per site per date showed that phenological synchrony (co-occurrence of F stage or prior with L4 and L5 larvae) was quite rare in the data (Figure 1).

Flowering phenology (as proportion of capitula counted in stage F) was best characterized by lognormal curves using scaled growing degree days (in hundreds). A curve was fit to data from all sites, and then both root mean squared error (RMSE) and mean absolute error (MAE) were calculated for data from each individual site relative to this all-site curve (summarized in Figure 2).

Figure 2. Lognormal curve fit by nls shown against F stage observations by site. Inset shows root mean squared error (RMSE) and mean absolute error (MAE) calculated overall and by site.

Figure 3. Lognormal curves of proportion of capitula found in F stage against growing degree days (hundreds).

This method showed some variation in goodness of fit but didn’t lend insight into the direction or magnitude of “shift” in timing between sites. A visual comparison of individual curves fit to each site showed slight shifts in the peak of each curve by site (Figure 3). This could be partially caused by the fact that a single survey represented different degree days between sites. However, Buck and Waterdog, which were modeled as having the same degree days, had different peaks in this figure, showing some ability to distinguish timing by site using this method.

Analysis of soil temperature found that data had a lurking effect of time-of-day of survey, which was characterized with a linear regression (Figure 4).

Normalizing soil temperature by time-of-day to a common time (Noon) allowed seasonal trends in soil temperature to emerge (Figure 5), where originally the impact of time-of-day obscured these trends (Figure 6).

Figure 5. Trends of adjusted soil temperature across the season (measured in hundreds of growing degree days).

Figure 6. Trends of non-adjusted soil temperature across the season (measured in hundreds of growing degree days)

Soil moisture was not found to be impacted by time of day, but was found to have low variability between most sites except one (Figure 7).

Logistic regression of a per-plant measure of “vulnerability” (as a binary response, 1 = any vulnerable capitula present and 0 = no vulnerable capitula present) against growing degree day showed a clear effect of growing degree day on the probability of plant vulnerability, with increasing degree days leading to a decrease in vulnerability with a clear threshold-like response (Figure 8).

Figure 8. Logistic regression of plant vulnerability (vulnerable = 1) against growing degree days (hundreds) with fitted curve shows a strong switch in probability of vulnerability from high to low with increasing growing degree days.

A similar regression against soil temperature (adjusted for time-of-day) showed that increasing soil temperature also decreased the probability of plant vulnerability, but with a more gradual response rather than a threshold-type response (Figure 9).

Figure 9. Logistic regression of plant vulnerability (vulnerable = 1) against adjusted soil temperature (°F) with fitted curve shows increasing adjusted soil temperature accounts for some decrease in probability of vulnerability.

Per site logistic regression of plant vulnerability against soil temperature showed that models (shown as predicted response curves) fit better for some sites than others, and slopes appeared to differ but not inflection points (Figure 10). This difference could be further resolved by fitting logistic regressions to two sites and analyzing the coefficient of the interaction term (which represents difference in slope of the second site) (Faustini 2000). The poorness of fit for Blair data in this case could be due to one missing survey date of soil temperature measurements.

Figure 10. Logistic regression of probability of plant vulnerability against adjusted soil temperature (degrees F) for all sites (grey dashed curve) and individual sites.

Soil moisture had an opposite relationship with plant vulnerability, with increasing soil moisture associated with increasing probability of plant vulnerability, using the same methods described above (Figure 11). Comparisons of predicted response curves by site for soil moisture showed a variety of slopes depending on the site, with all curves fitting reasonably well (Figure 12).

Figure 11. Logistic regression of probability of plant vulnerability against soil moisture (percent) shows an increase in probability of vulnerability with increased soil moisture.

Figure 12. Logistic regression of probability of plant vulnerability against soil moisture (percent) for all sites (grey dashed curve) and individual sites.

 

SIGNIFICANCE: Previous work showed that the loss of seeds due to cinnabar larval herbivory could significantly impact population dynamics for Senecio triangularis. The analyses presented here showed the variation in plant phenology, environmental factors, and the relationship between the two for the overall data set and for each site, individually. The biggest finding of this project was that phenological synchrony was too rare in this data set to model directly, as I had originally set out to do. Focusing instead on flowering phenology allowed development of methods and a greater understanding of the data captured in 2017.

For each variable examined, differences could be seen between sites with each site showing distinct behaviors. Further analysis of a full logistic regression model including all terms of interest could help understand the relative contributions of each explanatory variable to the change in probability of plant vulnerability.

For certain variables, it seems I may not have captured a wide enough range of variation in focal explanatory variables to draw clear connections between differences in phenology and these explanatory factors. For instance, soil moisture was largely similar between four sites and then quite distinct for a fifth site, Buck Mountain. Future work will focus on capturing greater variation in the abiotic environment, and hopefully also capturing greater variation in plant phenology.

 

LEARNING OUTCOMES: Through this project, my understanding and flexibility in R coding has increased by leaps and bounds. This includes restructuring data frames using dplyr pipelines (wide to long data, and collapsing nested data to compare at larger spatial/temporal scales), subsetting at will, and plotting data from every angle (mostly with ggplot). I discovered many useful packages (fitdistrplus, lubridate, splitstackexchange) and delved back into some previously encountered but less-well understood analyses (logistic regressions). I used ArcMap a few times to extract GPS points and other meadow statistics, but will hope to weave together the current analyses with explanatory variables not yet fully explored after the conclusion of this course.

 

STATISTICAL LEARNING: In terms of statistics used in this project, I learned about fitting phenomenological curves to data and analyzing goodness of fit using RMSE and MAE. I also learned about de-trending data to remove lurking effects such as the time-of-day effect in my soil temperature data. This will be very useful when dealing with other environmental factors that vary with time-of-day, and for which I do not have continuous data. I learned about fitting and analyzing logistic regression models to binary data using visual comparisons as well as numerical comparisons and interpretations for changes in slope and intercept. I was also able to understand and use methods developed by Emerson and Murtaugh (2012) for measuring “time of stage” for phenological data.

 

REFERENCES:

Datacamp. (2018). Logistic Regression in R Tutorial. https://www.datacamp.com/community/tutorials/logistic-regression-R

Diehl, J.W., and McEvoy, P.B. (1988). Impact of the Cinnabar Moth (Tyria jacobaeae) on Senecio triangularis, a Non-target Native Plant in Oregon. In Proceeding VII International Symposium on Biological Control of Weeds, (Rome, Italy), p.

Dunne, J.A., Harte, J., and Taylor, K.J. (2003). Subalpine Meadow Flowering Phenology Responses to Climate Change: Integrating Experimental and Gradient Methods. Ecol. Monogr. 73, 69–86.

Faustini, J.M. (2000). Stream channel response to peak flows in a fifth-order mountain watershed. PhD Thesis.

Hosmer Jr, D.W., Lemeshow, S., and Sturdivant, R.X. (2013). Applied logistic regression (John Wiley & Sons).

Kudo, G., and Suzuki, S. (1999). Flowering phenology of alpine plant communities along a gradient of snowmelt timing.

Murtaugh, P.A., Emerson, S.C., Mcevoy, P.B., and Higgs, K.M. (2012). The Statistical Analysis of Insect Phenology. Environ. Entomol. 41, 355–361.

Rodman, M. (2017). Non-target Effects of Biological Control: Ecological Risk of Tyria jacobaeae to Senecio triangularis in Western Oregon. Oregon State University.

Exploring historical constraints on fire across the Central Oregon Pumice Plateau

Filed under: 2018,Final Project,Final Project @ 5:45 pm

Question

My original question was how historical fire occurrence varies spatially on the central Oregon pumice plateau. My question morphed into how did constraints on fire including climate, fuel abundance and continuity, and lodgepole pine forest influence the occurrence of fire.

Dataset

My dataset consists of records of fire occurrence for 52 sample points distributed over an 85,000 ha area.

Study Area & Sample Points

At each sample point the annual occurrence of fire was reconstructed from tree rings during ~1700-1918.  Cross sections were removed from dead trees. Sections were sanded and precisely dated and injuries created by non-lethal low-severity surface fires were dated to their exact year of occurrence.

 

Fires scars collected at individual sample sites were composited into 1 record of fire occurrence at each sample point.  A range of 8 – 28 fires occurred at each sample point (mean 16).  In the Composite graph horizontal lines represent sample points and vertical tick marks represent fire occurrence.

Composite

Hypothesis

I hypothesized that climate, time since fire, and lodgepole pine acts as constraints to fire occurrence across the central Oregon pumice plateau.

Rationale

Earlier investigations demonstrate that fires historically occurred in dry forests in Oregon during drought years.  However fire size has not been related to climate.  It could be that fires of different sizes have additional relationships to previous year climate.  In exercise 2, I checked to see if small fires, large fires, and extensive fires were similarly related to climate.

Recent investigations north and northwest of this study area demonstrated lower fire frequency in lodgepole pine forests and that areas of lodgepole pine forest acted as intermittent barriers to fire spread.

Fuel is an obvious limiting factor to fire spread.  After fire fuel will need to recover sufficiently for fire to spread thus time since fire may be predictive of fire occurrence.

Methods/Approaches

Ex 1 – Mapping fires from Binary Point Data

Prior to exploring constraints on fire occurrence I need to produce maps of fire occurrence.  To do this I used Arc GIS to evaluate using Thiessen Polygons, Kriging, and Inverse Distance Weighting (IDW) to map historical fires. Once I had made maps of fires I visually examined spatial and temporal variation in fire occurrence by creating an animation of fire events over time, mapping fire history statistics, and by identifying fire occurrence groups with cluster analysis and then mapping the groups.

Ex 2 – Determining how fire size and climate were related

I used superposed epoch analysis (SEA) to determine how annual climate and climate in antecedent years was related to fire occurrence.  By breaking fire events into size classes I was able to see if relationships with climate varied with fire extent.

Ex 3 – Using a GLMM to understand the influence of climate, time since fire, and lodgepole pine on annual fire occurrence

I used a generalized linear mixed model (GLMM) to determine the influence of fixed effects (climate, time since fire, and lodgepole) on annual fire occurrence across my study area.  Sample point was included as a random effect to see if the model varied across the study area. I used R2 for GLMM models to determine the relative importance of each fixed effect and the random intercept of sample point in the model.

The Results and the Significance

Ex 1

In exercise 1 I learned the tradeoffs between the 3 difference approaches to mapping fires from binary point data. The Thiessen polygon method provided most efficient, objective, and parsimonious method to map fire perimeters based on the distribution of my sample points.

Fire Extent mapped from Thiessen Polygons

Ex 2

Superposed Epoch analysis demonstrated that extensive fires occurred during years of extreme drought, large fires occurred during average droughts, and small fires were not related to dry or cool wet climate years. No relationships with climate in antecedent years for fires of any size or years without fire were found.

SEA_FireSize

These results demonstrate that climate in antecedent years before a fire is not related to fuel abundance and connectivity. Cool wet years that may have been necessary to produce fine fuels that carry surface fires did not occur before fire years. Only a dry hot year during the fire year was associated with large and extensive fire spread. This suggests that fine fuel production with respect to fire spread is not related to climate. For my investigation this result means that fuel recovery following fire is not moderated by climate after fire. Thus time since fire may be an independent predictor of fire occurrence.

Ex 3

Using the GLMM approach I determined that climate, time since fire (interval), and lodgepole influenced annual fire occurred.  However, lodgepole was weakly significant, and only created a small change in the annual probability of fire. The conditional R2 for the GLMM model demonstrated that the random effect of site accounted for very little of the variance explained in the model. Climate and not interval as I expected explained most of the variance accounted for by the model.

For every 1 unit increase in PDSI the probability of fire occurring decreased by ~25%, for each year without fire (Interval) the probability of fire increased by 7%, and for each percent increase in lodgepole forest around a sample point the probability of fire decreased by 1%.

Significance of Ex 3 – Because lodgepole was significant, but weakly significant I want to explore other metrics that capture how lodgepole varies with respect to each plot.  It could be that lodgepole nearer the plot is more important or it could be its spatial pattern that causes variation in fire occurrence.  I was surprised that interval was less influential than climate. I interpret this as a strong indication that fuel is only limiting for a few short years after fire occurrence in the central Oregon pumice plateau. The small influence of site in the model suggests that bottom-up controls (microclimate, slope, local composition) that are site specific have little influence on fire frequency and fire region.

My next step is to focus on a single explanatory variable and switch to using site as a fixed effect with an interaction with that explanatory variable.  In addition I plan to fit 52 models, one for each site, using a single explanatory variable.  This would switch from a GLMM approach to a GLM approach, and allow me to map coefficients at each site on a map

Software Learning

In R I learned to use the LME4 package, learned how to perform SEA analysis, learned several new scripts for data wrangling, and learned how to make graphical summaries of fire history in ggplot.

In ArcGIS I learned to create animations of fire events and how to map fires with three different interpolations. I also used kriging to summarize variation in fire metrics in exercise 1.

Statistical Learning

I learned the pros and cons of different techniques to map fires, the ins and out of SEA analysis, and got an initial understanding of GLMMs. I plan to spend a lot more time working with GLMMs now that I’ve had this introduction and have found a flexible tiered approach to linear models.

References

Bolker, B.M.,Brooks,M.E.,Clark,C.J.,Geange,S.W.,Poulsen,J.R.,Stevens,M.H.H. & White, J.S.S. (2009) Generalized linear mixed models: a practicalguide for ecology and evolution. Trends in Ecology & Evolution, 24,127–135.

Hessl A, Miller J, Kernan J, Keenum D, McKenzie D. 2007. Mapping Paleo-Fire Boundaries form Binary Point Data: Comparing Interpolation Methods. The Professional Geographer 59:1, 87-104.

Florian Hartig (2018). DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. R package version 0.2.0. http://florianhartig.github.io/DHARMa/

Michael H. Prager & John M. Hoenig (2011) Superposed Epoch Analysis: A Randomization Test of Environmental Effects on Recruitment with Application to Chub Mackerel, Transactions of the American Fisheries Society, 118:6, 608-618, DOI: 10.1577/1548-8659(1989)118<0608:SEAART>2.3.CO;2

Nakagawa S. and Schielzeth H (2012) A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods in Ecology and Evolution Vol 4(2):133-142.

Assessing the importance of 3 predictors of fire occurrence and the random effect of site with a GLMM model

Filed under: Exercise/Tutorial 3 2018 @ 4:11 pm

Question

The question I asked was whether fire occurrence from 1700-1918, a binary (0 – no fire, 1 = fire) response, was related to three explanatory variables:

Climate – Annual Palmer Drought Severity Index (PDSI) a measure of moisture stress reconstructed from tree rings (range -6.466 – 7.234; severe drought to cool and wet). In exercise 2 I found using superposed epoch analysis that large and extensive fires occurred during years of extreme drought, and that previous year PDSI was not related to fire occurrence.

Interval – The amount of years since the last fire event (range 1-46). This variable is a surrogate for fuel, a primary constraint on fire spread. The longer we go without fire the more fuel accumulates potentially increasing the probability of fire spread.

Note – Calculating Time since fire for each year at each plot was quite tricky. I eventually found a solution with the dplyr package.  This code sums the amount of rows (years) between events (FIRE_PLU)) within each plot and adds the sum of years since fire to each year (row) in a new column. This almost does what I want, but starts at 0 on each year/row when a fire occurred.  The interval in the year of fire occurrence should be the years since the last fire (e.g. 20 not 0) so I shifted fire occurrence forward in time one year (FIRE_PLUS).

binary2<-binary1 %>% group_by(SP_ID, idx = cumsum(FIRE_PLUS == 1L)) %>% mutate(counter = row_number ()) %>%  ungroup %>% select(-idx)

Lodgepole – The proportion of area occupied by lodgepole pine within a 2 km buffer surrounding each sample plot.  Lodgepole pine stands have lower productivity and relatively sparse understory fuels.  Fire occurrence may be less frequent in areas where lodgepole pine forest limits fire spread.

My data looks like this:

SP_ID YEAR FIRE FIRE_PLUS PDSI INTERVAL PICO
SD200 1700 0 0 2.935 13 40
SD200 1701 0 0 2.028 14 40
SD200 1702 0 0 4.049 15 40
SD200 1703 0 0 -5.034 16 40
SD200 1704 0 0 2.306 17 40
SD200 1705 1 0 0.773 18 40
SD200 1706 0 1 -2.451 1 40

SP_ID is the sample plot identifier (n =52), YEAR is the year of observation, FIRE is the binary record of whether a fire occurred, FIRE_PLUS is fire occurrence shifted so I could calculate interval since fire for each year and each plot, INTERVAL is the year since fire occurrence, and PICO is the proportion of area around each plot occupied by lodgepole.

I’m also interested in the relative importance in the model of each explanatory variable, and whether the response varies by sample plot.

Approach

I used the LME4 package in R to construct generalized linear mixed models (GLMM; Bolker et al. 2009) to complete my analysis. The GLMM approach is useful for a binary response where data is nested within itself.  In this case my response is nested within sample plots and a mixed model allows me to pool information about fire occurrence across sites.

With a binary response sample size is limited to the amount of 1 (fire) occurrences of the response variable. In my case I have ~200 binary observations at each plot (years), but fire occurs in only 8-28 years depending on the sample plot. Therefore sample sizes would be too small for 3 predictor variables. The GLMM approach allowed me to pool information across plot, and determine whether sample plot, the random effect in the model, created a different response to the fixed effects (climate, interval, lodgepole).

I followed the following steps

  1. Checked for independence of the explanatory variables
  2. Specified the model with sample plot as the random effect, climate, interval, and lodgepole as the fixed effects, and fire as the response. This is a GLMM model fir by maximum likelihood with a logit link.

fire.glmm1 <- glmer(FIRE ~ PDSI+INTERVAL+PICO + (1|SP_ID), family = “binomial”, data = fire.data, control=glmerControl(optCtrl=list(maxfun=2e5)))

  1. Checked the model fit and assumption of residual independence. To do this I used the DHARMa package and the following vignette that describes how residual interpretation for GLMMs is problematic and provides a solution.

https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html

If the model is specified correctly then the observed data should look like it was created from the fitted model. Hence, for a correctly specified model, all values of the cumulative distribution should appear with equal probability. That means we expect the distribution of the residuals to be flat, regardless of the model structure (Poisson, binomial, random effects and so on).

Overall residuals for the full model

simulation_residuals

Residuals vs. Variables – Residuals for each fixed effect

PDSI_RESIDUALS

Temporal autocorrelation of residuals

Temporal_Autocorrelation

ACF plots of Residuals

ACF_RESIDUALS

4. I used R2 for GLMM (Nakagawa and Schielzeth 2012) to indicate the relative importance of the fixed effects and to determine the importance of sample point (random effect). The marginal R squared values are those associated with your fixed effects, the conditional ones are those of your fixed effects plus the random effects. To determine the relative importance of the fixed effects I dropped 1 fixed effect at a time and compared changes in R2 among models.

5. To interpret the influence of each fixed effect I exponentiatied the coefficients and graphed the probability functions.

Results

All three fixed effects were significant in the GLMM model, but PICO was barely significant (p value = 0.0458). For every 1 unit increase in PDSI the probability of fire occurring decreased by ~25%, for each year without fire (Interval) the probability of fire increased by 7%, and for each percent increase in lodgepole forest around a sample point the probability of fire decreased by 1%.

 

The difference in marginal and conditional R2 for the full model was small. The random intercept of site explained little additional variance. This suggests there’s not a lot more to be explained at the sample plot level.

R2m       R2c 0.2072182 0.2329679

Comparing conditional R2 among models while dropping fixed effects indicated that climate a top-down driver consistent across sample points accounted for most of the R2 (0.15 for PDSI, 0.5 for Interval, and 0.1 for Lodgepole).

These results suggested that Bottom-up drivers such as local topography, composition, or microclimate seem to have little influence on fire occurrence. I was surprised to see that climate had a stronger influence on fire occurrence within a year, but this makes sense as I think more about time since fire and fuel accumulation.  If fuel recovers quickly (e.g. 2-5 years) in this landscape then it won’t be limiting for most years.  If fuel is generally not a limiting constraint on fire then fire occurrence shouldn’t change dramatically with time since fire.  Essentially 5 years without fire and 20 years without fire have the same influence on whether a fire occurs and fire occurrence is simply more likely over a longer period of years. Thus climate is a more influential predictor of fire occurrence.

Overall most of the variance was not explained.  If climate and fuels are generally not limiting then fire occurrence in a year would occur as a result of human and lightning ignitions.  Data for these predictors is not available.

Critique

Overall I found GLMM modeling to be quite challenging and based on the amount of R packages and recent publications on the topic it seems like an evolving statistical method. I had trouble understanding the implications of the Random effect sample point accounting for little of the R2.  I interpret this to mean that there’s not a lot more to be explained at the sample point level that has not been included.

Interpreting coefficients from GLMM model with a logit link is also a challenging and there are few examples available online.

References

Bolker, B.M.,Brooks,M.E.,Clark,C.J.,Geange,S.W.,Poulsen,J.R.,Stevens,M.H.H. & White, J.S.S. (2009) Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology & Evolution, 24,127–135.

Florian Hartig (2018). DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. R package version 0.2.0. http://florianhartig.github.io/DHARMa/

Nakagawa S. and Schielzeth H (2012) A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods in Ecology and Evolution Vol 4(2):133-142.

Exploring recreational movement behavior through hidden Markov models

Filed under: 2018,Final Project @ 3:55 pm

Research Question Asked

Glacier Bay National Park and Preserve (GLBA), located in southeast Alaska, contains over 2.7 million acres of federally designated terrestrial and marine wilderness (National Park Service, 2015). Recreation users access GLBA Wilderness primarily by watercraft; the park lacks formal trail networks in its wilderness and terrestrial connectivity is fragmented by the park’s water resources. First designated as wilderness in 1980 through the Alaska National Interest Lands Conservation Act, management of the park’s wilderness has been guided by a 1989 Wilderness Management Plan (National Park Service, 1989). Much has changed in Alaska and GLBA since that time, and the park is currently engaged in updating its Wilderness Management Plan to adapt its management practices to modern management contexts. Park managers are particularly interested in developing a better understanding the wilderness experiences of water-based backcountry overnight users in GLBA Wilderness. As such, a dataset of global-positioning system (GPS) tracks of water-based visitor travel patterns were collected during the summer 2017 use season to record the spatial and temporal behaviors of recreationists in GLBA Wilderness.

The movement ecology paradigm provides a useful, organizing framework for understanding and conducting path analysis on GPS tracks of recreationists in GLBA Wilderness. Formally proposed in 2008, the movement ecology paradigm was designed to provide an overarching framework to guide research related to the study of organismal movement, with specific emphasis on guiding questions of why organisms move and how movement occurs through the lens of space and time (Nathan et al., 2008). The framework emphasizes understanding components of the movement, looking for patterns among those components, and understanding meaning behind movement through the underlying patterns (Figure 1). Ultimately, the target understanding is the movement path itself, which can be understood through quantification of the external factors, internal factors, and capacity for movement and/or navigation by the moving organism (Figure 2, Nathan et al., 2008). Through employing a movement ecology approach to the study of overnight kayaker movements in a protected area, individual movement tracks can be broken down into relevant components, the components of the path can be studied for patterns, and ultimately internal and external factors can be explored for influence or explanation of the movement path.

The following two aspects of the movement ecology framework were the focus of this study:

Movement States (Figure 1)– The movement ecology framework operationalizes movement as a series of step lengths and turning angles that can be used to identify underlying behavioral states. This follows the assumption that organisms move for specific reasons, and those reasons are reflected in the distance and direction of travel an organism travels in a set amount of time. The focus of this study was to transform raw GPS data into a path of step lengths and turning angles and to identify underlying behavioral states using step length and turning angle data.

Figure 1. Organizing framework for path analysis approach for this study (figure from Nathan et al., 2008): A) Understanding movement as a series of movement steps and associated turning angles, and B) Understanding values in the series of step lengths and turning angles as characteristic of behavioral states.

External Factors Affecting Movement (Figure 2) – According to Nathan et al. 2008’s framework, external factors are one of the factors influencing the movement path. In this study, external factors were operationalized as landscape-level features that may influence where a recreation travels in Wilderness and the type of behavior in which the recreationist engages. Two environmental variables, distance to shore and bathymetry were explored for influence on the movement paths.

Figure 2. Movement ecology framework (figure from Nathan et al., 2008). Of focus in this study is the exploration to two external factors, bathymetry and distance to shoreline, and their relationship to the movement path and associated behavioral states.

To apply the above-mentioned aspects of the movement ecology framework to my study of recreationist behavior in GLBA Wilderness, the following research questions were asked:

  • What are the mean step length and mean turning angle values for emergent behavioral states observed among the movement patterns of recreational kayakers?
  • How does bathymetry influence the transition probability between emergent behavioral states?
  • How does distance to shoreline influence the transition between emergent behavioral states?
  • Which external factor, distance to shoreline or bathymetry, has more explanatory power for transition probabilities in emergent behavioral states?

Description of Datasets Used

Dependent Variables Dataset:

GPS Dataset: The dataset for analysis was a test group of five GPS tracks taken from a sample of 38 GPS tracks collected during the summer of 2017 (Figure 3). Recreation grade, personal GPS units were administered to a sample of recreationists entering GLBA wilderness via personal kayak, June through August 2017. Study participants were asked to carry the GPS unit for the duration of their trip and return the unit at the end of their trip. GPS units continuously recorded movement throughout the trip.

Temporal Resolution and Extent: Units recorded a GPS point at various intervals, determined as a function of speed of travel. When speed was recorded at 0 miles per hour (MPH), the GPS units recorded an X,Y location point every 60 seconds. When speed was recorded at 1 MPH, the GPS units recorded an X,Y point every 15 seconds. When speed was 2 MPH or greater, the units recorded an X,Y GPS point every 8 seconds. The test group of GPS tracks were collected on trips taken in late June and early July 2017. The recorded tracks had between two and four days of data. For the purposes of this analysis, each of the test batch tracks were down-sampled such that data were aggregated into one-minute time bins, with X and Y data being averaged across the minute. In this way, the temporal resolution of the data during the analysis phase was 1-minute.

Spatial Resolution and Extent: At each time interval (described above), the GPS units recorded X and Y coordinates. Coordinates were recorded in decimal degrees. The geographic coordinate system for the data is GCS_WGS_1984. For analysis, the data were projected into the NAD_1983_UTM Coordinate System with a Zone 8N projection. The spatial extent for the dataset is the park boundary for GLBA.

Independent Variables Datasets:

Bathymetry: Bathymetry was incorporated using a 25-meter raster layer of the bathymetry underlying GLBA’s marine Wilderness. The bathymetry data layer was accessed from the publicly available National Park Service data clearinghouse available at the following link www.irma.nps.gov. The downloaded data layer was added into an existing base map in ArcMap 10.3 that contained the analysis area (Glacier Bay National Park Wilderness) and the aggregated point shapefile of test batch movement data. The Extract Values to Points (Spatial Analyst) tool in ArcGIS was used to join raster cell values underlying the movement data together for analysis.

Distance to Shoreline Dataset: A landcover vector dataset was accessed from the publicly available National Park Service data clearinghouse available at the following link www.irma.nps.gov. The downloaded data layer was added into an existing base map in ArcMap 10.3 that contained the analysis area (Glacier Bay National Park Wilderness) and the aggregated point shapefile of test batch movement data. Using the Joins and Relates option available in ArcGIS, the point-based GPS data were joined to the polygon (landcover) data. During the joining process, the distance from the landcover layer (the shoreline) was measured from each point, and added to the dataset as a unique field for each point. In this way, the distance from shoreline, in meters, was calculated for each GPS data point for use in analysis.

Figure 3. Map displaying datasets used in analysis. The yellow points on the map represented the aggregated shapefile of movement data used in the analysis. The blue area represents the bathymetry data layer. The distance from each yellow point to the nearest land cover class was calculated through a spatial join to operationalize the distance to shore dataset.

Hypotheses

For each research question, the following hypotheses were developed:

  • What are the mean step length and mean turning angle values for emergent behavioral states observed among the movement patterns of recreational kayakers?
    • Two behavioral states will emerge from analysis of the step length and turning angle movement data, one state representing a movement-oriented state in which step lengths are longer and turning angles are narrower and the second state representing a resting-oriented state in which step lengths are shorter and turning angles are wider. Specific values for step length and turning angle for each behavioral state were not hypothesized.
  • How does bathymetry influence the transition probability between emergent behavioral states?
    • As bathymetry increases (i.e., as the depth of the ocean increases), kayakers will be more likely to be in a movement-oriented state rather than a resting-oriented state.
  • How does distance to shoreline influence the transition between emergent behavioral states?
    • As distance to shoreline increases (i.e., as the kayaker is further away from the shoreline), kayakers will be more likely to be in a movement-oriented state rather than a resting-oriented state.
  • Which external factor, distance to shoreline or bathymetry, has more explanatory power for transition probabilities in emergent behavioral states?
    • Distance to shoreline is likely to have greater explanatory power than bathymetry. This hypothesis is sort of an off-the-cuff assumption that because kayakers can see the shoreline and landscape features their behavioral state is more likely to be influenced by distance to shoreline than bathymetry, which would be a harder environmental variable to perceive while kayaking. The hypothesis is therefore based on assumptions about human-perception of environmental characteristics.

Approaches

The overarching analytic approach for this study was use of hidden Markov models for operationalizing behavioral states within the test batch of movement paths (Langrock, et al., 2012; Michelot, et al., 2016). An R package, moveHMM (Michelot, Langrock, & Patterson, 2016), was the primary analytic tool used. The hidden Markov model approach requires that measured data, in this case time-stamped X and Y coordinates of movement actually represent movement data. The literature on the application of hidden Markov models for movement behavior suggests this requirement can be met by assuming that spatial inaccuracy does not exist within the data (i.e., the measured coordinates represent actual movement behaviors) and by regular time-stamped sampling of the GPS data (i.e., no missing data). Through assuming that the measured data represents a known state (in this case, movement), the hidden Markov model uses patterns in the measured data to reveal the “hidden” underlying states in the data. In this application of hidden Markov models, the hidden states being modeled are two behavioral states defined by combinations of co-occurring step length and turning angle movements. Additionally, covariates can be explored to understand how a co-occurring environmental variable that is changing in space and time with the step length and turning angle movement data may or may not correlate with behavioral shifts between states through hidden Markov models. Data processing and analytic approaches comprised three main phases, described below.

Phase 1 Summary: Relating Environmental Covariates to Step Length and Turning Angle Data (Exercise 2)

ArcGIS tools were used to spatially join environmental covariates (bathymetry and distance to shore) datasets to the point data prior to generating step length and turning angles. ArcGIS was also used as a data visualization tool to generate an overall understanding of the spatial extent of datasets being used.

Phase 2 Summary: Generation of Step Lengths and Turning Angles from GPS Data (Exercise 1)

The “prepData” function in the moveHMM R package was used to convert the series of X and Y coordinates into a series of step length, turning angle, and averaged covariate values. The “prepData” function requires that the GPS data be regularly sampled, and that each observation has a unique numeric ID code associated with the data in the data frame – data processing and summarizing tools in R were used to meet these requirements. Additionally, an R function was used to convert data from a latitude and longitude geographic coordinate system to a projected, UTM coordinate system to generate meaningful step length values of meters through the prepData tool.

Phase 3 Summary: Fitting the moveHMM Models and Evaluating Results (Exercise 3, Parts 1 and 2)

The “fitHMM” function was used from the moveHMM R package to run the hidden Markov models, generate behavioral probabilities, and run an AIC analysis on the fit models.

Results

Step Length and Turning Angle Generation:

The step length and turning angle histograms suggest that across the five test tracks processed using the moveHMM tool, the majority of step lengths across the tracks were between 600-700 meters per minute and kayakers generally traveled in a straight direction turning infrequently. The step lengths were greater than originally anticipated, and after further investigation I learned this is likely because kayakers have the option of taking a day boat back to the visitor center from their backcountry location rather than paddling back to the visitor center. This new information was a surprise to receive so late in the analysis, but likely explains why the step lengths of 600-700 meters per minute occur in the dataset. Future exploration of these techniques on this dataset will need to account for this discovery, likely through elimination of the day boat portion of the GPS track from analysis. The 0-100 meter per minute step length bin likely represents stoppage time. The disparity between the two step length categories suggests that in the case of water-based recreationists, step length may be a good metric for examining changes in behavioral state in future bivariate analyses.

Turning angle histograms suggest that for the most part, kayakers are traveling in a relatively straight direction with little variation away from that direction. When turning movements are made, they tend to turn toward the left. This appears to be the result of making a loop trip in which kayakers initiate their trips following the eastern coast of GLBA inlet and finish their trips following the western coast of GLBA inlet.

Figure 4. Histograms of generating turning angles and step lengths for movement path two.

Best Fitting Models:

Two models were fit to the step length and turning angle data using the moveHMM tool: a two-behavioral state model with distance to shore as a covariate for behavior transition and a two-behavioral state model with bathymetry as a covariate for behavior transition. The decision to model a two-behavioral state model came from review of the step length and turning angle histograms and conversation with my research advisors, instructor, and classmates. Figure 5 reports the model outputs for these model runs.

Figure 5. Results from the model outputs for the best fitting parameters with distance to shore as a covariate (left panel) and bathymetry as a covariate (right panel). The distance to shore model suggests a two-state model, with one state being characterized by shorter step lengths (mean = approximately 1 meter) and wider turning angles (mean = 3.1 degrees) and a second state being characterized by longer step lengths (mean = approximately 224 meters) and narrow turning angles (mean = -0.014 degrees).The bathymetry model suggests a two-state model, with one state being characterized by shorter step lengths (mean = approximately 2.5 meters) and wider turning angles (mean = 3.1 degrees) and a second state being characterized by longer step lengths (mean = approximately 262 meters) and narrow turning angles (mean = -0.007 degrees).

Visual Presentation of Results and AIC Analysis

Results of the AIC analysis to determine the favored model suggest that model 2, the bathymetry model is favored over model 1, the distance to shore model, given AIC values of 72708.23 and 82097.23 respectively. Given these results, the remaining visualizations and interpretation of results is provided for Model 2, with bathymetry as a covariate.

Figure 6. Density histogram of step length states for the Bathymetry Model. The density distribution and model fit curves to not suggest practical significant visual differences in the step length distributions for states 1 and 2. However, the figure suggests that the distribution for state 2, the movement oriented state, has a higher density across step lengths from slightly greater than 0 to approximately 300 meters in length than state 1, which peaks right around a step length of 0.

Figure 7. Density histogram of turning angle states for the Bathymetry Model. Turning angles for State 2 cluster around 0, which is expected given the narrow turning angle movement mean reported for this state. The curve of State 1 is unexpected, given that the mean of the turning angle movement reported for this state is around 3.

Figure 8. Distribution of behavioral states across time for track ID 2. The figures show that the majority of the movement track is spent in state 2, the moving state, and that state 1, the resting state, occurs infrequently throughout the movement track.

Figure 9. Transition probability matrix for influence of bathymetry on transition between or among behavioral states. The four transition probability plots show that as bathymetry increases (water depth increases), the likelihood of staying in state 1, the resting state, decreases dramatically at a small increase in bathymetry and then remains at 0. Similarly, at small increases in bathymetry, the probability of transitions from state 1, a resting state, to state 2, a moving state increases rapidly and then states at a probability of 100%.

Figure 10. Displays three separate outputs for Track 2 that begin to tell a story of what may be going on in the data. A) Example output for location of behavioral states in space for track ID 2. In theory, the figure would show blue to characterize portion of the track during which the individual is in state 2 and orange for portions for the track during which the individual is in state 1. At this scale, state 1 is hard to see, but there are small clusters around the beginning and end of the trip that show some orange track pieces. This figure shows that for individual 2, the majority of the track is in state 2 behavior rather than state 1 behavior. B) A display of the track (the orange color has no relevance) overlayed on a satellite image of the surrounding area. The few instances of state 1 resting behavior in the track co-occur in space with access to a glacier, suggesting that potential landscape features other than bathymetry and distance to shore may be better predictors of behavior state changes. C) Temporal display of track three, with transition in color through the track displaying the passage of time. The color gradient changes somewhat rapidly in a small amount of space where the state 1 behavior occurs.

For each research question, the following hypotheses were developed:

  • What are the mean step length and mean turning angle values for emergent behavioral states observed among the movement patterns of recreational kayakers?
    • Result: A two-behavioral state model was developed, with the following parameter estimates for step length and turning angle by state derived from the best fitting model, the bathymetry model: The bathymetry model suggests a two-state model, with one state being characterized by shorter step lengths (mean = approximately 2.5 meters) and wider turning angles (mean = 3.1 degrees) and a second state being characterized by longer step lengths (mean = approximately 262 meters) and narrow turning angles (mean = -0.007 degrees).

 

  • How does bathymetry influence the transition probability between emergent behavioral states?
    • Result: As bathymetry increases, the likelihood of staying in a stationary behavioral state, if already in a stationary behavioral state, decreases rapidly. Similarly, the likelihood of transitioning between a stationary state and a movement-oriented state increases rapidly and abruptly as bathymetry increases.

 

  • How does distance to shoreline influence the transition between emergent behavioral states?
    • Result: As distance to shore increases, the likelihood of staying in a stationary behavioral state, if already in a stationary behavioral state, decreases rapidly. Similarly, the likelihood of transitioning between a stationary state and a movement-oriented state increases rapidly and abruptly as bathymetry increases. (Results not pictured in this exercise)

 

  • Which external factor, distance to shoreline or bathymetry, has more explanatory power for transition probabilities in emergent behavioral states?
    • Result: Through an AIC analysis comparing the Bathymetry Model and the Distance to Shore Model, the Bathymetry Model was favored over the Distance to Shore Model as the preferred model for modeling two-state movement behavior. This result is counter to my hypothesis that distance to shore would be the favored model.

Despite these findings, I am not overly confident in the model outputs and conclude that additional work is needed to fully apply and understand the use of hidden Markov models for understanding movement behavior. First, the sample of tracks modeled is only representative of 5 movement patterns, and in reviewing these movement patterns they are quite distinct from each other. Additionally, imbedded in the data is the potential for tracks collected while recreationists were taking the dayboat rather than while independently kayaking. This fact not only impacts the step length and turning angle but also the underlying motivations related to external factors that served as the original premise for this work. Additionally, I think the model would perform better on a larger sample of data to help iron-out some of the disparities in the parameter estimates. Finally, the modeling exercise presented assumes that one distribution for each state is appropriate for the population, therefore, any variability in the individual track data is minimized in the model. This is only a helpful approach if the researcher believes that all actors behave in a similar capacity – this is a big assumption for my data and given this modeling exercise I do not necessarily think it holds.

Significance of results to science and resource managers

Conceptually, I think the application of hidden Markov models and the use of the moveHMM tool has great potential for exploring movement data outside the realm of animal-based movement, particularly for natural resources managers. The study of human movement in recreation settings through the collection of GPS track data is a relatively new methodological development. As such, analytic methods applied to spatial data in the social sciences to date have focused on providing descriptive summaries of track lengths and travel times and summarizing the data in aggregate. Using a movement ecology approach provides a lens through which to understand individual movement patterns and to look at variation in both space and time within a track. These methods also have the potential to help researcher and managers understand how landscape level features may influence movement patterns and what the characteristics of movement patterns are in certain places and at certain times. As researchers and managers begin to build empirical relationships between behavioral states and landscape level features, managers can work to engage in more directed landscape restoration, visitor use management, and park planning.

Learning: What did you learn about software a) Arc-Info, b) Modelbuilder and/or GIS programming in Python, c) R, d) other?

Through this class, I worked primarily in ArcGIS and R, exploring new tools and developing additional skills in both software packages. Through working in ArcGIS, I explored tools for completing spatial joins with both raster and vector data, data manipulation and formatting challenges and limits, and began to develop a rudimentary process for working between ArcGIS and R. Through the course, I moved beyond my prior knowledge of ArcGIS, using tools through the course exercises that I had not previously used such as the generate random points tool, extract values to points tool, spatial join with distance calculation, and mosaic raster tool.

The majority of my work in this class was performed in R, an analytic programming language in which my own prior experience was through completing homework for STAT 511 taken last term. My proficiency in R grew immensely through this course. Skills I gained through this course include opening, manipulating, and saving excel and csv data files in R, map-based data visualization tools, data summary tools, and several movement based analytic packages including adelhabitat and moveHMM. A central learning opportunity for me in this course was to be exposed to the various ways that R can be used to analyze and manipulate spatial data. Moreover, prior to this class, I had no concept of the depth of the spatial analytical packages available through R. I was introduced to those packages through recommendations by classmates, and the experience has changed how I will approach analysis in the future.

I did not work with Modelbuilder in the course, nor did I work with Python. I have worked with those programs in the past and was happy to develop new skills in R and additional skills in ArcGIS.

Learning: What did you learn about statistics, including a) hotspot, b) spatial autocorrelation, c) regression, and d) multi-variate methods?

Through the presentations of my classmates and through my own project work in this course I expanded my understanding of how to apply statistical concepts to spatial data analysis. Through several presentations given my classmates through Tutorial exercises, I have developed a working knowledge of autocorrelation, including what the analysis seeks to identify, how the term “lag” is operationalized in the analysis, and tools/packages for completing autocorrelation analysis. To date, this concept has been a term that has eluded me, and while I did not work directly with autocorrelation analysis for my own project work, I feel the exposure gained in this class has helped me move forward in understanding what an autocorrelation analysis seeks to accomplish.

For multi-variate statistical methods, I took a deep dive into exploring hidden Markov models as a mechanism for understanding transitions among behavioral states and how the potential for relationships with environmental covariates. I also ran an AIC analysis to identify a favored HMM. I also was exposed to new thinking about how to explore relationships between landscape variables through Exercise 2, beginning to develop a new way of thinking about how landscape level variables are related and why I might be interested in knowing that relationship when looking at the GPS tracks of use. I also was excited to be exposed to geographic weighted regression through Sam’s presentation, and look forward to exploring this analytic tool in the future. I liked the ability of the tool to bring in the spatial component of the data as a sort of third dimension to the analysis.

More than anything, this course has introduced me to new ways of thinking about spatial data generally, and exposed me to the wide range of possibilities available for future analyses. One of the greatest values of the course has been hearing the presentations of my classmates and learning about their research problems, questions, and tools used.

References

Langrock, R., King, R., Matthiopoulos, J., Thomas, L., Fortin, D., & Morales, J. (2012). Flexible and practical modeling of animal telemetry data: hidden Markov models and extensions. Ecology 93(11): 2336-2342.

Michelot, T., Langrock, R., & Patternson, T. (2017). An R package for the analysis of animal movement data. [online]. https://cran.r-project.org/web/packages/moveHMM/vignettes/moveHMM-guide.pdf.

Michelot, T., Langrock, R., & Patternson, T. (2016). MoveHMM: An R package for the statistical modelling of animal movement data using hidden Markov models. Methods in Ecology and Evolution 7: 1308-1315.

Nathan, R., Getz, W.M., Revilla, E., Holyoak, R., Saltz, D., & Smouse, P.E. (2008). A movement ecology paradigm for unifying organismal movement research. PNAS 105(49): 19052-19059.

National Park Service. (1989). Wilderness Visitor Use Management Plan: Glacier Bay National Park and Preserve.

National Park Service. (2015). Glacier Bay: Wilderness Character Narrative. Available: https://www.nps.gov/glba/learn/news/wilderness-character-narrative-released.htm.

Modelling snow for sheep

Filed under: 2018,Final Project @ 1:21 pm

RESEARCH QUESTION – How do seasonal snow conditions affect Dall Sheep recruitment?

Dall an emblematic species of alpine regions in high latitude North America. Their ranges extend from the mountains of the Yukon Territory, Canada, to the furthest western extent of the Brooks Range in Alaska. Populations of Dall Sheep have declined 21% range-wide since the 1990s with a major mechanism of decline thought to be the increased frequency of extreme spring snow conditions(Alaska Department of Fish and Game, 2014). During the months of April and May mature Dall sheep ewes typically give birth to one lamb. The survival of this lamb is dependent on the mother’s ability to protect it from predators and guide it to accessible forage. If successful, the lamb is ‘recruited’ into the population. Hence, a commonly used metric of animal population growth potential is the mother to child ratio, or in this case the lamb to ewe ratio (hereafter written as lamb:ewe). Extreme spring snow conditions are thought to decrease lamb survival by limiting access to forage, either by deep snow coverage or ice-layers formed in the snow subsequent to rain-on-snow events. The limited forage could cause starvation or increased use of areas where vulnerability to predation is increased. In this project I will examine this question via use of a spatially explicit snow-evolution model, SnowModel, and lamb:ewe ratios from summer sheep surveys.

DESCRIPTION OF DATASET

In this project I used three primary datasets;

Snow / climate dataset; SnowModel (Liston and Elder, 2006)was used to simulate daily snow and climate conditions in 6 different Dall sheep domains where survey data was available. SnowModel was forced with the Modern Era Retrospective Reanalysis for Research and Applications (MERRA2) product (Gelaro et al., 2017). SnowModel effectively downscales temperature, humidity, precipitation, wind speed and direction from a 0.5º by 0.625º to a 30m resolution and physically evolves and distributes a simulated snowpack across digital elevation and landcover layers, derived from the IfSAR DTM distributed by US Geological Survey and the NLCD 2011 product distributed by the Multi-resolution Land Characteristics Consortium respectively (Homer et al., 2015). For 5 of the domains (Brooks Range, Denali, Gates of the Arctic, Lake Clark and Yukon Charley), where available, in-situ data on snow depth and snow water equivalent were used to calibrate and validate the model. The 6thdomain, in the Wrangell St Elias, in-situ data from a March 2017 field campaign was used to calibrate and validate the model and test for model performance (see below). SnowModel is run for the entire period between September 1st1980 and August 31st2017. Daily data for snow and climate above the elevation of shrubline were then aggregated into monthly and seasonal metrics e.g. mean monthly snow depth (m). Seasons in this case were taken as September to November (Fall), December to February (Winter), March to May (Spring) and June to August (Summer).

Figure 1; Map of Alaska with Dall Sheep ranges

Sheep data; Sheep data used here is from annual surveys completed by Alaska Department of Fish and Game, Bureau of Land Management, US Fish and Wildlife Service and National Park Service in the same area as the SnowModel domains. Lamb:ewe ratios were calculated from the number of lambs recorded and the number of ewes and ewe-likes. Survey methods include distance sampling, stratified random sampling, and minimum count methods either from a ground location or fixed wing aircraft.

Climate Indice data; Climate indices of larger scale weather patterns were downloaded from the National Oceanographic and Atmospheric Administration (NOAA) for 7 different indices; the Pacific Decadal Oscillation (PDO), Arctic Oscillation (AO), East Pacific / North Pacific Oscillation (EP/NP), North Pacific Pattern (NP), West Pacific Pattern (WP), Pacific North American Index (PNA), and the North Atlantic Oscillation (NAO) (https://www.esrl.noaa.gov/psd/data/climateindices/list/).

HYPOTHESES

The hypotheses for this project fell were delineated by blog post, the three installments of which looked at a) model performance, b) climate indices and snow condition relationships, c) snow condition and lamb:ewe ratio relationships. They are hence;

  1. SnowModel performs best at the elevations and landcover where the greatest amount of in-situ data in available
  2. The influence of different climate indices on snow conditions is not uniform throughout Dall sheep ranges
  3. Spring snow conditions have the greatest impact on lamb:ewe ratios surveyed in summer

APPROACHES

For hypothesis A I used a multivariate approach to test the hypothesis, using the FAMD tool of the FactoMineR library in R to test where SnowModel was over or under predicting snow depth by land cover class and other metrics of topography (elevation, slope, aspect and northerness).

For hypothesis B I used autocorrelation and crosscorrelation functions in R to test for patterns within the time series of the snow metrics and in correlation with the climate indices.

For hypothesis C I used a simple approach comparing whether snow conditions were above or below their mean to whether lamb:ewe ratios were above or below their mean to see which conditions and when might best predict levels of recruitment. This then informed a multiple logistic regression model computed in R

RESULTS

Hypothesis A;

The results from my hypothesis A can be best seen in figures 2 to 4;

eigenvalue variance.percent cumulative.variance.percent

Dim.1 3.1793016        24.456166                    24.45617

Dim.2 2.1249171        16.345516                    40.80168

Dim.3 1.4969078        11.514675                    52.31636

Dim.4 1.2530914         9.639165                    61.95552

Dim.5 0.9902287         7.617144                    69.57267

Figure 2; Scree plot and table from the multiple factor analysis

Figure 3; Plot of the importance of the quantitative variables importance to the 1stand 2nddimension of the factor

Figure 4; Plot of the qualititative variables importance to the 1stand 2nddimension of the factor analysis.

The factor analysis revealed that only 40% of the variation in the data could be explained by the first 2 dimensions (figure 2), with elevation being the biggest contributor to the 1stdimension, and the category of SnowModel error (diffCategory) being the biggest contributor to the 2nddimension (figures 3 and 4). However, and perhaps expectedly given its elevational transition and role in in snow accumulation processes, landcover type was a significant contributor in both dimensions. Broadly speaking as the elevation increased, and the landcover went from coniferous forest to prostrate shrub tundra and bare ground, model accuracy increased. Interestingly there is also a pattern of underprediction in bare and coniferous forest landcover and over prediction in prostrate shrub tundra and erect shrub tundra, although the magnitude of the error is greatest in the landcover classes that typically populate lower elevations. As I had greater amounts of in-situ data from higher elevations I could confirm my hypothesis, however the analysis did reveal an under and overpredict pattern between bare ground and prostrate shrub tundra that I didn’t expect.

Hypothesis B;

I conducted analyses for all 6 domains described above, however the autocorrelation of monthly and seasonal snow/climate metrics from 1980 to 2017 and cross-correlations of monthly and seasonal total snowfall from 1980 to 2017 to climate indices, did not reveal any meaningful patterns, though some statistically significant results at certain lags were observed they never rose above ~0.5 for the autocorrelation and ~0.25 for the cross correlation. This suggests that larger term patterns of climate do not explain a large proportion of the inter and intra annual variability in snow conditions in alpine areas. Of note is the stronger influence of different climate indices for different domains, although only weakly significant there appears to be a pattern dependent on latitude and continentality, please refer to blog post 2 to examine this. In the meantime hypothesis 2 can be considered open for further testing. For this post please see the examples from the Wrangell St Elias below as illustrative;

Figure 5; Autocorrelation of monthly variables 1980 to 2017, Wrangell St Elias domain

 

Figure 6; Autocorrelation of seasonal variables 1980 to 2017, Wrangell St Elias domain

 

Figure 7; Cross-correlation of monthly variables to climates indices 1980 to 2017, Wrangell St Elias domain

Hypothesis C;

To test hypothesis C a simplistic approach was utilised to confirm the occurrence of the following conditions;

  1. High month/season snowfall and low lamb:ewe ratios
  2. Low month/season air temperature and low lamb:ewe ratios
  3. High month/season snow depth and low lamb:ewe ratios
  4. Low month/season forageable area and low lamb:ewe ratios

These statements are based on what we expect the relationship between snow conditions and sheep recruitment might be – increased snowfall and snow depth, or lower air temperatures and forageable area, produce conditions where greater energy expenditure is required for survival. Dall sheep are in calorific deficit during the snow season so benign conditions mean that ewes reach the lambing period in better condition and are potentially then more able to provide for their lambs, increasing the observed lamb:ewe ratio. An alternative or complementary idea is that conditions during and after lambing are more important as lambs require a narrower range of conditions than adult sheep to survive. From the frequency of occurrence of agreement in these conditions we can select variables for use in a logistic regression model.

I will present my results only for the Wrangell St Elias domain by month, however seasonal results and other domains have been analysed but not yet interpreted.

WRST_all

Figure 9; Wrangell St Elias results by month

Figure 10; Log regression result

From the simple comparison we can see that spring month (March, April, May) snow conditions that are more hazardous do not always predict low lamb:ewe ratios any more than winter months (December, January, February). From the logistic regression model, where I included the variables by month that had the most frequent occurrence of meeting the conditions, February forageable area and the total amount of November snowfall came out in the final model as being strongest.

SIGNIFICANCE

Each of my blogposts / hypotheses took me further along towards a better idea about how best to address and answer my research question. Hypothesis A described where SnowModel was performing best, and worst, and gives a quantifiable error to propagate through the analysis (although I have not done that here). Most importantly it gave me reasonable confidence that the model is doing reasonably ok in sheep territory and refined the areas where I can spend energy trying to improve it.

Hypothesis B, and its tests, were murkier, producing know results that really jumped out for their confirmation or rejection of the hypothesis. Further work would be to aggregate the indices and snow seasons further into yearly, or September to August, averages and compare them to lamb:ewe ratios as wells as snow and climate metrics.

Hypothesis C produced a model that rejected the original hypothesis, suggesting that snowfall in the autumn and the area available to forage in February is most important in. predicting lamb:ewe ratios. However, the pseudo R-sq (derived from its AIC score) of this model is not much to shout about and could likely be refined by introducing seasonal values and more variables into the analysis. Logistic regressions weren’t conducted in other domains than the Wrangell St Elias – it would be interesting to compare results to these to examine whether there are range-wide similarities.

The results from hypothesis C can be used to interpret whether years where no sheep survey took place are likely to have below and above average lamb:ewe ratios, giving an indication as to whether the observed population decline is a result of recruitment being affected by increasing occurrence of hazardous snow years.

YOUR LEARNING

Using R and Rstudio for the first time was a significant feature of this class for me. I began to appreciate the value of organising data according to the tidyverse philosophy and saw incremental improvement in my abilities to conduct analyses and create figures using the packages available. I’m not yet certain whether it is an overall improvement on Matlab, which I previously used, but I do prefer the ease of ggplot2 over equivalent Matlab tools.

I was pretty unversed statistically at the beginning of this class and have gained an understanding in the principles, use and concepts of Principle Component Analysis, Multiple Factor Analysis, autocorrelation and cross-correlation spatially and temporally, and multiple logistic regression. My future work will be much enhanced as a result.

REFERENCES

Alaska Department of Fish and Game, 2014. Trends in Alaska Sheep Populations, Hunting, and Harvests. Division of Wildlife Conservation, Wildlife Management Report ADF&G/DWC/ WMR-2014-3, Juneau.

Gelaro, R., McCarty, W., Suárez, M.J., Todling, R., Molod, A., Takacs, L., Randles, C.A., Darmenov, A., Bosilovich, M.G., Reichle, R., Wargan, K., Coy, L., Cullather, R., Draper, C., Akella, S., Buchard, V., Conaty, A., da Silva, A.M., Gu, W., Kim, G.-K., Koster, R., Lucchesi, R., Merkova, D., Nielsen, J.E., Partyka, G., Pawson, S., Putman, W., Rienecker, M., Schubert, S.D., Sienkiewicz, M., Zhao, B., Gelaro, R., McCarty, W., Suárez, M.J., Todling, R., Molod, A., Takacs, L., Randles, C.A., Darmenov, A., Bosilovich, M.G., Reichle, R., Wargan, K., Coy, L., Cullather, R., Draper, C., Akella, S., Buchard, V., Conaty, A., da Silva, A.M., Gu, W., Kim, G.-K., Koster, R., Lucchesi, R., Merkova, D., Nielsen, J.E., Partyka, G., Pawson, S., Putman, W., Rienecker, M., Schubert, S.D., Sienkiewicz, M., Zhao, B., 2017. The Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2). J. Clim. 30, 5419–5454. https://doi.org/10.1175/JCLI-D-16-0758.1

Homer, C., Dewitz, J., Yang, L., Jin, S., Danielson, P., 2015. Completion of the 2011 National Land Cover Database for the Conterminous United States – Representing a Decade of Land Cover Change Information. Photogramm. Eng. 11.

Liston, G.E., Elder, K., 2006. A Distributed Snow-Evolution Modeling System (SnowModel). J. Hydrometeorol. 7, 1259–1276. https://doi.org/10.1175/JHM548.1

Considering Beaver Dam Occurrence based on Stream Habitat and Landscape Characteristics

Filed under: Final Project @ 1:18 pm
  1. Research Question

Q1: How does the Suzuki and McComb Habitat Suitability Index (HSI) relate to observed beaver dams and the West Fork Cow Creek drainage?

 Q2: What other factors explain the selection of suitable habitat?

 At the beginning of this effort, the question I was most interested in was “How well does the Habitat Suitability Index (HSI) developed by Suzuki and McComb (1998) beaver dams and the West Fork Cow Creek drainage located in the South Umpqua River Basin in Southern Oregon.  The HSI suggests that beaver dams are mostly likely to occur where stream gradients are less than 3%, active channel widths are three to six meters, and valley bottom widths are at least 25 meters.

The second question I had was what other variables might explain the selection of some suitable stream reaches for dam building, while others were seemingly ignored.  Other models such as the Beaver Restoration Assessment Tool, developed by McFarland et al (2017) out of Utah weight a number of other variables including flow permanence and vegetation as possible limitations on dam building that I wanted to consider.  However, because of dataset availability I decided, instead to follow another line of inquiry from landscape ecology that considers the influence of habitat size and connectivity.

  1. Datasets:

To answer this question I considered two primary datasets. The first dataset is, Netmap, a proprietary stream network layer generated through a combination of digital elevation models (DEM) and data that was collected through state and federal agency stream surveys. This included estimates of stream gradient, active channel width, and valley bottom width.  The second data set was a collection of survey locations randomly selected from the stream network layer, stratified by locations considered to be suitable and unsuitable for beaver damming based on the HSI criteria .  Given the rarity, however, beaver dam occurrences in a watershed, the sample was weighted toward sites considered suitable for beaver damming. This dataset also included observations from surveys that took place in the fall months of 2017.  In total, 48 beaver dams were observed from the survey locations, principally on two streams in the drainage.

 

Map 1: West Fork Cow Creek

 

 

 

 

 

 

 

  1. Hypotheses:

H1: All observed dams will occur in stream reaches classified as suitable by the Suzuki and McComb HSI. 

 H2: Suitable stream reaches that are longer and located in closer proximity to other suitable reaches will are more likely to be occupied than those that are smaller and more isolated.

Figure from Dunning et al. 1992 showing how habitats A and B, while both too small to support a population, may be occupied (A) if in close proximity to other habitats

 

 

 

 

 

 

 

 

  1. Approaches:

The most challenging but most informative approaches I used all related to geoprocessing in ArcMap, which I was largely unfamiliar with as a software package prior this class.

Using SQL to identify suitable damming habitat:

The first effort included using SQL to identify reaches in the stream network data that fit the criteria.  The language was a little clunky at first but after a little time was simple enough to select the appropriate thresholds for each of the habitat variables from the HSI .

Converting suitable reaches into habitat patches:
The second approach required that I convert contiguous reaches of suitable habitat into habitat ‘patches’ so that I could eventually calculate the length of each patch and the distance of that patch to its nearest neighboring patch.  While seemingly simple, there was not one tool in Arc to accomplish this task.  All told I used a combination of functions in Arc including Buffer, Dissolve, Join and Spatial Join to accomplish this.

Calculating distance between patches:

To accomplish the last approach, I burrowed a tool designed for traffic engineers that can calculate the distance from one point to another via a specified network.  In most cases the network is a series of roads but for this problem, I used the stream network polylines, since there is literature to support that beavers will have high fidelity to water as they move through the landscape.

 

  1. Results:

Exercise 1 produced two metrics that I mapped.  The first was the patch length, and the second was distance to the nearest neighboring patch.

Map 2: Suitable dam habitat and survey locations

 

In Exercise 2, I demonstrated that the habitat criteria from the HSI correctly identified all areas where dams were observed, but that not all reaches meeting the HSI criteria had dams sites.  In short, based on these data, the HSI criteria seemed to be necessary but not ultimately sufficient for damming.

 

 

 

In Exercise 3, I then looked at the relationship between patch length, distance to nearest neighbor, and a final metric that consider the size of the nearest neighbor weighed by the distance to that neighbor. As a result of aggregating continuous stream reaches into single features, the number of patches with observed dams decreases to 4 and was problematic for a logistic regression analysis because the small sample size can lead to Type 1 errors.

 

Map 3. Habitat Patch Length based on HSI criteria

Map 4. Distance to Nearest Neighboring Patch

 

 

 

 

 

 

 

 

  1. Significance

I found two important take-home messages from these efforts. The first is that the HSI developed by Suzuki and McComb (1998) seems to work relatively well for identifying suitable dam habitat in the West Fork Cow Creek in sites, in that observed dams all fell within the criteria thresholds.  However, there is still quite of bit of variation in the suitable habitat locations where beaver dams were observed that cannot be explained by these data.  The second take-home message is that the selection within habitats appears to coincide with the landscape hypothesis that habitats are more likely to be selected based on their size and connectivity.  Due to sample size limitations, however, the strength of evidence for this is not conclusive.

These results are important to managers because there is a growing emphasis on the use of beavers to improve stream characteristics thought to be conducive to salmonids, particularly, Threatened Oregon Coast Coho Salmon (Oncorhynchus kisutch).  To date most management efforts in this regard have focused on relocating beavers from locations where dam building is problematic (e.g. road culverts) to areas where dams will not be a nuisance and accessible to anadromous fish.  Yet, our anecdotal understanding from conversations with managers in the Umpqua Basin is that these efforts have not been guided by an evaluation of suitable habitats for this area.  In other words, this work suggests that, at a minimum, relocation efforts should focus on streams that meet the HSI criteria.

  1. Geo-processing learning

These efforts were helpful to me in two ways.  First, it was useful to consider the HSI criteria relative to the observed dam sites that we found and improves my confidence that these may be necessary criteria for dam building, at least in similar drainages.  Secondly, I had relatively little experience with ArcMap, and what I did was from nearly a decade ago.  These exercises greatly improved my familiarity and confidence using this software.  This is particularly important as we scale this research up to the larger Umpqua Basin where the number of stream reaches increase by a factor 10.  I’ve even begun automating the geo-processesing steps for the patching and OD Cost Matrix in model builder which surpasses any expectations I had in the beginning of the quarter.

  1. Statistical learning

From statistical standpoint I am disappointed that I was not able to consider more sophisticated analyses.  In particular I had hope to apply logistic regression and apply a predictive map to inform my survey selection for the rest of the Umpqua basin but that was not feasible because of my small sample sizes.  As it was, I was limited to very basic analyses such as the permutation test given the small and lopsided data.  Yet, as I turn to defining my protocol for surveys in the rest of the Umpqua I have a much better appreciation for the relative rarity of dam site occurrences and that I will need to generate my sample accordingly so that I can develop a larger dataset to allow more robust statistical procedures.  Despite this, it was helpful to see how other students approached their problems and consider what of those tools I may apply to future datasets.

References:

Dunning, J. B., Danielson, B. J., & Pulliam, H. R. (1992). Ecological Processes That Affect Populations in Complex Landscapes. Oikos, 65(1), 169–175. https://doi.org/10.2307/3544901

Suzuki, N., & McComb, W. C. (1998). Habitat classification models for beaver (Castor canadensis) in the streams of the central Oregon Coast Range. Retrieved from https://research.libraries.wsu.edu:8443/xmlui/handle/2376/1217

 

 

 

 

The Hidden Behavioral States May Still Be Hidden: Exploring the Applicability of Hidden Markov Models and Environmental Covariates for Modeling Movement Data (Exercise 3 Part 2)

Filed under: 2018,Exercise/Tutorial 3 2018 @ 10:10 am

Question Asked

Overall, my aim for Exercise 3 work was to determine the extent to which environmental covariates (of the type explored in Exercise 2), could be related to spatially explicit behavioral states defined by step length and turning angle data generated from GPS tracks of movement in Exercise 1. As described in my Exercise 3 Part 1 blogpost, to operationalize the behavioral states, paired step length and turning angle measurements were generated from the raw GPS tracks, as described in my Exercise 1 blogpost. Histograms of both step length and turning angle distributions for five sample tracks revealed that two states may be emerging from the data: 1) a state characterized by small step lengths and wide turning angles and 2) a state characterized by large step lengths and very narrow (near zero) turning angles. The emergence of these two potential behaviors from the visual inspection of the histograms is characteristic of the behavioral states used to describe the movement behaviors of animals, to which hidden Markov model approaches have been applied.

Through coursework and collaboration in this class, my lab mate Jenna and I discovered the moveHMM R data analysis package, which uses hidden Markov model statistical theory to fit a two-state model of behavior driven by step length and turning angle observations. The hidden Markov model approach requires that measured data, in this case time-stamped X and Y coordinates of movement actually represent movement data. The literature on the application of hidden Markov models for movement behavior suggests this requirement can be met by assuming that spatial inaccuracy does not exist within the data (i.e., the measured coordinates represent actual movement behaviors) and by regular time-stamped sampling of the GPS data (i.e., no missing data). Through assuming that the measured data represents a known state (in this case, movement), the hidden Markov model uses patterns in the measured data to reveal the “hidden” underlying states in the data. In this application of hidden Markov models, the hidden states being modeled are two behavioral states defined by combinations of co-occurring step length and turning angle movements. This approach is well documented in the movement ecology literature for understanding transitions between two states of movement in animals, using tracking approaches such as GPS or telemetry. Given its applicability for the study of the movement of animals, Jenna and I thought it would be an interesting approach for understanding the movement behavior of people on the landscape.

Additionally, covariates can be explored to understand how a co-occurring environmental variable that is changing in space and time with the step length and turning angle movement data may or may not correlate with behavioral shifts between states. In this way, landscape-level environmental data of the type explored in Exercise 2 (i.e., vegetation type and elevation) can be related to spatially-explicit behavioral data. It should be noted that for Exercise 3, bathymetry and distance to shoreline were the environmental variables used in this exercise. I previously explored relationships between land-based vegetation cover class and elevation as environmental variables in Exercise 2; however, I was able to get the bathymetry data originally intended for Exercise 2 exploration to work with my dataset in ArcGIS. Therefore, instead of continued exploration of land cover vegetation class and elevation I will be exploring distance to shoreline and bathymetry as behavioral covariates in Exercise 3 Part 2 as these data co-occurred in space with the GPS movement data.

Given the above background, my research questions are as follows:

  1. How does bathymetry influence the transition probability between a movement-based behavioral state and a stationary behavioral state?
  2. How does distance to shoreline influence the transition between a movement-based behavioral state and a stationary behavioral state?
  3. Which covariate, if either, has more explanatory power?

Tool/Approach Used

I used the R package moveHMM (Michelot, Langrock, & Patterson, 2016) to run the hidden Markov models, generate behavioral probabilities, and run an AIC analysis on my fit models. For data preparation, I used the fitdistrplus R package to define parameter estimates and distributions (see Exercise 3 Part 1 post) and ArcGIS spatial analyst tools (see Exercise 2 post) to relate bathymetry and distance to shoreline data to each step length and turning angle observation in the dataset.

Description of Steps Used to Complete the Analysis

Many of the initial preparatory data wrangling and formatting steps used to set up the Exercise 3 Part 2 analysis are documented in blogposts for Exercises 1, 2, and 3 Part 1. The below list describes steps that were taken as part of this analysis that have been previously described in other posts. New analytic steps are subsequently described:

Previously completed workflows:

See Exercise 1 blogpost for a description of the steps used to generate step length and turning angles from raw GPS data for this analysis using the “prepData” function in moveHMM.

See Exercise 2 blogpost for a description of how raster-based elevation data were related to point-based movement data using the Extract Values to Points tool in ArcGIS. Similarly, see Exercise 2 blogpost for a description of how distance to shoreline was calculated for each point-based movement data observation using the Spatial Joins tool in ArcGIS.

See Exercise 3 Part 1 blogpost for a description of how initial distribution and parameter estimates were generated for step length and turning angle inputs for the moveHMM tool.

New analytic workflows:

New steps completed to run the moveHMM tool and explore model fits, visualize results, and calculate model AIC values are as follows. The below-described steps are adapted from published moveHMM workflows (Michelot, Langrock, & Patterson, 2017; Michelot, Langrock, & Patterson, 2016).

  1. Prior to fitting the moveHMM model, define distribution parameters for step length and turning angle. The default distribution for step length is gamma and the default distribution for turning angle is von Mises. If other distributions are used, they must be defined in the fitHMM function.
  2. Run model using the “fitHMM” function. Define input data (must be generated through prepData function), number of behavioral states, distribution parameters (define prior and then call in command), and covariates. This step generates numeric and text output reporting the model parameter estimates for each behavioral state.
  3. Generate a visual summary of the model, including density histograms for step length and turning angle, transition probabilities for behavioral states, and spatially explicit behavioral state occurrences through “plot(model)” function.
  4. Generate visual summaries of state transitions through “plotStates(model)” for each track.
  5. Run AIC on model(s) to provide a measure for which model is statistically favored.

Description of Results Obtained

Overall, the results obtained using this tool varied wildly. The below presented results are those that I think best represent the data and shed light on the research questions that I posed at the beginning of the exercise. Results are presented for each of the above listed analysis steps.

Results: Define distribution parameters for step length and turning angle and run model

This part of the exercise turned out to be more difficult than anticipated. I had hoped that by doing a thorough exploration of my data, both in aggregate and segmented into data-drive two-state bins, that the parameter estimates defined by the movement data would ultimately map well within the moveHMM analytic environment. Unfortunately, this was not the case. The first round of parameter estimates I tried in the model fitting process included the parameter estimates derived from the gamma and wrapped Cauchey distributions identified in Exercise 3 Part 1. The result of these parameter estimates was a model that only included 1 state (state 2), and had standard deviation estimates for the model parameters of infinity. These results were not expected, and suggested that the model was not a good fit for the data.

Given that the original, data driven estimates did not result in a well-fitting model, I began trying various combinations of parameter estimates seen in the literature and developed through my own knowledge of travel rates. Additionally, I reverted to using the two default distributions in the moveHMM tool: gamma for step length and von Mises for turning angle as the tool seemed to perform better with those distributions. Table 1 reports a set of initial parameter estimates used derived from the fitdistrplus tool and a second set of parameter estimates developed through trial and error ultimately used in model development.

Table 1. Parameter estimates for two model fitting trials.

Figure 1. Model outputs for the initial parameters (left panel). Results produced from the initial parameters resulted in only behavioral state (State 1) with a mean value of 0 for State 2, which essentially indicates a non-state as the gamma distribution is a positive distribution and thereby does not include negative parameter values.

Given the poor results from the initial parameters, the best fitting model parameters were used to fit two models for exploration in this exercise: a two-behavioral state model with distance to shore as a covariate for behavior transition and a two-behavioral state model with bathymetry as a covariate for behavior transition. Figure 2 reports the model outputs for these model runs.

Figure 2. Results from the model outputs for the best fitting parameters with distance to shore as a covariate (left panel) and bathymetry as a covariate (right panel). The distance to shore model suggests a two-state model, with one state being characterized by shorter step lengths (mean = approximately 1 meter) and wider turning angles (mean = 3.1 degrees) and a second state being characterized by longer step lengths (mean = approximately 224 meters) and narrow turning angles (mean = -0.014 degrees).The bathymetry model suggests a two-state model, with one state being characterized by shorter step lengths (mean = approximately 2.5 meters) and wider turning angles (mean = 3.1 degrees) and a second state being characterized by longer step lengths (mean = approximately 262 meters) and narrow turning angles (mean = -0.007 degrees).

Visual Presentation of Results and AIC Analysis

Results of the AIC analysis to determine the favored model suggest that model 2, the bathymetry model is favored over model 1, the distance to shore model, given AIC values of 72708.23 and 82097.23 respectively. Given these results, the remaining visualizations and interpretation of results is provided for Model 2, with bathymetry as a covariate.

Figure 3. Density histogram of step length states for the Bathymetry Model. The figure suggests that the distribution for state 2, the movement oriented state, has a higher density across step lengths from slightly greater than 0 to approximately 300 meters in length than state 1, which peaks right around a step length of 0.

 

Figure 4. Density histogram of turning angle states for the Bathymetry Model. Turning angles for State 2 cluster around 0, which is expected given the narrow turning angle movement mean reported for this state. The curve of State 1 is unexpected, given that the mean of the turning angle movement reported for this state is around 3.

Figure 5. Example output for location of behavioral states in space for track ID 2. In theory, the figure would show blue to characterize portion of the track during which the individual is in state 2 and orange for portions for the track during which the individual is in state 1. At this scale, state 1 is hard to see, but there are small clusters around the beginning and end of the trip that show some orange track pieces. This figure shows that for individual 2, the majority of the track is in state 2 behavior rather than state 1 behavior.

Figure 6. Distribution of behavioral states across time for track ID 2. The figures show that the majority of the movement track is spent in state 2, the moving state, and that state 1, the resting state, occurs infrequently throughout the movement track.

Figure 7. Transition probability matrix for influence of bathymetry on transition between or among behavioral states. The four transition probability plots show that as bathymetry increases (water depth increases), the likelihood of staying in state 1, the resting state, decreases dramatically at a small increase in bathymetry and then remains at 0. Similarly, at small increases in bathymetry, the probability of transitions from state 1, a resting state, to state 2, a moving state increases rapidly and then states at a probability of 100%.

Returning to my research questions for this exercise, I have reached the following conclusions:

  1. How does bathymetry influence the transition probability between a movement-based behavioral state and a stationary behavioral state?

As bathymetry increases, the likelihood of staying in a stationary behavioral state, if already in a stationary behavioral state, decreases rapidly. Similarly, the likelihood of transitioning between a stationary state and a movement-oriented state increases rapidly and abruptly as bathymetry increases.

  1. How does distance to shoreline influence the transition between a movement-based behavioral state and a stationary behavioral state?

(Results not pictured in this exercise) As distance to shore increases, the likelihood of staying in a stationary behavioral state, if already in a stationary behavioral state, decreases rapidly. Similarly, the likelihood of transitioning between a stationary state and a movement-oriented state increases rapidly and abruptly as bathymetry increases.

  1. Which covariate, if either, has more explanatory power?

Through an AIC analysis comparing the Bathymetry Model and the Distance to Shore Model, the Bathymetry Model was favored over the Distance to Shore Model as the preferred model for modeling two-state movement behavior.

Despite being able to answer the three research questions I originally posed, I am not yet confident to conclude that these results, or the modeled two-state behaviors, are valid in their current estimations. The critique below identifies why I am cautious of the presented results.

Critique of Method

Conceptually, I think the application of hidden Markov models and the use of the moveHMM tool has great potential for exploring movement data outside the realm of animal-based movement. The types of analyses that the tool is capable of is exciting, particularly given the ability to relate covariates to the movement data. However, I have several critiques of the method that arose through my experimentation in Exercise 3:

  1. Additional guidance is needed for establishing null distributions and parameters in the model – the moveHMM package documentation states that establishing the movement parameters themselves for each behavioral state, prior to running the model, is the most important component of the modeling process. Given that caveat to running the model, it seems that a high degree of familiarity with the movement data, and the population under study, is needed in order to run the two-state behavioral model. Therefore, the model in and of itself does not seem overly exploratory in nature by rather a tool to identify the transition probabilities between two known states – not how to identify the behavioral states. The literature does not present hidden Markov models in this light, and I think that the amount and degree of background knowledge on the expected model outcomes should be discussed more candidly in the literature on the application of hidden Markov models to movement data. It essentially felt like I was trying random combinations of parameters in order to get the models to run. Part of the struggle is certainly a function of being a beginner to this type of modeling.
  2. Better error messaging needed in moveHMM – When working with the tool, the errors returned were difficult to interpret and, given the newness of the tool itself, there is not yet an established online community providing responses to hiccups in the tool itself. Therefore, when I was getting errors about value estimates outside of the parameters or missing fields, I had difficulty identifying exactly where I was going wrong, given that I was using the published vignettes as guides for working with my own data. More descriptive error handling guidance would be helpful in using this tool again in the future.
  3. Small sample size and combined sample leading to erratic results – The sample of tracks modeled is only representative of 5 movement patterns, and in reviewing these movement patterns they are quite distinct from each other. I think the model would perform better on a larger sample of data to help iron-out some of the disparities in the parameter estimates. Additionally, the modeling exercise presented assumes that one distribution for each state is appropriate for the population, therefore, any variability in the individual track data is minimized in the model. This is only a helpful approach if the researcher believes that all actors behave in a similar capacity – this is a big assumption for my data and given this modeling exercise I do not necessarily think it holds.

For these reasons, the numeric results of the material presented should not be interpreted literally, but rather more as an exercise in how the tool could be applied in the future with additional data and a involved initial understanding of the behaviors of interest.

Using correlation-based techniques to investigate population trends in Bull Kelp (Nereocystis luetkeana) in southern Oregon

Filed under: Final Project @ 9:52 am

The Question: I was initially looking to explore correlation between a kelp canopy coverage data set and a suite of environmental variables. My question morphed into examining the correlation between canopy cover and two temperature data sets.

The Data: I used three datasets to investigate this question. The first was a 35-year time series of kelp canopy cover in southern Oregon derived from Landsat satellite imagery. This dataset was shared with me by my colleague Tom Bell and reported the percent cover of each 30mx30m pixel whenever cloud free imagery was available. The other two datasets were two time series of temperature on the Oregon coast, one derived from satellite imagery and the other from direct measurements of intertidal temperature near Port Orford. The satellite-derived data set was over 35 years long, but came at a resolution of 0.1 degree raster cells, meaning that it represented offshore sea temperatures more so than local, nearshore temperatures. The local, intertidal dataset was measuring local, nearshore water temperatures that were within 20 miles of most of my kelp beds. That time series only covered about half of my time series however, from late 1999 to the present.

 

The Hypotheses: Worldwide, warming temperatures have been implicated in global declines of kelp populations (Wernberg et al.,2016; Filbee-Dexter et al., 2016). Perhaps even more important than the direct effects of temperature is the fact that ocean temperature is closely correlated with nutrient availability, which is of crucial importance to these fast-growing primary producers. However, in many local or regional studies, temperature does not necessarily emerge as one of the top drivers of kelp growth and biomass. For example, in southern California giant kelp biomass was not significantly impacted by the extreme 2013-2015 warm water anomalies that impacted the Pacific coast of North America (Reed et al., 2016). Furthermore, several studies on giant kelp (Macrocystis pyrifera) in southern California have found that wave intensity as one of the primary factors controlling giant kelp biomass (Bell et al., 2015, Parnell et al., 2010). Biotic interactions, such as competition with understory kelp and herbivory by urchins, can also tightly control kelp populations (Dayton and Tegner, 1984). Overall, untangling the relative importance of abiotic drivers of kelp populations, is a highly context dependent business. In southern Oregon, I expect temperature/nutrient availability to have an effect on kelp biomass but for that effect to be moderated by the presence of urchins, wave events, and climate oscillations.

 

The Methods: To investigate the relationship between kelp and temperature, I employed three methods: 1) Autocorrelation and cross correlation to investigate temporal correlation within and between the two biggest kelp patches in southern Oregon.
2) Graphical representations and cross correlation to examine patterns between satellite-derived sea surface temperatures and local, directly-measured nearshore temperatures.

3) Interpolation of my kelp time series to monthly temperatures using polynomial splines in order to utilize wavelet and cross-wavelet analysis to look for areas of shared power within and between the kelp time series and the two temperature time series.

The Results and the Significance

Exercise 1: Autocorrelation and cross correlation were quite limited in the kelp patches I looked at, even at the yearly scale (e.g. Fig 1). This suggests that a) the species responds quickly to changing environmental conditions rather than to holding on to momentum from previous population sizes and 2) that local factors (less than 20km) are more important in driving kelp population size than local factors

Figure 1: Cross Correlation between maximum annual kelp cover at Rogue Reef and Orford Reef over a 35 year time series. The lag is in years.

While this finding was one of the simplest, it was also one of the most significant for me. This finding suggests that focusing on local, patch-specific dynamics will be important in untangling environmental drivers of kelp in Oregon. I’ve already utilized this finding when responding to feedback from managers. An ODFW employee suggested that my satellite-derived kelp time series was incorrect because it showed moderately sized populations since 2014. He said that Orford Reef, one of the largest kelp beds in the state, had been practically non-existent in past years, so there was no way regional populations were at anything other than historical lows. I looked into patch-specific dynamics over the past 4 years and found that, while Orford Reef had indeed gone essentially to zero since 2014, other patches in Oregon had recovered in that same time period and were bolstering regional kelp cover numbers.

Exercise 2: Overall, the relationship between SST and nearshore temperatures was fairly consistent. The temperatures matched very well in the winter, but then nearshore temps warmed up less and more slowly in the summer months. The average difference between mean annual temperature for the two datasets was consistent as well, usually staying between 1.1-1.5 degrees (Fig 2). The rankings of the warmest versus coolest years were also very similar between the datasets.

Figure 2: Difference in annual mean temperature between satellite derived sea surface temperature (black) and measured, nearshore temperatures (red) from 2000-2016. Note how consistent the difference between the two is.

While I can use satellite SST to infer average nearshore temperatures, it will not give me good insight into the variability of nearshore temperature, which could be important to kelp population regulation. If I want to try to incorporate this variability into my analyses, it may be useful to look further into the effect of upwelling and terrestrial water cycles on nearshore temperatures. However, considering a) that much of the interannual variability in temperatures (hottest to coolest years) is similar between satellite and nearshore temperatures and b) how much time it might take to more finely predict local temperatures from satellite temperatures, I think that the satellite temperatures may be good enough to use for most of my analyses.

Exercise 3: According to wavelet analysis, strong annual summer peaks in canopy were not a consistent pattern in my kelp time series, but rather only in the 1985-1991, 1999-2000, and 2014-2017 periods (Fig 3). This inconsistency extends to the cross wavelet analysis of kelp and temperature, where only certain years have high-power, annual cycles. Another important finding from this exercise was that sometimes there appear to be high-power interannual oscillations between kelp and temperature, suggesting climate oscillations may also be influencing kelp cover.

For me, there are two important takeaways from this analysis. First, it is particularly interesting to see a high degree of power in the 2014-2017 period because in these years a) we have had a boom in urchin populations, b) we have had anomalously high water temperatures and c) kelp in northern California has collapsed during this period. All of these other factors suggest that annual kelp population oscillations would have a smaller amplitude, not a larger. And second, the high power periods and years for temperature (both satellite and intertidal) were not necessarily the same as those for kelp. To me, this indicates something other than a 1:1 relationship between temperature and kelp. This suggests that I need to explore other environmental variables and that I might need to explore them in multivariate analysis rather than multiple univariate analyses.

Figure 3: A) Wavelet analysis of the kelp canopy time series and B) Cross-Wavelet analysis of kelp canopy and satellite temperature. Areas of high-power are in red. The black line in A represents the power ridge and the black arrows in B represent whether the two are in phase (right), out of phase (left), x leading (down) or x lagging (up).

I think my next steps with this project will be to use PCA to try and identify particular environmental factors that are most strongly correlated with kelp canopy cover. Incorporating my findings from Exercise 1, I will do it utilize PCA not only on the regional Oregon kelp time series, but also using the local population time series from individual patches. In this way, I will be able to look at regional drivers of kelp canopy cover as well as how the extent to which local drivers may differ a) between patches and b) from the regional drivers.

 

The Learning: I almost exclusively utilized R for this class, and gained familiarity with interpolation techniques as well as a number of very user-friendly, useful packages for performing correlation analyses. Most of the techniques I utilized in this class dealt with autocorrelation, in one form or another, but I learned about a number of other spatio-temporal statistics via the student presentations. In particular, I’m excited to utilize Geographically Weighted Regression to better understand local kelp patch correlations with various environmental variables and PCA to identify environmental variables that may be mostly closely correlated with kelp cover.

The References:

Bell, T. W., Cavanaugh, K. C., Reed, D. C., & Siegel, D. A. (2015). Geographical variability in the controls of giant kelp biomass dynamics. Journal of Biogeography, 42(10), 2010-2021.

Dayton, Paul K., and Mia J. Tegner. “Catastrophic storms, El Niño, and patch stability in a southern California kelp community.” Science 224.4646 (1984): 283-285.

Filbee-Dexter, K., Feehan, C. J., & Scheibling, R. E. (2016). Large-scale degradation of a kelp ecosystem in an ocean warming hotspot. Marine Ecology Progress Series, 543, 141-152.

Reed, D., Washburn, L., Rassweiler, A., Miller, R., Bell, T., & Harrer, S. (2016). Extreme warming challenges sentinel status of kelp forests as indicators of climate change. Nature communications, 7, 13757.

Wernberg, T., Bennett, S., Babcock, R. C., de Bettignies, T., Cure, K., Depczynski, M., … & Harvey, E. S. (2016). Climate-driven regime shift of a temperate marine ecosystem. Science, 353(6295), 169-172.

 

 

 

 

June 10, 2018

Fitting Distributions for Two Spatial Data Behavioral Measures: Step Length and Turning Angle (Exercise 3, Part 1)

Filed under: Exercise/Tutorial 3 2018 @ 10:11 pm

Question Asked

For Exercise 3, I wanted to explore the degree to which two environmental covariates influence the transition probabilities between and among two behavioral states using a hidden markov model approach. To operationalize the behavioral states, paired step length and turning angle measurements were generated from the raw GPS tracks, as described in my Exercise 1 blogpost. Histograms of both step length and turning angle distributions for five sample tracks revealed that two states may be emerging from the data: 1) a state characterized by small step lengths and wide turning angles and 2) a state characterized by large step lengths and very narrow (near zero) turning angles. The emergence of these two potential behaviors from the visual inspection of the histograms is characteristic of the behavioral states used to describe the movement behaviors of animals, to which hidden markov model approaches have been applied.

In order to fit a hidden markov model to the step length and turning angle data using the moveHMM tool (Exercise 3, Part 2), the null distributions and associated parameters for step length and turning angle must be defined for each state. The moveHMM tool states that one of three possible distributions must be defined for step length: gamma, Weibull, and lognormal; one of two possible distributions must be defined for turning angle: von Mises and wrapped Cauchy. Therefore, for Exercise 3 Part 1, I pose the following research questions related to the distributions of step length and turning angle:

  1. Of gamma, lognormal, and Weibull distributions, which distribution best fits the step length dataset?
    1. For the best fitting distribution, what are the associated model parameters for state 1 and state 2 behaviors?
  2. Of von Mises and wrapped Cauchy, which distribution best fits the turning angle dataset?
    1. For the best fitting distribution, what are the associated model parameters for state 1 and state 2 behaviors?

Tool/Approach Used

Earlier in the term, another classmate, faced with a similar need to define distributions for her data, discovered and presented on an R package called fitdistrplus (Delignette-Muller & Dutang, 2014) – a package allowing for the exploration of various distributions to user-provided data. Remembering that she had presented on Weibull, lognormal, and gamma distributions, I decided to answer my research questions through use of the fitdistrplus R package.

Description of Steps Used to Complete the Analysis

My overall approach to the analysis was exploratory, and following a learn-by-doing approach to understanding the best inputs for fitting the model. Of the variety of model-fitting options available within the fitdistrplus R package, I ended up using the following three model-fitting approaches on my data:

  1. Maximum likelihood estimation
  2. Maximum goodness of fit estimation with a Cramer-von Mises distance
  3. Maximum goodness of fit estimation with a Kolmogorov-Smirnov distance

I selected the above listed approaches because of the methods provided by the program (moment matching estimation and quantile matching estimation are also options), these three approaches seemed the most straightforward, referenced by other researchers in the literature and in discussion, and would provide parameter estimates needed in later analytical phases.

I also generated parameter estimates for three different slices of my data. First, I generated parameter estimates for all of the step length and turning angle data. Then I generated estimates for a rough cut at two behavioral states for the step length data: step lengths less than 400 meters (state 1) and step lengths 400 meters or greater (state 2). This cut off point emerged from my data as a trough in an otherwise somewhat bimodal step length distribution with a grouping of small step lengths (100 meters or less) and a grouping of larger step lengths (around 600 meters).

To fit the individual models, I used the “fitdist” function for each combination of data and model fitting type. I also used the “plot” and “fitdistres” functions to generate visualizations of the distributions.

Description of Results Obtained

Through my exploration with the tool, the various methods for fitting models, and my three difference data slices, I ended generating parameter estimates for 45 different model/data fitting/data combinations. Of those combinations, the results presented below were the most interesting and influential in my conclusions:

Step Length

Figure 1 displays the histogram and theoretical densities for the step length dataset as a whole, showing curves for the Weibull, lognormal, and gamma distributions, generated using maximum likelihood estimation as the model fitting technical. Among the three model fitting techniques tested, I did not see any practically significant differences between the curves produced or the estimated model parameters. Looking at the distribution of the step length data all together, I concluded that the Weibull and gamma distributions did not seem to differ, in a practical sense from each other, at least when looking at the dataset as a whole. Therefore, for the remainder of this blogpost I’ll present figures from the maximum likelihood estimation method, which is the default method for the fitdistrplus tool. I also noticed the bimodal nature of the step length histogram (x axis labeled “data”) and noticed that I should probably divide the data into two sets in order to generate parameter estimates from the data for two behavioral states.

Figure 1. Distribution fit using maximum likelihood estimation for the step length dataset as a whole

Figure 2 displays the histogram and theoretical densities for the state 1 slice of the step length dataset (step lengths less than 400 meters), showing curves for the Weibull, lognormal, and gamma distributions, generated using maximum likelihood estimation as the model fitting technical. Looking at the state 1 distributions, I noticed that the gamma distribution appeared to be the intermediate distribution among Weibull, the more conservative, and lognormal, the more extreme. Given that both the lognormal and Weibull distributions appeared to be influenced more by the height of the frequency distribution (lognormal) and width of the distribution (Weibull), the gamma distribution emerged as a potential contender for use in modeling the data.

Figure 2. Distribution fit using maximum likelihood estimation for the State 1 of the step length dataset

Figure 3 displays the histogram and theoretical densities for the state 2 slice of the step length dataset (step lengths 400 meters and greater), showing curves for the Weibull, lognormal, and gamma distributions, generated using maximum likelihood estimation as the model fitting technical. Looking at the state 2 distributions, noticeable differences emerged between the Weibull and gamma/lognormal curves. The Weibull curve seemed to be much more influenced by the height of the frequency distribution rather than the width of the distribution. Some background exploration of my data revealed that the height was being driven primarily by a single track’s data (of the set of 5 tracks under exploration in this class). Therefore, to not unduly weight one track’s behavior over the other’s, I ruled at Weibull for use in the moveHMM tool.

Figure 3. Distribution fit using maximum likelihood estimation for the State 2 of the step length dataset

Among gamma and lognormal, gamma is more frequently used by movement ecologists in modeling step length data. Given the results of the distribution explorations did not practically differ between the lognormal and gamma, I also felt confident moving forward using the gamma parameter estimates. The estimates for the gamma distribution parameters or shape and rate for each state are as follows:

State 1 Shape = 2.07, Rate = 0.031

State 2 Shape = 89.09, Rate = 0.140

Turning Angle

The turning angle parameters showed less variability from the onset of the project; my use of the fitdistrplus tool for turning angle was more for my own learning than for deriving data-driven parameter estimates for the moveHMM model. Through the fitdistrplus tool I wanted to explore weather wrapped Cauchey or von Mises was a better fit for the data. Unlike the step length plots, I was unable to create comparative plots for the wrapped Cauchey and von Mises distributions (this might be user error). Instead, I generated the below diagnostic plots for each distribution for each state. I ran the model fitting analyses for turning angle and the states defined by step length presented previously – I do not yet know enough about either distribution or the turning angle data generally to make an informed estimate of what the values might be for each state. Figures 4 and 5 below present the diagnostic plots for the wrapped Cauchey distribution. Ultimately, I selected this distribution because the Q-Q and P-P plots for wrapped Cauchey indicated a marginally better model fit than von Mises; however, both sets of diagnostic plots indicated well-fitting distributions, as indicated by little deviation in the scatter points from the plot lines.

Figure 4. Diagnostic plots for maximum likelihood estimation fit for wrapped Cauchey for State 1 turning angles

 

Figure 5. Diagnostic plots for maximum likelihood estimation fit for wrapped Cauchey for State 1 turning angles

The wrapped Cauchey parameter estimates (location and concentration) for the two states are as follows:

State 1 Location = -0.014, Concentration = 0.15

State 2 Location = -0.015, Concentration = 0.06

Critique of Method

Overall, I found the tool to be very useful in allowing me to compare and contrast the fit of various distributions for my step length and turning angle data. It was convenient and straight forward to be able to use a single tool to fit the various distributions, rather than having to use individual tools to step length and turning angle and/or individual tools for various distributions.

My critique of the tool lies in my own lack of knowledge in how best to interpret the outputs of the distributional fits. I appreciated that the tool presented both numeric estimates of the parameters and comparative models, but my own unfamiliarity with each of the distributions I was fitting made it difficult for me to evaluate the output. Through online research, the moveHMM tool documentation, movement ecology publications, and the fitdistrplus documentation, I feel I was able to put together enough of a working understand to make sense of the output; however, I feel like my interpretation could certainly be helped by additional understanding of the distributions and model fitting techniques themselves.

References

Delignette-Muller, M.L., & Dutang, C. (2014). fitdistrplus: An R package for fitting distributions. Journal of Statistical Software 64(4) [online] https://www.jstatsoft.org/article/view/v064i04.

Logistic Regression of Plant “Vulnerability” Against Three Explanatory Variables

Filed under: Exercise/Tutorial 3 2018 @ 3:46 pm

QUESTION:

In this exercise, I used logistic regressions to investigate whether whether three explanatory variables (growing degree days, soil temperature, soil moisture) could be related to changes in the probability of finding vulnerable versus invulnerable plants in my 2017 data. Logistic regressions allow us to fit models to data of probabilities that range between 0 and 1 (Hosmer et al., 2013). Vulnerable plants were defined as plants with any vulnerable capitula present whereas plants were defined as invulnerable if they had only invulnerable stages (vulnerable stages: primordia, buds, young flowers and flowers; invulnerable stages: fruit and dehisced fruit). Based on these definitions, I created binary response variable at the plant level:

vul.plant = 1 for a plant having any vulnerable capitula
vul.plant = 0 for a plant that has only invulnerable capitula

Because all flowering capitula begin at a vulnerable stage (primordia, bud) and pass through to invulnerable stages (fruit, dehisced fruit), a baseline expectation was that early season samples would have a probability of vulnerability close to 100%, and that samples late in the season would have a probability of vulnerability close to 0%. (Prediction 1): growing degree days are expected to explain a significant degree of the change in probability from 100% to 0%.

Previous analysis showed that warmer soil temperatures (˚F) were linearly associated with higher growing degree days, so I expected that soil temperature should explain some degree of the change in probability of vulnerability, both through this correlation with growing degree days and also through directly impacting flowering phenology. (2) Higher soil temperature is expected to be associated with decreased probability of vulnerability. On the other hand, the correlation between growing degree days and soil moisture (measured as percent soil moisture) was negative and neither as strong nor as consistent as soil temperature. For these reasons, soil moisture might explain a portion of the change in probability of plant vulnerability, and (3) higher soil moisture is  expected to be associated with increased probability of vulnerability.

Finally, I also wanted to ask whether the patterns shown in the logistic regressions I performed would differ between the five surveyed sites represented in the 2017 data.

APPROACH:

To complete this analysis, I used the generalized linear model function glm() to perform logistic regressions on data prepared in tutorial 2. Because soil temperature measurements were significantly impacted by the time of day at which they were measured, I removed the linear trend between soil temperature measured and time of day of measurement, creating adjusted soil temperature (also in ˚F). Soil moisture was used directly.

Binary response variable for plants was created using dplyr package and the summarise function to collapse per-capitula phenology scores to plant-level presence/absence scores for each flowering stage, then further collapsing the presence/absence scores for six stages into two classes, “vulnerable” and “invulnerable,” as defined above. Similar methods were used to create a binary variable for the presence of virulent larvae (where virulent was defined as larval stages L4 & L5), and a third binary variable for phenological synchrony, defined as presence on a single plant of both vulnerable capitula and virulent larvae. These additional binary variables will not be analyzed in this exercise.

METHODS:

  1. Create binary response variables – this was a little tricky because initial data structure was plant-level counts of capitula by stage. Using dplyr package and the group_by and summarise functions, I grouped the data by all plant-level variables I wished to keep in the resulting dataset, and then used summarise to create 2 new binary response variables: presence of vulnerable capitula (vulnerable capitula present = 1, none present = 0), and presence of virulent larvae (virulent larvae present = 1, none present = 0). I created a third binary response variable by multiplying the first two to document “phenological synchrony”, the co-occurrence on a single plant of vulnerable capitula and virulent larvae (virulent larvae.
  2. Once data were properly wrangled, I used the glm() function with family set to binomial to fit logistic regression models of the binary response variable of interest (plant vulnerability) as a function of included explanatory variables.
  3. A summary() of the glm object gave estimates of the model coefficients, and their significance levels (z scores, p-values). Based on methods outlined in the datacamp module on logistic regressions, I used predict() function to create a list of fitted probabilities and check how often these correctly classify plants as vulnerable or invulnerable (https://www.datacamp.com/community/tutorials/logistic-regression-R).
  4. To visualize the results as a smooth curve of predicted probabilities given by the glm object, I again used the predict() function with a regular sequence of x values overlapping the range of the explanatory variable and plotted these together using the baseplot or ggplot functions. For each explanatory variable, I eventually created a single plot with an overall model fit, and models fit to data from each individual site, in order to visually compare patterns/responses at different sites.

RESULTS:

With plant vulnerability as the response variable, I fit a model with growing degree days (in hundreds) as the single explanatory variable to confirm my initial expectation, that increased growing degree days would be associated with a move from near 100% vulnerability to 0% vulnerability. A logistic regression model fit to the data showed strongly significant estimates for the coefficients. The mean success rate for predicting vulnerability classification correctly using this fitted model was 91%. A visual representation of the probability of plant vulnerability showed the probability of plants being vulnerable starting very high (close to 1.0) and ending very low (close to 0.0) with the threshold value of 0.50 falling at about 720 growing degree days (Figure 1). A drop-in-deviance test for the model including growing degree days against a reduced model of plant vulnerability gave strong evidence (P<0.001 for χ2 with df = 1) that the probability of a flower being vulnerable depends on number of growing degree days.

Figure 1. Logistic regression of plant vulnerability (vulnerable = 1) against growing degree days (hundreds) with fitted curve shows a strong switch in probability of vulnerability from high to low with increasing growing degree days.

Next, I fit a model with plant vulnerability as the response and adjusted soil temperature (˚F) as the explanatory variable, understanding that some of the behavior seen between plant vulnerability and soil temperature can be attributed to the positive correlation between adjusted soil temperature and growing degree days (Figure 2). Coefficients fitted in this model were also highly significant (***see note on these p-values under “CRITIQUE”); but the mean success rate for fitted predictions was at 62% using this fitted model. Probabilities plotted using this fitted model ranged between 0.8 and 0.2 for the range of adjusted soil temperatures represented in the data. I found strong evidence (P<0.001 for χ2 with df = 1) that the probability of a flower being vulnerable depends on adjusted soil temperature. The correlation between growing degree day and adjusted soil temperature is estimated at 0.30.

Figure 2. Logistic regression of plant vulnerability (vulnerable = 1) against adjusted soil temperature (°F) with fitted curve shows increasing adjusted soil temperature accounts for some decrease in probability of vulnerability.

I also fitted individual logistic regression model for plant vulnerability against adjusted soil temperature for each site, and found that Buck Mt. and Holland Mdw. had significant p-values for coefficient estimates while Juniper Mt., Waterdog Lk., and Blair Lk. did not***. A plot of fitted values from each site-level logistic regression model and an overall logistic regression model of plant vulnerability against adjusted soil temperature is given in Figure 3.  In this figure, we see that the curve fit for Blair never predicts plants as being ‘vulnerable’, since the curve does not reach above probability 0.5, while curves for several other sites (Buck, Juniper, Holland and weakly, Waterdog) predict a switch from high to low probability of vulnerability with increasing soil temperature. The poorness of fit for Blair Lake could be due to missing data, since one survey date did not have soil temperature data for that site.

Figure 3. Logistic regression of probability of plant vulnerability against adjusted soil temperature (degrees F) for all sites combined (grey dashed curve) and individual sites.

We see differences in how well these fitted models describe and predict the occurrence of vulnerable plants at the five sites; I do not see clear indication that the inflection points differ significantly between sites, however the rate of change from probabilities near 1.0 to probabilities near 0.0 is steeper/quicker for certain sites.

Logistic regressions of plant vulnerability as a function of soil moisture showed an opposite relationship to previous regressions: increasing values of soil moisture saw the probability of plants being vulnerable going from low (near 0.0) to high (near 1.0) (Figure 4).  I found strong evidence (P<0.001 for χ2 with df = 1) that the probability of a flower being vulnerable depends on soil moisture.

Figure 4. Logistic regression of probability of plant vulnerability against soil moisture (percent) shows an increase in probability of vulnerability with increased soil moisture.

Fitting logistic regressions for each site individually did show some different patterns per site (Figure 5). Waterdog Lake reached probabilities near 1.0 much sooner than the rest, while Buck Mt., which tended to have higher soil moisture than the other four sites, had a much slower transition between low and high probabilities of vulnerability.

Figure 5. Logistic regression of probability of plant vulnerability against soil moisture (percent) for all sites combined (grey dashed curve) and individual sites.

 

CRITIQUE

Previous to this analysis, I had been informed that the Wald’s test z-statistic and p-values reported in the output of the glm() fitted model are not a reliable test for significance of coefficient estimates and inclusion of terms. Instead, a drop-in-deviance F-test testing a full versus a reduced model will yield a p-value for the inclusion of variables represented in the full model (Ganio, 2018). The drop-in-deviance test is intuitive and interpretable, but somewhat inconvenient for an analysis involving many separate logistic regressions (like, per site), and the reporting of the Wald’s test significance levels can be misleading. For this reason, I performed a drop-in-deviance test for each of the fullest models (growing degree days, adjusted soil temperature, soil moisture) that including all data. For the by-site logistic regressions, I found that calculating the mean success rate of fitted probabilities was a nice way to compare model goodness of fit quickly (datacamp, 2018), though my own comprehension of the steps involved is not as solid as with the drop-in-deviance test, making interpretation more difficult. Calculating odds ratios and analyzing inflection points and slopes for the by-site curves would help me interpret these results in a more biologically meaningful manner, to address the overarching research questions that motivated this data analysis.

An additional critique is that for the regressions of soil temperature and moisture to be most meaningful, I would like to include growing degree days in the model so that I can test the significance of the effect of soil temperature or moisture on probability of vulnerability after accounting for the affect of growing degree days. The challenges which stopped my from performing this analysis were that soil temperature and growing degree day were strongly correlated, and that interpretation of the x-axis was intractable for the model that included multiple explanatory variables with different scales. Further research could help make this analysis feasible.

REFERENCES

Datacamp. (2018). Logistic Regression in R Tutorial. https://www.datacamp.com/community/tutorials/logistic-regression-R

Ganio, Lisa. (2018). FES 524 Natural Resources Data Analysis Lecture, Feb 27 2018.

Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (Vol. 398). John Wiley & Sons.

 

June 9, 2018

Relationships between environmental features and recreationist behavior in Grand Teton National Park

Filed under: 2018,Final Project @ 7:12 pm

The research question that you asked.

Broadly, my research question at the beginning of the course was: “what spatial and temporal patterns emerge from day-use hikers in Grand Teton National Park?”

I appreciated starting out with a broad, exploratory question as it allowed me to think creatively and learn a variety of approaches for analyzing and conceptualizing human behavior using GPS data. As the course progressed, and as I examined patterns within the data, my research question became more specific:

“What are the relationships among the spatial and temporal behavior of recreationists and the environmental features within the recreation area?”

A description of the dataset you examined, with spatial and temporal resolution and extent.

Throughout the course, I used a sub-set of five GPS tracks (i.e. five hikers) collected on July 19, 2017. This subset came from a collection of 652 GPS tracks of day-use visitors at String and Leigh Lakes in Grand Teton National Park. The GPS units were distributed to a random sample of visitors between July 15 – September 8, 2017. Each intercepted visitor was asked to carry the GPS unit with them throughout the duration of their visit at String and Leigh Lakes. When deploying the units, study technicians also recorded the total number of people in the group, and the intended destination for their day visit. To maintain independence between samples, only one GPS unit was given to each group.

The GPS units used in this study were Garmin eTrex 10 units. These units collected point data every 5 seconds. The GPS tracks were saved as point features for analysis in ArcGIS so that each visitor’s hiking path can be represented by a series of points. The positional accuracy of these units can vary up to 15 meters. However, the Garmin units were calibrated with a high accuracy Trimble GPS unit which indicated a low average positional error of 1.18 meters.

Hypotheses: predictions of patterns and processes you looked for.

I hypothesize that people will spend more time, have shorter step lengths, and more acute turning angles the closer they are to a water feature.

This hypothesis is grounded in an assumption that summertime recreationists are drawn to open viewscapes, particularly cooling water features like lakes, waterfalls, and streams. The recreation site the data were collected in contains stunning lakes that are couched right on the edge of the Teton mountain range. Perhaps hikers will feel compelled to stop and sightsee the closer they are to these water features, and will have less stopping behavior the further away they are from these features.

Approaches: analysis approaches you used.

I sourced data layers representing vegetation cover and elevation from https://irma.nps.gov/. I used tools in ArcMap to build and calculate attributes for analysis. All analyses were conducted in R.

Exercise 1: (a) I used R to plot spatially explicit graphs of human movement through space and time. This allowed me to visually examine how people were using the recreation site and provide context for subsequent analysis.

(b) I created histograms representing the proportions of the step lengths and turning angles of the five recreationists. This approach provided me with a better understanding of the distributions and characteristics of the response variable.

Exercise 2: I generated box plots and scatter plots that represented the relationships among various environmental features — my independent variables — in the study area. The variables I examined were: vegetation, elevation, and distance to water.

Exercise 3: I conducted OE analysis (observed vs. expected) to examine if hikers were spending significantly more or less time in certain vegetation types despite being near water. I followed this with a chi-square test to see if the differences in distributions of observed vs expected were statistically significant.

Final analysis: (a) I ran the data through a hidden Markov Model package in R “moveHMM” to identify the probability of people changing from state 1 (short steps, acute turning angles) to state 2 (long steps, obtuse turning angles) as a function of their distance to water.

(b) I also attempted doing a multiple linear regression on the data. Unfortunately I couldn’t normalize the distribution of step length, realizing this approach may not be an adequate choice for this dataset.

Results: what did you produce — maps? Statistical relationships? other?

Throughout the course I produced numerous graphs representing recreationists movements through space and time. Figure 1 demonstrates a simple plot of the five tracks I worked with. This plot indicates most people in this sample stayed close to shore and relatively close to the parking lot. This makes sense as the trail hugs the lake shore.

Figure 1. A visual representation of the five tracks used throughout the analysis. All tracks collected on July 19, 2017.

Figure 2 summarizes the distribution of step lengths and turning angles for all five hikers. Step length is calculated as the distance between one GPS point to the subsequent GPS point (all GPS points have a temporal resolution of 1 minute). This figure suggests that these hikers typically walked straight, and nearly 40% of the time their step lengths, or distances between points, were 41-60 meters.

 

Figure 2. Distributions of turning angle (left) and step length (right).

Once I had a better understanding of the features of my response variable, I was curious to learn more about the environmental variables that I thought could be influencing human behavior. I also wanted to explore these variables to check for any confounding relationships between human movement and distance to water (my overarching hypothesis).

Examining the environmental features suggested that elevation may not be an influential variable in relation to the behavior of the recreationists (Figure 3). The range in elevation within the hikers’ movement paths was only 10 meters. However, the results did indicate that the predominant vegetation type for the hikers was coniferous woodland, and that conifer was primarily located near the lake shore. Perhaps the presence of conifer is playing a role in the amount of time the hiker spends near water. In other words, is the conifer deterring a person from stopping, even though they are near water? These questions encouraged me to examine the relationships among the recreationists time spent near water and the vegetation type they were in.

Figure 3. Box plots representing vegetation type relative to elevation (top left) and distance to water (top right). Elevation plotted against distance to water.

Before diving into analyzing the relationship between time, distance to water, and vegetation, I first calculated the overall proportions of the length of the hikers’ tracks grouped by distance class, and the time spent in each distance class. Figure 4 indicates that the five hikers spent 75% of their time 0 -20 meters from the lake. Additionally, of the entire track length, over 60% of the track is 0 – 20 meters from the lakeshore. From an ocular check, it looks like my hypothesis is being supported: people are spending more time near the water. But, is this a function of the lake feature or some other variable?

Figure 4. Proportion of time spent in the area compared to the length of the track (left). Frequency of vegetation types along every distance class (right).

The next set of figures illustrate the expected amount a time a person would spend in specific vegetation types within each distance class if they were traveling at a constant rate compared to the observed amount of time that person actually spent in each vegetation type within each distance class.  Coniferous woodland was the first vegetation type I examined.

Additionally, I was introduced to a neat way to conceptualize differences in distributions, particularly when the distributions represent different units of measurement (i.e. time vs. length). Creating an odds ratio essentially divides the values of time vs. length to give me a ratio. A value less than 1 indicates that people spent less time in the distance class than they would have if they had been traveling at a constant pace. A value greater than one indicates the people spent more time in distance class than they would have if traveling at a constant pace.

Figure 5  indicates that people were spending less time in conifer despite being closer to water! The further away from water, the more time they spent in conifer. The chi-square test indicates these differences in distributions are statistically significant (X2  = 90.359 df = 7, p-value < .001).

Figure 5. Observed/Expected for coniferous vegetation types grouped by distance class.

Figure 6 represents the observed versus expected distributions for herbaceous vegetation types (ex: open meadows). Interestingly, it appears that the hikers spent more time in meadows than was expected. It also appears that meadows are prevalent further away from shore. Back in Figure 5 we learned that people are spending, on the whole, more time in areas 0-20 meters from shore and 61-80 meters from shore. Perhaps for the 61-80 meters distance class, this can be explained by the presence of open meadows. However, the chi-square test indicates that these distributions are not significantly different from one another (X2= 7.4523 df = 3, p-value = .06).

Figure 6. Observed/Expected for meadow vegetation types grouped by distance class.

Figure 7 is very revealing as it indicates that people are spending more time in residential facilities that are close to shore. This includes areas with picnic tables, paved sidewalks, and some areas of the parking lot. This makes sense as I imagine people are drawn to the amenities that are offered. Also, the chi-square test indicates that the differences in these distributions are significant (X2 = 197.86, df = 3, p-value < .01) Seeing these results, I wonder if next time I should modify my hypothesis and examine if/ how distance from vehicle explains behavior.

Figure 7. Observed/Expected for residential facilities grouped by distance class.

These results suggested that environmental variables may play a larger role in human movement and behavior than originally hypothesized. As a final step in this exploratory journey I experimented with the moveHMM package to identify what the probabilities are of a person changing from state 1 (short step lengths, acute turning angles) to state 2 (long step lengths, obtuse turning angles) as a function of their distance to water. It is important to note that I still don’t have a comprehensive understanding of hidden Markov Models. Therefore, the following figures represent only a fraction of the output generated from this package. Further, the following figures aim to demonstrate how outdoor recreation scientists could potentially use these methods to better understand human behavior.

Figure 8 represents the two state model distributions of step length and turning angle. Figure 9 shows us that, indeed, as the distance to water increases, the likelihood a person will transition from state 1 (slow) to state 2 (fast) increases. According to this output my hypothesis is being supported!

While HMMs provide a statistically rigorous framework for incorporating covariates, allow for the autocorrelation commonly experienced with GPS data, and enable researchers to make inferences about changes in behavioral states, my own knowledge of these processes remains limited. Although it was relatively simple for me to push the data through an HMM model, this final step in my analysis still leaves me with a lot of questions. Further, I was unable to build an HMM model that controls for the vegetation type; thus I have a feeling that only including distance to water as a covariate may not be telling the whole story.

Figure 8. State-dependent distributions in the 2 state model.

Figure 9. Effect of the covariate ‘distance to water’ on the transition probabilities

As a final exercise, I was curious to see if it was possible to do a multiple linear regression on my own as another way to address the hypothesis. First, I checked for autocorrelation in the data. The output below represents one track.

Figure 10. Autocorrelation on distance to water (above) and step length (below).

The acf informed me that the spacing at which observations are no longer autocorrelated is about 2 minutes. I subsequently took a subsample of track to only select points that are greater than two minutes apart. After re-sampling I re-tested for autocorrelation – both variables are no longer autocorrelated (Figure 11).

Figure 11. Autocorrelation on re-sampled dats on distance to water (above) and step length (below).

I then tested for normality. The distributions were skewed, so I did a square root transformation on the distance to water. This normalized the distribution and a Shapiro-Wilk normality test provided a p-value > .05, suggesting the distribution is not statistically different from the normal curve.

Figure 12. Density plot of distance to water before (above) and after (below) transformation.

Unfortunately, I wasn’t able to normalize step length. Doing a log transform or square root just made the skewness even more severe.

Figure 13. Density plot of step length. The responses are not normally distributed.

At this point, I realized that doing a regression was perhaps not the best approach – especially since I wasn’t able to normalize the distribution of the response variable (step length). However, despite the cliff hanger ending, I intentionally included this final exercise in the blog post to demonstrate the use and consideration of autocorrelation on GPS data.

Significance. What did you learn from your results?  How are these results important to science? to resource managers?

These results revealed to me that environmental features such as vegetation and open viewsheds have an influence on the behavior of recreationists. Further, the results indicate that vegetation type  also plays more of a role in the behavior of recreationists than originally anticipated. I was surprised to see how much vegetation type influenced the amount of time the hikers spend in the area.

Overall, these results also demonstrate how recreationist activity can be measured and analyzed to develop deeper understandings of behavior.

These results are important to both science and resource managers. Parks and protected area land managers strive to provide a quality user experience while also protecting natural and cultural resources. Accurately understanding how people move and behave in a recreation system allow for more informed management decision making. For example, understanding what environmental conditions influence behavior could indicate a need for additional infrastructure, signage, or educational initiatives depending on the management objectives for the area. Additionally, by using statistical tools to analyze these relationships the results can have predictive power for managers.

In the scientific and academic communities, applying spatial methods to outdoor recreation science allows for a more accurate understanding of how people move, experience, and interact in outdoor spaces. By integrating GIScience with other common social science techniques in outdoor recreation — such as surveys, observations, and interviews — scientists glean richer results that can support and contribute to existing theory, generate deeper understandings about human behavior, and inspire additional studies.

Your learning: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other?

I worked exclusively in ArcMap and R. I went into this class with very limited knowledge and minimal confidence with the software; now, I am coming out of the class with more confidence in my ability to learn and figure it out. I appreciated the exploratory, self-guided structure of the class which provided space to play around with the technology and make mistakes. Even more so, I appreciated the insight of Julia and Laura, as well as the tips, tricks, and support from fellow classmates.

I learned how to work with GPS data in ArcMap and source data layers that were relevant to my research question. I learned various tools in ArcMap that allowed me to make calculations (i.e. TrackAnalyst tool and spatial join tools).

I learned how to work with spatial data in R and became more efficient at aggregating and manipulating dataframes for analysis. I also learned how to represent the data in a way that is meaningful to other audiences.

What did you learn about statistics, including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), and (d) multivariate methods (e.g., PCA)?

I learned how to do expected vs. observed analysis and chi-square test for significant differences in the distributions. I learned more about the application of hidden Markov Models in developing probability models representing changes in human behaviors. I also was able to work a little with autocorrelation.

Beyond my project I learned quite a bit from listening to the tutorials of other classmates. I am now (ever so slightly) more familiar with concepts like spatial and temporal autocorrelation, kriging, and geographically weighted regression. Additionally, it was extremely helpful working with Susie who had a similar dataset and was also using a hidden Markov Model approach.

References

Michelot, T., Langrock, R., Patterson, T. (2017). An R package for the analysis of animal movement data. Available: https://cran.r-project.org/web/packages/moveHMM/vignettes/moveHMM-guide.pdf.

June 8, 2018

Interannual variation in phenology and productivity at a C3 and a C4 grassland

Filed under: 2018,Final Project @ 12:48 pm

1. Research question

Introduction

Climate change is altering the production and distribution of plant species (Kelly & Goulden, 2008); however, the response of grasses to climate change is relatively understudied compared to woody plants, especially in tropical regions (e.g., Schimel et al. 2015). Globally, grasslands and savannas are estimated to comprise 30 percent of non-glacial land cover, and many important crop species are grasses (Still et al. 2003). Grasses are predicted to respond more quickly to climate change because of their short life span and dispersibility, and are aggressive invaders that can exacerbate fire cycles, cause land cover transitions, and alter carbon cycling (D’Antonio and Vitousek 1992, Angelo and Daehler 2013).

The community composition of a grassland mediates its response to its environment, and is critical to consider in forecasting the climate change impacts (Knapp et al., 2015). Grasses with the C4 photosynthetic pathway, in contrast to the ancestral C3 pathway, have comparatively higher resource-use efficienciesight, water, and nitrogen, especially under high temperatures. As a result, species using the C3 or C4 pathway will have distinct responses to a warming climate and rising CO2 (Collatz et al., 1992, 1998; Lloyd & Farquhar, 1994; Suits et al., 2005). Understanding the differential environmental constraints on the phenology, productivity, and distribution of grasses of different functional types will be critical in predicting how they will respond to future climate change, as well as identifying areas that may be prone to invasion from non-native grasses.

Questions

My initial research question, was, broadly: How are phenology and productivity of natural grasslands related to local climate variables, and how do these relationships differ between C3 and C4 grasslands?

Specifically, the research questions I answered were:

  1. Is interannual variation in production the result of multi-year patterns in production (autocorrelation)?
  2. How does the timing of the relationship between production and local climate variables differ between a C3 and a C4 grassland, and between years?

2. Data

My study sites are two eddy covariance (EC) flux tower locations in natural grassland areas located ~90 miles apart in eastern Kansas. The sites experience nearly identical climates, but the first is a natural tallgrass prairie composed of 99% C4 grass at Konza Prairie Biological Station outside Manhattan, KS, while the second is a replanted agricultural field composed of 75% C3 grass at the University of Kansas Field Station, outside Lawrence, KS. Because the two sites experience a very similar climate, despite being in distinct ecoregions, I hypothesize that photosynthetic type strongly controls differences in production and phenology at each site.

At these sites, production is measured first from satellite data using the normalized differential vegetation index (NDVI), which essentially measures the greenness of vegetation. In this analysis, I use NDVI derived from the MODIS satellites, which has 16-day temporal resolution, and is available from 2001 to present.

Production is also measured at the EC flux towers, and is calculated from ecophysiological equations that use ground-based measurements of atmospheric gas concentrations and meteorological data. The eddy covariance (EC) flux approach uses tower-mounted instruments to measure atmospheric concentrations of water and CO2, as well as air temperature, solar radiation, and other environmental data. All measurements are taken continuously every 30 minutes. EC flux data reflect the “footprint,” or area upwind of the tower where the instruments are mounted. The footprint varies with wind speed and direction, but averages about ~250m2. EC flux data span from 2008-2015.

The main metrics I am interested in are: daily total gross primary production (GPP) and the mean daily light use efficiency of production (LUE). GPP is reported in units of  gC•day^-1•m^-2. LUE is calculated as the daily sum of gross primary productivity (GPP) divided by the daily sum of the amount of incoming radiation, or photosynthetic photon flux density (PPFD). Daily LUE is converted to units of gC•MJ^-1•day^-1•m^-2 from units of µmol CO2• µmol photon^-1•day^-1• m^-2 using the molecular weight of carbon and Planck’s equation.

3. Hypotheses

My hypotheses were:

  1. Interannual variation in the phenology of production will not be strongly autocorrelated. Interannual variation in production is more strongly controlled by local climate variables
    1. a. Phenology will differ in the C3 vs C4 sites: the C4 site will have a longer growing season and a higher mean and maximum production.
  2. The C3 and the C4 site will have distinct cross-correlative relationships between production and local climate. They will differ in the timing and the strength of this relationship, and will respond distinctly to interannual variation in climate.

4. Approaches

My approach was to:

  1. Calculate annual phenology metrics from NDVI at each site, using the R package [greenbrown](http://greenbrown.r-forge.r-project.org/).
    • Phenological metrics include: the timing of the start of season and end of season, the length of the season, and mean and peak NDVI over the season. The phenological metrics are calculated by fitting curves to the annual growth cycle at each site.
    • After calculating the annual phenology metrics, I calculated autocorrelation using the acf() function n R to assess whether annual differences in phenology were a product of change over time, or cyclical trends.
  2. Calculate cross-correlation between GPP, LUE, and soil water content at each site, using the R function ccf(). Assess how cross-correlation differs between water-years  at each site.

5. Results

Phenology

Fig. 1: Start, end, and length of growing season over time at each site.

Fig. 2: Mean (MGS) and Peak growing season value of NDVI at each site.

The C3 site has a consistently longer growing season than the C4 site, with an earlier start of season and an later end of season. However, based on the NDVI data, the mean and peak growing season values look like they may be the results of cyclic patterns. I fit an autocorrelation function to assess whether there were interannual patterns in the phenological indices.

Fig. 3: Autocorrelation function for start, end and length of season, at each site. Sites are plotted with distinct line types (KFS, C3 =solid; Konza, C4 = dashed).

Fig. 4: Autocorrelation function for mean and peak growing season value, at each site. Sites are plotted with distinct line types.

In the autocorrelograms above, the dashed lines represent confidence intervals for determining whether autocorrelation is statistically significant. Because the autocorrelation function rarely, and marginally, rises above the confidence intervals for any of the phenological indices, this indicates that the phenology we observe is more likely to result from interannual climatic variation– rather than cycles in production.

Production and cross-correlation with climate

After inferring that interannual variation in phenology is likely not the result of temporal autocorrelation of the indices, I investigated the relationship between production and climate. Specifically, I examined the the timing and magnitude of the relationship between GPP and soil water content. Importantly, I am comparing the timing of cross-correlation between water years, which are measured from October to September. Water years, in contrast to calendar years, are a more biologically meaningful division to use when examining the relationship between plant growth and water variables.

For this project, I will just show the relationships between GPP and soil water content. Because LUE is derived from GPP, the cross-correlation between LUE and soil water content looks similar to the relationship between SWC and GPP.

Fig. 5: Annual time series of GPP (gC / m^2/ day) and soil water content (% * 10), for 2010-2013, plotted by water year. The thick lines represent the monthly average of total daily GPP for each site, in each year. The dashed lines represent the average daily soil water content. The daily total of GPP for each site and for each year is potted in thin, light-colored lines in the background.

Inspecting the time series of GPP and soil water content, by water year, we can perceive a clear hump in SWC that precedes a hump in GPP. Calculating cross-correlation between SWC and GPP, using the ccf() function in R, allows us to quantify this relationship.

In my phenological questions, I acknowledge the caveats of NDVI data– which capture distinctions in the timing, but not the magnitude of difference between production of C3 and C4 grasses (Fig. 7). I was curious whether cross-correlation between NDVI and soil water content would reveal similar or different trends from the cross-correlation between GPP and soil water content.

Fig. 6: Time series of GPP and NDVI at the C3 and C4 sites, from 2010-2013. NDVI is multiplied by 10 in order visualize on a similar scale to GPP. Mean monthly GPP is plotted in the thick, solid lines; the daily average of GPP for each site is plotted in the background using thin, lighter-colored lines.

Importantly, NDVI does not accurately capture the magnitude of difference between C3 and C4 production. While C3 and C4 grasses have distinct relationships between NDVI and GPP, NDVI alone is not always a great indicator of production differences between the two photosynthetic types.

Fig. 7: Cross-correlation strength vs. lag for GPP and soil water content, for 2010 -2013. The cross-correlation between NDVI and soil water content for each site in each year is plotted in lighter-colored lines in the background of each plot.

The vertical lines above are plotted at x = 0, and x = 100, to simplify visual comparison. Dashed lines represent confidence intervals.

When we examine how cross-correlation between GPP and soil water content varies between sites and between years, we see a clear shift at Konza Praire, the C4 grassland in 2012. This shift indicates that GPP and LUE are peaking earlier in relation to soil water content: with a lag of about 30 days, as opposed to 100 days at the Kansas Field Station, and at Konza in previous years.

Because 2013 followed the severe drought year, 2012, this shift suggests that the C4 grasses at Konza may be more flexible in shifting the timing of their water use in response to drought or other irregular moisture conditions.

Importantly, NDVI reveals the same patterns of cross-correlation between production and soil water content.

6. Significance.

First, my results confirm that C3 and C4 grasses respond distinctly to similar climatic variation. The shift in the timing of cross-correlation of GPP with soil water content in 2013 (Fig 7) is evidence that the same climatic change will affect these photosynthetic pathways in different ways. We need to continue using both experimental and observational research to quantify the interacting effects of rising temperatures, increased aridity, and elevated CO2 on the production of C3 and C4 grasses.

Second, my results enlighten me that though NDVI does not capture differences in the magnitude of production at C3 vs C4 sites, it does capture differences in phenology / timing of production. This means that NDVI, though a coarse metric of production, is valid to use for assessing variation in the timing of production. Particularly because NDVI is widely used and easy to obtain and calculate, this distinction will be helpful in future phenology research. However, importantly, phenological metrics that calculate mean, max, or total growing season productivity– i.e., metrics that describe the magnitude of production, rather than the timing of production– should be interpreted from NDVI with caution. Ideally, other production indices should be used to assess variation across functional types in the magnitude of production.

7. My learning, w/r/t software

Over the course of this class, I have become much more confident using R Markdown to edit and generate reports for publication. This blog post was drafted and formatted using R Markdown, with only minimal modifications to fit the blog format. I have also learned:

  • how to gapfill time series data using the function zoo::na.approx, which uses linear interpolation to replace NAs in a time series
  • how to convert outputs from acf() and ccf() functions into data frames that are easily plottable using ggplot

8. My learning, w/r/t statistics

I became much more comfortable generating and interpreting correlograms. I used autocorrelation and cross-correlation to determine which variables to proceed with in my analysis.

Future directions will involve using the information I now have about cross-correlationto select variables for future regressions. E.g., I learned that GPP is cross-correlated with soil water content, and as a result, I am more interested in investigating the shifts in the timing of that relationship, than the magnitude. I learned that GPP is more strongly correlated, than cross-correlated, with air temperature and soil temperature. As a result, will use linear mixed models to assess the mean annual relationships between daily total GPP and means of these variables.

Ultimately, this work prepares me to quantify and discuss the similarites and differences in the timing of key phenological and production events for a C3 vs a C4 grassland.

 

References

Angelo, Courtney L., and Curtis C. Daehler (2013). “Upward expansion of fire‐adapted grasses along a warming tropical elevation gradient.” Ecography 36.5 , 551-559.

Collatz G, Ribas-Carbo M, Berry J (1992) Coupled Photosynthesis-Stomatal Conductance Model for Leaves of C4 Plants. Australian Journal of Plant Physiology, 19, 519.

Collatz GJ, Berry JA, Clark JS (1998) Effects of climate and atmospheric CO2 partial pressure on the global distribution of C4 grasses: Present, past, and future. Oecologia, 114, 441–454.

D’Antonio, Carla M., and Peter M. Vitousek (1992). “Biological invasions by exotic grasses, the grass/fire cycle, and global change.” Annual review of ecology and systematics 23.1 : 63-87.

Kelly AE, Goulden ML (2008) Rapid shifts in plant distribution with recent climate change. Proceedings of the National Academy of Sciences, 105, 11823–11826.

Knapp AK, Carroll CJW, Denton EM, La Pierre KJ, Collins SL, Smith MD (2015) Differential sensitivity to regional-scale drought in six central US grasslands. Oecologia, 177, 949–957.

Lloyd J, Farquhar GD (1994) 13C Discrimination during CO₂ Assimilation by the Terrestrial Biosphere. Source: Oecologia, 994, 201–215.

Schimel, David, et al. (2015) “Observing terrestrial ecosystems and the carbon cycle from space.” Global change biology 21.5 : 1762-1776.

Still CJ, Berry JA, Collatz GJ, DeFries RS (2003) Global distribution of C3 and C4 vegetation: Carbon cycle implications. Global Biogeochemical Cycles, 17, 6-1-6–14.

Suits NS, Denning AS, Berry JA, Still CJ, Kaduk J, Miller JB, Baker IT (2005) Simulation of carbon isotope discrimination of the terrestrial biosphere. Global Biogeochemical Cycles, 19, 1–15.

June 7, 2018

Wavelet and Cross Wavelet Analyses of Kelp Canopy Cover and Temperature

Filed under: Exercise/Tutorial 3 2018 @ 4:26 pm

Question: With this exercise I wanted to look at patterns in the cycles of temperature and kelp coverage in southern Oregon. I wanted to examine if these two variables fluctuate together and if they do, is that synchronization is constant throughout the timeseries.

Tool: To do this I used wavelet analysis and cross wavelet analysis. These analyses do not depend on stationarity, unlike many of our other spatio-temporal tools, so they can detect changes in the period and frequency of a cycling time series or pair of time series.

Steps: Wavelet analyses require evenly sampled data. My temperature data is sampled evenly with one mean temperature per month (for both the intertidal and satellite data sets). My kelp timeseries is based on sporadic sampling however (i.e. whenever we could get a good cloud free satellite image). So the first task I needed to do was to interpolate my kelp timeseries to have monthly average canopy cover.

To do this, I interpolated each year’s timeseries separately. I first added a low canopy coverage for all winter months that did not have a data point. For the winter months (November –April), I found the median winter month canopy coveragewas about 1200m2, so all empty winter months were assigned to this median number. Every year had at least 3 data points in the non-winter months, so between the added winter month data and the summer month points, I was able to fit a polynomial spline to each year’s kelp timeseries. I did this in R using the lm(y~bs(x, degree = #) function. I then interpolated the value on the 15th of every month from this spline using predict function.

I then used the WaveletComp package in R to conduct wavelet and cross wavelet analyses on my kelp and temperature timeseries.

Results: 

b)

Figure 1: A) Wavelet analysis of my kelp canopy timeseries. B) Wavelet Analysis of the satellite temperature timseries. For both, colors correspond to power, white contour lines to the 0.1 significance level, and black lines to the power ridge.

My interpretation of the wavelets plots is:

1) Kelp Canopy – The highest power comes at the 6 month and 12 month periods. The 12 month wavelet makes sense, considering Nereocystis is an annual species, and the 6 month period wavelet corresponds to intra-growing season changes in canopy cover, possibly due to changes resulting from wave events. These areas of high power at 6 and 12 months are not consistent throughout the data set, and only occur when there are spikes in canopy coverage, which points to years where kelp cover, even at the height of the growing season was quite low,Another result to note is that in the first half of the timeseries, there is some power at the 5 year period. This could potentially be related to a climatic oscillation such as El Nino or PDO.

2) Satellite Temperature – similar to kelp canopy, the highest power for this wavelet analysis was at the 12 month period. Unlike the kelp wavelet, this 12 month period was consistently high power for the entire duration of the timeseries.

My interpretation of the cross-wavelet plots I generated is:

1) Kelp Canopy and Satellite Temperature – kelp canopy and satellite temperature appear to both have high power at 12 month periods. The arrows in this 12 month band largely point to the right, which suggests that x and y are in phase. A few of the arrows point down to some extent as well (around -30 degrees). This band is not constant however, and the significance of this high power band is lost around 1997 and 2003-2005. This suggests neither timeseries had strong annual cycles around this time. There are also a few other periods that appear to have a great enough amplitude to be considered significant. Perhaps most interesting is that the 5 year/60 month period appears is significant throughout the series. The arrows along this band are mostly pointing to the left and upwards, indicating that the two may be out of phase and that temperature (y) may be leading x (kelp). We saw a similar trend in the wavelet analysis for kelp and may be indicative of some kind of effect of climate oscillations.

2) Kelp Canopy and Local, Intertidal Temperature – This cross wavelet analysis was done on a dataset about how as long as the kelp/satellite dataset. This cross wavelet power graph is ‘chunkier’ at the annual scale than that between kelp and satellite data. This cross wavelet shows a dip in power from 2003-2005 like the above graph did, but also a dip about 2009-2011 as well. This indicates that the two timeseries are not both cycling together as strongly as did the kelp and satellite data. However, this cross wavelet has additional high power areas in the period of 3-11 months about 2014 and at 13-32 months from 2006-2010. There also appears to be some significant power at the 5 year range, similar to several previous power graphs, although this is hard to tell with a shorter timeseries.

Overall, my interpretation would be that both temperature datasets tend to cycle with kelp most strongly on the annual timescale, which makes sense. The local, intertidal data set does not appear to cycle as strongly with kelp as the satellite derived temperature does. My guess is that we got these results because local temperature is more susceptible to interruptions by relatively small scale phenomena such as upwelling or snowmelt and therefore adhere less strongly to the strong global cycles stemming from sun angle and exposure. I also think that these power graphs taken together suggest that some kind of multi-year climate oscillations may be involved in cycles of kelp canopy cover, since several of these analyses found significant amplitude at a roughly 5 year frequency.

Critique: My critique of this method is that I feel like wavelet and cross wavelet analysis is less informative than I was expecting it to be. Some of the strongest results just show cycles at an annual period that have a lower amplitude some years, which is pretty obvious from just looking at the graphs. I still feel like I’m not interpreting my results correctly, and I think part of the reason is that it’s hard for me to stop thinking in terms of correlation after my first two exercises.

Aggregating Suitable Dam Habitat to Consider the Role of Habitat Size and Connectivity

Filed under: 2018,Exercise/Tutorial 3 2018 @ 11:47 am

Question:

How does the size and connectivity of beaver dam habitat relate to observed dam sites in West Fork Cow Creek?

Approach:

In Exercise 2, I considered if stream gradient, active channel width, and valley bottom width, described by Suzuki & McComb (1998) as the most important predictors of beaver damming, corresponded to the observed dam sites collected in our field surveys last fall. And while I found that the their thresholds defining suitable damming habitat captured all observed dam sites, it was also clear that there were was also a significant amount of suitable habitat where no observed dams occurred.  In other words, there was still quite a bit of ‘available’ habitat.

To understand why some of these habitats were used and others were not I turned to work from landscape ecology, that posits the size and connectivity of habitat is an important consideration in resource selection of animals (Dunning et al. 1992).  To consider if this may help explain why dams were observed in some suitable habitats and not others, I conducted the ‘patching’ process and OD Cost Matrix to estimate the size and connectivity of suitable damming habitats using the Suzuki & McComb (1998) criteria and which are considered in this final exercise.

Steps used to complete the analysis:

Data preparation: Despite having already generated the size and connectivity measures for patches of damming habitat, there was still some geoprocessing required.  In particular, I need to identify what patches of habitat had actually been surveyed during our fall 2018 field work as well as identify what patches were observed to have beaver dams somewhere in them.  To do this, I used the Spatial Join Tool, choosing the ‘INTERSECT’ method of identifying overlapping patches and survey sites. I repeated this processes to identify patches with observed dams sites.  Of the 49 habitat patches generated in Exercise 1, 29 of those coincided with our survey points.  Of those, 48 dams sites were observed on only 4 of the 29 patches. All patches that were not surveyed were removed from the dataset and fields for a dummy variable along with dam counts were added for observed dams.

Variable generation: Using the data from the OD Cost Matrix analysis in Exercise 1 I had two variables of interest: 1) length of contiguous habitat reaches, that I refer to as patch ‘Length’, and another which is the distance of a patch to it’s nearest neighboring patch if traveling through the stream network; referred to hear as ‘Nearest Neighbor (NN) Distance.  For this exercise I was curious, however, if the size of the NN was important, surmising that a patch would be more likely to be occupied if it was located 100 meters away from a large patch (e.g. 900m) as opposed to a small patch (e.g. 100m).  As a result I calculated a NN Length, weighted by the distance to the NN patch (NN Length/NN Distance); referred to as ‘NNLwDist’.

Analysis: Similar to Exercise 2 I used the easyGgplot2 package to generate overlapping histograms of habitat patches with and without observed dam sites for these three variables. I also used this same package to generate density plots to show the relative distribution of patches with and without observed dams.  I then computed the means, standard deviation, and range for each variable with and without observed dams.

Lastly, I compared the difference in means using a permutation test because a standard Welches t-test was inappropriate due to the small and unequal sample sizes in each group.  I also applied a Wilcoxon rank sum test but this proved problematic due to a high number of ties in these data.

Results:

Distributions and Summary Statistics:

Table 1

 

 

 

 

 

Patch length: The histogram and density plot in figures 1 and 2 show a clear delineation in the patch length between those with and without observed dams.  On average, patches where dams were observed were more than 900 meters longer (z=3.6, p<0.0001) than those without (table 1).

 

Figure 1

Figure 2

 

 

 

 

 

 

 

 

 

Distance to Nearest Neighbor Patch: On average, patches with dams were nearly 350 meters closer to their nearest neighboring patch and overall are highly skewed to the shortest distances (95m to 100m) and clearly visible in the density plot (figure 4).  Patches without dams had a much wider distribution of distances (96m to 1591m). However, the permutation test did not find the difference in distances to be significant (z= -1.5, p=0.14).

Figure 3

Figure 4

 

 

 

 

 

 

 

 

 

Nearest Neighbor weighed by Distance: The mean difference between patches with and without observed dams was 6.4m with the permutation test suggesting this was significant (z=3.001, p=0.003).  Patches with observed dams were more distributed than in the other metrics (2m to 15m) which is particularly apparent in the density plots.

Figure 5

Figure 6

 

 

 

 

 

 

 

 

 

Lastly, figures 7, 8, and 9 show plots of Patch Length by NN Distance and NNLwDist, as well as NN Distance by NNLwDist, respectively.  The clustering of patch with observed dams by patch length is consistent with the above discussion, however, patches with observed dams also seem bounded or highly clustered by only small distances to the NN.  Given a larger sample size, I suspect, that differences in patches with and without dams would be more significant.  Conversely, patches with observed dams are more distributed by NNLwDist and is curious why those are significant.

 

Figure 7

Figure 8

Figure 9

 

 

 

 

 

 

 

 

 

Critique of methods:

Given the simplicity of the analysis in this exercise, I don’t have a lot of critiques to offer. The fundamental challenge with analyzing these data were that they are ‘0’ inflated, meaning that of the 29 patches generated, only 4 of them had observed dam sites.  I had initially applied a logistic regression model to consider the probability of dams as a function of patch length, distance to nearest neighbor and the weighted neighbor length, but refrained from providing the results and opting instead to offer the simpler analysis above given the small and lopsided samples in this analysis. That said, the overall effort confirms that patch geometry seems to have a role in the occurrence of beaver dams.  In particular, there is strong evidence of a difference in the size of patches where dams were observed compared to those where they were not and is consistent with theory from landscape ecology that suggests animals will preferentially select larger habitat over smaller habitats all else being equal.  Curiously, these data do not show compelling evidence of a difference in the NN Distance between patches with and without observed dams sites, but do for the NN Length weighted by the distance to that patch.

 

 

June 6, 2018

Exploring the relationship between floating guidance structures, hydraulics, and the location of behavior changes in fish

Filed under: Final Project @ 7:43 pm

Question, dataset, and approach:

This research investigated the hydraulic and behavioral impacts of a floating guidance structure in an experimental channel on juvenile Chinook salmon. Three exercises were conducted to determine if any relationships exist between channel hydraulics (which were stationary, measured at discrete locations and interpolated to become spatially-continuous) and the locations of behavior changes (as determined by a behavioral change point analysis tool, “smoove”). Exercise 1 examined differences in the spatial distribution of behavior changes at 20, 30, and 40 degree guide wall angles. Exercise 2 determined which hydraulic variables were most predictive of the location of behavior changes angles using multivariate methods. Exercise 3 used geographically weighted regression (GWR) to examine spatial variation in the relationship between each of the above variables and the location of behavior changes.

Hypotheses:

1) Because the hydraulics created by guide wall angles of 20, 30, and 40° differ from one another, we expect the location of behavior changes to vary in space with angle.

2)  Because similar flume experiments have found hydraulic thresholds in velocity gradient for fish behavior, we hypothesize that velocity gradient (or another hydraulic variable) will predict the downstream location of behavior changes at all 3 guide wall angles.

Results:

Although the spatial distribution of the location of behavior changes appears to vary, no pattern of statistical significance can be concluded from these data (Figure 1). Because potential hydraulic thresholds are created at increasingly downstream positions along the guide wall with increasing angle, it was hypothesized that the average location of behavior changes similarly appear farther downstream with increasing angle. Although such a pattern appears to exist, limited observations withhold the ability to draw statistically significant conclusions from purely the spatial distribution of behavior changes (Figure 2).

Figure 1. Two displays of behavior changes and the 95% confidence intervals surrounding their spatial distributions. The data in the graphic on the left, presented in Exercise 2, was displayed backwards.
Figure 2. Although average distance downstream of behavior changes varies with guide wall angle, no statistical significance was found in this study.

A comprehensive analysis of channel hydraulics and the location of behavior changes found turbulent kinetic energy (TKE) gradient, water speed, and velocity gradient to be potential predictors of the spatial distribution of behavior changes. A multivariate regression analysis relating the X and Y coordinates of a behavior change with water speed, TKE, TKE gradient, velocity gradient, and acceleration found only TKE to not be significantly correlated with the location of a behavior change (p-value > 0.05). A Principal Component Analysis (PCA) further determined that water speed, as the largest component of the first principal component, may be an important predictor of behavior change (Figure 3). However, PCA’s ability to only model one dependent variable (distance downstream) warranted analysis using Partial Least Squares regression (PLS). PLS found that velocity gradient and TKE gradient best predict the spatial distribution of behavior changes, while water speed contributed the least to each component axis (Figure 4).

Figure 3. Principal Component Analysis (PCA) indicated that water speed is an important hydraulic variable in predicting the downstream location of behavior changes.

Figure 4. Partial Least Squares (PLS) regression indicated that velocity gradient and TKE gradient are perhaps good predictors of the spatial distribution of behavior changes in both the X and Y directions.

Water speed, TKE gradient, and velocity gradient were found to vary spatially in their relationship with the location of behavior changes in a geographically weighted regression (GWR). GWR differs from other regressions by estimating local relationship coefficients for every point in the dataset, rather than assigning a global coefficient across the entire dataset. In this way, local variation in the relationship between independent and dependent variables can be seen. The relationship of TKE gradient and water speed with the location of behavior changes was both strongly negative and strongly positive at fine local scales (Figures 5 and 6). However, velocity gradient demonstrated a consistent positive relationship with the downstream location of a behavior change (Figure 7). That is, if a behavior change occurred at high (low) velocity gradients, we can expect its location to be relatively far downstream (upstream). Although variability in these data are a result of a fish’s past experience, physiology, and perception of stimuli, velocity gradient may most consistently predict the location of behavior changes in this experiment.

Figure 5. Geographically weighted regression depicted local variation in coefficients between water speed and the portion of the boom passed that span positive and negative values.

Figure 6. Spatial variability exists in the relationship between TKE gradient and the downstream location of behavior changes.

Figure 7. Velocity gradient is the only hydraulic variable that showed a consistent relationship with the distance downstream of a behavior change. This implied that if a behavior change occurred at high velocity gradients, we could expect its location to be relatively far downstream, and visa versa.

Significance:

These results may verify previous experiments, which found velocity gradient to be an important predictor of the location of behavior changes in fish. If true, designs for improving fish passage at dams should aim to minimize velocity gradients created by floating guidance structures to promote safe bypass routes. However, substantial variation exists in these results (e.g. the spatial distribution of behavior changes does not vary significantly with guide wall angle, as hypothesized; no one hydraulic variable consistently predicted of the location of behavior changes across all 3 guide wall angles). Perhaps scientists should focus future studies on hydraulic stimuli as experiential (e.g. signal-to-background noise, cumulative to a threshold) rather than absolute (e.g. one magnitude of a hydraulic variable as a single threshold).

My learning:

For this course, I was forced to revisit R in order to run complicated statistical analyses. Although I was familiar with multiple linear regression analyses already, more sophisticated approaches (multivariate linear regressions, PCA, PLS, and GWR) were new to me. Furthermore, my ability to troubleshoot in Python in order to visualize the results of my analyses (creating multi-dimensional confidence intervals for Exercise 1, for example) was tested and grew.

Geographically weighted regression and the location of behavior changes

Filed under: Exercise/Tutorial 3 2018 @ 6:07 pm

Question:

Exercise 2 determined that velocity gradient, water speed, and turbulent kinetic energy (TKE) gradient are the most important aspects of principal components that predict the spatial distribution of behavior changes. However, because channel hydraulics vary on a small scale (e.g. centimeters) within the experimental channel, the relationship between each hydraulic variable and the location of behavior changes also varies spatially. For example, a multiple linear regression comparing the downstream distance of a behavior change and three hydraulic variables of interest shows that the model residuals are non-uniformly distributed (Figure 1). As portion of the boom passed increases, the model shifts from over-predicting behavior change location to under-predicting it. In this exercise, a geographically weighted regression was conducted to examine the relationship of three hydraulic variables and the downstream distance of behavior changes on fine spatial scales. The objective was to determine if one or more independent variables have a spatially-consistent relationship with the location of behavior changes.

Figure 1. A general linear model relating the distance downstream of a behavior change and three hydraulic variables (water speed, TKE gradient, and velocity gradient) shows non-uniformly distributed residuals. These results indicate that the relationship of between independent and dependent variables may vary in space, and warrant analysis using a geographically weighted regression.

 

 

Steps:

A geographically weighted regression (GWR) differs from multiple linear regressions by estimating local relationship coefficients for every point in the dataset, rather than assigning a global coefficient across the entire dataset. In this way, local variation in the relationship between independent and dependent variables can be seen. A 2014 blog post by Adam Dennett was very helpful in understanding and implementing a GWR using R. Once local estimates of regression coefficients between velocity gradient, water speed, and TKE gradient and the location of a behavior change were found, the data were visualized using Python. Only statistically significant relationships (p-value < 0.05) were displayed, so that local coefficients can be regarded with confidence.

Results:

The relationship by hydraulic variables and the location of behavior changes varies substantially in space. Local coefficients of both TKE gradient and water speed range from positive to negative values (Figure 2 and Figure 3). That is, depending on location within the experimental channel, high values of TKE may incite a behavior change relatively early or late as a fish passes the guidance structure. Similarly, high values of water speed may incite a behavior change relatively early or late as a fish passes the guidance structure.

Figure 2. Locally-derived coefficients between water speed and the location of a behavior change. Grey shading indicates the intensity of water speed.

Figure 3. Locally-derived coefficients between TKE gradient and the location of a behavior change. Grey shading indicates the intensity of TKE gradient.

Only velocity gradient demonstrates a consistent relationship with the downstream location of a behavior change (Figure 4). For all observed behavior changes, increasing velocity gradients correlate with increasing distance downstream of the behavior change. Stated another way, if a behavior change occurred at low velocity gradients, we can expect its location to be relatively far upstream. If a behavior change occurred at high velocity gradients, we can expect its location to be relatively far downstream. This differs from the local relationships between the location of behavior changes and TKE gradient and water speed, indicating that velocity gradient may be the best predictor of the distance downstream that a fish passes a guidance structure before changing its behavior. The variability of hydraulic intensity that incites a behavior change is a result of natural variability in a fish’s past experience, physiology, and perception of stimulus, making absolute thresholds of fish behavior difficult to determine. However, velocity gradient may most consistently predict the location of behavior changes in this experiment.

Figure 4. Local coefficients between velocity gradient and the location of a behavior change show consistently positive relationships. Grey shading indicates the intensity of velocity gradient.

Critique of methods:

Geographically weighted regression is a statistical tool that informs local variation in the relationship between independent and dependent variables. If a researcher has reason to believe that global coefficients don’t capture variation in space, a GWR is a useful, if slightly confusing, method of analysis.

Observed vs. expected time spent in vegetation types as a function of distance to water

Filed under: 2018,Exercise/Tutorial 3 2018 @ 10:32 am

Disclaimer – this post is long, particularly in the ‘steps taken’ section. I may use some of these methods in the Fall, so I made it very detailed.

Question asked

For this analysis I wanted to dig a little more into the relationship between vegetation type, distance to water, and the movement of the recreationist. The reason for this is because in Exercise 2 I learned that coniferous woodland dominates the study area and the hikers’ path. However, one could wonder if people are less likely to spend time near water if the surrounding vegetation is dense with conifer. Perhaps this dense vegetation is less appealing than the lake feature and the hiker consequently keeps trekking until they reach a more open area. Understanding the significance of this relationship is important prior to applying the data to a hidden Markov Model. Results from this exercise may indicate that I should include vegetation data into the model because vegetation (such as coniferous woodland, meadows, etc.) may explain the movement of recreationists in addition to, or regardless of, distance to water.

My questions for this exercise are:

1.) What are the distributions of the length of hiker’s track versus the time they spend in the area as a function of distance to water?

2.) Are hikers spending significantly more or less time in certain vegetation types despite being near water?

Tool/Approach Used

I used ArcGIS and R to complete my analysis. I did a few things differently in this exercise in contrast to exercise 2. Steps outlined below:

(1) Simplify data and reduce to 1 minute time intervals between points: I am using a sub-sample of 5 GPS tracks. These tracks collected point data every 20 seconds. I decided to aggregate each point to 1 minute intervals to make it a little easier to analyze and interpret. To do this I used the dplyr package in R to aggregate the GPS points to 1 minute timestamps and average the xy coordinates.

(2) Import revised and reduced spatial dataframe to ArcGIS: I ran into a snag here. In the process of trying to convert the spreadsheet to a shapefile I kept losing the timestamp information. As a workaround, rather than converting the spreadsheet to a shapefile, I converted it to a dBASE table which (fortunately) preserved the timestamp information.

(3) Create 5 unique track layers: At this point, my data were still in one large table. I then split up the merged table into the five unique tracks and layers. To do this I used the select by attribute tool and create layer from selection tool.

(4) Build attribute table: I wanted the following variables in the attribute tables of each track: distance to water feature, vegetation type, length of track, timestamp, xy coordinates. So far, I had the xy coordinates and time stamps. To calculate distance to water feature I used the joins and relates tool to calculate the nearest distance between the GPS point and the outline of the water feature (I had previously created a polygon of the water feature using the digitize tool in editor mode; see Figure 1). To extract vegetation data I used the spatial join tool to link the vegetation type to the GPS point (Figure 2)

(4b) Calculating length of track: the final attribute I needed was the length of the GPS track. To do this, I used the Tracking Analyst tool (thank you, Sarah, for the suggestion!) then track intervals to feature option. What this does is calculate the distance between one point and the subsequent point. This is why the time stamp interval was crucial for me to maintain. I needed to ensure that the tool was using the timestamp to determine the subsequent point for calculating length. This tool added a new column to the attribute table of each track denoting the calculated distance between each point to the next point. Now, I have information on the actual distance (in meters) traveled.

(5) At this point, I’m very happy. I finally have an attribute table with all the information I need for analysis. I then export the table as a text file. I import the table into R for manipulation, visualization, and analysis.

In R:

(6) Bin distance to water data: to create histograms representing proportions I needed to bin the distance to water data. I used the cut tool in R to bin the distance to water data into groups of 20 meters (i.e. 1 – 20, 21 – 40, etc.)

(7) Create dummy variables for vegetation type: I used ifelse statements with the dplyr package to create new columns for conifer presence (0 = not present, 1 = present), meadow presence, and residential facilities presence.

(8) Calculate total length of hikers path in vegetation type each distance class: I again used dplyr to calculate the total length of each hiker’s path grouped by the distance class. I then used a conditional statement to only calculate the length of data points in each vegetation type (i.e. conifer, meadow, residential).

(9) Calculate time spent in each distance class: I first created a column with 1’s to represent 1 minute. I then calculated the time each hiker spent in each distance class. I used a conditional statement to only calculate the time spent in each vegetation type (i.e. conifer, meadow, residential)

(10) I then calculated ‘Observed’ versus ‘Expected’ proportions.

Expected = length of [vegetation type] / total length within distance class (assumes people are traveling at a constant speed)

Observed = time spent in [vegetation type] / total time within distance class

This will tell me if people are spending more or less time in a certain vegetation type despite being close to water.

(11) I did a chi-square test for each vegetation type to see if the observed vs expected distributions were significantly different.

Description of Results Obtained:

The first three figures I generated demonstrated to me the overall proportions of the length of the hikers’ tracks grouped by distance class, and the time spent in each distance class. Figure 4 indicates that the five hikers spent 75% of their time 0 -20 meters from the lake. Additionally, of the entire track length, over 60% of the length was 0 – 20 meters from the lakeshore. From an ocular check, it looks like my hypothesis is being supported: people are spending more time near the water. I wonder if that’s a function of the lake or another variable?

Laura introduced me to a neat way to conceptualize differences in distributions, particularly when the distributions represent different units of measurement (i.e. time vs. length). Creating an odds ratio essentially divides the values of time vs. length to give me a ratio. A value less than 1 indicates that people spent less time in the distance class than they would have if they had been traveling at a constant pace. A value greater than one indicates the people spent more time in distance class than they would have if traveling at a constant pace. Figure 5 demonstrates that people were spending more time in the area when they were 0-20 meters from the lake shore and 61-80 meters from the lake shore. I am ignoring the 141-160 and 161-180 distance classes because the sample sizes are too small. Once again, looks like the hypothesis is being supported. But, is this because of the water feature? Or vegetation? Or both?

Figure 6 illustrates the proportion of vegetation types within each distance class. It looks like 0 -20 meters from the lake shore, the predominate vegetation types are conifer woodland (~ 50%) and residential and facilities (~ 40%), i.e. paved parking lots, toilets, picnic areas, paved sidewalks.

The next set of figures illustrate the expected amount a time a person would spend in specific vegetation types within each distance class if they were traveling at a constant rate compared to the observed amount of time that person actually spent in each vegetation type within each distance class.  Coniferous woodland was the first vegetation type I examined. Figure 7 indicates that people were spending less time in conifer despite being closer to water! The further away from water, the more time they spent in conifer. The chi-square test indicates these differences in distributions are statistically significant (X2  = 90.359 df = 7, p-value < .001). Figure 8 demonstrates the odds ratio of Observed/Expected. Interestingly, Figure 5 suggests people are spending more time near water, yet when there’s conifer present, this is not the case; I wonder if there is another vegetation type explaining this?

Figure 9 represents the observed versus expected distributions for herbaceous vegetation types. In other words, open meadows. Interestingly, it appears that the hikers spent more time in meadows than was expected (see Figure 10 for odds ratio). It also appears that meadows are prevalent further away from shore. Back in Figure 5 we learned that people are spending, on the whole, more time in areas 0-20 meters from shore and 61-80 meters from shore. Perhaps for the 61-80 meters distance class, this can be explained by the presence of open meadows. Interestingly, the chi-square test indicates that these distributions are not significantly different from one another (X2= 7.4523 df = 3, p-value = .06).

Figure 11 is very revealing as it indicates that people are spending more time in residential facilities that are close to shore. This includes areas with picnic tables, paved sidewalks, and some areas of the parking lot. This makes sense as I imagine people are drawn to the amenities that are offered. Also, the chi-square test indicates that the differences in these distributions are significant (c2 = 197.86, df = 3, p-value < .01) Seeing these results, I wonder if next time I should modify my hypothesis and examine if/ how distance from vehicle explains behavior.

Critique of Method

I created a workflow that made sense to me, but overall the process was very time consuming, particularly in ArcMap. In my experience thus far, I have appreciated the visualization ArcMap provides and find the tools easy to learn and navigate. However, during this particular exercise ArcMap was especially clunky and slow to process. I was glad to use ArcMap so I could gain experience using the software, but many hours elapsed before I could build an adequate attribute table for subsequent analysis in R. Fortunately, thanks to this class, I am also growing more and more comfortable in R, so, with relative ease, I was able to quickly modify my data and build code that allowed me to analyze and visually represent the data.

Another limitation and critique of this method is that I am only using a conveniently chosen subset of 5 GPS tracks. Therefore, my sample size of humans (not GPS points) is low and not randomly selected; thus, I can’t generalize beyond the five hikers in the subsample. Given the time constraints at this point in the term, I decided to stick with the five hikers rather than boosting my sample size. Despite the low sample size, this type of analysis introduced me to a new approach for examining relationships between recreationists and their surrounding environment. I plan to adopt these methods for future research.

Driven by curiosity, at the end of this exercise I looked at the tracks in ArcMap again. It appears that the majority of use is occurring near the parking lot. Additionally, the ‘meadow’ in this dataset is sandwiched between residential facilities (i.e. the parking lot and picnic tables). This situation makes me wonder if distance to vehicle or parking lot is also playing a role in the response. Further, I wonder if I should even classify that polygon as meadow, especially since it is unclear if people are spending time there because of the vegetation or because of the distance to the vehicle/easy access to facilities. If I had more time, I would examine this additional variable.

June 5, 2018

Comparing SnowModel output to metrics of Dall Sheep recruitment.

Filed under: 2018,Exercise/Tutorial 3 2018 @ 4:16 pm

Question asked

Is Dall Sheep recruitment more influenced by near-summer snow conditions or do early snow season conditions also play a role?

A typical metric for assessing sheep recruitment, i.e. the number of young animals available to ‘recruit’ into the population, is the lamb to ewe ratio (hereafter referred to as lamb:ewe). In the case of Dall sheep, demographic surveys take place most frequently in the summer months of June and July, after their lambing months of April and May. In this question I will examine the prevailing theory that a cause of Dall sheep population decline are spring snow storms causing high lamb mortality by comparing summer lamb:ewe ratios to aggregated monthly and seasonal snow data derived from a spatially explicit snow evolution model run at daily timesteps.

Data / Tool / Approach used

Snow Data

The snow data for this analysis is derived from SnowModel, a spatially explicit snow evolution model, and consists of daily mean snow depth, total snowfall, mean air temperature, and forageable area (the percentage area under snow depth and density thresholds that allow Dall sheep to graze) aggregated into monthly and seasonal means and totals.

Sheep Data

Sheep data used here is from annual surveys completed by Alaska Department of Fish and Game, Bureau of Land Management, US Fish and Wildlife Service and National Park Service in 6 different Dall Sheep ranges – see blogpost 2. Survey methods include distance sampling, stratified random sampling, and minimum count methods either from a ground location or fixed wing aircraft.

Approach used and steps followed

At this initial stage a simplistic approach was employed to test the research question by counting the occurrence of observations by month and season that confirm the following conditions.

  1. High month/season snowfall andlow lamb:ewe ratios
  2. Low month/season air temperature andlow lamb:ewe ratios
  3. High month/season snow depth andlow lamb:ewe ratios
  4. Low month/season forageable area and low lamb:ewe ratios

The basis of these statements are the assumptions that increased snowfall and snow depth, or lower air temperatures and forageable area, produce conditions where greater energy expenditure is required for survival. Dall sheep are in calorific deficit during the snow season so benign conditions mean that ewes reach the lambing period in better condition and are potentially then more able to provide for their lambs, increasing the observed lamb:ewe ratio. An alternative or complementary idea is that conditions during and after lambing are more important as lambs require a narrower range of conditions than adult sheep to survive.

To test these conditions the snow data and lamb:ewe ratios were converted into anomalies and coded as being strongly/weakly positive or negative based on whether they were outside or within one standard deviation and the direction of their sign. The above conditions were then assessed based on this coded data via the plotting of heatmaps and tallying occurrences (see results below).

This approach is not complex but does begin to examine which time-scales and time periods of the snow season are important for Dall sheep, insight that can later be used in more complex predictive models.

The analysis took place using R and the biggest hurdle was the being able to pass column names as arguments into functions. This was overcome by learning the use of ‘enquo()’, tilde ‘~’, and the nuances of standard versus lazy evaluation that govern whether to include an underscore (e.g. aes_()) after function calls in dplyr pipes. See https://dplyr.tidyverse.org/articles/programming.htmlfor further, and better described, info!

Brief description of results obtained

In the following section I will present graphs from the Wrangell St Elias primarily, analysis was also conducted in the five other domains but in the interest of brevity I am limiting the number of figures.

Above average snowfall and below average lamb:ewe ratios by month

WRST_weak_spre

The heatmap, top left, shows in blue the months where the condition is met (the x axis is the same as the bar chart beneath). The dark grey bars correspond to years where a sheep survey took place (yaxis, 15 out of 37). On the right panel, the scatter diagram shows the lamb:ewe anomaly for each year.

From fig. 1 we can see that 9 of the 15 years of sheep surveys had a lamb:ewe ratio below the mean. Of these nine years the most common months that had higher than average snowfall were October and November, with 6 each. By contrast, the months believed to be important to lamb survival (Apr, May, June) only have 4 recorded instances of above average snowfall. Each year with below average lamb:ewe ratios had at least 4 months of higher than average snowfall (excluding August, which comes after the survey). 50 out of the 99 months (9 years with below average lamb:ewe ratios, excluding August) have above average snowfall.

Below average air temperature and below average lamb:ewe ratios by month

WRST_weak_tair

Fig. 2 by contrast shows air temperature by month. Here we see that May has the greatest number of months where the condition is met (n = 6). However, October to November, have the same number of instances as Mar, June and July. 48 out of 99 months agree with the condition.

Above average snow depth and below average lamb:ewe ratios by month

WRST_weak_snod

Mean snow depth, fig. 3, shows 4 to 5 instances per month meet the condition from October to June for years with low lamb:ewe ratios. 45 out of 99 months agree with the condition.

Below average forageable area and below average lamb:ewe ratios by month

WRST_weak_pc_area

The autumn months of September to November are comparatively low in instances where the condition is met (n = 4 to 5) for below average forageable area next to December through May where at least 6 out of 9 years show the condition met. February the condition is recorded 8 times. 66 of 99 possible months agree with the condition.

By season

WRST_weak_all_season

When considering the conditions by season, both autumn and spring snowfall meet the condition 6 years out of 9. Above average snowfall is seen in 19 out 36 possible seasons in low lamb:ewe ration years. Summer air temperature meets the condition 6 times, winter and spring 5 times each, autumn just twice. 19 out of 36 possible seasons snow lower than average air temperature. Snow depth by season is not seen to meet the condition more than 5 times for any season (winter and summer) and 17 out of 36 seasons meet the condition. Forageable area meets the conditions 23 out of 36 seasons, winter with the highest count of 7 out of 9 years met.

Conclusions / Critique of method

This method was a simplistic approach to examine which variable and when could have an effect on Dall sheep summer recruitment. Both by month and by season, below average forageable area had the most recorded instances of being seen alongside low average lamb:ewe ratios, 66 out of 99 possible months, 23 out of 36 seasons. Snow depth did not appear as important as either snowfall or air temperature in monthly or seasonal comparisons.

A critique of this method is that it doesn’t capture instances where the opposite of the condition occurs, e.g. high lamb:ewe ratios and high forageable area. It also doesn’t test the significance of any relationships and is suspect to potential anomalies affecting a limited sample size and its mean in the lamb:ewe ratios. The same tests presented above but with conditions that described instances of a snow variable and lamb:ewe ratios outside of one standard deviation did not produce any meaningful patterns, with occurrences being isolated to single months or seasons, if at all. Despite these limitations this approach does however give insight towards exploring the variables using more complex regression methods.

June 4, 2018

Shift in the lag of GPP with soil water content at a C4 grassland site

Filed under: 2018,Exercise/Tutorial 3 2018 @ 4:48 pm

1. Question asked:

How is soil water content related to gross primary production (GPP) and the light use efficiency of photosynthesis (LUE) of a natural grassland? How do these relationships differ between C3- and C4-dominated grasslands? Specifically, how does interannual variation among water years?

2. Tool / approach used:

To answer this question, I calculated cross-correlation of GPP and LUE with soil water content (SWC). I then compare the maximum cross-correlation, as well as comparing the timing of the lag at which the maximum cross-correlation occurs. I performed these calculation separately for two sites, and for water years.  Water years are measured from October to September. Water years, in contrast to calendar years, are a more biologically meaningful variable for examining the relationship between plant growth and water variables.

My data are derived from eddy covariance flux tower locations in natural grasslands in Eastern Kansas. Soil water content, like other eddy covariance flux data, is recorded at 30-minute intervals. Here, I have aggregated flux and environmental data to daily sums and averages.

3. Steps followed to complete analysis

Data prep and selection

– Because SWC data are most consistently available at both sites between 2009-2014, I limited my analysis to data collected between those years. I further limited the data to 2010-2013 due to data availability.

– I also initially examined the water use efficiency (WUE) of production, but WUE, calculated as GPP / evapotranspiration (ET), has distinct dynamics from GPP or LUE and does not appear to be strongly cross-correlated with soil water content, so I have omitted it from this analysis.

Plotting GPP and Light Use Efficiency (LUE) against soil water content, for each water year, helps, Visualizing inital lag correlation between the variables. Here, it seems evident that the water year captures the temporal extent of the relationship between production and soil water content, and you can perceive a hump in soil water content (dashed line) that precedes a hump in GPP and LUE at each site, in each year.

Visual assessment of additional environmental variables that may be related to production

This plot shows the annual timecourse of cumulative precipitation and GPP, for water years from 2010-2013. This was an attempt to investigate more closely the environmental conditions that might be driving thesoil water content and its relationship with production. However, the cumulative annual precipitation, calculated as the rolling sum of precipitation measured at the flux tower over the course of a water year, shows very low values of precipitation in 2012 and 2013 at KFS and Konza, respectively. This suggests that there was an error in the collection of the precipitation data. As I proceed with this analysis, I will substitute [DayMet](https://daymet.ornl.gov/) climate data for the site-level flux data, which will sacrifice spatial resolution but should improve data quality.

Analytical Methods

I used the ccf() function to calculate cross-correlation between GPP and SWC, and between LUE and SWC. First, I calculated the cross-correlation of GPP and LUE, with soil water content, at each site separately, for all years from 2010-2013.

However, I’m most interested in comparing patterns of cross-correlation among years, particularly with regard to the timing of peak cross-correlation in 2012, a drought year. Does the drought year have a distinct pattern of cross-correlation between GPP, LUE, and soil water content than a non-drought year?

I then used the ccf() function to calculate cross-correlation of GPP and LUE with soil water content at each site, for each year from 2010-2013 in order to assess interannual variability in cross-correlation between GPP and LUE with soil water content. In order to visualize trends both between sites and between years, I separately plotted each site, by year– as well as each year, by site.

Lastly, to better visualize the shift in timing and magnitude of cross-correlation at each site, I extracting the date and value of the maximum and minimum cross-correlation and plotted these values for each year.

4. Results obtained

Cross-correlation of production with soil water content, for all years

First, I calculated cross-correlation between GPP and SWC, and between LUE and SWC, for all years at both sites. Thus, this is the “average” cross-correlation between production and soil water content, which has a peak maximum cross-correlation at a lag of about 100 days for both GPP and LUE. However, I’m most interested in looking at how interannual variation in climate affects plant production.

Cross-correlation between production and soil water content, by site and water year

When we examine how cross-correlation between GPP and soil water content varies between sites and between years, we see a clear shift at Konza Praire, the C4 grassland in 2012. This shift indicates that GPP and LUE are peaking earlier in relation to soil water content: with a lag of about 30 days, as opposed to 100 days at the Kansas Field Station, and at Konza in previous years.

Because 2013 followed the severe drought year, 2012, this shift suggests that the C4 grasses at Konza may be more flexible in shifting the timing of their water use in response to drought or other irregular moisture conditions.

To visualize this further, I also extracted the value and lag date of the maxima and minima of the cross-correlation function to visualize how these points change over time at each year.

Interannual variation in the timing and value of maximum cross-correlation

Plotting the date of the maximum cross-correlation, as it changes over years at each site, also helps visualize the divergence in the timing of the relationship between the production indices and soil water content.

5. Critique of the method

While this method is useful in assessing which variables are cross-correlated, and when those relationships peak, it does not clarify which environmental variables are controlling the production dynamics that I am seeing. In order to determine whether soil moisture content, or another variable (like cumulative precipitation, air temperature, soil temperature, a drought severity index, or others) is more responsible for the production dynamics, I will need to build and compare linear mixed models that predict GPP and LUE, from a suite of environmental variables, for sites at different years. This will help me assess which environmental variables may be more or less important in certain years, and better quantify their relationships with production.

This task will be made more complex as the timing of the relationships, where cross-correlation is a strong factor, change over time. So, it may be difficult to construct an appropriate linear mixed model with the most influential lag variable.

May 30, 2018

Trends in 2017 Phenology Study Soil Temperature and Moisture Data

Filed under: 2018,Exercise/Tutorial 2 2018 @ 9:58 am

QUESTION:

For this exercise, I looked into environmental variables that I had measured in conjunction with plant and insect phenology data during the 2017 growing season. In 2017, explanatory variables that I tracked included: soil temperature & soil moisture at a depth of 8cm, measured per individual plant during each survey; ambient temperature at 30cm and 100cm from the ground, measured continuously at each site; sun exposure per site, which was estimated using a tool called the solar pathfinder; and snow disappearance date per site, which will be modeled – data not yet acquired.

Focusing on soil temperature and moisture, an initial task was to examine how much variation was captured for each variable at each site. If the five sampled sites do not show adequate variation in terms of soil moisture and temperature, then it would not be a useful exercise to attempt to use these variables as explanatory variables in models of plant and insect phenology. The questions answered in this exercise were: what is the range of values observed for soil temperature and moisture across 2017 surveys? How do these values vary across the season, and between sites? Because my data exploration quickly uncovered a lurking effect of time of day on soil temperature, I found that I needed to characterize and adjust for this effect before I could address these questions, so an intermediate question became: can I remove or lessen the impact of time-of-day on my measurements of soil temperature?  A description of those methods follows.

TOOLS/APPROACH:

An initial exploration of the soil temperature data found that soil temperature values were affected by the time of day at which the measurements were taken, with soils reading warmer the later in the day they were measured. For ease of manipulation, I converted character date & time columns to datetime or POSIXct objects using the lubridate package. I used the lm function along with ggplot2 to fit a linear regression to a plot of soil temperature as a function of time of day. I used mutate to create an “adjusted soil temperature” variable by using the linear regression to estimate what each individual soil temperature value would have been had they all been measured at noon. I plotted the adjusted soil temperature against growing degree days to look at seasonal trends both for all data combined and for each site, individually. I used RColorBrewer to directly manipulate color palettes for objects in ggplot2.

METHODS:

While trying to assess whether these data showed a linear trend, I found myself bumping against the limitations of treating time as a discrete factor; so I converted dates & times of surveys to datetime or POSIXct objects using the lubridate package. Using existing character columns for time and date, I created a new “datetime” variable with values in a mdy_hm format (eg. 2017-07-13 13:00:00). I also created a “time” variable which expressed time of day in the same format, fixing a dummy date (2018-05-28) so that these POSIXct objects would only vary by the time. These two steps put the data in a much more flexible form, with time expressed as a continuous and consistently spaced variable. I eventually converted time-of-day to decimal time for ease of use in linear formulas, etc.

An initial exploration of the soil temperature data found that soil temperature values were affected by the time of day at which the measurements were taken, with soils reading warmer the later in the day they were measured (Figure 1). This pattern interfered with my ability to examine and interpret patterns I was actually interested in, such as soil warming across the season, and variation between sites; so I sought to “detrend” the data from this time-of-day effect. I used ggplot to plot soil temperature versus time of day (which ranged from 9am to 7pm); and used the lm function to fit a linear model to the data (Figure 2). The linear fit appeared to be a good fit with no systematic deviances, which was confirmed by examining residuals (Figure 3).

Figure 1. A plot of soil temperature observations (across all sites and surveys) against time of day (decimal time– 720=noon) shows a positive trend.

Figure 2. A linear regression of soil temperature data against measurement time of day (blue) shows a shift of more than 5ºF in average soil temperature when comparing measurements taken early in the day to those taken late in the day.

Figure 3. Residual plots from the linear regression of soil temperature against time of day in decimal degree days shows no distinct patterns or deviations in the residuals.

Because this appeared to be a relatively good fit, I used this linear regression to create adjusted soil temperature values for all sites & survey dates. In essence, this meant manually de-trending the data. I extracted the slope estimated in the lm object, and used this slope of the overall linear trend to estimate what each soil temperature reading would have read had every point been taken at noon (value = 720 in decimal time). The new “adjusted soil temp” variable was created using the mutate function. I confirmed that I had successfully remapped soil temperature observations to noon by passing parallel lines between the observed points and the new, adjusted soil temperature measurements (Figure 4). A plot of adjusted soil temperatures against time of day (decimal time) further confirmed that I had succesfully removed the trend present in the original data (Figure 5).

Figure 4. Parallel lines confirm that adjusted soil temperature values (series: adjstemp) represent what each measured soil temperature value would be at noon, if the daily trend is adequately described by the linear regression.

Figure 5. Plot of adjusted soil temperature values against time of day (decimal time) confirms that these adjusted values no longer exhibit a positive linear trend.

RESULTS:

Having removed the time-of-day trend from my soil temperature data, I returned my attention to answering the questions of interest: do soil temperatures exhibit trends across the season, and do these trends differ by site? Plotting adjusted soil temperature against growing degree days (in hundreds) showed a positive relationship between soil temperature and growing degree days; a trend that held within and between sites. This trend also appeared to be reasonably described by a linear regression; so I fit both an overall linear model (grey) and per-site linear models (Figure 6).

Figure 6. Adjusted soil temperature observations plotted against growing degree days (in hundreds) show a positive trend, with warmer soil temperatures measured later in the season. Linear regressions for all data (in grey), and linear regressions for each individual site (see legend), show similar behaviors across most sites.

Comparing these to linear regressions that would have resulted from non-adjusted soil temperatures shows that de-trending the data for time-of-day effects allowed seasonal trends to emerge, so I conclude that the de-trending of soil temperatures for time-of-day trend was relatively successful (Figure 7). Some limitations are discussed below in the “Critiques” section.

Figure 7. Linear regression of non-adjusted soil temperature observations show overall seasonal trends would be obscured without the removal of the time-of-day trend lurking in the original data.

Linear regressions for each site had similar slopes and soil temperature ranges, with the exception of Bristow Prairie, which had unbalanced sampling since the site was made inaccessible by a fire after just two survey dates. This may imply that sites used in this study did not represent enough variation in soil temperature to adequately test the hypothesis that soil temperature is a driver of flowering phenology for Senecio triangularis (my focal plant species). However, I measured soil temperature at each individual plant, and the data show within-site variation in soil temperature and phenology, so I will continue to explore and test whether there is any correlation between soil temperature and flowering phenology. The daily trends in soil temperature shown through this analysis lead me to believe that my measurements on a given day or at a given point were strongly impacted by sun exposure and daily weather.

A quick glance at soil moisture found that soil moisture measurements (in percent soil moisture) were not impacted by the time of day that measurements were taken, but remained relatively stable. However, sites surveyed did not capture much variation in soil moisture with the exception of Buck Mountain, which was the wettest site throughout the season (Figure 8).

Figure 8. Plotting soil temperature (percent) by unique plant ID shows similar range and variation for five sites, with distinctly wetter soils at one site (Buck Mountain).

CRITIQUE:

This procedure assumed that the linear regression for data from all sites and dates was a reasonable approximation of how soil temperatures warmed within a day at each site and date. At the individual site level, comparing adjusted to non-adjusted soil temperatures (when plotted against survey days) shows that de-trending for time-of-day removed some but not all potential influence of time-of-day on soil temperature readings (Figure 9a. & b.). The impact of time-of-day on soil temperature may be directly linked to the timing and duration of sun exposure at each site. Figure 9a. and b. show that the lastest time point (17:00) had cooler soil temperatures than mid-afternoon; and this may be directly caused by the East-facing site losing sun exposure in the evening. These data also show that it is difficult to disentangle effects of time-of-day with effects of season; since the later-season measurements at this site both happened to be taken early in the day.

Figure 9a. Non-adjusted soil temperature observations at Buck Mountain (values are jittered to show overlapping observations) plotted against growing degree days (hundreds), with color denoting time of day that measurements were taken. The data show elevated soil temperatures for measurements taken in the middle of the day (blue, aqua). This is a site-level trend that is not necessarily accounted for by the overall linear trend used to adjust soil temperature measurements.

Figure 9b. Adjusted soil temperature observations at Buck Mountain plotted against growing degree days (hundreds), with color denoting time of day that measurements were taken.

In terms of r packages used, I was surprised that there was not an easier way to plot linear regressions in ggplot. I ended up using the geom_abline() argument within the ggplot function, specifying intercept and slope which I extracted from lm objects. Another option was geom_smooth() argument with (method=’lm’); but this did not allow extrapolation of the lines beyond the provided data range, and automatically added confidence interval ribbons which were too visually busy in this context.

Because it was not possible for me to measure soil temperature at the same time of day at each site, this post hoc analysis allowed me to correct for the impact of time of day. It did not, however, account for the impact of weather at the time of collection (such as cloudy skies) and other factors. A better solution would be to take continuous data for variables such as this which experience daily fluctuations; but even with that data, decisions would need to be made about how to characterize the daily fluctuations and derive a single measure to compare across sites and surveys, and relate to response variables (flowering & insect phenology) measured at discrete time-points.

Exploring the Distribution of Stream Habitat Characteristics and Observed Beaver Dam Locations in the West Fork Cow Creek Watershed

Filed under: 2018,Exercise/Tutorial 2 2018 @ 9:56 am

Question that I asked:

How do stream reaches with observed beaver dams sites in West Fork Cow Creek relate to stream gradient, active channel width and valley bottom width?

In this exercise I developed a few charts to consider the distribution of the stream habitat in terms of stream gradient, active channel width and valley bottom width in the West Fork Cow Creek relative to observed beaver dams sites. The intent of these charts is to explain how I arrived at the criteria that was ultimately used in my first exercise to generate metrics of habitat ‘patches’.

Tools used:

I used a couple of graphing packages from R to relate the distribution of stream habitat to observed dam sites. These packages include plot, ggplot, and easyGgplot2. I also used the SQL and Calculate functions in ArcMap

Steps to generate charts:

This was a simple process overall. The first step to was to generate dummy variables for my stream reaches to indicate if beaver dams had been observed or not. Because the number of dams observed was relatively low I was able to select all stream reaches by hand using the selection tool, then using the calculate function to add a 1 to all selected rows in a new field titled Dams Observed and 0 to all those where dams were not observed. Once that was complete I exported the stream network shapefile and read it into R.

Generating charts was simply an iteration code I scavenged from StackExchange and other sources online.

Results

Gradient:

Stream gradients are fairly well distributed throughout the basin (figure 1), ranging from 0 to over 8 degrees, but average just over 2 degrees. Figure 2 shows this same distribution of stream reaches with observed dams site superimposed in red. Figure 3 shows a similar plot but with stream reaches with observed dams sites normalized as a density curve. Overall dams sites occurred on those reaches with the lowest gradients.

Figure 1

Figure 2

Figure 3

 

 

 

 

 

 

Active Channel Width:

Active channel widths (ACW) are heavily skewed in the basin with most occurring at 5m or less (figure 4). Figure 5 overlays reaches with dams but are difficult to see because of the scaling. These differences are much more apparent in figure 6 depicting the density curves for all stream reaches and reaches with observed dams. Overall dams occurred on streams with ACW between 3 and 6 meters.

Figure 4

Figure 5

Figure 6

 

 

 

 

 

 

Valley Bottom Width:

Valley bottom width (VBW) ranges from 10m to over 400m but most (>0.5) occur between 25m and 100m (figure7. However, dam sites were observed between 25m and just over 150m. (figures 8, 9)

 

Figure 7

Figure 8

Figure 9

 

 

 

 

 

 

Variable comparison via scatter plots:

Figures 10, 11 and 12 show the distribution of all streams (grey circles) and those with observed dams (red triangles) across these habitat metrics.

Figure 10

Figure 11

Figure 12

 

 

 

 

 

 

Critique of the Method:

Coding in a dummy variable through the SQL and Calculate functions of ArcMap, while simple, were very helpful and will be even more so as I expand the analysis to the rest of the Umpqua Basin.

The R-code for the overlapping histograms was helpful but because of the scaling differences between all stream reaches and stream reaches with observed dams was large it was difficult to compare. The density plots provided a reasonable improvement for comparing the distributions of the two reach types.

May 29, 2018

Exploring relationships among environmental variables and time for hikers in Grand Teton National Park

Filed under: 2018,Exercise/Tutorial 2 2018 @ 9:56 am

Question that you asked:

Brief context: Throughout this class we have been exploring how human behavior and movement is influenced by environmental factors: are there underlying features in the surrounding environment that relate to the patterns of a recreationists’ movement? For example, are there environmental conditions that tend to result in shorter step lengths and acute turning angles for the hiker (spending more time in area, stopping, sightseeing) versus longer step lengths and obtuse turning angles (in transit, hiking)?  The dataset I am using contains hundreds of high resolution GPS tracks of day hikers at a popular lake in Grand Teton National Park. Overall, I hypothesize a positive relationship between the distance the recreationist is to the water feature and the length of their step and turning angle, i.e. as the distance between the water and the recreationist increase, their step length will increase and turning angle will become more obtuse. Another way to frame it: the closer to water the recreationist is, the more likely they will stop and view the lake, the further away from water, the more likely they will speed up.

Specific Questions for this Exercise: Before testing this hypothesis, I needed to ensure other environmental variables wouldn’t confound this relationship I’m interested in exploring. Therefore, my questions for this exercise are:

1.) What is the relationship between elevation and distance to water?

2.) What is the relationship between elevation and vegetation type?

3.) What is the relationship between vegetation type and distance to water?

Then, within the actual movement path of the recreationist:

4.) What is the relationship among the recreationists’ movement path, time, vegetation, and distance to water?

The reason why I chose these variables is because a recreationist may have shorter step lengths when they are closer to water not because of the water feature (my hypothesis), but rather because of the change in elevation, or the type of vegetation (conifer woodland vs meadow etc). Thus, in an effort to prevent any spurious relationship in my results, I am using this exercise as an opportunity to examine the relationships between these independent variables.

Tool/Approach used:

I worked in ArcMap and R to address these questions. ArcMap allowed the functionality for me to delineate the study area, randomly select data points for analysis, and extract the values of my three independent variables to build an appropriate attribute table for subsequent analysis in R.

Using R I created simple scatter plots,  box plots, and spatially explicit plots of the data. These results allowed me to visually analyze the relationships between the variables while also providing me with a richer understanding of the environmental features of the study area and the recreationists’ movement path.

Description of steps used to complete the analysis:

(1) Define study area -> per Julia’s suggestion,  I selected 200 random points within the study area for analysis. Using the buffer tool I created a 100m buffer around the trail network in Grand Tetons National Park (GRTE). I created a buffer around all of the trails in GRTE. Because my  dataset contained only day users, I used the digitize tool in editor to delineate the study area specific to the recreationists in my sample (Figure 2). I then used the clip tool to clip out the buffer within the study area that I defined.

Figure 1. Applying a buffer around all of the trails in GRTE. Too much!

Figure 2. Using the clip tool to limit and define study area for analysis.

(2) Generate random points for analysis -> I used the create random points tool to create 200 random points within the defined buffer area (see figure 3). Note: I intentionally removed any data point that was created in the lake.

Figure 3. Random points created within defined study area.

(3) Build attribute table -> I used various spatial joining tools to assign each randomly selected point with the following attribute data: vegetation type, elevation, and distance (m) to the water feature. For vegetation type I did a simple spatial join with the existing vegetation layer that I had previously imported into GIS. For elevation I used the extract values to points tool. To calculate distance to water I used the join and relate tool in ArcMap which calculates the nearest distance between the random point and the polygon that I had previously creating denoting the lake boundary. I also extracted the xy values for each randomly selected point using the Features tool in ArcToolbox and extract XY values.

(4) Use R for visual analysis -> I exported the attribute table I built in GIS and imported it into R. Within R I used the ggplot package to plot the values of each variable of interest.

(5) After examining the environmental features around the study area, I was interested specifically in the environmental features experienced by the five hikers in my sub-sample. Thus, back in ArcGIS I used the same sub-sample of five tracks that I have been working with throughout the course to extract vegetation data, elevation data, and the distance of the track from the water shoreline (see Figure 4).

Figure 4. The tracks of the five recreationists.

Description of results obtained

Among the 200 randomly selected points, I learned about the environmental features defining the area.

Figure 5. Elevation and vegetation type of 200 randomly selected points in study area.

 

 

Figure 6. Distance to water and vegetation type of 200 randomly selected points in study area.

 

Figure 7. Elevation and distance to water of 200 randomly selected points in study area.

In general, coniferous woodland dominates the study area, particularly areas close to the shoreline. Additionally, the area does not have much variation in elevation. Of the 200 randomly selected data points, the range in elevation is 400 meters.

For the next step, I was curious to explore how the hikers’ actual movement path related to these environmental variables, particularly vegetation. I was curious to see if people spent more time close to shore because of the water feature (my hypothesis), or perhaps because of the surrounding vegetation? Or are people spending less time near the shoreline because of these environmental features?

The figures below represent the environmental features extracted from the movement paths of a sub-sample of five hikers:

Figure 8. Distance to water and vegetation type of the movement path of five recreationists.

 

Figure 9. Elevation and vegetation type of the movement path of five recreationists.

 

Figure 10. Elevation and distance to water of the movement path of five recreationists.

The next set of figures includes time as a new variable of interest. These figures provide me with an initial visual to better understand the relationship between the length of time people spend in the area and the vegetation type they are hiking in.

Figure 11. Track 1 – the relationship between time spent in area and distance from water. Colors indicate the vegetation type.

 

 Figure 12. Track 1 – the relationship between the latitude coordinates of the recreationist’s movement and time spent in area. Colors indicate the vegetation type.

 

Figure 13. Track 1 – The path of the recreationist. Colors indicate the vegetation type.

As you may notice, the predominant vegetation type is conifer woodland. I wonder if the presence of conifer influences the amount of time they spend near water. In other words, does the conifer woodland influence a person to keep hiking until they reach a more open area (meadow, shrubland etc.)?  Examining the relationship between presence of conifer and time spent near water is important because my initial hypothesis is that people will spend more time near water, and less time further away from water. Perhaps the presence of conifer is playing a role in this relationship. In exercise 3 I will examine if people are spending significantly less time in conifer despite being near water.

To see output for all tracks:

GEOG566_Track1

GEOG566_Track2

GEOG566_Track3

GEOG566_Track4

GEOG566_Track5

GEOG566_OverallSummaries

Critique of the method:

Overall, the tools available in ArcMap and R provided the functionality for me to explore and examine the relationships among the movement path of the recreationist and environmental factors. I was able to successfully extract the attributes I needed from ArcGIS and then use R to manipulate, calculate, and visualize the data. It was helpful showing Julia and Laura the initial figures which guided me into exploring the relationships among conifer presence, distance to water, and the time the recreationist spent in the area. So, if anything, my critique is the learning curve I went through to wrap my around these new concepts! Overall, this was both a fun and enlightening exercise.

May 23, 2018

Examining time-series of downscaled alpine air temperature and snowfall across different climate classes in Alaska in comparison to major climate indices.

Filed under: 2017,Exercise/Tutorial 2 2018 @ 11:52 am

Question asked

Do large scale patterns of climate, e.g. the Pacific Decadal Oscillation, explain long-term variability in alpine snowfall and air temperature in Alaska? Is their effect spatially variable across the state?

 

Data / Tool / Approach used

To answer this question, I used monthly and seasonally aggregated data from daily model output for 6 mountainous domains in Alaska where Dall sheep are present. The domains are situated within different climate divisions of Alaska (Bieniek et al., 2012)and are as follows;

Name Climate Division Abbreviation
Brooks Range Northeast Interior BRKS
Gates of the Arctic North Slope GAAR
Denali Southeast Interior DENA
Lake Clark Bristol Bay LACL
Wrangell St Elias Southeast Interior WRST
Yukon Charley Southeast Interior YUCH

These domains are connected to summer surveys of sheep populations that I will use to examine relationships between summer recruitment and different metrics characterizing the seasonal snowpack in Exercise 3.

Fig.2; BRKS SnowModel domain showing base DEM (black/white), vegetation layer (mixed colours), sheep survey area (pale white), MERRA2 forcing data, and snow course and SNOTEL stations for assimilation

I first ran autocorrelations on the monthly and seasonal snow data to detect whether there were any significant instances of repeating patterns that might describe a larger scale process at play. Following this I cross-correlated the monthly and seasonal snow data to indices of larger scale climate patterns that might affect Alaska. These indices were downloaded from https://www.esrl.noaa.gov/psd/data/climateindices/list/ in monthly format and include;

Name Abbreviation
Pacific Decadal Oscillation PDO
Arctic Oscillation AO
East Pacific/North Pacific Oscillation EP_NP
North Pacific Pattern NP
West Pacific Index WP
Pacific North American Index PNA
North Atlantic Oscillation NAO

 

Steps followed to complete analysis

SnowModel – data preparation

The first step in these analyses were to aggregate daily SnowModel data into monthly and seasonal values. Matlab was used to do this and the snow/climate metrics I used included; snow depth (snod), snow density (rosnow), snowfall (spre), percent forageable area (snow under 30 cm depth and 330 kg m-3 density, pc_area), and air temperature (tair). The mean of all metrics was taken except for snowfall where the cumulative sum was found. The seasons are as follows; autumn – September, October and November; winter – December, January and February; spring – March, April and May; summer – June, July and August.

These data were then visually inspected for trends or patterns via heatmaps and time series graphs.

Fig. 3; heat map of mean monthly snow depth for WRST 1980 – 2016

Climate Indices – data preparation

The climate indices are already in month form so additional processing took place in R to aggregate them to seasonal values by taking their mean.

Autocorrelation and Cross-correlation

The acf function is R was used to perform autocorrelations and cross-correlations on the variables after further data wrangling using the functions within the tidyverse package (https://www.tidyverse.org)

Autocorrelations were conducted for each month and season going from 1980 to 2016 with a maximum lag time of 30 years. Cross-correlations were conducted with monthly air temperatures and snowfall to each climate indice from 1980 to 2016 with a maximum lag time of 12 months.

 

Brief description of results obtained

Autocorrelation

Domain Monthly autocorrelation figure Seasonal autocorrelation figure Significant correlations
Brooks BRKS_monthly_figure BRKS_seasonal_figure Positive autocorrelation with winter snowfall at 12 year lag
Denali DENA_monthly_figure DENA_seasonal_figure Positive autocorrelation with summer snowdepth at 1 year lag
Gates of the Arctic GAAR_monthly_figure GAAR_seasonal_figure Negative autocorrelation with spring snowfall at 7 year lag
Lake Clark LACL_monthly_figure LACL_seasonal_figure Negative autocorrelation with autumn snowfall at 1 year lag
Wrangell St Elias WRST_monthly_figure WRST_seasonal_figure  No significant autocorrelations
Yukon Charley YUCH_monthly_figure YUCH_seasonal_figure  No significant autocorrelations

 

Cross-correlation

Domain Climate Division Monthly cross-correlation figure Significant correlations
Brooks Northeast Interior BRKS_ccf_fig Strong positive correlation between snowfall and AO at 0 month lag.

PDO dominant for air T in seasonal cycle with a maximum lag of 2 months

Denali Southeast Interior DENA_ccf_fig Snowfall – positive correlation with AO, negative correlation with WP and PNA at 0 month lag

Air T – PDO dominant as BRKS, PNA positive & significant at lag 0

Gates of the Arctic North Slope GAAR_ccf_fig  Snowfall – positive with AO at lag 0

Air T – PDO dominant as per BRKS. NAO negative & significant at lag 0

Lake Clark Bristol Bay LACL_ccf_fig  Snowfall – strong positive correlation between WP and NP at lag 0

Air T – PDO dominant as per BRKS. AO negative & significant at lag 0

Wrangell St Elias Southeast Interior WRST_ccf_fig  Snowfall – PNA negative & significant at lag 0

Air T – PDO as per BRKS, NAO negative and significant at lag 0

Yukon Charley Southeast Interior YUCH_ccf_fig Snowfall – PNA negative & significant at lag 0

Air T – PDO as per BRKS, NAO negative and significant at lag 0

Critique of method

The method employed above has been useful in showing that the different ranges of Dall Sheep in Alaska have climate subject to different larger scale patterns. In the case of Denali it also shows that these relationships might not be uniform within Bieniek et al’s (2012) climate divisions. An issue with the approach is not including snow metrics that relate directly to the conditions that affect sheep. Snowfall is likely important but isn’t necessarily a reliable guide to how much snow is found on the ground as in these environments wind redistribution and sublimation are key processes governing snowpack evolution. Carrying on this work it would be interesting to cross-correlate other indices such as change in snow depth, or total sublimation. An issue arises here when using cross-correlation as the snowpack evolution is strongly autocorrelated throughout the season, so alternative approaches need to be explored. Likewise identifying the importance of the cumulative effect of persistent patterns of climate indices on snowpacks at different stages of the snow season is not possible in this analysis.

May 21, 2018

Exploring landcover class and elevation as potential explanatory variables for movement

Filed under: Exercise/Tutorial 2 2018 @ 9:30 pm

Question Asked

For Exercise 2, I sought to explore the potential relationship between two independent, terrestrial variables of the Glacier Bay National Park and Preserve Wilderness landscape that may have explanatory power for the dependent variables of step length and turning angle generated through use of the moveHMM tool in Exercise 1. Of the publicly available geospatial data layers from the National Park Service (NPS) data clearinghouse (irma.nps.gov), two stood out as having the potential to influence kayaker movements: 1) a land cover polygon dataset, and 2) a digital elevation model raster dataset. These two variables were selected for the following potential impacts on kayaker behavior. The below listed items do not represent a comprehensive list, but rather examples of the types of impacts the landscape could have on movement of recreationists (in this case kayakers) through a landscape.

  • The type of land cover near the shore may cause kayakers to move more slowly (decreased step length) and engage in searching behaviors (greater turning angles). This could be caused by the desire to look for wildlife along the coastline while paddling, the desire to paddle slowly by a glacier that reaches the shoreline, and/or the need to look for a camping or stopping location on shore. These searching/stopping behaviors would result in decreased step lengths and increased turning angle movements. These behaviors are expected of the overnight kayakers in this study because wildlife and glacier viewing is a recreation activity of interest in Glacier Bay and, due to the multi-day nature of the trip, kayakers will necessarily need to come ashore at some time during their trip.
  • The elevation of the terrestrial surroundings also has the potential to cause kayakers to move faster (larger step lengths) and engage in behavior more characteristic of migration/continuous movement (smaller turning angles). This could be caused by the desire to paddle quickly past areas with higher elevation in which the kayaker cannot go ashore due to the lack of access or cannot see wildlife, glaciers, or other landscape level features due to obstructed views caused by large differences between kayaker elevation (at sea level) and terrestrial elevation.

Prior to engaging in analysis to understand the potential impact of land cover type and elevation on step length and turning movement (Exercise 3 Spoiler Alert!), I wanted to explore potential relationships between elevation and the landcover data. Understanding the pattern and extent of any relationship between elevation and landcover class will be helpful in understanding the nature of the relationship between landscape level factors (landcover type and elevation) and movement behavior (step length and turning angle) in subsequent analyses. Therefore, this exercise highlights exploratory analyses.

Additionally, I sought to resolve issues of mismatched spatial resolution between point, polygon, and raster data types to make associations between the point data and polygon and raster data types. While I did not have a formal question for this part of Exercise 2, practically speaking, I wanted to find a mechanism for assigning values of spatially co-occuring rasters and polygons to individual points.

Tool/Approach Used

I worked primarily in ArcMap 10.3 to explore and link the independent variables to point data, and subsequently used R to generate box and whisker plots to look at the relationships between landcover type and elevation. In ArcMap, I used a data-driven approach to identify the “area of landscape influence” for kayakers, clipped my input landscape datasets to the identified area of influence, generated a random sample of points across the landscape, and used spatial joining techniques for raster and vector data to link landscape level data to the generated random sample. I then generated a figure for the random sample of points in R, providing a visualization of the relationship between elevation (Y axis) and landcover class (X axis).

Description of Steps Used to Complete the Analysis

  • Data Access: Data were accessed from the publicly available National Park Service data clearinghouse available at the following link irma.nps.gov. Downloaded data were added into an existing base map (created for Exercise 1) in ArcMap 10.3 that contained the analysis area (Glacier Bay National Park Wilderness) and the aggregated point shapefile of visitor tracks.

  • Defining the “Area of Landscape Influence”: To define the “area of landscape influence” on kayaker travel patterns, the point shapefile was converted to a line shapefile using the Points to Line (Data Management) tool in ArcMap. The Buffer (Analysis) tool was used to create a 10-kilometer buffer around the input line shapefile, thereby creating a polygon that represented 10 kilometers in the line of site of recreating kayakers. The selection of 10 kilometers was made after doing some initial testing to look at sensitive of the buffer size on the landscape. I wanted the buffer to be sufficiently large to capture the landscape a kayaker would be able to see while also being smaller than the overall Wilderness area extent. The generated buffer was then further reduced using the Clip (Analysis) tool to only include terrestrial land area in the Wilderness — this step removed all water within the park and land area outside of the park creating a land-only analysis area. The step of converting the raw point data to a single line shapefile was necessary to provide a continuous input into the Buffer tool so that the Buffer output would be a single polygon for analysis rather than individual analysis polygons around each point. The images below show the area of influence (and therefore analysis) relative to the overall landscape for landcover and elevation.

                 

  • Generating Random Points for Analysis: To explore potential relationships between landcover and elevation outside of the visual exploration available through looking at the displayed layers in ArcMap, 200 points were randomly generated in the analysis area to represent random sample locations where landcover and elevation data could be extracted and explored in plots in R. The Create Random Points (Data Management) tool was used to generate a shapefile containing 200 randomly distributed points within the analysis area.

  • Spatially Join Layers to Points: The landcover vector data and the elevation raster data were spatially joined to the randomly generated points using the Joins and Relates option to join the point and polygon (landcover) data together and the Extract Values to Points (Spatial Analyst) tool to join the point and raster data together. Essentially, both tools complete the same task but each is designed for the input data type, either raster or vector data, that is being used for the join. In both instances, the land cover and elevation attribute information for the polygon or raster cell in which the point fell was joined to the point attribute table.
  • Export for use in R: The attribute table for the random sample of 200 points shapefile was exported from ArcMap as an Excel file using the Table to Excel (Conversion)

Description of Results Obtained

I produced two plots to explore the potential relationship between landcover class for a given point and the elevation of the digital elevation model raster cell at that point. The first plot treated each landcover class independently, with 29 different categories of landscape features including various types of vegetation, water features (snow, ice, ponds, etc.), and rock and bare ground. The plot blow is what resulted, which was difficult to read and interpret due to the number of categories and limited observations per category for anything outside of rock/bare ground (Category 46) and snow and ice (Category 99).

Because this initial output had too many categories with too few observations per category to really make any meaningful conclusions about the relationship between the variables, a simplified boxplot was created including five vegetation categories, one category for rock/bare ground, and one category for ice, snow, or other terrestrial water features. The categories are as follows:

  1. Sitka Spruce, Hemlock, or Mixed Sitka Spruce Forest/Woodland
  2. Black Cottonwood Forest/Woodland
  3. Tall Alder Shrub
  4. Low or Dwarf Shrub
  5. Willow or Mixed Willow Shrub
  6. Rock/Bare Ground
  7. Snow, Ice, or Other Terrestrial Water Feature

The simplified box plot shows that there does appear to be a relationship between landcover class and elevation, with landcover classes comprised of vegetation tending to be located at lower elevations and landcover classes comprised of rock/bare ground and snow, ice, or other terrestrial water features tending to be located at higher elevations. The median elevation values for Sitka Spruce, Black Cottonwood, and Willow land cover classes were comparable and tended to be right around 100 meters with little variation while the other two vegetation classes (Tall Alder Shrub and Low or Dwarf Shrub) had median elevation values near 300 and 500 meters respectively, with a more even spread of the distribution of points in the upper and lower quartiles around the median. Finally, the rock/bare ground and snow/ice categories have median values of 600 meters and about 1000 meters respectively, with the greatest amount of spread around the median values. The results demonstrate expected relationships, with vegetation land cover types being constrained to specific elevations while rock/bare ground and snow/ice occur less discriminately across the landscape, but overall at higher elevations than vegetative cover.

Critique of Method

Overall, the progression of tools in ArcMap was straightforward, and each tool performed the data formatting or joining function as anticipated. Stringing together the correct tools in ArcMap took some time, but I was impressed that the workflow that I wanted to accomplish could be performed in ArcMap using standalone tools. Specially, prior to this exercise, I had never used the Create Random Points or Extract Values to Points tools and was initially skeptical that the functionality existed in ArcMap. I was luckily surprised!

One limitation I encountered is the format of some of the other landscape-level data files that I had planned to use in this analysis. Originally, my Exercise 2 analysis was going to encompass potential independent variables from both land and sea. I had planned to perform the analysis presented for the terrestrial data on bathymetry, bathymetry slope, and bathymetry aspect data downloaded from the irma.nps.gov data clearing house. Unfortunately, the bathymetry layers are only available for download as .lpk files, which are layer packages that are not stand-alone layers, but rather rely on underlying data from ESRI or other sources. Upon download, these three raster files do not have accessible attribute tables. I think the source of this is that the raster files themselves have the pixel type in the raster file set to “floating point” rather than integer. The integer pixel type is required to manipulate the raster files using the workflow presented above. Due to time constraints, I abandoned working the bathymetry data for Exercise 2, but I am still interested in and will need to figure out a work around for this limitation.

Finally, the resolution of the landcover class data was both a blessing and a curse. The landcover class dataset is incredibly detailed with primary classes and secondary classes identified and vegetation classes broken out into separate land cover types depending on canopy structure (open or closed) and the homogeneity/heterogeneity of the dominant plant community in the landcover polygon. This level of detail was a bit over my head, and therefore, I needed to reduce the number of categories to see make sense of the relationships present. The data layer did not come with detailed metadata, and it would have been helpful to have more information on how the data were collected, how categories were defined and identified to make more informed judgements on how land cover categories could be condensed.

How does cross-correlation of GPP and LUE with soil water content differ among years and photosynthetic functional types?

Filed under: 2018,Exercise/Tutorial 2 2018 @ 11:57 am

1. Question asked:

How is soil water content related to gross primary production (GPP) and the light use efficiency of photosynthesis (LUE) of a natural grassland? How do these relationships differ between C3- and C4-dominated grasslands?

2. Tool / approach used:

To answer this question, I calculated cross-correlation of GPP and LUE with soil water content (SWC). I then compare the maximum cross-correlation, as well as comparing the timing of the lag at which the maximum cross-correlation occurs.

My data are derived from eddy covariance flux tower locations in natural grasslands in Eastern Kansas. Soil water content, like other eddy covariance flux data, is recorded at 30-minute intervals. Here, I have aggregated flux and environmental data to daily sums and averages.

3. Steps followed to complete analysis

Data prep and selection

– Because SWC data are most consistently available at both sites between 2009-2014, I limited my analysis to data collected between those years. I further limited the data to 2010-2013 due to data availability.

– I also initially examined the water use efficiency (WUE) of production, but WUE, calculated as GPP / evapotranspiration (ET), has distinct dynamics from GPP or LUE and does not appear to be strongly cross-correlated with soil water content, so I have omitted it from this analysis.

First, I visually inspected the annual timecourse and cross-correlation of GPP and LUE with soil water content (SWC). Soil water content is recorded as a decimal (0-1) representing a percent. Here, SWC is multiplied by 10 so that it can be visualized on similar scales to GPP and LUE. These plots display daily (light shading) and the monthly average of daily GPP, LUE and soil water content.

We can clearly see distinct patterns at each site, in each year, and for each index.

For GPP- the annual patterns appear relatively similar between KFS and Konza. During 2012, the drought year, Konza appears to have a more significant reduction in GPP than does KFS.

There appear to be double peaks of GPP in 2012 for KFS site, and in 2013 for both KFS and Konza.

For both GPP and LUE, we can see a peak in SWC early in the year, that is followed later by a peak in GPP or LUE. This pattern is most evident for LUE at Konza, the C4 site.

Analysis

I used the ccf() function to calculate cross-correlation between GPP and SWC, and between LUE and SWC. First, I calculated the cross-correlation of GPP and LUE, with soil water content, at each site separately, for all years from 2010-2013.

However, I’m most interested in comparing patterns of cross-correlation among years, particularly with regard to the timing of peak cross-correlation in 2012, a drought year. Does the drought year have a distinct pattern of cross-correlation between GPP, LUE, and soil water content than a non-drought year?

I then used the ccf() function to calculate cross-correlation of GPP and LUE with soil water content at each site, for each year from 2010-2013 in order to assess interannual variability in cross-correlation between GPP and LUE with soil water content. In order to visualize trends both between sites and between years, I separately plotted each site, by year– as well as each year, by site.

Lastly, to better visualize the shift in timing and magnitude of cross-correlation at each site, I extracting the date and value of the maximum and minimum cross-correlation and plotted these values for each year.

4. Results Obtained

Comparing sites, for all years

The dashed lines represent confidence intervals for a significant cross-correlation at each site. The vertical line at Lag days = 100 represents a visually-identified peak in cross-correlation, and is provided for reference when comparing cross-correlation among indecies.

We see somewhat distinct patterns between sites in the cross-correlation of GPP and LUE with soil water content. Both sites have a peak correlation of GPP and LUE with soil water content at around 100 days. This indicates that GPP and LUE tend to be well-predicted by the soil water content of about 100 days earlier– a peak in GPP or LUE follows a peak in SWC by about 100 days. If we look back at our time series of these indices, we can visually detect this pattern.

The minima in cross-correlation at about ~-25 days indicates that 25 days before SWC is at a peak, GPP tends to have a lower value. This appears to reflect a seasonal minimum in GPP– the seasonality of plant production, compared to precipitation. ~25 days before soil water content peaks, GPP is at a seasonal minimum. This makes sense, because SWC tends to peak in the spring, before plant production ramps up in earnest. 25 days before this peak, it seems likely that there is little production occurring.

For GPP: both sites have similar timing, but distinct magnitude of cross-correlation of GPP with SWC. Konza, the C4 site, is less strongly cross-correlated with soil water content, suggesting less dependence on soil water content than the C3 site.

In contrast, for LUE: both sites have similar magnitude, but more distinct timing in the cross-correlation of LUE with SWC across years. We see distinct minima as well.

Assessing variation between years, for each site

At both sites, we see interannual variation in the timing and magnitude of the cross-correlation between GPP and LUE and soil water content. GPP exhibits more consistent patterns of timing and magnitude of cross-correlation between years than does GPP. Something strange is clearly happening at Konza in 2013– which may be due to bad data, or a breakdown of the relationship between GPP and SWC. This is unexpected because 2013 was not a known drought year.

At both sites, LUE follows similar patterns of interannual variation to GPP– which makes sense, because LUE is derived from GPP. But the patterns of cross-correlation of LUE are perhaps more distinct than those of GPP between C3 and C4 sites.

Assessing variation between sites, for each year

Next, I wanted to examine how patterns of cross-correlation differed between sites, in each year.
This figure shows cross-correlation of GPP with soil water content for each year, plotted separately, from 2010-2013. This demonstrates interannual variation in the timing of the cross-correlation between GPP, but shows that both sites tend to have interannual variation in the timing and magnitude of their cross-correlation of GPP with soil water content.

For GPP, when looking at these patterns at each site, between years, we see clear interannual variation in the timing and magnitude of cross-correlation between sites. However, excluding 2013, both sites exhibit relatively similar patterns in each year in the timing and magnitude of the cross-correlation of GPP and SWC. They are particularly well-aligned in 2011. This indicates that despite the distinct resource-use efficiencies at each site, they are responding to similar environmental cues in the timing of soil water content.

For LUE, the sites exhibit more distinct patterns in the timing and magnitude of cross-correlation with soild water content. there is similarly evident interannual variation at each site in the timing and magnitude of cross-correlation with soil water content.

When compared between sites in each year, it is further evident that the timing of cross-correlation of LUE is similar between sites, the magnitude of cross-correlation of LUE differs. Konza, the C4 site, has higher cross-correlation of LUE with soil water content in both 2010 and 2012. We know that C4 grasses are more strongly controlled by seasonal precipitation than C3 grasses, so this stronger cross-correlation supports this ecophysiological distinction.

Assessing changes in the timing of maxima and minima, over each year

I have cropped the view window of these plots to exclude the 2013 values for GPP. The 2013 values of maximum cross-correlation for GPP with SWC occur at -100 days, suggesting a switch in the dependence of GPP on SWC, such that GPP leads soil water content, rather than the reverse. However, the uneven time series in 2013 means that I’m not confident in these values, so I have cut off the graph where they occur.

For GPP, the timing of maximum cross-correlation changed, but followed similar patterns for KFS and Konza.

for LUE, the changes in timing of maximum cross-correlation between LUE and SWC are slightly more distinct between sites, but still follow similar patterns.

This suggests that the patterns of cross-correlation of GPP and SWC follow similar patterns, despite the distinct photosynthetic functional types. Though we think of C4 grasses as being more water-use efficient, and also more strongly controlled by seasonal precipitation than C3 grasses, in this area, the patterns of production with water appear to be similar. The same is true for LUE

In contrast, the timing of the minimum cross-correlation value exhibits more distinct patterns among sites. This represents the relative timing of the annual minimum GPP value, relative to soil water content. It’s possible that this is more closely related to production values from the previous year– which are not a part of this analysis, which groups the data by site and by year, such that previous-year values are not considered.

5. Critique of the method

Particularly when comparing patterns of cross-correlation between years, it is useful to extract the maximum value and date, to visualize the patterns across years. Reducing the dataset to one point is valuable for interpreting the metric that I care about most in cross-correlation– when the maximum occurs.

However, this last graphic in particular compromises complexity for simplicity. And particularly, does not capture whether the shape of the relationship changes– i.e., if the cross-correlation function were to peak for a longer or shorter duration. Further, the use of a “maximum” function is somewhat simplistic in terms of identifying a peak. A more complex way to extract this information would be to fit a curve to the data, and extract the peak from the curve. In this way, a curve-fitting function might better capture the dynamics of the cross-correlation, rather than a maximum peak that could be an artifact.

We also see that the efficacy of this method is highly dependent on the quality of the data. In 2013, where GPP at Konza has multiple peaks and substantial variation, the cross-correlation function returns results that are not easily interpretable. Similarly, I am unable to use this method in years that do not have data of consistent quality available.

This method could be further improved by developing a method of considering previous-year values when looking at annual patterns of cross-correlation– perhaps by appending previous year values to an annual dataset, or finding a different way to create annual groups that still contain previous-year data to assess long-term lag correlations.

May 20, 2018

Subtidal bathymetry slope characteristics at increasing spatial scales around long-term monitoring sites

Filed under: 2018,Exercise/Tutorial 2 2018 @ 11:13 am

Question

 I derived statistics from nearshore subtidal bathymetry (my hypothesized explanatory variable), including slope angle: (1) minimum, (2) maximum, (3) mean, and (4) standard deviation. These four statistics were calculated at increasing spatial scales around the long-term subtidal monitoring sites at San Nicolas Island, using buffers around the 10x2m2 transects at: 2m, 5m, 10m, 25m, and 50m. My overarching objective was to quantify the relative levels of physical substrate heterogeneity, with an eye towards correlating these statistics with metrics from the time series for exercise #3, allowing me to relate variation in bathymetry to the variable frequencies of community shift exhibited over the decades in the subtidal at San Nicolas Is.

Approach

 I first plotted GPS coordinates of the long-term sites onto the side-scan sonar bathymetry. These GPS points mark the 0m and 50m start and end of the mainline, off of which our 10x2m2 transects run. Some slight observer error in GPS point accuracy is expected due to current, surge, kelp, etc., and if present, is likely under 5m. Given the spatial scale of this inquiry is at the site level (i.e., out to 50m), and not at fine scale (i.e., 1m), this minor deviation is not expected to significantly hamper results.

      

     

Fig. 1: Two low-relief sites (upper two), a high-relief site (bottom left), and a low/moderate-relief site with surrounding high-relief structure, i.e., high heterogeneity (bottom right).

I plotted the 50m mainline with shapefiles, and likewise plotted the 10x2m2 transects of interest (using the distance measures in ArcGIS to ensure my transects were plotted at the proper distance down the mainline). I created buffers around the transects at 2m, 5m, 10m, 25m, 50m.

  

  

Fig. 2: 2m, 5m, 10m, 25m, and 50m buffers for the four sites plotted in Fig. 1.

I then used the ‘Mask Grab’ tool to extract values of the bathymetry underlying each buffered layer for all sites, and recorded the summary statistics: slope angle minimum, maximum, mean, and standard deviation. These values were entered into Excel, and subsequently plotted in R.

Results  

     

 Minimum slope angle: unsurprisingly, at least one very low-relief 2m cell was present at all sites (as I’d expect from working at the sites), thus the minimum doesn’t provide all that useful information, save for that it conveyed the sensitivity of the side-scan bathymetry (i.e., values less than one were common).

Maximum slope angle: this statistic appears to have successfully captured broad differences between sites that are qualitatively low-relief (i.e., fairly flat, or low angle), and high-relief (i.e., large structure, and higher angles). In particular, the maximum at Daytona is low at the spatial scale of the actual monitoring site (2m – 10m), but then rapidly increases as the surrounding reef structure is encountered by the larger buffer layers. This captures the high level of substrate heterogeneity found at Daytona, a ‘low-relief’ site for the actual monitoring transects, but unique in the high level of surrounding high-relief structure. In contrast, the other two high-relief sites (East Dutch and West Dutch) exhibit more homogeneous high relief structure.

Mean slope angle: both the mean, max, and standard deviation captured the lack of heterogeneity at the low relief sites. At West Dutch the mean drops off, reflecting the relatively small reef, and once the buffer increase scale (e.g., 25m, 50m), a majority of the substrate is sand.

Standard deviation: these values tended to not shift much across scale, though large differences were observed among-sites.

Critique of the method

 I found this to be a straightforward method, though in hindsight, I plan to modify these methods slightly for subsequent analyses. For example, my larger spatial scales have large gaps (e.g., 10m, 25m, 50m), and to gain better insight into the actual scale at which heterogeneity shifts (or not), I need finer spatial resolution.

Secondly, instead of lumping these buffered layers together (i.e., include the previous 10m in the buffer of the 25m), I should have constructed ‘donut’ layers. This would provide a more explicit representation of how the increased spatial scale behaves in comparison to the previous scales. I will use this approach going forward, and with greater spatial resolution, as mentioned above.

PDF links for full graphs / individual sites:

Mask_Grab_IndSites

Mask_grab_all

18_NMDS_Transect

Tracks_N2D

ALL_7_TS

May 15, 2018

Determining Fire-Climate Relationships with Superposed Epoch Analysis

Filed under: Exercise/Tutorial 2 2018 @ 12:14 pm

Background

My overall goal is to build a model that predicts the probability of fire at a point in each year from 1700-1918 based on climate, fuel (time since fire), local environment (annual temperature and precipitation, elevation, forest type, etc), and surrounding environment.

Prior to constructing this model I need to understand how annual climate is related to fire occurrence and whether that relationship changes with fire size. This exercise should also indicate whether fuel abundance and connectivity required for fire spread is related to antecedent climate.

Questions asked?

  1. How are fire events related to climate during the fire year and antecedent years?
  2. Does this relationship vary among small, medium, and large fires?

Tool and Approach

Superposed Epoch Analysis (SEA) is a non-parametric statistical tool that uses Monte Carlo simulations to identify non-linear relationships between a time series and key events and the time period preceding and following a key event. Because statistical significance is determined by null distributions generated by a Monte Carlo randomization procedure, SEA does not require the assumptions of parametric testing. The user must determine the key, time series, and analysis window.

The key is the event of interest. It could be a rapid increase in a kelp population, volcanic eruption, increase in a pollutant, and in my case is fire years. I considered all fire years where fire was recorded at >2 contiguous points, but broke these fires into 3 groups based on fire size.

  • Small <10,000 ha (39 fire years)
  • Large 10000 – 40,000 ha (33 fire years)
  • Extensive >40,000 ha (7 fire years – 1741, 1783, 1795, 1800, 1822, 1829, 1918)

The time series may be a climate index, temperature record, growth record, etc. I used Palmer Drought Severity Index (PDSI; Palmer, 1965)) to represent drought. PDSI is a measurement of dryness based on recent precipitation and temperature, but is only available for the instrumental record ~1900 to present. Cook et al. (2004) used tree rings to provide grid network of reconstructed PDSI for North America. I downloaded reconstructed data from grid point 45 (120.0W, 42.5N), which is just to the east of my study landscape.

https://iridl.ldeo.columbia.edu/SOURCES/.LDEO/.TRL/.NADA2004/.pdsi-atlas.html

The window is the temporal period surrounding the key event that SEA examines. The window I specified tested whether mean annual departures for PDSI were statistically significant in the year of fire, for the five years before, and four years after the fire.

SEA can be conducted using the dplr package in R:

https://cran.r-project.org/web/packages/dplR/dplR.pdf

How SEA works

SEA creates a matrix where rows correspond to key events (fire years) and columns correspond to the time series for the specified window (Table 1). For the 7 extensive fire years the matrix would look like this. The plotted time series window is the superposed epoch (Figure 1; Prager and Hoenig 1989).

SEA calculates a mean for each collumn in the time series window called the composite. A Monte Carlo randomization procedure is then used to determine the probability of occurrence for each composite based on chance alone. This procedure randomizes the values in the matrix (usually 10,000 times but user specified) and calculates the mean for each time step in the specified window from shuffled data. The means from shuffled data are used to generate a null distribution for each column that can be used to statistically test the significance of the real data at the 95th and 99th percentile confidence levels.

 

Time Series (PDSI)
PDSI -5 PDSI -4 PDSI -3 PDSI -2 PDSI -1 PDSI 0 PDSI +1 PDSI +2 PDSI +3 PDSI +4
         Key         (Fire Year) 1741 -1.5 0.3 2.2 -3.1 1.4 -3.1 2.2 1.1 -0.9 3.1
1783 -2.2 0.3 0.7 0.5 -2.9 -5.5 2.1 2.1 -0.6 0.7
1795 2.8 1.7 2.2 -2.7 -5.2 -3.9 -0.7 0.0 -1.8 3.7
1800 -3.9 -0.7 0.0 -1.8 3.7 -5.6 3.4 2.6 3.7 -1.1
1822 -1.7 1.0 1.0 0.0 0.0 -4.3 -0.6 -2.3 4.3 2.3
1829 -2.3 4.3 2.3 -0.9 -0.6 -6.5 1.4 -1.7 4.0 -1.5
1918 2.0 1.3 -0.7 2.7 -2.1 -4.2 -0.9 -1.7 3.7 -2.3
Composite -1.0 1.2 1.1 -0.7 -0.8 -4.7 1.0 0.0 1.8 0.7

Table 1.

Figure 1. Superposed Epoch of Extensive Fire Years

What to watch out for

Outliers – SEA can be vulnerable from to leveraging from an anomalously large or small value in the time series especially when there are few key events.  Data can be relativized or transformed to minimize the influence of outliers.

Autocorrelation – If times series are autocorrelated at the same lag windows as the SEA analysis, spurious results may be produced. The ACF function in R can be used to test for autocorrelation by linear regression of each climate reconstruction year with consecutive years and with autocorrelation function plots  (Figure 2) lagged at the same window as SEA analysis (Johnston et al. 2017). PDSI exhibited no autocorrelation (P ≥ 0.12).

 

Figure 2. ACF plot

 

Results and their significance

Not every warm dry year produced a large or extensive fire, but all extensive fire years were occurred in severe drought years (mean PDSI was -4.7; p < 0.01 ), Most large fires occurred in droughts (mean PDSI = -1.29; p < 0.01). PDSI values were not lower than expected in the fire year for small fires. Antecedent years were not droughts or pluvial climate events for small, large, or extensive fires. Years with no fire were significantly wetter than years with fire of any size (PDSI = 0.5; p < 0.01)

Figure 3. SEA results for years with no fire, a small fire, a large fire, and an extensive fire

These results demonstrate that climate in antecedent years before a fire is not related to fuel abundance and connectivity.  Cool wet years that may have been necessary to produce fine fuels that carry surface fires did not occur before fire years. Only a dry hot year during the fire year was associated with large and extensive fire spread.  This suggests that fine fuel production with respect to fire spread is not related to climate. In contrast, the two years preceding extensive fire in the SW United States are significantly cool and wetter than average years.  In the SW fuel production sufficient to carry fires is related to antecedent climate.

For my investigation this result means that fuel recovery following fire is not moderated by climate after fire, and that time since fire regardless of climate in years following fire may be a strong predictor of fire occurrence.

Critique

The dplr and burnr packages both include a routine to conduct SEA. Both of these programs are designed to work with ring width chronologies and fire history chronologies (.fhx format). The original software used to conduct analysis of tree ring chronologies and fire history used file formats designed to save disk space that are not time efficient and user friendly.  Avoid using the burnr package because if works from the FHX format, which you are unlikely to have if you aren’t a dendropyrochronologist. dplR allows you to input your own data from a spreadsheet more easily, but requires a vector format (single collumn) where row names are years for climate data. This code can be used to create a vector with row names as years for a time series (PDSI in this case)

###example code

pdsi <-read.csv(“PDSI45.csv”) #read in climate data

names(pdsi) # see collumn names

 

#create climate file, select column of interest (RECON PDSI in this case), subset years, and delete NA rows

climate <- pdsi[,c(“YEAR”, “RECON”)]

cols <- c(“year”,”clim_index”)

colnames(climate) <- cols

climate <- climate[(climate$year >=1700 ),]

climate <- climate[!(rowSums(is.na(climate))),]

 

#Make rownames the years and delete the year column leaving a dataframe with one column.

rownames(climate) <- climate$year

climate$year <- NULL

head(climate)

tail(climate)

 

Benefits of SEA approach

  • Does not require random, sampling, normality, homogeneity of variance, independence
  • Can be used on auto correlated time series (climate data, recruitment data, growth data)

 

References

Cook, E. R., C. A. Woodhouse, C. M. Eakin, D. M. Meko, and D. W. Stahle (2004), Long‐term aridity changes in the western United States, Science, 306(5698), 1015–1018, doi:10.1126/science.1102586.

Johnston JD, Bailey JD, Dunn CJ, Lindsay A (2017) Historical fire-climate relationships in contrasting interior pacific northwest forest types. Fire Ecology Volume 13, Issue 2, 2017.

Haurwitz MW and Brier GW (1981) A critique of the superposed epoch analysis method: its application to solar-weather relations. Monthly Weather Review 109:2074-2079

McKenzie D, Hessl A, Peterson D et al. (2004) Fire and Climatic Variability in the Inland Pacific Northwest: Integrating Science and Management. Final report to the Joint Fire Science Program on Project #01-1-6-01.

Prager HM, Hoenig JM (1989) Superposed Epoch Analysis: A Randomization Test of Enivronmental Effects on Recruitment with Application to Chub Makerel. Transactions of the American Fisheries Society 118:608-618.

Michael H. Prager & John M. Hoenig (2011) Superposed Epoch Analysis: A Randomization Test of Environmental Effects on Recruitment with Application to Chub Mackerel, Transactions of the American Fisheries Society, 118:6, 608-618, DOI: 10.1577/1548-8659(1989)118<0608:SEAART>2.3.CO;2

Swetnam TW and Betancourt JL (1990) Fire-Southern Oscillation Relations in the Southwestern United States. Science Vol. 249:1017-1020.

Wayne Palmer, “Meteorological Drought”. Research paper no.45, U.S. Department of Commerce Weather Bureau, February 1965 (58 pgs)

May 10, 2018

Relationship between satellite derived and nearshore ocean temperatures

Filed under: Exercise/Tutorial 2 2018 @ 1:41 pm

My goal for this tutorial was to look at the relationship between satellite-derived sea surface temperatures (SST) and measured intertidal temperature. I want to use temperature as one of my potential explanatory variables for kelp population size, but neither of these data sets match my temporal and spatial scales. The satellite SST goes back until 1982, which covers the entirety of my kelp population series, but it is only available at a 0.5 degree latitude/longitude scale. Conversely, the intertidal measurements taken by my lab were only taken a few miles from most of the relevant kelp beds, so they match my spatial scale better. However, these nearshore measurements are only available for a fraction of the kelp population time series.

 

Ideally, I would be able to relate the intertidal temperatures to the larger scale satellite SST so that I could infer nearshore temperatures from satellite SST and therefore utilize the extended historical extent of these satellite SST data. To begin, I plotted the two sets of temperature data for 2007-2012 by month (see Figure 1). The satellite and intertidal temperatures match closely in the winter and spring months, but diverge in the summer. Intertidal temperatures were consistently lower in the summer months than the satellite data.

Figure 1

I then looked at the autocorrelation and cross correlation of these two data sets. Satellite temperature was highly positively autocorrelated at 12 month intervals and negatively autocorrelated at 6 months intervals, all the way out to 4-5 years. Intertidal temperatures showed a similar pattern but the autocorrelation dropped off much more steeply and dropped below significance level after about 6-12 months. Cross correlation (see Figure 2) was fairly high at a time lag of 0, around 50%. However, the cross correlation was distributed asymmetrically around time lag 0, with significant correlation at -1,-2,-3 months on the negative axis but only at 1 month on the positive axis. This suggests that the cycle for intertidal temperature is lagging that of the satellite temperature by a month or two. Referencing figure 1, this looks to be caused by the fact that satellite temperatures begin to increase in the summer a month or two before those of intertidal temperatures.

 

Figure 2

This summer time lag may be due to upwelling, which affects nearshore areas in Oregon more so than offshore areas. Upwelling is probably responsible for the lower intertidal summer temperatures. However, we also see distinct drops in nearshore temperature that occur around the same time as influxes of terrestrial rain and snowmelt (e.g. summer 2011). This suggests that the terrestrial water cycle could be influencing nearshore ocean temperature, and therefore, early summer lags in nearshore temperature, could also be caused by snowmelt runoff.

 

One of the most encouraging results for me is that the difference in satellite and intertidal temperatures follows a fairly consistent pattern (see Figure 3 and 4). On a monthly basis the difference between satellite temperatures and intertidal temperatures is fairly small in the winter months, but begins to increase in spring and hits a maximum in August. The difference between satellite and intertidal temperatures is fairly consistent for each month (although the variation increases in the winter months). This suggests to me that I could take the average difference for each month and use that to adjust the satellite SST historical temperatures to reflect roughly what kind of temperatures were happening in nearshore.

Figure 3

 

Figure 4

All together, these results provide a partial answer to my question as to whether I can use satellite sea surface temperatures to accurately estimate nearshore temperature on a monthly basis. I did find a relationship between satellite and intertidal values that was fairly consistent on an annual basis and monthly basis. Also, I now know I would need to account for the lag in summer temperature increases in the intertidal around April and May compared to satellite temps. One thing this has illustrated, however, it appears that the nearshore temperature data is more variable than satellite temperature potentially because of the influence of temporally variable nearshore processes, such as upwelling and terrestrial runoff. While I can use satellite SST to infer average nearshore temperatures, it will not give me good insight into the variability of nearshore temperature, which could be important to kelp population regulation. If I want to try to incorporate this variability I will probably need to look further into the effect of upwelling and terrestrial water cycles on nearshore temp

Multivariate analysis of the location of behavior changes

Filed under: Exercise/Tutorial 2 2018 @ 1:25 pm

Question:

Because the spatial distribution of behavior changes doesn’t change with boom angle, the results of Exercise 1 imply that hydraulic conditions do not drive fish behavior in this experiment (Figure 1). However, because analogous research and common sense implies that thresholds of hydraulics indeed affect fish behavior, a statistical analysis is necessary to determine if environmental and/or internal factors did affect the location of behavior changes we observed. Five hydraulic variables – water speed (m/s), turbulent kinetic energy (or TKE, m2/s2), TKE gradient (m2/s2/m), velocity gradient (m/s/m, or s-1), and acceleration (m/s2) – were drawn from the locations of every behavior change in Exercise 1. Then, the locations of behavior changes are compared with channel hydraulics, boom angle, and a fish’s visual fitness (as measured by an optomotor assay) using three methods: 1) multivariate regression analysis, 2) principal component analysis (PCA), and 3) a partial least squares regression. With this analysis, we hope to answer the question: do any of 5 hydraulic variables, the geometry of the channel, or the visual fitness of a fish correlate well with the location of its behavior change?

Figure 1. The results of Exercise 1 indicate that despite differences in hydraulics created by varying boom angle, the spatial distribution of location changes is not observed the change between boom angles. Contour lines show two-dimensional 95% confidence intervals.

Method and steps for analysis:

Three methods of regression analysis were used to answer the question above. Although previous work was conducted in Python, R is better suited for complex statistical analyses. First, a multivariate regression examined the correlation between the locations of behavior change and the independent variables. A multivariate regression enables the analysis of more than one dependent variable – in this case, the X- and Y-coordinates of a behavior change. This is not to be confused with multiple linear regression, which only analyzes the correlation of predictor variables with one dependent variable (i.e. just X or just Y). Interestingly, angle, water speed, and velocity gradient show significant, positive correlations with location (where positive locations are downstream and against the left channel wall; Figure 2). Acceleration and TKE gradient, on the other hand, show significant negative correlations with location of behavior change. TKE (red box) shows no significant correlation, nor does visual fitness (blue box). Clearly, the high correlation of hydraulics warrants further analyses to better understand which hydraulic variable, if any, truly influences behavior changes.

Figure 2. The results of a multivariate regression analysis of hydraulic, geometric, and visual variables on the locations of behavior change.

Principal component analysis is one method of identifying important variables (in the form of components) from a larger set of variables, especially when they’re highly correlated. A principal component is a linear combination of input variables that explains the variation in the original data. By quantifying the amount of variation of each principal component, an idea of influential variables can be grasped. In the case of X (the downstream position of a behavior change) the first principal and second principal components account for over 85% of the variation observed (Figure 3). Within Principal Components 1 and 2, boom angle, water speed, and velocity gradient influence the dependent variable, X, to the greatest degree, indicated by the length of arrow in Figure 4. Again, visual fitness, TKE, and now TKE gradient lack influence on X. Although promising, the results of PCA are limited by the analysis of only one dependent variable. A partial least squares regression allows a similar investigation into the principal components of behavior changes locations in the X and Y dimensions.

 Figure 3. Proportion of variation explained by principal components as a result of PCA. The first two PC’s account for over 85% of variation in the dependent variable X.

 

 

 

 

 

 

 

 

 Figure 4. The contribution (as indicated by length of arrows) of independent variables to Principal Components 1 and 2. Variables near the bottom, top, and left of the graph have more influence than those near the apex of the arrows.

Partial least squares regression, or PLS regression, combines principal components of PCA and linear regression of multivariate regression. It benefits our data because multiple dependent variables may be analyzed for principal components of many correlated predictor variables. However, the results of PLS perhaps realize our original fear – that no consistent hydraulic variable emerges as a strong predictor of the location of behavior change in our experiment (Figure 5). Instead, TKE gradient now dominates Axis 1 (analogous to Principal Component 1), while velocity gradient and angle largely influence Axis 2. A PLS regression within boom angle failed to identify consistent hydraulic or visual variables that dominate the analysis’s axes.

 

 

 

Figure 5. Partial least squares regression of independent hydraulic, visual, and geometric variabls on X and Y, the locations of behavior change. TKE gradient now dominates Axis 1, while velocity gradient and boom angle influence Axis 2 the greatest.

Results:

No hydraulic or visual variable measured in this experiment consistently predicted the locations of behavior changes observed in this experiment. If anything, boom angle most consistently dominates the axes, principal components, and correlations of PLS, PCA, and multivariate regressions. This implies channel geometry, independent of the hydraulics it created, had the largest influence on fish behaviors as we observed them. Taken at face value, this result seems illogical. However, it may indicate that a bias existed in the behavior changes we observed.

 

Critique of methods:

Multivariate regression analysis easily identifies correlations of independent variables with more than one response variable. However, highly correlated independent variables require PCA or PLS to explain variation amongst many variables. Although powerful, these analyses are more difficult to interpret (in the case of PCA and PLS) and unable to investigate more than 1 response variable (in the case of PCA).

April 30, 2018

Path Analysis for Understanding Visitor Movement

Filed under: 2018,Exercise/Tutorial 1 2018 @ 11:45 am

Question Asked

The movement ecology paradigm provides a useful, organizing framework for understanding and conducting path analysis on GPS tracks of overnight, wilderness sea kayakers in Glacier Bay National Park and Preserve Wilderness. Formally proposed in 2008, the movement ecology paradigm was designed to provide an overarching framework to guide research related to the study of organismal movement, with specific emphasis on guiding questions of why organisms move and how movement occurs through the lens of space and time (Nathan et al., 2008). The framework emphasizes understanding components of the movement, looking for patterns among those components, and understanding meaning behind movement through the underlying patterns (Figure 1). Ultimately, the target understanding is the movement path itself, which can be understood through quantification of the external factors, internal factors, and capacity for movement and/or navigation by the moving organism (Figure 2; Nathan et al., 2008). Through employing a movement ecology approach to the study of overnight kayaker movements in a protected area, individual movement tracks can be broken down into relevant components, the components of the path can be studied for patterns, and ultimately internal and external factors can be explored for influence or explanation of the movement path.

Following the movement ecology paradigm, the questions of focus for Exercise 1 include the following:

  1. How does the distribution of step lengths and turning angles vary throughout the duration of the overnight kayaker’s trip?
  2. What is the frequency of step lengths (in meters) at one-minute time intervals for overnight kayaker movements?
  3. What is the frequency of turning angles (in degrees) at one-minute time intervals for overnight kayaker movements?

These questions focus on understanding two parameters of interest for individual movement paths: step length and turning angle. These parameters have traditionally been used to describe and quantify the movement of animals. Changes in step length and/or turning angle have been used to identify changes in behavioral states among animals, such as migration or foraging behaviors. In the context of this work, I hypothesize that understanding changes in step length and turning angle among kayakers will allow us to identify changes in the behavioral states of the recreationists of focus in this work. Therefore, univariate analyses to create histograms of step length and turning angle were produced for a small test batch of five GPS tracks.

Tool/Approach Used

Jenna and I worked together on developing the analytical approach – both from a conceptual perspective and from a pratical and mechanical perspective in R. We used the R analysis package moveHMM to generate step lengths and turning angles for each minute of the track (Michelot, Langrock, & Patterson, 2017). The packages uses hidden Markov models and the frequentist (drawing conclusions from sampled data emphasizing frequencies and/or proportions) statistical approach. The hidden Markov model approach requires that spatial data meet the following two criteria: 1) the data are sampled at regular intervals, and 2) the spatial data have a high degree of accuracy with minimal positional error. The data used in this exercise are believed to meet the criteria for hidden Markov models. First, while the data were not originally collected in one-minute increments, the maximum amount of time between collection of X,Y coordinates throughout a track was one-minute. The data were down-sampled into one-minute time bins to standardize the sampling interval through averaging the values of the X and Y coordinates for each one-minute interval. The maximum number of X,Y points averaged per one-minute was 6, as the minimum amount of time between X,Y observations was 8 seconds. Second, the spatial accuracy of the GPS units used to collect the tracks advertise a spatial accuracy of 2.5 meters. Given that the emphasis of the overall analysis is to understand general patterns in movement and behavioral relative to time rather than in relation to specific X,Y point locations the potential for up to 2.5 meters of spatial inaccuracy in the data does not present a cause for concern. Additionally, the data were visually inspected using ArcMap and significant spatial inaccuracies were removed prior to analysis. The tracks analyzed are believed to be free from spatial inaccuracy due to unit malfunction or loss of satellite signal.

Additional tools used for data preparation and data visualization for use before and after the moveHMM package tool include the following:

  1. A function to convert latitude and longitude spatial measures to UTM measures (source code taken from https://www.youtube.com/watch?v=XfdEnE99lq8)
  2. The plotSat function in the R package ggmap.

Description of Steps Used to Complete the Analysis

Jenna and I developed R scripts to complete the data transformation and formatting needed for data output from the collected GPS units to be ready for analysis by the moveHMM R package, run the tool, and create additional data visualizations.

Data Input: The input data was the attribute table of all the collected GPS tracks exported as a CSV file from ArcMap.

Steps for Analysis:

Data preparation for moveHMM: The tool requires that data be regularly sampled and that each observation has a unique numeric ID code associated with the data in the data frame. The tool will operate on latitude and longitude coordinates, but for this application of the tool it was decided that UTM coordinates, with units of meters, produced results that were practically meaningful.

  1. Import CSV file into R Studio (manually). Make sure the data column with the date and time of X,Y point collection is set to datetime.
  2. Down sample the data so that X,Y observations are aggregated into one-minute time bins. Average X and Y coordinates within each minute to produce one-minute time bin summary.
  3. Convert the averaged latitude and longitude coordinates to UTM coordinates.
  4. Insert a new ID data column to provide a numeric ID code for each GPS track in the data frame.

Running the moveHMM tool and generating output:

  1. Run the moveHMM tool and create plots. See lines of code below for example commands:

#Loading prepared dataframe into moveHMM for creation of step length and turning angle histograms

data <- prepData(PreppedData_MoveHMM, type = “UTM”, coordNames = c(“X”, “Y”))

head(data)

#Creating plots of and histograms

plot(data,compact=T)

The output of the moveHMM tool is a series of four figures produced for each GPS track (Figure 3). The top left histogram depicts how the step length (meters) varies through time (t = minutes). The top right histogram depicts how the turning angle (radians) varies through time (t = minutes). The bottom two histograms show the frequencies of step length and turning angle. The moveHMM summaries leave something to be desired, namely, the frequency histograms are not expressed as a percentage on the Y axis and the histograms are not standard across all tracks analyzed – the lengths of the X and Y axes vary. To produce more descriptive figures, the output file produced from the moveHMM analysis was manipulated to produce additional histograms for step length and turning angle.

Data Visualization of moveHMM results:

Step Length

  1. Identify the minimum and maximum step length values from the entire dataset. Given the minimum and maximum step length values, determine the desired number of step length bins for the X axis.
  2. Summarize the step lengths into the desired bins. Calculate the percentage of step lengths per bin.
  3. Create bar charts of individual tracks and all tracks together using ggplot.

Turning Angle

  1. Display the calculated turning angels in a rose diagram using the rose.diag tool from the R package Circular using 12 bins.

Description of Results Obtained

The step length and turning angle histograms suggest that across the five test tracks processed using the moveHMM tool, the majority of step lengths across the tracks were between 600-700 meters per minute and kayakers generally traveled in a straight direction turning infrequently. The step lengths were greater than originally anticipated, but after investigation into average speeds of travel for water-based recreationists, it appears that having a step length of 600-700 meters per minute is not unreasonable. Outside of the 600-700 meter per minute step length bin, the 0-100 meter per minute step length bin was the next most frequent among the step length bins. The 0-100 meter per minute step length bin likely represents stoppage time. The disparity between the two step length categories suggests that in the case of water-based recreationists, step length may be a good metric for examining changes in behavioral state in future bivariate analyses.

When looking at the available geography for water travel and the paths taken by the overnight kayakers, the small range of values observed for turning angle movements is consistent with the data visualization in space on the satellite image. Presently, the scale as which the turning angle histograms are presented is difficult to interpret and perhaps not refined enough to identify cut points in turning angle movements that would suggest changes in behavioral state. Further analysis and exploration is needed to understand whether or not turning angle would be a useful metric for further consideration in Exercise 2 analyses.

Critique of Method

Using the moveHMM tool to produce measures of step length and turning angle per minute of a recreationist’s trip provided a novel application for looking at changes in movement path overtime. Once the data were formatted properly, the tool was easy to run, producing a dataset with the calculated step length and movement angle measures while also producing four figure displays per track analyzed.

The data formatting for the tool and post processing visualization generated by the tool were cumbersome and had a steep learning curve. The tool requires data to be summarized into regular time bins for analysis. My data were not originally in this format; therefore, down-sampling was preformed to meet this requirement of the tool. Because the down-sampling was performed, in part, on a datetime variable in R, I experienced a learning curve in getting R to accurately read the date and time information in my data frame as a datetime class rather than a character or string class. Additionally, my data were not originally projected into UTM coordinates and I could not generate the necessary UTM coordinates through analysis in ArcGIS. I therefore had to search around for a work around and ended up finding a function that I was able to customize for use in transforming my data. While the moveHMM tool will work with latitude and longitude coordinates, the output is less meaningful and difficult to interpret when considering the variable of step length. Therefore, converting to UTMs was another data transformation step that was needed in order to run moveHMM smoothly.

The histograms generated by moveHMM, while useful in looking at overall patterns and trends are not user-friendly. The histograms themselves cannot be customized from within the package; therefore, any customization or additional data visualization must be done outside of the moveHMM package using different R tools. I would have also appreciated more documentation on the turning angle calculation, movements, and output. The associated tool documentation does not provide any guidance for interpreting the histograms, and while the step length histogram was pretty straight forward I had more difficulty with the turning angle diagram. This may be due to my inexperience with this measure, but I do feel the tool’s documentation could have provided more concrete examples of interpretation and additional documentation on units.

Exercise1_Figures

References

Michelot, T., Langrock, R., Patterson, T. (2017). An R package for the analysis of animal movement data. Available: https://cran.r-project.org/web/packages/moveHMM/vignettes/moveHMM-guide.pdf.

Nathan, R., Getz, W. M., Revilla, E., Holyoak, M., Kadmon, R., Saltz, D., & Smouse, P. E. (2008). A movement ecology paradigm for unifying organismal movement research. PNAS 105(49): 19052 – 19059.

Advanced Spatial Statistics: Blog Post 1 (.5) : Temporal autocorrelation of phenological metrics

Filed under: 2018,Exercise/Tutorial 1 2018 @ 11:11 am

1. Key Question

How has the phenology of production changed over time at two grassland sites, and does this change differ between and C3 and a C4 grassland?

2. Approach used

My approach uses an autocorrelational analysis to assess whether differences in phenological indices are indicative of change over time, or cyclical patterns.

3. Methods / steps followed

To answer this question, I used the R package greenbrown to extract phenological indices from time series of MODIS NDVI data from 2001-2015 at two locations in C3 and a C4 grassland. The sites correpond to Eddy Covariance Flux tower locations at the University of Kansas Biological Station and Konza Prairie Biological station, in eastern Kansas.

Phenological metrics include the start of the growing season, the end of the growing season, the length of the growing season, the peak growing season productivity. The Phenology() function calculates the phenological metrics by 1) identifying and filling permanent (i.e., winter) gaps in the time sereis, 2) smoothing and interpolating the time series, 3) detecting the phenology metrics from the smoothed and interpolated time sereis, and 4) correcting the annual DOY time series such that the metrics associated with days of the year (e.g., start of season, end of season) don’t jump between years. The Phenology() function provides several different approaches to calculate phenology metrics and to conduct temporal smoothing and gap filling. For this analysis, I used the “White” approach to calculation phenology metrics by scaling annual cycles between 0 and 1 (White et al. 1997) and used Linear interpolation / a running median for temporal smoothing and gap filling. The code for the phenology calculation function is “kon_phen <- Phenology(kon_ndvi, tsgf=”TSGFlinear”, approach=”White”)”. The end result is a dataframe with annual phenology metrics.

After calculating the annual phenology metrics, I used the acf() function to assess whether annual differences in phenology were a product of change over time, or cyclical trends.

4.1 Results: Phenological metrics

The phenological metrics appear to differ distinctly between the sites, which reflects established differences between the phenology of C3 vs. C4 grasses. The C3 site has a consistently longer growing season than the C4 site, with an earlier start of season and an later end of season. Based on the NDVI data, the sites have similar mean growing season (MGS) and peak growing season values of NDVI.

4.2 Results: autocorrelation analysis

In the autocorrelograms above, the dashed lines represent the upper and lower thresholds for statistically significant autocorrelation. Vertical lines represent 1-year lags, and a line at 0 is provided for reference. In each plot, the C4 site is orange, and the C3 site is blue.

The autocorrelational analysis reveals only a few instances of temporal autocorrelation that appear to be marginally significant. Overall, there does not appear to be strong temporal autocorrelation in the phenological metrics, suggesting that there are not annual or interannual cycles influencing the phenological metrics.

There appear to be different, but not significantly distinct, patterns of autocorrelation between the C3 and the C4 site, suggesting that the production patterns are being controlled by the same environmental drivers.

The few instances of statistically significant autocorrelation are:
– 3- and 4-year lags for EOS for the C3 site, indicating that the first positive peak in a cyclical pattern of EOS would occur at 3 years, and the first trough for EOS would occur at 4 years. This pattern is not evident at the C4 site.
– 2-, 3-, and 5-year lags for MGS at the C3 site, indicating that the first negative trough in a cyclical pattern of the mean growing season value would occur at 2 and 3 years intervals, and that the first positive peak in the cycle would occur at 5 years. Again, the C4 site does not exhibit a similar pattern.

Anecdotally, the result that the C3 site shows more statistically significant autocorrelation might indicate that the C3 site follows more cyclical patterns of phenology than the C4 site– perhaps suggesting that production at the C3 site is less sensitive to interannual variation in climate.

5. Critique

The distinction between the autocorrelational patterns at the C3 and C4 site may be due to a change in the land management at the C3 site over the course of the time series analyzed. In 2007 management shifted from an irrigation / field management type to lleaving the area in a more natural prairie state, when the Eddy Covariance Flux tower was installed. In contrast, the C4 site was maintained as a natural, unirrigated prairie for the duration of the time series.

Further, though NDVI is easy and accessible, comparison of the NDVI record with the Eddy Covariance Flux record at these sites suggests that it does not accurately capture intra-annual variation in production dynamics between the C3 and C4 sites. Eddy covariance flux tower records show that the C4 site has consistently higher max annual production than the C3 site, and a more distinct phenology. Because NDVI is a proxy of vegetation health by measuring greenness, rather than the physiology of plant production, it is less useful when plants look the same, but have distinct resource-use efficiences.

This method appears to work well on NDVI data; the greenbrown package appears to have been optimized for a ~2-week temporal resolution. When I attempted to use the package on Eddy Covariance flux data at daily resolution, the Phenology() function returned errors or missing data, and was unable to produce a smooth time series. Next steps include further processing the flux data for use with the greenbrown package, and performing a bivariate analysis to link annual phenological metrics with annual climate variables (e.g., mean annual temperature, mean annual precipitation, monthly precipitation variables, growing degree days).

April 29, 2018

Mapping Beaver Dam Habitat and Geometry through OD Cost Analysis

Filed under: Exercise/Tutorial 1 2018 @ 6:08 pm

Questions:

Question 1: How well do habitat models predict suitable beaver dam locations?

This analysis was initially driven by the question of how well models of suitable habitat for beaver dams could predict stream reaches where observed beaver dam locations were identified during a pilot field season in the West Fork Cow Creek of the South Umpqua River (see Map: Umpqua Basin  below) in southern Oregon in fall of 2017. This question focused initially on implementing a Habitat Suitability Index (HSI) described by Suzuki and McComb (1998).  The findings from this query are noted below and ultimately led to a second question for this analysis.

Map: Umpqua and WFCC basins

Click to enlarge

 

Question 2: Is there a difference in the geometry of suitable dam habitats with and without observed  beaver dams?

After applying the Suzuki and McComb HSI to the stream network it appeared that observed beaver dams occurred more frequently on those sections of streams where habitat ‘patches’ – that is, contiguous segments of the stream classified as suitable for damming – were larger and/or separated from other patches by smaller breaks of unsuitable habitat.  These obervations are consistent with theory from landscape ecology that suggest habitat ‘geometry’ (see Figure 1) such as size and relationship to other habitats can be an important factor in selection of habitat by animals (Dunning et al, 1992).  To approach this question I focused on generating two metrics of habitat geometry: 1) patch size, and 2) distance to next patch.

Figure 1.

Figure from Dunning et al. 1992 showing how habitats A and B, while both too small to support a population, may be occupied (A) if in close proximity to other habitats

 

Click to enlarge

Seemingly simple, these tasks ultimately proved more time intensive than I first anticipated so my contribution for this first exercise focuses less on results than on describing the geoprocesing procedures and workflow used derive this information.  The hope is that others may be able to repeat the processes for similar questions related to stream habitat without running into as many dead end efforts as it took me.

Data and Tools Used:

For this exercise, I used data from Netmap, a modeled stream layer representing the stream network with polylines at approximately 100m segments across more than two dozen attributes.  More information can be found here.

Select by Attribute query to identify suitable habitat.

The first task was to identify all segments in the stream network that fit the HSI criteria: stream gradient ≤ 3%, active channel width > 3m but < 6m, and valley floor width ≥ 25m.   To accomplish this, I used a Select by Attribute query, which can be found in the Attribute Table “Options” dropdown.  The procedure is relatively simple but requires that you use SQL (Structured Query Language) syntax to define the variable(s) and respective parameters for the attributes that you wanted selected.

Analyzing Patch Length with Buffer and Dissolve tools

To combine contiguous segments of the stream suitable for damming into single/unique habitat patches I used a combination of tools found the in the ArcMap Toolbox. The primary tool was Dissolve, which can combine a layer’s features, in this case the stream segments defined as habitat, along with the attribute information for each feature.  After several iterations, I found that using the Buffer function to ensure overlap among contiguous habitat segments in the stream network prior to dissolving was helpful.

Analyzing Distance to the next Habitat Patch with OD Cost Matrix

I initially tried several approaches to measure the minimum distance to the next patch for every patch in the stream network. The first approach I used was the nndist function in the R spatstats package that calculates the distance from one point to the nearest neighboring point. A similar tool exists in ArcMap, Near, and was appealing so that I could maintain workflow in the same program if possible.  Ultimately, I landed on OD Cost Matrix, a traffic planning tool in ArcMap that looks at distance between points using a specified network of travel corridors (aka streets).   This seemed best suited to my needs of looking at travel distances between patches when considering realistic pathways of movement for beavers along the stream network.

Steps to complete the analysis:

Identify Suitable Habitat with Select by Attribute.

These steps were relatively straight-forward. I opened ‘Select by Attribute’ window from the options dropdown menu in my attribute table of the stream network layer and used the following syntax to select all stream segments that met the HSI criteria: “WIDTH_M” >=3 AND “WIDTH_M” <=6 AND “GRADIENT” <= .03 AND “VAL_WIDTH” >= 25.  After running this query, all of the segments matching these criteria are highlighted on the stream network layer.  From there I exported the selected items into a new layer Suitable_habitat by right clicking the stream layer in Table of contents, select export data, and chose to export ‘selected features ‘from dropdown, being sure to click what geo-referencing system I wanted to use. I usually choose to reference based on the existing data frame when exporting data since I specified that when first opening the map and its an easy way to keep all exported layers consistent.

Dissolve Contiguous Stream Habitat Segments into Single Patches

I defined a “habitat patch” as any contiguous length of suitable habitat segments in the stream network. My goal was to take these individual contiguous segments and convert them into single polylines of “patches’ through the dissolve function that would ultimately provide a measure of the total patch length.

Step 1: Buffer all suitable habitat segments.

Before dissolving I created a 20m buffer on my habitat segments so that I could ensure that contiguous sections were overlapping eachother.  To do this, I opened the Buffer tool, selected the Suitable_habitat layer and specified 20 meter buffers.  This generated a new layer of the buffered habitat segments I labeled Suitable_habitat_20mbuff.

Step 2: Dissolve Buffered layer into unique patches.

I then opened the Dissolve tool, selected the buffered layer.  For the second part of this function I needed to choose which attributes to keep from each habitat segment (e.g. length, gradient, width, valley width, etc.) and how I wanted it combined with the other segments as attributes of the patch layer that I was generating.  To do this I went to the Statistics drop down in the Dissolve window and added each attribute along with the relevant calculation. For example, I specified that Length (LENGTH_M) be the sum (SUM) of all the segments for that patch, for GRADIENT I choose to carry over the mean of each segment (MEAN) and so on.   Finally, I unchecked the “Feature with multiple parts” box and checked “Unsplit line” box. By doing so I told ArcMap that I do not want all of these segments to be to turned into one feature (meaning one row of attributes in the attribute table) and instead turn any overlapping line vertices (the ends of my stream segments) into a single feature or ‘unsplit” line.  I then ran the function and labeled the layer output as Habitat_patch.

*Note, that I’ve had to repeat the Dissolve more than once and found the Results window in ArcMap highly useful. (Main Toolbar > Customize > Results).  By using it I can re-navigate to the Dissolve tool table with all the Statistic/attribute selections already populated.

Calculating Distance between Patches Using OD Cost Matrix

The OD Cost Matrix is a tool developed for transportation planning that can calculate the distance (referred to in Arc as Cost) from starting points (Origins), to ending points, (Destinations), using a specified network of travel corridors (Network Dataset) such as streets, highways, etc.  When completed, the analysis generates a dataset with distance from every specified origin to every specified destination along the network paths and can become quite large depending on the number of origins/destinations that you specify.

Step 1: Turn on relevant tools:

The first step I needed to do was turn on the Network Analyst extension found on the Main Toolbar > Customize > Extensions and click the Network Analyst to add a check next to it.  Then I needed open the Network Analyst toolbar.  Main Menu Toolbar > Customize > Toolbars, click on Network Analyst.

Step 2: Create Network Dataset

In the Arc Catalogue, I navigated to the network I wanted to use as travel paths (the stream network layer), right clicked and selected New Network Dataset.  From there, Arc moved me through a series of windows.  The most important of which are to specify ‘any point’ under the connectivity policy.  The default ‘end point’ leaves a number of Origin points out of the analysis for some reason.  The other important window/step was adding the metric I wanted to use for the ‘cost’ analysis.  In my case this was the LENGTH_M variable, which is the length in meters of each stream segment in the network.  Note, that the name needed to match the variable name in the dataset exactly.

Step 3: Convert Dissolved Patch Habitats into Point Data.

Because OD Cost Matrix relies on starting points and ending points I needed to convert the polylines representing my patches into points.  To do this I used the tool Feature Vertices to Point. The tool itself is fairly straight-forward but requires a tradeoff decision on whether to convert the patches to a single “MID” for the mid-point of the patch or “DANGLE” for a point at each end of the patch. The Mid Point selection offers a simpler output from the OD Cost Analysis, and the “Dangle” offered a more precise measure of nearest distance, but cumbersome output because it had twice the data points to and from which distances needed to be calculated (more on that below). In the end I did both and used the mid-points as barriers (see next step).

Step 4: Load Data in OD Cost Matrix

In the Network Analyst toolbar I selected “OD Cost Matrix” from the dropdown menu, then, clicked the icon with a small flag overlaid on a grid.  This last step opens a new window adjacent to the table of contents. In that new window I right click the bolded “Origins” and selected “Load data” and selected the patch points I created in Step 3.  Then I repeat the process for “Destinations”.  Then for “Point Barriers” I loaded my mid-point patch layer.  It took me a few times to realize this, but the barriers data prevents the analysis from computing numerous unnecessary distances.

Step 5: Solve OD Cost Matrix

Once all the data is loaded I clicked the solve icon on the Network Analyst toolbar that looks like a chart with a upward line curve overlaid on it.

Step 6: Export Distances from OD Cost Matrix into New Layer.

Once OD Cost Matrix completes the analysis the Table of Contents is populated with a number of outputs including a layer called “Lines”.  This layer shows straight lines from every Origin to every Destination.  At first this was a little confusing because the lines suggest that these are Euclidian distances, but once opening the attribute table I was able to verify that they are in fact distances via the stream network.  I exported this as a new layer as patch_distances.

Step 7: Create Summary Table for Distances and Join to Dissolved Patch Layer

The last step I made was to add the new distance information to my layer of dissolved habitat patches and then add that information to the Habitat_patch layer.  To do this I opened the attribute table for the patch_distances and right clicked on one of the column headers and selected “Summarize”.  This opens a window with a drop down for what column to create summary data and with a number of options lower in the window for what information to summarize.  Here I used the patch orgin and destination ID and summarized by minimum, maximum, sum and average distances.  This then generates a new table summarizing each of those metrics by the Origin/Destination ID. I saved this table as Patch_Dist_Summary.  Then, I used the Join tool to add the summary information to the Habitat_patch layer.  Note doing this required that the table have an identifier for each distance that relates to the corresponding patch feature in the Habitat_patch layer so that it can associate (or add) the distance metrics appropriately.

Results

Identifying Suitable Dam Habitat

Map 1 shows the results of the habitat selection based on the Suzuki and McComb (1998) HSI criteria and includes 15.8 km of stream length (2.3% of the total stream network).  Also shown are the dam sites that were observed during our pilot field season in 2017. Dam sites occurred across 21 stream segments for a total 2.2km of stream length, all of which fall within the bounds of the HSI criteria. However, as a predictive tool the HSI selection criteria also identified a larger number of stream segments that were observed as unoccupied in our site surveys, meaning that the model tends to ‘over predict habitat’ based on our observed dam sites.  One aspect that stood out, however was that the dams sites tended to occur on segments of suitable habitat that were contiguous to other suitable stream segments and/or separated only by small distances of unsuitable stream segments.

Map 1

Click to enlarge

Patch Length:

Map 2 shows all habitat patches, color coded by length with observed dam sites occurring among the longest patches.  These preliminary findings support the hypothesis that patch geometry may be a relevant factor in the beavers habitat selection for dam building.   It should also be noted, however that there are a few patches that were not surveyed so it is possible that unobserved dams exist in those locations.

Map 2

Click to enlarge

Patch Distances:

Map 3 shows the stream network distances generated from the OD Cost Matrix with lines showing the origin/destination points for all route distances calculated.  (NOTE: while the lines are straight, the distance calculations relate the ‘path distance’ or distance if traveling through the stream network) Here green lines indicate shorter distances and red lines indicating longer distances. Map 4 applies similar color codes to the patches themselves with red indicated a patches relative isolation and green indicting high connectivity to the nearest patch.

Map 3

 

Click to enlarge

Map 4

Click to enlarge

Distance weighted Length

Lastly, I wanted to consider both length and distance to nearest patch together and so created a final metric with Patch Length/Distance to nearest Patch.  The results are show in Map 5.  Overall there is not a large change either of these

Map 5

Click to enlarge

Critiques Critique of Habitat selection

I do not have a lot of input on this.  It’s a very useful procedure to quickly select out features of a dataset based on a fairly simple process.  The challenge, or course, is that SQL is a language and like any other can be fussy when first learning it’s syntax.

Critique of Habitat patching process

Overall, the dissolve process accomplished most of what I needed but with some notable caveats.  The most significant limitation is that the ‘unsplit lines’ function I used to combine stream layers only works where two line vertices overlaps.  In a handful of instances there were locations in the patches where three vertices came together at the confluence of a higher and lower order stream.  These cases are ignored by unsplit line function and remain ‘undissolved’. I found a work around by later adding a Patch ID to the attribute table by hand for these three segments and then dissolving by that common ID.  However, I’m not excited about applying the same work around where the number of suitable stream segments will increases from 158 in West Fork Cow Creek to over 11, 474 when I expand this procedure the rest of the Umpqua Basin (See Map 6 of suitable habitat segments in the greater Umpqua).

Critique of Distance Estimates

My initial efforts to identify patch distances began with the nndist function in spatstats package of R and the Near tools in ArcMap.  The main limitation for these is that they calculate the Euclidean distance (i.e. straight lines) between points and are not particularly realistic based on what we know about beaver movement in a watershed and their fidelity to streams for protection from predators (i.e. they tend not to waddle up ridges and into the next drainage).  Instead I needed a tool that would calculate the distance it would take a beaver to move from one patch to the next using the stream network as it’s travel corridor.

For my purposes the OD Cost Matrix seems very promising.  The greatest challenge I had with this procedure was relating the distance output layer back to the original patch layer. When deriving a new dataset from analysis of an existing one, Arc carries the original feature ID information forward but re-labels as “OriginID”  In my case, that means the Patch ID were carried forward in the conversion to point data, but when the point data was used in the OD Cost Matrix, and then subsequently Summarized, those Patch IDs were long gone.  As a result, I had to trace back the geneology and create a new Patch ID column in the summary table so that I could Join it with the original patch layer. Again, this is doable in a sub-basin context but when I apply this procedure to the entire Umpqua (Map 6) I’ll have to find a better solution.

Map 6:

Click to enlarge

References:

Dunning, J. B., Danielson, B. J., & Pulliam, H. R. (1992). Ecological Processes That Affect Populations in Complex Landscapes. Oikos, 65(1), 169–175. https://doi.org/10.2307/3544901

Suzuki, N., & McComb, W. C. (1998). Habitat classification models for beaver (Castor canadensis) in the streams of the central Oregon Coast Range. Retrieved from https://research.libraries.wsu.edu:8443/xmlui/handle/2376/1217

 

April 27, 2018

Fitting and Assessing Fit of Lognormal Regression to Phenology Data

Filed under: 2018,Exercise/Tutorial 1 2018 @ 6:30 pm

QUESTION

The purpose of this analysis was to find ways to describe phenological data I gathered in 2017, in order to quantify trends and compare between sites. Phenology was recorded for two interacting species: the introduced cinnabar moth (Tyria jacobaeae), and native perennial herb Senecio triangularis, which is a novel host to the moth larvae in North America. For S. triangularis, individual flowerheads (capitula) were scored into six phenological stages: primorida (P), buds (B), young flower (YF), flowers (F), fruit (FR), and dehisced fruit (D).

The motivating question is: does the flowering phenology differ by site for five surveyed sites? And if so, is the difference apparent at all flowering stages, or only for specific stages?

In order to complete this analysis, I first needed to convert my survey dates to growing degree days, defined as accumulated daily heat gain above 5˚ and below 37˚ C. The development thresholds used here were estimated from previous studies on phenology of alpine flowers (Kudo and Suzuki, 1999). I used the single triangle method with upper threshold (Sevacherian et al., 1977) with 2017 Tmin and Tmax rasters acquired from PRISM. This was accomplished using R code developed by Dr. Tyson Wepprich (personal communication), and provided me with accumulated or growing degree days (hereafter GDD) as a unit of thermal time, against which I plotted the response variable of capitula counted in one of the flowering stages.

I will focus on F stage capitula to describe the workflow that will be expanded to each flowering stage. I hoped to be able to describe an overall behavior for observations of F stage capitula (hereafter called F), producing a predictive curve fit to the data. The goodness of fit of this curve could then be examined for each site separately, to yield an estimate in difference in timing at certain sites.

 

APPROACH:

I used a combination of tools to choose and fit a curve to my observations of F against GDD. First, I used an R package called fitdistrplus to visually compare and also estimate parameters for a variety of curves, including Weibull, gamma and lognormal curves. Then I used the built-in nls (non-linear least squares) function in R with parameter starting estimates determined from output of fitdistrplus.

 

METHODS:

Initial trials with fitdistrplus showed that the package was only able to fit density plots of one variable. In order to create a density plot that reflected capitula counts on a given degree day, I used package splitstackexchange. I used the expandRows() command to expand rows of the data frame by values in the column contain F counts. In the resulting dataset, each observation of a time (GDD) represented an individual capitula counted at that time, so that a plant with seven F capitula at time t would be represented by seven observations of t. This method was in part based upon methods by Emerson and Murtaugh for the statistical analysis of insect phenology (Murtaugh et al., 2012).

Using the fitdist() command from fitdistrplus and the expanded set of GDD observations for F, I fit gamma, lognormal, and Weibull curves to my data, then calling the denscomp function to visually compare each curve against the density plot of observations. I repeated this method across all floral stages to determine that lognormal curves seemed to best reflect visual trends in my data (Figure 1a & b).

Figure 1a: Observations of F stage capitula against growing degree days

Figure 1b. Density plot produced with fitdistrplus showing observations of F-stage capitula at time x in growing degree days, with fitted Weibull, lognormal & gamma curves.

The lognormal fitdist object created for the steps above also contained estimates of the parameters for the lognormal curves, which I extracted with the summary() call.

Seeking a more flexible lognormal model, I saved these estimated parameters and turned next to the built-in nls() (non-linear least squares) function. This function requires the user to specify a formula and vector of starting estimates for parameters, and then uses least squares methods to find a best fit for the parameters with the provided data. My formula in this case was the lognormal probability density function (pdf):

I created an nls object using this formula, the original (non-expanded) dataset, and the estimates from the fitdist() function for starting estimates of the parameters  and . I was able to obtain an nls object using these methods, but had a scaling issue between the axes – because my x axis was stretched over 900 degree days, and because of inherent properties of the lognormal pdf, the scale of the y axis was off by about three orders of magnitude. To address this, I scaled GDD by dividing the number of degree days by 100, and converted my counts to a per-plant proportion of F. Obtaining new estimates from fitdist() for  and, I found these adjustments indeed yielded an appropriately fitting curve for my data. I used least squares and root mean squared error (RMSE) methods to check the fit of the curve.

 

RESULTS

With the predict() function and the nls object fit above, I plotted a predictive curve based on my lognormal model against the prepared data, and used a visual check to determine that this result appeared to be a relatively good fit, especially compared to earlier attempts to fit simple polynomials to the data (Figure 2).

Figure 2. Observations of F stage capitula fitted with lognormal (red) and cubic polynomial (blue) curves using non-linear least squares method.

The original goal was to compare observations for each site to this overall predictive curve. I was able to plot site-specific data with the curve for visual inspection (Figure 3a-e).

Figure 3a.

Figure 3b.

Figure 3c.

Figure 3d.

Figure 3e.

Additionally, I calculated mean absolute error (MAE) and root mean squared error (RMSE) to summarize the difference between observed values for each site and predicted values given by the overall lognormal curve. Of the five sites, Juniper had the lowest values for both RMSE  (0.1274 compared to 0.1848 for the data overall) and MAE (0.0712 compared to 0.103 overall); and Waterdog had the highest RMSE and MAE (0.2202 and 0.1269 respectively). A summary of these values is included in Figure 4:

Figure 4. Lognormal curve fit by nls shown against F stage observations by site. Inset shows root mean squared error (RMSE) and mean absolute error (MAE) calculated overall and by site.

CRITIQUE:

There was no one tool I could find that was able to fit a phenomenological model to these data. The use of fitdistrplus in combination with nls was an adequate solution because in the end it did yield an appropriate looking curve. To use fitdistrplus, I had to collapse my explanatory and response variables into one variable whose value represented the explanatory variable and frequency represented the response variable. This was a little overcomplicated for my purposes, but did allow me to visually compare fitted lognormal, gamma & Weibull curves, which was a great benefit.

The nls function, which allowed me to switch back to my initial data format, allowed me to easily extract estimates and predicted values, and plot the curve of predicted values. This was extremely useful for both visual and mathematical assessments of fit. A drawback of the nls function is that without starting parameters estimated by fitdistrplus, it was extremely difficult to fit a lognormal curve; some starting points yielded no tractable model and others yielded models with little context about the goodness of fit.

I am concerned that I could only fit the curve by rescaling the data, which may due to an underlying property of the lognormal pdf: that the area under the curve must be equal to 1. At moments I felt unsure whether I was fitting the data to the curve or the curve to the data.

But in the end, this method yielded a lognormal curve which is able to capture important properties of my count/proportion data, such as rising and falling at appropriate rates around the peak and not dipping into negative values. I will continue to explore, and attempt to verify that the method I used for estimating parameters in fitdistrplus to be used in nls was valid.

REFERENCES

Kudo, G., and Suzuki, S. (1999). Flowering phenology of alpine plant communities along a gradient of snowmelt timing.

Murtaugh, P.A., Emerson, S.C., Mcevoy, P.B., and Higgs, K.M. (2012). The Statistical Analysis of Insect Phenology. Environ. Entomol. 41, 355–361.

Wepprich, Tyson. (2018). “PRISM_to_GDD_daterange”, R Code. https://github.com/tysonwepprich/photovoltinism/blob/master/PRISM_to_GDD_daterange.R

Sevacherian, V., Stern, V.M., and Mueller, A.J. (1977). Heat Accumulation for Timing Lygus Control Measures in a Safflower-Cotton Complex13. J. Econ. Entomol. 70, 399–402.

 

April 26, 2018

Using multivariate statistics to test spatial variability of model performance.

Filed under: Exercise/Tutorial 1 2018 @ 2:42 pm

Question that you asked

Understanding where my model performs well compared to in-situ observations is important for assessing the appropriateness of snow and climate metrics derived from it in comparison to Dall sheep demographic data. For instance, is the model adequately capturing the paths of wind-blown snow that are so critical for winter grazing? Are there arrangements of landscape whose snow evolution is consistently difficult to represent accurately? Answering these questions produces dual benefit; identification of space to improve the model and the possibility to characterise and handle uncertainties in subsequent analyses using model data.

Data:

The data I’m using in this analysis are measurements of snow depth taken at a field site in the Wrangell St Elias National Park between the 18thand 24thMarch 2017. These snow depth measurements were taken at relatively high spatial resolution using a GPS enabled Magnaprobe and have been therefore aggregated into 30m by 30m grid cells of mean snow depth, along with other simple statistics (e.g. standard deviation). These grid cells match the output rasters of my spatially distributed snow evolution model enabling comparison of model performance by comparison of observed mean snow depth to modelled snow depth. For simplicity’s sake, I have chosen the middle date of the observation period, the 21stMarch, for the modelled observations. This is done in the knowledge that there was no snowfall or high winds in the entire observation period, both in reality and model space, and that snowpack evolution governed by other processes is relatively minor under such conditions and time-period.  Important to note also is that the model output compared to is the product of >250 iterations of model calibration. The selected model after this process produces the lowest Root Mean Squared Error (RMSE) between model and observed snow depth and water equivalent. This analysis is hence designed to examine the spatial distribution of this error. To do this I have categorised the result from subtracting the mean observed snow depth and the modelled observed snow depth as either being weak or strong, over or under prediction based on whether it is within or over the RMSE, positively or negatively.

The landcover dataset, derived from the National Land Cover Database 2011 Alaska (Homer et al. 2015), alongside a DEM for the model domain provide landscape data.

Fig 1; Spiral of Magnaprobe measurements on Jaeger Mesa

Name of the tool or approach that you used.

To produce landscape metrics I used the following data processing tools in QGIS;

  • GRASS GIS 7 r.slope and r.aspect,
  • Raster Calculator to produce a ‘northerness’ layer – cosine(aspect)

I also used QGIS to aggregate the point snow depth measurements into 30m resolution rasters for mean depth, standard deviation, max/min depth.

To perform the multivariate analysis I used the R package FactoMineR (Lê, S., Josse, J. & Husson, F. 2008). FactoMinerR is a Multivariate Exploratory Data Analysis and Data Mining Tool that enables Multiple Factor Analysis to deal with discrete data such as the nominal landcover data in my analysis. The particular package used is FAMD – Factor Analysis for Mixed Data.

Brief description of steps you followed to complete the analysis.

Data preparation;

For the aggregation of the in-situ measurements I used QGIS tools to create a 30m polygon grid that matched the extent and aligned to my SnowModel domain. I then used the ‘Join Attributes by Location’ tool to intersect the point data to these polygons, selecting mean, min, max, and standard deviation statistics for summary. The Rasterise tool then could be used to convert the polygon layer into ArcInfo Ascii format.

To compare model results vs observations, I first subtracted the model result raster for the 21stMarch 2017 from the mean observed snow depth raster. The resulting raster showing the difference could then be mapped on top of the landscape variables such as elevation, slope etc. (see figs 2 to 3 below) in QGIS.

Fig. 2; Model performance of snow depth mapped against elevation

Fig. 3; Model performance of snow depth mapped against landcover class

Matlab was employed to build the table for Multiple Factor Analysis in R. Using the arcgridread function it was possible to open the arrays of observations, model results and landcover variables, and then build a table holding each sampled cell’s corresponding variables. This table was then converted to .csv format for use in R.

Within R the table was imported as a data frame and the land cover vector converted into factor type.

Data Analysis;

The FAMD package of FactoMineR was used to perform the multivariate analysis, producing tables and plots of the results.

Brief description of results you obtained.

Fig. 4; Screeplot of explained variance by dimension

Table 1; Eigenvalues, percent variance and cumulative variance by dimension


The FAMD analysis allows insight into which dimensions describe the greatest variability, effectively ‘flattening’ the data, reducing a large set of variables to a selection that contains most of the information. In Fig 4 and Table 1 above, we can see that for my data 24 % of the variance is explained by the 1stdimension, 16 % by the 2nddimension. Cumulatively, by the 5thdimension 70% of the variance is described.

Fig. 5; Variable contribution to dimension 1

It is then possible to see which variables most contribute to each variable. Figure 5 shows that elevation and landcover both contribute >20 % of the variance seen in the 1stdimension. Above the 95% significance level, the Observed Mean Snow Depth (obsMeanSNOD) is also seen as a contributor.

Fig. 6; Variable contribution to dimension 2

The second dimension has the greatest contribution from the categorical values of the model performance (diffCategory), and significant contributions by both modelMeanSNOD and landcover, see figure 6.

Fig. 7; Quantitive variables on dimensions 1 and 2 coloured by the cosine of their contribution.

It is possible to then plot both the quantitive, figure 7, and qualitative, figure 8, variables on both the 1stand 2nddimension axis. By doing so some insight into how the variables interact is possible. The high positive standing on the 1stdimension and 2nddimensions by Observed Mean Snow Depth (fig. 7, obsMeanSNOD) is matched by the qualitative variables of coniferous forest and strong under predict. This suggests that the model strongly under predicts when the land cover is coniferous forest and the observed snow depth is high. This is confirmed by the greatest influence of elevation on the 1stdimension, higher elevations in my study domain have lower snow depth and bare or prostrate shrub tundra land cover. Indeed, my model is specifically calibrated to best reproduce snow in these areas as they’re prime Dall Sheep habitat. Looking further at figure 8 it is possible to see that weak underpredict matches the bare land cover class quite well, and similarly weak overpredict lines up nicely with prostrate shrub tundra. On the 2nddimension it is possible to see that erect shrub tundra is in the direction of the Model Mean Snow Depth and strong over-predict. This would suggest that for this land cover the model is producing too much snow.

Fig. 8. Qualitative variables contribution to dimensions 1 and 2

Figures 9 and 10 further describe the patterns explored above by plotting the individual rows, i.e. the pixels where I’ve sampled, on the first 2 dimensions and colouring them by their category for both the categorical description of model performance and land cover. Here however, we see that the inferences aren’t necessarily as clear cut. For example, the erect shrub tundra land cover class does not entirely comprise of strong model overprediction.

Fig. 9; Individuals by model performance category

Fig. 10; Individuals by land cover category

Critique of the method – what was useful, what was not?

This task was useful for quite a number of reasons, not least because it forced me into baby steps use of R. It neatly confirmed my suspicions about where my model is over predicting, high elevations with prostrate shrub tundra, but also identified other areas of poor representation, namely strong under prediction coniferous forest. It too gave me a certain satisfaction that for the majority of pixels in high elevation, low vegetation Dall Sheep terrain, there’s only weak over or under prediction, suggesting my calibration efforts haven’t been without cause.

It is also interesting to observe that the bare land cover class is subject to weak under prediction, whereas the prostrate shrub tundra class tends to be over predicted. This is quite likely due to with how the model treats the roughness of the land cover compared to how rough that land cover is in reality (as well as whether the 30m NLCD land cover classes are actually a good representation of ground conditions). In my study area, the bare patches were more frequently pretty rough scree patches that had the capacity to intercept a fair amount of snow. Conversely the prostrate shrub tundra snow holding height in the model parameters is set at 10 cm, whereas the reality of the land cover is these areas was a patchwork of sparse alpine grasses and sedges, probably not over 5 cm in height. This is useful information that allows me to play with adjusting these parameters.

A surprise is that the other landscape variables, slope, aspect, northerness, have little influence on the variability. I had expected that slope may have had more of an influence given how important it is to wind redistribution of snow

Where I have queries with the method in respect to my particular problem is the selection, type (qualitative vs quantitative) and quantity of the variables included. For instance, I originally left the description of model performance as a continuous variable but found the results not nearly as easily interpretable as when I categorised them. There is also an element of variation within variation that is masked by attempting to flatten the data. Dominant variables take precedence, such as land cover and elevation in my instance, so that the causes of variation within land covers, for example, are hard to see. A further exercise useful to my specific problem would be to just run the analysis on pixels from the bare and prostrate shrub tundra classes and seeing whether landscape variables are greater contributors.

 

REFERENCES

Homer, C.G., Dewitz, J.A., Yang, L., Jin, S., Danielson, P., Xian, G., Coulston, J., Herold, N.D., Wickham, J.D., and Megown, K., 2015, Completion of the 2011 National Land Cover Database for the conterminous United States-Representing a decade of land cover change information. Photogrammetric Engineering and Remote Sensing, v. 81, no. 5, p. 345-354

Lê, S., Josse, J. & Husson, F. (2008). FactoMineR: An R Package for Multivariate AnalysisJournal of Statistical Software25(1). pp. 1-18.

April 25, 2018

Using a hidden Markov Model tool to identify movement patterns of recreationists

Filed under: 2018,Exercise/Tutorial 1 2018 @ 2:48 pm

Question that you asked

For this exercise, the question that I asked was how to use path segmentation analysis to identify movement patterns of day-user recreationists at String and Leigh Lake in Grand Tetons National Park.

Name of the tool or approach that you used.

To answer this question, Susie and I employed methods derived from the hidden Markov Model (hMM). Hidden Markov Models, a type of path segmentation analysis, are typically used in the field of movement ecology to understand animal behavior. These models relate animal behavior to covariates, i.e. other environmental factors or ‘hidden states’ that may drive and explain animal movement. Often the movement of animals are separated into researcher-defined ‘states’: slow moving states (foraging) or faster moving states (in transit). Understanding the conditions and underlying features that influence the change in movement between these states is at the heart of what makes this analytical tool useful for movement ecologists – and potentially for social scientists.

What makes hMM an appealing tool for our GPS data is the assumptions necessary for this analysis align with the features of our dataset: (1) measurement error is small to negligible, and (2) the sampling units collect data at regular intervals.  The field of recreation science is only beginning to explore more advanced spatial methods to analyze and understand human behavior in outdoor recreation systems. Thus, this exercise is an opportunity to adopt tools and methods employed by other disciplines and test how compatible they are with human movement data.

In July 2017, Michelot, Langrock, & Patterson developed an R package —  ‘moveHMM’ — that provides step-by-step instructions to prep data into appropriate formats for hMM analysis. This package is built with the functions and algorithms necessary for more rigorous statistical testing with multiple variables, but for the purposes of this exercise we used moveHMM package to conduct a univariate analysis of two variables:  step-length and turning angle.

Brief description of steps you followed to complete the analysis.

Step 1: Subset data I chose to use nine tracks that were collected on July 19, 2017.

Step 2: Create unique ID column The moveHMM package requires that each track contain a ID column with a unique numeric value allocated to each individual.

Step 3: Establish temporal parameters To calculate the step lengths and turning angles, we needed to define the temporal parameter for each step segment (i.e. every minute, every five minutes, every 6 hours, etc). The original dataframe had a temporal resolution of 10 seconds. However, that level of precision was not necessary for this analysis as we assumed human movement did not vary significantly every 10 seconds. Thus, we aggregated the points to 1-minute intervals. To do this I used the dplyr package in R to average the coordinates to the minute level for each individual.

Step 4: Choose what type of coordinates to use for analysis The moveHMM package allows the user to use either Latitude/Longitude coordinates or UTM coordinates. My original data contained both. I choose to use UTM coordinates rather than Lat/Long. This is because the output for UTM coordinates is represented in meters which is a little easier to interpret and understand than the Latitude and Longitude units.

Step 5: Calculate step-length and turning angle My prepped dataframe is ready to go. It contains: unique ID number, timestamp (at the minute-level), and x & y projection values.

Move HMM has an easy code that quickly calculates the step-length and turning angle. With this code I created a new dataframe that includes these two additional variables.

Step 6: Plot the data The moveHMM package offers a code that plots histograms of the frequency and distribution of step lengths and turning angles for each individual (see results section).

Step 7: Create additional graphs that represent frequency of step-length and angle in % We noticed that the frequency graphs generated in the moveHMM output represented counts on the y-axis rather than percentages. Because the y-axis wasn’t standardized, it made it difficult to compare step-length and turning angle across the nine individuals. Thus, we created two additional histograms that represented frequency as a percent on the y-axis. Note: the histogram of the turning angles is represented as a rose diagram.

Step 8: Plot the tracks on a graph for a visual bonus The final thing we did was plot each GPS track onto a graph so the viewer can have additional context when interpreting the results.

Brief description of results you obtained.

We were able to successfully calculate the turning angles and step-lengths of each individual at one-minute intervals. We also produced graphs of the frequency and distribution of these values, allowing us to make comparisons between individuals and identify common patterns and trends. See images and pdf links below.

I was surprised to see the variation in movement across the nine individuals. Given that 8 of the 9 individuals were hiking on designated trails, I assumed the behavior would be relatively homogeneous, particularly the turning angles. However, the figures indicated that people who spent a relatively short amount of time in the area (<30 minutes) tended to have shorter step lengths and more variability in turning angles.  This could be explained by the environmental features of the areas near the parking lot. Along the southeastern shoreline of String Lake there are locations that are denuded of vegetation and offer ideal conditions for ‘beaching’, sight-seeing, and/or picnicking. Therefore, people who recreate in these areas may have shorter step lengths and greater variation in turning angles than those people who choose to venture further along the trail.

Figure 1: Example of a water-user track. From the top (going clockwise) : plots describing the change in step lengths and turning angles over time; plot of the track with long and lat coordinates; rose diagram representing the frequency of turning angles; histogram representing frequency of step length (in meters).

Figure 2: Example of a land-user track. From the top (going clockwise) : plots describing the change in step lengths and turning angles over time; plot of the track with long and lat coordinates; rose diagram representing the frequency of turning angles; histogram representing frequency of step length (in meters).

Figure 3: Example of another land-user track. From the top (going clockwise) : plots describing the change in step lengths and turning angles over time; plot of the track with long and lat coordinates; rose diagram representing the frequency of turning angles; histogram representing frequency of step length (in meters).

To see all of the output, click on the pdf links below

Histograms_StepLength

Histograms_TurningAngle

Original_moveHMM_Output

Satelite_Track_Images

Critique of the method – what was useful, what was not?

What was useful:

The moveHMM package was easy to understand and provided an example dataset that guided me throughout the analysis. Once I was able to get my data into the proper format, the moveHMM package took care of the calculations and created the initial figures. Additionally, the vignette included a brief overview of the theory and defined the concepts of the hidden Markov Model.

Some criticisms:

The figures that were generated in the moveHMM package were helpful in representing the distribution and values of each person’s step length and turning angle. However, the y-axis of the histograms represented frequency as a count rather than a percentage. Therefore, the y-axis varied for each individual depending on the number of step segments each user generated while carrying the GPS unit. This variation made it difficult to make comparisons between individuals. Therefore, we chose to create supplemental histograms that represented frequency in percent rather than tallies.

References:

Click here to view the moveHMM documentation

Edelhoff, H., Signer, J., & Balkenhol, N. (2016). Path segmentation for beginners: An overview of current methods for detecting changes in animal movement patterns. Movement Ecology, 4(1).

Michelot, T., Langrock, R., & Patterson, T. (2016). moveHMM An R package for the analysis of animal movement data, 1–24.

Patterson, T. A., Basson, M., Bravington, M. V., & Gunn, J. S. (2009). Classifying movement behaviour in relation to environmental conditions using hidden Markov models. Journal of Animal Ecology, 78(6), 1113–1123.

 

 

April 24, 2018

Spatial distribution of behavior changes

Filed under: Exercise/Tutorial 1 2018 @ 9:00 pm

Question:

Since writing ‘My Spatial Problem’, a behavioral change point analysis tool was implemented on the swim paths of juvenile salmon as they encountered boom angles of 20, 30, and 40 degrees. Boiling entire swim paths down to one behavioral change point greatly simplifies the analysis, by investigating the hydraulic thresholds that may or may not incite behavioral changes. A behavioral change is identified by a change in swim velocity, swim direction, or both. However, for the purposes of this study, the type of behavioral change is important. Do fish that pass the boom do so at one hydraulic threshold or location, while fish that are halted do so at another? The question asked in this investigation is: do ‘passing’ and ‘halting’ behaviors occur at different locations in space at any boom angle, or between boom angles?

Method and steps for analysis:

Python was used to visualize behavior changes in two ways: kernel density estimation and multidimensional confidence intervals. First, behavior changes were classified as either ‘passing’ or ‘halting’. ‘Passing’ behavior precedes downstream movement. ‘Halting’ behavior precedes upstream or pausing movement. The results of these classifications for all 20 degree trials are shown in Figure 1.

Figure 1. Behavioral changes of all fish during 20 degree trials. Red markers indicate halting behavior, that which precedes upstream movement or pausing of downstream movement. Blue markers indicate passing behavior.

Second, to determine if the spatial distributions of passing and halting behavior overlap, kernel density estimates were calculated using the scipy.stats function, Gaussian_kde. Kernel density estimates estimate a variable’s probability density function, two of which are shown in Figure 2. However, this method presents two shortfalls for the purposes of this investigation: 1) it fails to provide direct estimates of confidence, so its statistical power is low, and 2) the overlap between kernel density estimates, of which there is plenty in these data, closely resembles a bruise.

 

Figure 2. Kernel density estimates of passing (blue) and halting (red) behaviors over all 30 degree trials. Overlap is shown in purple.

A direct method of determining spatial independence in two dimensions is with multidimensional confidence intervals. A slew of boot-legged functions for calculating and plotting confidence ellipses are available on Python help pages like GitHub and StackOverflow. StackOverflow user ‘Rodrigo’ provides helpful code which was modified to create the confidence intervals in Figure 3.

 

Figure 3. Multidimensional, 95% confidence intervals for the spatial distributions of halting and passing fish behaviors during trials at 40 degrees.

Results:

Because the 95% multidimensional confidence intervals between passing and halting fish behaviors show substantial amounts of overlap at all boom angles, no evidence exists to suggest that passing and halting behavioral changes occur at different locations in the channel during trials. Furthermore, the behavioral changes between 20, 30, and 40 degrees show no significant differences in spatial distribution from one another (Figure 4). This finding holds promise for future analyses: if a threshold exists in some hydraulic variable (turbulence, water speed, etc.) for inciting behavioral changes (either halting or passing), it likely exists where a threshold appears in Figure 4, when a fish has passed between 0% and 25% of the floating guidance structure.

Figure 4. Behavioral changes and their 95% confidence intervals at 20, 30, and 40 degrees imply that the hydraulic signature of a floating guidance structure at a consistent fraction of its length (between 0 and 0.25) incites a reaction from juvenile fish.

Critique of methods:

Kernel density estimates are useful for visualizing the density of entities that are spatially independent of one another. However, the statistical significance of any overlap is unclear and difficult to present with small numbers of observations, which blur densities. Multidimensional confidence intervals, on the other hand, show clear estimates of confidence and overlap in this dataset.

Among-site differences in giant kelp (Macrocystis pyrifera) temporal autocorrelation

Filed under: 2018,Exercise/Tutorial 1 2018 @ 5:15 pm

Questions

As a preliminary analysis to explore pattern before incorporating other variables, I investigated how giant kelp (Macrocystis pyrifera) autocorrelation varied through time. By way of refresh, these abundance data are from seven subtidal sites, with each site comprised of five 10x2m2 transects. All 35 transects were sampled biannually (every 6 months) since 1980. Based on personal observations, I suspect that the physical substrate underlying these sites to be highly variable, both within- and among-sites (as is the associated subtidal community structure), such that averaging these transects up to the site level would gloss over pattern that might otherwise provide insight into spatiotemporal dynamics. Specifically, for this exercise I asked:

  • Are patterns of M. pyrifera temporal autocorrelation similar within-sites (among-transects)?
  • Are patterns of M. pyrifera temporal autocorrelation similar among-sites?

Approach

I used the base autocorrelation function in R studio (acf) and ggplot2 to visualize these data, and excel to create a spreadsheet tallying instances outside of the confidence interval.

Steps

  • Existing code was used to structure my data into data frames for the relevant sites and transects (dplyr)
  • Use the base R acf function to calculate lagged temporal autocorrelation coefficients for each of the 35 transects.
  • Create new data frame of correlation coefficients, and use those data to group and visualize with ggplot2 the five transects comprising each site.
  • Use correlation coefficient data to create spreadsheet tallying the instances of positive autocorrelation before the ‘first drop’ below the noise confidence interval, and tally subsequent peaks or dips above or below the confidence interval.
  • Examine within- and among-site patterns

Results

While I did not use a statistical test to evaluate my spreadsheet values, a visual examination provides insight into the differences both within and among sites. To address my first question (within-site, or among-transect variation), I do see differences in the temporal scale of correlation, though it is unclear how significant or meaningful these differences are.

The patterns among-site are more apparent, with certain sites exhibiting either no instances of positive autocorrelation, or a single point before dropping into the confidence interval (e.g., West Dutch, East Dutch). This indicates rapid shifts in M. pyrifera abundance at the 6 month interval. These same two sites also exhibited the longest lagged temporal scale of positive correlation (e.g., at the 15, 16, and 18 lagged scale, or 8-9 years).

Other sites exhibited longer periods of positive autocorrelation before the ‘first drop’ (i.e., West End Urchin, West End Kelp, and NavFac), with one site—Sandy Cove—almost uniformly exhibiting positive autocorrelation out to 7 (3.5 years). However, results from this site must be cautioned by the dramatic shifts in M. pyrifera over time, and thus non-stationarity is almost guaranteed. That being said, Sandy Cove also exhibited long periods of negative autocorrelation, often into the ‘uninterpretable range’ (i.e., past approximately 1/3rd of the total temporal scale).

Critique of the method

I did find it useful to calculate and view the temporal autocorrelation coefficients for M. pyrifera. These results support my ‘sense’ of the system that there is substantial variation among-sites (despite all these sites being within 10km of one another). While differences were found within-site, it is unclear how significant or substantive the within-site variation is. While autocorrelation obviously cannot shed light on causal factors, mechanisms, or even associations (without other variables included) underlying these patterns of correlation coefficients, I do believe they provide grounds to proceed with an investigation into the associations between physical substrate complexity and M. pyrifera density through time.

Day_acf Day ED_acf ED WD_acf WD WEK_acf WEK WEU_acf WEU SC_acf SC NF_acf NF

Table 1: temporal lag of steps before dropping into the confidence interval, and any subsequent departures from the confidence interval. The orange box are sites exposed to high storm surge, and the blue box are sites relatively sheltered from large wave events. The column on the far right uses color to qualitativley depict the relative degree of physical substrate complexity at each site, with red sites exhibiting large pinnacle structures (i.e., high-relief), and green sites almost uniformly flat (i.e., low-relief).

Fig. 1: 2m bathymetry for Sandy Cove, Dutch Harbor, and Dayona (L-R), with red depicting vertical slope, and green depicting flat (i.e., no slope)

April 22, 2018

Mapping fire extent from binary point data

Filed under: Exercise/Tutorial 1 2018 @ 11:25 am

(Please click the links to view all figures.  They aren’t very clear or large in the post!)

Question that you asked?

My overall objective is to build a predictive model of the annual spatial distribution of fire across my fire reconstruction area.  This will inform how fire extent and distribution were related to climate, topography, and fuels. Prior to doing this I need to answer these questions:

  1. What is the best method for mapping fire boundaries from binary point data?
  2. What do these maps indicate about spatial patterns of historical fires?

Dataset – I reconstructed fire history at 31 sampling points evenly distributed on a 5km grid, and 21 points that encircled landscaped patches of lodgepole pine forest (Figure 1). Point samples were denser near lodgepole pine forests because they may limit fire spread due to slow fuel recovery. At each sampling point I collected 3-6 cross sections from fire-scarred trees (194 trees total). All cross sections were precisely dated, and 1,969 fire scars were assigned to the calendar year of their formation. At each sample point individual tree records were composited into a record of fire occurrence at the sample point. Pyrogeographers composite fire records at points because most trees that survive fire do not form and preserve fire scars even when directly in the path of fire, and recorder trees record fire events over different time periods. Obtaining a full census of fire events at a sample point (e.g. figure 2) requires sampling multiple recorder trees within a defined search radius (250 meters in my study). I eliminated 89 scars that occurred on only one tree or could have been attributed to lightning, mechanical, or insect damage.

Figure 1 Point samples and recorder trees across the study area

Figure 2. Top panel –  individual recorder trees at a sample point, Bottom panel – composite records at each sample point. Vertical slashes on timelines indicate fire events.

 

Name of the tool or approach that you used?

Question 1 – I compared three different tools for mapping fire extent from binary point data. This has also been done by Hessl et al. 2008, but for a different study landscape using a smaller sampling grain over smaller areas.

  1. Thiessen polygons are polygons whose boundaries define the area that is closet to a sample point. Using thiessen polygons to map fires assumes the best evidence of fire or no fire at an unsampled point is the record at the nearest sampled point. Thus all unsampled areas are assigned the record of the nearest sampled point.
  2. Kriging is an optimal interpolation method that uses a linear combination of weights at know points to estimate the value at an unknown point (Burrough and McDonnel 1998).
  3. Inverse Distance Weighting (IDW) is a deterministic interpolation method that calculates the value of an unsampled cell as a distance-weighted average of sample points within a specified neighbor (Burrough and McDonnel 1998).

Question 2 – To assess spatial variation in mapped fires and the sum of all fires that occurred I used three approaches

  1. I created an animation of fires from 1700 – 1925 using the Time Slider tool in arc map to visualize each fire event using the thiessen polygon mapping method
  2. I mapped fire history statistics (e.g. mean, maximum, interval, CV of interval, fire size) at each point using the Thiessen Polygon and Kriging methods
  3. I performed cluster analysis on the occurrence of fire from 1700-1918, and then mapped the fire groups.

 

Brief description of steps you followed to complete the analysis?

All mapping approaches were testing in ArcMap.

Question 1

1a. I used the create thiessen polygons tool with my sample points as the input. Make sure to click environments to specify the processing extent or you will likely get unintended results! Outer polygons extend infinitely, and need to be clipped. I clipped them using my study area boundary (2.5km from any sampled point).

To map fires for each year, I used R to subset my fire year data for each tree to a composite at each sample point.  In Arc map, I joined this data to my sample points and thiessen polygons surrounding each sample point

1b and c. Prior to Kriging and IDW I created a binary matrix of fire occurrence for each fire year at each plot. Rows were plots and columns were fire years (52 rows by 132 columns). For both Kriging and IDW the input features were my sample points, and the value field was the binary presence/absence of the fire year I wanted to map.

I initially used the default settings for Kriging and IDW. After comparing results with the Thiessen polygon method I adjusted the importance of near versus far points by adjusting the search radius. For IDW you can also adjust the power parameter. As power increases IDW approximates the Thiessen polygon method where the interpolated value takes on the value of the nearest known point.

Kriging settings: method = ordinary semivariogram, model = spherical, search radius variable, number of points = 12.

IDW settings: power = 2, search radius variable, number of points = 12.

Question 2

2a. ArcMap has a convenient time slider that allows you to move through time to visual temporal variation in spatial data. I simply followed a tutorial to use the tool.

http://desktop.arcgis.com/en/arcmap/10.3/map/animation/creating-a-time-animation.htm

Make sure that you store all shapefiles that depend on an animation in a file geodatabase or the animation and Time Slider will not function correctly. The nice feature about enabling time is that any joins you make are dynamic and info tables update with each time step, but only if you store files in a geodatabase!

2b. I calculated summary statistics the fire record at each plot from 1700-1918 using the ddply package in R. This helpful package allows you to apply a set of functions to a group identifier (plot) within a data frame.  I’m happy to share the code if this is useful for someone in the class. After the summary table was created I joined this to my shapefile of sample points in Arc map to map variation in summary statistics. I used Thiessen polygons and Kriging to spatially represent variation in the statistics.

2c. Taylor and Skinner (2003) used cluster analysis to identify and map spatial patterns of fire occurrence. Similarly, I created a binary matrix of fire occurrence where rows were sample points and columns were fire years.  Cluster analysis was performed in PC-ORD using a Sorensen distance measure with a flexible beta method of β = 0.25. The resulting dendrogram was pruned by examining stem length and branching distribution to identify nodes that maximized both within-group homogeneity and between-group differences, while minimizing the number of groups (McCune and Grace 2002).

 

Brief description of the results you obtained? (I went overboard on what I included, but its all useful at this stage)

I obtained maps of all fire events using Thiessen Polygons, and maps of selected fires using Kriging and IDW for comparing the methods. In these maps red indicates area burned and blue indicates unburned.  The gradient between indicates uncertainty for Kriging and IDW. Trees that recording fire are represented by black points and trees that did not are represented by white points.

1918 map of fire extent comparing interpolation methods

1829 map of fire extent comparing interpolation methods

A movie of fire events based on Thiessen Polygon Mapping this file is too large to attach 🙁

FireMetrics mapped using thiessen polygons

FireMetrics_Kriging Metrics mapped using Kriging

FireGroups Fire occurrence groups identified through cluster analysis

I was also able to calculate and graph Fire Extent by year using the Theissen polygon method

 

Critique of the method- what was useful, what was not?

 Fire mapping techniques – The advantage of the Kriging and IDW techniques are that multiple data points are used to interpolate fires, whereas Thiessen polygons are informed only by the nearest point. Additionally Kriging and IDW are able to represent uncertainty of fire perimeters while Thiessen polygons produce abrupt fire boundaries that are an artifact of the sample distribution.

Kriging produced the most seemingly realistic and attractive fire maps for many fires (e.g. 1918).  However, Kriging poorly represented several large fires with irregular un-burned patches (e.g. 1829). Kriging requires the spatial variation in the variable represented can be similarly observed at all locations (requires spatial homogeneity), and performs best with uniform sampling density. Irregular unburned patches occur in several of the large fires that occurred in my study landscape. Logically they occur where fire burned and removed fuel in the years preceding the fire of interest.  For example, the large unburned area in 1829 on the East side of the study landscape burned in 1827. In combination, irregular burn probabilities and non-uniform sampling limit the utility of Kriging to consistently represent fire perimeters for my data and study landscape.

IDW was not similarly limited by irregular burn patterns. However, IDW creates a bullseye pattern of high to low burn probability where sample points are isolated or are on burn perimeters.  This imputes lower burn probability in the area between the somewhat isolated point and the main mass of the fire.  When all points within a large area recorded fire, IDW imputes a higher burn probability to the unsampled area at the center of the sampled points that actually recorded fire. In reality we know the sampled points burned, and they should not have lower burn probability than the unsampled points. IDW’s representation of fire can be improved by decreasing the search neighborhood or increasing the power function. However, this approaches the Thiessen polygon interpolation technique (see 1829 map).

Both Kriging and IDW are time intensive and would require a different and subjective threshold to be applied to each fire map to delineate burned and unburned area. The Thiessen polygon method ultimately provides the most efficient, objective, and parsimonious method to map fire perimeters based on the distribution of my sample points. After watching the animation of fire events mapped with Thiessen polygons the most important pattern that appeared was a consistent lag between fires that prevented reburn within short (<5 year) time periods.  Thus, fuel recovery after fire may constrain fire extent, and time since fire may be an important predictor of the annual spatial distribution of fire in the landscape. Kriging and IDW assume higher likelihood of fire at points where no fire was recorded that are surrounded by points that did record fire.  This provides another rationale for using the binary Thiessen interpolation method.

In making my choice to use the Theissen polygon method I also considered that area burned and fire metrics were highly and significantly recorded across all interpolation techniques (Hessl et al. 2007). Furthermore, Thiessen polygons accurately represented burn perimeters and fire frequency through a validation using known modern day fire perimeters (Farris et al. 2010).

Mapping Fire Metrics

I preferred the Kriging method to identify regions of the landscape with distinct fire regime metrics.  The Kriging method incorporates more than just the sample point allowing regions with higher or lower values for a metric to be clearly represented. The Southeast region of the study area burned with lower frequency, longer maximum intervals, and higher variability. This area has a high concentration of low productivity and relatively fuel limited lodgepole pine forest.

Identifying and Mapping Fire Types

Cluster analysis of fire years appears to be a promising technique for identifying regions with a similar history. The FireGroups were geographically clustered in the study landscape.  It may be possible to use these fire types to identify landscape features that constrain fire. This map suggests that the similarity or spatial auto correlation of fire history varies depending on position within the landscape.  This suggests a non uniform distribution of landscape features that constrain fire.

References

See this link for more about Kriging

http://resources.esri.com/help/9.3/arcgisdesktop/com/gp_toolref/spatial_analyst_tools/how_krige_and_variogram_work.htm

Burrough P and McDonnel R. 1998. Principles of geographical information systems. Oxford: Oxford University Press.
Dieterich JH. 1980. The composite fire interval: a tool for more accurate interpretation of fire history. USFS GTR- RM-81.
Farris CA, Baisan CH, Falk DA, Yool SR, Swetnam TW (2010) Spatial and temporal corroboration of a fire-scar based fire history in a frequently burned ponderosa pine forest. Ecological Applications 20(6):1598-1614.
Hessl A, Miller J, Kernan J, Keenum D, McKenzie D. 2007. Mapping Paleo-Fire Boundaries form Binary Point Data: Comparing Interpolation Methods. The Professional Geographer 59:1, 87-104.
Taylor AH, and CN Skinner (2003) Spatial Patterns and Controls on Historical Fire Regimes and Forest Structure in the Klamath Mountains. Ecological Applications 13(3):704-719.
Farris CA, Baisan CH, Falk DA, Yool SR, Swetnam TW (2010) Spatial and temporal corroboration of a fire-scar based fire history in a frequently burned ponderosa pine forest. Ecological Applications 20(6):1598-1614.

April 20, 2018

Temporal Cross Correlation Between Bull Kelp Patches in Oregon

Filed under: 2018,Exercise/Tutorial 1 2018 @ 7:32 am

One of the ultimate questions of my work is comparing what factors drive bull kelp populations in northern California versus in Oregon. With this exercise I wanted to examine whether patches exhibited temporal synchrony from year to year. If they are in sync, this may suggest that the two populations are being driven by large-scale, coastwide factors (e.g. ENSO phase, wind strength). If not, this gives evidence that the populations are being influenced more so by local factors than shared, coast-wide factors. One way to look at temporal synchrony of two areas is via cross correlation.

To conduct cross correlation, I used both the randomly generated data and my actual data. The randomly generated data should be representative of the null hypothesis that there is no synchrony between patches. I can then compare this to what patterns of cross correlation I get for my actual data, to see if it differs measurably or conforms to a similar pattern as my null hypothesis.

To interpret cross correlation you first need to look the autocorrelation for each variable. I used the acf function in the ‘ncf’ package in R to conduct autocorrelation on the time-series for two patches, Orford Reef and Rogue Reef (See Figure 1). With the randomly generated data, neither of the patches shows any kind of significant autocorrelation.

Figure 1: Autocorrelation of maximum annual kelp coverage for Orford Reef (left) and Rogue Reef (right) with randomly generated data. Lag is in terms of years.

I also ran autocorrelation on two kelp patches from my real data. I would expect to see some kind of autocorrelation in real populations. It makes intrinsic sense that with a real population, the size of the population now should influence how many there are one step into the future. However, the patch at Orford Reef had no autocorrelation other than at time=0 (see Figure 2). The patch at Rogue Reef was somewhat auto-correlated at a time lag of a year, but otherwise had no significant autocorrelation. This indicates that the size of the kelp canopy one year tells us very little about what it will look like in the future, although at Rogue Reef, canopy size in the current year will have some positive correlation with the canopy size next year.

Figure 2: Autocorrelation of maximum annual kelp coverage for Orford Reef (left) and Rogue Reef (right). Lag time is in years.

Once I understood what autocorrelation looked like for each of these reefs, I then moved on to looking at cross correlation between them. I did this using the ccf function in the R package ‘ncf’. For the randomly generated data, I expected no cross correlation, and this is what I saw (see Figure 3). The ccf results for the random data of Rogue Reef and Orford Reef did not go beyond the confidence envelope (except for one small point at a 10 year leading lag). While there is this small bit of cross correlation, since the data is randomly generated we can assume this is coincidental.

 

Figure 3: Cross correlation between maximum annual kelp cover at Rogue Reef and Orford Reef over a 35 year time series. The lag is in years.

With the real data, I would have expected some kind of cross-correlation between the reefs. The two are less than 20 miles apart and should be influenced by similar oceanographic conditions. However, the ccf graph for the real data was very similar to that of the random data. Other than a small amount of correlation at 0 years and -18 years, these two reefs were essentially uncorrelated.

Figure 4: Cross correlation between annual maximum kelp cover at Rogue Reef and Orford Reef. Lag is in years.

Overall, these results were surprising. For one, I was expecting some level of autocorrelation for the reefs. However, because bull kelp is an annual species, it is possible that each year the population resets and recruitment next spring is determined by density independent factors. Furthermore, since bull kelp have a bipartite life cycle, alternating between gametophytes and sporophytes, its possible that the transition over the winter from spores to gametophytes to gametes to baby kelp next spring may further decouple maximum canopy size in the fall from population size the next year.

I was also somewhat surprised that there was no cross correlation between the two patches. The lack of cross correlation suggests that two reefs are not correlated on an annual or multi-annual scale. Therefore, despite the fact that the patches are within 20 miles of one another, apparently there is enough local variation in the factors controlling population size to create substantially different sizes and patterns between the two.

I found this technique to be very useful. If I am interpreting the results correctly, then this technique is already helping me uncover some surprising results. One caveat about these results is that my data may not be stationary. However, there may be some reasons that this test is not the most appropriate for my data. Cross-correlation assumes stationarity, and given the intense inter-annual variability, it is not clear whether there are any long term changes in the mean of the population. Therefore, my data may not fulfill this assumption. I would welcome any feedback on how to better assess stationarity in my time series.

 

April 18, 2018

Visualizing a recreationists movement through space and time

Filed under: 2018,Exercise/Tutorial 1 2018 @ 1:30 pm

Question that you asked

The question I asked for this exercise was how to plot and visually represent recreationists’ movements through time and space.  My goal was to plot five individual GPS tracks onto a graph with X & Y coordinates (Long & Lat), and a Z-coordinate representing time. To visualize and identify patterns within and between each individual’s movements through time, I aimed to represent change in time using color gradients. In order to make comparisons between the five tracks, I also sought to standardize the color ramp so that each GPS track was color coded using the same timestamp parameters.

For this exercise, I used five GPS tracks that were collected on July 19, 2017 at String & Leigh Lake in Grand Teton National Park.

Name of the tool or approach that you used.

Although the goal for this exercise was seemingly simple (visualizing a person’s movement through space and time) the biggest challenge for me was learning how to produce this result using R. I am still familiarizing myself to this software, so the learning curve was steep throughout all phases of the process.

The packages I used in R were:

ggmap – to load in a satellite image of the location where the GPS tracks were collected

ggplot – to plot the tracks onto a graph

leaflet – to create an interactive plot of the GPS tracks

 

Brief description of steps you followed to complete the analysis.

Step 1: Put data into correct format. The first thing I needed to do was manipulate the data into an appropriate format for analysis.  I converted the shapefiles into a csv format in R.

Step 2: Subset data The dataframe I was working in contained 652 GPS tracks that spanned from July 15 – September 8. Thus, I needed to take a subset of those tracks to simplify the initial analysis and enable me to more easily test out my code. I chose to subset five tracks that were collected on July 19, 2017.

Step 3: Set up temporal color ramp parameters and palette In order for me to make temporal comparisons between each of the GPS tracks, I needed to standardize the color ramp so each GPS point was given the same color code depending on their date stamp. To do this, I calculated the maximum and minimum time value for all five tracks combined.  These max/min values gave me the temporal parameters for the color ramp.

After I defined the values for the color ramp, I created a color palette that could be applied to each of the tracks. I coded and defined this palette in R.

Step 4: Pull in a satellite map image layer for the graph To provide more context and meaning to the plotted GPS tracks, I thought it would be helpful to add a visual of the location where the tracks were collected. To do this, I downloaded the ‘ggmap’ package in R which allowed me to plug in the coordinates of the GPS tracks and extract a map image layer from Google Earth.

Step 5: Plot the GPS tracks onto the graph I used the ‘ggplot2’ package to plot the GPS points onto the XY coordinate graph. I applied the satellite imagery as a background layer and the color palette that I defined in Step 3. I produced five graphs (see results section).

Bonus approach: After successfully plotting the GPS tracks, I was curious to find out if there were more interactive ways to visualize a recreationists’ movement through time and space. I discovered the ‘leaflet’ package that allows a person to plot data (from both raster and vector data structures) and then zoom in and out, click on features, and essentially interact more dynamically with the results. I was eager to explore this package using my GPS data. After familiarizing myself with the syntax and objects for this package, I was able to successfully plot the GPS points and color code the tracks based on their timestamp (see results sections).

 

Brief description of results you obtained.

The results I obtained were five graphs that represented the movement of five recreationists through time at String and Leigh Lake in Grand Teton National Park. These results successfully demonstrate the first step necessary for most data analysis: visualizing the data to identify patterns and inform the next steps. Among the five tracks I selected, there was variability in the amount time spent in the area, the locations traveled, and the time for recreation. I intentionally selected one track that represented a water-based recreation user to visualize how their movement compared to those on land.

I was also successful in generating plots that allowed me to zoom-in and out of the points. By doing this, I was able to recognize more detail in the recreation user’s movement. Further, this result allowed me to clearly see where the GPS points were stacked up or close together, i.e. stationary, or slowed movement; or where the person was moving at a higher speed, i.e. points more spread out, evenly spaced.  I was unable to embed the html document into WordPress but am currently working on a method to display these plots. For now, here are a few screenshots that hopefully illustrates the dynamism of this type of plot.

From these results I intend to dive into the next phase of analysis to better understand the behaviors and underlying processes that cause and influence human movement in recreation settings. These processes could be temporal (are there certain times of day or times in the season that influence the movement of people?), environmental (how does the landscape influence the behavior and movement of a person?), social (does group size influence how people move and recreate?) etc. Before doing that, my goal is to analyze some behavioral characteristics within each person’s track. These characteristics include the individual’s step length at one-minute time intervals, and the person’s turning angles at one-minute time intervals. Stay tuned for that tutorial..

Critique of the method – what was useful, what was not?

What was useful: Once I learned the code and syntax to build these graphical representations of a human’s movement through time, I was very pleased with the results. The ggplot package has great functionality and allows me to represent the data in a number of ways. I also appreciated the ability to add a basemap to the graph using ggmap. Because I was new to the software and syntax, the bulk of the work involved building the code.

If people are interested in understanding how I built my code, I generated a text file that annotates the code and steps I took. Click here for the Exercise1_Code

Some criticisms:

a. Limitations in the zoom function for the satellite image —  Adding the satellite image to the graph provided additional meaning to the results. However, the zoom feature was clunky. For example, a zoom of a value of 5 compared to a zoom of a value of 6 dramatically changed the scale of the image. I’m sure there are ways to work around this, but I wasn’t able to make it work for this exercise.

b. Interpreting time as an integer – In order to denote color to the gps points based on a date stamp, I needed to convert the value of the date stamp to an integer in R. For a non-academic viewing the graphs, he/she/they may be able to understand overall changes in movement through time by simply looking at the variation in color, but they wouldn’t be able to interpret the integer values on the color ramp legend. In other words, by changing the date stamp to an integer, it’s hard to conceptually place the data points within a real point in time. I chose to work around this issue by including a start time and end time to each graph, but this was a cumbersome, crude approach.

c.The interactive leaflet map only works with an internet connection. Also, this course’s WordPress site won’t let me embed the image into the blog. The leaflet map was a neat, bonus approach for visualizing the data. However, it would only be useful in situations where there is an internet connection and computer monitor. Therefore, this tool is best used for power-point presentations, websites, or web-tutorials. I was frustrated that I couldn’t embed the html file into the Word Press blog, so you will have to simply view a screenshot of the plots. I believe this inability to embed the html is a function of the settings for the OSU blogs.

April 13, 2018

Autocorrelation of GPP and LUE at C3 vs C4 grassland sites

Filed under: 2018,Exercise/Tutorial 1 2018 @ 12:36 pm

Question Asked:

The question I asked in this exercise was: What degree and timing of temporal autocorrelation exists for LUE and GPP derived from eddy covariance flux data, and how does the degree and timing differ between a C3 and a C4 grassland site?

I aim to use an autocorrelational analysis to assess temporal patterns in these production indices, and to assess whether they reflect distinct seasonality between the C3 and C4 site. I.e., if the production indices show distinct autocorrelational patterns, that may indicate that the production of the C3 and C4 sites are responding to distinct environmental drivers. Because this is just an autocorrelational analysis and not a cross-correlational analysis, I am not quantifying the relationships between environmental drivers in this exercise.

Tool or approach used:

I used the R function stats::acf() to compute an autocorrelation function for my time series of production indices.

Steps followed to complete the analysis:

My data are derived from eddy covariance flux tower measurements, which are taken every 30 minutes. My data are from two grasslands—Konza Prairie Biological Station (US-Kon) is 99% C4 grass, and the University of Kansas Field Station is 75% C3 grass. I obtained gap-filled, 30-minute resolution data from site PIs, for 2008-2015.

From the 30-minute resolution data, I calculated GPP and LUE for time units of days, months, and years. I first sum and convert the EC flux measurements of GPP to be in units of gC/m^2/day. LUE is calculated as the total daily GPP divided by the total daily photosynthetic photon flux density (PPFD) and is in units of gC/MJ/m^2/day. I then smooth the dataset using a 7-day rolling mean, and filter it to remove extreme values that are artifacts of pre-processing or instrument error.

For the monthly time interval, I calculate the monthly average of the daily values of GPP and LUE. I do this by grouping the data by month, and then calculating the mean of all values for that month.

For the annual time interval, I calculate the total annual GPP, and the annual average of daily LUE. I sum the daily values of GPP and PPFD, and then divide the two to obtain annual LUE. This yields GPP in gC/m^2/year and LUE in gC/MJ/m^2/day.

However, the quality of these data vary widely, due to environmental inconsistency and equipment error. Thus, even though the instruments are ostensibly recording data every 30 minutes, after post-processing and gap-filling, I still end up with entire days– and sometimes months of data missing.

For each dataset– daily, monthly, and annual GPP and LUE– I used the R stats::acf() function to compute estimates of the autocorrelation function.

Results:

Fig. 1: A plot of autocorrelation versus lag, for daily total GPP and LUE, for each site. Dashed lines represent upper and lower 95% confidence intervals, and vertical lines represent lags of 365 and 730 days, to approximate 1- and 2-year lags.

For the daily data, the autocorrelation shows distinct autocorrelational patterns between each of the sites. These plots appear to reflect distinct seasonality of production between the C3 and the C4 sites.

However, I’m not certain whether the different timing of autocorrelation may reflect data that are missing due to filtering– I excluded certain values that were outside thresholds I had set, rather than setting values to 0 or NA. If values are missing for dates, then that would affect the autocorrelation.

 

Fig.2: Plot of autocorrelation vs. lag for the monthly average of daily GPP and LUE. Dashed lines represent 95% confidence intervals for the autocorrelation function. Vertical lines represent lags of 12, 24, and 36 months.

For the monthly data, and in contrast to the daily data, this plot demonstrates autocorrelation of production indices that is very similar between the two sites.

 

Fig. 3: Plot of autocorrelation of autocorrelation vs. lag of GPP. Dashed lines represent 95% confidence intervals for the autocorrelation function, colored for each site.

Similar to the plots of autocorrelation of the monthly average of daily GPP and LUE, the autocorrelation of the total annual GPP and average annual LUE show similar patterns between the two sites.

The daily, monthly, and annual datasets suggest alternate conclusions about the temporal autocorrelation of the data. Because the data are filtered heavily to exclude outlying values, this means that data for days- and months- are missing, and are not accounted for in the time series of observations.

Critique of the method:

Because the autocorrelation function relies on the observations that are present in order to calculate lags, and does not use a date-time field present in the function– the autocorrelation function is unable to accurately represent missing data or irregular observation intervals. The unseen, missing data– particularly in the daily dataset– appear to be causing the offset in the autocorrelation function between the two sites. If the sites are missing data from different times, and are missing different numbers of observations at those times, then the autocorrelation function will look different.

I will continue to explore gap-filling and extrapolation methods from my data in order to compare autocorrelation on sub-month time intervals.

April 9, 2018

Predicting Produce Safety Rule compliance through spatiotemporal analysis of publicly-available water quality data

Filed under: 2018,My Spatial Problem @ 11:05 am
Tags: ,

Research Question

Because of the numerous foodborne illness outbreaks associated with fresh produce, the Food and Drug Administration finalized the Produce Safety Rule in November 2015. This rule implements a variety of new food safety practices on the farm to prevent foodborne pathogens from reaching the consumer. As part of this new rule, growers of fresh produce are required to meet water testing requirements for all water used in the growing, handling, and harvesting of produce. Growers are expected to test their surface water source a minimum of 20 times to establish a baseline Water Quality Profile (WQP). The WQP is then to be updated annually with 5 additional samples. The WQP consists of the geometric mean and statistical threshold value of generic E. coli in the water.

My objectives with this dataset are twofold:

  1. Determine whether Oregon produce growers will face difficulty in meeting the water quality requirements based on historical trends
  2. Explore whether produce growers who share a common surface water source can pool their data to collectively establish a WQP to meet the requirements

Dataset

Oregon Department of Environmental Quality maintains a public database (Ambient Water Quality Monitoring System) of statewide surface water testing for a variety of contaminants. I will analyze the dataset for generic E. coli. I have also acquired data for pH and temperature as potential explanatory variables for the data set. These data exist as point data at DEQ monitoring stations that are adjacent to a water source (river/stream), with the data spanning from January 1, 2013 through December 21, 2016. Each monitoring station has different temporal spans (for example: one monitoring station only contains data for 2015, while another covers the entire three-year span).

Hypotheses

 I hypothesize that generic E. coli concentrations will correlate most strongly to time of year for sampling. I predict that pH and temperature variations will contribute insignificantly to the fluctuations of generic E. coli. Additionally, I predict that trends will be consistent within each watershed, but vary greatly between.

Approaches

 I will test the dataset within ArcGIS and R to determine statistically significant factors in generic E. coli concentrations within watersheds in the state of Oregon.

Expected Outcome

The outcome of this research will help inform food safety extension work. Additionally, this data may help growers alleviate the water-testing burden if we identify that current testing regimes by government agencies is sufficient to meet the requirements of the rule, or if the data within a watershed can be collectively shared to meet compliance standards.

Significance

This study will help guide the direction of future food safety extension work related to the Produce Safety Rule to prevent foodborne illness outbreaks associated with fresh produce.

Preparation

I am very comfortable working with the suite of ArcInfo software. I have limited beginner level experience with R, Python, Modelbuilder, and image-processing software (ENVI Classic).

April 8, 2018

Spatial and temporal variation in the historical fire regime of the Oregon pumice ecoregion

A description of the research question that you are exploring.

I’m researching historical fire regimes in ponderosa pine, lodgepole pine, and mixed-conifer forests in south central Oregon. Previous fire history reconstructions demonstrate fires were historically frequent occurring every 10-20 years, and were predominantly low severity.  However, the size and sampling design of earlier reconstructions does not describe historical fire sizes and their spatial patterns across landscapes. We mapped historical fire perimeters to answer the following research questions;

1) How did topography, vegetation (fuels), climate, and previous fires constrain fire spread and fire perimeters?

2) Do these constraints create landscape regions (firesheds) with a distinct fire regime?

3) What defines a fireshed?  Do landforms or distinct vegetation types envelop them? In other words, are there significant spatial relationships between firesheds and a combination of topography and vegetation?

3) Are constraints on fire spread and perimeters stable over time or do they vary temporally with climate or land use?

 

A description of the dataset you will be analyzing, including the spatial and temporal resolution and extent

I systematically reconstructed fire history at 52 plots by removing partial cross sections from 3-10 fire scarred trees within 250 m (20 ha) of plot center. Fire scars were dated to their exact year of formation to build a record of fire events during ~1670-1919. The fire record for unsampled areas is represented by the nearest known fire record from a sampling point using Thiessen polygons (Figure 1 ). This method of interpolation of fire history to unsampled areas assumes that the best predictor of fire history in an unsampled area is the nearest sampled area (Farris 2010). The fire reconstruction area is 85,000 ha and includes a mosaic of landforms, soil types, and forest types.

Fire_maps -Fig. 1

 

Hypotheses: predict the kinds of patterns you expect to see in your data, and the processes that produce or respond to these patterns.

Background

Fire regimes vary with and are constrained by broad scale top-down drivers primarily climate and local bottom-up drivers including topography, (slope, aspect, landforms, soil) vegetation, and ignitions.

Top Down controls on fire

In central Oregon, climate and ignitions are generally not limiting to fire. Summers are hot and dry and lightning ignitions are abundant (Morris 1934). Previous dendrochronological reconstructions of historical fires demonstrate most fires burned in years with below average spring precipitation (Johnston 2016). However, fire size has not yet been related to climate and we do not know if extensive fires (10,000-80,000 ha) have a different relationship with annual and previous year climate than small fires.

Hypothesis1 – Large fire events depend on continuous fuels in a hot dry year. A series of wetter than average years followed by a drought year provides abundant fuel with low moisture allowing the spread of extensive fires.

If this hypothesis is true, I expect extensive fires to be negatively related to drought in previous years and positively related to drought in the year of the fire event. Small fires should have poorer relationships with climate or may be more common during cool wet years.

Hypothesis 2 – Large and small fire events are limited by fuel not climate.

If this hypothesis is true, I expect no there is no relationship between previous year climate and fire size, and that both extensive and small fires are positively correlated with drought in the year of the fire event. Fire maps would also show that fires did not burn areas that had recently burned.

 

Bottom-up controls on fire

I expect that bottom-up drivers of topography and fuels are stronger constraints on fire spread than climate, and will explain more of the spatial variation in fire history. Merschel et al. (2018) demonstrated that pumice basins characterized by coarse soils, low productivity, and low fuel abundance constrained spread of some extensive fires and drove spatial variation in fire history on the eastern slope of the central Oregon Cascades. Similar, but more extensive pumice basins occur in this study area. In addition, there are large topographic features including long steep ridges, large volcanic buttes, and gentle rolling topography intermixed across the study region.

Hypothesis 3 – Lodgepole pumice basins limited fire spread because of slow fuel recovery and formed firesheds.

If this hypothesis is true, I expect fire frequency to decrease with increasing abundance of lodgepole pumice basin within an analysis polygon. I also expect variability in fire frequency and the amount of small fires to increase with the increasing abundance of lodgepole pumice basin.

Hypothesis 4 – Fuel recovery controls fire spread throughout the area.

If this hypothesis is true, I expect fire frequency to vary little across the reconstruction area, and that it would not vary significantly among vegetation (forest) types. Time since fire would be the best predictor of whether an area burned in a fire event, and firesheds would not be apparent across the study area.

 

Approaches: describe the kinds of analyses you ideally would like to undertake and learn about this term, using your data.

My first priority is to develop maps of fire events and the metrics that describe fire history.  For example, I would want to map how frequency, variation in frequency, minimum and maximum fire interval, and fire size vary spatially across the study area. It would be very helpful to develop fire maps that also show how time since last fire spatially varied across the study area.  This would help us understand whether fuel recovery limits or constrains fire in the region. I’ve made an animation of fire maps through time before, but it was extremely glitchy and limited in ArcMap. I would like to learn a better animation process because videos of fire maps are effective in presentations. Ideally unburned reconstruction polygons would be shaded by time since fire, current year fires would be depicted in red, and the maps would include a sidebar that summarized drought conditions in the current year. The viewer would simultaneously learn how fires were related to topography, vegetation type, fuel recovery, and climate.

Potential analyses (I’ve got a lot more to learn about)

Cluster analysis to identify fire regime types based on fire regime metrics

Hot Spot analysis to identify how fire history or fire regime metrics vary spatially

Superposed ephoch analysis to identify relationships between climate and fire occurrence and climate and fire size.

Generalized linear mixed modeling to check for spatial autocorrelation in fire regime metrics and to understand relationships between fire regime metrics and topography, vegetation, and landscape structure

 

Expected outcome: what do you want to produce — maps? statistical relationships? other?

I want to produce maps of fire events and fire regime metrics.

I would like to know how fire size and occurrence was statistically related to climate

I want to produce statistical models that describe how fire regimes vary spatially with topography, vegetation type, and fuel recovery. I suspect that relationships may vary with climate and that interactions between variables may be important (e.g. vegetation type may only be important on steep volcanic buttes)

 

Significance. How is your spatial problem important to science? to resource managers?

Currently there is much interest in restoring fire in forested landscapes with a history of fire exclusion and vigorous debate on the historical role of fire in different forest types and environmental settings. The science of fire ecology provides us with a good theoretical understanding of what drives variation in fire regimes. However, few datasets available allow us to quantify what drivers of variation have the most influence, and how they interact. By identifying what drives variation in fire regimes we can better plan for, manage, and reintroduce fire into forests with a history of fire suppression.

 

  1. I’m comfortable working with Arc-Info to make maps, but would like more experience with spatial statistics.
  2. I’m a Python beginner
  3. I’m comfortable working through problems in R, but I often have sloppy code. I could take a big step forward by learning how to produce maps in R and perform multivariate stats in R.

 

 

Fire_maps_Test

April 6, 2018

It’s all in the timing: Assessing risk of an introduced insect on a native plant through investigations of phenological synchrony

Filed under: 2018,My Spatial Problem @ 10:39 pm

1. Research Question / Background:

Biological control of weeds involves introducing or augmenting natural enemies, such as insect herbivores, for population control of a target weed. Choosing insect herbivores that are highly specific to their host plants gives managers some confidence that these species will be safe & effective as biocontrol agents. However, our ability to predict outcomes of introductions is imperfect, and resulting risks to non-target native plants must be weighed in evaluating success & safety of biological control programs.

The cinnabar moth was released in Western Oregon beginning in the 1970s to 1990s to control a European weed, tansy ragwort. However, redistribution of the cinnabar moth was halted after it established on Senecio triangularis, a non-target native wildflower (Diehl and McEvoy, 1988). The moth has established and maintained populations on S. triangularis even in absence of the ancestral host. Cinnabar herbivory of foliage has not been found to cause long term decreases to plant fitness or reproduction (Rodman, 2017). Herbivory of flowers is less common given a mismatch between the flowering time of S. triangularis and peak feeding stages for the moth, but when it does occur, floral/seed herbivory may have direct impacts on population dynamics. Previous work has shown that that larvae experiencing phenological synchrony with S. triangularis flowers decreased seed set by 95% (Rodman, 2017); and that S. triangularis is seed-limited, so that reduction of seed set decreases seedling recruitment to the next generation (Lunde, unpublished data).

Because this plant is a novel host-plant, and because insects and plants can respond to disparate environmental cues to determine the timing of their life cycles (phenology), we expect to see phenological synchrony varying depending on environmental factors. Knowing which populations of S. triangularis would be most likely to experience seed herbivory by cinnabar larvae could help managers track and respond to cinnabar moth presence and the risk to S. triangularis on a site-by-site basis.

This project seeks to explore how variation in phenological synchrony is related to a set of environmental variables for a set of known populations of cinnabar moth on Senecio triangularis. Using a linear mixed model, and environmental variables measured directly or derived from zonal statistics of spatial data, we will use variable selection processes to determine which candidate factors best explain variation in phenological synchrony seen on a single date (July 22) in 2017. This relationship will then be used to predict phenological synchrony, and subsequent risk of seed herbivory, for a larger set of 26 sites with known S. triangularis populations extending across the Oregon Cascade & Coast Ranges.

2. Data:

This project uses a dataset of phenology scores (for cinnabar larvae and Senecio triangularis) and environmental variables taken for five populations in the Willamette National Forest near Oakridge, OR. The project also estimates environmental variables from assorted spatial datasets. Though some data were taken at the level of individual plants, all variables will be considered at the site level. Some variables (such as soil type, snowmelt date) cannot be measured and deemed meaningful at the level of individual plants. Data sources/details as follows:

Population locations: two polygon layers — one includes 5 sites surveyed over the 2017 season; one includes 26 additional sites previously surveyed for cinnabar presence and host-plant damage.

Phenological synchrony: tabular, count data for number of vulnerable and invulnerable flower heads (capitula), and counts of cinnabar larvae in peak feeding stages (4th & 5th instars), collected from a random subset of tagged plants on July 22, 2017. These count data will be used to derive a measure of phenological synchrony for each site. While data were collected at 10-day intervals, in order to capture variation in synchrony, this analysis will focus on data from July 22. This date is halfway between the date at which all capitula were vulnerable and the date at which all capitula were invulnerable in 2017.

The following environmental variables are included in the phenology dataset: ambient temperature at 6 inches and 36 from the ground; soil temperature and moisture at 6 inch depth; and approximate sun exposure measured by a Solar Pathfinder. These data are all constrained to the 5 surveyed sites.

Snow disappearance date will be approximated from a model developed by Ann Nolin and the Mountain Hydroclimatology Group. This model uses MODIS data to determine date of snow disappearance at a resolution of approximately 500m. Accumulated degree days will be approximated for July 22, 2017 from a model developed by Len Coop of the OSU Integrated Plant Protection Center (IPPC), based on a variety of climate data: AGRIMET, HYDROMET, ASOS/METAR and COOP networks, RAWS network, SNOTEL network, and others.

Soil type for each site will be determined from SSURGO soil layers, which provide polygon boundaries of component soil types as determined by the National Cooperative Soil Survey. Elevation and aspect (site average) will be determined from elevation and aspect surface previously developed from a Digital Elevation Model of Oregon at a 30m resolution. Each layer will be obtained for the extent of Oregon, and considered at the scale of 31 identified sites using zonal statistics.

3. Hypotheses:

The phenology of insects is usually predicted on degree days, because insects are ectotherms whose development is largely constrained by external heat gain (Johnson et al., 2007); whereas alpine and subalpine species of perennial plants have been shown to vary widely in terms of which environmental factors drive flowering phenology (Dunne et al., 2003). We expect to see differences in phenological synchrony of host-plant and larvae to the extent that environmental drivers underlying each species vary independently.

My hypothesis is that phenological synchrony varies between sites on July 22, 2017, and that a significant amount of the variation in phenological synchrony can be explained by a combination of candidate environmental factors (listed above). Further, I hypothesize that this model can be used to predict risk of cinnabar seed herbivory to S. triangularis based on values of environmental factors.

It is possible that the five sites represented in the available data do not represent a wide enough range in explanatory variables to adequately test this hypothesis. In this case, I would seek to approximate what site conditions need to be represented in the data in order to answer the research question.

4. Approaches:

Initial analysis will use zonal statistics and statistical analysis of tabular data to determine the variation in phenological synchrony and candidate explanatory variables across 5 sites surveyed in 2017. This will be used to develop a linear mixed model using appropriate variable selection methods to determine which environmental factors best explain variation in phenological synchrony for July 22.

Then, using a broader set of 26 sites with known S. triangularis populations and the same set of environmental data, I will use the linear mixed model to predict phenological synchrony for theoretical cinnabar moth populations at these sites. Results of this analysis would be used to display the estimated phenological synchrony, and risk of seed herbivory, for all 31 sites.

5. Expected outcome:

Outcomes for this project will be a linear mixed model that can predict mid-season phenological synchrony of cinnabar larvae and S. triangularis flowers from a set of environmental explanatory variables derived from spatial datasets and 2017 phenology survey data.

An additional product will be a map showing phenological synchrony for 26 additional sites as predicted from environmental factors deemed significant in this first part. Not all of these populations have cinnabar moth populations, but this map would allow us to identify sites at which S. triangularis would be at high or low risk of seed herbivory if populations of cinnabar moth were to establish.

6. Significance:

Estimations and predictions of phenological synchrony determined in this study will be significant in answering how often and under what conditions cinnabar larvae have the potential to decrease seed set for Senecio triangularis through floral herbivory. Experimental data could be used to estimate decreases in annual seedling recruitment based on seed reduction scenarios. Meanwhile, a map showing relative risk of seed herbivory due to phenological synchrony will allow managers to identify high-risk populations of S. triangularis in order to focus monitoring efforts at these sites and possibly intervene by reducing or moving cinnabar moth populations.

7. Level of preparation:

ArcINFO: 3 terms of coursework (GIS I, II, & III) and independent work; relatively confident.

Modelbuilder and/or GIS programming in Python: one term of coursework (GIS III); somewhat confident.

R: three terms coursework (Stats 511, 512 & FES 524); limited proficiency, no experience with spatial data

WORKS CITED

Diehl, J.W., and McEvoy, P.B. (1988). Impact of the Cinnabar Moth (Tyria jacobaeae) on Senecio triangularis, a Non-target Native Plant in Oregon. In Proceeding VII International Symposium on Biological Control of Weeds, (Rome, Italy), p. 119-126.

Dunne, J.A., Harte, J., and Taylor, K.J. (2003). Subalpine Meadow Flowering Phenology Responses to Climate Change: Integrating Experimental and Gradient Methods. Ecol. Monogr. 73, 69–86.

Johnson, D., Bessin, R., and Townsend, L. (2007). Cooperative Extension Service, University of Kentucky. Resource 474, 7727.

McEvoy, P.B., Higgs, K.M., Coombs, E.M., Karaçetin, E., and Ann Starcevich, L. (2012). Evolving while invading: rapid adaptive evolution in juvenile development time for a biological control organism colonizing a high-elevation environment. Evol. Appl. 5, 524–536.

Rodman, M. (2017). Non-target Effects of Biological Control: Ecological Risk of Tyria jacobaeae to Senecio triangularis in Western Oregon. Oregon State University.

Associations between physical substrate complexity and spatiotemporal kelp forest dynamics

Filed under: 2018,My Spatial Problem @ 10:08 pm

BACKGROUND

Nearshore temperate kelp forests are structured by an interaction of physical forces and biological processes that produce patchy species assemblages across space and rapid shifts in community structure through time. At short temporal scales seasonal storm surge removes adult giant kelp (Macrocystis pyrifera) and can instigate community shifts, while longer-term periodicity (e.g., El Niño Southern Oscillation events, Pacific Decadal Oscillation) modulates the oceanographic conditions influencing macroalgal growth. Disturbance and herbivore driven community shifts are often transient; however, deforested regions do not always immediately recover, and urchin dominated communities may persist for a decade or more before reverting back to a macroalgal state. While the processes involved in both directions of the switch between urchin and macroalgal states have been repeatably observed and qualitatively described, it is less clear how biological and physical context at local scales may modify positive and negative feedback mechanisms to either perpetuate or dampen these shifts. To (1) investigate how variation in physical substrate complexity is associated with varying temporal kelp forest dynamics, and (2) map a future survey, I will incorporate a spatially explicit 38-year time series of community dynamics with 2m side scan sonar bathymetry around San Nicolas Island, CA, in the Channel Islands.

SPECIFICALLY

(1) Are spatial associations between physical habitat complexity (i.e., relief) and community structure consistent through time? Are varying levels of relief associated with (a) cyclical periodicity through time or (b) edges in community structure? Does increasing the spatial scale of bathymetry incorporated around each site influence the previous results? That is, does the composition or heterogeneity of an expanding spatial window of substrate provide insight into the nature or persistence of species-habitat associations?

(2) Create a pool of possible survey sites for fieldwork currently slated for summer 2019. Based on previous surveys, our inference into community structure is influenced by the complexity of the substrate sampled, and thus it is possible to obtain a misleading or incomplete snapshot of community structure if a completely random design is implemented irrespective of local heterogeneity. Therefore, my second objective is to create a sampling design that incorporates physical substrate complexity (e.g.,, a stratified sampling design, but not necessarily scaled by area per condition), allowing divers in situ to survey both low- and high-relief all around the island, providing an independent test of my hypotheses for how species-habitat relationships vary over time given broader context (e.g., sea otter distribution, storm exposure).

 DATA

Benthic bathymetry

I will use existing side-scan sonar bathymetry data of the nearshore subtidal around San Nicolas Island (SNI), CA (Kvitek, 2011). These data have a 2m grain and extend approximately 1km offshore around the island. I will predominantly use a slope layer containing measures of substrate verticality.

Time series

I will use data from an ongoing 38-year biannual sampling program that has surveyed seven subtidal sites in the nearshore subtidal (35-40 feet sea water) around SNI (Kenner et al, 2011). Macroalgae and urchins species are surveyed within 10x2m2 transects (5 per site, 35 total), filamentous red algae and colonial species are recorded in 1m2 percent-cover quadrats (10 per site, 70 total), and fish are recorded in benthic and midwater 8x50m2 transects (5 per site, 35 total). As incorporating multiple community matrices sampled at different scales may require more time than the scope of this class, a few key indicator species will be retained and standardized for initial analyses, including 1) Giant kelp (Macrocystis pyrifera), 2) Purple urchins (Strongylocentrotus purpuratus), 3) the California Sheephead (Semicossyphus pulcher), and percent coverage of 4) fleshy red algae, and 5) suspension feeders.

 HYPOTHESES

1) I hypothesize sites predominantly comprised of low-relief will exhibit large shifts in community structure through time (e.g., large swings in local urchin densities), sites comprised of a mix of low- and high-relief will exhibit spatially explicit differences in species-habitat associations through time, and sites comprised of high-relief will exhibit relatively uniform species-habitat associations and minimal shifts in community structure over time (e.g., consistently low urchin densities).

2) I hypothesize low-relief sites that are homogenous across an increasing window of spatial scale will experience rapid and lasting shifts in community structure, while other sites comprised of a mix of low- and high-relief will exhibit a spatial patchwork of urchin barrens and kelp regions. I hypothesize homogenous high-relief sites will be associated with high Sheephead densities, an urchin predator whose spatial aggregation may locally increase the strength of top-down trophic regulation, limiting herbivory, and yielding macroalgal species that exhibit cycles with periodicities characteristic of populations governed by age-structured growth and senescence.

APPROACH

My objective is to increase my technical proficiency in ArcGIS and learn how to apply various spatiotemporal analyses, e.g., autocorrelation and spatial and temporal cross-correlation among transects within a site. Wavelet analysis has proven very useful, and perhaps those results could be related to the bathymetry. I would like to explore a variety of methods to motivate future efforts (e.g., incorporate the entire time series at a later date, perform more robust spatial autocorrelation analysis once the 2019 island-wide survey has taken place).

PRODUCTS

I would like to produce maps that show the increasing scales of benthos analyzed for the later part of question (1), along with various statistical output for analyzing patterns within- and among-sites (e.g., variograms, correlograms). For the sampling design I would like to map a pool of potential sample sites linked to GPS coordinates from which I’ll randomly select and sample 35.

SIGNIFICANCE

Results from these analyses will provide insight into how local features are associated with varying temporal dynamics over time, potentially contextualizing experimental work planned for summer 2018 that will test a key mechanism hypothesized to vary with substrate complexity. Additionally, abrupt transitions in community structure often negatively affect ecosystem function over time; for example, sea star wasting has resulted in the domination of purple urchin populations and a crash in macroalgae, and as a consequence, for the first time since its creation, the recreational abalone fishery will not open this season. Results for the islandwide snapshot survey sampling design will directly inform questions of management interest, as the recent arrival of the invasive macroalgae Sargassum horneri threatens native species at SNI, and the current distribution is unknown.

 PREPARATION

I have a rudimentary working knowledge of ArcGIS, and can probably figure out how to do most tasks once I explicitly know what it is I need to do (and which tools or packages are required for the specific tasks). I conceptually know what I need to do, but I don’t know how that translates in terms of ESRI tools. I’ve used R to structure data, run analyses, and create figures, but I have yet to analyze spatial data.

LITERATURE CITED

Kvitek, R. 2012. Bathymetry data used in this study were acquired, processed, archived, and distributed by the Seafloor Mapping Lab of California State University Monterey Bay. http://seafloor.otterlabs.org/SFMLwebDATA.htm

Kenner, M.C., Estes, J.A., Tinker, M.T., Bodkin, J.L., Cowen, R.K., Harrold, C.H., Hatfield, B.B., Novak, M., Rassweiler, A., Reed, D.C. 2013. A multi-decade time series of kelp forest community structure at San Nicolas Island, California (USA). Ecology 94(11): 2654

What landscape factors are most important to predicting beaver dam occurrence in the Oregon Coast Range?

Filed under: 2018,My Spatial Problem @ 5:51 pm

Project Overview:

North American beavers (Castor canadensis) are often referred to as ‘ecosystem engineers’ because they can fundamentally transform stream and riparian ecosystems through dam building, pond creation, and intensive foraging on vegetation. Recent literature suggests that beaver restoration via, introduction of beavers to unoccupied stream reaches, may provide for a cost-effective strategy to restore degraded watersheds. Despite these potential benefits there is also considerable uncertainty around the efficacy of beaver restoration including 1) survival of reintroduced beavers, 2) what constitutes suitable habitat for dam building, 3) quantifiable benefits of those dams, and 4) possible mal-effects or unintended consequences of beaver restoration efforts, such as flooding and damage to private property.   The goal of this analysis to consider what landscape factors can be used to predict 1) the presence or absence of beaver occupancy and 2) presence or absence of beaver dams in the West Fork Cow Creek, a tributary of the South Umpqua River in Southern Oregon.

Datasets:

These questions will be analyzed using data that were collected during beaver occupancy and dam presence surveys in the West Fork of Cow Creek (WFCC) during August and September of 2017. A total of 144 survey locations were sampled from the basin using metrics of stream gradient, bank-full stream width, and valley floor width. Using these geomorphic characteristics, survey locations were organized into three strata: 1) suitable for damming habitat and beaver occupancy; 2) unsuitable damming habitat but suitable for beaver occupancy, and 3) unsuitable for damming habitat and beaver occupancy. Surveys were collected along longitudinal transects upstream from the survey locations to 100m upstream.

Hypotheses:

  1. Availability of suitable beaver dam habitat will be most limited by stream gradients in the West Fork Cow Creek.

Literature on suitable damming habitat identifies more than a dozen variables that have been used to predict dam sites in watersheds throughout North America (Dittbrenner et al., 2018) but generally include factors related to perennial streamflow, stream geomorphology and food supply. In the Oregon Coast Range, dams sites have been found to occur most commonly in stream reaches with low gradient (≤ 5%), moderate bank-full width (3-6m) and wide valley floors (≥25m) (Suzuki & McComb, 1998). These characteristics reflect the criteria for selection of stream reaches in Stratum 1. However, it is not clear that each of these factors exert an equal influence on the occurrence of beaver dams with evidence that stream slope may be the most important factor because of high annual precipitation and the generally steep, dissected nature of the regions watersheds that produce high seasonal peak flows that can cause dam failures.

  1. Observed dams sites will occur more frequency where connectivity among suitable damming habitat is greatest.

Connectivity and neighborhood effects are important factors in habitat selection studies. For example, Issak et al. (2007) found Chinook salmon preferentially selected spawning locations with greater habitat size and connectivity over habitat quality. (Isaak Daniel J., Thurow Russell F., Rieman Bruce E., & Dunham Jason B., 2007). To my understanding these factors have not been well considered in efforts to predict the occurrence of beaver dam sites across watersheds.

Approaches:

I would like to build a logistic regression model that considers the occurrence of beaver dams sites based on a number of explanatory variables including, stream slope, bankfull width, valley slope, connectivity, and proximity to non-damming beavers.

Expected Outcome:

My goal in this effort to develop a predictive model of where beaver dams are either most likely to occur, or identify‘opportunity areas’, i.e. where dam sites could occur with beaver introductions in the West Fork Cow Creek drainage. This would include maps of predicted dam locations, and dammed stream reaches as well as locations where conflict may arise due to proximity to roads or agricultural lands.

Significance:

There have been growing interest in the Umpqua River Basin among stakeholders and watershed managers to explore what opportunities beaver restoration may provide to watershed enhancement. A predictive tool would provide guidance and help to improve chances of success in relocation of beavers.

Level of preparation:

My experience with Arc-Info is low to moderate and it has been quite some time since I used any of the ESRI products on a regular basis so I anticipate there will be a challenging learning curve. Over the past 6 months I have been using R studio and feel moderately comfortable running regression analyses and developing basic charts and figures. I have no experience with Python.

References:

Dittbrenner, B. J., Pollock, M. M., Schilling, J. W., Olden, J. D., Lawler, J. J., & Torgersen, C. E. (2018). Modeling intrinsic potential for beaver (Castor canadensis) habitat to inform restoration and climate change adaptation. PLOS ONE, 13(2), e0192538. https://doi.org/10.1371/journal.pone.0192538

Isaak Daniel J., Thurow Russell F., Rieman Bruce E., & Dunham Jason B. (2007). Chinook salmon use of spawning patches: relative roles of habitat quality, size, and connectivity. Ecological Applications, 17(2), 352–364. https://doi.org/10.1890/05-1949

Suzuki N, McComb WC. Habitat classification models for beaver (Castor canadensis) in the streams of the central Oregon Coast Range. Northwest Sci. 1998;72: 102–110.

 

How does photosynthetic pathway of a grassland affect seasonality and drought response of productivity?

Filed under: 2018,My Spatial Problem @ 3:24 pm
  1.       A description of the research question that you are exploring.

Grasslands are key social, economic, ecological components of US landscapes, and globally, ecosystems containing abundant grassy cover are estimated to compose ~30 percent of non-glacial land cover (Still et al., 2003; Asner et al., 2004). Yet, compared to forests, we know relatively little about how the productivity of grassy landscapes will respond to future, more-intense droughts induced by climate change. The community composition of a grassland mediates its response to drought, and is critical to consider in forecasting the climate change impacts (Knapp et al., 2015). Photosynthetic pathway (C3 or C4) used by grass species is a first-order factor of community composition that strongly affects resource-use efficiencies. Grasses with the C4 photosynthetic pathway, in contrast to the ancestral C3 pathway, have comparatively higher light-use and photosynthetic efficiencies, especially under high temperatures, as well as higher water use efficiency. As a result, C3 or C4 grasses will have distinct responses to warming climate and rising CO2 (Collatz et al., 1992, 1998; Lloyd & Farquhar, 1994; Suits et al., 2005). Thus, the photosynthetic pathway composition (C3 or C4) of grass communities is a fundamental aspect of grassland and savanna function, ecology, and biogeography.

The light use efficiency (LUE) of photosynthesis is one metric that we can use to track the growth of natural grasslands. LUE is calculated as gross primary productivity (GPP) divided by absorbed photosynthetically active radiation (APAR), and can be obtained from eddy covariance (EC) flux tower measurements of ecosystem productivity and environmental conditions. Though the comparative water-use efficiency (WUE) of C3 and C4 grasslands has been well-studied, LUE has received less attention. Importantly, LUE is also correlated with sun-induced chlorophyll fluorescence (SIF), a new remote sensing index that captures intra-annual variation in production better than NDVI and EVI (Rossini et al., 2010; Guanter et al., 2014), and should be particularly useful across systems with distinct resource-use efficiencies.

Guided by these knowledge gaps, I am interested in a) comparing the seasonal dynamics of LUE between C3 and C4 grassland sites, and b) quantifying the impacts of a 2012 drought on the LUE of C3 and C4 grasslands.

Specific questions include:

  • How does the timing of annual spring greenup / increase in LUE differ between the C3 and C4 site?
  • How does the slope of the annual cycle of increase in LUE differ between the C3 and C4 site?
  • How much do these parameters vary from year to year?
  • What climatic factors (e.g., degree days, temperature, precipitation, drought severity, previous year production) are correlated, autocorrelated, or temporally cross-correlated with this variation?
  • What anomalies are associated with the timing, slope, and magnitude of LUE during a known drought year, and how do these anomalies differ between a C3 and C4 site?
  • How is coarse-scale SIF correlated with EC flux tower-scale measurements of GPP and LUE, and how does this relationship differ between the C3 and C4 sites?
  1.     A description of the dataset you will be analyzing, including the spatial and temporal resolution and extent.

My study sites are two eddy covariance (EC) flux tower locations in natural grassland areas located ~90 miles apart in eastern Kansas. The sites experience nearly identical climates, but the first is a natural tallgrass prairie composed of 99% C4 grass at Konza Prairie Biological Station outside Manhattan, KS, while the second is a replanted agricultural field composed of 75% C3 grass at the University of Kansas Field Station (Fig. 1). Because the two sites experience a very similar climate, I hypothesize that photosynthetic type strongly controls differences in LUE at each site.

LUE can be calculated from ecophysiological equations that use ground-based measurements of atmospheric gas concentrations and meteorological data. The eddy covariance (EC) flux approach uses tower-mounted instruments to measure atmospheric concentrations of water and CO2, as well as air temperature, solar radiation, and other environmental data. All measurements are taken continuously every 30 minutes. EC flux data reflect the “footprint,” or area upwind of the tower where the instruments are mounted. The footprint varies with wind speed and direction, but averages about ~250m2. EC flux data span from 2008-2015.

The main metric I am interested in is daily total LUE, which is calculated as the daily sum of gross primary productivity (GPP) divided by the daily sum of the amount of incoming radiation, or photosynthetic photon flux density (PPFD). Daily LUE is converted to units of gC·MJ-1·day-1·m-2 from units of µmol CO2· µmol photon-1·day-1· m-2   using the molecular weight of carbon and Planck’s equation. Example time series of GPP, PPFD, and LUE appear in Fig. 2.

SIF is available from the NASA GOME-2 satellite at 0.5 degree spatial resolution, at 14- and 30-day temporal resolution. Because of the coarse spatial resolution of GOME-2 data (0.5 degree), SIF from GOME-2 will be weighted by MODIS Land Cover Type quantify the amount and type of land cover within the GOME-2 grid cell associated with each flux tower site, sensu Wagle et al. (2016).

Fig. 1: Study sites and ecoregions considered in the analysis. Study sites are the EC flux tower site at Konza Prairie Biological Station (US-Kon), in Manhattan KS, and the EC flux tower site at the University of Kansas Biological Field Station (US-KFS), outside Lawrence, KS.
 
Fig. 2: Time series of: GPP, photosynthetic photon flux density (PPFD), and LUE from EC flux data, as well as SIF from GOME-2 for for the C3 (orange) and C4 (blue) Kansas flux tower sites. The 2012 drought appears between the gray lines.
  1.     Hypotheses: predict the kinds of patterns you expect to see in your data, and the processes that produce or respond to these patterns.

My hypotheses are driven by the seasonality and comparative resource-use efficiencies of C3 and C4 photosynthesis (Fig. 3). I hypothesize that, when examining multi-year trends, the C4 site will have a later greenup, but higher maximum GPP and LUE than the C3 site. I also hypothesize that the average slope of annual increase in GPP will be statistically significantly different between the C3 and C4 sites.

I hypothesize that precipitation and growing degree days will be strongly correlated with parameters describing the timing and seasonality of GPP and LUE.

I hypothesize that C4 sites, compared to C3 sites, will show more stable GPP and LUE under 2012 drought conditions, due to the higher WUE of the C4 pathway and higher rates of photosynthesis under high temperatures.

I also hypothesize that there will be distinct relationships between SIF and GPP, and between SIF and LUE, when compared between C3 and C4 sites, driven by the distinct resource use efficiencies of the distinct functional types. I hypothesize that the slope of the relationship between SIF and GPP and SIF and LUE will be statistically significantly higher at the C4 site than at the C3 site.

Fig. 3: Comparison of the simulated responses of C3 (solid line) and C4 (dashed line) photosynthesis. Response of net photosynthesis (a) to quantum flux, at 25 degC, and intercellular C02 partial pressure (pi) of 25 and 15 Pa for Cg and C4 respectively; and (c) to leaf temperature at pi of 25 and 15 Pa for C3 and C4 respectively and quantum flux of 1500 kmol m-2 s-‘. From Collatz et al. 1992.
 
  1.    Approaches: describe the kinds of analyses you ideally would like to undertake and learn about this term, using your data.

I am interested in learning about harmonic curve fitting this term. I expect that harmonic curve fitting will allow me to quantify and investigate interannual patterns in production and extract coefficients, minima, maxima, and timing of production dynamics. better than simple linear regression or generalized linear models.

I am also curious about exploring wavelet analysis with my EC flux data to investigate the degree to which annual patterns of production mimic daily patterns of production.

  1.     Expected outcome: what do you want to produce — maps? statistical relationships? other?

I want to produce statistical models that describe interannual patterns of seasonality at the C3 and C4 grassland sites. Further, I also I want to produce statistical relationships between metrics average annual seasonality and environmental conditions. I also want to produce statistical relationships between drought year anomalies in production indices and environmental conditions.

  1.     Significance. How is your spatial problem important to science? to resource managers?

LUE is a relatively under-utilized metric of tracking plant production, but will be increasingly valuable for its relationship to new remote sensing indices. Quantifying seasonal differences in LUE and other production indices and drought response at closely-located C3 and C4 grassland sites will a) clarify how LUE differs between C3 and C4 grasslands, and b) describe the drought response of LUE and how it differs between C3 and C4 grasslands. Exploring LUE dynamics facilitates using LUE-correlated satellite indices to track and predict variation in plant production. Exploring initial correlations between LUE and SIF at these sites will facilitate using SIF to track variation in production across functional types at larger spatial scales. Ultimately, I am interested in investigating how seasonal dynamics of LUE differ across plant communities of varying C4%; developing statistical relationships between the C4% of a site, seasonality, and drought response; and using SIF to track the drought response of plant communities of varying functional types.

  1.     Your level of preparation: how much experience do you have with (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R?

  • a)  Significant experience with Arc Softwares and GUI-based image processing and analysis in Arc.
  • b)  Some exposure to ModelBuilder and Python in Arc. Some exposure to coding in Python outside Arc.
  • c)  Proficient and comfortable in statistical and spatial analysis and data visualization using R.

Works Cited

Asner GP, Elmore AJ, Olander LP, Martin RE, Harris AT (2004) Grazing Systems, Ecosystem Responses, and Global Change. Annual Review of Environment and Resources, 29, 261–299.

Collatz G, Ribas-Carbo M, Berry J (1992) Coupled Photosynthesis-Stomatal Conductance Model for Leaves of C4 Plants. Australian Journal of Plant Physiology, 19, 519.

Collatz GJ, Berry JA, Clark JS (1998) Effects of climate and atmospheric CO2 partial pressure on the global distribution of C4 grasses: Present, past, and future. Oecologia, 114, 441–454.

Guanter L, Zhang Y, Jung M et al. (2014) Global and time-resolved monitoring of crop photosynthesis with chlorophyll fluorescence. Proceedings of the National Academy of Sciences, 111, E1327–E1333.

Knapp AK, Carroll CJW, Denton EM, La Pierre KJ, Collins SL, Smith MD (2015) Differential sensitivity to regional-scale drought in six central US grasslands. Oecologia, 177, 949–957.

Lloyd J, Farquhar GD (1994) 13C Discrimination during CO₂ Assimilation by the Terrestrial Biosphere. Source: Oecologia, 994, 201–215.

Rossini M, Meroni M, Migliavacca M et al. (2010) High resolution field spectroscopy measurements for estimating gross ecosystem production in a rice field. Agricultural and Forest Meteorology, 150, 1283–1296.

Still CJ, Berry JA, Collatz GJ, DeFries RS (2003) Global distribution of C3 and C4 vegetation: Carbon cycle implications. Global Biogeochemical Cycles, 17, 6-1-6–14.

Suits NS, Denning AS, Berry JA, Still CJ, Kaduk J, Miller JB, Baker IT (2005) Simulation of carbon isotope discrimination of the terrestrial biosphere. Global Biogeochemical Cycles, 19, 1–15.

Wagle P, Zhang Y, Jin C, Xiao X (2016) Comparison of solar-­induced chlorophyll fluorescence , light-use efficiency , and process-based GPP models in maize. Ecological Applications, 26, 1211–1222.

 

The wrong snow for sheep…

Filed under: 2018,My Spatial Problem @ 3:20 pm

Dall Sheep on Jaeger Mesa, Wrangell St Elias National Park, Alaska. Laura Prugh.

Research question.

Dall Sheep are a species of wild sheep whose ranges extend throughout mountainous Alaska and the Yukon Territory. As a large ungulate specialised in grazing sub-Arctic to Arctic alpine regions, they are important maintainers of ecosystem function in habitats considered particularly sensitive to environmental change. They also provide an important service to local, often remote, human populations, traditionally through subsistence hunting but more recently through lucrative trophy hunting and wildlife tourism facilitation.

Since the 1980s range-wide populations of Dall Sheep have decreased by up to 21%, and in some areas emergency harvest closures have been enacted. The common explanation for this population decline has been an increase in extreme winter snow conditions reducing access to forage and increasing energy expenditure. Dall Sheep lamb in April and May and the physical condition of the ewes at the end of winter is a key determinant of the survival of their lambs, and hence longer-term population size.

To date, there has been limited empirical investigation into the relationships between longer term Dall Sheep population health and patterns of seasonal snow cover. This project seeks to address this by answering the following research questions;

  • Has there been an increased frequency of extreme seasonal snow conditions (e.g. snow depth, density/icing, duration) from 1980 to present day in Dall Sheep habitats?
  • Do instances of extreme seasonal snow conditions correlate to reduced recruitment of Dall Sheep?

Dall Sheep ranges in Alaska and the Yukon Territory. Study site is the author’s field site in the Wrangell St Elias.

Datasets

Seasonal snow condition data for the analyses in this project will be prepared from the outputs of a physically based, spatially explicit snow evolution model, SnowModel (Liston and Elder, 2006). SnowModel has been run at a daily timestep for domains using climate reanalysis forcing data within 6 Alaska National Parks and Preserves (NP) from 1980 to 2017; Wrangell St Elias NP, Lake Clark NP, Denali NP, Gates of the Arctic NP, and Yukon Charley NP. Snow condition data, e.g. depth/density, has been aggregated for sheep habitat (e.g. mean snow depth above shrubline) by month and season (e.g. Winter = December, January and February) for each water year.

For each of these domains we have summer sheep count datasets that have been taken at differing frequencies and methodologies. For the scope of this project we will use a metric derived from these sheep counts as an indication of recruitment success; the lamb:ewe ratio. More lambs per ewe each summer season shows greater recruitment success.

Hypotheses

    • Deeper seasonal snow will inhibit Dall Sheep movement and forage access, therefore increasing energy expenditure during winter, leading to decreased spring reproductive success
    • Longer durations of seasonal snow cover will correspondingly cause poorer Spring sheep condition and hence decrease reproductive success

Approaches

For RQ#1 I would like to explore approaches that detect and describe statistically significant features of trends in snow cover data from 1980 to 2017. This might be the incidence of hazardous events per season, e.g. rain-on-snow; mean snow depth by month/season/water year; or snow cover duration. Potential extension to RQ#1 would be to compare the importance of climate indices, e.g. Pacific Decadal Oscillation, on the spatiotemporal patterns of seasonal snow in each domain.

For RQ#2 varying complexities and flavours of regression analysis will be explored to discover the most important features of the snow season that influence the following summer’s Dall Sheep recruitment. Initial ideas are linear mixed models and random forest.

Preliminary results comparing summer lamb:ewe ratios to simple metrics of seasonal snow and climate

Expected Outcomes

Expected, and hoped for, outcomes are statistical relationships between Dall Sheep recruitment success and seasonal snow conditions, aiding our understanding on the mechanisms driving their population decline. The sheep survey data is patchy in time and space, so identification of broad, range-wide relationships will help identify years that may have been hazardous for sheep but where no surveys were conducted. Statistical relationships describing correlations of seasonal snowfall in relation to the strength of teleconnections, e.g. the Arctic Oscillation, for each domain could help wildlife managers anticipate sheep-hostile years and make informed decisions in regard to their controlled harvest.

Significance

This project will help improve our understanding on the drivers of long-term Dall Sheep decline and inter-annual recruitment success. Understanding these drivers will improve evidence-based decision making in regard to their future management, aiding the sustainability of a critical ecosystem service.

Proficiencies

I have a reasonable level of proficiency in ArcInfo, less so in Model Builder. I am confident in Python for geospatial analysis, though I most often use Matlab when working SnowModel output. I have zero experience in R.

Reference(s)

Liston, G.E. and Elder, K., 2006. A distributed snow-evolution modeling system (SnowModel). Journal of Hydrometeorology7(6), pp.1259-1276.

 

Spatial and Temporal Patterns Among Multi-day, Overnight Wilderness Users

Filed under: My Spatial Problem @ 10:48 am
  1. A description of the research question that you are exploring.

Glacier Bay National Park and Preserve (GLBA), located in southeast Alaska, contains over 2.7 million acres of federally designated terrestrial and marine wilderness (National Park Service, 2015). Recreation users access GLBA Wilderness primarily by watercraft; the park lacks formal trail networks in its wilderness and terrestrial connectivity is fragmented by the park’s water resources. First designated as wilderness in 1980 through the Alaska National Interest Lands Conservation Act, management of the park’s wilderness has been guided by a 1989 Wilderness Management Plan (National Park Service, 1989). Much has changed in Alaska and GLBA since that time, including an increasing cruise ship tourism industry using the park’s waters and reductions in glacial ice, and the park is currently engaged in updating its Wilderness Management Plan to adapt its management practices to these modern contexts. Additionally, the Wilderness Act of 1964 includes explicit statements about how wilderness should be managed – these statements have been operationalized into the Wilderness Character framework (US Forest Service, 2008). This framework provides managers with benchmarks for understanding the degree to which the wilderness experiences of recreationists align with the characteristics of wilderness described in its enabling legislation. Of interest to the park in this wilderness management planning process is developing a better understanding the wilderness experiences of backcountry overnight users.

The advent of widely available access to Global Positioning System (GPS) technology has led to the ability to continuously track the movement of people through space and time (van der Spek et al., 2009). Using GPS tracking in recreation research expands on previous methods of data collection by providing a reliable way to continuously measure behavior and to generate precise estimates of the spatial and temporal components of visitor movement (van der Spek et al., 2009). Specifically, GPS provides researchers with exact locations and time stamps of visitor movement, whereas self-reported or researcher recorded methods of data collection are subject to estimation error and imprecision that can lead to misrepresentations of actual travel patterns (Hallo et al., 2012; van der Spek et al., 2009). Using GPS technology also removes the potential negative impacts to experience caused by more invasive methods of data collection such as physically following the visitor or observing visitor movements (Cole & Hall, 2012). Furthermore, as GPS technology has continued to advance, earlier obstacles to using GPS technology in visitor use studies, such as burden to the visitor and unit cost, have been resolved (Hallo et al., 2012).

In wilderness settings, GPS technology has been primarily employed as a tool for studying the behavior of day users – meaning those users that do not stay overnight in wilderness as part of their recreation experience. Previously, limits in GPS battery life have been the primary factor preventing the study of overnight wilderness users through GPS technology. Recently, Stamberger et al. (2018) used recreation-grade GPS units to track overnight users in Denali National Park and Preserve. While Stamberger et al. (2018) successfully collected 113 GPS tracks from multi-day users, success of the study was limited by GPS battery life, reliability of the units used, and challenges in data collection and management. Additionally, Stamberger et al.’s analyses focused primarily on the spatial distributions of users with primary results focusing on use density. This study seeks to continue to expand on the contributions of Stamberger et al. by overcoming the battery life and reliability limitations through use of a different, recreation-grade GPS unit with enhanced battery life and through implementing data collection methods in the field that reduce and address the reported logistical challenges. Moreover, this study seeks to explore the potential for new analyses for analyzing GPS data from overnight wilderness users through employing analyses that not only provide descriptions of the spatial component of use but that equally consider the temporal component of use. In this way, this study seeks to describe patterns in the behavior of multi-day, backcountry users through analysis of the spatial and temporal data collected.

Research Questions

Primary Focus: Behavior of Wilderness Users

  • What spatial and temporal use patterns emerge among overnight, multi-day wilderness users in Glacier Bay National Park and Preserve?
  • What differences or similarities emerge in spatial and temporal use among days in an overnight, multi-day wilderness trip (i.e., looking at the spatial and temporal use of all trips on day 1 do we see emerging characteristics)?

Secondary Focus: Intersection of Behavior and Location

  • What are the land cover characteristics of terrestrial wilderness use in Glacier Bay National Park and Preserve? Do relationships exist between the spatial and temporal characteristics of terrestrial use and land cover classes?
  • What are the bathymetry (or other?) characteristics of marine wilderness use in Glacier Bay National park and Preserve? Do relationships exist between the spatial and temporal characteristics of marine use and marine features? Note: I’m not sure what data I’d use to operationalize this at this time.

Practical/Data Analysis/Class Questions

  • How can the overnight, multiday tracks be meaningfully displayed and/or symbolized for reporting? I’d like to try to figure out a way that both space and time can be represented given that a central contribution of tracking overnight users is seeing their use of space through time (i.e. multiple days).
  • What analyses can be used that move beyond descriptive statistics (i.e., calculations of distance traveled, time spent)? Are there clustering analyses that take in to account both spatial and temporal characteristics rather than just spatial characteristics?
  • Are there standard diagnostic or exploratory data plots (outside of viewing the data in ArcGIS) that can be used to understand the GPS data and determine appropriate spatial statistics for analysis?
  • How can I “normalize” the data to ensure that observed differences are not a function of the number of GPS points dropped but a function of actual distances in behavior? Do I need to normalize? Note: This may not be relevant for this dataset, but I’m working with another spatial dataset for a publication where this is relevant.
  1. A description of the dataset you will be analyzing, including the spatial and temporal resolution and extent.

Dataset: The dataset for analysis is a sample of 38 GPS tracks of multi-day trips in GLBA Wilderness (Figure 1). Recreation grade, personal GPS units were administered to a sample of wilderness visitors, June through August 2017, prior to the start of their multi-day wilderness trip. Study participants were asked to carry the GPS unit for the duration of their trip and return the unit at the end of their trip. GPS units tracked visitor movement continuously throughout the trip.

Figure 1. GPS track data collected from wilderness users in GLBA Wilderness during the summer 2017 use season.

Temporal Resolution and Extent: Units recorded a GPS point at various intervals, determined as a function of speed of travel. When speed was recorded at 0 miles per hour (MPH), the GPS units recorded an X,Y location point every 60 seconds. When speed was recorded at 1 MPH, the GPS units recorded an X,Y point every 15 seconds. When speed was 2 MPH or greater, the units recorded an X,Y GPS point every 8 seconds. Data collection began with the first GPS unit distributed on June, 17, 2017 and ended with the last GPS unit returned on August, 6, 2017. Most tracks recorded between two and four days of data. Some tracks are incomplete (i.e., the entire trip was not recorded) because the battery died or the unit malfunctioned prior to being returned at the end of the participant’s overnight trip.

Spatial Resolution and Extent: At each time interval (described above), the GPS units recorded X and Y coordinates. Coordinates were recorded in decimal degrees. The geographic coordinate system for the data is GCS_WGS_1984. The spatial extent for the dataset is the park boundary for GLBA.

  1. Hypotheses: predict the kinds of patterns you expect to see in your data, and the processes that produce or respond to these patterns.

At this point, I do not have any formal hypotheses about the spatial or temporal behavior of wilderness users in GLBA Wilderness. My analyses will be exploratory, and I hope to look at several different analyses and outputs to ultimately identify an approach/analysis that works well within the limits of the data and will be practically meaningful for Wilderness managers. Ultimately, I’d like to be able to describe hot spots in both space and time and to identify spatial and temporal trends among the days of each trip.

  1. Approaches: describe the kinds of analyses you ideally would like to undertake and learn about this term, using your data.

Spatial Descriptions: I would like to create a kernel density map and perform a hot spot analysis to get practice using those tools and to understand spatially where clustering is occurring in the data. These analytical outputs are common density outputs in the recreation literature and I’d like to make sure that I’m applying them appropriately. I am also interested in potentially using a nearest neighbor hierarchical cluster analysis to understand where spatially explicit clusters exist in the data. I’ve used this analysis before, but again I’d like to make sure that I’m applying it appropriately and interpreting the output appropriately. I am also interested in an analysis (maybe path analysis?) that identify statistical patterns in the sequence of the X,Y points rather than identifying statistical hot or cold spots among the points in the GPS tracks.

Spatiotemporal Analyses: I have read a paper that uses the Space-Time Cube in ArcGIS to understand hot and cold spots in space and time and thought that the output was interesting; I would be interested in using that tool, if appropriate, to try to analyze the space and time elements of the GPS tracks together. Generally, this next level of analysis is an area where I am looking for guidance, as I’m not really familiar with other spatiotemporal analyses. I’ve been doing some initial research, but need to keep working on this end to find out what analytical tools are available. At this point, I’d be looking for something that is descriptive, and data driven as I do not have formal hypotheses to test.

  1. Expected outcome: what do you want to produce — maps? statistical relationships? other?

Ideally, I’d like to be able to produce visualizations, whether it be maps or other, that represent statistically significant spatiotemporal behaviors in the data. In essence, when a manager looks at a map or visual output, I’d like to be able to show that what is displayed is statistically significant and doesn’t just look significant because of the symbology used.

  1. How is your spatial problem important to science? to resource managers?

Since the establishment of “wilderness” as a federal lands designation, recreation researchers have engaged in research to understand the unique experience of recreating in wilderness. To date, primary methods for conducting research to shed light on the quality of wilderness experiences has used qualitative and quantitative approaches to collect interview and/or survey data from wilderness recreationists. These studies have focused on understanding an individual’s perceptions of their experiences, with topics ranging from motivations, meaning and importance, aspects of the experience, preferences for management, the social and environmental impacts of wilderness use, and the emotional benefits of wilderness experiences (Dawson & Hendee, 2009). Researchers have also sought to understand wilderness behaviors through data collection techniques such as visitor-recorded trip itineraries, visitor-mapped travel trajectories, and visitor reports on such items as trip duration, activities, and encounters with other users to name a few items. A common characteristic of these data collection methods targeting measurement of behavior is that all measure visitor perceptions or recollections of wilderness experiences and behaviors rather than actual experiences and behaviors themselves. While approximations of actual behavior, these measures are limited in utility, creating uncertainty in the understanding of such basic questions as where do wilderness users go during their trips and how long do they stay in wilderness? In seeking to measure actual behavior, researchers have employed such methods as research observation to record occurrences of specific behaviors in which visitors engage or the use of sensor technology to record counts of visitors passing a location at one time. These data collection techniques provide measures of actual behavior; however, the measurements are made at one time, and rarely can be used to provide a continuous record of visitor behavior in wilderness. Through using GPS technology to track overnight wilderness users in this study, an increased level of data accuracy and resolution is available for analyzing and understanding patterns in overnight wilderness users than has been previously possible.

From a managerial perspective, the analysis of these data will provide Wilderness managers at GLBA with an increased understanding of the overnight wilderness visitor population for use in upcoming Wilderness management planning efforts This new information is notable, as the overnight wilderness visitor population is the primary user group in GLBA Wilderness.

  1. Your level of preparation: how much experience do you have with (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R?

Arc-Info: My level of experience with Arc-Info is proficient – I can easily navigate my way around the software and work independently to problem solve. I would not consider myself an expert as there are several tool boxes within Arc-Info that I have never used. I work primarily with vector data (point, line, and polygon) and am much more familiar with tools built for these data types. I am somewhat familiar with the spatial analysis toolbox, but have not had much success using these tools in Arc-Info as my datasets have been too large in the past. I would consider myself an expert in navigating the online help available through ESRI.

Modelbuilder/Python: My level of experience with Modelbuilder is proficient, although I have not used Modelbuilder in recent years. I know that it is an available tool for linking processes, but in the past of have automated those processes using Python rather than model builder. In my master’s program I took a course specifically oriented around learning how to use Python for data management and to leverage Arc-Info tools. The course content focused on batch processing, data management, and calling tools from Arc-Info using Arcpy. It has been a little bit since I have used these skills directly, but I’ve tried to maintain those skills and could work through some code if needed. I consider myself a beginning Python programmer with much to learn. I did save my resources from my prior class and have a great textbook on Python programming in the Arc environment that I’d be happy to share with others.

R: My R experience is new, and gained through taking STAT 511 last term. I feel comfortable in the RStudio environment, and find many similarities between Python and R. I would like to learn how to work with and analysis my GPS data in R, and how to leverage any spatial visualization tools that R has to offer. My experience in R is novice, but not intimidated!

References

Cole, D. N., & Hall, T. E. (2012). Wilderness experience quality: Effects of use density depend on how experience is conceived. In Cole, D.N. (Ed.), Wilderness Visitor Experiences: Progress in Research and Management, 2011 April 4-7, Missoula, MT. Proc. RMRS-P-66 (pp. 96–109). Fort Collins, CO: U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station.

Hallo, J. C., Beeco, J. A., Goetcheus, C., McGee, J., McGehee, N. G., & Norman, W. C. (2012). GPS as a method for assessing spatial and temporal use distributions of nature-based tourists. Journal of Travel Research, 51(5), 591–606.

National Park Service. (1989). Wilderness Visitor Use Management Plan: Glacier Bay National Park and Preserve.

National Park Service. (2015). Glacier Bay: Wilderness Character Narrative. Available: https://www.nps.gov/glba/learn/news/wilderness-character-narrative-released.htm.

Stamberger, L., van Riper, C. J., Keller, R., Brownlee, M., & Rose, J. (2018). A GPS tracking study of recreationists in an Alaskan protected area. Applied Geography, (93), 92-102.

United States Department of Agriculture Forest Service. (2008). Wilderness character and characteristics: What is the difference and why does it matter? Available: https://www.wilderness.net/NWPS/documents/FS/FS_Wilderness_Character_Characteristics.pdf.

Van der Spek, S., van Schaick, J., de Bois, P., & de Haan, R. (2009). Sensing human activity: GPS tracking. Sensors, 9(4), 3033–3055.

April 5, 2018

Exploring spatial and temporal behavior patterns of recreationists in Grand Teton National Park

Filed under: My Spatial Problem @ 12:33 pm

Overall context about my research and spatial problem:

For my Master’s thesis I will be exploring the spatial and temporal behavior patterns of water-based recreationists at a popular lake destination in Grand Teton National Park. More specifically, I will be examining if there are differences in the movements between three primary paddlesport user groups: canoers, kayakers, and stand-up paddleboarders. I will analyze the total distance people traveled, the amount of time people spent on the lake, the total distance traveled from shore, and if there are hot/cold spots of visitor use. This spatial analysis will be coupled with a survey that uses goal interference theory to explore perceptions of conflict between and among these user groups. Each person who receives a GPS unit will also participate in a survey about their experience and self-reported behavior during their visit. The analysis for the survey component of the research will be a bivariate regression analysis, examining the relationship between user group (independent variable) and perception of conflict (dependent variable). This research will be one of the first to combine survey data with spatial data to understand how people perceive and respond to conflict in time and space within water-based recreation settings. Further, this research will contribute to the dearth of knowledge about the spatial/temporal movements of water-based recreationists in parks and protected areas.

The caveat is that I do not have these data as I will be collecting them this summer. Therefore, for the purposes of this class, I will be using a mock dataset that will ideally allow me to use similar spatial analysis tools that can be applied towards my upcoming research. It is important to note that the dataset I’m using for this class is not water-based, but rather land-based hikers along a complex trail system. Because water-based recreation movement is typically more diffuse than trail-based recreation, and because I don’t have survey data, the research question for this class will be different from my actual thesis research question. However, the proposed research questions for this course aim to answer similar questions that I will be asking in my own research.

1. A description of the research question that you are exploring.

A. What spatial and temporal patterns emerge from of day-use hikers in Grand Teton National Park?

This research question seeks mostly descriptive answers about human movement and behavior within this trail system. How far are people going? How much time is spent recreating? Where is visitor movement clustered? Where is movement more diffuse?

B. To what extent does group size influence the spatial and temporal behaviors of people pausing or congregating in certain areas along a trail system in Grand Teton National Park?

This research question seeks to examine the relationship between two variables and allows me to explore temporal characteristics of visitor movement.

I imagine as I delve into the data other research questions will emerge.

2. A description of the dataset you will be analyzing, including the spatial and temporal resolution and extent.

The dataset I will be analyzing is a collection of 652 GPS tracks of day-use visitors at String and Leigh Lakes in Grand Teton National Park. These GPS units were distributed to a random sample of visitors between July 15 – September 8, 2017. Each intercepted visitor was asked to carry the GPS unit with them throughout the duration of their visit at String and Leigh Lakes. When deploying the units, study technicians also recorded the total number of people in the group, and the intended destination for their day visit. To maintain independence between samples, only one GPS unit was given to each group.

The GPS units used in this study were Garmin eTrex 10 units. These units collected point data every 5 seconds. The GPS tracks were saved as point features for analysis in ArcGIS so that each visitor’s hiking path can be represented by a series of points. The positional accuracy of these units can vary up to 15 meters. However, the Garmin units were calibrated with a high accuracy Trimble GPS unit which indicated a low average positional error of 1.18 meters.

3. Hypotheses: predict the kinds of patterns you expect to see in your data, and the processes that produce or respond to these patterns.

I expect to find various spatial and temporal hot spots in the trail system surrounding String and Leigh Lake. Specifically, I predict that people will cluster around the eastern shoreline of String Lake, the area of shoreline that connects to the parking lots. I imagine this clustering will be influenced by a couple factors: 1.) this area is closest to the parking lots, allowing for easy access to and from vehicles; and 2.) this area provides beach access with sections of land denuded of vegetation providing spaces for picnicking, lounging, and watersport activity.

I expect that larger groups will take more breaks than smaller groups. Moreover, I predict a positive relationship between group size and stopping behavior, i.e. as group size increases, so will the stopping behavior. A process that may be influencing this pattern is that more people in a group increases the likelihood that at least one person will want, or need, to stop. Therefore, all people in the group will be more likely to stop.

4. Approaches: describe the kinds of analyses you ideally would like to undertake and learn about this term, using your data.

As stated previously, my aim for this course is to learn analysis tools that will enable me to analyze my research data next Fall. There may be additional questions and tools I will discover throughout this course. As of now, to answer the above research questions, I intend to learn the following analyses:

Spatial Pattern Analysis Tools:

  1. Density analysis – where are there clusters in visitor use? I’d like to try doing a Kernel density analysis to achieve this.
  2. Hotspot analysis – are these clusters statistically significant compared to use in other areas of the trail system?
  3. Nearest Neighbor analysis – I am interested in learning how to use this tool but am unsure if it is appropriate given my data set. I need to investigate this analysis further.

Modeling Spatial Relationships Using Regression Analysis Tools:

  1. Ordinary Least Squares Regression – determine the relationship between group size (independent variable) and stopping behavior (dependent variable).
  2. Unknown. Perhaps there are more appropriate analyses available to answer this question. Ultimately, I would like to learn how to do bivariate and multivariate correlation analysis in this course as these approaches will be used in my own research.

Spatial- Temporal Analysis Tools:

  1. ArcGIS space-time cube – to determine the length of time people spend in certain areas. In general, I am interested in learning more about how to apply temporal analysis to these data.

5. Expected outcome: what do you want to produce — maps? statistical relationships? Other?

I’d like to create maps that represent visitor density and hot/cold spots. This will visually indicate where people are clustering both spatially and temporally.  I also want to produce a linear representation of the relationship between group size and stopping behavior. I’d also like to represent temporal results in a way that is digestible to outside audiences; perhaps through the space-time cube?

6. Significance. How is your spatial problem important to science? to resource managers?

Parks and protected area land managers strive to provide a quality user experience while also protecting natural and cultural resources. Accurately understanding how people move and behave in a recreation system allow for more informed management decision making. For example, understanding where and when there are hot-spots in visitor use could indicate a need for additional infrastructure, signage, or educational initiatives depending on the management objectives for the area. Additionally, by exploring spatial relationships between variables (in this case, relationships between group size and stopping behavior), the results can have predictive power for managers.

In the scientific and academic communities, applying spatial methods to outdoor recreation science allows for a more accurate understanding of how people move, experience, and interact in outdoor spaces. By integrating GIScience with other common social science techniques in outdoor recreation — such as surveys, observations, and interviews — scientists glean richer results that can support and contribute to existing theory, generate deeper understandings about human behavior, and inspire additional studies.

7. Your level of preparation: how much experience do you have with (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R?

I am new to all the tools necessary for answering these research questions. Therefore, I anticipate needing to spend additional time outside of class familiarizing myself to the software before diving into the analysis.

a.) Arc-Info — I took an introductory GIS course during Winter term 2018. While I did well in the course, I did not gain as much hands-on experience with ArcGIS as I would have liked.

b.) I used Modelbuilder once during a lab exercise. Other than that, I have little experience. I have no experience in Python.

c.) I have familiarity using R and became fast friends with YouTube and Google to learn how to use this software. I initially learned how to use R in the Statistics 511 course. I also used R to analyze and graphically represent summary statistics from numerous datasets for a large visitor use and visitor impact study for the National Park Service.

 

Next Page »

© 2018 GEOG 566   Powered by WordPress MU    Hosted by blogs.oregonstate.edu