QUESTION:

In this exercise, I used logistic regressions to investigate whether whether three explanatory variables (growing degree days, soil temperature, soil moisture) could be related to changes in the probability of finding vulnerable versus invulnerable plants in my 2017 data. Logistic regressions allow us to fit models to data of probabilities that range between 0 and 1 (Hosmer et al., 2013). **Vulnerable** plants were defined as plants with any vulnerable capitula present whereas plants were defined as **invulnerable** if they had only invulnerable stages (vulnerable stages: primordia, buds, young flowers and flowers; invulnerable stages: fruit and dehisced fruit). Based on these definitions, I created binary response variable at the plant level:

vul.plant = 1 for a plant having any vulnerable capitula

vul.plant = 0 for a plant that has only invulnerable capitula

Because all flowering capitula begin at a vulnerable stage (primordia, bud) and pass through to invulnerable stages (fruit, dehisced fruit), a baseline expectation was that early season samples would have a probability of vulnerability close to 100%, and that samples late in the season would have a probability of vulnerability close to 0%. **(Prediction 1):** **growing degree days are expected to explain a significant degree of the change in probability from 100% to 0%.**

Previous analysis showed that warmer soil temperatures (˚F) were linearly associated with higher growing degree days, so I expected that soil temperature should explain some degree of the change in probability of vulnerability, both through this correlation with growing degree days and also through directly impacting flowering phenology. **(2) Higher soil temperature is expected to be associated with decreased probability of vulnerability**. On the other hand, the correlation between growing degree days and soil moisture (measured as percent soil moisture) was negative and neither as strong nor as consistent as soil temperature. For these reasons, soil moisture might explain a portion of the change in probability of plant vulnerability**, **and** (3)** **higher soil moisture is expected to be associated with increased probability of vulnerability**.

Finally, I also wanted to ask whether the patterns shown in the logistic regressions I performed **would differ between the five surveyed sites** represented in the 2017 data.

APPROACH:

To complete this analysis, I used the generalized linear model function **glm()** to perform logistic regressions on data prepared in tutorial 2. Because soil temperature measurements were significantly impacted by the time of day at which they were measured, I removed the linear trend between soil temperature measured and time of day of measurement, creating adjusted soil temperature (also in ˚F). Soil moisture was used directly.

Binary response variable for plants was created using **dplyr** package and the **summarise** function to collapse per-capitula phenology scores to plant-level presence/absence scores for each flowering stage, then further collapsing the presence/absence scores for six stages into two classes, “vulnerable” and “invulnerable,” as defined above. Similar methods were used to create a binary variable for the presence of virulent larvae (where virulent was defined as larval stages L4 & L5), and a third binary variable for phenological synchrony, defined as presence on a single plant of both vulnerable capitula and virulent larvae. These additional binary variables will not be analyzed in this exercise.

METHODS:

- Create binary response variables – this was a little tricky because initial data structure was plant-level counts of capitula by stage. Using
**dplyr**package and the**group_by**and**summarise**functions, I grouped the data by all plant-level variables I wished to keep in the resulting dataset, and then used summarise to create 2 new binary response variables: presence of vulnerable capitula (vulnerable capitula present = 1, none present = 0), and presence of virulent larvae (virulent larvae present = 1, none present = 0). I created a third binary response variable by multiplying the first two to document “phenological synchrony”, the co-occurrence on a single plant of vulnerable capitula and virulent larvae (virulent larvae. - Once data were properly wrangled, I used the
**glm()**function with family set to binomial to fit logistic regression models of the binary response variable of interest (plant vulnerability) as a function of included explanatory variables. - A
**summary()**of the glm object gave estimates of the model coefficients, and their significance levels (z scores, p-values). Based on methods outlined in the datacamp module on logistic regressions, I used**predict()**function to create a list of fitted probabilities and check how often these correctly classify plants as vulnerable or invulnerable (https://www.datacamp.com/community/tutorials/logistic-regression-R). - To visualize the results as a smooth curve of predicted probabilities given by the glm object, I again used the
**predict()**function with a regular sequence of x values overlapping the range of the explanatory variable and plotted these together using the baseplot or ggplot functions. For each explanatory variable, I eventually created a single plot with an overall model fit, and models fit to data from each individual site, in order to visually compare patterns/responses at different sites.

RESULTS:

With plant vulnerability as the response variable, I fit a model with growing degree days (in hundreds) as the single explanatory variable to confirm my initial expectation, that increased growing degree days would be associated with a move from near 100% vulnerability to 0% vulnerability. A logistic regression model fit to the data showed strongly significant estimates for the coefficients. The mean success rate for predicting vulnerability classification correctly using this fitted model was 91%. A visual representation of the probability of plant vulnerability showed the probability of plants being vulnerable starting very high (close to 1.0) and ending very low (close to 0.0) with the threshold value of 0.50 falling at about 720 growing degree days (Figure 1). A drop-in-deviance test for the model including growing degree days against a reduced model of plant vulnerability gave strong evidence (P<0.001 for ** χ^{2}** with df = 1) that the probability of a flower being vulnerable depends on number of growing degree days.

Next, I fit a model with plant vulnerability as the response and adjusted soil temperature (˚F) as the explanatory variable, understanding that some of the behavior seen between plant vulnerability and soil temperature can be attributed to the positive correlation between adjusted soil temperature and growing degree days (Figure 2). Coefficients fitted in this model were also highly significant (***see note on these p-values under “CRITIQUE”); but the mean success rate for fitted predictions was at 62% using this fitted model. Probabilities plotted using this fitted model ranged between 0.8 and 0.2 for the range of adjusted soil temperatures represented in the data. I found strong evidence (P<0.001 for ** χ^{2}** with df = 1) that the probability of a flower being vulnerable depends on adjusted soil temperature. The correlation between growing degree day and adjusted soil temperature is estimated at 0.30.

I also fitted individual logistic regression model for plant vulnerability against adjusted soil temperature for each site, and found that Buck Mt. and Holland Mdw. had significant p-values for coefficient estimates while Juniper Mt., Waterdog Lk., and Blair Lk. did not***. A plot of fitted values from each site-level logistic regression model and an overall logistic regression model of plant vulnerability against adjusted soil temperature is given in Figure 3. In this figure, we see that the curve fit for Blair never predicts plants as being ‘vulnerable’, since the curve does not reach above probability 0.5, while curves for several other sites (Buck, Juniper, Holland and weakly, Waterdog) predict a switch from high to low probability of vulnerability with increasing soil temperature. The poorness of fit for Blair Lake could be due to missing data, since one survey date did not have soil temperature data for that site.

We see differences in how well these fitted models describe and predict the occurrence of vulnerable plants at the five sites; I do not see clear indication that the inflection points differ significantly between sites, however the rate of change from probabilities near 1.0 to probabilities near 0.0 is steeper/quicker for certain sites.

Logistic regressions of plant vulnerability as a function of soil moisture showed an opposite relationship to previous regressions: increasing values of soil moisture saw the probability of plants being vulnerable going from low (near 0.0) to high (near 1.0) (Figure 4). I found strong evidence (P<0.001 for ** χ^{2}** with df = 1) that the probability of a flower being vulnerable depends on soil moisture.

Fitting logistic regressions for each site individually did show some different patterns per site (Figure 5). Waterdog Lake reached probabilities near 1.0 much sooner than the rest, while Buck Mt., which tended to have higher soil moisture than the other four sites, had a much slower transition between low and high probabilities of vulnerability.

CRITIQUE

Previous to this analysis, I had been informed that the Wald’s test z-statistic and p-values reported in the output of the **glm()** fitted model are not a reliable test for significance of coefficient estimates and inclusion of terms. Instead, a drop-in-deviance F-test testing a full versus a reduced model will yield a p-value for the inclusion of variables represented in the full model (Ganio, 2018). The drop-in-deviance test is intuitive and interpretable, but somewhat inconvenient for an analysis involving many separate logistic regressions (like, per site), and the reporting of the Wald’s test significance levels can be misleading. For this reason, I performed a drop-in-deviance test for each of the fullest models (growing degree days, adjusted soil temperature, soil moisture) that including all data. For the by-site logistic regressions, I found that calculating the mean success rate of fitted probabilities was a nice way to compare model goodness of fit quickly (datacamp, 2018), though my own comprehension of the steps involved is not as solid as with the drop-in-deviance test, making interpretation more difficult. Calculating odds ratios and analyzing inflection points and slopes for the by-site curves would help me interpret these results in a more biologically meaningful manner, to address the overarching research questions that motivated this data analysis.

An additional critique is that for the regressions of soil temperature and moisture to be most meaningful, I would like to include growing degree days in the model so that I can test the significance of the effect of soil temperature or moisture on probability of vulnerability *after accounting for the affect of growing degree days*. The challenges which stopped my from performing this analysis were that soil temperature and growing degree day were strongly correlated, and that interpretation of the x-axis was intractable for the model that included multiple explanatory variables with different scales. Further research could help make this analysis feasible.

REFERENCES

Datacamp. (2018). *Logistic Regression in R Tutorial*. https://www.datacamp.com/community/tutorials/logistic-regression-R

Ganio, Lisa. (2018). *FES 524 Natural Resources Data Analysis Lecture, Feb 27 2018*.

Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). *Applied logistic regression* (Vol. 398). John Wiley & Sons.