GEOG 566

         Advanced spatial statistics and GIScience

April 27, 2018

Fitting and Assessing Fit of Lognormal Regression to Phenology Data

Filed under: 2018,Exercise/Tutorial 1 2018 @ 6:30 pm


The purpose of this analysis was to find ways to describe phenological data I gathered in 2017, in order to quantify trends and compare between sites. Phenology was recorded for two interacting species: the introduced cinnabar moth (Tyria jacobaeae), and native perennial herb Senecio triangularis, which is a novel host to the moth larvae in North America. For S. triangularis, individual flowerheads (capitula) were scored into six phenological stages: primorida (P), buds (B), young flower (YF), flowers (F), fruit (FR), and dehisced fruit (D).

The motivating question is: does the flowering phenology differ by site for five surveyed sites? And if so, is the difference apparent at all flowering stages, or only for specific stages?

In order to complete this analysis, I first needed to convert my survey dates to growing degree days, defined as accumulated daily heat gain above 5˚ and below 37˚ C. The development thresholds used here were estimated from previous studies on phenology of alpine flowers (Kudo and Suzuki, 1999). I used the single triangle method with upper threshold (Sevacherian et al., 1977) with 2017 Tmin and Tmax rasters acquired from PRISM. This was accomplished using R code developed by Dr. Tyson Wepprich (personal communication), and provided me with accumulated or growing degree days (hereafter GDD) as a unit of thermal time, against which I plotted the response variable of capitula counted in one of the flowering stages.

I will focus on F stage capitula to describe the workflow that will be expanded to each flowering stage. I hoped to be able to describe an overall behavior for observations of F stage capitula (hereafter called F), producing a predictive curve fit to the data. The goodness of fit of this curve could then be examined for each site separately, to yield an estimate in difference in timing at certain sites.



I used a combination of tools to choose and fit a curve to my observations of F against GDD. First, I used an R package called fitdistrplus to visually compare and also estimate parameters for a variety of curves, including Weibull, gamma and lognormal curves. Then I used the built-in nls (non-linear least squares) function in R with parameter starting estimates determined from output of fitdistrplus.



Initial trials with fitdistrplus showed that the package was only able to fit density plots of one variable. In order to create a density plot that reflected capitula counts on a given degree day, I used package splitstackexchange. I used the expandRows() command to expand rows of the data frame by values in the column contain F counts. In the resulting dataset, each observation of a time (GDD) represented an individual capitula counted at that time, so that a plant with seven F capitula at time t would be represented by seven observations of t. This method was in part based upon methods by Emerson and Murtaugh for the statistical analysis of insect phenology (Murtaugh et al., 2012).

Using the fitdist() command from fitdistrplus and the expanded set of GDD observations for F, I fit gamma, lognormal, and Weibull curves to my data, then calling the denscomp function to visually compare each curve against the density plot of observations. I repeated this method across all floral stages to determine that lognormal curves seemed to best reflect visual trends in my data (Figure 1a & b).

Figure 1a: Observations of F stage capitula against growing degree days

Figure 1b. Density plot produced with fitdistrplus showing observations of F-stage capitula at time x in growing degree days, with fitted Weibull, lognormal & gamma curves.

The lognormal fitdist object created for the steps above also contained estimates of the parameters for the lognormal curves, which I extracted with the summary() call.

Seeking a more flexible lognormal model, I saved these estimated parameters and turned next to the built-in nls() (non-linear least squares) function. This function requires the user to specify a formula and vector of starting estimates for parameters, and then uses least squares methods to find a best fit for the parameters with the provided data. My formula in this case was the lognormal probability density function (pdf):

I created an nls object using this formula, the original (non-expanded) dataset, and the estimates from the fitdist() function for starting estimates of the parameters  and . I was able to obtain an nls object using these methods, but had a scaling issue between the axes – because my x axis was stretched over 900 degree days, and because of inherent properties of the lognormal pdf, the scale of the y axis was off by about three orders of magnitude. To address this, I scaled GDD by dividing the number of degree days by 100, and converted my counts to a per-plant proportion of F. Obtaining new estimates from fitdist() for  and, I found these adjustments indeed yielded an appropriately fitting curve for my data. I used least squares and root mean squared error (RMSE) methods to check the fit of the curve.



With the predict() function and the nls object fit above, I plotted a predictive curve based on my lognormal model against the prepared data, and used a visual check to determine that this result appeared to be a relatively good fit, especially compared to earlier attempts to fit simple polynomials to the data (Figure 2).

Figure 2. Observations of F stage capitula fitted with lognormal (red) and cubic polynomial (blue) curves using non-linear least squares method.

The original goal was to compare observations for each site to this overall predictive curve. I was able to plot site-specific data with the curve for visual inspection (Figure 3a-e).

Figure 3a.

Figure 3b.

Figure 3c.

Figure 3d.

Figure 3e.

Additionally, I calculated mean absolute error (MAE) and root mean squared error (RMSE) to summarize the difference between observed values for each site and predicted values given by the overall lognormal curve. Of the five sites, Juniper had the lowest values for both RMSE  (0.1274 compared to 0.1848 for the data overall) and MAE (0.0712 compared to 0.103 overall); and Waterdog had the highest RMSE and MAE (0.2202 and 0.1269 respectively). A summary of these values is included in Figure 4:

Figure 4. Lognormal curve fit by nls shown against F stage observations by site. Inset shows root mean squared error (RMSE) and mean absolute error (MAE) calculated overall and by site.


There was no one tool I could find that was able to fit a phenomenological model to these data. The use of fitdistrplus in combination with nls was an adequate solution because in the end it did yield an appropriate looking curve. To use fitdistrplus, I had to collapse my explanatory and response variables into one variable whose value represented the explanatory variable and frequency represented the response variable. This was a little overcomplicated for my purposes, but did allow me to visually compare fitted lognormal, gamma & Weibull curves, which was a great benefit.

The nls function, which allowed me to switch back to my initial data format, allowed me to easily extract estimates and predicted values, and plot the curve of predicted values. This was extremely useful for both visual and mathematical assessments of fit. A drawback of the nls function is that without starting parameters estimated by fitdistrplus, it was extremely difficult to fit a lognormal curve; some starting points yielded no tractable model and others yielded models with little context about the goodness of fit.

I am concerned that I could only fit the curve by rescaling the data, which may due to an underlying property of the lognormal pdf: that the area under the curve must be equal to 1. At moments I felt unsure whether I was fitting the data to the curve or the curve to the data.

But in the end, this method yielded a lognormal curve which is able to capture important properties of my count/proportion data, such as rising and falling at appropriate rates around the peak and not dipping into negative values. I will continue to explore, and attempt to verify that the method I used for estimating parameters in fitdistrplus to be used in nls was valid.


Kudo, G., and Suzuki, S. (1999). Flowering phenology of alpine plant communities along a gradient of snowmelt timing.

Murtaugh, P.A., Emerson, S.C., Mcevoy, P.B., and Higgs, K.M. (2012). The Statistical Analysis of Insect Phenology. Environ. Entomol. 41, 355–361.

Wepprich, Tyson. (2018). “PRISM_to_GDD_daterange”, R Code.

Sevacherian, V., Stern, V.M., and Mueller, A.J. (1977). Heat Accumulation for Timing Lygus Control Measures in a Safflower-Cotton Complex13. J. Econ. Entomol. 70, 399–402.


Print Friendly, PDF & Email

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

© 2019 GEOG 566   Powered by WordPress MU    Hosted by