### Consolidating Spectral Variables – PCA

Before any comparisons could be made between spatial and spectral variables in the study, I needed to look a bit closer at each set of variables. Retaining a large number of spectral variables in my models would make them much more complicated, therefore it would be advantageous to reduce that number. Based on a correlation matrix constructed using the R function ‘corrplot’ (Figure 14), there are several strong correlations between variables. In fact, the red edge band seems to have strong positive correlations with the other 4 Micasense Rededge bands. This clued me in that principal components analysis (PCA) could be used to condense them.

**Figure 14:** Correlation matrix between mean crown response for 5 spectral bands from Micasense Rededge sensor (blue, green, red, near ir, red edge) as well as two vegetative spectral indicies (NDVI and TGI).

Though I knew I wanted to do PCA to reduce the number of spatial variables in my dataset and had some evidence that it could work, it was necessary to decide how many components should be retained. By plotting the Eigen values in a Scree plot using R package ‘doParalell’ (Figure 15), it became evident that retaining three principal components would be appropriate.

**Figure 15:** Scree plot for 7 spectral variables. Results indicate that 3 principal components should be retained for principal components analysis based on the shape of the curve and the Eigen value = 1 line.

The R package ‘Psych’ was used to carry out PCA for 3 components on the 7 spectral variables. Results (Figure 16) indicate that the model could account for 94% of the total variance in the data. The principal components RC1, RC2, and RC3 are representative mostly of green, infrared, and blue, respectively. In short, the spectral dataset can be consolidated from 7 variables to 3 without losing much precision.

**Figure 16:** Output for 3 factor principal components analysis to summarize 7 spectral variables. Results show that retaining these three components would explain 94% of the variation from the original model.

### Choosing Spatial Variables – Hot Spot Analysis and GWR

Though several spatial variables had been added to the data, it was unclear what relationship those variables had with the spectral data. For preliminary investigation, an ordinary least squares (OLS) model was built in ArcMap with NDVI as the dependent variable and all 4 spatial variables as explanatory variables. The results of the variable distributions and relationships plot from the OLS output (Figure 17) suggest that box number is the spatial variable most likely to have a significant effect on spectral data. Both box and plot center seem to be slightly skewed in terms of their residual plots, and nearest neighbor appears to be normally distributed.

To further inform this analysis, hot spot analysis (Getis- Ord Gi*) was carried out for the NDVI variable (Figure 18). Results demonstrate clear spatial trends. The most obvious disparity lies between the boxes. There also some evidence that position within the boxes has an effect, as the healthiest seedlings tend to appear in the center of boxes, where the least healthy seedlings tend to appear around the edges.

**Figure 17:** Variable and residual distributions for spatial variables in the study. Note that nearest neighbor (nn), box center, and plot center appear to have randomly distributed residuals, where box # (box_) does not.

**Figure 18:** Hot Spot analysis of seedlings based on NDVI shows spatial irregularities between and within boxes. This figure serves as a baseline for regressions that follow and suggests that both box # and distance to center of box may be significant covariates.

Using geographically weighted regression (GWR) in ArcMap, I was able to look at how NDVI variability changed with the inclusion of each of my spatial variables as effects. Because the box # seemed to have the greatest effect, I started by modeling it alone as a predictor of NDVI. The result (Figure 19) shows a large visual improvement in the randomness of variability of the data when compared to the hot spot analysis. Adding box center to the model (Figure 20) only had a marginal effect on the result, though some outliers at the east and west extremes of the site were corrected.

**Figure 19:** GWR of NDVI by box. Reduction in number of severe outliers in box centers suggests that box effect is likely significant.

**Figure 20:** Adding distance to box center to model had only a small effect. The distribution of variability in the regressed model seems to be much more uniform and, presumably, closer to the corrected NDVI values without the spatial effects of box number and box center.

Though there is still much work to be done, this workflow sets the foundation for processing multispectral imagery that has or may have spatial biases and proposes ways to evaluate and work though them. Many applications exist for this type of work, but in the case of my study, the hope is to use spatially corrected NDVI values to classify seedlings based on their drought response, which could end up looking very similar to Figure 21.

**Figure 21:** Potential application for spatially corrected data: cluster and outlier analysis to identify individuals and groups of individuals with unexpected NDVI values compared to predictions based on spatial variables. Seedlings in high clusters (pink) or high outliers (red) could be strong candidates for genetic drought resistance.