Having endured much frustration trying to use Geographically Weighted Regression to determine relationships between the patch metrics I calculated using FRAGSTATS, I’ve decided to change things up a bit for my final blog post. I’ll continue working with Geographically Weighted Regression, but rather than applying it to FRAGSTATS-derived metrics (shape index, edge:area ratio etc.), I will experiment with a different LandTrendr output: forest biomass.


Explaining more about the dataset here will help me articulate the research question below. So, the biomass layer is a raster calculated using a Tassled Cap Transformation, which is a method of enhancing the spectral information content of Landsat imagery. It is essentially a measure of per-pixel brightness (soil), greenness (vegetation) and wetness (soil and canopy moisture). I will be using yearly time-series biomass layers and “time-stamped” clear-cut disturbance patches.

Research Question:

I’m still not sure how well I can articulate the question, but here goes: Is there a statistically significant relationship between the mean biomass value within a clear-cut patch at the timing of the clear-cut, and the mean biomass within that same patch, before and after the clear-cut?



Clear-cutting obviously results in a loss of biomass, and I expect that quantities of biomass within a clear-cut patch before, during and after a clear-cut will exhibit a significant relationship.


While I had easy access to the biomass data, creating a dataset of disturbance patches with attributes for biomass before, during, and after the timing of each clear cut was a carpal-tunnel inducing task. I ought to have approached it programmatically, and I did try, but my Python skills were lacking. I ended up using a definition query to filter disturbance patches by year, and then ran three Zonal Statistic operations on the biomass layers (one for the year before a given set of clearcuts, one during, and one after). I then joined each biomass calculation back to the clear-cut patches. Below is an attribute table for one set of disturbance patches (note the three mean biomass values) and an example of a disturbance patch overlaid on a “before, during, and after” set of images. I did this for three sets of yearly clear-cuts, and then merged them into one dataset of roughly 700 features.

attrib befaftdur

I then ran Geographically Weighted Regression, with “mean biomass after clear-cut” as the dependent variable, and “mean biomass before” and “mean biomass at timing of clear-cut” as explanatory variables.


I experimented with multiple combinations of the three biomass mean variables, and also tried adjusting the kernel type. The most significant run was that on the left, which had parameters as described above.

r2         r2






While it was satisfying to finally produce some significant statistics, I recognize that the analysis is not groundbreaking. While change in biomass is certainly of interest to forest managers and ecologists, the way in which it was calculated here (as a mean of an annual snapshot within a patch) may not have significant implications.

What I learned:

If this course is nicknamed “Arc-aholics Anonymous” then you could say I had somewhat of a relapse, as most of my analyses throughout the quarter made use of tools I had used in ArcMap before. That said, I gained a much more thorough understanding of their functionality, and feel I have a better command over interpreting their sometimes perplexing results. I now have a much better idea of the types of datasets and variables that lend themselves to certain methods, and the experience of working with a large dataset of Landsat time-series-derived forest disturbance will be invaluable to my research moving forward. I learned a great deal from others in the course and am glad to have made some new contacts (especially you coding gurus). Some of the work produced in this course was truly outstanding and I feel inspired to hone my own skills further, particularly with open-source software.

For this final tutorial post, I’ll be describing a workflow in which FRAGSTATS is used to generate a number of continuous metrics of disturbance patches. Then, using these metrics, a Grouping Analysis is performed in ArcMap to identify groups of patches that have multiple similarities.

FRAGSTATS is a spatial pattern analysis program developed by landscape ecologists for quantifying the structure (i.e., composition and configuration) of landscapes at multiple scales (see this link to the documentation). Through the lens of a landscape ecologist, a landscape may be considered as an assemblage of patches whose spatial homo/heterogeneity characterize the landscape they comprise. While the patches of interest to landscape ecologists are often habitat types, they may represent any spatial phenomenon, including forest disturbance.

The installation of FRAGSTATS is very straightforward (link to downloads page), and the GUI is friendly! Below I outline the steps for adding data to FRAGSTATS and generating patch metrics:

1) Add a layer. FRAGSTATS accepts raster images in a variety of formats (again, see documentation here). I worked with GeoTIFF (.tif) files representing disturbance patches.frags3frag_input

2) Select metrics to calculate. Descriptions of each metric can be found in the documentation.


3) Generate Results. Simply click the run button on the main toolbar, and view the results.


If your goal is to attach these tables back to your input data for mapping/analysis, in a GIS for example, then it is crucial to generate a “patch ID file”. To do this, simply check the box for “Generate patch ID file” under the Analysis parameters tab:


The patch ID file and associated tables will be saved to the directory you choose. Note that here I’ve checked only the box for Patch metrics. The patch ID file will have a suffix of “_id8” appended to whatever name your input file is, and it’s associated extension (“input”_id8.tif). The patch metrics file will have a .patch extension. Open the .patch file in Excel or the spreadsheet editor of your choice, delimit by comma, and save it as a file type that ArcMap will recognize, such as .csv or .txt. I suggest removing the “LID” field which contains the file path where your initial input raster resides.

4) Join output .patch file to patch ID file. In ArcMap, bring in both the patch ID and copy of the patch file in .csv or .txt format. Then, proceed with a tabular join:


Right-click the _id8 file, click Join…in the prompt window that appears choose “Value” for the field in the patch ID file, and PID for the field in the table of patch attributes. Click OK. The patch attributes are now joined to each patch in the patch ID file. Open the attribute table of the patch ID file to verify. Remember to export this file to make the join permanent.

*Note: if you don’t see “Value” as an option for the join field, you may have to run the “Build Raster Attribute Table” tool on your patch ID file.

5) Proceed with Mapping/Analysis!

*Note that the FRAGSTATS tabular outputs are useful in their own right. Below I’ve charted “class” level metrics (as opposed to patch level) that show trends in clear-cut patches over time. In this case, I filtered the disturbance data for clear-cuts only, and had FRAGSTATS generate outputs by year (year is the “class”).








The Willamette Valley of the Terminal Pleistocene (10,000+ years ago) was a wildly different place than the valley we are used to seeing today. There were massive animals roaming the valley floor, which was a lot wetter, with more marshes, bogs, and lakes. Oak savannah and forested swaths of land made their way towards the Willamette River… This wonderful landscape was likely also inhabited by humans, though that is a very difficult question to explore.

In order to find out where in the Pleistocene Willamette Valley people might have lived, we must first understand the sediments that lay under our feet here in the valley, and the best way to do that is by extracting it from the ground and analyzing it.

This project is a part of a bigger picture study that seeks to use sediment cores extracted from a buried peat bog found at Woodburn High School in Woodburn, Oregon and identify the sediments buried within that are of the appropriate age and environment to find both potential Pleistocene aged archaeological sites as well as more evidence of Pleistocene megafauna.


Research Question

For the purposes of this project, I am seeking to find a different method for identifying stratigraphic breaks in a sediment profile. The most popular methods for identifying stratigraphy is through visual examination, multivariate statistical analysis, or by texturing the sediment.

Using x-ray fluorescence geochemical data, and Wavelet Analysis, a method typically used for studying time-series data, is it possible to determine the site stratigraphy using one or two different variables?


The Data

The dataset consists of XRF data taken at 2mm intervals, from 65 1.5-meter core samples. These cores come from 14 different boreholes covering the majority of the defined study area which is approximately a 200×50-meter area. The cores were extracted using a Geoprobe direct-push coring rig.


The Site:


The Geoprobe in action at the site:


The core samples were halved and run through an iTrax core scanning unit. The iTrax scans the cores using an optical camera, a radiograph (similar to medical x-ray), and an x-ray fluorescence scanner, which collects geochemical data consisting of 35 different element counts at 2mm intervals. The data is organized into 14 CSV files containing the XRF results.


The iTrax:



Using wavelet analysis, significant increases and decreases of geochemical properties can indicate where stratigraphic breaks in the sediment occur. This pattern should be repeatable across all of the gathered cores.



The method I chose to analyze my core data and attempt to break apart the stratigraphy was Wavelet Analysis using an R package called “WaveletComp”.

‘WaveletComp” takes any form of continuous data, typically time-series data, spatial data in this case, and uses a small waveform called a wavelet to run variance calculations along the dataset. The resulting output is a power diagram, which shows (in red) the locations along the dataset where there is a great change in variance. A cross-wavelet power diagram can also be generated. This can indicate when two different variables are experiencing rises and/or drops at the same time.


Example of a wavelet.


There are two equations used when generating a wavelet power diagram…


The above equation uses the dataset to calculate the appropriate size of the wavelet according to the number of points in the dataset.


The above equation uses the wavelet to run variance calculations across the dataset and output the power diagram.


Using the ‘WaveletComp” package in R, I processed 5 different core scans. In order to properly conduct the analysis, elements had to be selected in order to do both univariate and bivariate analysis. There were a variety of ways that I could have selected the data, but ultimately, in the test sample, I chose to look at the elements that had the most obvious changes, aluminum and iron.

The details on how to actually run the “WaveletComp” package can be found in my wavelet analysis tutorial.




After all of the tests were run, the resulting wavelet power diagrams were placed alongside line graphs of the element that was run, as well as a cross-wavelet power diagram, which indicates when the two selected elements change at the same time.

The wavelet power diagrams show significant changes in the waveform, as indicated by the red (high variance). The blue in the power diagram shows low variance. In the cross-wavelet power diagram (the one on the far right), the arrows indicate if the waveforms of the two elements are in-phase (pointing left) or out of phase (pointing right) with each other.

In order to identify stratigraphic breaks, I looked at the “taller” red plumes in the wavelet power diagrams, which indicated a significant change in the amount of an element present, either a great increase or a great decrease. This result was compared to the graph, as well as the image of the core (for visual identification of changes in color or texture). The spots that have the high plumes, and correspond with a significant color or texture change in the image are presumably the stratigraphic breaks.

Each of the five cores showed promise that we can identify stratigraphy using wavelet analysis. The two most significant cores are discussed below:


The results for the first core segment shows various distinct changes as indicated by the tall red plumes in the power diagrams, but there is one major problem with the results. The graphed data shows areas where there is zero data at points, as indicated by the drops to the very bottom of the graph. These spots also have the most distinct plumes (for the most part), but about 2/3 of the way down there is a distinct plume that, while contains zero values, also is a spot with a distinct texture change in the sediment. At this point the sediment changes texture and feel entirely.


The results of the second core sample were more interesting, as there were no zero values. The Al+Fe cross-wavelet diagram shows significant plumes where almost all of the significant looking color changes in the stratigraphy are in the profile.




The second core was the most interesting of the analysis, due to the lack of zero/null data. With a little bit of data management, the zero data can be reduced by interpolating some of the values and re-running the data with the interpolated data, That should help reduce the effects of the zero data as demonstrated in the above results.



Wavelet analysis is an excellent tool for observing patterns in spatio-temporal data that is sequential. As for the significance of this project…

Once all of the observed kinks are fixed in the data, this could serve as a new method for identifying changes in stratigraphy. This could be a good method to identify stratigraphy in sediment cores that are extremely similar in color or texture.


What I learned about the software

R is an excellent tool for conducting many different kinds of spatio-temporal analysis. From running wavelet analysis to regression, and really any other tool that ArcGIS has to offer.

ArcGIS is an excellent tool for data visualization, but it is a very finicky program that has to


What I learned about statistics.

Statistical applications in geospatial analysis are very important to understanding even the smallest of changes, such as an increase or decrease in iron in a sediment core.



The relationship between microbiome composition and host health has recently generated a great deal of attention and research. The importance of host-associated microbiomes is still poorly understood, although significant relationships between gut micobiome composition and host health have been described. Although the gut microbiome has received the most attention, each body site is home to its own distinct microbial community. The nasal microbiome has received relatively little attention, although a few studies suggest that there is a relationship between nasal microbiome composition and incidence of infections. Using a unique system of closely studied semi-wild African Buffalo, I propose to study the drivers of nasal microbiome composition in a social species.

Data Collection

Our study herd of semi-wild African Buffalo (Syncerus caffer) is kept in a 900 hectare predator-free enclosure located in Kruger National Park, South Africa. Every 2-3 months, all 60-70 individuals are captured, and biological samples are collected for diagnosis as part of a larger study that the Jolles lab is conducting on Foot-and-Mouth Disease. Age, sex, and body condition are recorded, in addition to a number of other physiological parameters. Degree of relatedness is known for each pair of individuals in the herd. Each animal is fitted with a GPS collar that is programmed to record location every 30 minutes, and with a contact collar that records identity and duration of contacts with other buffalo. The GPS data used in this exploratory analysis was collected between October-December 2015.

Overarching Hypotheses

My research is guided by the following two hypotheses:

  1. Conspecific contacts drive nasal microbiome similarity and disease transmission.
  2. Habitat overlap drives nasal microbiome similarit and disease transmission.

I also propose to examine the relationship between the other parameters (age, sex, body condition, relatedness, etc) on nasal microbiome composition, but for this class I focused on the spatial parameters.


The first step toward addressing hypothesis 1 is to quantify conspecific contacts. I tested several approaches during the course of this class. The first approach, presented during my first presentation was to generate buffers in ArcGIS around each animal at a given time point and generate a measure of “crowdedness.” The second method was to generate a contact matrix in R (see figure 1) to show distances between each individual at a given time point. Detailed methods and R code are given in my first blog post.


Figure 1: Sample distance matrix, generated for a subset of 5 animals at a single time point. Distances range between 3-18 meters.

As an intermediate step towards addressing my second hypothesis, it will be necessary to measure the percent habitat overlap between every pair in the herd. I used a suite of tools to carry out a test analysis using GME, ArcMap, and excel. Detailed methods are outlined in my second blog post. In summary, I used GME to generate kernel densities and 95% isopleths for two individuals, then used the identity feature in Arc to calculate area of overlap. Figure 2 shows a sample of the output from the identity tool.


Figure 2: Habitat overlap between two individuals in the study herd. Individuals’ home ranges are colored yellow and purple, respectively. Area of overlap between the two animals is outlined in red. Overlap between the two animals is 80% and 90%, respectively.

Eventually, once I have determined contact network and habitat overlap matrices, I will look at correlation between microbiome similarity and habitat overlap and average distance with conspecifics.


  1. Distance matrix: this analysis showed potential usefulness for researching herd behavior and structure, and I will likely return to it in future analysis. However, since my question of interest relates to microbe transmission, which is likely to be associated only with close contacts, I plan to focus my current efforts on utilizing the contact collars I described in the data section. Although I will lose some spatial information, it will simplify my analysis and increase temporal resolution.
  2. Habitat overlap: The method described here for measuring habitat overlap shows great promise for use in my research, especially if the process can be automated. I will explore iteration functions in GME and ArcGIS ModelBuilder to find the best way to expedite this analysis across multiple pairs of individuals at multiple time points.

Potential Issues

A few problems that I will likely need to deal with as my analysis progresses:

  1. The possibility of high correlation between conspecific contacts and habitat overlap. I will try to control for this by looking for animals that have high spatial overlap but low contact rate, and vice versa.
  2. Non-matching pairwise percent overlap. For example, individual 13 showed 90% overlap with individual 1, whereas 1 showed only 80% overlap with 13. I can deal with this by looking at pairwise averages, unless the percent overlap is too different, in which case I will need to explore other options.

Lessons Learned

Software Packages:

Thanks to help from my classmates, I became much more comfortable and familiar with geospatial functions in R. I also used GME for the first time and discovered that it has great potential usefulness for my future analyses. I became familiar with ModelBuiler in ArcGIS while attempting to iterate the buffering analysis. I still have a lot to learn with all these tools, but I feel much more confident than I did prior to this class.

Statistical methods:

Although I did not utilize any of the statistical methods outlined in the syllabus due to the unique nature of my dataset, I learned a great deal from watching my classmates present. I expect that hotspot analysis, multivariate statistics, and different types of regression models will be part of my future. In particular, I plan to use regression models and PCA to help analyze my data in the future.





Moving forward with the disturbance centroids, I ran two other analysis tools in ArcMap that identify not only spatial clustering, but clustering of similar numeric (not necessarily continuous) variables attributed to the centroids. These tools are Cluster and Outlier Analysis (Anselin Local Moran’s I) and Hot Spot Analysis (Getis-Ord Gi*). While these tools essentially answer the same question (where are clusters of similar values?) they calculate this in slightly different ways. See this link for more information. For both tools I used year of disturbance as the input value to be evaluated:


The map on the left shows the results of the Local Moran’s I tool. High-High and Low-Low Clusters represent statistically significant clusters of high and low values, whereas High-Low and Low-High clusters are outliers representing either high values surrounded by low values, or vice versa. The results of the Hot Spot Analysis show similar patterns of significant clustering by year. These patterns are indicative of geographic shifts in timber harvest concentration, which could reflect management decisions. Using “Select by Location”, the land use designations associated with significant Moran’s I clusters were tabulated. Low-Low Moran’s I clusters are on the left (1985-1989) and High-High clusters (1996-2012) on the right:


Here we see an expected increase in clustering of clear-cuts and partial harvests on Non-Forest Service (private) and Matrix lands, yet it is important to keep in mind that these land-use designations did not exist until 1994.

Following this analysis of clustering by year, I was interested in attaching continuous variables to the actual disturbance patches, rather than their centroids. Toward this end, I brought in an another output from the LandTrendr algorithm: disturbance magnitude, which is a function of the slope of change in spectral values for each pixel over a yearly time step. Since magnitude is measured on a per-pixel basis, I used the Zonal Statistics tool in ArcMap to calculate the mean disturbance magnitude within each patch, and attach it to the patch attributes. I then ran Hot Spot Analysis again, with mean disturbance magnitude as the input value.


The distribution of high and low magnitude disturbances is interesting, especially when overlaid on the land use designations. As shown in the map on the right, a cluster of high magnitude disturbance is associated with a section of adjacent private lands (in grey) and a cluster of low magnitude disturbance is associated with Matrix land (in orange). This may indicate a higher proportion of clear-cutting (high magnitude disturbance) on private lands, and more partial harvesting (lower magnitude disturbance) on Matrix lands.

Mapping Weeds in Dryland Wheat: A spectral trajectory approach

Introductions: Into the weeds

When I started this course, I had a central goal in mind: I wanted to use remotely sensed data to identify weeds. What I didn’t have, was a clue as to how I was going to get there. I had a few data sets to work with, and although I had made some limited progress with them, I had not yet been able to ask the kind of spatial questions I knew were possible with them.I had data that I knew As a part of this course, I formed this central goal into two research questions:

  • How does spectral trajectory relate to weed density? Can variation in spectral trajectory be used to distinguish weed species from crop species?
  • Can this information be used to distinguish the spatial extent of weeds in dryland cropping systems? How well do different methods of assessing weed density perform?

Originally, I had planned on using some low altitude aerial imagery I had taken in 2015 for addressing this question. After many many late nights trying to get these data into a form that was suitable for analysis, I decided for the sake of brevity and actually getting something done in this class, I would use an alternative data set.

My alternative data set consisted of 3 kinds of reference data with regards to the location and species of weeds in a 17 acre field NE of Pendleton OR, as well as Landsat 8 images taken over the course of summer 2015. The reference data sets I had consisted of transect data where species level density counts were made, spectral measurements of green foreign material made during harvest, and observations made from the front of a combine at the time of harvest. In the landsat data set I used in this class, I only included cloud free images from the duration of the growing season, 2015.



Motivations and Relevance:

My goal with the work I had hoped to achieve in this class was to use variation in phenology between weed and crop species to distinguish weed density using time series landsat data. One of the major issues with weeds in cropping systems however, is that they typically represent only a very small fraction of the total cover in a pixel. The revised hypotheses I ended up testing in this class were:

  1. Weeds and crop can be distinguished based on their relative rates of change in greenness (NDVI).
  2. Visual estimates of weed density will be more accurate at estimating weed density than automated inline assessments.

The main reasons why I find these questions interesting, is the fact that species level discrimination in satellite based remote sensing is still such a challenge. In most remote sensing efforts to identify weeds using satellite data generally perform poorly. A major reason for this, is the typically low signal weeds will have in a cropping system. Advancement in the methodologies for identifying species in satellite imagery would be a significant contribution. As well, there have been few attempts at using a time series approach to distinguishing weed to the species level. Most work has concentrated on increasing the spatial resolution or the spectral resolution of imaging efforts for making species level classifications. As well, the data I was working represented as complete a sampling of the field as is reasonably possible. All plant material from the field had been cut and measured by the spectrophotometer. In this way, the spatial resolution of my reference data was very fine.


In this class, I ended up attempting to answer these hypotheses using two general approaches that were made available to us in the class. I used hotspot analysis to identify areas of the field that were statistically significant positive values as “weedy”.  I then used the output of the hotspot analysis as a predictor variable in a geographically weighted regression, with NDVI as the explanatory variable. With previous attempts at classification, I typically ended up with poor results when comparing the spectrally generated data set to my reference data set. Hotspot allowed me to overcome this, in that the classification was not as sensitive to extremely high or low values. The improved performance is likely a result of the fact that hotspot analysis takes into consideration the neighborhood of values a measurement resides in. I then used the output of the hotspot analysis as the predictor variable in a geographically weighted regression. With geographically weighted regression, we remove the assumption of stationarity, and fit regression models based on a local neighborhood rather than to the entire dataset. Finally, I classified the output of the regression analysis based on the coefficients for each of the terms of the local linear models. My goal here was to attempt to get at why the relationships between weed density and NDVI appeared to vary as they do.


The specific details of each of these analyses are available in their individual posts, but overall the results were very promising. I was able to improve the classification accuracy of a spectrophotometric assessment of weed density from ~50% to 85%, as good as a human observer. This results indicates that the in-situ assessment of weed density may be a real possibility in the near future, which has implications for farmers as well as scientists and land managers. Generation of quality reference data is very difficult to do over broad spatial areas, and any way to improve that process is a good thing.

Regarding the output of the geographically weighted regression, I find I’m still grappling with the results. While I did end up with distinct areas of the field having differing responses of weed density to NDVI, i’m finding difficulty in interpreting the output. I think that if additional explanatory variables for why weeds respond the way they do, and if this is predicted by geographically weighted regression, wouldn’t one simply include those variables into their original model for predicting weed density? So while geographically weighted regression may be suitable for a data exploration exercise, I’m not sufficiently convinced of its utility in predicting weed density. More work is necessary to come to a conclusive answer as to why local relationships between NDVI and weed density appear to vary as they do.



I learned an incredible amount this quarter, mostly from the incredibly high quality presentations and conversations I had with my colleagues in this class. An incredibly high bar was set by the work done by this cohort. Although I didn’t get as far as I might have liked to with these data, this was my first foray into spatial statistics. I had the advantage of being very comfortable in R, but this class forced me to branch out into ArcMap, and even by the end of the quarter, into Python and Earth Engine. One of the major issues I had, was that the data set I was working with was not as high a resolution as I had originally intended. I wanted to be using methods from machine learning on an a very high resolution multispectral data set i’ve been working on, but it wasn’t ready for analysis by the time I needed to produce results.  I ended up doing a lot of work that never saw the light of day, looking at issues in spatial variance and resolution, that with more time, I would have liked to present on.


The research question and Background

Soil microbes are critically important for soil to function, because they provide the functional backbone for many soil ecosystem services such as the carbon and nitrogen cycles. The structure and composition of microbial communities can have a large influence on the rates at which these ecosystem services can be performed; therefore, determining the spatial distribution of these microbial communities is necessary to establish accurate functionality of soils across a landscape. However, this can be very difficult due to the fact that microbial communities are extremely diverse. In one gram of soil there can be over a billion microbes with anywhere from 4,000 to 50,000 different species. There are two different potential forces governing the spatial distribution of microbial communities. The first theory, first popularized by Baas and Becking (1934), stated that “Everything is everywhere but, the environment selects”. This idea is in line with the classical ecology niche theory, in which species distribution is dictated by their niche. The opposite of this paradigm is the idea that microbes are more limited by dispersal potential then environmental selection factors. This idea reflects the ideas of neutral theory, which states that species distributions are completely stochastic, and at any given location, the species composition is determined by the species composition of its neighboring areas.
Until recently, these theories have not really been tested in microbial communities. However, with the advent of high throughput sequencing in the 1990s, an in situ view of microbial communities became possible. In a recent study, O’Brien et al. (2016) determined that at a fine scale (1cm2) community composition was extremely heterogeneous and correlated weakly with environmental factors; however, as the scale and scope increased, patterns in microbial community distribution began to emerge and were closely linked to environmental parameters. This study illustrates that when examining microbial communities, the spatial scale of the study will influence your results.
In my study, we want to know which principles govern microbial distributions across a regional landscape. Are microbial communities across a regional landscape influenced by environmental parameters, or is their distribution dictated more by dispersal limitation? There has been some work on a similar sized scale that has determined that bacterial diversity is more closely linked to environmental parameters then dispersal limitation. Specifically studies from all over the world have found the most influential environmental factor to be soil pH (Fierer and Jackson 2006, Griffiths et al. 2011).
The Dataset
During the spring of 2015, 113 soil samples were collected across the state of Oregon to capture the spatial pattern of microbial community structure across a regional landscape. Sampling locations were stratified by Common Resource Areas (CRA), which are geographically associated land resource regions based on geology, climate, water, soils, biological resources, and land use. This stratification was used to help capture all of the spatial heterogeneity of the soil across a regional landscape. A subset of this data (90 out of 113) was previously sampled by the NRCS in soil series survey, where an extensive lab work up was conducted to catalog each of these soils’ physical and chemical properties.
The physical and chemical properties in soil will have a greater influence on microbial communities than spatial variability. In essence, on a regional landscape scale, the factors dictating microbial community composition will be linked more closely to environmental factors in the soil rather than how geographically distant two samples are.
If this is true then we need to be able to quantify the heterogeneity of a landscape of these environmental parameters. In this blog post I will use the 90 sample points from the NRCS to see how well interpolation methods preform when examining the spatial heterogeneity of edaphic properties, using soil pH as a proxy. To examine the accuracy of these interpolation methods we use the SSURGO database of soil pH as our truth. This will allow me to compare the interpolation techniques and determine their strength and weaknesses.
The first method used was spatially extrapolating pH point data over their relative CRA polygon. In theory since these CRA polygons were delineated by similar environmental factors, the edaphic properties such as soil pH should be conserved within each polygon.
Inverse Distance Weighted (IDW) Interpolation method was also attempted. This method essentially determines the value of any unknown point by taking weighted averages of all known points, where the weights are determined by their distances to the unknown point.
The last interpolation method used was ordinary kriging. Just as IDW only considers distances between points for interpolation, so does kriging; however, unlike IDW kriging incorporates the fact that after a certain distance there is no more useful information to explain the variation at a given point, determined through the variogram. Therefore, kriging uses a variogram model to determine the maximum distance from any given point that information should come from. The figure below shows the relationship between the variance explained and distance from the given point. The fitted line is the model used in the ordinary kriging interpolation and how the weights for the average are determined.


After interpolation, these maps were then compared to the SSURGO map to test their performance. This comparison was done by finding the absolute difference at each pixel and summing the variance observed. Since the SSURGO map was not completely finished for all of Oregon, each of the interpolated maps were clipped to same spatial extent as the SSURGO map.

The map shown below is the soil pH values extrapolated over its respective CRA. You can see the Willamette Valley is quite homogeneous in its soil pH values while Oregon’s Eastern side is quite a bit more heterogeneous. The problem with this map is it draws sharp lines between CRA polygons where in theory the real pH values would have a continuum as you moved from one CRA to another.


The IDW map seemed to show the same pattern in the Willamette valley; however, the heterogeneity of eastern side of Oregon seems to be blended together. The IDW interpolation also seemed to develop pockets of like pH values which look more like artifacts of the IDW algorithm than real characteristics of the landscape.

Like IDW, ordinary kriging also blended the heterogeneity of the eastern side of Oregon together; however, it did so without creating many of these artificial pockets that the IDW method seemed to create.

The main quantitative difference between the SSURGO map and the interpolated maps can be seen in the western part of Oregon. The SSURGO map shows a much higher level of heterogeneity compared to the interpolated maps as seen in the Willamette Valley. It also shows a large amount of heterogeneity in the southeastern side of Oregon that was only captured in the CRA map.


As expected, the IDW map performed the least favorably (a variance score of 2676) while both the kriging and CRA map had very similar results (variance scores of 2440 and 2414, respectively). This is quite surprising since the kriging map seemed to have a homogenizing effect on soil pH in eastern Oregon which was present in both the CRA map and the SSURGO map. Below is a map of the difference in pH between the CRA map and the SSURGO values of pH. You can see the key areas in which the CRA failed were areas of high heterogeneity of the SSURGO map.

DIfference map

Through this analysis we have determined that across this landscape there is a large amount of heterogeneity in edaphic properties. There may be significantly correlated parameters that have not been captured in the SSURGO database. If so, it may be inappropriate to interpolate these values using just our sampling points.
Finally, if it is determined that geographic distance is the best proxy for determining microbial community distribution at this scale, an interpolated map of the community needs to be presented in a way in which the audience understands where the map has strong areas of success and areas in which we are quite sure the map is wrong.
What I Learned: Software
The majority of the project was conducted in R’s statistical environment, which is due to several reasons. First, I was already familiar with the programming language. Moreover, R is quite powerful in its data manipulation capability and its ability to execute complex statistics in its base functions. Because R is also completely free and availability to anyone, it allows me to conduct my analyses without relying on a paid programs. The main drawback to this particular statistical software is it has a rather steep learning curve which made AcrGIS more preferable in some aspects. I found ArcGIS rather useful in its ability to quickly display results and data with minimal effort. It is able to quickly display a large amount of information very quickly in a user-friendly way. However, when attempting statistical interpolation methods in ArcGIS, I felt rather limited in its customizability and was completely ignorant of the underlying processes it was doing during interpolation.
R on the other hand requires more work upfront to create and display raster and vector files; however, once the data is in, manipulating it and doing analytical analysis on it was quite straightforward. When conducting spatial analysis, R is transparent; each function comes with its own documentation about what exactly is happening during the execution of the function. It also requires the users to have a working understanding of the basic methodology when doing spatial analysis, which is always a good thing.
All in all, both programs have their niches and best uses; however, one should not feel limited to just one. Exploiting the strengths of each program and bypassing their relative weakness by shifting to different platforms should be everyone’s goal when conducting any sort of analysis including spatial analysis.

What I Learned: Spatial Statistics
Before joining this class I had a very limited understanding of spatial statistics. Through the class exercises and a lot of reading I developed a comfortable understanding of the basic idea behind the process used in several different spatial statistic methods. Some of these models include IDW, kriging, geographically weighted regression, PCA ordination, and boosted regression tree. However, I feel the most important lessons I learned how to critically interpret the results of these analysis such as geographically weighted regression in ways that make intuitive sense.
Work Cited
Fierer, Noah, and Robert B. Jackson. “The Diversity and Biogeography of Soil Bacterial Communities.” Proceedings of the National Academy of Sciences of the United States of America 103, no. 3 (January 17, 2006): 626–31. doi:10.1073/pnas.0507535103.

Griffiths, Robert I., Bruce C. Thomson, PHillip James, Thomas Bell, Mark Bailey, and Andrew S. Whiteley. “The Bacterial Biogeography of British Soils.” Environmental Microbiology 13, no. 6 (June 1, 2011): 1642–54. doi:10.1111/j.1462-2920.2011.02480.x.

O’Brien, Sarah L., Sean M. Gibbons, Sarah M. Owens, Jarrad Hampton-Marcell, Eric R. Johnston, Julie D. Jastrow, Jack A. Gilbert, Folker Meyer, and Dionysios A. Antonopoulos. “Spatial Scale Drives Patterns in Soil Bacterial Diversity.” Environmental Microbiology, March 1, 2016, n/a-n/a. doi:10.1111/1462-2920.13231.

To begin my analysis of forest disturbance patterns, I wanted to get a broad scale understanding of how clear-cuts and partial harvests are distributed throughout Willamette National Forest, as well as the land use designations and forest districts they spatially coincide with. Using ArcMap I ran the Kernel Density tool on cumulative disturbances from 1985 to 2012. This first required converting my disturbance patch data (rasters) into polygons (vectors). Additionally, Kernel Density requires point or polyline inputs, so I also generated disturbance centroids from the polygons. Below I outline the function, results and my interpretations:

Kernel Density calculates a magnitude per unit area using a kernel function to fit a smoothly tapered surface to each input feature. Here are the results for two runs of this tool:


Below are the parameters used for each kernel density output above. In the first run on the left, the output cell size of 250 meters results in a surface that is less smooth than that of the second run on the right, which has a much smaller output cell size. Additionally, in the second run, the optional parameter for a search radius was set to 1600 meters, a value that is roughly double that of the average nearest-neighbor distance between disturbance centroids. This creates a density surface that is more appropriate for the scale of the map (on the right). Both maps give an indication that higher densities of clear-cuts and partial harvests between 1984 and 2012 occurred on Matrix Lands and Non-Forest Service Lands, which are mostly composed of private and industrial landowners. These results are not surprising, given that timber harvest is the primary function of Matrix Lands, and that private and industrial landowners are inclined to produce timber.


*It is important to note that after converting the disturbance patches to polygons, I used the Normal QQ Plot and Histogram functions (built in to the Geostatistical Analyst extension for ArcMap) to remove outliers using the “Shape_Area” attribute field. Polygons of extremely low area (less than eleven 30-meter Landsat pixels) were removed because they likely represent incorrectly classified pixels. Polygons of extremely high area were also removed, because they would not be accurately represented by a single centroid point.

Mapping cumulative clear-cuts and partial harvests over the 27 year period between 1985 and 2012 paints an interesting picture, but in order to better assess the effects of forest governance on landscape patterns, it is probably more interesting to map disturbances temporally. Using the Definition Query function in ArcMap, I filtered the disturbance data at 5-year intervals, and ran the Kernel Density tool again for each interval:


Bringing in the temporal dimension reveals some interesting changes in the density of forest disturbance that may correspond with significant events in the history of forest governance. For example, 1990 shows very high density of clear-cuts and partial harvests, which indicates a peak in timber harvest activity prior to the May 29, 1991 Dwyer Injunction, which banned timber sales. Following the 1994 Record of Decision for the Northwest Forest Plan, the maps for 2000, 2005 and 2010 show the expected drop in overall disturbance density, but interesting peaks associated with certain forest districts. For one final run of the Kernel Density tool, I mapped cumulative disturbance during the period of the Dwyer Injunction (1990-1994). The result shows higher disturbance density than expected, but I assume that there is lag between the timing of timber sales and when sold forest land is actually harvested. Thus, this map likely shows clear-cuts and harvests on forest lands sold prior to the injunction, but may also be indicative of which forest districts are more inclined toward timber production.


Exploring Giant Gourami Distribution Models

The research question:

In the context of predicting the impacts of the rapidly expanding aquaculture industry and understanding the impacts of natural and human changes on the landscape, one of the overarching research question that I am interested in is: What is the natural distribution of the the air-breathing giant gourami (Osphronemus goramy) in South East Asia and how is it predicted to change over time with respect to 1) the biophysical properties of the landscape, 2) human impacts, and 3) climate projections into the future.

Specific to this class, the research question that I explored was, what is the distribution of the giant gourami in SE Asia in the year 2000 based on 4 environmental variable (NDVI, precipitation, surface temperature, and river flow accumulation) and human population density.

Background: Giant gourami inhabits regions characterized by fresh to brackish water and in slow-moving areas like swamps, lakes, and large rivers.  Given its unique ability to breather air, this fish can survive in poorly oxygenated water to anoxic areas.  Already a popular fish in the aquarium trade and eaten in some regions, this fish is farmed in SE Asia.  I expect that with climate change, increased urbanization, and the changing hydrologic profile of the system due to potential dams that this fish may become more suitable than others for its ability to live in ‘poorer’ environmental conditions.  

The Dataset:

gourami_distPoint_dataMy dataset consists of points of fish presence/pseudo-absence across SE Asia (image above) characterized by associated environmental variables of interest.  The ‘presence’ portion of my dataset was pulled from fishbase.com, consisting of 47 occurrence points for giant gourami collected between primarily 1980s-1990s through an unknown sample protocol.  Clipping to my study region, SE Asia left me with 36 presence points.  Background data, or pseudo-absence points were generated randomly along rivers and streams of SE Asia in ArcMap.  Environmental data for all points were taken from freely available satellite datasets listed below on Google Earth Engine in a 1km buffer around the point data and filtered to the date range of Feb-Jun 2000 when possible (for retaining consistency in the dataset).

  • NDVI: 32-day Landsat 7
  • Precip: CHIRPS (Climate Hazards Group InfraRed Precipitation with Station data) at 0.05deg resolution.
  • Flow: HydroSHEDS flow accumulation 15-arc seconds (Feb 11-22, 2000)
  • Temp: NCEP/NCAR reanalysis 2.5deg resolution
  • Population: WorldPop project- Estimated mean population density 2010 & 2016 at a 100x100m resolutiongourami_BRT_rastersgouramiVar_histos_ALL


Based on my point data and the variables used, my hypotheses are below (with the third un-tested):

  1. Giant gourami are distributed in a wide range of habitat types, including relative warmer surface temperatures, a range of flows, and in areas with range of precipitation.
  2. This distribution of the Giant gourami is not affected by human population density.
  3. Giant gourami distributions will not change much given predicted climate warming (untested)


This was an exploratory analysis that employed the species distribution model, boosted regression trees (BRT).  This is an ensemble tree-based species distribution model that iteratively grows small/simple regression trees based on the residuals from all previous trees to explore non-parametric data. BRTs consist of two components: regression trees and boosting and ultimately help to identify the variables in the dataset that best predict where the species is expected based on the presence/absence data.  (you can view my the boosted regression tutorial for more on this analysis approach).


After building my dataset in Arc and through Google Earth Engine, I was able to produced BRT results in R studio for several combinations of learning rates and tree complexities with and without the population data as a variable.  Preliminary analysis indicates that precipitation and temperature contribute the most in determining where giant gourami are expected, based on the data.  Digging into my two hypotheses, I explored the contribution of temperature on giant gourami distribution and population density.

Precipitation was identified by the BRT models as the strongest contributor in determining in the likelihood of finding a giant gourami in my study area.  

  • R output:
  • > gourami.tc3.lr005$contributions
  • var:   rel.inf
  • mean_Precip: 60.511405
  • mean_temp: 25.182547
  • pop_density: 10.984079
  • NDVI_mean: 1.673674
  • flow: 1.64829

Exploring the spatial relationship between precipitation trends and my presence/pseud-oabsence point data, it is clear that the presence points are clustered in regions with lower rainfall.  As a means to explore this relationship further, it might be helpful to shrink the extent of my study area to represent a smaller area, more representative of the conditions associated with the presence points.

Population density did not appear to have an effect on model performance as expected.  As shown in the model output below, which displays model output for models with population density as a variable in the left column and without on the right.  Model deviance in cross-validation does not change when population density is removed as an explanatory variable.

gourami_TC3-4_lr005_PoP-noPoPExploring this relationship visually on a map, this result makes sense as the point data are distributed across a wide range of population densities.



The results of this exploratory analysis revealed some interesting patterns in giant gourami distribution, but is limited in a big way by the point data used.  Points were only present in Malaysia due to availability, so this distribution pattern is possibly more a function of sample effort than actual distribution.  If the analysis were limited to Malaysia only, it may provide a better representation of the data.  

Understanding the spatio-temporal patterns that govern the range of species like the giant gourami allow resource managers to help meet increasing demands and at the same time mitigate environmental harm.  Protein consumption is expected to increase to 45kg per capita by 2020, a 25% increase from 1997—the fish consumption rate is no outlier.  The growing aquaculture industry provides roughly half of the global fish supply (FAO, 2014).  The giant gourami are 1 species of ~400 air breathing fish present–ideal candidates for aquaculture.  This prospect presents an opportunity for increased protein production in a changing climate, but also increases threat of invasion/outcompetition.


Lessons Learned:

In this analysis I learned about integrating Arc, Google Earth Engine, and R for building my dataset as well as running a boosted regression tree model.  Through the process, I tracked my progress, workflow, and challenges.  This allowed me to identify areas where I could have been more efficient.  For example, I generated my background/psuedo-absence points in Arc and then brought them into Earth Engine, when I could have done the whole process in Earth Engine.

A note on Earth Engine: the reason that I chose to use this platform was for its ability to access large global datasets across wide ranging time frames quickly and without the need to download to manipulate or extract data.  Earth Engine functions through javascript, which is a hurdle to overcome for someone new to programming.

My data analysis was done in R studio, through which I learned to run a BRT model with the ‘gbm’ package, I learned some simple histogram plotting functions, and how to import/view raster images with the packages ‘raster’ and ‘rgdal’.

In terms of the statistics that I explored for this analysis, I have listed some of my take home points below:

  • Boosted Regression Tree strengths and limitations
    • Deals with non-parametric data well
    • Is able to deal with small sample sizes
    • Pseudo-absence points present issues in interpreting results since they are not true absence points
    • Sample bias is potentially a problem
    • Is not a spatially explicit analysis so issues like spatial autocorrelation are not dealt with
  • Hot Spot Analysis (conducted on a different dataset)
    • A quick and easy way to identify patterns at different scales for data points that are independent of the researcher and span a large temporal and spatial range.  
    • could be an approach to hypothesis generation.  


Research Question:  What is the correlation between the location of leatherback sea turtles, sea surface temperature, and chlorophyll?


Leatherback sea turtle locations: The leatherback turtle dataset was obtained from http://seamap.env.duke.edu/dataset.  The dataset was in point feature format.  The extent of the dataset is within the Gulf of Mexico. Observations of each sea turtle was collected from January through December 2005.


Figure 1: Sea turtle locations in the Gulf of Mexico, derived from http://seamap.env.duke.edu/

Sea Surface Temperature: The Sea Surface Temperature (SST) dataset was obtained from NOAA.  The dataset was downloaded from the NOAA website in .csv format with latitude and longitude coordinates and temperature data associated with each coordinate.  The extent of the dataset was the Gulf of Mexico.  The SST data was the average daily maximum for January and December 2005.  The data ranged from 2.44 – 26.89 Celsius, the mean was 23.57 Celsius.


Figure 2: Sea Surface Temperature (Celsius) data for January 2005 within the Gulf of Mexico, derived from NOAA.

Chlorophyll-a: The chlorophyll data was obtained from NOAA.  The dataset was downloaded from the NOAA website in .csv format with latitude and longitude coordinates and chlorophyll data associated with each coordinate.  The extent of the dataset was the Gulf of Mexico.  The Chlorophyll data was a 3-day composite in December 2005.  The data ranged from 0 to 1.63 mg/m3, the mean was 0.29 mg/m3.


Figure 3: Chlorophyll-a(mg/m^3) data for January 2005 within the Gulf of Mexico, derived from NOAA.

Hypothesis: I expected leatherback sea turtles to be clustered in areas based on high jellyfish concentrations.  Jellyfish tend to be located in areas that have higher chlorophyll-a concentrations and where sea surface temperature is low.  Thus sea turtle locations should correlate well with higher values of chlorophyll-a and lower values of SST.

Analysis Approaches:  To test my hypothesis, I utilized the hotspot analysis, Spatial autocorrelation (Moran’s I), Geographically Weighted Regression, Kernel Density tools in ArcGIS.  In addition, I also utilized R statistical package to create a graph that correlated chlorophyll with SST.

a.  Hotspot Analysis: The spatial distribution of the sea turtle locations appeared to be clustered toward the middle of the Gulf of Mexico.  The hotspot analysis tool helps to identify where statistically significant hotspots or clusters of sea turtles are located within the Gulf of Mexico.

b.  Spatial Autocorrelation (Moran’s I): This tool measures spatial autocorrelation using feature locations and feature values simultaneously. The Moran’s I index will be a value between -1 and 1. Positive spatial autocorrelation will show values that are clustered. Negative autocorrelation is dispersed. Random is close to zero. The tool generates a Z-score and p-value which helps evaluate the significance of the Moran’s index. I tested the spatial autocorrelation of chlorophyll and sea surface temperature at each feature location.  The conceptualization of spatial relationships method used was the inverse distance and the Euclidean distance measure was used for the distance method.  I selected a 500km distance (smaller distances were too small for the study site).

c.  Ordinary Least Squares: This tool performs a global linear regression to “generate predictions or model a dependent variable in terms of its relationships to a set of explanatory variables.  Before conducting this test, I sampled the SST and the CHL-a values at each of the feature locations (sea turtle locations) using the Extract Multi Values to Points tool.  This tool “Extracts cell values at locations specified in a point feature class from one or more rasters, and records the values to the attribute table of the point feature class.”   This model was run three separate times, increasingly adding more explanatory variables each time.  Each OLS run used Chlorophyll as the dependent variable.  The first OLS run, SST as the explanatory variable.  The second run, used SST and depth (m) as the explanatory variables and the third run, used SST, depth, and turtle count as the explanatory variables.

d.  Geographically Weighted Regression: Based on the observations and results found in the OLS analysis (the data being nonstationary).  I decided to conduct a Geographically Weighted Regression analysis.  This tool performs a local form of linear regression used to model spatially varying relationships.  The dependent variable used for this tool was the Chlorophyll and the explanatory variable was SST.


Hotspot Analysis:  The results of the hotspot analysis (shown below) suggest that the turtle locations are significantly clustered off the coast of Louisiana and Texas between approximately 2000 to 3,000m depth of water.  However, the results of this analysis appear to be quite deceptive.  Upon taking measurements of turtles in the hotspot cluster it appears as though they may be more dispersed.  Further analysis is needed in order to determine further patterns of Sea turtle locations.


Figure 4: Results of the hotspot analysis for leatherback sea turtle locations

Spatial Autocorrelation (Moran’s I) – sea surface temperature:  The results of the spatial Autocorrelation tool suggest that the pattern of Sea Surface temperature at each feature location is clustered.  The Moran’s Index was 0.402514, the z-score was 2.608211, and the p-value was 0.009102.   Since the critical value (z-score) was greater than 2.58 there is less than 1-percent likelihood that the clustered pattern is a result of random chance.

SeasurfaceTemp_Morans_resullts1Of2 SeasurfaceTemp_Morans_resullts2Of2

Figure 5: Sea surface temperature results for Moran’s I tool.

Spatial Autocorrelation (Moran’s I) – Chlorophyll-a :  The results of the spatial Autocorrelation tool suggest that the pattern of chlorophyll at each feature location is clustered.  The Moran’s Index was 0.346961, the z-score was 2.216243, and the p-value was 0.026675.  The critical value (z-score) was less than 2.58 but greater than 1.96 thus suggesting that there is less than 5-percent likelihood that the clustered pattern is a result of random chance.

Chlorophyll_Morans_resullts1Of2 Chlorophyll_Morans_resullts2Of2

Figure 6:  Results of the spatial autocorrelation Moran’s I for chlorophyll-a at the leatherback sea turtle locations.

Ordinary Least Squares:  This model was run three separate times, increasingly adding more explanatory variables each time.  Each OLS run used Chlorophyll as the dependent variable.  The first OLS run, SST as the explanatory variable.  The second run, used SST and depth (m) as the explanatory variables and the third run, used SST, depth and turtle count as the explanatory variables.

Model 1:

Model Structure:  Chl-a = f(SST)

Model Results:

a) Overall r2: 0.449473

b) Coefficient on SST in model: -0.052654

The results suggest that Chl is negatively related to SST and given the p-value of 0.000, we can deduce that SST is a significant predictor of Chlorophyll-a.

Model 2:

Model Structure:   (Chl =f(SST), (Depth))

Model Structure:

  1. a) Overall r2: 0.452436
  2. b) Coefficient on SST: -0.051769
  3. C) Coefficient on depth: -0.000015

The results suggest that SST and Depth are negatively correlated with Chlorophyll.  Given the p-value of 0.056397 for depth, this is not a statistically significant relationship.  The p-value for SST is 0.0000 suggesting that it is a statistically significant relationship and is a better predictor of chlorophyll-a than depth.

Model 3:

Model Structure:  (Chl =f(SST), (Depth), (Count))

Model Structure:

  1. a) Overall r2: 0.460299
  2. b) Coefficient on SST: -0.050454
  3. c) Coefficient on Depth: -0.000014
  4. d) Coefficient on Count: 0.121644

The results suggest that SST and Depth are negatively correlated with Chlorophyll and the count is positively correlated.  Given the p-value of 0.067482 for depth, this is not a statistically significant relationship.  The p-value for Count is 0.001815, suggesting that the relationship is statistically significant.  The p-value for SST is 0.0000 suggesting that it too is a statistically significant relationship.  In this model we see that the number of turtles found at each location and the SST values have statistically significant relationships to Chlorophyll.  SST has the lowest p-value and would suggest that it is the best indicator for chlorophyll, though we should not discount the count variable.

Overall results: Running the model using three explanatory variables provided the best Overall R-Square value of .46.  The model significance proves to have an overall statistical significance due to the Koenker (BP) statistic being statistically significant, therefore I used the Joint Wald Statistic as an assessor of the model significance.  The Joint Wald Statistic was significant as shown below:


Figure 7: OLS model 3 – Joint Wald Statistic

The Koekner statistic is used to assess model stationarity.  This statistic revealed that the model was not stationary in geographic space and/ or data space due to its significance and having a p-value <0.05 as shown below:


Figure 8:  OLS model 3 – Koenker BP Statistic

Since the Koekner statistic was significant it was appropriate to look at the robust probabilities of each variable to assess their effectiveness.  The Robust Probability scores for each of variables reveals that the count and SST are statistically significant (as shown below) thus they are found to be important to the regression model.  However, it also appears as though the model is a good candidate for Geographically Weighted Regression due to it being nonstationary.


Figure 9: OLS model 3 – Variable Statistics


Geographically Weighted Regression:

Model Structure:  (Chl =f(SST), (Depth), (Count))

Model Result

a) spatial pattern of r2 values (map)


Figure 10: Geographically Weighted Regression Analysis: Map of the Local R-Squared values

After conducting the GWR analysis using the chlorophyll as the dependent variable and the Sea Surface Temperature as the explanatory variable I mapped the local R-Squared values of each feature location to show where the model predicted well and where it predicted poorly.  The map shows that predicts are made best where turtle locations appear to be in areas where Sea Surface temperature is cooler off the Texas coastline.

b) Spatial pattern of coefficients for SST


Figure 11:   Geographically Weighted Regression Analysis: Map of the coefficients

I mapped the coefficients in order to understand regional variation of the model.  When using GWR to model the Chlorophyll (dependent variable) I was interested in understanding the factors that contribute to the turtle locations (or chlorophyll at each of the turtle locations).  I was also interested in examining the stationarity of the data being that the OLS model revealed that it was not stationary.  In order to do these tasks I mapped the coefficient distribution as a surface to show where and how much variation was present.  As shown below in the map, it appears that Sea surface temperature has a negative relationship with Chlorophyll.  The range of the coefficients is -0.066176 to -0.64989.  There is very little variation in the coefficients.  The results of this test help to inform policies at a regional scale.

Significance: What did you learn from your results?  How are these results important to science? to resource managers?

The results of the tests suggest that chlorophyll-a has a negative relationship to sea surface temperature in the Gulf of Mexico according to leatherback sea turtle locations.  Sea turtles are located in areas where sea surface temperature is low and chlorophyll values are higher.  After mapping the coefficients of the variables we see that there is little variation in the data suggesting that policies regarding the protection of leatherback sea turtles should extend throughout the entirety of the Gulf of Mexico rather than in a few selected areas.

Learning Outcomes:

Programming in R:

With the help of Mark Roberts, I was able to use the R software programming to load and utilize the ggplot2 package to plot the correlation between chlorophyll-a and sea surface temperature. Lots of room for improvement here!  The following is the code used to make the plot as well as the plot outcome:


Figure 12: ggplot code used in R to plot the correlation between chlorophyll-a and sea surface temperature.



Figure 13: Correlation between chlorophyll-a and sea surface temperature.


What did you learn about spatial statistics:

I think scale is an important factor.  You need to understand your data and be sure to not take the results as they are.  You should investigate everything to make sure that the data and results make sense.  If the results appear incorrect they probably are incorrect.   I also  learned that it is dangerous to conduct spatial statistical analyses on data unless you can interpret what makes sense. The spatial autocorrelation conducted on SST and Chl-a indicated that they were clustered but I was still confused about this outcome.  While conducting research it is important to thing about scale of the data to ensure that all of the data line up.  I encountered an issue with the SST data due to the way it was sampled. I had many sea turtle locations that did not have sea surface data associated with them because of the way the SST data was sampled.   A classmate pointed out that I did not consider the effects of currents on jellyfish.  This could be a significant factor that may lead us to understand why sea turtles are present in Gulf of Mexico since jellyfish are partially distributed by currents.  Another lesson learned is that it is important to remember the basic principles of geographic information systems while conducting these analyses.  For instance it is important to turn on extensions, and be sure that all the data is properly projected prior to conducting any analysis as this can and will through off the results. Overall, it is important to know your data.