Data Wrangling to Assess Data Availability: A Data Detective at Work

By Alexa Kownacki, Ph.D. Student, OSU Department of Fisheries and Wildlife, Geospatial Ecology of Marine Megafauna Lab

Data wrangling, in my own loose definition, is the necessary combination of both data selection and data collection. Wrangling your data requires accessing then assessing your data. Data collection is just what it sounds like: gathering all data points necessary for your project. Data selection is the process of cleaning and trimming data for final analyses; it is a whole new bag of worms that requires decision-making and critical thinking. During this process of data wrangling, I discovered there are two major avenues to obtain data: 1) you collect it, which frequently requires an exorbitant amount of time in the field, in the lab, and/or behind a computer, or 2) other people have already collected it, and through collaboration you put it to a good use (often a different use then its initial intent). The latter approach may result in the collection of so much data that you must decide which data should be included to answer your hypotheses. This process of data wrangling is the hurdle I am facing at this moment. I feel like I am a data detective.

Data wrangling illustrated by members of the R-programming community. (Image source: R-bloggers.com)

My project focuses on assessing the health conditions of the two ecotypes of bottlenose dolphins between the waters off of Ensenada, Baja California, Mexico to San Francisco, California, USA between 1981-2015. During the government shutdown, much of my data was inaccessible, seeing as it was in possession of my collaborators at federal agencies. However, now that the shutdown is over, my data is flowing in, and my questions are piling up. I can now begin to look at where these animals have been sighted over the past decades, which ecotypes have higher contaminant levels in their blubber, which animals have higher stress levels and if these are related to geospatial location, where animals are more susceptible to human disturbance, if sex plays a role in stress or contaminant load levels, which environmental variables influence stress levels and contaminant levels, and more!

Alexa, alongside collaborators, photographing transiting bottlenose dolphins along the coastline near Santa Barbara, CA in 2015 as part of the data collection process. (Image source: Nick Kellar).

Over the last two weeks, I was emailed three separate Excel spreadsheets representing three datasets, that contain partially overlapping data. If Microsoft Access is foreign to you, I would compare this dilemma to a very confusing exam question of “matching the word with the definition”, except with the words being in different languages from the definitions. If you have used Microsoft Access databases, you probably know the system of querying and matching data in different databases. Well, imagine trying to do this with Excel spreadsheets because the databases are not linked. Now you can see why I need to take a data management course and start using platforms other than Excel to manage my data.

A visual interpretation of trying to combine datasets being like matching the English definition to the Spanish translation. (Image source: Enchanted Learning)

In the first dataset, there are 6,136 sightings of Common bottlenose dolphins (Tursiops truncatus) documented in my study area. Some years have no sightings, some years have fewer than 100 sightings, and other years have over 500 sightings. In another dataset, there are 398 bottlenose dolphin biopsy samples collected between the years of 1992-2016 in a genetics database that can provide the sex of the animal. The final dataset contains records of 774 bottlenose dolphin biopsy samples collected between 1993-2018 that could be tested for hormone and/or contaminant levels. Some of these samples have identification numbers that can be matched to the other dataset. Within these cross-reference matches there are conflicting data in terms of amount of tissue remaining for analyses. Sorting these conflicts out will involve more digging from my end and additional communication with collaborators: data wrangling at its best. Circling back to what I mentioned in the beginning of this post, this data was collected by other people over decades and the collection methods were not standardized for my project. I benefit from years of data collection by other scientists and I am grateful for all of their hard work. However, now my hard work begins.

The cutest part of data wrangling: finding adorable images of bottlenose dolphins, photographed during a coastal survey. (Image source: Alexa Kownacki).

There is also a large amount of data that I downloaded from federally-maintained websites. For example, dolphin sighting data from research cruises are available for public access from the OBIS (Ocean Biogeographic Information System) Sea Map website. It boasts 5,927,551 records from 1,096 data sets containing information on 711 species with the help of 410 collaborators. This website is incredible as it allows you to search through different data criteria and then download the data in a variety of formats and contains an interactive map of the data. You can explore this at your leisure, but I want to point out the sheer amount of data. In my case, the OBIS Sea Map website is only one major platform that contains many sources of data that has already been collected, not specifically for me or my project, but will be utilized. As a follow-up to using data collected by other scientists, it is critical to give credit where credit is due. One of the benefits of using this website, is there is information about how to properly credit the collaborators when downloading data. See below for an example:

Example citation for a dataset (Dataset ID: 1201):

Lockhart, G.G., DiGiovanni Jr., R.A., DePerte, A.M. 2014. Virginia and Maryland Sea Turtle Research and Conservation Initiative Aerial Survey Sightings, May 2011 through July 2013. Downloaded from OBIS-SEAMAP (http://seamap.env.duke.edu/dataset/1201) on xxxx-xx-xx.

Citation for OBIS-SEAMAP:

Halpin, P.N., A.J. Read, E. Fujioka, B.D. Best, B. Donnelly, L.J. Hazen, C. Kot, K. Urian, E. LaBrecque, A. Dimatteo, J. Cleary, C. Good, L.B. Crowder, and K.D. Hyrenbach. 2009. OBIS-SEAMAP: The world data center for marine mammal, sea bird, and sea turtle distributions. Oceanography 22(2):104-115

Another federally-maintained data source that boasts more data than I can quantify is the well-known ERDDAP website. After a few Google searches, I finally discovered that the acronym stands for Environmental Research Division’s Data Access Program. Essentially, this the holy grail of environmental data for marine scientists. I have downloaded so much data from this website that Excel cannot open the csv files. Here is yet another reason why young scientists, like myself, need to transition out of using Excel and into data management systems that are developed to handle large-scale datasets. Everything from daily sea surface temperatures collected on every, one-degree of latitude and longitude line from 1981-2015 over my entire study site to Ekman transport levels taken every six hours on every longitudinal degree line over my study area. I will add some environmental variables in species distribution models to see which account for the largest amount of variability in my data. The next step in data selection begins with statistics. It is important to find if there are highly correlated environmental factors prior to modeling data. Learn more about fitting cetacean data to models here.

The ERDAPP website combined all of the average Sea Surface Temperatures collected daily from 1981-2018 over my study site into a graphical display of monthly composites. (Image Source: ERDDAP)

As you can imagine, this amount of data from many sources and collaborators is equal parts daunting and exhilarating. Before I even begin the process of determining the spatial and temporal spread of dolphin sightings data, I have to identify which data points have sex identified from either hormone levels or genetics, which data points have contaminants levels already quantified, which samples still have tissue available for additional testing, and so on. Once I have cleaned up the datasets, I will import the data into the R programming package. Then I can visualize my data in plots, charts, and graphs; this will help me identify outliers and potential challenges with my data, and, hopefully, start to see answers to my focal questions. Only then, can I dive into the deep and exciting waters of species distribution modeling and more advanced statistical analyses. This is data wrangling and I am the data detective.

What people may think a ‘data detective’ looks like, when, in reality, it is a person sitting at a computer. (Image source: Elder Research)

Like the well-known phrase, “With great power comes great responsibility”, I believe that with great data, comes great responsibility, because data is power. It is up to me as the scientist to decide which data is most powerful at answering my questions.

Data is information. Information is knowledge. Knowledge is power. (Image source: thedatachick.com)

 

The seamounts are calling and I must go: a humpback’s landscape

Solène Derville, Entropie Lab, Institute of Research for Development, Nouméa, New Caledonia (Ph.D. student under the co-supervision of Dr. Leigh Torres)

The deep ocean is awe-inspiring: vast, mysterious, and complex… I can find many adjectives to describe it, yet the immensity of it prevents me from picturing it in my mind. Landscapes are easy to imagine because we see them all the time, but their hidden ocean counterparts of seascapes with several kilometer-high seamounts and abyssal trenches are hard to visualize.

When I started a PhD on the spatial ecology of humpback whales, a species typically known for its coastal distributions, I never imagined my research would lead me to seamounts. Lesson of the day: you never know where research will lead you… So here is how it happened.

About twenty years ago when my supervisor, Dr Claire Garrigue, started working on humpback whales in New Caledonia, she was told by fishermen that humpbacks were often observed in prime fishing locations, about 170 km south of the mainland. After a little more investigation into this claim, it was discovered that these fishing spots corresponded with two seafloor topographic features: the Antigonia seamount and Torch Bank (Fig. 1), These features rise from the seafloor to depths of 30 m and 60 m respectively and are surrounded by waters about 1500 m deep. This led Dr. Garrigue to implement an ARGOS-satellite tagging program to follow the movements of humpbacks leaving the South Lagoon (one of the main breeding area in New Caledonia, Fig. 1). Sure enough, most of the tagged whales (61%) visited the Antigonia seamount (Fig. 2; Garrigue et al. 2015)⁠.

Map of New Caledonia and our study areas: the South Lagoon and the Southern Seamounts. Light grey lines represent 200m isobaths. Land is shown in black and reefs in grey.
Figure 1: Map of New Caledonia and our study areas: the South Lagoon and the “Southern Seamounts”. Light grey lines represent 200m isobaths. Land is shown in black and reefs in grey.
Figure 2: ARGOS tracking of 34 humpback whales tagged between 2007 and 2012 in the South Lagoon. The Antigonia seamount and Torch Bank are completely covered by tracklines.
Figure 2: ARGOS tracking of 34 humpback whales tagged between 2007 and 2012 in the South Lagoon. The Antigonia seamount and Torch Bank are completely covered by tracklines.

 

Seamounts are defined as “undersea mountains rising at least 100m from the ocean seafloor” (Staudigel et al. 2010). Most of them have a volcanic origin and the majority of them are located in the Pacific Ocean (Wessel 2001). But what is the link between these structures and marine life? The physical and biological mechanisms by which seamounts attract marine wildlife are diverse (for a review see: Pitcher et al. 2008)⁠. In a nutshell, topography of the ocean floor influences water circulation and isolated seabed features such as seamounts affect vertical mixing and create turbulences, consequently resulting in higher productivity.

For instance, have you ever heard of internal waves? Contrary to the surface waves people play in at the beach, internal waves propagate in three dimensions within the water column and can reach heights superior to a 100m! When these waves encounter steep topography, they break, similar to what a “normal” wave would do when reaching shore. This creates complex turbulence, which in turn may attract megafauna such as cetaceans (see com. by Hans van Haren).

The importance of seamounts for cetaceans is often referenced in the literature, however, few studies have tried to quantify this preference (one of which was recently published by our labmate Courtney Hann, see Hann et al. 2016 for details). So what importance do these seamounts serve for humpback whales in New Caledonia? Are they breeding grounds, do they serve as a navigation cue, a resting area, or even a foraging spot (the latter being the less likely hypothesis given that humpback whales have never been observed feeding in tropical waters)?

To answer this question, an expedition to Antigonia was organized in 2008 and about 40 groups of whales were observed in only 7 days! The density of this aggregation, the high occurrence of groups with calves and the consistent singing of males suggested that this area may be associated with breeding or calving behavior. Several other missions followed, confirming the importance of this offshore habitat for humpbacks.

Looking through all this data I was struck by two things: 1) whales were densely aggregated on top of these seamounts but were rarely found in the surrounding area (Fig. 3), and 2) other seamounts with similar characteristics are only a few kilometers from Antigonia, but seem to be rarely visited by tagged whales.

What is so special about these seamounts? Why would energetically depleted females with calves choose to aggregate in these off-shore, densely occupied and unsheltered waters?

 

Figure 3: 3D surface plot of the seabed in the Southern seamount area. Humpback whale groups observed in-situ during the boat-based surveys conducted between 2001 and 2011 are projected at the surface of the seabed: blue points represent groups without calf and white points represent groups with calf. Antigonia and Torch Bank have a clear flat-top shaped which classifies them in the “guyot” seamount type. Most whale groups aggregated on top of these guyots.
Figure 3: 3D surface plot of the seabed in the Southern Seamounts area. Humpback whale groups observed during the boat-based surveys (2001-2011) are projected at the surface of the seabed: blue points represent groups without calf and white points represent groups with calf. Antigonia and Torch Bank have a clear flat-top shaped and are called “guyots” seamounts. Most whale groups aggregated on top of these guyots. For 3D interactive plot: click here.

I will spend the next two months at the GEMM lab in Newport, OR, trying to answer these questions using ocean models developed by New Caledonian local research teams (at IRD and Ifremer). I will be comparing maps of local currents and topography of several seabed features located south of the New Caledonia main island. The oceanographic model used for this study will allow me to analyze a great number of environmental variables (temperature, salinity, vertical mixing, vorticity etc.) through the water column (one layer every 10m, from 0 to 500m deep) and at a very fine spatio-temporal scale (1km and 1day, even 1 hour at specific discrete locations) to better understand humpback whale habitat preferences.

Figure 4: Modeled Sea Surface Temperature for July 15th 2013 (model in progress, based on MARS3D, development by Romain Legendre). A temperature front occurs in the middle of the study area, along the Norfolk ridge. On this image, a cold eddy is forming right on top of the Antigonia seamount.
Figure 4: Modeled Sea Surface Temperature for July 15th 2013 (model in progress, based on MARS3D, development by Romain Le Gendre). A temperature front occurs in the middle of the study area, along the Norfolk ridge. On this image, a cold eddy is forming right on top of the Antigonia seamount.

 

Looking forward to uncovering the mysteries of seamounts and sharing the results in December!

Literature Cited

Garrigue C, Clapham PJ, Geyer Y, Kennedy AS, Zerbini AN (2015) Satellite tracking reveals novel migratory patterns and the importance of seamounts for endangered South Pacific Humpback Whales. R Soc Open Sci

Hann CH, Smith TD, Torres LG (2016) A sperm whale’s perspective: The importance of seasonality and seamount depth. Mar Mammal Sci:1–12

Pitcher TJ, Morato T, Hart PJ, Clark MR, Haggan N, Santos RS (2008) Seamounts: ecology, fisheries & conservation. Oxford, UK: Blackwell Publishing Ltd.

Wessel P (2001) Global distribution of seamounts inferred from gridded Geosat/ERS-1 altimetry. J Geophys Res 106:19431–19441

Staudigel H, Koppers AP, Lavelle JW, Pitcer TJ, Shank TM (2010) Defining the word ‘seamount’. Oceanography 23,20–21.

New Zealand’s mega-fauna come to Newport, Oregon.

By Olivia Hamilton, PhD Candidate, University of Auckland, New Zealand.

The week leading up to my departure from New Zealand was an emotional rollercoaster. Excited, nervous, eager, reluctant… I did not feel like the fearless adventurer that I thought I was. D-day arrived and I said my final goodbyes to my boyfriend and mother at the departure gate. Off I went on my three-month research stint at the Hatfield Marine Science Center.

Some thirty hours later I touched down in Portland. I collected my bags and headed towards the public transport area at the airport. A young man greeted me, “Would you like to catch a taxi or a shuttle, ma’am?” “A taxi please! I have no idea where I am”, I responded. He nodded and smiled. I could see the confusion all over his face… My thick kiwi accent was going to make for some challenging conversations.

After a few days in Portland acclimatizing to the different way of life in Oregon, it was time to push on to Newport. I hit a stroke of luck and was able take the scenic route with one of the girls in the GEMM lab, Rachael Orben. With only one wrong turn we made it to the Oregon coast. I was instantly hit with a sense of familiarity. The rugged coastline and temperate coastal forest resembled that of the west coast of New Zealand. However, America was not shy in reminding me of where I was with its big cars, drive-through everything, and RVs larger than some small kiwi houses.

The Oregon Coast. Photo by Olivia Hamilton.
The Oregon Coast. Photo by Olivia Hamilton.

We arrived at Hatfield Marine Science Center: the place I was to call home for the next quarter of a year.

So, what am I doing here?

In short, I have come to do computer work on the other side of the world.

Dr. Leigh Torres is on my PhD committee and I am lucky enough to have been given the opportunity to come to Newport and analyze my data under her guidance.

My PhD has a broad interest in the spatial ecology of mega-fauna in the Hauraki Gulf, New Zealand. For my study, megafauna includes whales, dolphins, sharks, rays, and seabirds. The Hauraki Gulf is adjacent to Auckland, New Zealand’s most populated city and home to one of our largest commercial ports. The Hauraki Gulf is a highly productive area, providing an ideal habitat for a number of fish species, thus supporting a number of top marine predators. As with many coastal areas, anthropogenic activities have degraded the health of the Gulf’s ecosystem. Commercial and recreational fishing, run-off from surrounding urban and rural land, boat traffic, pollution, dredging, and aquaculture are some of the main activities that threaten the Gulf and the species that inhabit it. For instance, the Nationally Endangered Bryde’s whale is a year-round resident in the Hauraki Gulf and these whales spend much of their time close to the surface, making them highly vulnerable to injury or death from ship-strikes. In spite of these threats, the Gulf supports a number of top marine predators.  Therefore it is important that we uncover how these top predators are using the Gulf, in both space and time, to identify ecologically important parts of their habitat. Moreover, this study presents a unique opportunity to look at the relationships between top marine predators and their prey inhabiting a common area.

The Hauraki Gulf, New Zealand. The purple lines represent the track lines that aerial surveys were conducted along.

 

Common dolphins in the Hauraki Gulf. Photo by Olivia Hamilton
Common dolphins in the Hauraki Gulf. Photo by Olivia Hamilton

 

A Bryde’s whale, common dolphins, and some opportunistic seabirds foraging in the Hauraki Gulf. Photo by Isabella Tortora Brayda di Belvedere.
A Bryde’s whale, common dolphins, and some opportunistic seabirds foraging in the Hauraki Gulf. Photo by Isabella Tortora Brayda di Belvedere.

 

Australisian Gannets and shearwaters foraging on a bait ball in the Hauraki Gulf. Photo by Olivia Hamilton.
Australisian Gannets and shearwaters foraging on a bait ball in the Hauraki Gulf. Photo by Olivia Hamilton.

To collect the data needed to understand the spatial ecology of these megafauna, we conducted 22 aerial surveys over a year-long period along pre-determined track lines within the Hauraki Gulf. On each flight we had four observers that collected sightings data for cetaceans, sharks, predatory fish, prey balls, plankton, and other rare species such as manta ray. An experienced seabird observer joined us approximately once a month to identify seabirds. We collected environmental data for each sighting including Beaufort Sea State, glare, and water color.

The summary of our sightings show that common dolphins were indeed common, being the most frequent species we observed. The most frequently encountered sharks were bronze whalers, smooth hammerhead sharks, and blue sharks. Sightings of Bryde’s whales were lower than we had hoped, most likely an artifact of our survey design relative to their distribution patterns. In addition, we counted a cumulative total of 11,172 individual seabirds representing 16 species.

Summary of sightings of megafauna in the Hauraki Gulf.

Summary of sightings of megafauna in the Hauraki Gulf.My goal while here at OSU is to develop habitat models for the megafauna species to compare the drivers of their distribution patterns. But, at the moment I am in the less glamorous, but highly important, data processing and decision-making stage. I am grappling with questions like: What environmental variables affected our ability to detect which species on surveys? How do we account for this? Can we clump species that are functionally similar to increase our sample size? These questions are important to address in order to produce reliable results that reflect the megafauna species true distribution patterns.

Once these questions are addressed, we can get on to the fun stuff – the habitat modeling and interpretation of the results. I will hopefully be able to start addressing these questions soon: What environmental and biological variables are important predictors of habitat use for different taxa? Are there interactions (attraction or repulsion) between these top predators? What is driving these patterns? Predator avoidance? Competition? So many questions to ask! I am looking forward to answering these questions and reporting back.

International Collaborations: What do the Oregon Coast and Maui’s dolphins have in common?

My name is Solène Derville and I am a master’s student in the Department of Biology at the Ecole Normale Supérieure of Lyon, France. As part of my master’s, I am spending a few months in Newport, where I am working under Dr Leigh Torres’s supervision in the GEMM Lab. Hopefully, this will be the starting point for a longer term collaboration, for a PhD project about the spatial ecology of humpback whales in New-Caledonia (South Western Pacific Ocean) which I am currently preparing.

Solene at Crater lake

On an early morning of February 2015, I am waiting at the airport for my flight to PORTLAND/PDX. I’ve had only one day to pack but I feel confident that I’ve made the right choices as my 23kg luggage contains mainly jumpers, sweatshirts, thick socks, and a brand new umbrella. I’ve got everything I need to face my four months internship in rainy Newport, Oregon.

A few disillusionments await me when I finally land: 1) my “saucisson” (fancy sausage) can’t pass customs and ends up in a bin despite my attempts to negotiate with the customs official, and 2) as soon as I am out of the airport, it starts raining. At first sight this looks like the harmless kind of drizzle I’ve experienced in England, until I realize it’s raining sideways! So much for buying a new umbrella…

Luckily, these small inconveniences don’t affect my spirits for long as I get to discover the richnesses Oregon has to offer.

My mouth drops open the first time someone tells me that I can see elk around Newport and that gray whales are commonly observed next to the jetty at this time of year. It’s difficult to describe to someone who’s always been living in this environment how exciting it is to me. I am not used to all this wilderness and certainly not to living so close to it. It’s a thrill to think that I only need to ride my bike for a few miles to meet the amazing local fauna.

Oregon Coast by Solene
Oregon Coast by Solene

Of course, the beauty of Oregon’s landscapes and the richness of its wildlife is not the only thing that catches my attention. I am immediately touched by the kindness of people, the sense of sharing and the deeply rooted sense of community. I feel welcomed at HMSC, and by my colleagues in the GEMM lab and I am eager to start my internship.

So what is my work here exactly?

Well, believe it or not, I’ve crossed the Atlantic Ocean and came to the US to actually work on a species of dolphins endemic to New-Zealand! Dr Leigh Torres, and I are investigating the fine-scale distribution and habitat selection patterns of Maui’s dolphin (Cephalorhyncus hectori maui). This subspecies of the more common Hector’s dolphin (Cephalorhyncus hectori, also endemic to New-Zealand) is the smallest dolphin in the world and unfortunately among the most endangered (listed as “critically endangered” by the IUCN). The Maui’s dolphin population is thought to have decreased to under 100 individuals in the past decades.

Maui's dolphin credit: Will Rayment
Maui’s dolphin credit: Will Rayment

In practice, this means I am doing data analysis so I spend my days in front of my computer. This may sound a bit dull, but computer work is actually a great part of research in ecology (apart from awesome field work stage, but this is only the tip of the iceberg). Speaking for myself, I’ve always found it very exciting to put together all this hard-won data to answer important questions, especially when the conservation of species as emblematic as the Maui’s dolphin is at stake. To tell the truth, the nerdy code writing work is also a lot of fun!

My data set consists of boat-based observations of Maui dolphin groups made during the 2010, 2011, 2013 and 2015 summer surveys. Overall about a hundred groups were observed. Based on these observations we would like to know: WHERE are the Maui dolphins (distribution pattern)? And WHY (habitat preferences)?

New Zealand
New Zealand

My job is first to describe the spatial distribution patterns of these observations given the year, composition of groups, or group behaviour (whether animals were feeding, resting etc.). This can be done using kernel density estimates: a very good method for “smoothing” a distribution in 2 dimensions and highlighting its main characteristics (extent, core areas etc.). This allows us to answer (or try to answer) the “WHERE” question.

Kernel density maps
Kernel density maps

The second stage of my analysis is to describe the environmental conditions at each of the dolphin group locations and compare them with the environmental conditions in surveyed areas where Maui dolphins where not observed. This allows us to better understand the environmental cues that Maui dolphins might be following to find “suitable” places for their every-day activities and therefore try answer the “WHY” question. In statistical jargon, we are exploring the relationship between probability of presence of Maui dolphins and environmental predictors such as: sea surface temperature, turbidity of the water, distance to closest river mouths, distance to the coast and depth.

The resulting models will be used to predict seasonal variations in Maui’s dolphin distribution, notably in winter when direct surveying is difficult because of weather conditions. Based on the resulting dynamic distribution models, we finally aim to predict how Maui’s dolphins might interact with anthropogenic activities or react to changes in their environment.

So far, preliminary results are very promising and I am hoping to share these soon!