Coding stories, tips, and tricks

Clara Bird1 and Karen Lohman2

1Masters Student in Wildlife Science, Geospatial Ecology of Marine Megafauna Lab

2Masters Student in Wildlife Science, Cetacean Conservation and Genomics Laboratory

In a departure from my typical science-focused blog, this week I thought I would share more about myself. This week I was inspired by International’s Woman’s Day and, with some reflection on the last eight months as a graduate student, I decided to look back on the role that coding has played in my life. We hear about how much coding can be empowering but I thought it might be cool to talk about my personal experience of feeling empowered by coding. I’ve also invited a fellow grad student in the Marine Mammal Institute, Karen Lohman, to co-author this post. We’re going to briefly talk about our experience with coding and then finish with advice for getting started with coding and coding for data analysis.

Our Stories

Clara

I’ve only been coding for a little over two and a half years. In summer 2017 I did an NSF REU (Research Experience for Undergraduates) at Bigelow Laboratory for Ocean Sciences and for my project I taught myself python (with the support of a post-doc) for a data analysis project. During those 10 weeks, I coded all day, every workday. From that experience, I not only acquired the hard skill of programming, but I gained a good amount of confidence in myself, and here’s why: For the first three years of my undergraduate career coding was a daunting skill that I knew I would eventually need but did not know where to start. So, I essentially ended up learning by jumping off the deep end. I found the immersion experience to be the most effective learning method for me. With coding, you find out if you got something right (or wrong) almost instantaneously. I’ve found that this is a double-edged sword. It means that you can easily have days where everything goes wrong. But, the feeling when it finally works is what I think of when I hear the term empowerment. I’m not quite sure how to put it into words, but it’s a combination of independence, confidence, and success. 

Aside from learning the fundamentals, I finished that summer with confidence in my ability to teach myself not just new coding skills, but other skills as well. I think that feeling confident in my ability to learn something new has been the most helpful aspect to allow me to hit the ground running in grad school and also keeping the ‘imposter syndrome’ at bay (most of the time).

Clara’s Favorite Command: pd.groupby (python) – Say you have a column of measurements and a second column with the field site of each location. If you wanted the mean of the measurement per each location, you could use groupby to get this. It would look like this: dataframe.groupby(‘Location’)[‘Measurement’].mean().reset_index()

Karen

I’m quite new to coding, but once I started learning I was completely enchanted! I was first introduced to coding while working as a field assistant for a PhD student (a true R wizard who has since developed deep learning computer vision packages for automated camera trap image analysis) in the cloud forest of the Ecuadorian Andes. This remote jungle was where I first saw how useful coding can be for data management and analysis. It was a strange juxtaposition between being fully immersed in nature for remote field work and learning to think along the lines of coding syntax. It wasn’t the typical introduction to R most people have, but it was an effective hook. We were able to produce preliminary figures and analysis as we collected data, which made a tough field season more rewarding. Coding gave us instant results and motivation.

I committed to fully learning how to code during my first year of graduate school. I first learned linux/command line and python, and then I started working in R that following summer. My graduate research uses population genetics/genomics to better understand the migratory connections of humpback whales. This research means I spend a great deal of time working to develop bioinformatics and big data skills, an essential skill for this area of research and a goal for my career. For me, coding is a skill that only returns what you put in; you can learn to code quite quickly, if you devote the time. After a year of intense learning and struggle, I am writing better code every day.

In grad school research progress can be nebulous, but for me coding has become a concrete way to measure success. If my code ran, I have a win for the week. If not, then I have a clear place to start working the next day. These “tiny wins” are adding up, and coding has become a huge confidence boost.

Karen’s Favorite Command: grep (linux) – Searches for a string pattern and prints all lines containing a match to the screen. Grep has a variety of flags making this a versatile command I use every time I’m working in linux.

Advice

Getting Started

  • Be kind to yourself, think of it as a foreign language. It takes a long time and a lot of practice.
  • Once you know the fundamental concepts in any language, learning another will be easier (we promise!).
  • Ask for help! The chances that you have run into a unique error are quite small, someone out there has already solved your problem, whether it’s a lab mate or another researcher you find on Google!

Coding Tips

1. Set yourself up for success by formatting your datasheets properly

  • Instead of making your spreadsheet easy to read, try and think about how you want to use the data in the analysis.
  • Avoid formatting (merged cells, wrap text) and spaces in headers
  • Try to think ahead when formatting your spreadsheet
    • Maybe chat with someone who has experience and get their advice!

2. Start with a plan, start on paper

This low-tech solution saves countless hours of code confusion. It can be especially helpful when manipulating large data frames or in multistep analysis. Drawing out the structure of your data and checking it frequently in your code (with ‘head’ in R/linux) after manipulation can keep you on track. It is easy to code yourself into circles when you don’t have a clear understanding of what you’re trying to do in each step. Or worse, you could end up with code that runs, but doesn’t conduct the analysis you intended, or needed to do.

3. Good organization and habits will get you far

There is an excellent blog by Nice R Code on project organization and file structure. I highly recommend reading and implementing their self-contained scripting suggestions. The further you get into your data analysis the more object, directory, and function names you have to remember. Develop a naming scheme that makes sense for your project (i.e. flexible, number based, etc.) and stick with it. Temporary object names in functions or code blocks can be a good way to clarify what is the code-in-progress or the code result.

Figure 1. An example of project based workflow directory organization from Nice R Code (https://nicercode.github.io/blog/2013-04-05-projects/ )

4. Annotate. Then annotate some more.

Make comments in your code so you can remember what each section or line is for. This makes debugging much easier! Annotation is also a good way to stay on track as you code, because you’ll be describing the goal of every line (remember tip 1?). If you’re following a tutorial (or STACKoverflow answer), copy the web address into your annotation so you can find it later. At the end of a coding session, make a quick note of your thought process so it’s easier to pick up when you come back. It’s also a good habit to add some ‘metadata’ details to the top of your script describing what the script is intended for, what the input files are, the expected outputs, and any other pertinent details for that script. Your future self will thank you!

Figure 2. Example code with comments explaining the purpose of each line.

5. Get with git/github already

Github is a great way to manage version control. Remember how life-changing the advent of dropbox was? This is like that, but for code! It’s also become a great open-source repository for newly developed code and packages. In addition to backing up and storing your code, GitHub has become a ‘coding CV’ that other researchers look to when hiring.

Wondering how to get started with GitHub? Check out this guide: https://guides.github.com/activities/hello-world/

Looking for a good text/code editor? Check out atom (https://atom.io/), you can push your edits straight to git from here.

6. You don’t have to learn everything, but you should probably learn the R Tidyverse ASAP

Tidyverse is a collection of data manipulation packages that make data wrangling a breeze. It also includes ggplot, an incredibly versatile data visualization package. For python users hesitant to start working in R, Tidyverse is a great place to start. The syntax will feel more familiar to python, and it has wonderful documentation online. It’s also similar to the awk/sed tools from linux, as dplyr removes any need to write loops. Loops in any language are awful, learn how to do them, and then how to avoid them.

7. Functions!

Break your code out into blocks that can be run as functions! This allows easier repetition of data analysis, in a more readable format. If you need to call your functions across multiple scripts, put them all into one ‘function.R’ script and source them in your working scripts. This approach ensures that all the scripts can access the same function, without copy and pasting it into multiple scripts. Then if you edit the function, it is changed in one place and passed to all dependent scripts.

8. Don’t take error messages personally

  • Repeat after me: Everyone googles for every other line of code, everyone forgets the command some (….er every) time.
  • Debugging is a lifestyle, not a task item.
  • One way to make it less painful is to keep a list of fixes that you find yourself needing multiple times. And ask for help when you’re stuck!

9. Troubleshooting

  • Know that you’re supposed to google but not sure what?
    • start by copying and pasting the error message
  • When I started it was hard to know how to phrase what I wanted, these might be some common terms
    • A dataframe is the coding equivalent of a spreadsheet/table
    • Do you want to combine two dataframes side by side? That’s a merge
    • Do you want to stack one dataframe on top of another? That’s concatenating
    • Do you want to get the average (or some other statistic) of values in a column that are all from one group or category? Check out group by or aggregate
    • A loop is when you loop through every value in a column or list and do something with it (use it in an equation, use it in an if/else statement, etc).

Favorite Coding Resource (other than github….)

  • Learnxinyminutes.com
    • This is great ‘one stop googling’ for coding in almost any language! I frequently switch between coding languages, and as a result almost always have this open to check syntax.
  • https://swirlstats.com/
    • This is a really good resource for getting an introduction to R

Parting Thoughts

We hope that our stories and advice have been helpful! Like many skills, you tend to only see people once they have made it over the learning curve. But as you’ve read Karen and I both started recently and felt intimidated at the beginning. So, be patient, be kind to yourself, believe in yourself, and good luck!

A Weekend of Inspiration in Marine Science: NWSSMM and Dr. Sylvia Earle!

By Karen Lohman, Masters Student in Wildlife Science, Cetacean Conservation and Genomics Lab, Oregon State University

My name is Karen Lohman, and I’m a first-year student in Dr. Scott Baker’s Cetacean Conservation and Genomics Lab at OSU. Dr. Leigh Torres is serving on my committee and has asked me to contribute to the GEMM lab blog from time to time. For my master’s project, I’ll be applying population genetics and genomics techniques to better understand the degree of population mixing and breeding ground assignment of feeding humpback whales in the eastern North Pacific. In other words, I’ll be trying to determine where the humpback whales off the U.S. West Coast are migrating from, and at what frequency.

Earlier this month I joined the GEMM lab members in attending the Northwest Student Society of Marine Mammalogy Conference in Seattle. The GEMM lab members and I made the trip up to the University of Washington to present our work to our peers from across the Pacific Northwest. All five GEMM lab graduate students, plus GEMM lab intern Acacia Pepper, and myself gave talks presenting our research to our peers. I was able to present preliminary results on the population structure of feeding humpback whales across shared feeding habitat by multiple breeding groups in the eastern North Pacific using mitochondria DNA haplotype frequencies. In the end GEMM lab’s Dawn Barlow took home the “Best Oral Presentation” prize. Way to go Dawn!

A few of the GEMM lab members and me presenting our research at the NWSSMM conference in May 2019 at the University of Washington.

While conferences have a strong networking component, this one feels unique.  It is a chance to network with our peers, who are working through the same challenges in graduate school and will hopefully be our future research collaborators in marine mammal research when we finish our degrees. It’s also one of the few groups of people that understand the challenges of studying marine mammals. Not every day is full of dolphins and rainbows; for me, it’s mostly labwork or writing code to overcome small and/or patchy sample size problems.

All of the CCGL and GEMM Lab members excited to hear Dr. Sylvia Earle’s presentation at Portland State University in May 2019 (from L to R: Karen L., Lisa H., Alexa K., Leila L., Dawn B., and Dom K.) . Photo Source: Alexa Kownacki

On the way back from Seattle we stopped to hear the one and only Dr. Sylvia Earle, talk in Portland. With 27 honorary doctorates and over 200 publications, Dr. Sylvia Earle is a legend in marine science. Hearing a distinguished marine researcher talk about her journey in research and to present such an inspiring message of ocean advocacy was a great way to end our weekend away from normal grad school responsibilities. While the entirety of her talk was moving, one of her final comments really stood out. Near the end of her talk she called the audience to action by saying “Look at your abilities and have confidence that you can and must make a difference. Do whatever you’ve got.” As a first-year graduate student trying to figure out my path forward in research and conservation, I couldn’t think of better advice to end the weekend on.

 

Photogrammetry Insights

By Leila Lemos, PhD Candidate, Fisheries and Wildlife Department, Oregon State University

After three years of fieldwork and analyzing a large dataset, it is time to finally start compiling the results, create plots and see what the trends are. The first dataset I am analyzing is the photogrammetry data (more on our photogrammetry method here), which so far has been full of unexpected results.

Our first big expectation was to find a noticeable intra-year variation. Gray whales spend their winter in the warm waters of Baja California, Mexico, period while they are fasting. In the spring, they perform a big migration to higher latitudes. Only when they reach their summer feeding grounds, that extends from Northern California to the Bering and Chukchi seas, Alaska, do they start feeding and gaining enough calories to support their migration back to Mexico and subsequent fasting period.

 

Northeastern gray whale migration route along the NE Pacific Ocean.
Source: https://journeynorth.org/tm/gwhale/annual/map.html

 

Thus, we expected to see whales arriving along the Oregon coast with a skinny body condition that would gradually improve over the months, during the feeding season. Some exceptions are reasonable, such as a lactating mother or a debilitated individual. However, datasets can be more complex than we expect most of the times, and many variables can influence the results. Our photogrammetry dataset is no different!

In addition, I need to decide what are the best plots to display the results and how to make them. For years now I’ve been hearing about the wonders of R, but I’ve been skeptical about learning a whole new programming/coding language “just to make plots”, as I first thought. I have always used statistical programs such as SPSS or Prism to do my plots and they were so easy to work with. However, there is a lot more we can do in R than “just plots”. Also, it is not just because something seems hard that you won’t even try. We need to expose ourselves sometimes. So, I decided to give it a try (and I am proud of myself I did), and here are some of the results:

 

Plot 1: Body Area Index (BAI) vs Day of the Year (DOY)

 

In this plot, we wanted to assess the annual Body Area Index (BAI) trends that describe how skinny (low number) or fat (higher number) a whale is. BAI is a simplified version of the BMI (Body Mass Index) used for humans. If you are interested about this method we have developed at our lab in collaboration with the Aerial Information Systems Laboratory/OSU, you can read more about it in our publication.

The plots above are three versions of the same data displayed in different ways. The first plot on the left shows all the data points by year, with polynomial best fit lines, and the confidence intervals (in gray). There are many overlapping observation points, so for the middle plot I tried to “clean up the plot” by reducing the size of the points and taking out the gray confidence interval range around the lines. In the last plot on the right, I used a linear regression best fit line, instead of polynomial.

We can see a general trend that the BAI was considerably higher in 2016 (red line), when compared to the following years, which makes us question the accuracy of the dataset for that year. In 2016, we also didn’t sample in the month of July, which is causing the 2016 polynomial line to show a sharp decrease in this month (DOY: ~200-230). But it is also interesting to note that the increasing slope of the linear regression line in all three years is very similar, indicating that the whales gained weight at about the same rate in all years.

 

Plot 2: Body Area Index (BAI) vs Body Condition Score (BCS)

 

In addition to the photogrammetry method of assessing whale body condition, we have also performed a body condition scoring method for all the photos we have taken in the field (based on the method described by Bradford et al. 2012). Thus, with this second set of plots, we wanted to compare both methods of assessing whale body condition in order to evaluate when the methods agree or not, and which method would be best and in which situation. Our hypothesis was that whales with a ‘fair’ body condition would have a lower BAI than whales with a ‘good’ body condition.

The plots above illustrate two versions of the same data, with data in the left plot grouped by year, and the data in the right plot grouped by month. In general, we see that no whales were observed with a poor body condition in the last analysis months (August to October), with both methods agreeing to this fact. Additionally, there were many whales that still had a fair body condition in August and September, but less whales in the month of October, indicating that most whales gained weight over the foraging seasons and were ready to start their Southbound migration and another fasting period. This result is important information regarding monitoring and conservation issues.

However, the 2016 dataset is still a concern, since the whales appear to have considerable higher body condition (BAI) when compared to other years.

 

Plot 3:Temporal Body Area Index (BAI) for individual whales

 

In this last group of plots, we wanted to visualize BAI trends over the season (using day of year – DOY) on the x-axis) for individuals we measured more than once. Here we can see the temporal patterns for the whales “Bit”, “Clouds”, “Pearl”, “Scarback, “Pointy”, and “White Hole”.

We expected to see an overall gradual increase in body condition (BAI) over the seasons, such as what we can observe for Pointy in 2018. However, some whales decreased their condition, such as Bit in 2018. Could this trend be accurate? Furthermore, what about BAI measurements that are different from the trend, such as Scarback in 2017, where the last observation point shows a lower BAI than past observation points? In addition, we still observe a high BAI in 2016 at this individual level, when compared to the other years.

My next step will be to check the whole dataset again and search for inconsistencies. There is something causing these 2016 values to possibly be wrong and I need to find out what it is. The overall quality of the measured photogrammetry images was good and in focus, but other variables could be influencing the quality and accuracy of the measurements.

For instance, when measuring images, I often struggled with glare, water splash, water turbidity, ocean swell, and shadows, as you can see in the photos below. All of these variables caused the borders of the whale body to not be clearly visible/identifiable, which may have caused measurements to be wrong.

 

Examples of bad conditions for performing photogrammetry: (1) glare and water splash, (2) water turbidity, (3) ocean swell, and (4) a shadow created in one of the sides of the whale body.
Source: GEMM Lab. Taken under NMFS permit 16111 issued to John Calambokidis.

 

Thus, I will need to check all of these variables to identify the causes for bad measurements and “clean the dataset”. Only after this process will I be able to make these plots again to look at the trends (which will be easy since I already have my R code written!). Then I’ll move on to my next hypothesis that the BAI of individual whales varied by demographics including sex, age and reproductive state.

To carry out robust science that produces results we can trust, we can’t simply collect data, perform a basic analysis, create plots and believe everything we see. Data is often messy, especially when developing new methods like we have done here with drone based photogrammetry and the BAI. So, I need to spend some important time checking my data for accuracy and examining confounding variables that might affect the dataset. Science can be challenging, both when interpreting data or learning a new command language, but it is all worth it in the end when we produce results we know we can trust.

 

 

 

On learning to Code…

By Amanda Holdman, MSc student, Dept. Fisheries and Wildlife, OSU

I’ve never sworn so much in my life. I stared at a computer screen for hours trying to fix a bug in my script. The cause of the error escaped me, pushing me into a cycle of tension, self-loathing, and keyboard smashing.

The cause of the error? A typo in the filename.

When I finally fixed the error in my filename and my code ran perfectly – my mood quickly changed. I felt invincible; like I had just won the World Cup. I did a quick victory dance in my kitchen and high-fived my roommate, and then sat down and moved on the next task that needed to be conquered with code. Just like that, programming has quickly become a drug that makes me come back for more despite the initial pain I endure.

I had never opened a computer programming software until my first year of graduate school. Before then Matlab was just the subject of a muttered complaint by my college engineering roommate. As a biology major, I blew it off as something (thank goodness!) I would never need to use. Needless to say, that set me up for a rude awakening just one year later.

The time has finally come for me to, *gulp*, learn how to code. I honestly think I went through all 5 stages of grief before I realized I was at the point where I could no longer put it off.

By now you are familiar with the GEMM Lab updating you with photos of our charismatic study species in our beautiful study areas. However, summer is over. My field work is complete, and I’m enrolled in my last course of my master’s career. So what does this mean? Winter. And with winter comes data analysis. So, instead of spending my days out on a boat in calm seas, watching humpbacks breach, or tagging along with Florence to watch gray whales forage along the Oregon coast, I’ve reached the point of my graduate career that we don’t often tell you about: Figuring out what story our data is telling us. This stage requires lots of coffee and patience.

However, in just two short weeks of learning how to code, I feel like I’ve climbed mountains. I tackle task after task, each allowing me to learn new things, revise old knowledge, and make it just a little bit closer to my goals. One of the most striking things about learning how to code is that it teaches you how to problem solve. It forces you to think in a strategic and conceptual way, and to be honest, I think I like it.

For example, this week I mapped the percent of my harbor porpoise detections over tidal cycles. One of the most important factors explaining the distribution and behavior of coastal marine mammals are tides. Tidal forces drive a number of preliminary and secondary oceanographic processes like changes in water depth, salinity, temperature, and the speed and direction of currents. It’s often difficult to unravel which part of the tidal process is most influential to a species due to the several covariates related to the change in tides , how inter-related those covariates are, and the elusive nature of the species (like the cryptic harbor porpoise). However, while the analysis is preliminary, if we map the acoustic detections of harbor porpoise over the tidal cycle, we can already start to see some interesting trends between the number of porpoise detections and the phases of the tide. Check it out!

reef3_clicks

Now, I won’t promise that I’ll be an excellent coder by the end of the winter, but I think I might have a good chance at being able to mark the “proficient” box next to Matlab and R on my first job application. Yet, whatever your reason for learning code – whether you are an undergraduate hoping to get ahead for graduate school, a graduate student hoping to escape the inevitable (like me), or just someone who thinks getting a code to work properly is a fun game – my advice to you is this:

Google first. If that fails, take mental breaks. Revisit the problem later. Think through all possible sources of error. Ask around for help. Then, when you finally fix the bug or get the code to work the way you would like it to, throw a mini-party. After it’s all over, take a deep breath and go again. Remember, you are not alone!

Happy coding this winter GEMM Lab readers – and I wish you lots of celebratory dancing!