Mapmaking: Part 3

Posted on December 12, 2015 by Ryan

In the first two parts of this series, I introduced Lightroom, the Lightroom plugins LR/Transporter and FTP Publisher, and the programming languages AWK and R. With those tools, I organized my photos and got some of their metadata into a format that I can easily manipulate with R code.

After getting the photo information organized, I had a few more pieces of metadata to get together. In particular, I wanted to organize the map based on the taxonomy of the corals, and I wanted to include some information about the site of collection that wasn’t included in my sample metadata file. We are keeping this information in separate files, for a couple of reasons. Over the course of the project, multiple people have collected replicates of the same species of coral in different locations. Every time we collect a coral, we need to fill in a line of data in the sample metadata table. Right now, we have 57 columns in that table, meaning we have to manually fill in 57 pieces of information for each sample. On a whirlwind trip where we collect 50 samples, that adds up quickly to 2850 values, or 2850 opportunities to make a typo or some other error.

If any two columns in our table are highly repetitive and are dependent on each other, we should be able to allow the computer to fill one in based on the other. For example, we could create seven columns in the sample metadata file that detail each sample’s species, genus, family, order, phylogenetic clade, NCBI taxonomy ID number, and perhaps some published physiological data. However, all of these pieces of information are dependent on the first value: the species of coral sampled. If we collect the same species, say, Porites lobata, 25 times throughout the project, all the information associated with that species is going to be repeated again and again in our metadata sheet. However, if instead we create a single column in our sample metadata table for the species ID, we can then create a separate table for all the other information, with only one row per species. We cut down on the amount of manual data entry we have to do by 144 values for that species alone!* Not only does that save time; it helps to avoid errors. The same general principle applies to each site we’ve visited: certain values are consistent and prone to repetition and error, such as various scales of geographical information, measurements of water temperature and visibility, and locally relevant collaborators. So we created another table for ‘sites’. **

Excerpt from 'species' metadata table

genus_species	genus	species	family	clade	TAXON_ID	NCBI_blast_name
Tubastrea coccinea	Tubastrea	coccinea	Dendrophyllidae	II	46700	stony corals
Turbinaria reniformis	Turbinaria	reniformis	Dendrophyllidae	II	1381352	stony corals
Porites astreoides	Porites	astreoides	Poritidae	III	104758	stony corals
Acropora palmata	Acropora	palmata	Acroporidae	VI	6131	stony corals
Pavona maldivensis	Pavona	maldivensis	Agaricidae	VII	1387077	stony corals
Herpolitha limax	Herpolitha	limax	Fungiidae	XI	371667	stony corals
Diploastrea heliopora	Diploastrea	heliopora	Diploastreidae	XV	214969	stony corals
Symphyllia erythraea	Symphyllia	erythraea	Lobophyllidae	XIX	1328287	stony corals
Heliopora coerulea	Heliopora	coerulea	Helioporaceae	Outgroup	86515	blue corals
Stylaster roseous	Stylaster	roseous	Stylasteridae	Outgroup	520406	stony corals

Excerpt from 'sites' metadata table

reef_name	date	reef_type	site_name	country	collected_by	relevant_collaborators	visibility
Big Vickie	20140728	Midshelf inshore reef	Lizard Island	Australia	Ryan McMinds	David Bourne, Katia Nicolet, Kathy Morrow, and many others at JCU, AIMS, and LIRS	12
Horseshoe	20140731	Midshelf inshore reef	Lizard Island	Australia	Ryan McMinds	David Bourne, Katia Nicolet, Kathy Morrow, and many others at JCU, AIMS, and LIRS	15
Al Fahal	20150311	Offshore reef	KAUST House Reefs	Saudi Arabia	Ryan McMinds, Jesse Zaneveld	Chris Voolstra, Maren Ziegler, Anna Roik, and many others at KAUST	Unknown
Far Flats	20150630	Fringing Reef	Lord Howe Island	Australia	Joe Pollock		15
Raffles Lighthouse	20150723	Inshore Reef	Singapore	Singapore	Jesse Zaneveld, Monica Medina	Danwei Huang	4.5
Trou d'Eau	20150817	Lagoon Patch Reef	Reunion West	France	Ryan McMinds, Amelia Foster, Jerome Payet	Le Club de Plongee Suwan Macha, Jean-Pascal Quod	10
LTER_1_Fringing	20151109	Fringing Reef	Moorea	French Polynesia	Ryan McMinds, Becky Vega Thurber	the Burkepile Lab	>35

Thus, after loading and processing the sample and photo metadata files as in the last post, I needed to load these two extra files and merge them with our sample table. This is almost trivial, using commands that are essentially in English:

sites <- read.table('sites_metadata_file.txt',header=T,sep='\t',quote="\"")
data <- merge(samples,sites)
species_data <- read.table('species_metadata_file.txt',header=T,sep='\t',quote="\"")
data <- merge(data,species_data)

And we now have a fully expanded table.

A couple of commands are needed to account for empty values that are awaiting completion when we get the time:

data$relevant_collaborators[is.na(data$relevant_collaborators)] <- 'many collaborators'
data$photo_name[is.na(data$photo_name)] <- 'no_image'

These commands subset the table to just rows that had empty values for collaborators and photos, and assign to the subset a consistent and useful value. Empty collaborator cells aren’t accurate – we’ve gotten lots of help everywhere we’ve gone, and just haven’t pulled all the information from all the teams together yet! As for samples without images, I created a default image with the filename ‘no_image.jpg’ and uploaded it to the server as a stand-in.

Default image shown when a sample has no pictures.

Now I need to introduce the R package that I used to build my map: Leaflet for R. Leaflet is actually an extensive Javascript package, but the R wrapper makes it convenient to integrate my data. The package allows considerable control of the map within R, but the final product can be saved as an HTML file that sources the online Javascript libraries. Once it’s created, I just upload it to our webpage and direct you there!

Note that although I usually use R from the Terminal, it’s very convenient to use the application RStudio with this package, because you can see the product progress as it’s built, and then easily export it at the end.

To make my map more interesting, I took advantage of the fact that each marker on the Leaflet map can have a popup with its own arbitrary HTML-coded content. Thus, for each sample I integrated all my selected metadata into an organized graphical format. The potential uses for this are exciting to me; it means I could put more markers on the map, with tables, charts, interactive media, or lots of other things that can be specified with HTML. For now, though, I decided I wanted the popups to look like this, with just some organized text, links, and a photo:

Acanthastrea echinata: E13.20.Aca.echi.1.20151110

Site: LTER_1_Backreef

Date: 20151110

Country: French Polynesia

Collected by Ryan McMinds, Becky Vega Thurber with the help of the Burkepile Lab.

So, I wrote the HTML and then used R’s paste0() function to plug in the sample-specific data in between HTML strings.

data$html <- paste0('300px; overflow:auto;">',
'<div width="100%" style="clear:both;">',
'<p>',
'<a href="https://www.flickr.com/search/?text=GCMP%20AND%20',data$genus_species,'"target="_blank">',data$genus_species,'</a>: ',
'<a href="https://www.flickr.com/search/?text=',gsub('.','',data$sample_name,fixed=T),'"target="_blank">',data$sample_name,'</a>',
'</p>',
'</div>',
'<div width="100%" style="float:left;clear:both;">',
'<img src="http://files.cgrb.oregonstate.edu/Thurber_Lab/GCMP/photos/sample_photos/processed/small/',data$photo_title,'.jpg" width="50%" style="float:left;">',
'<div width="50%" style="float:left; margin-left:10px; max-width:140px;">',
'Site: <a href="https://www.flickr.com/search/?text=GCMP%20AND%20',data$reef_name,'" target="_blank">',data$reef_name,'</a>',
'<p>Date: <a href="https://www.flickr.com/search/?text=GCMP%20AND%20',data$date,'"target="_blank">',data$date,'</a></p>',
'<p>Country: <a href="https://www.flickr.com/search/?text=GCMP%20AND%20',data$country,'"target="_blank">',data$country,'</a></p>',
'</div>',
'</div>',
'<div width="100%" style="float:left;">',
'<p>',
'Collected by <a href="https://www.flickr.com/search/?text=GCMP%20AND%20(',gsub(', ','%20OR%20',data$collected_by,fixed=T),')"target="_blank">',data$collected_by,'</a>',
' with the help of ',data$relevant_collaborators,'.',
'</p>',
'</div>',
'<div style="clear:both;"></div>',
'</div>')

Yeesh! I hate HTML. It definitely makes it uglier having to build the code within an R function, but hey, it works. If you want, we can go over that rat’s nest in more detail another time, but for now, the basics: I’ve created another column in our sample metadata table (data$html) that contains a unique string of HTML code on each row. In blue, I create a container for the first line of the popup, which contains the species name and sample name, stitched together into a link to their photos on Flickr. In orange, I paste together a source call to the sample’s photo on our server. In green, I create a container with metadata information (and links to all photos associated with that metadata on Flickr), which sits next to the image. And in purple, I stitch together some text and links to acknowledge the people who worked to collect that particular sample. Looking at that code right now, I’m marveling at how much nicer it looks now that I’ve cleaned it up for presentation…

And now that I’ve gotten all the metadata together and prepared the popups, the only thing left to do is create the map itself. However, I’ll leave that for just one more post in the series.

*math not thoroughly verified.

**edit: My father points out that we are essentially building a relational database of our metadata. In fact, I did initially intend to do that explicitly by loading these separate tables into a MySQL database. For now, however, our data isn’t all that complex or extensive, and separate tables that can be merged with simple R or Python code are working just fine. I’m sure someday we will return to a discussion of databases, but that day is not today.

Mapmaking: Part 2

Posted on December 11, 2015 by Ryan

No, you didn’t miss Mapmaking: Part 1. Before getting interrupted by last-minute extra fieldwork with the Waitt Foundation (which was awesome!), I gave an intro to photo management in Lightroom. Today I’ll expand on that, beginning a series of posts explaining how I created this map. On the way, I’ll introduce a little bit of…

*shudder*

coding.

Some really ugly code that I once wrote.

If you’ve been following my blog just to look at pretty beach pictures, I apologize. But I encourage you to keep reading. If any of the code makes you go cross-eyed, don’t worry; it does the same to me. I would love to field some questions in the comment section to make things clearer.

So. I have all of my photos keyworded to oblivion, and those keywords include sample IDs. How did I get them into my map? First, I needed to make sure I could link a given sample with its photos programmatically. I have a machine-readable metadata table that stores all our sample information, which we’ll be using later for data analysis. Metadata just refers to ‘extra’ information about the samples, and by machine-readable, I mean it’s stored in a format that is easy to parse with code. I used this table to build the map because it specifies GPS coordinates and provides things like the site name to fill in the pop-ups. But I didn’t have any photo filenames in this table, because it’s easier to organize the photos by tagging them with their sample IDs, like I explained last post. I simply needed to extract sample IDs from the photos’ keywords and add the their filenames to my sample metadata table. And not by hand.

Excerpt from sample metadata table

sample_name	reef_name	date	time	genus_species	latitude	longitude
E1.3.Por.loba.1.20140724	Lagoon entrance	20140724	11:23	Porites lobata	-14.689414	145.468137
E1.19.Sym.sp.1.20140724	Lagoon entrance	20140724	11:26	Symphyllia sp	-14.689414	145.468137
E1.6.Acr.sp.1.20140726	Trawler	20140726	10:35	Acropora sp	-14.683931	145.466483
E1.15.Dip.heli.1.20140726	Trawler	20140726	10:38	Diploastrea heliopora	-14.683931	145.466483
E1.3.Por.loba.1.20140726	Trawler	20140726	10:41	Porites lobata	-14.683931	145.466483

A popup from the map on our webpage, displaying the sample ID, selected metadata information, and a photo.

To get started, I installed a Lightroom plugin called LR/Transporter. This plugin contains many functions for programmatically messing with photo metadata. Using it, I created a ‘title’ for all of my photos with a sequence of numbers in the order that they were taken. The first sample photo from the project was one that Katia took while I was working in Australia, and it’s now called ‘GCMP_sample_photo_1’. Katia and I also took 17 other photos that contained this same sample, incrementing up to ‘GCMP_sample_photo_18’. The last photo I have from the project is one from my last trip, to Mo’orea, and it now has the title ‘GCMP_sample_photo_3893’.

Then, I exported small versions of all my photos to a publicly accessible internet server that our lab uses for data. I did this with another Lightroom plugin called FTP Publisher, from the same company that made LR/Transporter. Each photo was uploaded to a specific folder and given a filename based on its new arbitrary title. Thus my first photo, GCMP_sample_photo_1, is now easily located at:

http://files.cgrb.oregonstate.edu/Thurber_Lab/GCMP/photos/sample_photos/processed/small/GCMP_sample_photo_1.jpg

Next, I used LR/Transporter to export a machine-readable file where the first item in every line is the new title of the photo, and the second item is a comma-separated list of all the photo’s keywords, which include sample IDs.

Excerpt from Lightroom photo metadata table

GCMP_sample_photo_1	E1.3.Por.loba.1.20140724, Fieldwork, GCMP Sample, ID by Ryan McMinds, Lagoon Entrance, Pacific Ocean
GCMP_sample_photo_2	E1.3.Por.loba.1.20140724, Fieldwork, GCMP Sample, ID by Ryan McMinds, Lagoon Entrance, Pacific Ocean, Ryan McMinds
GCMP_sample_photo_124	20140807, E1.5.Gal.astr.1.20140807, GCMP Sample, ID by Ryan McMinds, Pacific Ocean, Trawler Reef
GCMP_sample_photo_1051	Al Fahal, E4.3.Por.lute.1.20150311, GCMP Sample, ID by Ryan McMinds, KAUST, Red Sea
GCMP_sample_photo_3893	E13.Out.Mil.plat.1.20151111, GCMP Sample, Mo'orea

Now comes the fun part.

To associate each sample with a URL for one of its photos, I needed to search for its ID in the photo keywords and retrieve the corresponding photo titles, then paste one of these titles to the end of the server URL. The only way I know to do this automatically is by coding, or maybe in Excel if I were a wizard. I’ve learned how to code almost 100% through Google searches and trial-and-error, so when I write something, it’s a mashing-together of what I’ve learned so far, and it’s made for results, not beauty. The first programming language I learned that was good for parsing tables was AWK, because I do a lot of work in the shell on the Mac terminal. I thus tackled my problem with that language first, in an excellent example of an inefficient method to get results:

while read -r line; do
search=$(awk '{print $1}' <<< $line)
awk -v search=$search 'BEGIN {list=""}
$0 ~ search && list != "" {list = list","$1}
$0 ~ search && list == "" {list = $1}
END {print search"\t"list}' photo-metadata-file.txt
done < sample-metadata-file.txt > output-file.txt

Ew.

I’ve been issuing my AWK commands from within the shell, which is a completely separate programming language. For the life of me, I couldn’t remember how to use AWK to read two separate files simultaneously while I was writing this code. I know I’ve done it before, but I couldn’t find any old scripts with examples, and rather than re-learn the efficient, correct way, I mashed together commands from two different languages. I then decided I needed to go back and do it the right way, so I rewrote the code entirely in AWK. That code snippet isn’t very long, but it took a lot of re-learning for me to figure it out. So it was about a week or so before I realized that since my map-making had to occur in yet another language (called R), it was ridiculous for me to be messing with AWK in the first place…

So I came to my senses and started over.

In R, I simply import the two tables, like so:

samples <- read.table('sample-metadata-file.txt',header=T,sep='\t',fill=T,quote="\"")
photo_data <- read.table('photo-metadata-file.txt',header=F,sep='\t',quote="\"")

Then use a similar process as in AWK to create a new column of photo titles in the sample metadata table (this time I simply add the first photo instead of the whole list):

samples$photo_name <- as.character(sapply(samples$sample_name, function(x) { photo_data[grep(x,photo_data[,2])[1],1] }))

And now, I have a single table that tells me the coordinates, metadata, and photo titles of each sample. With this, I can make the map, with one point drawn for each line in the table. I’ll continue explaining this process in another post.

Excerpt from sample metadata table

sample_name	reef_name	date	time	genus_species	latitude	longitude	photo_title
E1.3.Por.loba.1.20140724	Lagoon entrance	20140724	11:23	Porites lobata	-14.689414	145.468137	GCMP_sample_photo_1
E1.19.Sym.sp.1.20140724	Lagoon entrance	20140724	11:26	Symphyllia sp	-14.689414	145.468137	GCMP_sample_photo_17
E1.6.Acr.sp.1.20140726	Trawler	20140726	10:35	Acropora sp	-14.683931	145.466483	GCMP_sample_photo_37
E1.15.Dip.heli.1.20140726	Trawler	20140726	10:38	Diploastrea heliopora	-14.683931	145.466483	GCMP_sample_photo_37
E1.3.Por.loba.1.20140726	Trawler	20140726	10:41	Porites lobata	-14.683931	145.466483	GCMP_sample_photo_40

By the way, I am working on translating my blog into Spanish and French, to make it more accessible and just to help myself learn. Si quieres ayudarme, puedes encontrar la traducción activa de esta entrada y otras en el sitio Duolingo. ¡Gracias!

The Cnidae Gritty

Searching for details in the study of coral microbiology

Category Archives: Coding

Mapmaking: Part 3

Mapmaking: Part 2