DADA2 Pipeline for 16S datasets in R

By Kelly Shannon

###This pipeline is adapted from the benjjneb dada2 pipeline found on GitHub: https://benjjneb.github.io/dada2/tutorial.html and based on Callahan et al., 2016. DADA2: High Resolution Sample Inference from Illumina Amplicon Data. Nature Methods. As well as from one created by MK English in Dr. Ryan Mueller’s lab at OSU. ##The pipeline will be performed with 14 total paired-end sequence fastq files (7 individual samples) as an example flowthrough. These are V4 hypervariable region amplicon sequences performed by the Oregon State University CGRB on an Illumina MiSeq NGS sequencer.

#Packages to load: I like to go ahead and load all of the packages I think that I will need before I start. If you do not yet have a package in your library, the command to install it is <install.packages(“name_of_package”). You may not need all of these but it doesn’t hurt to load them regardless in case you do, since they are all commonly used packages to complement dada2.

library(BiocManager)
library(dplyr)
library(scales)
library(grid)
library(reshape2)
library(dada2)
library(tidyr)
library(stringr)
library(phyloseq); packageVersion("phyloseq")
library(Biostrings); packageVersion("Biostrings")
library(ggplot2); packageVersion("ggplot2")
library(vegan)
library(lessR)
library(ape)
library(reshape2)
library(microbiome)

#First things first, we want to set a path for R to find our paired-end reads, which are the fastq sequences that were recieved from the sequencing facility. This will be the same path as where they are located on your desktop, and can easily be copy and pasted from moving into the directory where they are located on the command line and copy pasting the output of . What I like to do is make a directory that contains only my fastq files of interest, that way the path to them will only display those files. We can store this path in a vector.

path_to_files <- "/Users/kellyshannon/BIOINFORMATICS/MiSeq_Beavers/paired_reads"
list.files(path_to_files)

The output from this should show your fastq files, and maybe other directories that you have within this directory, that’s fine too. Your files should end in “.fastq”

#We now want to create individual vectors containing just the forward and reverse reads with the <sort()> command. The pattern we put into this can really be anything, but it MUST match the common wording of your forward and reverse reads. For example, I had renamed mine before this pipeline and each forward read follows the pattern: “samplename_F.fastq”, where the sample name is unique to each sample, but they all end in “F.fastq”. Therefore, the pattern will be “F.fastq”; same deal for the reverse reads.

forward_seqs <- sort(list.files(path_to_files, pattern = "_F.fastq", full.names = TRUE))
reverse_seqs <- sort(list.files(path_to_files, pattern = "_R.fastq", full.names = TRUE))

#Now that we have vectors made for our forward and reverse reads we can extract the sample names from them, again this is assuming that they have the file names following the structure of “samplename_X.fastq”.

sample_names <- sapply(strsplit(basename(forward_seqs), "_"), `[`, 1)
sample_names

The output of the vector should show only the names of your samples.

#We are now going to want to inspect the quality of our forward and reverse reads. Remember that this will be displaying the Phred Score on the y-axis (quality of the identity of nucleobases: https://en.wikipedia.org/wiki/Phred_quality_score) and the length of the reads on the x-axis. Now, we can use DADA2 for this, but I think multiqc does a much better job of displaying read quality, and gives lots of additional info. Assuming your reads have been demultiplexed by the sequencing center (they will be generally from the OSU CGRB), they also may have come with their own FastQC reports, which can be put in directly to multiqc to generate an overall report. If not, you can download FastQC and MultiQC individually and use them on your sequences (links below). Esseentially, FastQC will generate a similar report to what we’ll generate with DADA2, below, but then MultiQC can be used to compile all of your FastQC output into one beautiful file that can be visualized on your browser. It will also show you whether or not there are any adaptors in your reads, which will need to get trimmed away.

#https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ #https://multiqc.info/

#We’ll continue with the DADA2 way of visualizing read qualities because the pipeline really won’t change whether you are using the DADA2 <plotQualityProfile()> command, or multiqc output. This will output the total # of reads in each fastq file, as well as showing us where the reads drop off in quality. In general, the forward reads will be much better than the reverse, meaning the quality of the reads will not drop off until towards the end of the reads.

plotQualityProfile(forward_seqs)
plotQualityProfile(reverse_seqs)

The mean quality score of all of the reads in each file is shown by the green line, so this is the one to pay attention to. You will generally want to trim off the last little section of nucleotides where the avg. read quality drops off, hence why we analyze these plots.

#First lets assign the file for what will be the filtered fastq.gz files, not that we will zip these so that they take up less space on our computer. The reason this is done is so that the output of the filtering command (the step after the following) can be assigned to file names, separate from our forward and reverse sequence vectors that we have already made.

filt_forward_seqs <- file.path(path_to_files, "filtered", paste0(sample_names, "_F_filt.fastq.gz"))
filt_reverse_seqs <- file.path(path_to_files, "filtered", paste0(sample_names, "_R_filt.fastq.gz"))
names(filt_forward_seqs) <- sample_names
names(filt_reverse_seqs) <- sample_names

That last step is so that we can still apply our sample names to the filtered reads.

#Okay now it’s time to actually perform the filtering of our reads. We do this with the <filterAndTrim()> command. Now, there are tons of parameters that can be specified with this command and they need to be precise for your samples or else it may lead you to either drop too many reads (most likely outcome) and get too few eventual ASVs or keep too many reads and get spurious ASVs (happens less of the time). I’d highly recommend reading the documentation of this command(I know scary right?) so that you can tailor it to your experiment, but this does take some practice and know-how.

filtered_reads <- filterAndTrim(forward_seqs, filt_forward_seqs, reverse_seqs, filt_reverse_seqs, 
truncLen = c(250,140),
maxN = 0, #This is getting rid of any ambiguous base calls. 
maxEE = c(2,2), #Setting the amount of expected errors in reads, increasing this value lets more reads through.  
truncQ = 2, #Truncates reads at the first sign of a quality score equal to the set value. 
rm.phix = TRUE, #Removes any phiX phage genomes, which are often used by sequencing facilities as controls. 
compress = TRUE, multithread = TRUE)

A couple notes: the truncLen command should be set to where in your forward and reverse reads the avg. base quality steeply drops in quality. Depending on how strict you want to be, this means where the reads drop below ~25 on the y-axis (~99.5% confidence per nucleobase). For my values, my forward reads maintained high Phred Scores (above 30) all the way through so I truncated at 250 (where V4 reads should end). My reverse reads are of much lower quality, which is typical, so I truncated these at a lower value, where the Phred Scores drop below about 30. The values of these parameters are going to be highly specific to your own analysis, but you can also play it safe and go with the default values listed on the official DADA2 pipeline.

My reverse reads are pretty garbage so I will truncate those at 160. REMEMBER: there must be at least a 20 nt overlap between your forward and reverse reads for them to merge into contigs; obviously this won’t be a problem for mine since I didn’t truncate my forward reads, however.

We can view the output at the end so we can keep track of how many reads were filtered. This is a good way to learn how changing parameters influences the amount of outputted reads.

#It will speed things up down the line if we dereplicate our reads, meaning collapsing all identical reads.

derep_forward <- derepFastq(filt_forward_seqs, verbose = TRUE)
derep_reverse <- derepFastq(filt_reverse_seqs, verbose = TRUE)

#Now that we have filtered our reads, we want to learn the error rates of our dataset, which will be specific to the actual sequencing run. This command should therefore be run only on data that has all been sequenced on the same run (i.e. the same MiSeq run). This is looking for incorrect nucleotide substitutions as a result of sequencing errors, and will be performed on the forward and reverse reads separately. This command may take some time, depending on how many samples you have.

errors_F <- learnErrors(filt_forward_seqs, multithread = TRUE)
errors_R <- learnErrors(filt_reverse_seqs, multithread = TRUE)

This will only use a subset of your reads to learn the error rates, so don’t be alarmed if you don’t see all of your reads being used for learning the error rates.

You can also visualize plots of your error rates with <plotErrors()>, which is a sanity check and should display a decrease in error frequency (y-axis) concurrent with the increase in per-base quality score (x-axis). This step is somewhat optional, but is a good check that the learnErrors command worked.

#We can now apply the “core sample inference algorithm” to the filtered and dereplicated with the learned error rates. This is a pretty complicated algorithm that uses the error rates gathered from the command above to quantify the rate at which an amplicon read is produced from a sample sequence as a function of sequence composition and quality (don’t worry if you don’t understand what this means, I don’t really either). The number of repeated observations of the sample sequence is used to calculate a p-value that the number of observed sequences is consistent with the error model. Low p-values mean that the number of sample sequences is too many than could be explained by sequencing errors, or in more simple terms, these are “true biological sequences” rather than spurious ones. The default p-value is 1e-40, which is REALLY low, but is the value recommended by the pipeline. You can change this, but it is not necessarily recommended to do so because you can be very sure that any reads that pass this p-value are true biological ones. Lowering this value could potentially allow spurious reads to pass through (but again, this depends on how strict you want to be in your analysis).

setDadaOpt(DETECT_SINGLETONS=TRUE, OMEGA_A=1e-30)

dada_forward_seqs <- dada(derep_forward, err=errors_F, multithread=TRUE, pool = "pseudo")
dada_reverse_seqs <- dada(derep_reverse, err=errors_R, multithread=TRUE, pool = "pseudo")

I’ve changed up a couple major parameters above from the default pipeline. 1. I’m keeping singletons because they make up such a large portion of my sequences, and I do not want to drop rare sequences. FAIR WARNING, this is controversial and not recommended by many because it is argued that singletons have higher chances of being erroneous than other sequences. I would like to maintain as many rare sequences as possible for my analysis so I am keeping them, but you may prefer to not take the risk and drop them. 2. I lowered the p-value, which dictates the threshold at which DADA2 calls sequences as being overabundant and uses them to base ASV assignments of other similar sequences. This is also not necessarily recommended because it can increase the chances of false positives, but again I’d like to try and get my rarer sequences through. 3. I set pool equal to “pseudo”. The default is pool = FALSE, which will be much less computationally intensive, and will treat each sample independently. Pooling compares samples and takes this information in when performing denoising, which favors rare sequences that may have higher representation in more deeply sequenced samples than samples that have less sequences. Pseudo is sort of a hybrid between these two methods that is less computationally intensive than pool = TRUE. Phew, we got through that explanation.

#Now we are ready to merge our forward and reverse reads and make contigs! We’ll also make a sequence table to display our formed contigs.

contigs <- mergePairs(dada_forward_seqs, derep_forward, dada_reverse_seqs, derep_reverse, verbose=TRUE)
seqtable <- makeSequenceTable(contigs)

#Now we’ll remove chimeric sequences from our contigs and track the amount of sequences we get at each contig size, and the amount that we lose from chimeras.

seqtable_nochim <- removeBimeraDenovo(seqtable, method = "consensus",  multithread=TRUE, verbose=TRUE)
dim(seqtable_nochim)
table(nchar(getSequences(seqtable)))

Find the percentage of chimeric sequences by dividing the total sequences by the chimeric ones. This percentage is generally below 1%, if it is much higher then this then spurious reads may have escaped through any one of the above filtering steps in the pipeline.

#Since these are 16S V4 amplicons, they should be around 250-255-ish base pairs in length. If there are sequences outside of this general region, as shown by the table() command, then they can be trimmed away. You can see how many unique sequences you have after all of this too.

seqtable.2 <- seqtable[,nchar(colnames(seqtable)) %in% 250:257]
seqtable <- seqtable.2
table(nchar(getSequences(seqtable)))
dim(seqtable_nochim)

Note that 250-255 is not a set-in-stone range of values. It’s possible that your sequences may be outside of this range and this does not necessarily mean that they aren’t true biological sequences (all kinds of things could cause this from the actual sequencing). However, for my data, this is the range that all of my contigs fall into.

If you were to view the chimera sequence table you would see the amount of reads, in each sample, that are attributed to each unique sequence, where the columns are the full ~250 bp sequences. The sequences in each of these columns are the “ASVs”, which we will soon assign taxonomy to by aligning them to a reference database.

#Now this part is important. We want to know exactly how many reads, in each sample, were dropped from each step we have gone through, which we can track with the following commands…

getN <- function(x) sum(getUniques(x))
track_reads <- cbind(filtered_reads, sapply(dada_forward_seqs, getN), sapply(dada_reverse_seqs, getN), sapply(contigs, getN), rowSums(seqtable_nochim))
colnames(track_reads) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track_reads) <- sample_names

#Now we can add taxonomic classifications to each ASV by aligning our unique sequences to a reference database. There are several different databases out there that we can choose from, but we will use Silva. The trainset that we will align to can be obtained here: https://benjjneb.github.io/dada2/training.html. As you can see on the site, there are a couple different ones to choose from, but the one we want is Silva version 138.1. Silva is great because it is constantly being updated. If a new one has come out just use the newest version! #After downloading the most up-to-date version of Silva, you can move it into your directory of choice and we’ll assign a vector to its path. Note that this is assigning taxonomy only down to the genus-level. This is adequate for my samples because they will be less represented in the database, due to them being remote sediment samples. Other samples, such as human microbiome, will have greater representation in the database and you may have more luck assigning taxonomy to the species level. You may have noticed when you clicked the link that there is a separate trainset for species, this is the one that can be used for this purpose. #Both of these commands will take some time.

silva_trainset <- "/Users/kellyshannon/BIOINFORMATICS/MiSeq_Beavers/silva_nr99_v138.1_train_set.fa.gz"
file.exists(silva_trainset)
taxonomy <- assignTaxonomy(seqtable_nochim, silva_trainset, multithread = TRUE)

If the output of <file.exists()> if FALSE then the path to the database is not correct.

taxonomy_species <- addSpecies(taxonomy, "/Users/kellyshannon/BIOINFORMATICS/MiSeq_Beavers/silva_species_assignment_v138.1.fa.gz")

#I see this step of adding species as somewhat optional because, especially if you have environmental samples, the vast majority of your sequences will likely not get assigned to species.

#Finally, we want to remove “junk” sequences from our taxonomy table such as Chloroplasts and Mitochondria. We’ll use the dim() command throughout to keep track of how many sequences are removed.

is.chloroplast <- taxonomy[,"Order"] %in% "Chloroplast"
seqtable.nochloro <- seqtable_nochim[,!is.chloroplast]
dim(seqtable.nochloro)
taxonomy.nochloro <- taxonomy[!is.chloroplast,]
dim(taxonomy.nochloro)

is.mitochondria <- taxonomy.nochloro[,"Family"] %in% "Mitochondria"
seqtable.nomito <- seqtable.nochloro[,!is.mitochondria]
taxonomy.nomito <- taxonomy.nochloro[!is.mitochondria,]
dim(seqtable.nomito)
dim(seqtable.nochloro)
dim(seqtable_nochim)

seqtable_nochim <- seqtable.nomito
taxonomy <- taxonomy.nomito
write.csv(taxonomy, "~/BIOINFORMATICS/MiSeq_Beavers/bp_taxonomy.csv")

The output of these last three dim() commands should have increasingly large values because the .nomito table will have both chloroplasts and mitochondria removed.

We want to reset our sequence table to encompass the cleaned up sequences so we re-assign the seqtable.nochim. Same deal for the taxonomy table.

As a reminder, when brackets are placed after a table, they will be in the format [rows,columns].

The exclamation mark is used to tell R to remove any sequences that match each vector- “is.mitochondria” and “is.chloroplast”. These commands can also be used for any other sequences you hope to remove from your table. All you need to do is specify the taxonomic level (here it was “Order”) and the name of the sequence you wish to remove in that level (here it was Mitochondria and Chloroplasts).

Finally, we want to save this table so that we have access to it with the full length sequences as rows. This is helpful if you feel you want to blast any of the sequences to ensure their taxonomic assignment is accurate.

#The final step is to make these data tables easier to work with by switching the full ~250 bp sequences with “asv”, which will make it so instead of seeing these mess of sequences you will instead see asv1, asv2, etc.

library(lessR)
library(S4Vectors)
rows <- nrow(taxonomy)
asvs <- lessR::to("asv", rows)
sample_asvs <- seqtable_nochim
flipped_sample_asvs <- t(sample_asvs)
sample_asvs <- flipped_sample_asvs
seqs <- rownames(sample_asvs)
length(seqs)
seqs <- DNAStringSet(seqs)
names(seqs) <- asvs
dim(seqs)
rownames(sample_asvs) <- asvs
rownames(taxonomy) <- asvs

The length() of seqs should be the same value as dim() of seqtable_nochim.

When you view the final matrix the rows should be labeled as “ASVrow#”

#The “DADA2” section is now over and we can now move on to analyzing and visualizing the data with the packages, Phyloseq and Microbiome. You will want to install both of these and load them up. I will go through the basics here of generating a phyloseq object and plotting and analyzing alpha and beta diversity. For more in depth information, you can check out the Microbiome package GitHub page here: https://microbiome.github.io/tutorials/

#The required phyloseq componenets are a taxonomy matrix and an OTU/ASV abundance matrix. We have made those already, however. It’s also good to include a metadata table with sample information. We’ll do this below then make our phyloseq object.

Group <- rownames(seqtab.nochim_bp)
Samples <- c("ER", "RCb", "RCp", "RC2", "RC4", "RC4s", "WC")
Metadata <- data.frame(Group = Group, SampleID = Samples)
rownames(Metadata) <- Group
library(phyloseq)
phyloseq_object <- phyloseq(otu_table(sample_asvs, taxa_are_rows = TRUE), sample_data(Metadata), tax_table(taxonomy))
phyloseq_object

You may have other metadata you’d like to add, which you can do so by adding additional vectors to the dataframe.

The rownames of the metadata must be the same as the column names of the OTU abundance table for graphing later on. This should be the sample sites in the same order.

#Now we want to check the sequence abundance at all our sites and graph it for visualization.

Sequences_per_sample <- sample_sums(phyloseq_object)
Sequences_per_sample
sample_richness <- data.frame(Sequences = Sequences_per_sample, Sample_Site = Group)

This shows us the range in sequence depth by sample site. This next step is optional, but it’s good to rarefy your data so that it can be more accurately compared when analyzing alpha and beta diversity. Some argue not to do so, but it is a good way to normalize the data.

rarefied_microbes <- rarefy_even_depth(phyloseq_object)
library(microbiome)
alpha_table <- microbiome::alpha(rarefied_microbes, index = "all")

measurements <- c("Observed", "Shannon", "Simpson", "Chao1")
plot_richness(rarefied_microbes, x = "SampleID", measures = measurements, color = "Group", nrow = 1) + 
  geom_point(size = 3) + theme_bw() + 
  theme(legend.position = "none", 
        axis.text.x = element_text(size = 10, angle = 90, vjust = 0.5, hjust = 1), 
        axis.text.y = element_text(size = 10)) + 
  labs(x = "Beaver Pond", y = "Alpha Diversity Measure")

rarecurve(t(otu_table(rarefied_microbes)), step = 50, cex = .6, col = "red")

rarefied_microbes_phyla <- aggregate_taxa(rarefied_microbes, level = "Phylum")
plot_heatmap(rarefied_microbes_phyla, method = "NMDS", distance = "bray", sample.label = "SampleID", taxa.label = "Phylum", title = "Phlylum-Level Richness by Site", low = "darkslategray", high = "darkorange2")

This gives us a table of alpha diversity measurements, plots of four commonly used measurements, a rarefaction curve, and a plot of the phyla, not that you can change the taxonomic level with taxa.label option on the heatmap.

Now for beta diversity.

ordination <- ordinate(rarefied_microbes, method = "NMDS")
plot_ordination(rarefied_microbes, ordination, type = "Samples" , color = "SampleID") + geom_point(size = 5)

#Now we are done! There’s tons more you can do with phyloseq and the microbiome package, but those are the basics. Now you have some results!

Colwell Geomicrobiology Lab at OSU

Research, Outreach, and Engagement

Leave a Reply Cancel reply