{"id":356,"date":"2021-09-28T21:57:59","date_gmt":"2021-09-28T21:57:59","guid":{"rendered":"https:\/\/blogs.oregonstate.edu\/earthmotes\/?p=356"},"modified":"2021-09-30T16:38:52","modified_gmt":"2021-09-30T16:38:52","slug":"dada2-pipeline-for-16s-datasets-in-r","status":"publish","type":"post","link":"https:\/\/blogs.oregonstate.edu\/earthmotes\/2021\/09\/28\/dada2-pipeline-for-16s-datasets-in-r\/","title":{"rendered":"DADA2 Pipeline for 16S datasets in R"},"content":{"rendered":"\n<p>By Kelly Shannon<\/p>\n\n\n\n<p>###This pipeline is adapted from the benjjneb dada2 pipeline found on GitHub:&nbsp;<a href=\"https:\/\/benjjneb.github.io\/dada2\/tutorial.html\">https:\/\/benjjneb.github.io\/dada2\/tutorial.html<\/a>&nbsp;and based on Callahan et al., 2016. DADA2: High Resolution Sample Inference from Illumina Amplicon Data. Nature Methods. As well as from one created by MK English in Dr.&nbsp;Ryan Mueller\u2019s lab at OSU. ##The pipeline will be performed with 14 total paired-end sequence fastq files (7 individual samples) as an example flowthrough. These are V4 hypervariable region amplicon sequences performed by the Oregon State University CGRB on an Illumina MiSeq NGS sequencer.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p>#Packages to load: I like to go ahead and load all of the packages I think that I will need before I start. If you do not yet have a package in your library, the command to install it is &lt;install.packages(\u201cname_of_package\u201d). You may not need all of these but it doesn\u2019t hurt to load them regardless in case you do, since they are all commonly used packages to complement dada2.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>library(BiocManager)\nlibrary(dplyr)\nlibrary(scales)\nlibrary(grid)\nlibrary(reshape2)\nlibrary(dada2)\nlibrary(tidyr)\nlibrary(stringr)\nlibrary(phyloseq); packageVersion(\"phyloseq\")\nlibrary(Biostrings); packageVersion(\"Biostrings\")\nlibrary(ggplot2); packageVersion(\"ggplot2\")\nlibrary(vegan)\nlibrary(lessR)\nlibrary(ape)\nlibrary(reshape2)\nlibrary(microbiome)<\/code><\/pre>\n\n\n\n<p>#First things first, we want to set a path for R to find our paired-end reads, which are the fastq sequences that were recieved from the sequencing facility. This will be the same path as where they are located on your desktop, and can easily be copy and pasted from moving into the directory where they are located on the command line and copy pasting the output of&nbsp;. What I like to do is make a directory that contains only my fastq files of interest, that way the path to them will only display those files. We can store this path in a vector.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>path_to_files &lt;- \"\/Users\/kellyshannon\/BIOINFORMATICS\/MiSeq_Beavers\/paired_reads\"\nlist.files(path_to_files)<\/code><\/pre>\n\n\n\n<p>The output from this should show your fastq files, and maybe other directories that you have within this directory, that\u2019s fine too. Your files should end in \u201c.fastq\u201d<\/p>\n\n\n\n<p>#We now want to create individual vectors containing just the forward and reverse reads with the &lt;sort()&gt; command. The pattern we put into this can really be anything, but it MUST match the common wording of your forward and reverse reads. For example, I had renamed mine before this pipeline and each forward read follows the pattern: \u201csamplename_F.fastq\u201d, where the sample name is unique to each sample, but they all end in \u201cF.fastq\u201d. Therefore, the pattern will be \u201cF.fastq\u201d; same deal for the reverse reads.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>forward_seqs &lt;- sort(list.files(path_to_files, pattern = \"_F.fastq\", full.names = TRUE))\nreverse_seqs &lt;- sort(list.files(path_to_files, pattern = \"_R.fastq\", full.names = TRUE))<\/code><\/pre>\n\n\n\n<p>#Now that we have vectors made for our forward and reverse reads we can extract the sample names from them, again this is assuming that they have the file names following the structure of \u201csamplename_X.fastq\u201d.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sample_names &lt;- sapply(strsplit(basename(forward_seqs), \"_\"), `&#091;`, 1)\nsample_names<\/code><\/pre>\n\n\n\n<p>The output of the vector should show only the names of your samples.<br><br>#We are now going to want to inspect the quality of our forward and reverse reads. Remember that this will be displaying the Phred Score on the y-axis (quality of the identity of nucleobases:&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Phred_quality_score\">https:\/\/en.wikipedia.org\/wiki\/Phred_quality_score<\/a>) and the length of the reads on the x-axis. Now, we can use DADA2 for this, but I think multiqc does a much better job of displaying read quality, and gives lots of additional info. Assuming your reads have been demultiplexed by the sequencing center (they will be generally from the OSU CGRB), they also may have come with their own FastQC reports, which can be put in directly to multiqc to generate an overall report. If not, you can download FastQC and MultiQC individually and use them on your sequences (links below). Esseentially, FastQC will generate a similar report to what we\u2019ll generate with DADA2, below, but then MultiQC can be used to compile all of your FastQC output into one beautiful file that can be visualized on your browser. It will also show you whether or not there are any adaptors in your reads, which will need to get trimmed away.<\/p>\n\n\n\n<p>#<a href=\"https:\/\/www.bioinformatics.babraham.ac.uk\/projects\/fastqc\/\">https:\/\/www.bioinformatics.babraham.ac.uk\/projects\/fastqc\/<\/a>&nbsp;#<a href=\"https:\/\/multiqc.info\/\">https:\/\/multiqc.info\/<\/a><\/p>\n\n\n\n<p>#We\u2019ll continue with the DADA2 way of visualizing read qualities because the pipeline really won\u2019t change whether you are using the DADA2 &lt;plotQualityProfile()&gt; command, or multiqc output. This will output the total # of reads in each fastq file, as well as showing us where the reads drop off in quality. In general, the forward reads will be much better than the reverse, meaning the quality of the reads will not drop off until towards the end of the reads.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>plotQualityProfile(forward_seqs)\nplotQualityProfile(reverse_seqs)<\/code><\/pre>\n\n\n\n<p>The mean quality score of all of the reads in each file is shown by the green line, so this is the one to pay attention to. You will generally want to trim off the last little section of nucleotides where the avg. read quality drops off, hence why we analyze these plots.<\/p>\n\n\n\n<p>#First lets assign the file for what will be the filtered fastq.gz files, not that we will zip these so that they take up less space on our computer. The reason this is done is so that the output of the filtering command (the step after the following) can be assigned to file names, separate from our forward and reverse sequence vectors that we have already made.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>filt_forward_seqs &lt;- file.path(path_to_files, \"filtered\", paste0(sample_names, \"_F_filt.fastq.gz\"))\nfilt_reverse_seqs &lt;- file.path(path_to_files, \"filtered\", paste0(sample_names, \"_R_filt.fastq.gz\"))\nnames(filt_forward_seqs) &lt;- sample_names\nnames(filt_reverse_seqs) &lt;- sample_names<\/code><\/pre>\n\n\n\n<p>That last step is so that we can still apply our sample names to the filtered reads.<\/p>\n\n\n\n<p>#Okay now it\u2019s time to actually perform the filtering of our reads. We do this with the &lt;filterAndTrim()&gt; command. Now, there are tons of parameters that can be specified with this command and they need to be precise for your samples or else it may lead you to either drop too many reads (most likely outcome) and get too few eventual ASVs or keep too many reads and get spurious ASVs (happens less of the time). I\u2019d highly recommend reading the documentation of this command(I know scary right?) so that you can tailor it to your experiment, but this does take some practice and know-how.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>filtered_reads &lt;- filterAndTrim(forward_seqs, filt_forward_seqs, reverse_seqs, filt_reverse_seqs, \ntruncLen = c(250,140),\nmaxN = 0, #This is getting rid of any ambiguous base calls. \nmaxEE = c(2,2), #Setting the amount of expected errors in reads, increasing this value lets more reads through.  \ntruncQ = 2, #Truncates reads at the first sign of a quality score equal to the set value. \nrm.phix = TRUE, #Removes any phiX phage genomes, which are often used by sequencing facilities as controls. \ncompress = TRUE, multithread = TRUE)<\/code><\/pre>\n\n\n\n<p>A couple notes: the truncLen command should be set to where in your forward and reverse reads the avg. base quality steeply drops in quality. Depending on how strict you want to be, this means where the reads drop below ~25 on the y-axis (~99.5% confidence per nucleobase). For my values, my forward reads maintained high Phred Scores (above 30) all the way through so I truncated at 250 (where V4 reads should end). My reverse reads are of much lower quality, which is typical, so I truncated these at a lower value, where the Phred Scores drop below about 30. The values of these parameters are going to be highly specific to your own analysis, but you can also play it safe and go with the default values listed on the official DADA2 pipeline.<\/p>\n\n\n\n<p>My reverse reads are pretty garbage so I will truncate those at 160. REMEMBER: there must be at least a 20 nt overlap between your forward and reverse reads for them to merge into contigs; obviously this won\u2019t be a problem for mine since I didn\u2019t truncate my forward reads, however.<\/p>\n\n\n\n<p>We can view the output at the end so we can keep track of how many reads were filtered. This is a good way to learn how changing parameters influences the amount of outputted reads.<\/p>\n\n\n\n<p>#It will speed things up down the line if we dereplicate our reads, meaning collapsing all identical reads.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>derep_forward &lt;- derepFastq(filt_forward_seqs, verbose = TRUE)\nderep_reverse &lt;- derepFastq(filt_reverse_seqs, verbose = TRUE)<\/code><\/pre>\n\n\n\n<p>#Now that we have filtered our reads, we want to learn the error rates of our dataset, which will be specific to the actual sequencing run. This command should therefore be run only on data that has all been sequenced on the same run (i.e.&nbsp;the same MiSeq run). This is looking for incorrect nucleotide substitutions as a result of sequencing errors, and will be performed on the forward and reverse reads separately. This command may take some time, depending on how many samples you have.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>errors_F &lt;- learnErrors(filt_forward_seqs, multithread = TRUE)\nerrors_R &lt;- learnErrors(filt_reverse_seqs, multithread = TRUE)<\/code><\/pre>\n\n\n\n<p>This will only use a subset of your reads to learn the error rates, so don\u2019t be alarmed if you don\u2019t see all of your reads being used for learning the error rates.<\/p>\n\n\n\n<p>You can also visualize plots of your error rates with &lt;plotErrors()&gt;, which is a sanity check and should display a decrease in error frequency (y-axis) concurrent with the increase in per-base quality score (x-axis). This step is somewhat optional, but is a good check that the learnErrors command worked.<\/p>\n\n\n\n<p>#We can now apply the \u201ccore sample inference algorithm\u201d to the filtered and dereplicated with the learned error rates. This is a pretty complicated algorithm that uses the error rates gathered from the command above to quantify the rate at which an amplicon read is produced from a sample sequence as a function of sequence composition and quality (don\u2019t worry if you don\u2019t understand what this means, I don\u2019t really either). The number of repeated observations of the sample sequence is used to calculate a p-value that the number of observed sequences is consistent with the error model. Low p-values mean that the number of sample sequences is too many than could be explained by sequencing errors, or in more simple terms, these are \u201ctrue biological sequences\u201d rather than spurious ones. The default p-value is 1e-40, which is REALLY low, but is the value recommended by the pipeline. You can change this, but it is not necessarily recommended to do so because you can be very sure that any reads that pass this p-value are true biological ones. Lowering this value could potentially allow spurious reads to pass through (but again, this depends on how strict you want to be in your analysis).<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>setDadaOpt(DETECT_SINGLETONS=TRUE, OMEGA_A=1e-30)\n\ndada_forward_seqs &lt;- dada(derep_forward, err=errors_F, multithread=TRUE, pool = \"pseudo\")\ndada_reverse_seqs &lt;- dada(derep_reverse, err=errors_R, multithread=TRUE, pool = \"pseudo\")<\/code><\/pre>\n\n\n\n<p>I\u2019ve changed up a couple major parameters above from the default pipeline. 1. I\u2019m keeping singletons because they make up such a large portion of my sequences, and I do not want to drop rare sequences. FAIR WARNING, this is controversial and not recommended by many because it is argued that singletons have higher chances of being erroneous than other sequences. I would like to maintain as many rare sequences as possible for my analysis so I am keeping them, but you may prefer to not take the risk and drop them. 2. I lowered the p-value, which dictates the threshold at which DADA2 calls sequences as being overabundant and uses them to base ASV assignments of other similar sequences. This is also not necessarily recommended because it can increase the chances of false positives, but again I\u2019d like to try and get my rarer sequences through. 3. I set pool equal to \u201cpseudo\u201d. The default is pool = FALSE, which will be much less computationally intensive, and will treat each sample independently. Pooling compares samples and takes this information in when performing denoising, which favors rare sequences that may have higher representation in more deeply sequenced samples than samples that have less sequences. Pseudo is sort of a hybrid between these two methods that is less computationally intensive than pool = TRUE. Phew, we got through that explanation.<\/p>\n\n\n\n<p>#Now we are ready to merge our forward and reverse reads and make contigs! We\u2019ll also make a sequence table to display our formed contigs.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>contigs &lt;- mergePairs(dada_forward_seqs, derep_forward, dada_reverse_seqs, derep_reverse, verbose=TRUE)\nseqtable &lt;- makeSequenceTable(contigs)<\/code><\/pre>\n\n\n\n<p>#Now we\u2019ll remove chimeric sequences from our contigs and track the amount of sequences we get at each contig size, and the amount that we lose from chimeras.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>seqtable_nochim &lt;- removeBimeraDenovo(seqtable, method = \"consensus\",  multithread=TRUE, verbose=TRUE)\ndim(seqtable_nochim)\ntable(nchar(getSequences(seqtable)))<\/code><\/pre>\n\n\n\n<p>Find the percentage of chimeric sequences by dividing the total sequences by the chimeric ones. This percentage is generally below 1%, if it is much higher then this then spurious reads may have escaped through any one of the above filtering steps in the pipeline.<\/p>\n\n\n\n<p>#Since these are 16S V4 amplicons, they should be around 250-255-ish base pairs in length. If there are sequences outside of this general region, as shown by the table() command, then they can be trimmed away. You can see how many unique sequences you have after all of this too.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>seqtable.2 &lt;- seqtable&#091;,nchar(colnames(seqtable)) %in% 250:257]\nseqtable &lt;- seqtable.2\ntable(nchar(getSequences(seqtable)))\ndim(seqtable_nochim)<\/code><\/pre>\n\n\n\n<p>Note that 250-255 is not a set-in-stone range of values. It\u2019s possible that your sequences may be outside of this range and this does not necessarily mean that they aren\u2019t true biological sequences (all kinds of things could cause this from the actual sequencing). However, for my data, this is the range that all of my contigs fall into.<\/p>\n\n\n\n<p>If you were to view the chimera sequence table you would see the amount of reads, in each sample, that are attributed to each unique sequence, where the columns are the full ~250 bp sequences. The sequences in each of these columns are the \u201cASVs\u201d, which we will soon assign taxonomy to by aligning them to a reference database.<\/p>\n\n\n\n<p>#Now this part is important. We want to know exactly how many reads, in each sample, were dropped from each step we have gone through, which we can track with the following commands\u2026<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>getN &lt;- function(x) sum(getUniques(x))\ntrack_reads &lt;- cbind(filtered_reads, sapply(dada_forward_seqs, getN), sapply(dada_reverse_seqs, getN), sapply(contigs, getN), rowSums(seqtable_nochim))\ncolnames(track_reads) &lt;- c(\"input\", \"filtered\", \"denoisedF\", \"denoisedR\", \"merged\", \"nonchim\")\nrownames(track_reads) &lt;- sample_names<\/code><\/pre>\n\n\n\n<p>#Now we can add taxonomic classifications to each ASV by aligning our unique sequences to a reference database. There are several different databases out there that we can choose from, but we will use Silva. The trainset that we will align to can be obtained here:&nbsp;<a href=\"https:\/\/benjjneb.github.io\/dada2\/training.html\">https:\/\/benjjneb.github.io\/dada2\/training.html<\/a>. As you can see on the site, there are a couple different ones to choose from, but the one we want is Silva version 138.1. Silva is great because it is constantly being updated. If a new one has come out just use the newest version! #After downloading the most up-to-date version of Silva, you can move it into your directory of choice and we\u2019ll assign a vector to its path. Note that this is assigning taxonomy only down to the genus-level. This is adequate for my samples because they will be less represented in the database, due to them being remote sediment samples. Other samples, such as human microbiome, will have greater representation in the database and you may have more luck assigning taxonomy to the species level. You may have noticed when you clicked the link that there is a separate trainset for species, this is the one that can be used for this purpose. #Both of these commands will take some time.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>silva_trainset &lt;- \"\/Users\/kellyshannon\/BIOINFORMATICS\/MiSeq_Beavers\/silva_nr99_v138.1_train_set.fa.gz\"\nfile.exists(silva_trainset)\ntaxonomy &lt;- assignTaxonomy(seqtable_nochim, silva_trainset, multithread = TRUE)<\/code><\/pre>\n\n\n\n<p>If the output of &lt;file.exists()&gt; if FALSE then the path to the database is not correct.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>taxonomy_species &lt;- addSpecies(taxonomy, \"\/Users\/kellyshannon\/BIOINFORMATICS\/MiSeq_Beavers\/silva_species_assignment_v138.1.fa.gz\")<\/code><\/pre>\n\n\n\n<p>#I see this step of adding species as somewhat optional because, especially if you have environmental samples, the vast majority of your sequences will likely not get assigned to species.<\/p>\n\n\n\n<p>#Finally, we want to remove \u201cjunk\u201d sequences from our taxonomy table such as Chloroplasts and Mitochondria. We\u2019ll use the dim() command throughout to keep track of how many sequences are removed.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>is.chloroplast &lt;- taxonomy&#091;,\"Order\"] %in% \"Chloroplast\"\nseqtable.nochloro &lt;- seqtable_nochim&#091;,!is.chloroplast]\ndim(seqtable.nochloro)\ntaxonomy.nochloro &lt;- taxonomy&#091;!is.chloroplast,]\ndim(taxonomy.nochloro)\n\nis.mitochondria &lt;- taxonomy.nochloro&#091;,\"Family\"] %in% \"Mitochondria\"\nseqtable.nomito &lt;- seqtable.nochloro&#091;,!is.mitochondria]\ntaxonomy.nomito &lt;- taxonomy.nochloro&#091;!is.mitochondria,]\ndim(seqtable.nomito)\ndim(seqtable.nochloro)\ndim(seqtable_nochim)\n\nseqtable_nochim &lt;- seqtable.nomito\ntaxonomy &lt;- taxonomy.nomito\nwrite.csv(taxonomy, \"~\/BIOINFORMATICS\/MiSeq_Beavers\/bp_taxonomy.csv\")<\/code><\/pre>\n\n\n\n<p>The output of these last three dim() commands should have increasingly large values because the .nomito table will have both chloroplasts and mitochondria removed.<\/p>\n\n\n\n<p>We want to reset our sequence table to encompass the cleaned up sequences so we re-assign the seqtable.nochim. Same deal for the taxonomy table.<\/p>\n\n\n\n<p>As a reminder, when brackets are placed after a table, they will be in the format [rows,columns].<\/p>\n\n\n\n<p>The exclamation mark is used to tell R to remove any sequences that match each vector- \u201cis.mitochondria\u201d and \u201cis.chloroplast\u201d. These commands can also be used for any other sequences you hope to remove from your table. All you need to do is specify the taxonomic level (here it was \u201cOrder\u201d) and the name of the sequence you wish to remove in that level (here it was Mitochondria and Chloroplasts).<\/p>\n\n\n\n<p>Finally, we want to save this table so that we have access to it with the full length sequences as rows. This is helpful if you feel you want to blast any of the sequences to ensure their taxonomic assignment is accurate.<\/p>\n\n\n\n<p>#The final step is to make these data tables easier to work with by switching the full ~250 bp sequences with \u201casv\u201d, which will make it so instead of seeing these mess of sequences you will instead see asv1, asv2, etc.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>library(lessR)\nlibrary(S4Vectors)\nrows &lt;- nrow(taxonomy)\nasvs &lt;- lessR::to(\"asv\", rows)\nsample_asvs &lt;- seqtable_nochim\nflipped_sample_asvs &lt;- t(sample_asvs)\nsample_asvs &lt;- flipped_sample_asvs\nseqs &lt;- rownames(sample_asvs)\nlength(seqs)\nseqs &lt;- DNAStringSet(seqs)\nnames(seqs) &lt;- asvs\ndim(seqs)\nrownames(sample_asvs) &lt;- asvs\nrownames(taxonomy) &lt;- asvs<\/code><\/pre>\n\n\n\n<p>The length() of seqs should be the same value as dim() of seqtable_nochim.<br><br>When you view the final matrix the rows should be labeled as \u201cASVrow#\u201d<br><br>#The \u201cDADA2\u201d section is now over and we can now move on to analyzing and visualizing the data with the packages, Phyloseq and Microbiome. You will want to install both of these and load them up. I will go through the basics here of generating a phyloseq object and plotting and analyzing alpha and beta diversity. For more in depth information, you can check out the Microbiome package GitHub page here:&nbsp;<a href=\"https:\/\/microbiome.github.io\/tutorials\/\">https:\/\/microbiome.github.io\/tutorials\/<\/a><\/p>\n\n\n\n<p>#The required phyloseq componenets are a taxonomy matrix and an OTU\/ASV abundance matrix. We have made those already, however. It\u2019s also good to include a metadata table with sample information. We\u2019ll do this below then make our phyloseq object.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Group &lt;- rownames(seqtab.nochim_bp)\nSamples &lt;- c(\"ER\", \"RCb\", \"RCp\", \"RC2\", \"RC4\", \"RC4s\", \"WC\")\nMetadata &lt;- data.frame(Group = Group, SampleID = Samples)\nrownames(Metadata) &lt;- Group\nlibrary(phyloseq)\nphyloseq_object &lt;- phyloseq(otu_table(sample_asvs, taxa_are_rows = TRUE), sample_data(Metadata), tax_table(taxonomy))\nphyloseq_object<\/code><\/pre>\n\n\n\n<p>You may have other metadata you\u2019d like to add, which you can do so by adding additional vectors to the dataframe.<\/p>\n\n\n\n<p>The rownames of the metadata must be the same as the column names of the OTU abundance table for graphing later on. This should be the sample sites in the same order.<\/p>\n\n\n\n<p>#Now we want to check the sequence abundance at all our sites and graph it for visualization.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Sequences_per_sample &lt;- sample_sums(phyloseq_object)\nSequences_per_sample\nsample_richness &lt;- data.frame(Sequences = Sequences_per_sample, Sample_Site = Group)<\/code><\/pre>\n\n\n\n<p>This shows us the range in sequence depth by sample site. This next step is optional, but it\u2019s good to rarefy your data so that it can be more accurately compared when analyzing alpha and beta diversity. Some argue not to do so, but it is a good way to normalize the data.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>rarefied_microbes &lt;- rarefy_even_depth(phyloseq_object)\nlibrary(microbiome)\nalpha_table &lt;- microbiome::alpha(rarefied_microbes, index = \"all\")\n\nmeasurements &lt;- c(\"Observed\", \"Shannon\", \"Simpson\", \"Chao1\")\nplot_richness(rarefied_microbes, x = \"SampleID\", measures = measurements, color = \"Group\", nrow = 1) + \n  geom_point(size = 3) + theme_bw() + \n  theme(legend.position = \"none\", \n        axis.text.x = element_text(size = 10, angle = 90, vjust = 0.5, hjust = 1), \n        axis.text.y = element_text(size = 10)) + \n  labs(x = \"Beaver Pond\", y = \"Alpha Diversity Measure\")\n\nrarecurve(t(otu_table(rarefied_microbes)), step = 50, cex = .6, col = \"red\")\n\nrarefied_microbes_phyla &lt;- aggregate_taxa(rarefied_microbes, level = \"Phylum\")\nplot_heatmap(rarefied_microbes_phyla, method = \"NMDS\", distance = \"bray\", sample.label = \"SampleID\", taxa.label = \"Phylum\", title = \"Phlylum-Level Richness by Site\", low = \"darkslategray\", high = \"darkorange2\")<\/code><\/pre>\n\n\n\n<p>This gives us a table of alpha diversity measurements, plots of four commonly used measurements, a rarefaction curve, and a plot of the phyla, not that you can change the taxonomic level with taxa.label option on the heatmap.<\/p>\n\n\n\n<p>Now for beta diversity.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>ordination &lt;- ordinate(rarefied_microbes, method = \"NMDS\")\nplot_ordination(rarefied_microbes, ordination, type = \"Samples\" , color = \"SampleID\") + geom_point(size = 5)<\/code><\/pre>\n\n\n\n<p>#Now we are done! There\u2019s tons more you can do with phyloseq and the microbiome package, but those are the basics. Now you have some results!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Kelly Shannon ###This pipeline is adapted from the benjjneb dada2 pipeline found on GitHub:&nbsp;https:\/\/benjjneb.github.io\/dada2\/tutorial.html&nbsp;and based on Callahan et al., 2016. DADA2: High Resolution Sample Inference from Illumina Amplicon Data. Nature Methods. As well as from one created by MK English in Dr.&nbsp;Ryan Mueller\u2019s lab at OSU. ##The pipeline will be performed with 14 total [&hellip;]<\/p>\n","protected":false},"author":11692,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-356","post","type-post","status-publish","format-standard","hentry","category-tutorials"],"_links":{"self":[{"href":"https:\/\/blogs.oregonstate.edu\/earthmotes\/wp-json\/wp\/v2\/posts\/356","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.oregonstate.edu\/earthmotes\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.oregonstate.edu\/earthmotes\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/earthmotes\/wp-json\/wp\/v2\/users\/11692"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/earthmotes\/wp-json\/wp\/v2\/comments?post=356"}],"version-history":[{"count":4,"href":"https:\/\/blogs.oregonstate.edu\/earthmotes\/wp-json\/wp\/v2\/posts\/356\/revisions"}],"predecessor-version":[{"id":360,"href":"https:\/\/blogs.oregonstate.edu\/earthmotes\/wp-json\/wp\/v2\/posts\/356\/revisions\/360"}],"wp:attachment":[{"href":"https:\/\/blogs.oregonstate.edu\/earthmotes\/wp-json\/wp\/v2\/media?parent=356"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/earthmotes\/wp-json\/wp\/v2\/categories?post=356"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/earthmotes\/wp-json\/wp\/v2\/tags?post=356"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}