16S MiSeq Analysis Tutorial Part 1: NMDS and Environmental Vectors

This tutorial aims to guide the user through a NMDS analysis of 16S abundance data using R, starting with a ‘sample x taxa’ distance matrix and corresponding metadata.

My final code here represents a wide conglomeration of script from previous Colwell Lab graduate students and my own invention.

Set Up your Environment

Load in the necessary libraries and set your working directory.

library(vegan)
library(ggplot2)
library(gridExtra)
library(grid)
library(ape)

setwd("Your working directory here")

Create and Visualize a Dissimilarity Matrix

Read in your abundance matrix and metadata, ensuring that row order is the same on the two datasets.

data <- read.table("sv-asv-abund.txt", header=T, row.names = 1, sep="\t")

#Rotate to match metadata if necessary:
#rownames(data) <- factor(rownames(data), levels = rownames(data))
#data <- as.data.frame(t(data[rowSums(data)!=0, ]))

metadata <- read.table("metadata.txt", header = T, row.names = 1, sep="\t")

After importing your abundance matrix as a ‘data’ object, apply the statistic Bray-Curtis Dissimilarity to quantify the dissimilarity of taxa counts at each site/sample, then convert to a PCoA object.

dist <- vegdist(data, "bray")
pcoa <- pcoa(dist)

Visualize the first ten principal components to identify principal component contribution to data variability.

barplot(pcoa$values$Relative_eig[1:10], names = paste ('PCoA, 1:10'), ylab = 'eigenvalues')

Plot the principal components to identify any data outliers. Any extreme outliers may be considered for removal in downstream analyses due to their effect on NMDS.

biplot.pcoa(pcoa)

NMDS Distance and Selection of Dimensionality

Visualize the dimensionality of your dissimilarity distance matrix by creating a scree plot in real time (kinda fun). A scree plot will show the eigenvlaues of your principal components. At the “elbow” of the plot, choose the number of dimensions your dataset posses (usually between 2 and 4).

NMDS.scree <- function(x) { #where x is the name of the data frame variable
plot(rep(1, 10), replicate(10, metaMDS(x, autotransform = F, k = 1)$stress),
xlim = c(1, 10),ylim = c(0, 0.30), xlab = "# of Dimensions", ylab = "Stress",
main = "NMDS stress plot")
for (i in 1:10) {
points(rep(i + 1,10),replicate(10, metaMDS(x, autotransform = F, k = i + 1)$stress))
}
}
NMDS.scree(dist)

The dimensional value you choose from the scree plot will be the ‘k’ value for your ordination. Now, preform an ordination using your Bray-Curtis dissimilarity distance matrix. It is crucial you set the seed for reproducibility and record the stress of your NMDS found in the output of ‘ord’ for future reference. Stress greater than 0.2 is incredibly suspect, while > 0.3 shows complete nonsensical relationships. Acceptable stress is < 0.2, with ideal stress < 0.1.

set.seed(2)
ord <- metaMDS(dist, k = 2, trymax = 100, trace = FALSE)
ord

Visualize the stress using a Shepard scatter plot. You should expect a linear or non-metric fit close to .90 or greater.

Addition of Environmental Variables

Now, identify correlations between your environmental/metadata and data shape. This will create vectors overlaid on the NMDS plot to show biogeochemical relevance to the taxa distance matrix.

env_fit <- envfit(ord$points, metadata, perm=999)
env_fit_df <- as.data.frame(env_fit$vectors$arrows*sqrt(env_fit$vectors$r))
env_fit_df$vector <- rownames(env_fit_df)
env_fit

Extract the scaling scores and sample names from your NMDS ordination and convert to a data frame.

data.scores <- as.data.frame(scores(ord))
data.scores$sample <- rownames(data.scores)

Add any information from your metadata table to the score dataframe. “Pond” in this example refers to the geographic location of the samples.

data.scores <- data.scores[order(data.scores$sample),]
data.scores$Pond <- metadata$pond

Clean up the score dataframe strings (make it nice and pretty).

data.scores <- as.data.frame(data.scores, stringsAsFactors = TRUE)
data.scores$sample <- factor(data.scores$sample, levels = data.scores$sample)

In addition to environmental data correlation, find any dispersion in the dissimilarity distance matrix (same as the one used for NMDS).

dist <- vegdist(data, method="bray")
disper_pond <- betadisper(dist, data.scores$Pond)
disper_pond

Visualization of NMDS

The moment we’ve all been waiting for! This will create an NMDS plot containing environmental vectors and ellipses showing significance based on NMDS groupings. Tweak away to create the NMDS of your dreams.

p.nmds <- ggplot(data = data.scores, aes(x = NMDS1, y = NMDS2, fill=Pond)) +
geom_point(aes(), colour="black", pch=21, size = 3) +
xlab(paste0("NMDS1")) +
ylab(paste0("NMDS2")) +
stat_ellipse(aes(x = NMDS1, y = NMDS2, color = Pond), level = 0.05) +
coord_fixed() +
geom_segment(data=env_fit_df, aes(x=0, xend=env_fit_df$MDS1, y=0, yend=env_fit_df$MDS2), inherit.aes = FALSE,
arrow = arrow(length = unit(0.25, "cm")), colour = "grey") +
theme_linedraw(base_size = 18) +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
strip.text = element_text(face = "bold")) +
theme(text=element_text(size = 20)) +
geom_text(data=env_fit_df, aes(x=env_fit_df$MDS1, y=env_fit_df$MDS2, label=vector),
inherit.aes = FALSE, size=3) +
ggtitle("NMDS of ____")
print(p.nmds)

And of course, to wrap up the part 1 of this analysis, finish with PERMANOVA comparisons to identify significance based on environmental variables.

perm_pond <- adonis(data ~ pond, data=metadata, permutations = 10000, method = "bray")
perm_pond$aov.tab

Analysis of similarities (ANOSIM):

anosim_pond <- anosim(dist, metadata$pond, permutations = 999, distance = "bray")
anosim_pond$signif

And finally, similarity percentages (SIMPER) for identifying the driving variable in a community matrix. In this example, “6.4” and “16.8” are distances in km from proglacial lake, and “rhm” and “ts” are ponds.

sim <- simper(comm = data, group = metadata$glacialpath, permutations = 100)

sim.rhm.ts <- as.data.frame(sim$6.4_16.8$species)
names(sim.rhm.ts)[1] <- "taxa"
sim.rhm.ts$cumsum <- as.data.frame(sim$6.4_16.8$cusum)
names(sim.rhm.ts)[2] <- "cumsum"

Up next: Centroid Visualization, Alpha Diversity Analyses, Canonical Analyses of Principal Components, Relative Abundance Visualization, and Phylogenetic Analysis

One thought on “16S MiSeq Analysis Tutorial Part 1: NMDS and Environmental Vectors

  1. Paola

    Thanks a million for the insights you share on your blog.I’m excited to join your newsletter but need assistance with the confirmation process.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *