Keywords
whole genome sequencing - microbiome - assembly - metabolic pathways
The relationship of humans with our environmental microbes is documented throughout
history. The discovery of the smallpox vaccine by Edward Jenner and the great pandemics
of the Bubonic Plague and the 1918 influenza have demonstrated the volatile and parasitic
side of microbes. However, we also have mutualistic and commensal relationships with
the microbes in our environment. Recently, the Human Microbiome Project (HMP) consortium
took on the task of documenting what constitutes a healthy microbiome.[1]
[2]
[3]
[4] This question has help to highlight studies demonstrating that dysbiosis of the
microbiome is associated with type 2 diabetes mellitus, obesity, inflammatory bowel
disease, and colorectal cancer.[5]
[6]
[7]
[8] Further, dysbiosis of the microbiome has been implicated as a cause for preterm
birth. Gravidae that undergo preterm birth often have an intrauterine infection with
increases in inflammatory cytokines,[9]
[10] such as IL-6 and IL-1β. Thus, these studies have begged the question of how do we
establish and maintain a healthy microbiome. With the exponentially expanding interest
in human microbiome research, a working knowledge of the methodology and tools used
in this field is fundamental to translational research. This is notably true in reproductive
and perinatal research initiatives, where there is a tremendous potential need for
investigators well versed in both the technology and biology of the expanding field
of research. While an increasing number of investigators are familiar with and employing
16S-based metagenomic approaches, there are far fewer investigators who have a working
knowledge of alternative metagenomic approaches.
Before the era of massively parallel NextGen sequencing, the clone-based metagenome
approach in combination with Sanger sequencing was used for early metagenomics research.
First, DNA content of a genomic clone is sheared into random fragments before cloning
fragments into plasmid vectors that are grown to produce monoclonal libraries containing
enough genomic material for sequencing. Although Sanger sequencing produces long reads
(100–2,000 bp), usually only a few selected inserts could be obtained. Thus, this
process is low throughput and suffers from the limitation of assembly regions with
large repeats and cloning bias. Hence, it is not surprising that NextGen sequencing
(NGS) techniques have quickly replaced Sanger sequencing because of their collectively
unique advantages. In addition to economical low per base cost and higher throughput,
the cloning step and its inherent problems seen in Sanger sequencing methods are no
longer issues for NGS techniques. Environmental samples can be sequenced directly
by NGS techniques, which allows for the investigation of unculturable and low abundance
species. Therefore, the comprehensive characterization of more complex and diverse
microbial communities, such as microbial communities related to the human reproductive
system, become feasible. NGS techniques used in metagenomics research mainly include
454 Genome Sequencer Pyrosequencing (454 Life Sciences; Roche Company, Branford, CT)
for 16S rDNA sequencing, Solexa/Illumina (Illumina Inc., San Diego, CA) for whole
genome shotgun sequencing (WGS) studies, and the most recent Helicos (Helicos Bio
Sciences, Cambridge, MA) single-molecule sequencing technology also for WGS studies.[11]
The majority of recent studies examining the bacterial flora communities residing
within humans have utilized 16S rDNA sequencing techniques. The nine variable regions
of the 16s rRNA gene are flanked by conserved stretches in the majority of bacteria.
This conservation can be used as targets for PCR primers with near-universal bacterial
specificity. Therefore, through 454 Pyrosequencing, sequences read are obtained from
one region of the 16S rRNA gene, which is then quantified and subsequently assigned
a taxonomy. Thus, when compared with WGS techniques of the full length 16S rRNA gene,
the coverage of each sample is much higher and many more samples are able to be run
in parallel using bar-coding system. However, the downside to WGS techniques is that
a small proportion of reads could be assigned to lower level taxonomy due to the shorter
read length. Overall, the resolution of the community composition obtained with 16S
Pyrosequencing techniques is orders of magnitude larger than Sanger sequencing with
a lower per base cost.
The Illumina technology was introduced around the same time as 454 Pyrosequencing
technology. The Illumina instruments produce more than 10 times the number of reads
per run as the 454 GS FLX machines, albeit of much shorter lengths (less than 100 bp
compared with 400–500 bp of 454 reads). The advantages of WGS sequencing on Illumina
platform over 16S rDNA sequencing on 454 platform are the ability to provide information
on genome assembly, species level taxonomy abundance, gene predication, and metabolic
pathway reconstruction.[12] However, each stage of the analysis is complicated by incomplete coverage, the high
volume of data, the short length of reads, and intrinsic errors caused by parallelism
sequencing.[13]
[14] In this review, we will primarily focus on the bioinformatics procedure to transform
Illumina-generated short reads into biologically meaningful taxonomic and functional
entities ([Fig. 1]). Recent developed tools specific for metagenomic data analysis and their application
to human reproductive medicine will be discussed as well ([Table 1]).
Table 1
Tools for metagenomics analysis mentioned in this review
|
Process
|
Tools
|
Reference
|
|
Metagenome assembly
|
MetaVelvet
Meta-IDBA
GeneStitch
Ray Meta
Bambus 2
|
[20]
[21]
[22]
[23]
[25]
|
|
Gene prediction
|
MetaGene Annotator
MetaGeneMark
Orphelia
|
[26]
[28]
[29]
|
|
Taxonomic identification
|
MEGAN
WebCARMA
Phymm PhymmBL
MetaPhyler
MetaPhlAn
|
[31]
[33]
[35]
[36]
[37]
|
|
Functional and pathway annotation
|
BLASTX
UBLAST
HMMER
THINK-Back
HUMAnN
|
[40]
[41]
[42]
[45]
|
|
Data analysis pipeline
|
MetAMOS
MG-RAST
Genboree tool kit
|
[58]
[55]
[54]
|
Figure 1 Flowchart of metagenomics whole genome shotgun sequencing data analysis.
Genome Assembly
Genome assembly is essential for the study of gene arrangements and gene function.
For assembly in a single organism, all the DNA fragments come from the same genome.
However, this is not the case when it comes to metagenome assembly, and several obstacles
make metagenome assembly especially challenging. For samples from an environment with
low microbial abundance, the coverage on the genome is usually incomplete. Although
longer gene sequences could be achieved, there is still a risk of making chimeric
contigs from different operational taxonomic units (OTUs). This risk of chimeric contigs
is further complicated by genomic repeats. For the same reason, the assembly process
could distort the species abundance as well.[15]
Reconstructing genomes without referencing a previously sequenced genome is called
de novo assembly, which is proven to be hard to solve computationally (NP-hard). The
traditional method used for assembly of Sanger-based sequences is Overlap Layout Consensus.
An overlap graph is first constructed with each read as a node and edge representing
the overlap identified between reads. The graph is thereafter analyzed to determine
the paths connecting reads together to construct the genome. However, this method
is not suitable to be used on the assembly of short reads generated on an NGS platform
because in the worst case each read must be compared with all other reads. NGS methods
usually generate an order of magnitude more reads compared with Sanger sequencing,
which significantly increases the computational complexity. Most of the recently developed
metagenomics assembly algorithms are based on Eulerian tour of de Bruijn graphs. In
de Bruijn graphs, reads are first decomposed into fixed length k-mers. Nodes are represented
by k-mers, with the reads themselves being the edges connecting the nodes ([Fig. 2]). The overlaps are implicitly represented in the graph by paths that traverse from
one read to its neighbor. The output is usually a simple path of contigs. In this
way, the high number of reads does not affect the number of nodes and because repeats
only appear once, the problem of high redundancy in reads is also solved. Moreover,
the solution to a de Bruijn graph is an Eulerian path, and a linear-time algorithm
to solve an Eulerian path does exist.
Figure 2 A sample de Bruijn graph with k = 4. The edges of the graph are unique subsequences of reads with length of k. The nodes of the graph represent common subsequence of length of k-1. If the suffix of one node matches with the prefix of the other node with length
of k-2, the two nodes are connected. This graph consists of short reads for the consensus
sequence “GTTTGGTTGT.”
Traditional assemblers designed for the assembly of single organism genomes were initially
applied to assemble metagenomes (i.e., Velvet,[16] Celera,[17] and SOAPdenovo[18]) with limited success. Recently, various Eulerian strategy-based assemblers have
been developed specifically for the assembly of metagenomes.[19]
One assembler in particular is MetaVelvet[20] that was developed based on Velvet,[16] a popular assembler for single genomes. The basic idea of MetaVelvet is to take
the de Bruijn graph constructed from sequences obtained from multiple species as a
mixture of multiple de Bruijn subgraphs, where each subgraph represents an individual
species. The mixed de Bruijn graph is then decomposed into individual subgraphs based
on coverage difference and graph connectivity, and the subgraphs are subsequently
used for building scaffolds.
The program Meta-IDBA[21] is used to address the issue of identifying the branches in the de Bruijn graph
caused by polymorphism in similar subspecies (sp-branches) or caused by similar genomic
regions shared by different species (cr-branches). Meta-IDBA first identifies and
removes cr-branches in the graph, which leaves connected components corresponding
to a set of subgraphs of the same species. Finally, each component is transformed
into a multiple alignment of consensus sequences to represent the contigs of different
subspecies.
GeneStitch[22] uses the prior knowledge of the species composition and gene contents to guide the
assembly process. The idea is that the assembled contigs are similar to given reference
genes. Alternatively, the contigs could be inferred from the tangled de Bruijn graph
using a network matching algorithm, and if no prior sequence knowledge is available,
a general dataset of genes could be used to reference the gene sets as well. With
the ever increasing number of samples and reads, scalability is becoming important
for metagenome assembly.
Ray Meta[23] is a method that was developed for scalable distributed de novo metagenome assembly
on Ray.[24] Ray Meta does not modify the de Bruijn subgraphs as MetaVelvet and Meta-IDBA. It
applies heuristics-guided graph traversals on k-mers in parallel, which is more amenable
to distributed computing. All of these methods are claimed to yield longer contigs
and more representative taxonomic representations on simulated and real data compared
with assemblers designed for single genome assembly.
After the reads are assembled into contigs, the relative positions of the contigs
along a genome are determined by scaffolding, a process that depends on mate-pair
information. If two ends of the mate pair are in differing contigs, the two contigs
are inferred to be adjacent to each other on the genome. Most of the assemblers contain
module of scaffolding. This scaffolding algorithm starts with the most reliable information
and gradually adds more data as long as the new information agrees with the constructed
scaffold. Tools, such as Bambus 2,[25] have been developed for metagenome scaffolding. Bambus 2 can be applied to virtually
all existing sequencing technologies and the output from popular assemblers.
Gene Prediction
A fundamental purpose of gene assembly is to enable gene predictions using scaffolds,
such that genes can be classified into correct functions. There are two classically
described methods for gene prediction: (1) train model parameters on known annotations
to predict unknown annotation or (2) to train models based on homology search, which
aligns sequences to gene database to find homologous sequences. However, it is not
possible to apply these traditional methods directly to metagenomics data. The incomplete
open reading frame (ORF) acquired from metagenome assembly often lacks start or stop
codons; therefore, ab-initio programs do not work in this scenario. In addition, there
is not yet a sufficient metagenomics sequence database to build a statistical model
to distinguish coding from noncoding ORFs. The obvious drawback for these homology-based
methods is that it only provides information for known genes.
Recent tools have been developed to address this core issue of metagenome gene prediction.
MetaGene Annotator (MGA)[26] is upgraded from MetaGene.[27] First, all ORFs in MGA are extracted and scored on a model estimated from annotated
genomes. Then, an optimal combination of ORFs is calculated using the scores of orientations
and distances of neighboring ORFs. MGA also uses the logistic regression models of
the GC content and di-codon frequencies from MetaGene. In addition, it has an adaptable
ribosomal binding site model based on complementary sequences of 16S ribosomal RNA,
which helps MGA to precisely predict translation start sites.
MetaGeneMark[28] uses a heuristic approach originally developed for finding genes within small fragments
of anonymous prokaryotic genomes and/or highly inhomogeneous genomes. The training
dataset consists of 357 bacteria and Archaea species. Linear regression is applied to the relevant information in the training
set, such as the relationship between positional nucleotide frequencies and the global
nucleotide frequencies and the relationship between the amino acid frequencies and
the global GC content. The initial frequency values of the occurrence of 61 codons
are calculated based on the above information and subsequently modified by the frequency
of each amino acid determined by the GC content. Finally, the Markov model of a protein
coding region is constructed based on the usage of all 61 codons.
Orphelia is the third recently developed tool for gene prediction,[29] and is unique in that it adopts a neural network-based method. The neural network
is trained on randomly excised DNA fragments from the genomes that were used for discriminate
training. The artificial neural network combines sequence features, such as monocodon
usage, dicodon usage, and translational initiation sites, with ORF length and GC content
to compute a posterior probability of an ORF to encode a protein.
One recent study has benchmarked these three gene prediction methods and demonstrated
variable performance at different read lengths and fragment types.[26] As might be logically predicted, the authors found that longer reads result in better
gene prediction. In addition, while MGA had the best sensitivity, it was the worst
in specificity for most read lengths. MetaGeneMark had average sensitivity but much
better specificity than MGA, and Orphelia had the lowest annotation error for longer
read lengths. Therefore, the combination of several methods, screening intergenic
regions for overlooked genes, and using dedicated frameshift detectors may result
in better prediction accuracy.[26] Decisions as to which will perform optimally will also be dependent upon the number
of reads per sample and the ratio of bacterial to human reads. This is similarly related
to the human body niche of sample origin.
Taxonomy Identification
One important question to answer in metagenomics analysis is “What microbes are present?”
which leads to the identification of taxonomy distribution in metagenomic samples.
16S rDNA-based surveys produce on average 10,000 sequences that range from 400 to700
bp in length per sample. Rapid taxonomic classifiers, such as the Ribosomal Database
Project (RDP) classifier,[30] use these sequences to generate taxonomic distribution down to the genus level.
Despite its popularity, 16S rDNA-based methods suffer from the biased estimation of
microbial diversity due to the variability in copy number of the 16S gene and the
PCR.
There are two key pathways enabling taxonomic identification using WGS reads. The
first employs homology search against a reference gene database. For example, MEGAN[31] first performs a BLASTX search against the NCBI-NR database. Taxonomic analysis
is then conducted by placing each read onto a node of the NCBI taxonomy according
to the lowest common ancestor of the top hits, and the NCBI taxonomy is based on a
hierarchically structured classification of all species represented in the NCBI. Instead,
CARMA,[32] and the refined version WebCARMA,[33] searches all Pfam domain and protein families as phylogenetic markers to identify
the source organisms of unassembled reads using hidden Markov models. Then a phylogenetic
tree is reconstructed for each matching Pfam family and the corresponding query reads.
Finally, the reads are classified into a higher-order taxonomy depending on their
phylogenetic relationships to family members with known taxonomic affiliations. It
is worth noting that only a small portion of reads have matches by BLAST against the
microbial database.
An alternative strategy to homology-based approaches is to use machine learning and
statistical methods to classify reads based on the composition of the DNA base signatures.
The interpolated Markov models (IMMs) have been employed with success in bacterial
gene classification using the GLIMMER system.[34] Compared with other methods, IMMs utilize information from sequences of different
lengths and integrate the results. The program Phymm[35] demonstrates the use of IMMs in classification of metagenomic reads. In Phymm, a
classifier is trained on a large amount of curated genomes. This classifier constructs
probability distributions that represent the observed patterns of nucleotides characterizing
each chromosome or plasmid. PhymmBL[35] demonstrates that the combination of machine learning and BLAST produces higher
accuracy than either method alone.
Given the complexity of metagenomic assemblies, the taxonomic classification can also
be achieved by directly using reads before assembly. Large-scale studies, for example,
the HMP,[1]
[2]
[3] likely includes hundreds of samples. In these large-scale studies, the computational
efficiency of BLAST becomes the bottleneck for the analysis process if all the reads
are used for classification without assembly. Therefore, reference marker gene sets
are constructed to reduce the size of the database. MetaPhyler[36] is one of these methods, which relies on 31 phylogenetic marker genes derived from
existing genomes and the NCBI-NR database. Furthermore, instead of using a universal
classification threshold for all genes at all taxonomic levels, MetaPhyler uses different
thresholds for classifiers to the reference gene and to the taxonomic level, which
results in much faster analysis. MetaPhlAn[37] first identified more than 2 million potential markers using 2,887 genomes from
Integrated Microbial Genomes (IMG) system,[38] which was further refined to a catalog of 1,221 species with 231 markers per species
and > 115,000 markers at higher taxonomic levels. The relative abundance of each taxonomic
level is made by the alignment of reads to clade-specific marker sequences in this
catalog. Microbial clade abundance is then estimated by normalizing read-based counts
with the average genome size of each clade. MetaPhlAn has been applied to the analysis
of vaginal microbiome (posterior fornix) of asymptotic women enrolled in the HMP.
As Lactobacillus is the dominant genus in the vaginal microbiome, it is important to further classify
this genus down to the species level to reflect the detailed microbial profile. Using
this strategy, all five of the signature Lactobacillus species could be identified by MetaPhlAn. Despite the technical difference between
16S sequencing and WGS, the estimated relative abundance is quite similar.
Functional and Pathway Annotation
Functional and Pathway Annotation
After discovering the microbial consistency, the next question to be answered from
WGS data are “What are these microbes able to do?”. There are two issues involved
in this process. The first issue is to assign functional annotation to the assembled
ORF or to the reads directly. The other issue is to place the genes in the context
of biological pathways, especially metabolic pathways.[39] The most straightforward way for functional prediction is by aligning query sequences
to an existing reference protein database, but then one must determine which database
to use. The size and contents of the databases are different, which will in turn affect
the efficiency and accuracy. If one is interested in annotating as many sequences
as possible, the NCBI RefSeq database would be a good choice because it has the most
comprehensive collections of genomes. For this purpose, various versions of BLAST,
including BLASTX and BLASTP, could be applied. However, this approach suffers from
the long computation time required to search through all the homologs in reference
to the database for each sequence in the dataset. To speed up the process, BLAST can
be done in parallel, like the MBLASTX (MulticoreWare, St. Louis, MO) used by the HMP.
Tools, such as UBLAST,[40] have been developed for high-throughput sequence classifications that are often
an order of magnitude faster than BLAST in real applications, but these applications
lose sensitivity. The raw quantification obtained from alignments needs to be normalized
by the size of the reference coding sequences. The results from a homology search
are often affected by sequence conservation due to the functional homology in different
organisms. When sequences are mapped to structurally or functionally conserved region,
they can easily be assigned to different species if only a similarity score is used.
A possible solution to misclassification is to adopt the more sensitive profile-based
search method. This method uses databases with profiles generated from alignments
of protein families that share similar functions, such as COG, Pfam, or TIGRfam. Hidden
Markov–based HMMER[41] was designed to perform a fast search against profiles generated from multiple sequence
alignments. Although more sensitivity is achieved this way, fewer sequences get annotated.
For partial proteins generated on short contigs or unassembled sequences, a repository
with patterns or motifs (i.e., PROSITE) might be used for a functionality search.
If gene prediction is successful, genomic neighborhood, phylogenetic profiling, and
gene coexpression analysis may provide useful information for functional prediction
as well.
Pathway-based analysis has been developed to interpret the results from microarray
experiments before applying the results to metagenomics data. Pathway here indicates
a series connected sets of genes with nodes representing genes and lines representing
their relationships ([Fig. 3]). The significance of these pathways is decided by functional enrichment statistics
(Fisher exact test) or by scoring based on the pool of genes in the sample (gene set
enrichment analysis [GSEA]). One major drawback of these count-based methods is disregarding
the topology of the pathways. The order of the genes in the pathway could help with
the interpretations of the results. Fortunately, more complex methods have been developed
to address this problem. THINK-Back,[42] stands for Knowledge-based Interpretation of High Throughput data, is a suite of
tools trying to generate biologically meaningful hypothesis by using knowledge in
pathway databases, such as KEGG, PANTHER, Reactome, and Biocarta. One method in THINK-Back
adjusts the score generated by GSEA[43] by incorporating the appearance frequency of the genes in a KEGG database.[44] Another method takes into account the topology of the pathways to calculate a density
score, which is subsequently used for adjusting GSEA scores.[42] The pathway reconstruction using WGS data is essentially using the number of gene
copies to indicate the activity of pathway, which is quite different from RNA-based
microarray and RNA-seq analysis. But the idea of integrating topology pathway information
into pathway analysis could still be applied to metagenomics pathway reconstruction.
Figure 3 Mock KEGG pathway map shows the concept of pathway analysis. The figure on the top
contains all the KEGG pathways involved in metabolic process. Blue and red colors
indicate the enrichment of genes in either case or control group. Two examples of
KEGG pathways are shown with nodes representing proteins or molecules and lines representing
their biological relationship. The gradient of red indicates the average relative
ratio of gene abundance between case and control samples.
As described above, most of the ORFs assembled from metagenomics reads are partial
and likely contain errors caused by frame shifts; therefore, another way to perform
functional annotation is to skip the gene calling altogether and use the protein coding
sequences identified from the reads. In HUMAnN (the HMP Unified Metabolic Analysis
Network),[45] the reconstruction of a network is accomplished by mapping the protein coding genes
onto reference pathway collections, such as eggnog and KEGG orthology groups, based
on their homology to the reference genes previously characterized. MinPath[46] adopts an integer programming algorithm to reconstruct “minimal pathway,” which
is defined as a list of functions annotated for a set of genes in a minimal pathway
that includes all the gene functions. This approach avoids the problem of identification
of spurious pathways and overestimation of microbial abundance. After data normalization
and smoothing, pathway coverage (relative confidence of each pathway being present
in the sample) and pathway abundance (relative “copy number” of each pathway in the
sample) are generated and organized into a matrix-like format for postprocessing.
Comparative Metagenomics
Despite all the challenges with WGS as covered in this review, important environmental
and biological questions have been investigated through comparisons of taxonomic abundance
and metabolic pathway activity. Because of the dynamic nature of the microbiome, there
is large variation in microbiome profiles even from samples obtained from a similar
environment. Therefore, a higher number of samples need to be collected to get an
accurate measure of the microbiome. Ergo, in addition to a large amount of sequences,
there are also a large number of samples with metadata. The taxonomy profile is often
organized into a matrix with rows representing taxonomy (either relative abundance
for each taxonomic level or OTU counts) and columns representing each sample. Depending
on the complexity of the microbiome, there could be thousands of rows and columns
in the matrix.
Matrix expansion yields the issue of data dimension. Dimensionality reduction becomes
important to decrease the computational cost. If the taxa table is large, it will
be helpful to first filter the table to remove singletons or OTUs/species only appearing
in a small number of samples. As singletons or rare species may be generated by sequencing
error, they are not helpful for the purpose of comparative metagenomics. Principal
coordinates analysis (PCoA) is the most popular technique to perform dimensionality
reduction. PCoA takes the results of β diversity comparisons that are generated using
phylogenetic (UniFrac[47]) or nonphylogenetic (such as Canberra) distance on taxon table and produces a new
matrix with fewer dimensions by solving eigenvalues. The direction of each axis is
chosen to maximize the variation in the data. Normally, the first three coordinates
are chosen to visualize the samples in three dimensions. The points that cluster together
indicate that these samples have similar taxonomy profiles. An alternative, nonparametric
rank-based method is to use nonmetric multidimensional scaling, which could avoid
the arch effect caused by the sparsity of the matrix.[48]
If clusters are observed from a 3D PCoA plot, most biologists will be interested to
know which taxa cause the differences in the microbial community depending on the
metadata. Thus, some machine learning techniques could be used to answer this question.
Not every statistical test should be used for every analysis, but the combination
of several analyses can produce a more accurate result. The Random Forest generates
a large ensemble of decision trees from a random subset of the data and a random selection
of the variable. The resulting ensemble of trees is then used with a majority-voting
approach to decide which sample belongs to which group. One advantage of the Random
Forest is that there is no need for cross validation to get an unbiased estimate of
the test set error. An out-of-bag error estimate is generated internally by a bootstrap
sample from the original data. This is very useful when the sample size is small.
Boruta is an all-relevant feature selection wrapper algorithm around Random Forest.
It finds important features by iterative learning of the Random Forest classifier.
In the end, a list of features confirmed to differentiate groups is generated.[4] LEfSe (linear discriminate analysis effect size)[49] is a recently developed tool to identify genomic features (genes, pathways, or taxa)
specific to each group. LEfSe first use Kruskal–Wallis sum-rank test and Wilcoxon
rank-sum test to identify the significant differential abundance with respect to the
class of interest. Then linear discriminant analysis is applied to estimate the effect
size of each differentially abundant feature. LEfSe also provides bar plots and cladogram
plots to represent the discovered biomarkers.
Random Forest and linear discriminate analysis are both supervised machine learning
methods, which means that the samples have been assigned to a group before the learning
task. Unsupervised learning has been applied to metagenomics data as well to discover
the hidden structures of microbiome. One effort is the introduction of the “enterotype”
by Arumugam et al[50] using the human gut samples from the MetaHIT consortium.[50] Enterotypes are generated by performing clustering analysis on the gut microbial
communities. The difference among three enterotypes is driven by key bacterial genera
and not related with age, gender, or body weight. There is also a report about the
existence of enterotype-like clusters in the vaginal microbiome community based on
the abundance of bacterial species, mainly species in the Lactobacillus genus.[51] However, recent research, including our own, indicates that one should take precautions
when performing enterotyping.[52]
[53] Despite various ways to generate a taxa table, clustering is a statistical approach,
whose performance is affected by many factors. One recent study tried to identify
the influence of various factors on enterotyping, including clustering methodology,
distance metrics, OTU-picking methods, sequencing depth, and sequencing methods.[52] Using the HMP data, instead of discrete enterotypes, a smooth gradient distribution
of Bacteroides abundances was observed in gut microbiome. For the vaginal microbiome, depending
on the taxonomy level, distance metrics and scoring methods, two to five clusters
are found using the HMP vaginal data. These results suggest that distance metrics
and the clustering methods have the largest effect on enterotyping. At least one absolute
scoring method combined with two to three distance metrics should be used to verify
the existence of enterotypes.[52]
WGS Data Analysis Pipelines
WGS Data Analysis Pipelines
As described above, the metagenomics data analysis includes assembly, gene predication
and annotation, taxonomic classification, and so on, but each of these tasks is performed
by specific software that requires installation, configuration, and integration. This
is a daunting task even for bioinformatics experts. Most of the research groups construct
analysis pipeline by picking tools for each task based on their own experience. For
a laboratory without bioinformatics support, it may be difficult to perform meaningful
analysis with a large amount of data. With this in mind, we have recently worked to
produce single-site, publicly available tool sets.[54] Specifically designed for 16S analysis, our Genboree Microbiome tool set was deployed
using the web-based Genboree workbench, which has an easy-to-use GUI interface. Users
upload the sequencing file and metadata and choose the desired task analysis by clicking.
Similar web tools for WGS data analysis have been developed too. MG-RAST[55] is a comprehensive web tool for both phylogenetic and functional summaries. MG-RAST
is based on a modified version of the RAST (rapid annotation based on subsystem technology)
server[56] upon the SEED framework, which provides automated sequence assignment by comparison
with both protein and nucleotide databases. Users can upload the sequence file to
the server and keep data private or public.
Like QIIME for 16S-based analysis, a similar standardized framework for WGS data analysis
has been created. SmashCommunity[57] is one of the early pipelines designed for 454 and Sanger data with limited capability
for follow-up analysis. MetAMOS[58] is a modular and customizable framework for metagenomic assembly and analysis, which
is also user friendly. The construction is built upon the AMOS open-source genome
assembly framework. A collection of publicly available tools is tied together by the
lightweight workflow system Ruffus, including Meta-IDBA,[21] MetaVelvet,[20] SOAPdenovo,[18] Bowtie,[59] MetaGeneMark,[28] MetaPhyler,[36] and more. The modular design enables users to check the output for each step and
facilitates the integration of data generated by other tools or in different formats.
WGS Application in Human Reproductive Medicine
WGS Application in Human Reproductive Medicine
Despite these advances in WGS analysis, the prevailing technique used for microbiome
research in the area of human reproductive medicine is still 16S rDNA sequencing.
However, WGS techniques have been adopted in several recent studies in addition to
the 16S sequencing. A subset of samples from HMP was subjected to Illumina sequencing,
which included samples from posterior fornix.[2] The result from this study demonstrated that although each body site is characterized
by signature clade, most of the metabolic pathways are evenly distributed and prevalent
across both individuals and body habitats. However, this analysis revealed that the
pathways related with oligosaccharide and polyol transport system are more active
in posterior fornix samples. One recent study on dynamic changes of gut microbiota
from first to third trimesters also used Illumina HiSEq. 2000 to examine the enrichment
of specific metabolic pathways.[60] The analysis of data did not find difference in the mean relative abundance of gene
categories or metabolic pathways between trimesters. Therefore, the shifts of gut
microbiome during pregnancy may not be associated with metabolic changes. However,
a network analysis of correlations between COG (cluster of orthologous groups) abundances
across samples indicated the loss of network modularity in the third trimester, which
indicates a reduction in phylogenetic diversity and a more uneven distribution of
taxa.[60] The results of this study are in agreement with studies of phylogenic diversity
from our laboratory.[4] As WGS techniques continue to improve and become more user friendly, this will be
a powerful tool in future studies with a focus on human reproduction. For instance,
in studying preterm birth, it can be challenging to detect bacteria in the amniotic
fluid of patients.[61] Yet, recent studies have detected bacteria deep within human fetal membranes.[62] In addition, an independent study found that bacteria were harbored in the basal
plate of the placenta.[63] Remarkably, there was not statistical significance in the presence of bacteria between
preterm and term patients.[63] Thus, while previous sequencing techniques have failed to detect bacteria in the
placenta, the advent on NGS techniques may help to advance our understanding of the
role of the microbiome in promoting healthy, term pregnancies.
Conclusion
The rapid advancement of sequencing technology has brought both promise and challenges
to the metagenomics field. We can now explore unknown environments as community genomic,
ecologic niches in previously unparalleled and dynamic fashions. However, the downstream
analysis currently lags behind the sequencing technology. Compared with 16S-based
metagenomic sequencing, WGS generates exponentially more sequences that necessitate
large storage requirements, and produce large numbers of unknown species that demand
more computational resources. In this review, we have introduced several recently
developed tools dedicated to metagenomics assembly, gene predication, and pathway
reconstruction. There is still a high demand for more efficient and more sensitive
tools to perform standardized analysis. In addition to WGS, RNA-based metatranscriptomics
is also under development to provide more details on the dynamic changes in the community,
which may alleviate the limitation caused by DNA-based methods. Metabolomics attempts
to measure the complete set of molecules in the community, which could provide important
information on the study of host–microbe interactions. In our era of “omics-based
discovery science,” physician scientists are increasingly called upon to work side
by side with computational scientists. It is our hope that this review will provide
our fellow microbiome-minded reproductive and perinatal biologists with a working
knowledge of the current state of the science.