An Overview of Next-Generation Sequencing
Complete the form below to unlock access to ALL audio articles.
Over the last 56 years, researchers have been developing methods and technologies to assist in the determination of nucleic acid sequences in biological samples. Our ability to sequence DNA and RNA accurately has had a great impact in numerous research fields. This article discusses what next-generation sequencing (NGS) is, advances in the technology and its applications.
Contents
What is next-generation sequencing?
Next-generation sequencing methods
What is next-generation sequencing?
The structure of DNA was determined in 1953 by Watson and Crick based on the fundamental DNA crystallography and X-ray diffraction work of Rosalind Franklin.1,2 However, the first molecule to be sequenced was actually RNA – tRNA – in 1965 by Robert Holley and RNA of bacteriophage MS2 later on.3,4 Various research groups then began adapting these methods to sequence DNA with a breakthrough coming in 1977 by Fredrick Sanger and colleagues, developing the chain-termination method.5 By 1986, the first automated DNA sequencing method had been developed.6,7 This was the beginning of a golden era for the development and refinement of sequencing platforms, including the pivotal capillary DNA sequencer.
The chain-termination method, also known as Sanger sequencing, uses a DNA sequence of interest as a template for a PCR that adds modified nucleotides, called dideoxyribonucleotides (ddNTPs), to the DNA strand during the extension step.8 When the DNA polymerase incorporates a ddNTP, the extension ceases leading to the generation of numerous copies of the DNA sequence of all lengths spanning the amplified fragment. These chain-terminated oligonucleotides are then size separated using gel electrophoresis in early methods, or capillary tubes in later automated capillary sequencers and the DNA sequence is determined. With these immense technological advances, the human genome project was completed in 2003.9 In 2005, the first commercially available NGS platform, or second generation (2G) as it has become, was introduced, able to amplify millions of copies of a particular DNA fragment in a massively paralleled way in contrast to Sanger sequencing.10
The key principles behind Sanger sequencing and 2G NGS share some similarities.11,12 In 2G NGS, the genetic material (DNA or RNA) is fragmented, to which oligonucleotides of known sequences are attached, through a step known as adapter ligation, enabling the fragments to interact with the chosen sequencing system. The bases of each fragment are then identified by their emitted signals. The main difference between Sanger sequencing and 2G NGS stems from sequencing volume, with NGS allowing the processing of millions of reactions in parallel, resulting in high-throughput, higher sensitivity, speed and reduced cost. A plethora of genome sequencing projects that took many years with Sanger sequencing methods could now be completed within hours using NGS.
There are two main approaches in NGS technology, short-read and long-read sequencing, each with its own advantages and limitations (Table 1).13 The main scope for investing in the development of NGS is its wide applicability in both clinical and research settings. In clinical settings, NGS is used to diagnose various disorders, via identification of germline or somatic mutations.14,15 The shift towards NGS in clinical practice is justified by the power of the technique paired with the continually declining costs. NGS is also a valuable tool in metagenomic studies and used for infectious disease diagnostics, monitoring and management.16,17 In 2020, NGS methods were pivotal in characterizing the SARS-CoV-2 genome and are constantly contributing in monitoring the COVID-19 pandemic.18,19
Figure 1: The evolution of sequencing methodologies.
Next-generation sequencing methods
The term NGS is often taken to mean 2G technologies, however, third (3G) and fourth (4G) generation technologies have since evolved that work on different underlying principles.
Sequencing platforms/ sequencing technology
Second-generation sequencing methods are well-established and share many features in common. They can, however, be subdivided according to their underlying detection chemistries including sequencing by ligation (incorporating nanoball) and sequencing by synthesis (SBS), which further divides into proton detection, pyrosequencing and reversible terminator (Figure 2).
Figure 2: Diagram representing the principle 2G sequencing platforms and chemistries.
Proton detection sequencing relies on counting hydrogen ions released during the polymerization of DNA. Unlike other techniques, it does not use fluorescence and does not use modified nucleotides or optics. Instead, pH changes are detected by semiconductor sensor chips and converted to digital information.20
Pyrosequencing utilizes the detection of pyrophosphate generation and light release to determine whether a specific base has been incorporated in a DNA chain.21,22
By far the most popular SBS method is reversible terminator sequencing which utilizes ‘’bridge-amplification’’. During the synthesis reactions, the fragments bind to oligonucleotides on the flow cell, creating a bridge from one side of the sequence (P5 oligo on flow cell) to the other (P7), which is then amplified. The added fluorescently-labeled nucleotides are detected using direct imaging.23
Unlike SBS, sequencing by ligation does not use DNA polymerase to create a second strand. The sensitivity of DNA ligase to base-pairing mismatches is utilized instead, with the fluorescence produced used to determine the target sequence. Digital images taken after each reaction are then used for analysis. DNA nanoball sequencing is a form of sequencing by ligation that exploits rolling circle replication. Concatenated DNA copies are compacted into DNA nanoballs and bound to sequencing slides in a dense grid of spots ready for ligation-based sequencing reactions.24,25 Whilst the nanoball technique reduces running costs, the short sequences produced can be problematic for read mapping.
2G NGS technologies in general offer several advantages over alternative sequencing techniques, including the ability to generate sequencing reads in a fast, sensitive and cost-effective manner. However, there are also disadvantages, including poor interpretation of homopolymers and incorporation of incorrect dNTPs by polymerases, resulting in sequencing errors. The short read lengths also create the need for deeper sequencing coverage to enable accurate contig and final genome assembly.26–30 The major disadvantage of all 2G NGS techniques is the need for PCR amplification prior to sequencing. This is associated with PCR bias during library preparation (sequence GC-content, fragment length and false diversity) and analysis (base errors/favoring certain sequences over others).
The introduction of 3G sequencing circumvents the need for PCR, sequencing single molecules without prior amplification steps. The first single molecule sequencing (SMS) technology was developed by Stephen Quake and colleagues.31 Here, sequence information is obtained with the use of DNA polymerase by monitoring the incorporation of fluorescently labeled nucleotides to DNA strands with single base resolution. Depending on the method and the instrument used, some of the advantages of 3G NGS include:
- Real-time monitoring of nucleotide incorporation
- Non-biased sequencing and
- Longer read lengths
Nevertheless, high costs, high error rates, large quantities of sequencing data and low read depth can be problematic.32,33
In 4G systems the single-molecule sequencing of 3G is combined with nanopore technology. Similar to 3G, nanopore technology requires no amplification and uses the concept of single molecule sequencing but with the integration of tiny biopores of nanoscale diameter (nanopores) through which the single molecule passes and is identified. The 4G systems currently offer the fastest whole genome sequence scan but are still quite expensive, error prone compared to 2G techniques and relatively new. Consequently, there is currently less extensive data available for the technique.34
Main steps of 2G sequencing methods and next-generation sequencing library prep
Regardless of the 2G NGS method chosen, there are several main steps that must be tailored and optimized to the target (RNA or DNA) and sequencing system selected.
(1) Sample preparation (pre-processing)
Nucleic acids (DNA or RNA) are extracted from the selected samples (blood, sputum, bone marrow etc.). Extracted samples are quality control (QC) checked, using standard methods (spectrophotometric, fluorometric or gel electrophoretic). If using RNA, this must be reverse transcribed into cDNA, however some library preparation kits may include this step.
Random fragmentation of the cDNA or DNA, typically by enzymatic treatment or sonication, is performed. The optimal fragment length depends on the platform being used. It may be necessary to run a small amount of fragmented sample on an electrophoresis gel when optimizing this process. These fragments are then end-repaired and ligated to smaller generic DNA fragments called adapters. Adapters have defined lengths with known oligomer sequences to be compatible with the applied sequencing platform and identifiable where multiplex sequencing is performed. Multiplex sequencing, using individual adapter sequences per sample, enables large numbers of libraries to be pooled and sequenced simultaneously in a single run. This pool of DNA fragments with adapters attached are known as a sequencing library.
Size selection may then be performed, by gel electrophoresis or using magnetic beads, to remove any fragments that are too short or too long for optimal performance on the sequencing platform and protocol selected. Library enrichment/amplification is then achieved using PCR. In techniques involving emulsion PCR, each fragment is bound to a single emulsion bead which will go on to form the basis of sequencing clusters. Amplification is often followed by a “clean-up” step (e.g., using magnetic beads) to remove undesired fragments and improve sequencing efficiency.
The final libraries can undergo QC checks using qPCR, to confirm DNA quality and quantity. This will also allow the correct concentration of sample to be prepared for sequencing.
(3) Sequencing
Depending on the selected platform and chemistry, clonal amplification of library fragments may occur prior to sequencer loading (emulsion PCR) or on the sequencer itself (bridge PCR). Sequences are then detected and reported according to the platform selected.35
(4) Data analysis
The generated data files are analyzed depending on the workflow used. Analysis methods are highly dependent on the aim of the study.36–38
Whilst they may reduce the amount of samples that can be analyzed in a given run, paired-end and mate pair sequencing offer advantages in downstream data analysis, particularly for de novo assemblies. The techniques link sequencing reads together that are read from both ends of a fragment (paired-end) or are separated by an intervening DNA region (mate pair).
There are clearly many options when it comes to selecting a sequencing strategy. The following are some of the key considerations when deciding on the appropriate library preparation and sequencing platform:
(a) Research question being asked
(b) Sample type
(c) Short-read or long-read sequencing
(d) DNA or RNA sequencing – do you need to look at the genome or transcriptome?
(e) Is the whole genome required or only specific regions?
(f) Read depth (coverage) needed – experiment-specific
(g) Extraction method
(h) Sample concentration
(i) Single end, paired end or mate pair reads
(j) Specific read length required
(K) Could samples be multiplexed?
(l) Bioinformatic tools – experiment dependent. Depending on the sample and the biological question, the entire process of sequence analysis can be adapted.
Short-read vs long-read sequencing
The advantages and disadvantages of short- and long-read sequencing are summarized in Table 1.
Table 1: A table of advantages and disadvantages for short vs long read sequencing.
|
| |||
Short-read sequencing | · Higher sequence fidelity · Cheap · Can sequence fragmented DNA | · Not able to resolve structural variants, phasing alleles or distinguish highly homologous genomic regions · Unable to provide coverage of some repetitive regions | ||
Long-read sequencing | · Able to sequence genetic regions that are difficult to characterize with short-read seq due to repeat sequences · Able to resolve structural rearrangements or homologous regions · Able to read through an entire RNA transcript to determine the specific isoform · Assists de novo genome assembly | · Lower per read accuracy · Bioinformatic challenges, caused by coverage biases, high error rates in base allocation, scalability and limited availability of appropriate pipelines |
Next-generation sequencing data analysis
Any kind of NGS technology generates a significant amount of output data. The basics of sequence analysis follow a centralized workflow which includes a raw read QC step, pre-processing and mapping, followed by post-alignment processing, variant annotation, variant calling and visualization.
Assessment of the raw sequencing data is imperative to determine their quality and pave the way for all downstream analyses. It can provide a general view on the number and length of reads, any contaminating sequences, or any reads with low coverage. One of the most well-established applications for computing quality control statistics of sequencing reads is FastQC. However, for further pre-processing, such as read filtering and trimming, additional tools are needed. Trimming bases towards the ends of reads and removing leftover adapter sequences generally improves data quality. More recently, ultra-fast tools have been introduced, such as fastp, that can perform quality control, read filtering and base correction on sequencing data, combining most features from the traditional applications while also running two to five times faster than any of them alone.39
After the quality of the reads has been checked and pre-processing performed, the next step will depend on the existence of a reference genome. In the case of a de novo genome assembly, the generated sequences are aligned into contigs using their overlapping regions. This is often done with the assistance of processing pipelines that can include scaffolding steps to help with contig ordering, orientation and the removal of repetitive regions, thus increasing the assembly continuity.40,41 If the generated sequences are mapped (aligned) to a reference genome or transcriptome, variations compared to the reference sequence can be identified. Today, there is a plethora of mapping tools (more than 60), that have been adapted to handle the growing quantities of data generated by NGS, exploit technological advancements and tackle protocol developments.42 One difficulty, due to the increasing number of mappers, is being able to find the most suitable one. Information is usually scattered through publications, source codes (when available), manuals and other documentation. Some of the tools will also offer a mapping quality check that is necessary as some biases will only show after the mapping step. Similar to quality control prior to mapping, the correct processing of mapped reads is a crucial step, during which duplicated mapped reads (including but not limited to PCR artifacts) are removed. This is a standardized method, and most tools share common features. Once the reads have been mapped and processed, they need to be analyzed in an experiment-specific fashion, what is known as variant analysis. This step can identify single nucleotide polymorphisms (SNPs), indels (an insertion or deletion of bases), inversions, haplotypes, differential gene transcription in the case of RNA-seq and much more. Despite the multitude of tools for genome assembly, alignment and analysis, there is a constant need for new and improved versions to ensure that the sensitivity, accuracy and resolution can match the rapidly advancing NGS techniques.
The final step is visualization, for which data complexity can pose a significant challenge. Depending on the experiment and the research questions posed, there are a number of tools that can be used. If a reference genomes is available , the Integrated Genome Viewer (IGV)is a popular choice43, as is the Genome Browser. If experiments include WGS or WES, the Variant Explorer is a particularly good tool as it can be used to sieve through thousands of variants and allow users to focus on their most important findings. Visualization tools like VISTA allow for comparison between different genomic sequences. Programs suitable for de novo genome assemblies44 are more limited. However, tools like Bandage and Icarus have been used to explore and analyze the assembled genomes.
Next-generation sequencing bottlenecks
NGS has enabled us to discover and study genomes in ways that were never possible before. However, the complexity of the sample processing for NGS has exposed bottlenecks in managing, analyzing and storing the datasets. One of the main challenges is the computational resources required for the assembly, annotation, and analysis of sequencing data.45 The vast amount of data generated by NGS analysis is another critical challenge. Data centers are reaching high storage capacity levels and are constantly trying to cope with increasing demands, running the risk of permanent data loss.46 More strategies are continuously being suggested with the aim to increase efficiency, reduce sequencing error, maximize reproducibility and ensure correct data management.
Next-generation sequencing applications
Since the early 2000s NGS has become an invaluable tool in both research and clinical/diagnostic settings for modern medicine and in drug discovery, with the use of methods including WGS, WES, targeted sequencing, transcriptome, epigenome and metagenome sequencing dramatically increasing. Figure 3 summarizes workflows and options for targeting different datasets.
Figure 3: Flow diagram indicating possible sequencing strategies for different sample types.
Through WGS, researchers are able to study not only genes and their involvement in disease in humans and animals, but also characteristics of microbial and agricultural populations, providing important epidemiological and evolutionary data.47–52 Thus far, there has been a plethora of studies where mutations, rearrangements and fusion events were identified using WGS. Currently, WGS is used for the surveillance of antimicrobial resistance, one of the major global health challenges.53,54 As the costs are constantly decreasing, WGS is more frequently used for resequencing the entire human genome in clinical samples and may soon become routine in clinical practice.55 Ultimately, WGS will be needed to assign functionality to the remaining majority of the genome and decipher its role in diseases.
Their more focused nature make WES and targeted sequencing attractive options for population and clinical studies.56,57 Despite having more limitations as the name suggests, WES is an important clinical tool in the personalized medicine field. Genetic diagnoses for certain diseases, like cancer, as well as genetic characterization for other disorders can be achieved with this method in a more cost-effective way than WGS.
In addition to the many applications that NGS has in sequencing DNA, it can also be used for RNA analysis. This enables, for example, the genomes of RNA viruses, such as SARS and influenza, to be determined. Importantly, RNA-seq is frequently used in quantitative studies, facilitating not only the identification of transcribed genes in a DNA genome, but also the level at which they are transcribed (transcription level) according to the relative abundance of RNA transcripts. Potential rearrangements of the DNA sequences may also be identified through the identification of novel transcripts.58,59
Epigenomic sequencing allows the study of changes caused by histone modifications and DNA methylation. There are different methods employed for the study of epigenetic mechanisms, including whole genome bisulfate sequencing (WGBS), chromatin immunoprecipitation (ChIP-seq) and methylation dependent immunoprecipitation (MeDIP-seq) followed by sequencing.60,61 Depending on the selected method, the complete DNA methylome and histone modification profiles can be mapped and studied, gaining insights into genomic regulatory mechanisms.
Metagenomic sequencing can provide information for samples collected in a specific environment. It enables the comparison of differences and interactions between mixed microbial populations, as well as host responses. Some of the potential applications of metagenomic sequencing include, but are not limited to, infectious disease diagnostics and infection surveillance, antimicrobial resistance monitoring, microbiome studies and pathogen discovery.62
Technological advances in sample preparation, sequencing technologies and data analysis mean that NGS is also being used at the single cell level to study heterogeneities and rare changes in DNA, RNA and the epigenome.
Next-generation sequencing key terms and abbreviations
Table 2: Key terms and abbreviations relating to NGS. DNA | Deoxyribonucleic acid |
RNA | Ribonucleic acid |
tRNA | Transfer ribonucleic acid |
NGS | Next-generation sequencing |
PCR | Polymerase chain reaction |
cDNA | Complementary DNA |
gDNA | Genomic DNA |
RNA-seq | RNA-sequencing |
SMS | Single molecule sequencing |
SBS | Sequencing by synthesis |
WGS | Whole genome sequencing |
WES | Whole exome sequencing |
WGBS | Whole genome bisulfate sequencing |
ChIP-seq | Chromatin immunoprecipitation sequencing |
MeDIP-seq | Methylation dependent immunoprecipitation followed by sequencing |
P5 | Primer 5 (sequencing adapter) |
P7 | Primer 7 (sequencing adapter) |
3G | Third-generation sequencing |
4G | Fourth-generation sequencing |
dNTPs | Deoxynucleoside triphosphate |
FastQC | Fast quality control |
Flow cell | Glass slide containing fluidic channels |
Library | Pool of DNA fragments with adapters attached |
Indel | Insertion or deletion of bases |
Adapters | Platform-specific sequences for fragment recognition |
fastp | Fast preprocessor |
De novo sequencing | Novel genome sequencing in the absence of a reference sequence |
Contigs | From “contiguous” - overlapping DNA fragments |
SNP | Single nucleotide polymorphism |
Scaffold | Created by linking contigs together using additional information |
SBL | Sequencing by ligation |
Paired-end | Reading a sequencing fragment from both ends and linking the data |
Mate pair | Linking sequencing reads separated by an intervening DNA region |