Long-read sequencing to understand genome biology and cell function

https://doi.org/10.1016/j.biocel.2020.105799 Get rights and content

Abstract

Determining the sequence of DNA and RNA molecules has a huge impact on the understanding of cell biology and function. Recent advancements in next-generation short-read sequencing (NGS) technologies, drops in cost and a resolution down to the single-cell level shaped our current view on genome structure and function. Third-generation sequencing (TGS) methods further complete the knowledge about these processes based on long reads and the ability to analyze DNA or RNA at single molecule level. Long-read sequencing provides additional possibilities to study genome architecture and the composition of highly complex regions and to determine epigenetic modifications of nucleotide bases at a genome-wide level. We discuss the principles and advancements of long-read sequencing and its applications in genome biology.

Keywords

Third-generation sequencing

Long-read sequencing

Nanopore sequencing

SMRT sequencing

Genomics

1. Introduction

Massively parallel sequencing, also known as next-generation sequencing (NGS) came as disruptive innovation into the field of life science. Within a couple of years, NGS led to a dramatic increase in knowledge on genomes of different organisms, their architecture, function, and genetic variation down to single-cell level (Shendure et al., 2017). Various methods based on semiconductors (Ion Torrent), pyrosequencing (454 Life Science, Roche), sequencing by ligation (Applied Biosystems), and sequencing by synthesis with reversible terminators (Solexa, Illumina) allowed fast and precise DNA and RNA sequencing (Metzker, 2010). However, short-read sequencing methods have shortcomings in their capability to investigate complex genomes, repetitive elements, full-length transcripts, or native base modifications. Several of the current limitations can be overcome by long-read technologies (third-generation sequencing technologies, TGS). In the following we will discuss the applications of long-read sequencing to understand genome function. The review focuses on the technical applications of long-read methods, which can be applied to the most diverse questions in cell biology.

2. Nanopore sequencing

The original idea of analyzing nucleotide sequences with nanopores was born in the 1980s, but it took more than 30 years for the technology to reach market maturity (Company: Oxford Nanopore Technologies, ONT) (Deamer et al., 2016; Kasianowicz and Bezrukov, 2016). In Nanopore sequencing a current is applied over a tiny pore to driving an ion flow. Each molecule entering the pore interferes with the ion flow and therefore induces a characteristic and measurable change in the current. ONT utilizes a biological (protein-based) nanopore derived from a bacterial protein. This pore is embedded in an electrical resistant polymer membrane which is surrounded by a microscaffold structure, separating every pore from another. DNA or RNA translocate through the pore by the applied voltage and the translocation is actively controlled by an ATP-dependent motor protein to slow down the translocation speed and allow measurements at high resolution. As the narrowest part of the pore region harbors at least 5 bases of the oligonucleotide at a time, the measured current signal belongs to a 5mer. This has the advantage that every base is measured 5 times while being pulled through the pore which hence improves the consensus accuracy to 96 % with the most advanced pore (R10.3) and basecaller (guppy version 3.6). The respective current profile is characteristic for a 5mer and can be translated into nucleotide sequence by the basecalling algorithm (Fig. 1).

Fig. 1. Principle of Nanopore sequencing.

DNA or RNA is analyzed by threading oligonucleotides through a tiny protein pore. A voltage applied along the pore drives an ion flow leading to a measureable current. The DNA/RNA is unzipped by a motor protein which also controls the translocation speed. Contiguous nucleotides entering the pore lead to a characteristic change in the current and the signal can be translated into the respective nucleotide sequence. DNA and RNA can be sequenced and processed in real-time at a speed of 450 bases per second and pore. Each flow cell operates up to 50 (Flongle), 512 (MinION / GridION) or 3000 (PromethION) pores in parallel.

This polymerase-free method is also unique in the ability to sequence native RNA in real-time and hence to analyze the many different base modifications which are present in RNA molecules. Modified nucleotides block the current in a characteristic manner and hence can be identified. However, current basecallers are only able to distinguish few of these modifications (Liu et al., 2019; Xu and Seki, 2020). Similarly, DNA modifications can be analyzed without pretreatment (e.g. bisulfite conversion) of the genomic DNA (Giesselmann et al., 2019; Ni et al., 2019; Sharim et al., 2019). Another advantage of Nanopore sequencing is the enormous read length of up to 2 million bases since the method is independent from polymerase processivity.

3. Single molecule real-time (SMRT) sequencing

SMRT (single molecule real-time) sequencing from Pacific Bioscience (PacBio) also provides long reads of native DNA. The method relies on fluorescence-labeled nucleotides incorporated by a polymerase which is immobilized at the bottom of so called ZMWs (zero-mode waveguides). These picoliter-sized wells are assembled on a flow cell and allow the detection of fluorescence signals from millions of molecules in parallel. In contrast to NGS methods the incorporation of nucleotides is detected in real-time and from single molecules. Furthermore, the incorporation time enables conclusions on respective base modifications. To reduce the error rate PacBio provides a high accuracy protocol (circular consensus sequencing (CSS)). The circularized DNA templates are read several times by the polymerase which increases the accuracy from ∼90 % up to 99.8 %.

4. Other long-read/ cytogenetic technologies

Synthetic long-read technologies provide alternative methods to obtain information on long DNA fragements. Methods such as linked-read sequencing (10x Genomics) and stLFR (MGI) allow the in silico assembly of long sequences from short-read NGS data. Moreover next-generation cytogenetics enables to analyze single DNA strands at megabase scale. Optical mapping approaches (Bionano) and molecular combing techniques (Genomic Vision) are amongst these novel cytogenetic approaches. Bionano utilizes restriction enzymes to add fluorescence labels to defined sites of the DNA. Labeled DNA molecules are linearized in nanochannel arrays on a chip and are imaged and assembled into genomic optical maps. Changes in patterning can define structural variants at a genome-wide level. Molecular combing (Genomic Vision) uses fluorescence in-situ hybridization (FISH) probes that are hybridized to DNA immobilized and stretched on glass slides. Distance and order of the respective fluorescence signals allow conclusion on genomic alterations.

5. Structural variations, complex haplotypes and chromosomal rearrangements

Structural variations (SV) are a rich source for genome evolution and inter-individual variation, but acquired SVs can also drive pathological processes such as cancer development. SVs including copy number variants (deletions, amplifications) can be detected by comparative genomic hybridization approaches (SNP-arrays, CGH-arrays) and to a certain extend by short-read sequencing methods. However, complex structural rearrangements, inversions, balanced chromosomal translocations and other copy neutral SVs are easily missed or impossible to detect by array or short-read sequencing approaches. Long-reads in contrast can facilitate unambiguous mapping of reads to detect a large proportion of genomic SVs (Fig. 2). Complex or repetitive regions of the genome such as segmental duplications, telomeres or centromeres become accessible (Li and Freudenberg, 2014). Jain et al. e.g. demonstrated the sequencing of the centromeric region of the Y-chromosome (Jain et al., 2018). Many examples for SVs detection by long-read sequencing have recently been published and include constitutive structural variants (Kraft et al., 2019; Sanchis-Juan et al., 2018), complex genomic aberrations (Cretu Stancu et al., 2017; Sedlazeck et al., 2018b), or somatic genomic rearrangements (Euskirchen et al., 2017; Gong et al., 2018). Exact breakpoints of indels or fusion genes can be defined using long-read approaches (Chaisson et al., 2019; De Coster et al., 2018; Jeck et al., 2019; Kraft et al., 2019; Seo et al., 2016). Highly complex genomic regions have been resolved as exemplified by sequencing of killer cell immunoglobin-like receptors (KIRs) or human leukocyte antigen (HLA) regions (Ameur et al., 2019). The yet largest human Nanopore sequencing study on 1817 Icelanders identified a median of 23,111 autosomal SVs per individual, with a median of 11,506 insertions and 11,576 deletions (Beyter et al., 2019). The total size of SVs comprised a median of 9.9 Mb and thus a substantial part of genetic information (Beyter et al., 2019). Next-generation cytogenetics methods are also feasible to detect genomic aberrations (Zhang et al., 2019). Here, optical mapping allows high coverage per sample (∼300x) and has advantages in the detection of low-grade somatic SVs (Neveling et al., 2020).

Fig. 2. Overview on the possibilities of long-read sequencing. (A) The major difference between NGS methods and TGS techniques is the read length of the individual reads. The average NGS read length is between 50 to 300 nucleotides, whereas TGS allows a mean read length between 10.000 to 100.000 nucleotides and is able to generate reads in the megabase range (top panel). This read length is helpful to resolve structural variations (B) or to determine the exact size of tandem repeats (C). (D) Phasing of sequence data can be dramatically improved, which e.g. allows to discriminate pseudogenes from genes of interest and to determinate whether variants occur in-cis or in-trans (E). Figure adopted from (Mantere et al., 2019).

6. Repeat architecture

The size and structure of many repetitive regions of genomes is hardly accessible with short-read sequencing technologies (Tørresen et al., 2019). However, an increasing number of repetitive elements has been linked to human diseases, which has led to a growing interest in the study of these regions (Hagerman et al., 2017; McColgan and Tabrizi, 2018; Paulson, 2018). Long-read sequencing enables their analysis in a single read and thus the exact determination of length, composition, and repeat count (Giesselmann et al., 2019). For example, novel intronic TTTTA/TTTCA repeat expansions in different genes (MARCH6, SAMD12, STARD7) were shown to cause inherited types of epilepsy (Corbett et al., 2019; Florian et al., 2019; Ishiura et al., 2018). The underlying repeats are several kb long and consist of variable numbers of TTTA and TTTCA motifs. Another example is a trinucleotide (GGC) repeat expansion in the NOTCH2NLC gene leading to a neurodegenerative disease (Sone et al., 2019). In case of nanopore sequencing, the length of the repeats can be estimated directly from the raw signal without the need for error prone basecalling (Fig. 3).

Fig. 3. Detection of tandem repeat expansions from Nanopore sequencing raw signal traces.

Example plots showing the raw Nanopore sequencing signal from a tandem repeat expansion. The repeat consists of two distinct sequence motives indicated in green and blue. Adjacent sequences are shown in grey. The upper plots show representative current profiles of three repeat units for each motive (own unpublished data).

Besides measuring repeat size, the methylation status of repeats can be determined. Both the length and the methylation status of an unstable CGG repeat in the 5′ untranslated region of the FMR1 gene e.g. correlate with the severity of Fragile X syndrome, a neurogenetic disorder (Ardui et al., 2017; Giesselmann et al., 2019). Structure and length of low complexity regions at human telomeres and centromeres can also be delineated with Nanopore sequencing and first end-to-end chromosome assemblies are available (Miga et al., 2019).

7. Epigenetic regulation

Over 150 types of base modifications have been described so far (Xu and Seki, 2020). These modifications are crucial in many aspects of biology, including development, cellular maintenance, ageing, or cancer. However, available sequencing technologies allowed only limited insight into nucleic acid modifications. Because base modifications lead to characteristic changes in the current profiles when the respective bases are pulled through nanopores, the method detects various chemical modifications of nucleic acids. These modifications can be detected using raw electric signals of nanopore data by a bidirectional recurrent neural network (Liu et al., 2019) or deep learning algorithms (Ni et al., 2019). SMRT sequencing also allows the detection of DNA modifications (Flusberg et al., 2010). The modification of the bases leads altered incorporation times of nucleotides by the polymerase, however, a relatively high coverage is needed for a reliable detection (Kelleher et al., 2018). Both methods allow to discriminate between 5-methylcytosine and 5-hydroxymethylcytosine, or to detect N6-methyladenosine (Rand et al., 2017; Simpson et al., 2017; Sun et al., 2019). Moreover, native CpG methylation and chromatin accessibility/structure can be studied in parallel using long-reads (Lee et al., 2018; Shipony et al., 2018). Similarly, various base modifications present on native RNA molecules can be detected (Garalde et al., 2018). Nanopore sequencing is also capable to analyze regions of differential methylation between parental alleles which results in parent of origin specific control of gene expression. These so called imprinted regions show differential methylation of CpG sites and the long-reads allow to determine the haplotype of each read (“phasing”) (Gigante et al., 2019) (Fig. 2). Recently a sequencing method for the measurement of replication fork movements on single molecules was reported by detecting nucleotide analog signal currents on nanopore traces. This allows to determine the replication dynamics which is otherwise difficult to address and still poorly understood (Hennion et al., 2018; Muller et al., 2019). The three-dimensional spatial organization of chromatin in cells was investigated by a combination of long-read sequencing and modified chromatin-conformation capture (Hi-C) assays (Ulahannan et al., 2019; Vermeulen et al., 2020). Applying these methods allows a deeper understanding of epigenetic regulation and chromatin architecture.

8. RNA sequencing, alternative splicing, and single cell sequencing

Alternative splicing of mRNAs is a mechanism to increase protein diversity and function. Nanopore and SMRT sequencing allow to determine entire transcripts within single reads, which provides a comprehensive view on isoforms and splicing events (Soneson et al., 2019). The power of long read sequencing in RNA analysis is underlined by the fact that over 50 % of the identified isoforms from Nanopore sequencing transcriptome analyses are not covered by short read sequencing datasets (Workman et al., 2019). Furthermore, long-reads detect the size of the poly-A tail, which impacts mRNA stability and translation (Legnini et al., 2019). This length can also be analyzed from the nanopore sequencing raw signals without basecalling (Krause et al., 2019; Parker et al., 2020; Workman et al., 2019). Base modifications on RNA molecules can be studied in addition (Xu and Seki, 2020). Recent publications also show that long-read sequencing can be combined with single cell RNA sequencing methods (Gupta et al., 2018; Lebrigand et al., 2019; Singh et al., 2019; Volden et al., 2018; Volden and Vollmers, 2020). Combining the technologies generates sequence information not only from the 3′ or 5′ prime end of the transcript, but allows to analyze full-length transcripts from single-cells (Lebrigand et al., 2019; Singh et al., 2019; Volden and Vollmers, 2020). Long-read sequencing methods show limitations such as lower accuracy (nanopore sequencing) and limited sequencing depth (SMRT sequencing) in comparison to short-read single cell analysis, however, addressing full-length transcriptomics and cell type-specific alternative splicing make these technologies attractive for several applications. Moreover, combining transcriptome analysis with unique molecular identifiers (UMIs) increases the accuracy of the Nanopore data to 99.9% (Karst et al., 2019). In summary, long-read sequencing is a powerful tool for in-depth analysis of mRNA composition and biology.

9. De novo genome assembly

An important application of long-read sequencing is the de novo assembly of prokaryotic and eukaryotic genomes (van Dijk et al., 2018v). Especially in polyploid organisms such as wheat or Xenopus species and in regions of low complexity the long reads facilitate correct genome assembly to large continuous contigs (Genova et al., 2019; Kapustová et al., 2019; Schmid et al., 2018; Schmidt et al., 2017; Shin et al., 2019; Wang et al., 2019). De novo assemblies are possible without laborious BAC or YAC generation and sequencing. Due to the high error rate in homopolymer regions, Nanopore sequencing is currently mostly used for scaffolding, but de novo assembly without additional data from short-read sequencing is possible for prokaryotic, eukaryotic and even human genomes (LaPierre et al., 2019; Liem et al., 2017; Scheunert et al., 2019; Shafin et al., 2019). A combination of Hi-C, optical mapping, short-read and long-read sequencing can even generate chromosome sized contigs (Miga et al., 2019). Optical mapping methods are often useful for scaffolding and further increase in contig size compared to sequencing alone (Deschamps et al., 2018; Etherington et al., 2020). Table 1 provides an overview on the applications of Nanopore sequencing.

Table 1. Examples of applications of long-read sequencing.

Applications
Highly polymorphic regions
HLA (Ton, K.N.T. et al. 2018), KIR (Roe, D. et al. 2017)
Infection
Antibiotic resistance (Břinda, K. et al. 2018), Ebola (Quick, J. et al. 2016), Gonorrhea (Golparian, D. et al. 2018), West nile virus, Zika (Grubaugh, N. D. et al. 2018), Meningitis (Brønstad Brynildsrud, O. et al. 2018), Tuberculosis (George, S. et al. 2018), Sepsis
Methylation analysis (Gigante, S. et al. 2019; Gilpatrick, T. et al. 2019; Lee, I. et al. 2018)
Microbiome analysis (Kerkhof, L.J. et al. 2017; Nicholls, S.M. et al. 2019; Shin, J. et al. 2016)
Pseudogene discrimination
CYP2D6 (Liau, Y. et al. 2019), IKBKG (Ardui, S. et al. 2018), PKD1 (Borràs, D.M. et al. 2017), SMN1
Repeat structure / expansions
ABCA7 (De Roack, A. et al. 2018), C9orf72, FMR1, HTT, INS, MUC1, NOTCH2NLC, SAMD12, SCA2, SCA3, SCA10, SCA17
RNA isoform detection (Clark, M.B. et al. 2018;Tang, A.D. et al. 2018)
Translocations
BCR-ABL (Jeck, W.R. et al. 2018), t(X;20) (Dutta, U.R. et al. 2018)
Structural variants (De Coster, W. et al. 2018; Sanchis-Juan, A. et al. 2018)
STR profiling (Cornelis, S. et al. 2018)

10. Challenges of long-read sequencing

Preparing DNA for long-read sequencing has several pitfalls in terms of obtaining optimal sequencing libraries. Size-selection can be an issue since very large DNA molecules tent to block nanopores and very short molecules reduces the overall sequencing output. Moreover, libraries from freshly isolated DNA/RNA produce a higher output due to less degradation and oxidation compared to long-term stored samples. Furthermore, sample purity is an issue due to the high input of DNA for long-read sequencing and higher content of organic solvents or secondary metabolites that can harm the pores (nanopore sequencing) or polymerase (SMRT sequencing). Another challenge in TGS sequencing is a comparable high sequencing error rate, but higher coverage and optimized filtering strategies can improve consensus accuracy (Amarasinghe et al., 2020; Deamer et al., 2016; Sedlazeck et al., 2018a). PacBio already released the CSS high accuracy sequencing chemistry with an error rate below 1 %. The launch of a new consensus sequencing chemistry for nanopore sequencing is announced but not available so far. UMIs in combination with PCR are another promising option to increase sequencing accuracy above 99 % (Karst et al., 2020). An alternative option is the R2C2 protocol using a circularized sequencing template and a rolling circle amplification (RCA). This method allows to read the same sequence several times to increase consensus accuracy. Moreover, modulation the voltage in nanopore sequencing has been described to overcome the primary error modes in base calling (Noakes et al., 2019). However, most errors in Nanopore sequencing occur in homopolymer regions. This issue might be addressed by the next convolution-based basecaller generation and new pores. Bioinformatics strategies for processing of long-read sequencing data are rapidly evolving and the landscape of available tools is currently increasing. A catalogue of long-read sequencing data analysis tools can be found at https://long-read-tools.org/. Another problem is the large raw data file size which requires a high demand for storage and data management. However, it is likely that further advancements in the technology and bioinformatics will address most of the remaining issues.

相关阅读:
mysql数据类型
 Hive Getting Started补充
 Hive安装
 HDFS High Availability Using the Quorum Journal Manager
用DBContext (EF) 实现通用增删改查的REST方法
 Internet Explorer 10 administration IE10管理
 配置AD RMS及SharePoint 2013 IRM问题解决及相关资源
 SharePoint 2013 首页修改
 Status: Checked in and viewable by authorized users 出现在sharepoint 2013 home 页面
 添加AD RMS role时，提示密码不能被验证The password could not be validated
原文地址：https://www.cnblogs.com/wangprince2017/p/13756018.html