Abstract

In recent years long-read technologies have moved from being a niche and specialist field to a point of relative maturity likely to feature frequently in the genomic landscape. Analogous to next generation sequencing, the cost of sequencing using long-read technologies has materially dropped whilst the instrument throughput continues to increase. Together these changes present the prospect of sequencing large numbers of individuals with the aim of fully characterizing genomes at high resolution. In this article, we will endeavour to present an introduction to long-read technologies showing: what long reads are; how they are distinct from short reads; why long reads are useful and how they are being used. We will highlight the recent developments in this field, and the applications and potential of these technologies in medical research, and clinical diagnostics and therapeutics.

近年来,长read技术已经从一个小众和专业领域转变为一个相对成熟的点,可能经常在基因组范围。
类似于下一代测序,使用长读技术的测序成本已经大大降低,而仪器的通量继续增加。
这些变化共同展现了测序大量个体的前景,目的是在高分辨率下充分描述基因组。
在这篇文章中,我们将努力介绍长read技术,展示:什么是长read;
如何区别于短文;
为什么长read是有用的,以及如何使用它们。
我们将重点介绍该领域的最新发展,以及这些技术在医学研究、临床诊断和治疗中的应用和潜力。

When Short Reads Are Not Enough

DNA is an extraordinarily compact storage medium, so small that developing ways to decode the sequence encoded in these molecules has been a topic of research for many years. The first method developed for sequencing DNA, often known as Sanger sequencing (1), was a low throughput process that detected bases by incorporation into a template strand, sequencing fragments of DNA up to 1000 bp long. The breakthrough allowing sequencing at scale finally came with the advent of next generation sequencing (NGS) technology, which employed massively parallel reactions for high throughput. While these technologies have been able to capture sequence from the majority of the genome and have found utility in the study of disease, their short reads and lack of contextual information has limited their utility in genome assembly and in resolving complex and repetitive regions of the genome.

The incremental improvements in read-length that this generation of technology can yield is one of diminishing returns. Thus, to achieve substantial gains in mapping, assembly and phasing one must consider technology that provides an order of magnitude increase in read-length (2). Practically also, there are many important problems in genetics where a short read of DNA (<1000 base pairs) is insufficient (Table 1 and Fig. 1).

  当短读不够的时候

DNA是一种极其紧凑的存储介质,其体积如此之小,以至于开发解码这些分子中编码序列的方法一直是多年来的研究课题。
第一种被开发出来的DNA测序方法,通常被称为桑格测序(1),是一种低通量的过程,通过嵌入到模板链中检测碱基,测序长度可达1000 bp的DNA片段。
随着下一代测序(NGS)技术的出现,允许大规模测序的突破终于出现了,该技术采用大规模并行反应来实现高通量
虽然这些技术已经能够从大部分基因组中获取序列,并在疾病研究中发现了实用价值,但它们的短读和缺乏上下文信息限制了它们在基因组组装解决复杂和重复的基因组区域方面的效用。
这一代技术所带来的读取长度的增量改进是一种收益递减的过程。
因此,要在绘图、组装和分阶段方面取得实质性进展,就必须考虑在读长方面提供一个数量级增长的技术(2)。实际上,在遗传学中有许多重要问题,短读DNA(1000个碱基对)是不够的(表1和图1)。

Table 1.

Advantages and applications of long-read sequencing

Limitations of short read dataApplications and advantages of long-read sequencing
  • Access to high GC content regions

  • Resolution of complex regions of the genome (e.g. MHCa)

  • Repetitive regions where short reads will not map uniquely

  • Systematic context-specific error modes

  • Structural variation, and large segmental duplications

  • Paralogous regions of the genome

  • Resolution of phase (read-based phasing)

 
  • De novo assembly from long reads to span the low complexity and repetitive regions, to create accurate assemblies (3).           从新组装从长读取横跨低复杂度和重复区域,以创建准确的组装

  • Targeted sequencing of complex genomic and paralogous regions and resolution of phase for clinical applications e.g. HLAb typing, ADPKDc (4).     复杂基因组和副子区域的靶向测序和临床应用的阶段分辨,如HLAb分型,ADPKDc(4)。

  • Transcriptomics, allowing full length sequencing of isoforms and examination of splicing (5).   

  •  转录组学,允许对异构体进行全长测序和剪接检查(5)。

  • Detection of structural variants (e.g. segmental duplications, gene loss and fusion events)

  • Single molecule sequencing allows examination of clonal heterogeneity of pathogens, and immunogenic cells

  • Long-range characterization of methylation patterns

检测结构变异(例如片段重复、基因丢失和融合事件)
单分子测序可以检测病原体的克隆异质性和免疫原性细胞
甲基化模式的远程特性

a    MHC: Major histocompatibility complex.
b    HLA: Histocompatibility leucocyte antigen.
c    ADPKD: Autosomal-dominant polycystic kidney disease.
Figure 1.

Behaviour of reads around genomic events. (A) Large insertion: short reads at the edge of the variant are be soft-clipped. Reads within the insertion will be either unmapped or mapped incorrectly. Large reads will either span the insertion or have enough context to be marked as inserted sequence.

(B) Large deletion: short reads spanning the deletion may be mismapped or only have one of the reads marked as mapped because the reference measured length indicates the insert size deviates from the expected distribution. Long reads will span the gap but most will have enough context to call the deletion.

(C) Copy number variation: where the read-length exceeds the length of the CNV region reads will map correctly. Shorter reads may be collapsed and show up as increased depth in a pileup or be marked as mapping poorly.

(D) Inversion: reads will either be represented as a primary alignment with an inverted supplementary or manifest as soft clipping around the edge of the inversion with a reduction in depth where reads span the edge of the inversion.

Behaviour of reads around genomic events. (A) Large insertion: short reads at the edge of the variant are be soft-clipped. Reads within the insertion will be either unmapped or mapped incorrectly. Large reads will either span the insertion or have enough context to be marked as inserted sequence. (B) Large deletion: short reads spanning the deletion may be mismapped or only have one of the reads marked as mapped because the reference measured length indicates the insert size deviates from the expected distribution. Long reads will span the gap but most will have enough context to call the deletion. (C) Copy number variation: where the read-length exceeds the length of the CNV region reads will map correctly. Shorter reads may be collapsed and show up as increased depth in a pileup or be marked as mapping poorly. (D) Inversion: reads will either be represented as a primary alignment with an inverted supplementary or manifest as soft clipping around the edge of the inversion with a reduction in depth where reads span the edge of the inversion.

Key to achieving high quality results with all long-read technologies is the use of high molecular weight DNA as a starting material.

The utility of these methods depends on a long DNA fragment size, with DNA damage and fragmentation limiting the quality of data obtained.

Specific protocols for DNA extraction such as the agarose gel protocol for BioNano are ideal to maximize yield from these methods.

使用所有长期读取的技术来获得高质量结果的关键是使用高分子量的DNA作为起始材料。
这些方法的效用取决于一个长的DNA片段大小,与DNA损伤碎片限制了获得的数据的质量。
DNA提取的特定协议,如BioNano的琼脂糖凝胶协议是理想的,以最大限度地提高这些方法的产量。

Long-Read Technologies

Single molecule real time sequencing

The first long-read sequencing technology to achieve a widespread deployment is the single molecule real time (SMRT) sequencing technology from Pacific Biosciences (PacBio). The SMRT system implemented in their Sequel and RS- II platforms uses a massively parallel system of polymerases each bound to a single molecule of target DNA that has been circularized with a pair of hairpin sequencing adaptors (the SMRTbell) (Fig. 2A). Incorporation of labelled bases by a polymerase on the template strand causes fluorescence. The resulting signal is detected by a CCD camera via a zero-mode waveguide (6,7), yielding a combination of signal and time series information. Reads produced by this technology typically peak at 100 Kbp in length and a typical N50 on recent polymerases is ∼20 Kbp.

单分子实时测序
太平洋生物科学公司(PacBio)的单分子实时(SMRT)测序技术是首个获得广泛应用的长时间测序技术。
在他们的续作和RS- II平台中实现的SMRT系统使用了一个大规模并行的聚合酶系统,每个聚合酶与一个目标DNA分子结合,该DNA分子通过一对发夹测序适配器循环(SMRTbell)(图2A)。
模板链上的聚合酶结合标记的碱基引起荧光。
产生的信号由CCD相机通过零模波导(6,7)检测,产生信号和时间序列信息的组合。
该技术产生的读数峰值通常为100 Kbp,最近聚合酶的典型N50值为20 Kbp。

Figure 2.

Long-read sequencing technologies.

(A) PacBio SMRT sequencing. Double stranded DNA is first sheared and size selected to the desired length and then sequencing adaptors are annealed. The adaptors are bound to a sequencing primer and strand displacing polymerase which adheres to the bottom of a well containing a zero mode wave guide. Following a pre-extension period where the polymerase reaction is run in the dark, the fragment is illuminated with a laser and as each base in the sequencing solution is incorporated, the fluorophore is detected and the polymerase reaction displaces it, giving a time and intensity signal which is converted into a base call.

(B) Oxford Nanopore Technology passes the DNA molecule through a nanopore attached the flow cell surface membrane. As each base of the DNA molecule passes through the pore changes to the current passing through the pore are detected and converted into a signal. The signal detected is passed to a recurrent neural network (RNN) which converts it into base calls.

(C) 10X Genomics Chromium technology works by means of an emulsion droplet technology, where gel beads are mixed with high molecular weight genomic DNA and an enzyme. Within each gel bead DNA is sheared and barcoded, creating fragments which can then be sequenced with Illumina sequencing. The presence of the chromium barcode then provides a mapper or assembler with linked-reads, allowing the relative spatial position of the fragments to be estimated Components of figure reproduced with permission from Pacific Biosciences, Oxford Nanopore Technologies and 10X Genomics.

 读测序技术。

(A) PacBio SMRT测序。
双链DNA首先剪切,大小选择到所需的长度,然后退火测序适配器。
适配器与测序引物和链置换聚合酶结合,链置换聚合酶粘附在含有零模波导的井的底部。
pre-extension期后的聚合酶反应是运行在黑暗中,片段与激光照亮,因为每个基地测序解决方案是结合,检测荧光团和聚合酶反应取代它,给予时间和强度信号转换为基地的电话。
(B)牛津纳米孔技术使DNA分子通过附着在流动细胞表面膜上的纳米孔。
当DNA分子的每个碱基通过小孔时,通过小孔的电流就会被检测并转换成信号。
检测到的信号被传递到递归神经网络(RNN),该网络将其转换为基本调用。
(C) 10X Genomics Chromium technology通过乳剂液滴技术工作,其中凝胶珠与高分子量基因组DNA和一种酶混合。
每个凝胶珠内的DNA被剪切和编码,产生的片段可以用Illumina测序进行测序。
铬条码的出现提供了一个带有链接读取的制图器或组装器,允许对片段的相对空间位置进行估计,从而获得太平洋生物科学、牛津纳米孔技术和10X基因组学公司的许可,复制figure的组成部分。

Long-read sequencing technologies. (A) PacBio SMRT sequencing. Double stranded DNA is first sheared and size selected to the desired length and then sequencing adaptors are annealed. The adaptors are bound to a sequencing primer and strand displacing polymerase which adheres to the bottom of a well containing a zero mode wave guide. Following a pre-extension period where the polymerase reaction is run in the dark, the fragment is illuminated with a laser and as each base in the sequencing solution is incorporated, the fluorophore is detected and the polymerase reaction displaces it, giving a time and intensity signal which is converted into a base call. (B) Oxford Nanopore Technology passes the DNA molecule through a nanopore attached the flow cell surface membrane. As each base of the DNA molecule passes through the pore changes to the current passing through the pore are detected and converted into a signal. The signal detected is passed to a recurrent neural network (RNN) which converts it into base calls. (C) 10X Genomics Chromium technology works by means of an emulsion droplet technology, where gel beads are mixed with high molecular weight genomic DNA and an enzyme. Within each gel bead DNA is sheared and barcoded, creating fragments which can then be sequenced with Illumina sequencing. The presence of the chromium barcode then provides a mapper or assembler with linked-reads, allowing the relative spatial position of the fragments to be estimated Components of figure reproduced with permission from Pacific Biosciences, Oxford Nanopore Technologies and 10X Genomics.
One complication of SMRT sequencing is the high error rate of this process relative to short read sequencing, at 11–14% depending on polymerase and chemistry. However, this error mode is stochastic (by contrast with other technologies), and can be mitigated by repeated measurements of the sequence. With PacBio sequencing, this is carried out by repeated forward and reverse sequencing passes over the circularized SMRTbell molecule (Fig. 2A). Adaptor sequences can be removed from the generated sequence to provide enough subreads to generate a highly accurate consensus of each molecule. This process is known as circular consensus sequencing and has been shown to reduce basecalling error substantially (8) whilst also enabling the strand specific calling of base modifications in unamplified DNA (9). When long DNA fragments are sequenced, these may not be parsed more than once in the SMRTbell; in this case, increasing coverage and then calling a consensus across reads can also achieve a reduced error rate; a method frequently used in polishing assemblies (10).

与短读测序相比,SMRT测序的一个并发症是误差率高,这取决于聚合酶和化学反应,误差率为11 14%。
然而,这种错误模式是随机的(与其他技术相比),并且可以通过重复测量序列来减轻。
在PacBio测序中,通过循环SMRTbell分子的重复正向和反向测序来进行测序(图2A)。
适配器序列可以从生成的序列中移除,以提供足够的子序列来生成每个分子的高度准确的一致。
这一过程被称为循环一致测序,并被证明大大减少了碱基错误(8),同时也使链能够特异性地调用未扩增DNA中的碱基修饰(9)。

当对长DNA片段进行测序时,这些片段在SMRTbell中可能不会被解析超过一次;
在这种情况下,增加覆盖率并在读取之间达成共识也可以减少错误率;
一种经常用于抛光组件的方法(10)。

Oxford Nanopore Technologies

The next successful single molecule technology to hit the market was that produced by Oxford Nanopore Technologies (ONT) (11). This technology is based on passing a single strand of DNA through a nanopore with an enzyme attached, and measuring changes in the electrical signal across the pore (Fig. 2B). The signal is then amplified and measured to determine the bases that passed through. As the pore holds several bases at a time (typically 5-mers), overlapping k-mers that cause changes in raw current must be inferred and used to make base calls, a process which can be error prone. By measuring the shape of the molecule passing through the pore ONT not only reads the sequence of the DNA but like SMRT is also able to detect base modifications (12). However, unmodelled base modifications and systematic DNA context-specific errors (13) currently limit the utility of the technology.

Oxford Nanopore MinION technology heralds the promise of a pocket size sequencer, with reads from ONT that can stretch into the hundreds of kilobases with appropriate DNA preparation, and megabase long reads that have been observed when a large number of flow cells have been used. There appears to be no intrinsic read-length limit for ONT, other than the size of DNA fragments. Recent improvements in technology, library preparation and throughput have allowed the first human line sequenced on the MinION (GM12878) earlier this year (14). This study generated ultra-long reads (>800 Kbp), and suggested that addition of modest coverage with ultra-long-read sequencing to existing assemblies may substantially improve resolution of contigs and haplotypes. While the error rate is comparable to SMRT sequencing, a component of the error is systematic and context-specific, limiting the ability to correct this by increasing coverage (13) and requiring polishing with other technologies instead.

ONT has developed a distinct strategy to mitigate stochastic error on their platform, focusing on the way that the template strand passes through the pore. ONT cannot simply circularize the DNA. Instead, both the template and complement strands of the DNA molecule are joined by a hairpin loop during library prep (2D) or tethered in such a way (1D2) to allow sequential forward and reverse strand sequencing. Combining these data greatly enhances accuracy and reduces random error.

The use of nanopores as a nucleic acid sequencing technology is not entirely exclusive to ONT; at least one similar but distinct competing technology is also under development by Roche.

牛津纳米孔技术
下一个成功进入市场的单分子技术是由牛津纳米孔技术公司(ONT)生产的。
这项技术的基础是将一条DNA链带着酶穿过纳米孔,然后测量孔内电信号的变化(图2B)。
然后对信号进行放大和测量,以确定通过的碱基。
由于孔一次包含几个碱基(通常为5-mers),必须推断导致原始电流变化的重叠k-mers,并用于进行碱基调用,这是一个容易出错的过程。
通过测量穿过小孔的分子的形状,ONT不仅能读取DNA序列,而且像SMRT一样也能检测碱基的改变(12)。
然而,未建模的碱基修饰和系统性DNA上下文特异性错误(13)目前限制了该技术的效用。
Oxford Nanopore MinION技术是一种口袋大小测序器的前景,通过适当的DNA制备,从ONT读取可以延伸到数百千碱基,以及在使用大量流动细胞时观察到的兆碱基长的读取。
除了DNA片段的大小外,ONT似乎没有固有的读取长度限制。
最近在技术、库准备和吞吐量方面的改进使得今年早些时候第一个在MinION (GM12878)上测序的人线成为可能。
这项研究产生了超长序列(800 Kbp),并表明在现有的片段中加入适当的超长序列覆盖可以显著提高contigs和单倍型的分辨率。
虽然错误率与SMRT测序相似,但错误的一个组成部分是系统的和具体的环境,限制了通过增加覆盖率来纠正错误的能力(13),并需要用其他技术来抛光。
ONT已经开发了一种独特的策略来减少平台上的随机误差,重点是模板链通过小孔的方式。
ONT不能简单地使DNA循环。
相反,DNA分子的模板链和补体链在库准备(2D)时通过发夹环连接,或以这样的方式拴在一起(1D2),以允许序列的正向和反向链测序。
结合这些数据可以大大提高准确性,减少随机误差。
利用纳米孔作为核酸测序技术并不完全是ONT的专利;
罗氏公司还在开发至少一种类似但截然不同的竞争性技术。

10X Genomics Chromium system

An alternative to the aforementioned single molecule sequencing methods is the 10X Genomics Chromium system. Whilst this is not technically a long-read sequencing technology, it is an important member of this ecosystem and can solve similar problems such as mapping, phasing and assembly (Fig. 2C). Chromium has lower cost compared to ONT and SMRT because of the use of the nearly ubiquitous Illumina short reads in its sequencing process.

The basis of this technology (15) is the barcoding of large fragments of DNA (preferably >100 Kbp) in an initial digital droplet polymerase chain reaction (PCR) step. In each droplet, a single fragment is both sheared and then tagged with a semi-unique molecular barcode (Fig. 2C). The resulting fragments are then sequenced like any other Illumina library. The barcode allows for determination of the relative spatial orientation of the tags, and allows phasing and assembly of contigs by combining information across multiple tags (15,16). Additionally, because the data provide spatial orientation across the genome, it is possible to use it to scaffold data from other methods (17).

10X基因组铬系统
上述单分子测序方法的替代方法是10X Genomics Chromium系统。
虽然从技术上讲,这不是一项长期测序技术,但它是这个生态系统的重要成员,可以解决类似的问题,如绘图、分阶段和组装(图2C)。
与ONT和SMRT相比,Chromium的成本更低,因为它在测序过程中使用了几乎无处不在的Illumina short reads。
这项技术的基础是在最初的数字液滴聚合酶链反应(PCR)步骤中对大片段DNA(最好是100 Kbp)进行条形码编码。
在每个液滴中,单个片段被剪切,然后用一个半独特的分子条形码标记(图2C)。
得到的片段会像其他Illumina文库一样被测序。
该条形码允许确定标签的相对空间方向,并允许通过合并多个标签的信息对contigs进行分阶段和组装(15,16)。
此外,由于数据提供了跨基因组的空间定位,因此可以使用它从其他方法构建数据(17)。

Allied technologies

Allied technologies associated with long-read sequencing such as: optical mapping, HiC and similar have been used to enhance the final results from sequencing. Optical mapping technologies such as BioNano Irys and Saphyr label DNA and then image the labelled DNA to generate genome maps. These genome maps are used to scaffold contigs produced by assembly (18) and also to discover large (>500 bp) structural variants and inversions. HiC can be used to assay chromosomal conformation and is particularly useful in assigning assembled sequences to chromosomes (19).

联合技术
与长读测序相关的技术如光学制图、HiC等已被用于增强测序的最终结果。

光学制图技术如BioNano Irys和Saphyr对DNA进行标记,然后对标记的DNA成像,生成基因组图谱。
这些基因组图用于支架组装产生的contigs(18),也用于发现大型(500 bp)结构变异和倒置。
HiC可用于测定染色体构象,尤其适用于将组合序列分配到染色体上(19)。

The Utility of Long-Read Technology: Recent Developments    长read技术的效用:最近的发展

High resolution genome assemblies

Accurate assemblies of the genomes of organisms are crucial to understanding organismal diversity, speciation, evolution of species and the impact of genomic diversity on health and disease. The current human genome reference GRCh38 has been assembled from the DNA of multiple donors, and represents a mosaic of haplotypes. However, several studies have suggested that existing human reference genomes may not fully reflect the diversity of global human populations, and may be biased towards diversity in European populations (20–22). This has important implications for human basic and medical research. Assembling the human genome has involved extensive curation with clone-based assembly methods and Sanger sequencing. Long-read technologies provide a high throughput platform for characterization of genomes through highly contiguous assemblies (Fig. 3).

高分辨率基因组组装
生物体基因组的准确装配对于理解生物体多样性、物种形成、物种进化以及基因组多样性对健康和疾病的影响至关重要。
目前的人类基因组参考GRCh38是由多个供体的DNA组装而成的,并代表了单倍型的嵌合体。
然而,一些研究表明,现有的人类参考基因组可能不能充分反映全球人类人口的多样性,而且可能偏重于欧洲人口的多样性(20 22)。
这对人类基础研究和医学研究具有重要意义。
人类基因组的装配涉及到基于克隆的装配方法和桑格测序的广泛管理。
长读技术提供了一个高通量平台,通过高度连续的组装来鉴定基因组(图3)。

Figure 3.

Long reads span and call variations that short reads cannot. IGV (http://software.broadinstitute.org/software/igv/home) image of (top) PacBio reads from a sample sequenced as part of the GDAP project. The reads span a 6 kb heterozygous LINE-1 element deletion and show clear depth variation.

Illumina (bottom) reads from the same sample unable to be clearly mapped around the deletion with reads in white indicating where reads were unable to be uniquely mapped.

 
Long reads span and call variations that short reads cannot. IGV (http://software.broadinstitute.org/software/igv/home) image of (top) PacBio reads from a sample sequenced as part of the GDAP project. The reads span a 6 kb heterozygous LINE-1 element deletion and show clear depth variation. Illumina (bottom) reads from the same sample unable to be clearly mapped around the deletion with reads in white indicating where reads were unable to be uniquely mapped.
 The early long-read platforms produced reads that were only a few kilobases long with a high per-base cost; however, they quickly carved a niche in the creation and finishing of assemblies. These long reads could close gaps in genomes by spanning the low complexity regions that would otherwise require many costly YAC, BAC and fosmid clones to be created and sequenced. Thus, many of the early tools such as PBJelly were focused on gap closure (23,24). The high per-base error rate also required new assembly algorithms, and new tools were created to polish the final assembly with Illumina reads to eliminate basecalling error (12). Clone based assembly methods were not eliminated entirely either as they provided useful spatial context, but long reads provided a new way to sequence clones in a high throughput manner (25).

Long-read sequencing methods have contributed to platinum quality reference sequences such as NA12878 (14,26) and the haploid sequences CHM1 (27) and CHM13 (28), as well filling many gaps in the human reference (18,25,29). Of particular note are the first Chinese (18) and Korean (25) human reference genomes which have been created to answer questions about population-specific sequence. These sequences have resulted in highly contiguous assemblies, closing a high proportion of gaps in the human genome. These have led to discovery of population-specific sequences, demonstrating the need for further assemblies from non-European population groups. Recently, higher coverage sequencing (∼60×) of two haploid genomes has also been used to identify substantial structural variation, the vast majority of which have not been recovered from sequencing using NGS technologies (28). Characterization of high resolution population-specific reference genomes from initiatives such as the Genome in a Bottle (GIAB) (30) and the Genome Diversity in Africa project (GDAP) (31) (Fig. 3) will provide important resources for population and medical genetics, and also allow a clearer understanding of the evolutionary demographic history of different populations by better delineation of phase (31).

Most human assemblies have involved a haploid representation of the genome, where information from the two chromosomes is collapsed into a single sequence. Generation of haplotype representations of the genome can reduce error in the final assembly, particularly in the case of segmental duplications (16,32). While long-read technologies can generate phase information over long contiguous segments, these methods cannot resolve phase over long regions of homozygosity or assembly gaps. Assembly of haploid genomes, therefore, requires additional contextual information, which can be provided by linked-read approaches. More recently, trio based methods (where parents are sequenced using Illumina short reads, with offspring sequenced with long reads) have been used to provide this contextual information by separation of maternal and paternal haplotypes prior to assembly using a father–mother–offspring trio (33). This method has been applied to yield a highly contiguous diploid assembly of an F1 hybrid of two bovine subspecies with a quality surpassing previous cattle reference genomes (33).

Long reads have been successfully applied to organisms with smaller genomes as well as bacteria and viruses, with the advantage that for some of these the entire genome can be spanned by a single long read (34). The Tree of Life initiative, a collaboration across multiple centres is in the process of developing high resolution reference sequences for >50 vertebrate species using a combination of long read, short read and linked-read approaches. Another leading project is the large bacterial sequencing project NCTC 3000 at the Wellcome Sanger Institute, which is using PacBio sequencing to sequence complete bacterial genomes (https://www.phe-culturecollections.org.uk/collections/nctc-3000-project.aspx). These relatively small genomes (Escherichia coli is for example 4.6 Mbases) can often have their chromosomes and plasmids assembled into single contigs. The construction of full and accurate assemblies of these organisms allow fine-scale phylogenies of these organisms to be constructed and is also helpful in the field of epidemiology when tracing the source of an outbreak. A recent example of this was a study where SMRT sequencing was used to identify a reservoir of antibiotic resistant plasmids within hospitals (35).

In addition to DNA sequencing, ONT sequencing has been applied to sequence RNA directly rather than relying on an intermediate cDNA step, allowing direct sequencing of RNA viruses and detection of splice variants and base modifications directly from RNA molecules. An example of this is the recent direct sequencing and assembly of the influenza A virus in a native RNA form without amplification or conversion to DNA (36).

早期的长读取平台产生的读取只有几个千碱基长,但每个碱基的成本很高;
然而,他们很快就在组装的创造和完成中占据了一席之地。
这些长序列可以跨越低复杂度区域,填补基因组中的空白,否则就需要许多昂贵的YAC、BAC和fosmid克隆的创建和测序。
因此,许多早期的工具,如PBJelly主要关注于间隙的闭合(23,24)。
较高的每个基础错误率也需要新的装配算法,并且创建了新的工具来使用Illumina读取来抛光最终装配以消除基础错误(12)。
基于克隆的装配方法也没有完全被淘汰,因为它们提供了有用的空间上下文,但是长读取提供了一种以高吞吐量方式对克隆进行排序的新方法(25)。
长读测序方法为白金质量参考序列如NA12878(14,26)和单倍体序列CHM1(27)和CHM13(28)做出了贡献,也填补了人类参考序列(18,25,29)的许多空白。
特别值得注意的是,中国(18)和韩国(25)的第一批人类参考基因组被创造出来,以回答有关群体特异性序列的问题。


这些序列产生了高度连续的装配,填补了人类基因组中很大比例的空白。
这导致了种群特异性序列的发现,证明了需要从非欧洲种群进行进一步的装配。
最近,对两个单倍体基因组的更高覆盖度测序(60)也被用于识别实质性的结构变异,其中绝大多数在NGS技术的测序中还没有恢复(28)。
高分辨率的表征特定人群参考基因组从瓶中基因组等举措(GIAB)(30)和基因组多样性在非洲项目(GDAP)(31)(图3)将为人口和医学遗传学提供重要的资源,也允许一个清晰的理解进化人口历史的不同人群更好的界定阶段(31)。


大多数人类的装配都涉及到基因组的单倍体表达,其中来自两条染色体的信息被折叠成单个序列。
基因组单倍型代表的产生可以减少最终装配中的错误,特别是在片段重复的情况下(16,32)。
虽然长读技术可以在长连续段上生成相位信息,但这些方法不能在长纯合性或装配间隙区域上解决相位问题。
因此,单倍体基因组的装配需要额外的上下文信息,这些信息可以通过链读方法提供。
最近,基于三联法(使用Illumina短序列对亲本进行测序,使用长序列对子代进行测序)已被用于通过在使用父-母后代三联法进行装配之前分离父-父单倍型来提供上下文信息(33)。
该方法已被应用于两个牛亚种F1杂交产生高度相邻的二倍体组合,其质量超过了之前的牛参考基因组(33)。
长读已经成功地应用于基因组较小的生物体以及细菌和病毒,其优势是,对于其中一些生物,整个基因组可以通过一个长读跨越(34)。
“生命之树计划”是一个跨多个中心的合作项目,该项目正在结合长读、短读和链读的方法,为50种脊椎动物开发高分辨率参考序列。
另一个主导项目是威康桑格研究所的大型细菌测序项目NCTC 3000,该项目正在使用PacBio测序技术对完整的细菌基因组进行测序(https://www.phe-culturecollections.org.uk/collections/nctc-30000-project.aspx)。
这些相对较小的基因组(例如大肠杆菌为4.6亿个碱基)通常可以将染色体和质粒组装成单个的contigs。
这些生物体完整而准确的装配体的构建可以构建这些生物体的精细系统发育,在流行病学领域追踪爆发源时也很有帮助。
最近的一个例子是利用SMRT测序来鉴定医院内抗生素耐药质粒库的研究(35)。
除了DNA测序外,ONT测序也被直接应用于RNA测序,而不是依赖中间的cDNA步骤,可以直接测序RNA病毒,直接检测RNA分子的剪接变异和碱基修饰。
这方面的一个例子是,最近对甲型流感病毒进行的直接测序和组装是在没有扩增或转化为DNA的情况下以自然RNA形式进行的(36)。

Targeted sequencing

From a clinical point of view targeted sequencing is an area where long reads are likely to have the greatest initial impact. In the diverse, complex and clinically relevant regions such as the histocompatibility leucocyte antigen (HLA) (37), killer cell immunoglobulin-like receptor (KIR) (38) and BRCA; and in pharmacologically relevant genes such as CYP2D6 (39,40), targeted sequencing has allowed clinicians and researchers to characterize areas of the genome which were previously inaccessible using NGS methods. In addition, where diversity is high it has become possible to call and phase variation across the entire gene. This approach has since been used to retype 126 HLA reference samples across 6 loci and is now considered a gold standard for clinical sequencing for stem cell transplants (41).

Typically, when targeting such a region, a long-range PCR reaction is used to specifically amplify the genes of interest. However recently there have been studies demonstrating the use of pulldowns and CRISPR-CAS9 to capture the region of interest with little or no amplification (42). The advantage of these reduced and non-amplification based approaches is the removal of PCR error as a factor, particularly in tandem repeats and GC rich regions (42). Additionally, in the case of CRISPR methods, capture of raw genomic material allows DNA modification information to be read.

Transcriptomics and RNA

In addition to its many uses with DNA, long-read technology also has provided many new insights into the world of transcriptomes and ncRNA by allowing for sequencing of these full length isoforms rather than relying on the assembly of sheared NGS fragments, a method prone to a high rate of false positives and ambiguities (43). Direct sequencing of isoforms can be particularly useful in complex polyploid genomes such as the coffee plant (44), where construction of a reference transcriptome is otherwise extremely challenging. In addition to its usefulness in reference transcriptomes IsoSeq has been used in functional studies to analyse the expression of various disease-linked proteins such as TP53 in leukaemia (45).

The MinION platform has recently been used to sequence cDNA; applications of this, such as single cell sequencing of immune cells illustrates the power of such methods to examine clonal heterogeneity in gene expression and isoform usage, potentially revolutionizing our understanding of the repertoire and functions of immunological cell receptors (46).

Epigenetics

SMRT sequencing technology is able to detect base modification, as it records base kinetics of the polymerase, when DNA molecules are sequenced directly without PCR. Similarly, Nanopore technology can also detect base modifications due to variation in ionic currents. However, because amplification of DNA would erase base modifications, these methods require relatively large amounts of native, unamplified DNA as input material. Recent innovations that combine bi-sulphate conversion with SMRT sequencing have allowed direct high throughput analysis of CpG methylation without requiring large quantities of sample (47), providing an avenue for more accurate assessment of CpG islands, and allele-specific CpG methylation.

Clinical applications

The advantages of long-read technologies in accessing complex regions of the genome, make these ideal for clinical applications in diagnosis, prognostication and personalized medicine. Early clinical applications have included sequencing of tandem repeats in fragile X syndrome, spinocerebellar ataxia, providing accurate diagnostics and potential for prognostication in clinical genetics. SMRT sequencing has also been used to resolve structural variants associated with Mendelian disease (48).

Long-read sequencing technologies are rapidly moving towards the mainstay of high resolution HLA typing for transplant registries in certain regions (37); with high resolution typing potentially having implications for better matching, and clinical outcomes of patients undergoing transplantation. This is even more important in populations which are poorly represented in current reference sequence databases, limiting disambiguation of clinical types when using standard methods for typing. The HLA diversity in Africa project, which aims to characterize high resolution HLA types across >20 ethno-linguistic in Africa has recently completed sequencing of ∼2000 individuals using long-read sequencing, identifying high levels of novelty in class I and class II HLA types (49). This panel will provide an important resource for clinical HLA typing in populations of African ancestry, as well as a platform for highly accurate imputation of HLA types in medical genetics research.

Using long-range PCR amplicons, with barcoding and long-read technology also allow better delineation of genes from pseudogenes, such as for sequencing PKD1 for diagnosing autosomal-dominant polycystic kidney disease, for which diagnostic accuracy of NGS technologies has been limited (50). SMRT sequencing has also been used to tailor treatment in patients with cancer, by identifying low frequency resistant mutations in BCR-ABL1 that affect treatment efficacy in patients with CML (51). Applications of SMRT sequencing in reproductive medicine, to identify parent of origin effects, and for pre-implantation diagnosis have been previously noted (52).

Full sequencing of several virus genomes in a single contig by long-read sequencing has provided unique avenues for identification of resistant mutations for clinical applications. Proof-of concept studies have generated protocols to examine low frequency (up to 0.25%) associated mutations for HIV and HCV resistance to drugs, through deep sequencing of full length quasispecies (53). Methylation profiles of pathogens examined using SMRT approaches have also been shown to correlate with pathogenicity, and virulence, potentially providing a new avenue for applications in infectious disease surveillance.

The Future

Long-read technologies are improving rapidly, and may become the mainstay of sequencing; however, the broader application of long-read technologies are currently limited by a lower throughput, higher error rate and higher cost per base relative to short read sequencing. Wider use of such technologies in the clinical context may rapidly improve our understanding of cancer, pathogen evolution, drug resistance and genetic diversity in complex regions of the genome that have important implications for clinical care. Parallel development of existing technology to allow high throughput PCR-free sequencing will be important in sequencing difficult regions of the genome (54).

At present, no single long-read technology has any clear advantage from a scientific point of view, and thus it seems likely that the future of long-read sequencing is more likely to be decided on commercial terms rather than scientific. Whichever technology captures the market, it is clear that as these technologies become more affordable they will continue to shine a light into previously intractable regions of the genome with ever larger sample sizes and longer read-lengths, allowing new discovery in these evolving fields