Genome assembly using Nanopore-guided long and error-free DNA reads

Abstract

Background

Long-read sequencing technologies were launched a few years ago, and in contrast with short-read sequencing technologies, they offered a promise of solving assembly problems for large and complex genomes. Moreover by providing long-range information, it could also solve haplotype phasing. However, existing long-read technologies still have several limitations that complicate their use for most research laboratories, as well as in large and/or complex genome projects. In 2014, Oxford Nanopore released the MinION® device, a small and low-cost single-molecule nanopore sequencer, which offers the possibility of sequencing long DNA fragments.

Results

The assembly of long reads generated using the Oxford Nanopore MinION® instrument is challenging as existing assemblers were not implemented to deal with long reads exhibiting close to 30% of errors. Here, we presented a hybrid approach developed to take advantage of data generated using MinION® device. We sequenced a well-known bacterium, Acinetobacter baylyi ADP1 and applied our method to obtain a highly contiguous (one single contig) and accurate genome assembly even in repetitive regions, in contrast to an Illumina-only assembly. Our hybrid strategy was able to generate NaS (Nanopore Synthetic-long) reads up to 60 kb that aligned entirely and with no error to the reference genome and that spanned highly conserved repetitive regions. The average accuracy of NaS reads reached 99.99% without losing the initial size of the input MinION® reads.

Conclusions

We described NaS tool, a hybrid approach allowing the sequencing of microbial genomes using the MinION® device. Our method, based ideally on 20x and 50x of NaS and Illumina reads respectively, provides an efficient and cost-effective way of sequencing microbial or small eukaryotic genomes in a very short time even in small facilities. Moreover, we demonstrated that although the Oxford Nanopore technology is a relatively new sequencing technology, currently with a high error rate, it is already useful in the generation of high-quality genome assemblies.

Background

The technology of long-read sequencing now offers different alternatives to solve genome assembly problems (for example, in complex regions involving repeated elements or segmental duplications) and haplotype phasing, which cannot be resolved adequately by short-read sequencing. Application of the single-molecule real-time sequencing (SMRT) platform produced by Pacific Biosciences to small microbial as well as large complex eukaryotic genomes demonstrated the possibility of considerably improving genome assembly quality [1-4]. Microbial genome could now be fully assembled (at least in some cases) using Pacific Biosciences’s SMRT reads alone [2] or in combination with short but high quality reads [1]. The high error rate of SMRT reads renders the necessity for either deep coverage or a strategy of error correction using Illumina reads. It’s clear that the current yield and high cost per base of this technology remain a barrier for most genomic projects targeting large genomes. Moreover, the price of the commercially available Pacific Biosystems PacBio RS II instrument is high and the needs in terms of infrastructure and implementation does not make it accessible to the whole research community. Similar improvements in read length were also accomplished by the Illumina Truseq synthetic long-read sequencing strategy; its application to the human genome and the resolution of highly repetitive elements in the fly genome provided encouraging results [5,6] and showed the importance of long and high-quality reads. Nonetheless, the long range polymerase chain reaction step included in the library preparation may introduce important genome coverage biases. Moreover the time needed for library construction may be a limitation in a time-constrained project, and again does not make it accessible to the whole research community.

This year, Oxford Nanopore Technologies Ltd released the MinION® device, a single-molecule nanopore sequencer connected to a laptop through a USB 3.0 interface, to hundreds of members of the MinION® Access Programme (MAP) who are testing the new device. The technology is based on an array of nanopores embedded on a chip that detects consecutive 5-mers of a single-strand DNA molecule by electrical sensing [7]. This new technology provides several advantages: the MinION® device is small and low cost, the library construction involves a simplified method, no amplification step is needed, and data acquisition and analyses occur in real time. In the Oxford Nanopore technology, the two strands of a DNA molecule are linked by a hairpin and sequenced consecutively. When the two strands of the molecule are read successfully, a consensus is built to obtain a more accurate read (called 2D read). Otherwise only the forward strand sequence is provided (called 1D read).

MinION® tests were performed by all early access members, first on the phage lambda genome. Three recent publications of these studies [8-10] showed the production of long reads with an average size of 5,000 and 5,500 bp, respectively. These primary studies point to a high error rate in reads from the current version of MinION®. However, despite the high error rate, Ashton et al. [10] demonstrate the potential of the MinION® device for microbial sequencing. This motivated the need to develop new tools, either for MinION® read correction or for new alignment algorithms. Methods for correction of long reads produced for the Pacific Biosciences sequencer have already been proposed [1,11-13]. However, these methods are based on read alignment, thus the ability to correct input reads is linked to the local error rate. As a consequence, the size of the corrected read is closely correlated to the sequencing errors of the input long read. Long and relatively inaccurate reads that harbor hotspots of sequencing errors will lead to mosaic reads, with alternating regions of high and low fidelity. As existing assembly softwares were not implemented to deal with long reads with a high error rate, we developed a method based on a combination of two sequencing technologies: Oxford Nanopore and Illumina, to produce long and accurate synthetic reads before assembly.

Results and discussion

Overview of MinION® reads

We performed five runs of MinION® sequencing with four different A. baylyi genomic DNA libraries (targeting two different mean fragment sizes: 8 kb and 20 kb), and two different flowcell chemistries, R7 and R7.3 (methods and Table 1). We produced a total of 66,492 reads, representing a genome coverage of approximately 57 ×. About 13% of these 66,492 reads were 2D reads, which represent 42% of the cumulative size, indicating a significant difference of length between 1D and 2D reads. The 1D reads had an average size of 2,052 bp, in contrast the average size of 2D reads reached 10,033 bp (Table 2). The N50 size is two times higher when using the 20 kb library, suggesting that we obtained longer MinION® reads when sheared size is increased. The lower average read size of run4 and run5 was due to a high proportion of very short 1D reads (< 500 bp). These two runs were achieved using the same library preparation (library4, Table 1). As previously reported [8-10], we observed a low mappability on the reference genome [14]; 83.2% of 2D reads and 16.6% of 1D reads were aligned (Figure 1 and Table 2). Thus, the real genome coverage, when only taking into account aligned nucleotides, is about 34 ×. The mean identity to the reference of 1D reads was 56.5% while 2D reads revealed a mean identity of 74.5%. The R7.3 chemistry showed several improvements in terms of throughput, proportion of 2D bases (greater than 42% with R7.3 and less than 27% with R7) and in quality of 2D reads (Additional file 1: Table S1). Even, if more recent chemistry and flowcells exhibited a significant progress, these first results still showed a heterogeneity in throughput and in proportion of 2D reads.

基因组组装使用纳米矿引导长和无错误的DNA读取
Mohammed-Amin Madoui,
Stefan Engelen
科琳Cruaud,
卡罗琳·贝尔瑟
劳里伯特兰,
阿德里亚娜阿尔贝蒂,
Arnaud Lemainque,
帕特里克Wincker &
jean - marc Aury
BMC基因组学第16卷，文章编号:327(2015)引用本文

17 k访问

94年引用

54 Altmetric

Metricsdetails

摘要
背景
长读测序技术是几年前推出的，与短读测序技术相比，它们提供了解决大型和复杂基因组组装问题的希望。
此外，通过提供远程信息，还可以解决单倍型相位问题。
然而，现有的长期研究技术仍有一些限制，使其在大多数研究实验室以及大型和/或复杂的基因组计划中的应用复杂化。
2014年，牛津Nanopore公司发布了MinION®设备，这是一款小而低成本的单分子纳米孔测序器，提供了测序长DNA片段的可能性。

结果
使用Oxford Nanopore MinION®仪器生成的长读取的组装具有挑战性，因为现有的组装器无法处理出现近30%错误的长读取。
在这里，我们提出了一种混合方法，以利用使用MinION®设备生成的数据。
我们对一种著名的细菌，不动杆菌baylyi ADP1进行了测序，并应用我们的方法获得了一个高度连续的(单个contig)和精确的基因组组装，甚至是在重复区域，而不是仅使用illumina组装。
我们的杂交策略能够产生高达60kb的NaS(纳米孔合成长度)，这些NaS与参考基因组完全对齐，没有错误，并且跨越高度保守的重复区域。
NaS读取的平均准确率达到了99.99%，同时没有损失MinION®读取输入的初始大小。

结论
我们描述了NaS工具，一种混合方法，允许使用MinION®设备对微生物基因组进行测序。
我们的方法，最理想的是分别基于20倍和50倍的NaS和Illumina reads，为微生物或小型真核生物基因组测序提供了一种高效且经济的方法，即使在小型设备中也能在很短的时间内完成测序。
此外，我们证明，尽管牛津纳米孔技术是一种相对较新的测序技术，目前存在较高的错误率，但它已经可以用于生成高质量的基因组组件。

背景

长读测序技术为解决基因组装配问题(如重复元素或片段重复的复杂区域)和单倍型分期问题提供了不同的选择，而短读测序无法充分解决这些问题。
太平洋生物科学公司(Pacific Biosciences)生产的单分子实时测序(SMRT)平台应用于小型微生物和大型复杂真核生物基因组，证明了大幅提高基因组组装质量的可能性[1-4]。
微生物基因组现在可以完全组装(至少在某些情况下)，使用太平洋生物科学公司的SMRT单读体[2]或与短但高质量的读体[1]结合使用。
SMRT读取的高错误率使得必须采用深度覆盖或使用Illumina读取的错误校正策略。
很明显，目前这种技术的产量和高成本仍然是大多数以大基因组为目标的基因组项目的障碍。
此外，商业上可获得的太平洋生物系统PacBio RS II仪器的价格很高，而且在基础设施和实施方面的需求并不能使整个研究界都能获得它。
Illumina Truseq合成长读测序策略也实现了相似的读长改进;
它在人类基因组中的应用和苍蝇基因组中高度重复的元素的解析提供了令人鼓舞的结果[5,6]，显示了长质量阅读的重要性。
尽管如此，文库准备中包括的长范围聚合酶链反应步骤可能会引入重要的基因组覆盖偏差。
此外，在一个时间有限的项目中，图书馆建设所需的时间可能是一个限制，并不能使整个研究社区都可以使用。
今年，牛津纳米孔技术有限公司发布了一款单分子纳米孔测序器，它通过USB 3.0接口连接到笔记本电脑上，面向数百名正在测试这款新设备的“纳米孔访问计划”(MAP)成员。
该技术基于嵌入在芯片上的纳米孔阵列，通过电传感[7]检测单链DNA分子的连续5-mers。
这种新技术提供了几个优点:MinION设备体积小，成本低，图书馆建设涉及一个简化的方法，不需要放大步骤，数据采集和分析实时发生。
在牛津纳米孔技术中，DNA分子的两条链由发夹连接并连续测序。
当分子的两条链被成功读取时，就会形成共识以获得更精确的读取(称为2D读取)。
否则，只提供正链序列(称为1D read)。
所有早期访问成员都进行了MinION测试，首先是在噬菌体lambda基因组上。
这些研究最近的三篇论文[8-10]显示，长读取的平均大小分别为5,000 bp和5,500 bp。
这些主要的研究指出，从当前版本的MinION读取错误率很高。
然而，尽管错误率很高，Ashton等人的[10]证明了MinION设备在微生物测序方面的潜力。
这激发了开发新工具的需求，无论是用于MinION读取校正，还是用于新的对齐算法。
已经提出了校正太平洋生物科学测序器长序列的方法[1,11-13]。
然而，这些方法是基于读取对齐的，因此纠正输入读取的能力与本地错误率相关联。
因此，正确读取的大小与长时间读取的输入序列错误密切相关。
长时间且相对不准确的reads存在测序错误的热点，会导致mosaic reads，高保真度和低保真度区域交替出现。
由于现有的组装软件无法处理高错误率的长测序，我们开发了一种结合两种测序技术:Oxford Nanopore和Illumina的方法，在组装前生成长且准确的合成测序结果。

结果与讨论
概述MinION®阅读
我们进行了五分的奴才®与四个不同的a . baylyi基因组DNA库测序(针对两个不同的意思是片段大小:8 kb和20 kb),和两个不同的flowcell化学反应,R7和R7.3(方法和表1)。我们一共66492读,代表一个基因组覆盖率约57×。
在这66,492次读取中，大约13%是2D读取，占累计大小的42%，表明1D读取和2D读取的长度存在显著差异。
1D读取的平均大小为2052 bp，相比之下，2D读取的平均大小为10,033 bp(表2)。使用20kb库时，N50的大小是原来的两倍，这表明当剪切大小增加时，我们得到的MinION®读取时间更长。
run4和run5较低的平均读取大小是由于非常短的1D读取(500 bp)所占比例较高。
这两次运行是使用相同的文库准备完成的(library4，表1)。正如之前报道的[8-10]，我们观察到参考基因组[14]的可映射性较低;
83.2%的2D reads和16.6%的1D reads被对齐(图1和表2)。因此，仅考虑对齐的核苷酸时，真实的基因组覆盖率约为34倍。
1D reads的平均一致性为56.5%，而2D reads的平均一致性为74.5%。
在吞吐量、2D碱基比例(R7.3大于42%，R7小于27%)和2D读取质量方面，R7.3的化学表现出了一些改进(附加文件1:表S1)。
即使最近的化学和流动细胞表现出显著的进展，这些初步结果仍然显示出在吞吐量和2D读取比例上的异质性。