HISEA: HIerarchical SEed Aligner for PacBio data

Nilesh Khiste &
Lucian Ilie

BMC Bioinformatics volume 18, Article number: 564 (2017) Cite this article

1342 Accesses
4 Citations
6 Altmetric
Metricsdetails

Abstract

Background

The next generation sequencing (NGS) techniques have been around for over a decade. Many of their fundamental applications rely on the ability to compute good genome assemblies. As the technology evolves, the assembly algorithms and tools have to continuously adjust and improve. The currently dominant technology of Illumina produces reads that are too short to bridge many repeats, setting limits on what can be successfully assembled. The emerging SMRT (Single Molecule, Real-Time) sequencing technique from Pacific Biosciences produces uniform coverage and long reads of length up to sixty thousand base pairs, enabling significantly better genome assemblies. However, SMRT reads are much more expensive and have a much higher error rate than Illumina’s – around 10-15% – mostly due to indels. New algorithms are very much needed to take advantage of the long reads while mitigating the effect of high error rate and lowering the required coverage.

Methods

An essential step in assembling SMRT data is the detection of alignments, or overlaps, between reads. High error rate and very long reads make this a much more challenging problem than for Illumina data. We present a new pairwise read aligner, or overlapper, HISEA (Hierarchical SEed Aligner) for SMRT sequencing data. HISEA uses a novel two-step k-mer search, employing consistent clustering, k-mer filtering, and read alignment extension.

Results

We compare HISEA against several state-of-the-art programs – BLASR, DALIGNER, GraphMap, MHAP, and Minimap – on real datasets from five organisms. We compare their sensitivity, precision, specificity, F1-score, as well as time and memory usage. We also introduce a new, more precise, evaluation method. Finally, we compare the two leading programs, MHAP and HISEA, for their genome assembly performance in the Canu pipeline.

Discussion

Our algorithm has the best alignment detection sensitivity among all programs for SMRT data, significantly higher than the current best. The currently best assembler for SMRT data is the Canu program which uses the MHAP aligner in its pipeline. We have incorporated our new HISEA aligner in the Canu pipeline and benchmarked it against the best pipeline for multiple datasets at two relevant coverage levels: 30x and 50x. Our assemblies are better than those using MHAP for both coverage levels. Moreover, Canu+HISEA assemblies for 30x coverage are comparable with Canu+MHAP assemblies for 50x coverage, while being faster and cheaper.

Conclusions

The HISEA algorithm produces alignments with highest sensitivity compared with the current state-of-the-art algorithms. Integrated in the Canu pipeline, currently the best for assembling PacBio data, it produces better assemblies than Canu+MHAP.

背景
下一代测序(NGS)技术已经出现了十多年。
它们的许多基本应用都依赖于计算良好基因组组装的能力。
随着技术的发展，装配算法和装配工具也在不断的调整和完善。
Illumina公司目前的主导技术生产的读码太短，无法跨越多次重复，这就限制了成功组装的内容。
来自太平洋生物科学公司(Pacific Biosciences)的新兴SMRT(单分子实时)测序技术产生了统一的覆盖范围和长达6万个碱基对的长序列，使基因组装配变得更好。
然而，SMRT的读取成本要高得多，而且误差率也比Illumina的高得多——大约10-15%——主要是由于indels。
新的算法需要充分利用长时间读取的优势，同时降低高错误率的影响，降低所需的覆盖率。

方法
组装SMRT数据的一个重要步骤是检测读取之间的比对或重叠。
与Illumina数据相比，高错误率和很长的读取时间使这成为一个更具有挑战性的问题。
我们提出了一种新的配对读对准器，或重叠，HISEA(分级种子对准器)用于SMRT测序数据。
HISEA使用了一种新的两步k-mer搜索，采用了一致聚类、k-mer过滤和读取对齐扩展。

结果
我们比较了HISEA和几个最先进的程序- BLASR, DALIGNER, GraphMap, MHAP，和小地图-在真实的数据集从五种生物。
我们比较了它们的敏感性、精确性、特异性、f1得分以及时间和记忆使用情况。
我们还介绍了一种新的，更精确的评价方法。
最后，我们比较了两个领先的程序，MHAP和HISEA，为他们的基因组组装性能在Canu管道。

讨论
我们的算法对SMRT数据的比对检测灵敏度在所有程序中是最好的，明显高于目前最好的。
目前SMRT数据的最佳汇编程序是在其管道中使用MHAP aligner的Canu程序。
我们已经在Canu管道中加入了新的HISEA对准器，并将其与多个数据集在两个相关覆盖级别上的最佳管道进行基准测试:30倍和50倍。
在这两个覆盖级别上，我们的程序集都比使用MHAP的程序集更好。
此外，Canu+HISEA装配为30倍的覆盖可比Canu+MHAP装配为50倍的覆盖，同时更快和更便宜。

结论
HISEA算法与目前最先进的算法相比，产生最高灵敏度的校准。
集成在Canu管道，目前最好的组装PacBio数据，它产生更好的组装比Canu+MHAP。

相关阅读:
NLP---word2vec的python实现
matplotlib---Annotation标注
matplotlib---legend图例
matplotlib---设置坐标轴
windows下右键新建md文件
vue+webpack+npm 环境内存溢出解决办法
element-ui tree树形组件自定义实现可展开选择表格
vue-动态验证码
ES6 数组函数forEach()、map()、filter()、find()、every()、some()、reduce()
eslint配置文件规则

原文地址：https://www.cnblogs.com/wangprince2017/p/13756576.html