Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm
Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm

用马瑟卡巨型读码算法，将面包小麦的祖先土什奇的巨大且高度重复的基因组杂交组装
1. Aleksey V. Zimin 1,2,
2. Daniela Puiu 1,
3. Ming-Cheng Luo 3,
4. Tingting Zhu 3,
5. Sergey Koren 4,
6. Guillaume Marçais 2,5,
7. James A. Yorke 2,6,
8. Jan Dvořák 3 and
9. Steven L. Salzberg 1,7
+Author Affiliations
1. ¹Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA;
2. ²Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA;
3. ³Department of Plant Sciences, University of California, Davis, California 95616, USA;
4. ⁴National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA;
5. ⁵Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA;
6. ⁶Departments of Mathematics and Physics, University of Maryland, College Park, Maryland 20742, USA;
7. ⁷Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, Maryland 21218, USA
1. Corresponding author: salzberg@jhu.edu
Abstract

Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.

用马瑟卡巨型读码算法，将面包小麦的祖先土什奇的巨大且高度重复的基因组杂交组装
Aleksey诉Zimin1 2 Daniela Puiu1, Ming-Cheng Luo3,婷婷Zhu3,谢尔盖•Koren4 Guillaume Marcais2, 5, 6,詹姆斯·a . Yorke2 Jan Dvořak3和史蒂文•l . Salzberg1 7
+作者从属关系

1约翰霍普金斯医学院mckusicka - nathans基因医学研究所计算生物学中心，美国马里兰州巴尔的摩市21205;
2马里兰大学物理科学与技术研究所，马里兰大学帕克分校，美国马里兰20742;
3加州大学植物科学系，加州戴维斯95616;
4美国国家卫生研究院人类基因组研究所，美国马里兰州贝塞斯达20892;
5卡耐基梅隆大学计算生物学系，宾夕法尼亚州匹兹堡15213;
6马里兰大学数学与物理系，马里兰大学帕克分校，美国马里兰20742;
约翰霍普金斯大学生物医学工程、计算机科学和生物统计学七个学系，美国马里兰州巴尔的摩市21218
通讯作者:salzberg@jhu.edu
摘要
由单分子测序技术产生的长测序reads提供了大幅度提高基因组装配的连续性的可能性。
目前最大的挑战是长读的错误率相对较高，目前约为15%。
高错误率使得单独使用这些数据非常困难，特别是对于高度重复的植物基因组。
原始数据中的错误会导致一致基因组序列中的插入或删除错误(indels)，从而给下游分析带来重大问题;
例如，单个indel可能会改变读码框并错误地截断蛋白质序列。
在这里，我们描述了一种解决高错误率问题的算法，该算法结合了长、高错误率的reads和更短但更准确的Illumina测序reads，其错误率平均为1%。
我们的混合装配算法将这两种读取结合起来，构造出既长又准确的巨读，然后使用CABOG装配器对巨读进行装配，CABOG装配器是专为长读设计的。
我们将这项技术应用于一个巨大的Illumina和PacBio序列的数据集，该序列来自于Aegilops tauschii物种，这是一个巨大的、极其重复的植物基因组，以前的组装尝试都没有成功。
我们发现，最终组装的contig比以往任何组装的都要大，N50的contig大小为486,807个核苷酸。
我们将contigs与独立制作的光学地图进行比较，以评估其大规模精度，并将其与一套高质量的细菌人工染色体(BAC)组件进行比较，以评估基础水平的精度。
相关阅读:
Hybrid APP基础篇(三)->Hybrid APP之Native和H5页面交互原理
 Hybrid APP基础篇(二)->Native、Hybrid、React Native、Web App方案的分析比较
 Hybrid APP基础篇(一)->什么是Hybrid App
JavaScript筑基篇(一)->变量、值与对象
 深入Node.js的进程与子进程：从文档到实践
 深入Node模块Buffer-学会操作二进制
 深入Nodejs模块fs
刷《一年半经验，百度、有赞、阿里面试总结》·手记
 Asp.Net Core 轻松学-被低估的过滤器
 css精灵图&字体图标
原文地址：https://www.cnblogs.com/wangprince2017/p/13756583.html

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm

Abstract