Hybrid assembly with long and short reads improves discovery of gene family expansions

Abstract

Background

Long-read and short-read sequencing technologies offer competing advantages for eukaryotic genome sequencing projects. Combinations of both may be appropriate for surveys of within-species genomic variation.

Methods

We developed a hybrid assembly pipeline called “Alpaca” that can operate on 20X long-read coverage plus about 50X short-insert and 50X long-insert short-read coverage. To preclude collapse of tandem repeats, Alpaca relies on base-call-corrected long reads for contig formation.

Results

Compared to two other assembly protocols, Alpaca demonstrated the most reference agreement and repeat capture on the rice genome. On three accessions of the model legume Medicago truncatula, Alpaca generated the most agreement to a conspecific reference and predicted tandemly repeated genes absent from the other assemblies.

Conclusion

Our results suggest Alpaca is a useful tool for investigating structural and copy number variation within de novo assemblies of sampled populations.

Background

Tandemly duplicated genes are important contributors to genomic and phenotypic variation both among and within species [1]. Clusters of tandemly duplicated genes have been associated with disease resistance [2], stress response [3], and other biological functions [4, 5]. Confounding the analysis of tandem repeats in most organisms is their underrepresentation in genome assemblies constructed from short-read sequence data, typically Illumina reads, for which the sequence reads are shorter than repeats [6,7,8,9].

The ALLPATHS-LG software [10] overcomes some of the assembly limitations of short-read sequencing by clever combination of Illumina paired end reads from both short-insert and long-insert libraries. Applied to human and mouse genomes, the ALLPATHS assembler produced assemblies with more contiguity, as indicated by contig N50 and scaffold N50, than had been attainable from other short-read sequence assemblers. ALLPATHS also performs well on many other species [11, 12]. The ALLPATHS assemblies approached the quality of Sanger-era assemblies by measures such as exon coverage and total genome coverage. However, the ALLPATHS assemblies captured only 40% of genomic segmental duplications present in the human and mouse reference assemblies [10]. Similarly, an ALLPATHS assembly of the rice (Oryza sativa Nipponbare) genome [13] was missing nearly 12 Mbp of the Sanger-era reference genome, including more than 300 Kbp of annotated coding sequence. These findings illustrate the potential for loss of repeat coding sequence in even the highest quality draft assemblies constructed exclusively from short-read sequence data.

Long-read sequencing offers great potential to improve genome assemblies. Read lengths from PacBio platforms (Pacific Biosciences, Menlo Park CA) vary but reach into the tens of kilobases [9]. The base call accuracy of individual reads is about 87% [14] and chimera, i.e. falsely joined sequences, can occur within reads [15]. Although low base call accuracy and chimeric reads create challenges for genome assembly, these challenges can be addressed by a hierarchical approach [9] in which the reads are corrected and then assembled. The pre-assembly correction step modifies individual read sequences based on their alignments to other reads from any platform. The post-correction assembly step can use a long-read assembler such as Celera Assembler [16,17,18], Canu [19], HGAP [20], PBcR [21], MHAP [22], or Falcon [23]. Because most of the errors in PacBio sequencing are random, PacBio reads can be corrected by alignment to other PacBio reads, given sufficient coverage redundancy [24]. For example, phased diploid assemblies of two plant and one fungal genome were generated by hierarchical approaches using 100X to 140X PacBio [25] and a human genome was assembled from 46X PacBio plus physical map data [23]. Despite the potential of long-read assembly, high coverage requirements increase cost and thereby limit applicability.

Several hybrid approaches use low-coverage PacBio to fill gaps in an assembly of other data. The ALLPATHS pipeline for bacterial genomes maps uncorrected long reads to the graph of an assembly in progress [26]. SSPACE-LongRead, also for bacterial genomes, maps long reads to contigs assembled from short reads [27]. PBJelly [28] maps uncorrected long reads to the sequence of previously assembled scaffolds and performs local assembly to fill the gaps. In tests on previously-existing assemblies of eukaryotic genomes, PBJelly was able to fill most of the intra-scaffold gaps between contigs using 7X to 24X long-read coverage [28]. These gap filling approaches add sequence between contigs but still rely on the contig sequences of the initial assemblies. As such, gap filling may not correct assembly errors such as missing segmental duplications or collapsed representations of tandemly duplicated sequence. Long reads that span both copies of a genomic duplication, including the unique sequences at the repeat boundaries, are needed during the initial contig assembly to avoid the production of collapsed repeats.

We developed a novel hybrid pipeline named Alpaca (ALLPATHS and Celera Assembler) that exploits existing tools to assemble Illumina short-insert paired-end short reads (SIPE), Illumina long-insert paired-end short reads (LIPE), and PacBio unpaired long reads. Unlike other approaches that use Illumina or PacBio sequencing for only certain limited phases of the assembly, Alpaca uses the full capabilities of the data throughout the entire assembly process: 1) contig structure is primarily formed by long reads that are error corrected by short reads, 2) consensus accuracy is maximized by the highly accurate base calls in Illumina SIPE reads, and 3) scaffold structure is enhanced by Illumina LIPE that can provide high-coverage connectivity at scales similar to the PacBio long reads. We targeted low-coverage, long-read data in order to make the pipeline a practical tool for non-model systems and for surveys of intraspecific structural variation.

We evaluated the performance of Alpaca using data from Oryza sativa Nipponbare (rice), assembling the genome sequence of the same O. sativa Nipponbare accession used to construct the 382 Mbp reference, which had been constructed using clone-by-clone assembly, Sanger-sequenced BAC ends, physical and genetic map integration, and prior draft assemblies [29]. We also sequenced and assembled three accessions of Medicago truncatula, a model legume, and compared these to the M. truncatula Mt4.0 reference assembly of the A17 accession [30]. The Mt4.0 reference had been constructed using Illumina sequencing, an ALLPATHS assembly, Sanger-sequenced BAC ends, a high-density linkage map, plus integration of prior drafts that integrated Sanger-based BAC sequencing and optical map technology [31].

For the Medicago analyses where no high quality reference sequence was available for the accessions whose genomes we assembled, we focused our evaluation on the performance of Alpaca on large multigene families that play important roles in plant defense (the NBS-LRR family) and in various regulatory processes involving cell to cell communications (the Cysteine-Rich Peptide, or CRP, gene family). Members of these multigene families are highly clustered; the reference genome of M. truncatula harbors more than 846 NBS-LRR genes, with approximately 62% of them in tandemly arrayed clusters and 1415 annotated Cysteine-Rich Peptide (CRP) genes, with approximately 47% of them in in tandemly arrayed clusters. Resolving variation in gene clusters like these is crucial for identifying the contribution of copy number variation (CNV) to phenotypic variation as well as understanding the evolution of complex gene families.

背景
长读和短读测序技术为真核生物基因组测序项目提供了竞争优势。
两者的组合可能适用于调查种内基因组变异。

方法
我们开发了一种名为“Alpaca”的混合装配流水线，可以运行20倍的长插入覆盖率，以及大约50倍的短插入覆盖率和50倍的长插入短插入覆盖率。
为了防止串联重复序列的崩溃，羊驼依靠碱基-呼叫-校正的长序列来形成叠群。

结果
与其他两种装配协议相比，羊驼在水稻基因组上的参照协议和重复捕获最多。
在三组模型豆科植物中，羊驼在同一参照上的一致性最大，并预测出了其他组合中缺失的串联重复基因。

结论
我们的结果表明，羊驼是一个有用的工具，以调查结构和拷贝数变异的重新装配的抽样群体。

背景
成对重复的基因是[1]种间和种内基因和表型变异的重要原因。
成对重复的基因簇与抗病性[2]、应激反应[3]等生物学功能有关[4,5]。
在大多数生物中，串联重复序列分析的问题在于它们在由短读序列数据构建的基因组装配中代表性不足，典型的是Illumina reads，其序列读序列比重复序列短[6,7,8,9]。
ALLPATHS-LG软件[10]巧妙地结合了Illumina的短插入和长插入文库的末端序列，克服了短插入测序的一些装配限制。
应用于人类和小鼠基因组时，ALLPATHS组装器产生的装配体比其他短读序列组装器更具连续性，如contig N50和scaffold N50所示。
ALLPATHS在许多其他物种上也表现良好[11,12]。
通过外显子覆盖度和全基因组覆盖度等措施，全路组装的质量接近于Sanger-era组装的质量。
然而，ALLPATHS程序集只捕获了存在于人类和小鼠参考程序集[10]中的40%的基因组片段重复。
同样，水稻(Oryza sativa Nipponbare)[13]基因组的全径组装缺失了近12 Mbp的sanger时代参考基因组，其中包括超过300 Kbp的注释编码序列。
这些发现表明，即使是仅从短读序列数据构建的最高质量草案组件，也有可能丢失重复编码序列。
长读测序为改善基因组组装提供了巨大的潜力。
PacBio平台(Pacific Biosciences, Menlo Park CA)的读取长度各不相同，但达到数十个碱基[9]。
单个reads的碱基调用精度约为87%[14]，并且在reads[15]中可能出现嵌合体，即错误连接序列。
虽然较低的碱基调用精度和嵌合读取给基因组组装带来了挑战，但这些挑战可以通过[9]层次方法来解决，在[9]中，读取被纠正，然后组装。
预装配校正步骤根据单个读序列与来自任何平台的其他读序列的比对来修改它们。
校正后的装配步骤可以使用长读装配程序，如Celera assembler[16,17,18]、Canu[19]、HGAP[20]、PBcR[21]、MHAP[22]或Falcon[23]。
由于PacBio测序中的大部分错误都是随机的，所以在[24]覆盖冗余足够的情况下，可以通过与其他PacBio reads比对来校正PacBio reads。
例如，使用100X到140X PacBio[25]通过分层方法对两个植物和一个真菌基因组进行分阶段二倍体组装，用46X PacBio和物理地图数据[23]组装一个人类基因组。
尽管长时间阅读组装的潜力，高覆盖要求增加了成本，因此限制了适用性。
有几种混合方法使用低覆盖率的PacBio来填补其他数据集合中的空白。
细菌基因组的ALLPATHS流水线将未修正的长读图映射到正在进行的[26]组装图上。
SSPACE-LongRead也用于细菌基因组，它将长读片段映射为由短读片段[27]组装的contigs。
PBJelly[28]将未修正的长读序列映射到之前组装的脚手架序列，并执行局部组装以填补空白。
在对之前存在的真核生物基因组集的测试中，PBJelly能够用7X到24X的长读覆盖[28]填充contigs之间的大部分支架内间隙。
这些间隙填充方法在contig之间添加序列，但仍然依赖于初始组装的contig序列。
因此，间隙填充可能无法纠正装配错误，如缺少分段重复或折叠重复序列的表示。
在最初的叠群装配过程中，需要跨越两个基因组复制的长读取，包括重复边界上的唯一序列，以避免产生塌陷的重复。

我们开发了一种新的混合管道，名为Alpaca（ALLPATHS and Celera Assembler），

它利用现有工具组装Illumina short insert pairated end short reads（SIPE）、Illumina long insert paired end short reads（LIPE）和PacBio不成对长读。与其他仅在装配的某些有限阶段使用Illumina或PacBio测序的方法不同，Alpaca在整个装配过程中使用数据的全部功能：

1）contig结构主要由长读取形成，短读取可纠正错误，

2） Illumina SIPE reads中的高精度基本调用最大限度地提高了一致性准确性；

3）Illumina LIPE增强了支架结构，可在类似PacBio long reads的规模下提供高覆盖连接性。

我们针对低覆盖率、长时间读取的数据，以便使管道成为非模型系统和种内结构变异调查的实用工具。我们利用水稻（Oryza sativa Nipponbare）的数据对羊驼的性能进行了评估，将用于构建382 Mbp参考的同一个O.sativa Nipponbare的基因组序列进行了组装，该序列是通过克隆组装、Sanger测序BAC末端、物理和遗传图谱整合构建的，[29]先前的大会草案。我们还测序并组装了三份豆科模式苜蓿（Medicago Truncula）的材料，并将其与A17加入的M.Truncula Mt4.0参考组合进行了比较[30]。Mt4.0参考文献是使用Illumina测序、ALLPATHS组件、Sanger测序BAC末端、高密度连锁图，以及整合了基于Sanger的BAC测序和光学地图技术的先前草案的整合[31]。在Medicago分析中，没有高质量的参考序列可用于我们组装的基因组，我们将评估重点放在大型多基因家族（NBS-LRR家族）和涉及细胞间通讯的各种调控过程中发挥重要作用的大型多基因家族的表现（富含半胱氨酸肽或CRP基因家族）。这些多基因家族的成员是高度聚集的；截断M.truncula的参考基因组包含846个NBS-LRR基因，其中约62%位于串联排列簇中，1415个带注释的富含半胱氨酸肽（CRP）基因，其中约47%位于串联排列簇中。解决这些基因簇中的变异，对于确定拷贝数变异（CNV）对表型变异的贡献以及理解复杂基因家族的进化至关重要。

Hybrid assembly with long and short reads improves discovery of gene family expansions

Hybrid assembly with long and short reads improves discovery of gene family expansions 长链和短链杂交组合提高了基因家族扩展的发现

Abstract

Background

Methods

Results

Conclusion

Background