PBSIM2: a simulator for long read sequencers with a novel generative model of quality scores
Abstract
Recent advances in high-throughput long-read sequencers, such as PacBio and Oxford Nanopore sequencers, produce longer reads with more errors than short-read sequencers. In addition to the high error rates of reads, non-uniformity of errors leads to difficulties in various downstream analyses using long reads. Many useful simulators, which characterize long read error patterns and simulate them, have been developed. However, there is still room for improvement in the simulation of the non-uniformity of errors.
To capture characteristics of errors in reads for long read sequencers, here, we introduce a generative model for quality scores, in which a hidden Markov Model with a latest model selection method, called Factorized information criteria, is utilized. We evaluated our developed simulator from various points, indicating that our simulator successfully simulates reads that are consistent with real reads.
动机
最近在高通量长读测序器方面的进展,如PacBio和Oxford Nanopore测序器,比短读测序器产生的长读错误更多。
除了读取的高错误率之外,错误的不均匀性还会导致使用长读取进行各种下游分析的困难。
许多有用的仿真器已经被开发出来,用来描述和模拟长读错误模式。
但是,在误差不均匀性的模拟方面还有改进的余地。
结果
为了捕捉长读测序器读取错误的特征,在这里,我们引入了质量分数生成模型,其中使用了一种最新的模型选择方法的隐马尔科夫模型,称为因式信息标准。
我们从不同的角度评估了我们开发的模拟器,表明我们的模拟器成功地模拟了与真实读取一致的读取。
1 Introduction
High-throughput DNA sequencing technology has markedly changed the style of biological research, from hypothesis-driven biology to data-driven biology.
Notably, recent advances in long read sequencers, including Pacific Biosciences (PacBio) and Oxford Nanopore Technologies(Nanopore), have accelerated studies on the genome (Chaisson et al., 2015;Korlach et al., 2017; Jain et al., 2018; Bowden et al., 2019; Dilthey et al.,2019), epigenome (Simpson et al., 2017), and transcriptome (Weirather et al., 2017), among others (van Dijk et al., 2018; Mantere et al., 2019).
It is known that reads generated by long read sequencers include more errors than those generated by short read sequencers (e.g., Illumina HiSeq), and many tools and algorithms that specifically target long read sequencers have been developed (Sedlazeck et al., 2018; Makałowski and Shabardina, 2019; Amarasinghe et al., 2020).
However, in the development of tools/algorithms for long read sequencers, it is generally difficult to evaluate those using real data.
This is because real data that meets the necessary conditions cannot always be prepared;
in addition, the true error information of real data is not easy to obtain.
Therefore, simulators that generate reads with error information,
such as alignments between reads and the reference sequences, are useful for the evaluation of new tools/algorithms.
(See Escalona et al. (2016); Alosaimi et al. (2020) for comprehensive reviews of read simulators.)
Moreover, these simulators are useful for experimental design such as estimating the depth coverage required for genome assembly and variant detection.
To make this possible,it is crucial to be able to properly simulate the characteristics of real reads,especially the characteristics of errors.
1介绍
高通量DNA测序技术显著改变了生物学研究的方式,从假设驱动的生物学转变为数据驱动的生物学。
值得注意的是,包括太平洋生物科学(PacBio)和牛津纳米孔技术(Nanopore)在内的长读测序仪的最新进展加速了基因组研究(Chaisson et al., 2015;Korlach et al., 2017;Jain等人,2018年;Bowden等人,2019年;Dilthey等,2019年)、表观基因组(Simpson等,2017年)和转录组(Weirather等,2017年),以及其他(van Dijk等,2018年;Mantere等人,2019年)。众所周知,长读测序器生成的序列比短读测序器(如Illumina HiSeq)产生的错误更多,并且已经开发了许多专门针对长读测序器的工具和算法(Sedlazeck等,2018;马卡łowski Shabardina, 2019;Amarasinghe等人,2020)。然而,在为长读序列器开发的工具/算法中,通常很难评估那些使用真实数据的序列器。这是因为满足必要条件的真实数据不可能总是准备好;另外,真实数据的真实误差信息不易获得。因此,产生带有错误信息读取的模拟器,例如读取序列和引用序列之间的对齐,对于评估新工具/算法是有用的。(见Escalona等人(2016);Alosaimi等人(2020)对read模拟器进行了全面综述。)此外,这些模拟器对实验设计很有用,如估算基因组组装和变异检测所需的深度覆盖。要做到这一点,关键是能够正确地模拟真实读取的特性,特别是误差的特性。
PacBio sequencers have lesser systematic (or context-specific) errors(e.g., errors in high- and low-GC regions and at homopolymer runs) than that of short read sequencers, such as Illumina (Eid et al., 2009; Ross et al., 2013; Laehnemann et al., 2016).
In contrast, it has been reported that PacBio reads have regional bias of error distribution within the reads,and very lowquality regions are sometimes observed1.
Lowquality regions are caused by chimeras and undetected adapter sequences, as well as nonuniformity of errors.
Figure 1 clearly shows the non-uniformity of quality scores, with the distributions of accuracy of 800 bp disjoint intervals in reads 2. ‘Random models’ randomly generate quality scores according to real frequencies of quality scores, leading to a normal distribution of quality scores.
Compared with random models, the distributions of real reads have broader accuracy ranges of 800 bp interval, especially for low read accuracy.
Our previously developed simulator, PBSIM (Ono et al., 2013), employs a random model (Eid et al., 2009), and the reads generated by it are simpler and easier to handle than real reads; this is a problem when evaluating the tools/algorithms for long read sequencers.
Currently, there are several simulators that generate long reads (see Supplementary Table S1 for summary).
With regard to simulation of low quality regions, NanoSim (Yang et al., 2017) generates a set of read profiles from alignment-based analysis, and simulates lowquality regions using the profiles.
PaSS (Zhang et al., 2019) adopts preset high error rates for both ends of the reads, to simulate low quality regions. Badread (Wick, 2019) can introduce chimeras, adapter sequences, low quality regions, and lowcomplex repetitive sequences into simulated reads.
However, there is still room for improvement in the simulation of the non-uniformity of errors(or quality scores).
To simulate the non-uniformity of quality scores, in this study, we developed a generative model for quality scores, based on a hidden Markov Model in combination with latest model selection criteria.
Our computational experiments showthat PBSIM2, the newversion of PBSIM, simulates reads that have a tendency similar to real reads.
This article is organized as follows: In Section 2, after introducing a novel generative model for quality scores, we describe the detailed design of PBSIM2.
In Section 3 we report comprehensive evaluations of PBSIM2 and related discussions.
PBSIM2 newly added the function to simulate Nanopore reads, whereas it removed the function to simulate circular consensus sequencing (CCS also known as HiFi) reads.
This is because the average accuracy of CCS exceeds 99%, which is outside the purpose of PBSIM to simulate error-prone reads. PBSIM2 is freely available from https://github.com/yukiteruono/pbsim2, and it will be useful for various studies using long reads.
PacBio测序器有较少的系统(或上下文特定)错误(例如。与短读测序器相比,如Illumina (Eid et al., 2009;Ross等人,2013年;Laehnemann等,2016)。
相比之下,有报道称PacBio reads在reads内部存在误差分布的区域偏差,有时会观察到质量非常低的区域1。
低质量区域是由嵌合体和未检测到的适配器序列,以及误差的不均匀性造成的。
图1清楚地显示了质量分数的不均匀性,在reads 2中,精度分布为800 bp不相交区间。
“随机模型”根据质量分数的真实频率随机生成质量分数,导致质量分数呈正态分布。
与随机模型相比,实际读取的分布具有更大的精度范围,间隔为800 bp,特别是在读取精度较低的情况下。我们之前开发的模拟器PBSIM (Ono et al., 2013)采用了一个随机模型(Eid et al., 2009),由它生成的读操作比真实读操作更简单、更容易处理;
在评估长读序列器的工具/算法时,这是一个问题。
目前,有几个模拟器可以生成长读取(摘要请参阅补充表S1)。
对于低质量区域的模拟,NanoSim (Yang et al.,2017)通过基于对准的分析生成一组读取的概要文件,并使用这些概要文件模拟低质量区域。
PaSS (Zhang et al., 2019)在读取两端采用预设的高错误率来模拟低质量区域。
Badread (Wick, 2019)可以将嵌合体、适配器序列、低质量区域和低复杂重复序列引入到模拟读出中。
然而,在误差(或质量分数)的非均匀性模拟方面仍有改进的余地。
为了模拟质量分数的不均匀性,在本研究中,我们开发了基于隐马尔可夫模型的质量分数生成模型,并结合最新的模型选择标准。
我们的计算实验表明,PBSIM2, PBSIM的新版本,模拟读取具有类似真实读取的趋势。
本文组织如下:在第2节中,在介绍了一种新的质量分数生成模型之后,我们描述了PBSIM2的详细设计。
在第3节中,我们报告了PBSIM2的综合评价和相关的讨论。
PBSIM2新增了模拟纳米孔读取的功能,而去掉了模拟循环一致序列(CCS也称为HiFi)读取的功能。
这是因为CCS的平均精度超过99%,超出了PBSIM模拟易错读取的目的。
PBSIM2是免费的从https://github.com/yukiteruono/pbsim2,这将是有用的各种研究使用长read。
Our experiments showed that the generative model simulates quality scores that are more consistent with real reads of PacBio and Nanopore than other existing simulators.
4结论
在这项研究中,我们提出了一种新的模拟长读取由PacBio和纳米孔测序器,其中一种新的生成模型的质量评分被使用。
本研究的新颖之处在于引入了基于隐马尔可夫模型(HMM)和模型选择程序的质量分数生成模型。
我们的实验表明,生成模型模拟的质量分数比其他现有的模拟器更符合PacBio和Nanopore的真实读数。