Abstract
While second generation sequencing led to a vast increase in sequenced data, the shorter reads which came with it made assembly a much harder task and for some regions impossible with only short read data. This changed again with the advent of third generation long read sequencers. The length of the long reads allows a much better resolution of repetitive regions, their high error rate however is a major challenge. Using the data successfully requires to remove most of the sequencing errors. The first hybrid correction methods used low noise second generation data to correct third generation data, but this approach has issues when it is unclear where to place the short reads due to repeats and also because second generation sequencers fail to sequence some regions which third generation sequencers work on. Later non hybrid methods appeared. We present a new method for non hybrid long read error correction based on De Bruijn graph assembly of short windows of long reads with subsequent combination of these correct windows to corrected long reads. Our experiments show that this method yields a better correction than other state of the art non hybrid correction approaches.
虽然第二代测序导致了测序数据的大量增加,但随之而来的短读取使组装变得更加困难,对于某些地区来说,只有短读取数据是不可能的。
随着第三代长读测序器的出现,这种情况又发生了改变。
长读取的长度允许一个更好的分辨率的重复区域,但他们的高错误率是一个主要的挑战。
成功地使用这些数据需要去除大部分的测序错误。
第一个混合校正方法用低噪声第二代数据正确的第三代数据,但是这种方法有问题时不清楚短读的位置由于重复也因为第二代测序失败序列一些第三代测序工作的区域。
后来出现了非混合方法。
本文提出了一种新的非混合长读纠错方法,该方法基于长读的短窗口的德布鲁因图装配,然后将这些正确的窗口组合起来进行长读纠错。
实验结果表明,该方法比现有的非混合校正方法具有更好的校正效果。