Why you should QC your reads AND your assembly？

鲤鱼基因组：http://www.ntv.cn/a/20140923/52953.shtml

关于鲤鱼基因组的测定，数据质量控制遭到质疑。

Why you should QC your reads AND your assembly？

http://grahametherington.blogspot.co.uk/2014/09/why-you-should-qc-your-reads-and-your.html

The genome sequence of the Common Carp Cyprinus carpio was published in Nature last week. By coincidence, I was doing some QC on some domesticated Ferret (Mustela ptorius furo) reads, which had thrown some kmer warnings in the FastQC tool. I blasted the kmers in NCBI and was quite perplexed by the number of hits that I found in the carp genome. Nearly all of the first 150 hits were all from the carp genome. Anyway, I looked a bit further into my odd kmers and it turns out that they were the ends of some Illumina adapter sequences that had presumably been incorporated into the paired-reads on the shorter ends of the insert size. This then took me back to the Carp Genome - what had creeped into that?

In the paper, the authors state that they used 454, Illumina and Solid sequencing and also used some previously published BAC-end sequences. The BAC-end and 454 sequences were assembled with the Celera assembler and the Illumina, Solid and 454 8kb mate-pair sequences were mapped to the assembly to construct the scaffolds. Finally, they used the paired-end information from the short paired-end reads to fill the gaps between the scaffolds. The final assembly consists of 9377 scaffolds.

The only quality control they speak of is "We then filtered out low-quality and short reads to obtain a set of usable reads".

So I thought I'd look at what was actually in their assembly. I downloaded the Carp genome assembly (9377 scaffolds) and created a blast database from it and then created a fasta file of Illumina adapter sequences (found here) and used them as query sequences to blast against the Carp genome. There is some redundancy in the Illumina adapter sequences, so I collapsed them, so retaining only unique sequences and then removed any adapter sequences that were sub-sequences of longer adapter (the final file consisted of 81 sequences). The blast resulted in 3750 hits (evalue < 8.00E-06) of which 1009 were of 100% identity.

This gave me a final tally of at least 20 Illumina adapter sequences incorporated into the final Common Carp genome assembly. Out of the 9377 scaffolds, 277 appears to have Illumina Adapter sequences in them. I've included the counts of the different Illumina adapter sequences (non-redundant) for the scaffolds at the bottom of the page.

I've not looked for adapter sequences used in Solid or 454 sequencing yet. It would be interesting to see what that throws up.

So, a lesson to be learned here. QC your assembly, especially if you're not overly stringent with your read QC.

Here's the data:
Common Carp genome scaffolds
Illumina adapter sequences
Illumina adapter sequences collapsed
Illumina adapters v Carp genome blast