Monday, February 24, 2014

A Sunset for Draft Genomes?

The sun set during AGBT 2014 for a final time over a week ago.  The posters have long been down, and perhaps the liver enzyme levels of the attendees are now down to normal as well.  This year’s conference underscored a possibility that was suggested last year: that the era of the poorly connected, low quality draft genome is headed for the sunset as well

Early complete genomes were truly complete, using a mix of libraries and other strategies to gain truly closed genomes.  These were monumental achievements, and other than some minor sniping, they were heralded as marvelous achievements.  But these were hard and expensive, and certainly a lot of biological value could be extracted from lower quality genomes, though they also laid traps for careless workers to blunder and declare artifacts as discoveries.  As short read sequencing enable fast and cheap genome sequences, the qualities dipped lower and lower as the cost and labor gap between draft and quality genomes became enormous.  Strategies such as mate-pair libraries tried to bridge the gap, but tended to be disappointing and fall short of the goal of driving genomes to completion. 

Two recent genomes illustrate what has been commonplace.  The duckweed genome is 158Mb, or about the same size as Drosophila, and was assembled using Illumina paired end reads and 454 mate pairs plus BAC end sequencing.  The mean contig size after this was only 8.2Kb (contig N50 was not given) and 1071 scaffolds.  The flatfish genome comes in over 400Mb, was assembled using a gemisch of read types, and attained a contig N50 of 26.5Kb and scaffold N50 of 868Kb.  That is, assuming this all came out correctly; a major revision of the Aedes aegypti genome assembly was just announced that made many edits to the scaffolds.

What has radically changed in the last year and change is the emergence of true long read sequencing, spearheaded by Pacific Biosciences.  Bolstered by clever software, long reads democratized high quality microbial sequencing.  With the newest chemistries and RS II instruments, a single flowcell is sufficient to obtain a completely closed genome of many bacteria of interest for less than $1K.  While short read sequencing can still deliver some sort of assembly for significantly less, the distance in price has dropped to a few fold and several hundred dollars.

At AGBT, the building momentum for larger genomes was demonstrated by William McCombie (CSHL) and others.  McCombie showed assemblies of S.cerevisiae and S.pombe which resolved nearly all chromosomes to single contigs; a few were represented by two contigs, one for each arm.  Arabidopsis and Drosophila have shown very impressive results, and his groups’ results on the over half gigabase rice genome show that high continuity genomes are possible even in large eukaryotic genomes.  Coupled with Jason Chin showing off a diploid assembler for long reads , Gene Myers announcing a much faster read cleaning pipeline and a Japanese group presenting a poster on a read cleaning pipeline that works with less coverage, the future is very rosy for long read assembly.  Moore’s Law helps the sequencing world here too: genome sizes are not doubling every year!  So the hardware to process these will continue to get cheaper.

Coming down from above, Josh Burton discussed his Lachesis software, which can use Hi-C data to generate chromosome arm (or whole chromosome) scaffolds.  Hi-C is a library preparation method from which the read pairs indicate regions of DNA that were physically proximate in the eukaryotic nucleus.  The signal is a bit faint, but with scaffolds of 50kb or greater Lachesis shows a strong ability to detect that signal and organize the data.

There is still a cost difference, which can be extreme for larger genomes.  But, as PacBio continues to improve their chemistry and preparation protocols that will shrink.  Right now, any genome under about 6Mb can be reasonably expected to be closed for under $1K on PacBio, though this is dependent on the quality of the DNA preparation and care must be taken or small plasmids can be lost.  PacBio library prep is more sensitive than short read preps to various contaminants, but this problem appears to be limited in impact.  PacBio has announced an intention to improve throughput by 4X this year, and while that is a projection they did meet these goals last year.  So, by the end of the year it is not unreasonable to think that any bacterium and many smaller eukaryotic genomes can be had for a single SMRT cell, which is around $600 with library prep, and $10K would be enough to cover genomes up in the 150Mb range or more.  Using high-coverage Illumina data to reduce the coverage requirements is another option, with a number of PacBio-with-Illumina read cleaning pipelines available.  This is all without mate pair libraries, sequencing the ends of cosmids or BACs or optical mapping.  It will also be a relief for everyone who can't remember how to calculate N50s; when your contigs or scaffolds are chromosome-scale, the N50 statistic is meaningless!

There is also the very real possibility of other long read technologies getting into the game.  Illumina’s Moleculo technology has been used for eukaryotic assembly, though it has not been put head-to-head with PacBio and the nature of some repeats and PCR-unfriendly regions would make it unsurprising if it were not quite as good.  But Moleculo may be much cheaper for large genomes, though I don’t believe the values to calculate the cost tradeoffs are widely available.  Oxford Nanopore data was seen for the first time at AGBT, and while it was a bit of a lackluster debut, the ability to resolve repeats with this data was demonstrated.  With Oxford’s MinION access program scheduled to get devices to users next month, and a permissive policy for data release on users’ samples, the sequencing world should soon be awash in MinION data (several Tweeters have already pledged release of their data).

So genome assembly remains in flux, but long reads are taking over and will simply get better and cheaper.  Given this, isn’t it time to start making the push to make poor quality draft genomes less respectable?  While there will remain hard cases, such as metagenomic samples and single cell genomes or when surveying many related isolates, the time would seem nigh to make high quality references the standard.  For prokaryotes, it really is reasonable now to expect closed genomes as typical.  For small eukaryotic genomes, say under 50Mb, complete chromosome arms should be the standard, and for larger genomes that should still be a goal. For larger eukaryotes, scaffolding with Lachesis and Hi-C libraries should be expected.  

Such a change won't happen overnight, but it needs to start happening.  Journal editors, reviewers, lab heads, staff scientists, post-docs, graduate students and everyone else who wants to lead the charge should do so. Dispelling the current complacency with low quality drafts requires active effort. But the rewards are substantial, so it is a crusade worth joining!


Keith Robison said...

Jonathon Eisen Storified the Twitter commentary on this post.

Rick said...

I can't wait. Currently I am working on a ~65Kb genome which despite Illumina paired-end and mate-paired reads is refusing to close up. We may need more sequencing done but the PI is being cheap. :-( Anyway our first attempt with just PEs was rejected by a low-tier journal because the genome was not complete enough. Despite the added work I silently applauded. There are too many draft, or sub-draft, genomes out there.