Long-read sequencing: the next next generation?
The ability to accurately sequence large sections of DNA is important in some areas of healthcare, and is enabling progress in others
Next-generation sequencing (NGS), the technology used for the 100,000 Genomes Project, involves cutting DNA into small fragments (several hundred base pairs) and amplifying them (copying them many times) before the sequencer reads them and the genome is pieced back together using bioinformatics software.
NGS has brought many benefits, including substantial increases in the speed at which genomes can be sequenced. But it does have its limitations; and long-read sequencing can be a better option in certain situations.
Long-read, or third-generation, sequencing involves reading sequences of between 10,000 and 100,000 base pairs in one go (although much longer reads have also been reported), without the need to cut up and amplify DNA samples.
One of the main benefits is that because the genome sequence is assembled from much larger pieces, opportunities for error and uncertainty are greatly reduced.
Long-read sequencing has promising applications not just in genomics but also in transcriptomics, as it can potentially read complete RNAs.
NGS is very good at finding small variations in DNA such as changes to a single letter of the genetic code, but there are other types of variations that long-read sequencing can identify better. These include changes where large sections of DNA are inserted, deleted or moved around and copy number variations (CNVs).
CNVs occur where a sequence of base pairs in the DNA is repeated, and where the number of repeats varies between people. It is estimated that CNVs may account for as much as 12% of the genome, and they can have significant health implications. For example, the short sequence ‘CAG’ repeats at the end of the Huntingtin gene, and the number of repeats determines whether a person will get Huntingdon’s disease.
Long-read sequencing is considered to be better for sequencing repetitive DNA because it is less error-prone: when smaller reads are reassembled using bioinformatics software repeats may be overlooked or duplicated.
Shedding light on new regions of the genome
In 2018, researchers used a long-read technology called nanopore to produce an accurate reference map of the centromere of the human Y chromosome for the first time. Their results were published in Nature.
Centromeres are crucial parts of the genome that ensure each pair of chromosomes is properly aligned for cell division. Centromere malfunction is still poorly understood, but it is thought to have a role in the development of some cancers and possibly also in causing disorders such as Down’s syndrome (where an embryo is formed with too many or too few chromosomes).
When the Human Genome Project was originally declared complete in 2003, the reference genome that had been generated did not include centromeres; contemporary whole-genome sequencing does not include them either. This is because centromeres are full of near-identical repeating sequences which are millions of base pairs long, thus making it impossible to reassemble smaller fragments with any degree of certainty – a challenge likened to completing a jigsaw puzzle made of a picture of a clear blue sky.
Because long-read sequencing divides the puzzle into larger pieces, it will undoubtedly make understanding the centromeres easier. And in a time when the drive to understand more about genomics and its impact on health is stronger than ever, long-read sequencing is sure to become key to scientists’ armament.