Visualisation of a genome sequence as a tunnel

Are we there yet? Genome sequencing turns 20

We look back on two decades of genome sequencing and explore four areas where continued progress could provide future breakthroughs

The first draft of the human genome was reported in 2001, a decade after the launch of the Human Genome Project in 1990. Since then, our knowledge of the genome has come a long way, taking genomics from a specialist area to one that touches our whole healthcare system. But what are we still working on?

The ‘junk’ that isn’t junk

One of the main roles of DNA is to code for proteins. The parts of the genome that do this are known as the exome, but this only makes up around 2% of the human genome. The rest, known as non-coding DNA – or ‘junk DNA’ as it was often called, was thought for a long time to have little value.

As tools to explore the transcriptome – or RNA – have improved, we now understand that many non-coding parts of the genome are transcribed, with the resulting RNAs having important functions on cells. Furthermore, it appears that some non-coding regions of the genome act as promotors, enhancers and suppressors, regulating the activation of nearby genes.

In fact, research suggests that “the vast majority (80.4%) of the human genome participates in at least one biochemical RNA- and/or chromatin-associated event in at least one cell type.”

Non-coding DNA continues to be at the heart of new discoveries about human health and disease, including inflammatory and allergic conditions, and atherosclerosis, as reported recently on this blog.

Heritability of complex diseases

While some conditions, such as achondroplasia and cystic fibrosis, are caused by a limited number of possible genetic variants, other conditions are more complex. For example, diabetes and cardiovascular disease can sometimes be the result of a single genetic variant, but in many cases the cause is much less clear. This is because these conditions are often the result of the interactions of many different genetic variants, combined with environmental factors.

Due to the limitations of sequencing technology, it was previously impossible to pinpoint the many small differences that can add up to large variations in risk for these more complex conditions.

Now, large genomic datasets linked to health records are available because of next-generation sequencing, allowing genome-wide-association studies (GWAS) to identify many variants with small effect sizes. These can be combined to create a personalised polygenic risk score, which aims to quantify the cumulative effects of genes on a person’s susceptibility to a condition.

Despite the existing benefits of large datasets, work is ongoing to make genomic datasets even more representative and inclusive. This is because many large databases primarily only contain genomes from people with European ancestry and so are less useful when applied to other populations. This lack of diversity can have serious consequences, as explained by bioinformatician Nana E Mensah in our recent feature.

Variants of unknown significance

Our understanding of the genome has increased, but there is still a lot that we do not fully understand. There are many genetic variants that are identified in the genome, but we are still not clear if these contribute to disease or not. These are called variants of uncertain significance.

Even in the BRCA1 and BRCA2 genes – some of the most well-studied in the genome – there are variants that we know do not affect the chance of developing cancer, some that we know do, and many for which there is insufficient evidence to understand yet.

Due to the size of the genome, we have not yet been able to study every gene in the same depth as BRCA1 and BRCA2. In some cases, variants can be very rare or unique and, without a prior reference, it can be difficult to know their effects and how they may impact on health.

With time, and with the aid of new sequencing technologies and infrastructure, we will continue to answer more questions about the human genome and the many complex factors that can cause disease, but it is a considerable and ongoing task.

Reference genome: nearly there?

The 2001 draft of the human reference genome, as well as the 2003 follow-up, excluded regions of the genome that were impossible to sequence with the technology available at the time.

The ends (telomeres) and the middle part (centromere) of chromosomes are composed of very long strings of repeating sequences. Although these can be read by next-generation sequencing, it is difficult to then piece the short fragments together into a complete sequence because there is so much repetition.

Now, with the rise of ultra-long-read sequencing, researchers have been able to start filling in these gaps. The first complete centromere sequence was published in 2018 and the first whole chromosome sequence, including its centromere and both telomeres, followed in 2020. A complete reference genome including all the human chromosomes could soon be a reality.

Please note: This article is for informational or educational purposes, and does not substitute professional medical advice.