Reference genome: defining human difference

In order to make assertions about genomic variation and health, scientists rely on the ‘reference genome’ – but how definitive is it?

The president of one of the major players in next generation sequencing estimated several years ago that by 2017 1.6 million human genomes will have been sequenced. The information gathered is allowing us to make great advances in understanding the differences between people that contribute to health and disease.

Every time a genome as large as ours is sequenced, the genetic material must be broken up into short overlapping fragments, often numbering in the millions. Once the genome has been sequenced, the readings of these individual fragments need to be put back together in order to be analysed by scientists looking for variation in regions of DNA that could have an impact on our health. In order to do this, scientists refer to what is known as a ‘reference genome’ – a template genome incorporating the most up to date information we have on human genomics. The current human reference genome build, known as GRCh38, is used worldwide, but whose genome is it based on and how reliable is it? To answer these questions, we must first go back to 2003 and the completion of the Human Genome Project (HGP).

The Human Genome Project

It took 15 years and an international effort to produce the first draft of the human genome. For the very first time, nearly all the three billion letters that make up the genome had been mapped using Sanger sequencing. The published version of the human genome was 99% complete and formed the basis for the reference genome used today. It didn’t represent the genome of one individual but was built using information from the DNA of several volunteers living near the laboratories involved in the project. The identities of those that participated has never been revealed; even the participants themselves do not know if their DNA was used to produce the final published genome.

Since the completion of the project, there have been many iterations of the human reference genome in line with new scientific findings. It is curated by the Genome Reference Consortium (GRC) which is made up of five teams, including The National Centre for Biotechnology Information (NCBI) in Maryland USA, The Genome Institute at Washington University, and the Wellcome Trust and European Bioinformatics Institute, based in the UK. For over a decade it has been the GRC’s job to ensure that the reference genome is revised and updated regularly as new information emerges. This is imperative as any inaccuracies represented in the reference genome will impact on the inferences that are made about genomes referenced against it.

Build 38

Released in 2014, GRCh38 is the most accurate and up-to-date version of the human genome in the world. Prior to this, the last build (GRCh37) was released in 2009. In the interim, the reference genome was constantly refined and added to, in places, in updates known as ‘patches’. Build 38 is updated four times a year. The GRC also relies on collaborators to identify and notify problems within the reference sequence. Build 38 was a significant ‘upgrade’, and due to its accuracy and reputation it is the ‘go to’ reference for many large scale projects, including the UK’s 100,000 Genomes Project.

Limitations

Despite its undoubted value, GRCh38 still has limitations. First, it is based on data collected from the HGP, which focused on a small number of people from a small number of countries, and is therefore not truly representative of human diversity. In an effort to work towards a more representative human reference genome, the GRC engages with clinicians and researchers to ensure data from ethnically diverse populations is included. Some parts of the genome show so much ‘normal’ diversity between individuals that it is necessary to build alternate scaffolds so that all known benign or normal variations can be aligned accurately. When GRCh37 was initially released it had nine alternate scaffolds, but as time passed and more diversity was identified, this figure rose to 60. In comparison, GRCh38 has 261 alternate scaffolds – indicative of the incredible advances in our understanding over a relatively short period of time.

The second limitation of GRCh38 is that it has 603 ‘gaps’. These gaps represent those regions of the genome that are particularly difficult to sequence. However, as technology improves and scientists learn more, the gaps in the reference genome are gradually being filled. An as example, the most recent build was the first to include centromere sequences – highly repetitive regions that are millions of base pairs long. Known for a long time to have a structural role within the cell, centromeres are believed by some scientists to be “a major source of sequence variation” between individuals – and therefore to have the potential to influence health.

Lastly, as the human reference genome is a mosaic, built from information from multiple ‘donors’, it is not representative of one actual complete set of chromosomes, and this can cause a number of issues when matching newly sequenced genomes to it. The ideal would be a reference sequence built from one complete set of chromosomes, and scientists are trying to do exactly that using the genome from a non-viable but fertilised egg called a Hydatidiform mole. These cells have 23 pairs of chromosomes; with the chromosomes in each pair being identical. This means the DNA sequence in each pair is identical, which makes building a reference for each of the chromosomes a lot easier.

Looking forward

Research teams are working towards building the first ever complete human reference genome – an assembly that has been provisionally named the platinum genome.

One way researchers are trying to reach this goal is by sequencing longer fragments of DNA. Up until now, technological limitations and cost have prevented the use of longer reads in genome sequencing, but there is now technology in use that can sequence fragments thousands of base pairs long – a vast improvement compared to the 300-500 base pair fragments sequenced routinely. If longer fragments can be sequenced it is anticipated that there will be fewer gaps, making it easier to piece genomes back together; if a plate smashes into only a few pieces, it will be easier to fix than one that has smashed into hundreds of pieces. This technology is being used to look at particularly tricky regions of GRCh38, and contributing to a clearer picture of the human genome in its entirety.

The current human reference genome “is still the benchmark by which all other human assemblies must be compared”. It has underpinned numerous studies looking at our evolution and development, and has been invaluable in the study of human variation and disease.  As more genomes are sequenced, the full extent of human diversity will become apparent. At first, new findings can cause more confusion, but scientists believe that with advances in technology and collaboration between genomic studies across the world, we are well on the way to building the first platinum genome.