Genomic datasets are widely used by researchers, but are they representative enough? We look at recent research that aims to shift the balance
A collaboration led by researchers in Germany and the US has published a diverse set of newly-sequenced human genomes.
The research, published in Science, included complete genomes from 32 individuals from representing populations in Africa, the Americas, East and South Asia, and Europe.
As previously covered in this blog, genomic databases still disproportionately represent people of European ancestry. Because of this, the researchers wanted to create a dataset that would more effectively capture the genetic differences in a variety of human populations, in order to gain better insights into disease association.
The new dataset represented 26 different populations worldwide and was assembled independently of the original composite human reference genome.
Other initiatives such as the Human Genome Structural Variation Consortium and Human Pangenome Reference Consortium intend to use similar approaches to generate 500 more globally-representative human genomes, adding to the diversity of resources available for future studies.
The availability of more diverse genomic data will allow researchers to look at genetic predisposition to disease in specific populations, as well as the impact of human genome variation on health.
The genomes in the study were presented as 64 haplotype assemblies. European Molecular Biology Laboratory (EMBL) group leader, Dr Jan Korbel, explained: “For each human individual that participated in the study, we identified not one but two genomes – one for each set of chromosomes. Previously we could not distinguish whether genetic variation came from one chromosome set or the other.”
Working with haplotype assemblies can provide a more accurate picture of genomic variation within populations and therefore contributes to improved genomic diversity.
Until recently, understanding which parts of a person’s genome were inherited from each parent also meant sequencing all three of their genomes. These three samples are known collectively as a ‘trio’ and are frequently used in genomic tests in healthcare settings.
The researchers showed that it is now possible to understand where genetic material comes from without needing a trio, which could lead to a quicker turnaround of results and the ability to work with more samples where the genomic information of the parents is unavailable. Trio sequencing will, however, most likely continue in medical settings for the foreseeable future.
Structural variants: a sequencing challenge
Over the past 20 years, a great deal of work has gone into identifying and cataloguing differences between the human reference genome and the genomes of individuals.
Many of the differences that have been successfully identified so far are changes to single base pairs, called single nucleotide polymorphisms (SNPs). Next-generation sequencing is effective at finding SNPs but far less so at locating other types of variation, especially those longer and more complex regions, called structural variants.
Structural variants tend to be chunks of DNA that have been added (insertions), are missing (deletions) or repeated (duplications), or are the wrong way round (inversions). These are overrepresented in the genomes of people with certain types of disease, and so are worthy of investigation.
The research found that over two-thirds of discovered variations would not have been discoverable by other sequencing methods, including over 100,000 structural variants over 50 base-pairs long.
A realistic global picture
The study demonstrated three different approaches that can help to provide genomic data that is more diverse:
- Sequencing the genomes of different populations from around the world.
- Using haplotype assembles to provide a more accurate picture without the guidance of a reference genome or a reliance on trios
- Using long-read sequencing to resolve regions of the genome that would be problematic using other next-generation sequencing methods.
Over time, the use of these approaches will lead to datasets that are more representative of real-world diversity, enabling the development of more effective treatments for a wider variety of people.