Featuring: Joanne Mason, West Midlands Regional Genetics Laboratory
It’s my job to actually look at the variants that we’re generating with all of this sequencing data and decide what the impact of those variants is and how likely they are to be causing the patient’s condition and then decide what gets fed back to patients.
So I’m going to be introducing some quite complicated concepts and hopefully explain them to you in a way that is understandable, and I’m happy to take questions at the end.
It helps if we start to think about what does a genome actually look like. If you want to see a genome you can go down to the Wellcome Collection just opposite Euston Station in London and actually have a look at the human genome, because there you’ll find printed out all 6 billion letters of the human genome in the volume of over 100 books.
However, on a more visual basis the genome actually looks more like this. So we have a sequence: those 6 billion bases are comprised of just 4 different letters – A, C, G and T – which are the four nucleotides that make up our DNA. These together with a phosphate backbone are packaged up into a structure that‘s probably familiar to all of you, called a double helix. Then that double helix of DNA is further packaged up in a very specific and ordered way such that all 6 billion bases will fit into every single cell in your body. They’re packaged up into chromosomes and you have 23 pairs of these in every single cell in your body.
This is what chromosomes look like down the microscope, and we can look at these via a technique called cytogenetics. You are looking at a cell that has been arrested in a particular stage of the cell division cycle called metaphase, and this is the point at which the chromosomes are most condensed and easiest for us to see. And then the cells have been burst open to reveal the chromosomes and they’ve been stained with specific chemicals which reveals very particular banding pattern, such that every chromosome has a very characteristic banding pattern that can be recognised by skilled staff. The chromosomes are then paired up and ordered in what we refer to as a karyotype, and this process of staining the chromosomes is actually called G banding. So cytogenetics is like the real archetypal genomic technology; it’s a way of looking at the entire genome all in one go, but it will only reveal very gross aberrations involving the chromosomes.
So one of these, to show you an example of the kind of thing we do, is illustrated here, so this is the karyotype of a little girl, we know it’s a girl because there’re two copies of the X chromosome and no copies of the Y, and you can see straight away that we’ve got three copies of chromosome 21 here. That’s also known as trisomy 21, or more familiarly Down’s syndrome. So if we had a baby born with a suspicion of Down’s syndrome, a sample of blood would be sent off to our laboratory. We’d look at the chromosomes down the microscope, and if we see three copies of 21 we can confirm that clinical diagnosis.
Another example comes from cancer; this time we are talking about acquired genetic disease. In this case, we’ve got a translocation between chromosomes 4 and 11 such that the long arm of chromosome 4 – so this is the long of 4 – but the long arm of this one of the pair has actually translocated onto the bottom of chromosome 11, and we’ve got the reciprocal process the little bit of the bottom 11, the long arm, has gone on to chromosome 4. So we’ve got now a new derived chromosome 4 and a derived chromosome 11. The effect of this translocation is that the genes located at the break points, so sort of here and here, have been disrupted in some way and the consequence is that the protein is also disrupted, and this has led to it having oncogenic potential and has led to the formation of the leukaemia. And it’s really important – the reason we want to know this is partially because we can diagnose a specific type of leukaemia, but even more than that we are even able to say what kind of therapies are likely to work best, so this is why we use chromosome analysis. People think of it perhaps traditionally for diagnosing inherited genetic diseases, but we use it an awful lot in cancer as well.
So that was looking at chromosomes: when they’re easily visible down the microscope we can band them, then trained analysists can look for aberrations. But in some situations that’s just not possible, either because cells aren’t actively dividing or because the kind of aberrations we’re looking for perhaps just involve the very tips of the chromosomes, right down here, and it’s just not possible to actually see that. They’re what we call cytogenetically cryptic.
So then we can use another technique, a really powerful technique, called fluorescent in situ hybridisation, or molecular cytogenetics. This particular example here is a lovely illustration of the technique, although I should point out it’s not commonly used in diagnostic labs but it shows you what the technique can do. We’ve been able to label up every single pair of chromosomes with a different colour so that they can be easily recognised down the microscope. So if you have say a tiny little bit of one chromosome translocated to the end of another one, it’s revealed by the colour – what would be invisible when you are doing G-band analysis.
This slide also shows the difference between cells at different stages in the cell cycle. So these cells are in the metaphase when the chromosomes are nicely condensed and they’ve actually been split open from the cell and released and spread out on a slide such that we can see them like this. However, at another stage in the cell cycle called interphase, this is what the chromosomes look like inside the cell. They’re much more spread out: you can see from the colours that the genetic material is diffusely spread throughout the cell; the chromosomes are just a big tangle. So again this is where this technique of FISH comes in because we are still able to analyse cells in this interphase even if we can’t generate metaphase chromosomes by applying FISH probes to them.
So I’ve illustrated that with the next picture. So here’s a probe to a gene on the Y chromosome seen in metaphase where you can still recognise the Y chromosome, this tiny little chromosome down here. You are also able to look at cells in the interphase of the cell cycle and there’s the Y chromosome down there. And this technique is very good if you just wanted to do a quick count. We can just count the number of copies of Y or any other chromosome or any other gene which is relevant in certain situations.
We can take that technique even further and use differentially coloured probes. Here, we’ve got a probe to a gene on chromosome 9, coloured in red, and another probe to a gene on chromosome 22, coloured green. This is the normal 22, this is the normal 9, and you’ll have to believe me here but the signals on this particular chromosome 9 – this is the pair of 9s here – these signals are slightly less intense than this one, and the little bit of the signal here has actually moved onto this chromosome 22 down here.
And where the red and green are now co-located, the signal appears yellow. So what this tells us is that we’ve had a translocation between this chromosome 9 and this chromosome 22, and the result of this translocation is that the genes at the two break points have fused together to create a new chimeric protein which has oncogenic potential. And this particular genetic aberration, called translocation 9;22, is characteristic of chronic myeloid leukaemia – it’s actually how you diagnose chronic myeloid leukaemia. And because it’s so characteristic, the derived chromosome 22 has its own name it’s called a Philadelphia chromosome. So again by having the sample sent into us with a suspected diagnosis and looking down the microscope, we can confirm that this is indeed chronic myeloid leukaemia and suggest the most appropriate course of treatment for this patient.
So moving on from gross chromosomal aberrations and delving down more to the actual DNA sequence itself, I just thought it would be worth reminding everyone how the genetic apparatus actually works and how the sequence is read and the code is understood by the cell in the formation of a protein.
So triplets of bases, or nucleotides, form what’s known as a codon, and each codon codes for a different amino acid, and it’s the sequence of amino acids that make up a protein. So a very simplistic example here is a substitution in the sequence of a thymine for an adenine, so this codon originally coded for a cysteine amino acid but now the sequence has changed from T G T to A G T, so the amino acid coded for now is a serine, so we’ve got a change for one single amino acid in this protein. The consequences of this are completely variable. Often these kind of changes are tolerated by the cell; there is no consequence to the protein. Every single one of us contains thousands of these types of changes – this is what’s responsible for the normal differences between us all. However, at the other end of the scale, the consequence could be completely devastating, and it could be just this single type of change here that will stop the protein from perhaps folding properly or functioning properly, and the consequence could be a very serious genetic condition.
So the types of changes we’re looking for – the very small changes – are perhaps best illustrated by the analogy of the set of instructions in a book. And if we start with a very simple sentence, ‘The fat red fox can hop’, there are various different changes that can happen. First of these is a substitution, which is like a spelling mistake, and here you can still read the sentence but its meaning’s changed slightly and the consequence of that to the cell is that it could still understand that sentence and there be no real phenotypic effect, or it could change it completely and it no longer works properly.
So after substitutions we then come to deletions. Now you can have what’s known as an in-frame deletion, which in this case we’ve deleted three letters but the sentence can still be read and make sense. Paradoxically, if we only delete one letter rather than three, this throws the reading frame out and this sentence is now gobbledeygook and we can’t read it any more. So this is a sort of nice illustration of how just one single change, one single letter missing, can have such a dramatic effect.
The converse of deletions are insertions, and again if you’ve got an insertion of just a single nucleotide you throw out the reading frame and your sentence no longer makes sense. And then slightly bigger than insertions, we could have a duplication. So in this case we’ve duplicated one whole word and you can still read the sentence, or we duplicate four letters and then throw the reading frame out. So I hope you can start to see the reason I’ve put this slide up: it shows just that one single change in your 6 billion letters can have such a devastating impact and be the cause of either cancer or a nasty genetic condition.
So when we’re looking for variation in the genome at the sequence level, we’re looking for everything from single base changes right up to perhaps whole extra chromosomes. And traditionally we’ve had to use a variety of techniques in the genetics lab to be able to find all of these different types of changes at different orders of magnitude of scale. But what we really want to be able to do this most efficiently with the highest diagnostic yield at reduced cost is one technique that will fit all. And we finally have such a technique, which is showing great promise now to be able to do all of these things. That technique is known as next-generation sequencing, and that’s what the rest of my talk is really going to be about.
So next-generation sequencing. The reason it’s so important is it is completely changing the way we deliver healthcare. In a nutshell, what it does is it decodes the DNA and produces the precise order of the four component bases. But the difference between next-generation sequencing and what’s gone before – and bearing in mind we’ve used a process called Sanger sequencing for decades – with next-generation sequencing, we can actually sequence billions of DNA fragments in parallel in one go, so it’s very high throughput. And it means that we can now sequence an entire human genome in one run, as Tom’s already alluded to.
So within the clinical lab we’ve used, as I said, Sanger sequencing for a long time now. However, the throughput of that is pretty low: we can sequence 96 fragments per run on one of these plates; that’s equivalent to about 75 thousand bases of DNA. So in practice that means we could perhaps look at 96 different patients for one single exon, or maybe screen a whole gene for just one patient one run at a time. However, with the benchtop next-generation sequencing apparatus that we have access to in the lab, we can now look at 20 million fragments per run, so orders of magnitude greater than what’s gone before. This is the equivalent of looking at 20,000 of these plates, so the amount of data we are generating is huge – 15,000 million bases in one go – so this enables us to screen gene panels for a number of patients in one go, or potentially look at a whole exome, which is the entire coding sequence of your DNA. So it’s given us a huge increase in our sequencing output and it’s completely revolutionised our clinical services.
So talking about how we are going to use next-generation sequencing in our clinical labs, which we’re moving across from calling genetics labs to genomic laboratories, we’ve got questions to ask: is it more appropriate to look at the whole genome, or could we get away with just looking at the coding part of the genome – that 1-5% of the genome that actually makes the proteins, which is known as the exome? Or is it more appropriate to concentrate on specific gene panels? I’ll illustrate this on another slide in a minute, but the different strategies are used in different situations. Say, for example, why you might go straight on to gene panels: for a patient referred with hereditary cardiomyopathy, there are up to 70 genes involved in that condition, so it’s more appropriate perhaps just to design a gene panel and look at those 70 genes. That’s the most cost-effective way of doing things as it stands, but in time that may well change.
Whereas if we are thinking about whether to look at whole exomes or whole genomes, just to illustrate the differences, why we may choose one over the other: as I’ve already said, an exome comprises just the coding region of your genome. That’s about 20,000 genes that makes up between 1% and 5% of your genome. This will generate about 30 megabases of DNA, as opposed to whole genome sequencing, which gives you 100 times more output. By using exome sequencing, you’ll discover about 85% of your disease-causing variants in Mendelian conditions, which are those where the inheritance pattern has been well worked out. Whereas whole genome sequencing gives you the potential to detect all of your disease-causing variants. With whole exome sequencing, your output will be about 30,000 variants, so at the end of that process you’re left with 30,000 variants that you then need to decide, ‘What do I do with these?’ And we’ll come on to that in a minute. Whereas with whole genome sequencing, the figure is more like 3 million. So that’s the challenge; that’s the problem we face. And then in the rest of my talk, I’ll go on and say how we are actually getting around that challenge.
So just to recap on the strategy right from the beginning. You’ll need to take into account the phenotype and whether it’s well described, how much money you have, what kind of sensitivity you need to go down to, so for cancer we need a very high level of sensitivity when we are looking at clones in tumours. How many patient numbers you’ve got? So obviously the 100,000 Genomes Project gives you volume and the opportunity to reduce your costs massively. So we can either employ gene panels, which may be appropriate for specific presentations such as breast cancer where we only want to look at a limited number of genes. For more heterogeneous conditions, such as developmental delay, it’s perhaps more appropriate to sequence the coding region, known as the clinical exome. Or as we’ve already heard from Tom, the 100,000 Genomes Project is enabling us just to go straight in and look at the entire genome and then select virtual panels out of that that we apply more scrutiny to.
So, again, this is the problem 3 billion letters and what do we do with all of that data?
So to generate the actual data, the process that we employ, illustrated here, is you have to start off with the specimen. So if we are looking for germline variants, we start off with blood or saliva, or other tissue, but in cancer we actually need a little bit of the tumour – we are obviously trying to sequence the genetic content of the tumour. We then extract nucleic acid; usually this is DNA, sometimes RNA if we are interested in expression. Then – this is a key part of the process – we prepare our library, and this determines how much of the genome we’re going to be looking at. So it could be the whole genome or we could just pull out the exome, or we may just want to look at gene panels, and the way we do this is illustrated here. So if we start off with the entire genome, then we can do what’s known as target enrichment to pull out, for example, just cardiac genes, just breast cancer genes, just the deafness genes, or the entire coding region or look at the whole genome. So that’s sort of how we do it.
So moving onto the sequencing process itself, there are loads of different chemistries and platforms that we can use to do this. But, generally speaking, the process itself sequences in short fragments, which we call reads, which can be between 75 and 300 base pairs. And then they have to be mapped back to a reference genome so that they can be ordered and we can make sense of them. So that’s the first part of the bioinformatics pipeline – this mapping back all these reads back to the reference genome.
Then the BI pipeline will call any variants and highlight them to us, and then we do some further bioinformatics to actually annotate those variants to decide what the consequences are likely to be and finally to assign pathogenicity to them, basically to make a decision. Are they just part of normal variation? Can we write them off as benign variants, or are they actually likely to be responsible for this condition? So that really is the hardest part, believe me, I think we’ve got the sequencing side of things sussed – yes, it would be nice if it were cheaper – but this the really difficult bit now.
This is what the next-generation sequencing data actually looks like. This is how it is presented to me; this is a browser that we use. So these are all the reads here. This is why it’s called parallel sequencing, because we generate loads and loads of these different reads, which are really quite short – as I said, between 75 and 150 base pairs. Then the bioinformatics lines it back up to the genome, so all of these reads here come from just this little region. This is chromosome 18 here across the top, just a part of chromosome 18; this is the centromere of 18 and all these regions come from just this little bit here, which is the GATA6 gene. Down here at the bottom, we’ve got the sequence of amino acids and the sequence of nucleotides. From what we can see here is the software is highlighting a variant, and this variant is actually a change in a single base from an adenine to a guanine, and the consequence of that is to change the amino acid sequence. Actually this is a pathogenic mutation, so that’s what it actually looks like.
So one of the ways we actually can start to hone in on the regions of interest and to make sense of all this data is to apply panels. In the genetics diagnostics lab, all of our familial cancer patients go down one single panel. We look at 94 genes implicated in hereditary cancer in one go through one pipeline. It’s a very streamlined process: we can get DNA to data in just three days through a single technical workflow and then we apply virtual gene panels to this, such that all of the patients go through a filtering process and we only look at the genes of interest at the end. So the breast cancer patients, whilst they get sequenced for all of those genes, we’ll only actually look at the breast cancer genes. In that way, we limit the number of variants we’re left with at the end.
So the variants that we see may be false calls as a result of the sequencing process itself, known as artefacts, and the bioinformatics helps us to decide which of these it’s going to be – whether it’s a false call, whether it’s likely to just be benign variation. Sometimes the variants just have uncertain clinical significance and we have to decide whether to report those out or hold them back. Ultimately, what we’re really looking for is the likely pathogenic mutations, which will be reported back to the patients, and there are various databases and tools online that we use to make these decisions.
So that’s how we deal with gene panels. Then, when we talk about the results of whole genome sequencing, we are dealing with vastly hugely increased amounts of data. The way we deal with this is to take our 3 billion bases – which are sequenced from either the germline in a rare disease or the somatic tumour in cancer – and then we can actually subtract from the cancer the germline variants, and that automatically limits the amount of data we are left with to look at. For rare disease, if we’ve got a de novo condition and we have both parents, we can subtract the parental variants because we know the parents aren’t affected and again that leaves us with a more limited pool of variants. We can then apply some virtual panels and that will give us again a reduced number of variants at the end of the pipeline, which are then annotated, filtered and prioritised. When it comes to rare diseases, the genomic variants are filtered based on such things as their frequency, so if it’s a rare disease the variant itself needs to be rare. And its location, which is inside a known gene panel containing genes associated with that condition; Its genotype; its inheritance pattern; and its predictive consequence.
So that this is probably the important slide to go through; it describes how we actually filter. So if we have a variant, what we need to know is that the variant allele is not commonly found in the general healthy population – we can use population databases to make that assessment. And the allelic state matches the known mode of inheritance for the gene and the disorder. If the answer to those questions is yes, we then start to look at the consequence of that variant. If it’s a known pathogenic mutation, it goes straight down into tier 1 and that gets reported back to us from the 100,000 Genomes Project back to the Genomic Medicine Centres. If it’s a protein-truncating variant, even if it’s not been reported before, that follows the same path. And then if it’s one of the genes on our virtual panel, it either goes into tier 1 or tier 3, which I’ll just skip through in a minute. However, if its protein altering rather than protein truncating, this is slightly more difficult to deduce what the effect might be, and in this case its tiered as a tier 2 rather than a tier 1, which means the effect is less likely – you would have to do a bit more work to decide what the effect might be.
So very quickly, tiers 1 and tier 2. Tier 1 are the known pathogenic variants with protein-truncating variants, and these come back to the Genomic Medicine Centres – we do a bit more work just to make sure that we’re happy that they are responsible for the genetic condition and they’ll get reported back to the patients. Whereas the tier 2 variants, we may have to do a lot more work to assign pathogenicity to them or discount them as a likely benign variant. Tier 3 variants, so the ones where really don’t know about them and it’s unlikely we’ll be looking at these at this stage, although in the future we’ll perhaps pay them a bit more scrutiny.
So, just to summarise, the way that we’ve adopted next-generation sequencing in clinical genomic services: we’ve employed it because it has dramatically increased the amount of sequencing data we have access to at reduced cost, which means that we can look at more patients, reduce our reporting times, many more genomic targets, increasing our diagnostic yields. So it really has been the answer to the demands of genomic medicine.
So, as to the future: will we be doing whole genome sequencing for everybody? And at what point? Will we be screening everybody a birth? Those ethical questions I’m not prepared to answer, but they’re things that have been mooted and discussed. The one thing I would say is that whole genome sequencing is great, but it can’t do everything yet; it doesn’t provide all the answers. So, for example, a particular area I’m involved in is monitoring residual diseases leukaemia, and I can’t see a way in which, at the current costs and so forth, we can do that with next-generation sequencing. But that’s not to say it’s not changing the way we deliver genomics services. So thanks for your attention and I’m happy to take any questions and sorry that was rushed at the end.
Questions from the audience
Audience member: Thanks very much, two very quick questions. Firstly, when you show the alignment of the contigs and you show the transition from an A to a G, not all of the contigs have the mutation. Is this because it’s a heterozygous?
Joanne: Exactly, yes.
Audience member: And the other question is what if you surreptitiously discover a mutation for something you are not actually looking for – how do you deal with that?
Joanne: That’s very interesting. One of the answers to that is to apply these gene panels, so if you’re only looking at the panels which are likely to be associated with the disease, then by definition you’re only going to be looking at things which are likely to responsible. There is the potential if you open that up and start to do more on a research basis, yes you absolutely will start to find things. And that’s part of the 100,000 Genomes Project – to consent patients right at the beginning and have them be clear about what kind of data that are going to be returned to them. Tom, am I right in saying they can actually able to choose what levels of return?
Dr Tom Fowler, Genomics England: Yes, they can choose. So I think one of the points about this is that people actually think of incidental findings – the framework in this is just incidental findings – and actually we’ve both emphasised the huge amount of data, so there’s a difference between going and looking for specific things and coming across specific things that have an implication for something else. I think at the moment we don’t actually return the incidental findings per se, but we do return additional looked-for findings. But I suspect the direction of travel is to actually move the incidental findings that are come across more in the direction that we do in the rest of the NHS, which is when you know if you do see a dark mass on a lung X-ray, you will report that. The bit where I think you can get caught up in is the issue of how certain you are that something means something, and people are often debating this, where you actually go, ‘Well, we are not 100% sure, so actually is it a finding or isn’t it?’ And I think we often tie ourselves up in knots when actually the technical bit says, ‘No, not ready to report on this yet.’
Joanne: That’s a good point. So generally speaking I would say with anything in the rare disease programme, where we’re dealing with constitutional genetic conditions, be they hereditary or de novo, generally speaking people don’t report back variants of unknown significance because it’s of no use and it could be have the potential to be misinterpreted. Those boundaries aren’t quite so clear with cancer, because even if we don’t know exactly the function of something, it could still mean that the patient could potentially receive a treatment that could still work, so it all starts to become a bit more complicated.