In the emerging NGS era, scalable variant interpretation requires automated options
By Anika Joecker, PhD, and Sohela Shah, PhD
For both cancer and hereditary diseases, the capacity to interpret genetic variants found in patient specimens has become a mission-critical—and incredibly time-consuming—process for clinical labs. Single-gene tests are being supplanted by gene-panel tests, and an increasing number of clinicians are ordering exome or even whole-genome tests in order to elucidate the causes of rare diseases and put a stop to seemingly endless diagnostic odysseys.
Compared with the amount of data generated by genetic testing over recent decades, all of this new-model testing is generating orders of magnitude more genetic variants to be identified, analyzed, and interpreted. The resulting data glut puts tremendous pressure on existing variant interpretation pipelines, many of which were built for evaluating just a few mutations at a time. Attempting to force hundreds or thousands of variants of interest through those existing pipelines is becoming increasingly difficult.
In the emerging era of next-generation sequencing (NGS), it’s clear that variant interpretation processes must become more automated. There is a need for better software and other automation options that enable clinical geneticists to funnel most variants through machine-based analysis, leaving only the most complex variants to require expert attention and manual interpretation. But implementing such automated tools can be a complicated venture for labs that have built up their own proprietary databases of variant analysis information. Extensive validation will be necessary for any solution.
This article will review the challenges of variant interpretation in existing pipelines, and will also look at newer options that some labs have begun using to save time and improve results. Study data will illustrate how some software solutions can increase solve rates for patient cases, while also minimizing variants of unknown significance.
VARIANT DISCOVERY
When Nic Volker became one of the first children to have his exome sequenced, in 2009, he didn’t just make scientific headlines. Interpretation of Volker’s exome delivered the answer to a years-long quest to figure out why this boy was so sick—something that no other test or expert had managed to do. Apparently cured by a genome-guided treatment, Volker became nationally known as the first child whose life had been saved by DNA sequencing.1–4
Volker’s story is important to the variant interpretation world because it put a spotlight on large-scale genetic tests. His illness could never have been solved with single-gene or even gene-panel tests. Suddenly, doctors and patients around the world realized that exome and genome sequencing offered a very real chance to explain mysterious diseases. While the tests remained quite expensive, the demand for this kind of approach appeared virtually overnight and has been growing quickly ever since.
In 2009, most clinical labs were getting along just fine with interpretation pipelines built for a relatively low capacity. But exome- or genome-sequencing tests can uncover millions of variants for each person sequenced. Filtering out variants predicted to have no downstream effect could reduce the number of mutations needing to be manually inspected and assessed—but not nearly enough. And while exome and genome sequencing are still typically used as a last resort—after other tests have failed to produce answers—there is a strong trend to include more and more genes even for simple tests.
As our understanding of the genetic mechanisms behind disease improves, it’s likely that panel tests will grow to include even more genes, and that exome and genome sequencing will be used more frequently. In the Human Gene Mutation Database, a widely used resource for comprehensive data on published inherited human disease mutations, there are now more than 187,000 mutation entries, and the number of mutations added each year has more than doubled in the past decade.5 As scientists and physicians become more skilled at associating genotypes and phenotypes, the number of recognized mutations can be expected to soar.
We are, after all, still in the era of variant discovery. Much of the human genome remains uncharted territory, and there are new findings to be made with each new patient who walks through the door. Every new Nic Volker will stoke more interest in using genomics to solve clinical challenges. It’s an exhilarating time for biology—but an overwhelming time to be managing variant interpretation.
INTERPRETATION PIPELINE
In most clinical labs, the processes currently used for variant interpretation range from manual to semiautomated. Today, almost every variant of interest must be scrutinized by a clinical geneticist, highlighting the tremendous skill of these experts, but also revealing an opportunity to improve the variant interpretation process so that such highly skilled professionals can focus their attention on the complex variants that truly require manual interpretation.
The interpretation pipeline picks up where the sequencing pipeline ends, with an annotated report of DNA results that typically flags variants that require follow-up investigation and deprioritizes the majority of common variants that have been reported in public databases. For patients with hereditary disease, the goal is to isolate the variant most likely to be causing that disease. For cancer cases, analysts are typically looking for somatic variants that offer clues to guide treatment, or for variants that can be used as prognostic or diagnostic markers. In either case, the analytical process is far from trivial.
Clinical geneticists begin with information gathering, scouring PubMed and Google Scholar to find references in the literature to the variants present in their reports. Then they must review each paper to determine whether the information it offers about the variant is applicable to the patient being tested. Details of interest include which organism is being studied, the type of disease reported, the number of human subjects in the study, variant heterozygosity, the strength of correlation between the variant and the phenotype, and much more.
This step alone represents a massive time investment. Assuming each paper can be thoroughly reviewed in just 5 minutes—a tall order—performing a literature review for a rare disease such as Bloom syndrome, with just 92 articles listed in PubMed, would take more than 7 hours. A well-characterized disease such as cystic fibrosis, with nearly 3,500 papers listed in PubMed, would take more than 280 hours. Clearly, it is impossible for variant analysts to perform a comprehensive review of all disease-associated papers for any individual patient.
Next, clinical geneticists turn to public databases of variant information and to websites that model protein changes based on a DNA mutation. These sources help determine whether the variants uncovered are likely to affect the function of a gene or protein that might result in the patient’s phenotype.
For some labs, this phase of the investigation also includes screening variants against a proprietary database. This practice is especially useful for tests that a lab runs frequently; with each run of that test, the lab develops deeper expertise in recognizing the variants most likely to cause disease. Some clinical labs have invested in developing internal databases to capture such institutional knowledge, effectively building a competitive advantage for interpreting those variants and avoiding redundant research for variants that have already been analyzed. Proprietary knowledgebases such as these help to accelerate the process of interpreting and reporting variants, and represent an important step toward automating some of the analytical workflow.
Finally, clinical geneticists pull together all of the information they’ve amassed, and use their expertise to weigh the evidence. There is a great deal of pressure on each geneticist to make sure that no relevant piece of information escapes notice. Ultimately, their judgments will yield the lists of variants classified as pathogenic or likely pathogenic, benign or likely benign, and those that defy useful classification—variants of unknown significance (VUS). For known variants, the entire interpretation process can take as little as 30 minutes; but for the most complex variants and those not previously seen, the process can take as much as a full day.
AUTOMATED OPTIONS
As the bottleneck in NGS-based testing has shifted from sequencing DNA to analyzing results, public and private institutions alike are making a greater effort to develop new ways to improve and automate some or all of the interpretation process.
Particularly impressive in this area is the work of the Clinical Sequencing Exploratory Research program (CSER), a nationwide consortium that was launched with funding from the National Cancer Institute and the National Human Genome Research Institute.6 CSER brings together laboratories across the United States to tackle challenges in the integration of genomic sequencing into clinical care—including variant analysis. For example, the consortium is benchmarking variant calls from lab to lab in order to better understand differences in how variants are classified from one clinical geneticist to another. It is expected that recommendations from CSER will eventually increase reproducibility and establish community-standard best practices for variant reporting.
Publicly accessible clinical databases are also providing variant analysts with reliable, one-stop resources that may one day help reduce the need for extensive literature searches. One such repository is ClinVar, through which users share genotype-phenotype associations and the supporting evidence for them. Hosted by the National Center for Biotechnology Information, ClinVar is just one of many needed tools that will help clinical teams classify variants with greater confidence and speed.7
To support advances that can help clinical geneticists determine whether a variant is often seen in the population—making it easier to classify the variant as likely benign or likely pathogenic—scientists have introduced a few freely available tools. The Allele Frequency Community was launched by several organizations to aggregate and share data about variant rates among many populations, which is particularly useful for geneticists trying to determine variant frequencies in populations that are underrepresented in public databases.8 Separately, the Exome Aggregation Consortium reports variants discovered across thousands of sequenced exomes to provide additional detail on both rare and common mutations.9
Meanwhile, commercial organizations are also developing new tools to interpret cases of hereditary disease. Our organization, for instance, offers the Qiagen hereditary disease solution, an automated tool that streamlines the variant analysis and interpretation process.10 For clinical laboratories, we also provide Qiagen Clinical Insight Interpret—clinical decision support software with versions that enable variant classification and reporting for hereditary and somatic cancer.11,12 Powered by the manually curated Qiagen Knowledge Base, these solutions automatically query the peer-reviewed literature, many public databases, and even a lab’s proprietary database, to pull together relevant information from a single search.
SOFTWARE AT WORK
Some recent studies—several performed in collaboration with clinical labs and one performed internally—illustrate how automated end-to-end solutions might accelerate interpretation and improve results.
In a benchmarking study conducted with one whole-exome trio and six whole-genome trios from the Inova Translational Medicine Institute, Falls Church, Va, we found that an end-to-end automated analysis and interpretation pipeline can rapidly reduce the number of variants requiring manual follow-up.13–15 For the six whole-genome trios, we began with 4.8 million to 7.8 million variants apiece. Using default parameters, we were able to quickly filter the number of variants down to anywhere from four to 266 variants. By adding the patient’s phenotype information to the analytical process, the number of candidate variants was reduced to one, two, or three variants in all but one case (see Figure 1).
For the whole-exome trio we observed a similar pattern. Beginning with about 500,000 variants, and applying default parameter settings and patient phenotypic information, the automated solution quickly narrowed the number of variants to the single likely disease-causing variant. This dramatic reduction in candidate variants can help to increase the solution rate for patient cases—particularly for rare or unknown diseases.
In an internal study designed to assess the importance of using multiple data sources for variant interpretation, we randomly selected about 280 variants for two disease types and analyzed them according to American College of Medical Genetics and Genomics guidelines in two ways: first using only public databases, and next using those databases as well as a curated repository of peer-reviewed literature.16,17 For 180 variants associated with Lynch syndrome, we found that using more data sources reduced the number of VUS by 27%. Likewise, we saw a 33% reduction in the VUS group when the 99 variants linked to heart disease were analyzed with both databases and literature (see Figure 2).
These results demonstrate that using more sources of information that can be queried easily through an automated workflow enables geneticists to classify more variants into clinically meaningful categories. The study also showed the importance of having full access to the literature, which can be challenging for clinical labs due to paywall restrictions for many journals.
CONCLUSION
Successful variant interpretation may be the single most important element required for translating genomic advances into mainstream medicine. It is essential that genetic analysts get this right—but clearly the manual approach is not sustainable or scalable for such a rapidly growing field.
Investment in automated tools—whether that means automating individual pieces of the workflow or adopting an end-to-end solution—will be of paramount importance for clinical labs eager to meet this challenge. We look forward to a time when most variants are interpreted automatically, freeing up clinical geneticists to apply their considerable expertise to the most complicated variants and cases.
Anika Joecker, PhD, is director of global product management for the clinical program at Qiagen Bioinformatics, Aarhus, Denmark, and Sohela Shah, PhD, is principal genome scientist for advanced clinical testing at Qiagen Bioinformatics, Redwood City, Calif. For further information, contact CLP chief editor Steve Halasey via [email protected].
REFERENCES
- One in a billion: a boy’s life, a medical mystery: part 1: a baffling illness [online]. Milwaukee Journal Sentinel, December 18, 2010. Available at: http://archive.jsonline.com/news/health/111641209.html. Accessed September 14, 2016.
- One in a billion: a boy’s life, a medical mystery: part 2: sifting through the DNA haystack [online]. Milwaukee Journal Sentinel, December 21, 2010. Available at: http://archive.jsonline.com/news/health/112248249.html. Accessed September 14, 2016.
- One in a billion: a boy’s life, a medical mystery: part 3: gene insights lead to a risky treatment. Milwaukee Journal Sentinel, December 25, 2010. Available at: http://archive.jsonline.com/news/health/112249759.html. Accessed September 14, 2016.
- Johnson M, Gallagher K. Young patient faces new struggles years after DNA sequencing [online]. Milwaukee Journal Sentinel, October 25, 2015. Available at: http://archive.jsonline.com/news/health/young-patient-faces-new-struggles-years-after-dna-sequencing-b99602505z1-336977681.html. Accessed September 14, 2016.
- Cooper DN, Ball EV, Stenson PD, et al. The human gene mutation database [online]. Cardiff, Wales: Institute of Medical Genetics, Cardiff University, 2015. Available at: www.hgmd.cf.ac.uk/ac/index.php. Accessed October 13, 2016.
- Clinical Sequencing Exploratory Research [homepage online]. Seattle: CSER Coordinating Center, University of Washington, 2016. Available at: www.cser-consortium.org. Accessed October 13, 2016.
- ClinVar [homepage online]. Bethesda, Md: National Center for Biotechnology Information, 2016. Available at: www.ncbi.nlm.nih.gov/clinvar. Accessed October 13, 2016.
- Allele Frequency Community [homepage online]. Redwood City, Calif: Qiagen Bioinformatics, 2016. Available at: www.allelefrequencycommunity.org. Accessed October 13, 2016.
- About ExAC [homepage online]. Cambridge, Mass: Exome Aggregation Consortium, Broad Institute, 2015. Available at: www.exac.broadinstitute.org. Accessed October 13, 2016.
- End-to-end hereditary disease solution [online]. Redwood City, Calif: Qiagen Bioinformatics, 2016. Available at: www.qiagenbioinformatics.com/solutions/hereditary-disease. Accessed October 13, 2016.
- QCI Interpret for somatic cancer [online]. Redwood City, Calif: Qiagen Bioinformatics, 2016. Available at: www.qiagenbioinformatics.com/products/qiagen-clinical-insight. Accessed October 13, 2016.
- QCI Interpret for hereditary cancer [online]. Redwood City, Calif: Qiagen Bioinformatics, 2016. Available at: www.qiagenbioinformatics.com/products/qci-for-hereditary cancer. Accessed October 13, 2016.
- Joecker A, Shah S, Solomon B, et al. A [sic] efficient and accurate end-to-end next-generation sequencing solution for identifying and interpreting disease-causing variants in rare diseases [abstract P15.06]. Poster presented at the 2016 annual meeting of the European Society of Human Genetics (Barcelona: May 21–24, 2016). Eur J Hum Genet. 2016; 24(E-Suppl 1):333. Available at: www.eshg.org/fileadmin/www.eshg.org/conferences/2016/downloads/eshg2016_abstracts_final.pdf. Accessed October 18, 2016.
- Shah S, Krämer A, Boycott K, et al. Leveraging network analytics to infer patient syndrome and identify causal mutations using patient DNA sequence and phenotype data [abstract P14.086]. Poster presented at the 2016 annual meeting of the European Society of Human Genetics (Barcelona: May 21–24, 2016). Eur J Hum Genet. 2016; 24(E-Suppl 1):327–328. Available at: www.eshg.org/fileadmin/www.eshg.org/conferences/2016/downloads/eshg2016_abstracts_final.pdf. Accessed October 18, 2016.
- Joecker A, Shah S, Hadjisavas M, et al. An automatic end-to-end solution for disease-causing variant detection in rare and hereditary diseases with a high case solve rate and a much-reduced false positive rate [poster online]. Redwood City, Calif: Qiagen Bioinformatics, 2016. Available at: http://resources.qiagenbioinformatics.com/posters/eshg_anika_fin.pdf. Accessed October 18, 2016.
- Qiagen Bioinformatics clinical portfolio overview [online]. Redwood City, Calif: Qiagen Bioinformatics, 2016. Available at: http://resources.qiagenbioinformatics.com/flyers-and-brochures/clinical_portfolio_overview.pdf. Accessed October 18, 2016.
- Richards S, Aziz N, Bale S, et al., on behalf of the American College of Medical Genetics and Genomics laboratory quality assurance committee. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015;17(5):405–424; doi: 10.1038/gim.2015.30.
- Rienhoff HY Jr, Yeo CY, Morissette R, et al. A mutation in TGFB3 associated with a syndrome of low muscle mass, growth retardation, distal arthrogryposis and clinical features overlapping with Marfan and Loeys-Dietz syndrome. Am J Med Genet A. 2013;161A(8):2040–2046; doi: 10.1002/ajmg.a.36056.