SMRT sequencing enables the detection of genomic variants for a wide range of clinical applications
By Robert P. Sebra, PhD, and Melissa Laird Smith, PhD
The Mount Sinai Health System in New York serves some 3.5 million patients each year. In doing so, its 7,000-plus physicians rely heavily on clinical laboratory testing to diagnose patients and make critical treatment decisions. Some of the earliest incarnations of genetic assays used in such applications are born in the technology development lab of the Icahn School of Medicine at Mount Sinai.
The task of Mount Sinai’s technology development lab is to assess new technologies, assays, and molecular methods to determine which ones will meet the needs of the systemwide clinical community. While this work falls under the umbrella of research and development, every study is done with an eye toward clinical utility. The overall approach is straightforward: identify a healthcare challenge and determine what set of tools will help to address it quickly, affordably, and through the generation of robust data that facilitates the use of that solution in clinical genetics labs while meeting regulatory guidelines. Once the technology development lab has played its role, the assays can be implemented by the team of experts in Mount Sinai’s clinical genetics or clinical pathology labs.
Mount Sinai’s technology development lab is extremely fortunate to have a broad range of sequencing platforms available. Often, the biggest challenge is selecting which platform will produce the type of data that will be most influential for both physicians and patient populations. Major targets include diseases or genomic variations that cannot be diagnosed with conventional methods.
The technology development lab also actively seeks out tools that can fill gaps in existing knowledge. For example, it is now widely accepted that large and repetitive structural variants have been dramatically underrepresented in analyses of the human genome, but are clinically important for accurately resolving polymorphic human leukocyte antigen (HLA) typing, or even for diagnosing diseases such as fragile X syndrome that require the assessment of long tandem genetic repeats.1
When setting up the technology development lab a few years ago, it was decided to implement a single-molecule, real-time (SMRT) sequencing system from Pacific Biosciences, Menlo Park, Calif (see Figures 1, 2). The system was selected because its long reads are uniquely capable of resolving structural variants and other large elements in the genome that go undetected by short-read sequencers. Since then, long-read sequencing has been incorporated as an important validation tool for cancer-related single-nucleotide polymorphisms, gene fusions, and more.
This article will describe several assays that have been developed and evaluated in the lab using SMRT sequencing for a variety of applications. For clinical labs with access to this technology, these assays provide useful data that, in most cases, cannot be generated any other way.
Pharmacogenomics: Drug Metabolism
The CYP2D6 enzyme plays a central role in drug metabolism and is estimated to have direct involvement in metabolizing about 25% of commonly used medications. Genotyping the CYP2D6 gene is widely performed to predict how an individual patient will metabolize drugs ranging from antipsychotics to codeine, making CYP2D6 analysis a prime example of the importance of pharmacogenomic testing to any clinical lab.2
As the community has learned more about the CYP2D6 gene, however, it has become clear that simple genotyping methods are not sufficient to resolve this highly polymorphic region. Scientists have now classified more than 150 different CYP2D6 alleles.3 Unambiguous genotyping is further complicated by the presence of copy number variation, with variants demonstrating up to 97% homology to the functional gene. These confounding factors can affect the interpretation of a CYP2D6 genotype and lead to erroneous conclusions about how a patient will metabolize a drug.
Analyzing this region with typical genotyping platforms has a serious limitation: most methods only test for around a dozen of the most common alleles, making it impossible to detect rare or novel alleles that convey important information for drug selection and dosage. Moreover, short-read next-generation sequencing is thwarted by the frequent duplications, deletions, and other structural variants that stretch far beyond what a standard 200-base read can capture.
To get past those limitations, the authors evaluated the ability of SMRT sequencing to finely resolve the CYP2D6 gene and phase alleles (see “What Is SMRT Sequencing?“). Our hypothesis was that the long reads obtained via SMRT sequencing—many of which are tens of kilobases in length—would generate the nucleotide-level resolution across single molecules needed to consistently and reliably classify alleles (see Figure 3). Working with Mount Sinai clinical colleagues Stuart Scott, PhD, and Yao Yang, PhD, we developed a targeted long-range polymerase chain reaction method for generating amplicons containing the full sequence of interest, to be used as templates for long-read sequencing.4 These CYP2D6 data captured the full-length gene, its upstream and downstream copies, and even copy number variation duplication events.
We first validated the assay using 10 DNA samples previously classified by the Coriell Institute for Medical Research and available from the institute’s human reference genetic material repository. The SMRT sequencing results called each genotype accurately, and generated new information about variants that had been missed by other orthogonal analysis tools. These data made it possible to refine the genotype calls, detect allele-specific duplication, and identify novel alleles.
We then began a new series of tests, this time applying the pipeline to 14 samples that had been previously analyzed but had produced inconclusive or unreliable results. By directly observing the full-length amplicon sequence—rather than inferring it from a reconstruction of short reads that could too easily conflate sequence data—the pipeline addressed the earlier test discrepancies. This approach confirmed a number of challenging diplotypes and also allowed for better resolution, structural variant detection, and allele phasing.
These studies yielded compelling data suggesting that other methods for genotyping CYP2D6 have been misclassifying results more than anticipated. Based on our results, about 20% of samples tested had their genotypes revised, sometimes to include a novel allele and other times to recall as an existing rare allele. The discovery of three novel alleles in a region as well-studied as this one, and with such small sample numbers, underscores the importance of generating higher-resolution information. Overall, these findings suggest that many established genotyping pipelines for CYP2D6 are producing inaccurate data for a significant proportion of patients, leading to less-informed therapeutic decisions.
In addition to publishing the full pipeline for others to evaluate, we have also improved the sequencing multiplex capability using current technologies. By barcoding samples, it is now possible to multiplex up to 384 samples in a single run for >100x coverage of each individual’s CYP2D6 region using the PacBio Sequel system. It seems likely that long-read CYP2D6 analysis could address some of the inconsistencies reported in earlier pharmacogenomic studies, as these may well have been caused by inaccurate or incomplete genotype information of the samples or participants involved.
Personalized Cancer Therapy Program
Clinical laboratories around the world are working to build robust processes to realize the potential of cancer genomics for tailoring treatment to each patient. The advances from genomic data in biomedical cancer research are so promising that everyone is trying to find ways to accelerate the generation and use of this information for optimal and individualized patient care.
In many cases, labs rely on tumor exome sequencing to identify driver mutations that might provide guidance on the best course of treatment. At Mount Sinai, our research institute was founded on the principle that integrating technologies and respective data sets provides deeper insight and a more comprehensive view than any individual source. That concept guided what has become our personalized cancer therapy research program.
This integrative genomics approach was developed to combine exome or whole-genome sequencing, microarray genotyping, RNA-seq, and targeted long-read sequencing for each patient’s tumor and a matched normal sample. The in-depth analyses interrogate somatic and germline mutations, copy number variations, changes in gene expression, and gene fusions.
In a study designed to test this pipeline, we returned results to 46 patients and their physicians. The reports included genomic results predictive of prognosis, drug response, and toxicity for a wide range of cancer types.5 For most patients, we found nearly five times—and in some cases more than 13 times—more relevant mutations than were detected by typical cancer-specific genomic panel tests used in clinical labs today. The method revealed potentially actionable variants for 91% of patients—also a significant improvement over existing clinical tests. In four cases, those findings changed the treatment course for patients.
Naturally, given the cost of generating so many data sets for each individual patient, it would be challenging to roll out this concept in a clinical lab. We believe that a phased approach will provide the best combination of high-quality results and cost-effectiveness. Clinical labs could, for instance, start with targeted panels and add new layers of data as appropriate when actionable mutations are elusive (see Figure 4).
One of the most important aspects of Mount Sinai’s personalized cancer therapy research program is giving clinicians and bioinformaticians the ability to review reported mutations and quickly filter them from the total variants detected down to the handful that are most diagnostic, prognostic, or therapeutically actionable. In our study, the pipeline had to allow for in-depth bioinformatics analyses while still ending with a clinician-friendly report of key variants.
To achieve this goal, we began with somatic variants uncovered during short-read sequencing-based exome or whole-genome data, which narrowed the list to a manageable subset of no more than 100 variants most likely to be useful for characterizing the cancer. Because these variants were considered the high-value targets, it was essential to carefully validate each one. SMRT sequencing was used to perform highly accurate circular consensus sequencing for each genomic region harboring the mutations of interest. This protocol offers the same level of orthogonal validation as Sanger sequencing at a lower cost when multiplexed. With the Sequel system, we achieved variant detection sensitivity of less than 1%, and we also ran many molecules from the sample, giving an accurate view of even the most heterogeneous tumors. With barcoding, it is also possible to pool many amplicons in a single run, in order to analyze several patients simultaneously and keep costs in check.
Once the target variants were validated, we gained a better sense of how each annotated variant was likely to alter biological function. That information is culled from public databases and manually curated repositories, and can be followed up with cell-based and other functional assays to confirm the effect. The end result is a clear view of mutations that are likely to have diagnostic, prognostic, or therapy-selection value. This approach can be replicated in clinical laboratories that have access to several genomic technologies, such as short- and long-read sequencing, plus microarrays and panel tests.
One of the most exciting clinical areas now being explored is profiling the complexity of the adaptive immune system. This is a rapidly developing field, as Sanger sequencing has never had sufficient throughput to efficiently sample the diversity of the immune response. The high throughput of next-generation sequencing has opened these doors, however, enabling investigators to explore and begin to map the highly variable landscape of each individual’s immune system.
Since this hasn’t been a common application for clinical labs, here’s a quick refresher: three families of genes in the human genome (V, D, and J genes) encode multiple alleles that are recombined in a mix-and-match type of process. Known as V(D)J recombination, this process forms the astonishingly diverse array of antibodies and T-cell receptors that enable the human immune system to respond to a wide range of threats.
Characterizing the immunoglobulin loci has been difficult on its own because the regions are marked by segmental duplication, structural variants, and other challenging sequence contexts. SMRT sequencing was chosen to address this task because its long reads can generate extremely accurate and contiguous sequences across the short (~600bp) V(D)J regions without any bioinformatics manipulation, and because its lack of systematic bias ensures an accurate representation.
Because examination of this region has been limited by technical and throughput challenges, V(D)J sequencing has not previously been used in clinical diagnostics. But this region influences patient susceptibility and outcomes in a large number of common diseases, such as diabetes, HIV, influenza, and lupus erythematosus. It also affects the ability of an individual to respond to vaccination. In the future, incorporating V(D)J sequencing into clinical lab testing could have a major impact on improving patient care.
Getting to that point, however, will require building a public database that represents as much of the existing V(D)J diversity as possible. Today’s studies of this region are reminiscent of the early days of HLA typing: investigators are rushing to catalog and validate as many variants as possible, including the specific receptors that are important for modulating the immune response in an effective manner. Considering that every T-cell and B-cell theoretically expresses a different V(D)J sequence, it is apparent what a massive undertaking it will be just to describe the diversity of the human immune system. Resolving this complexity at the single-molecule level will be a huge step forward.
It’s still early days for elucidating the full repertoire of V(D)J alleles, but as teams around the world continue the cataloging effort, clinical labs will ultimately be able to produce medically actionable information about the immune response for any patient. In the past few years, scientists have identified more and more diseases where particular alleles of these V(D)J genes encode antibodies or T-cell receptors that appear to have a more or less effective immune response. Knowing a patient’s specific risk and likelihood of effective response to a disease or vaccine will be incredibly useful in tailoring care.
The efforts described here are in varying stages of being deployed for clinical use. Naturally, all of them are ready and available to the R&D community, and we anticipate that some labs will use our reports to develop their own protocols for laboratory-developed tests. At Mount Sinai, we work closely with our cutting-edge clinical genetics team to ensure that the methods we hand off meet the requirements for certification under the Clinical Laboratory Improvement Amendments of 1988 (CLIA). We also actively collaborate with external groups interested in adopting these methods for research use.
Ultimately, our lab does not focus on any single clinical indication. Our mandate is to find the right combination of science, technology, and medicine to make a difference in patient care, and we look for opportunities to do that across as many disease types and conditions as possible. For the work presented here, and many other ongoing projects, long-read sequencing is an excellent approach on its own, or serves as an ideal orthogonal validation tool.
There’s regrettably little information available to most clinical labs about which sequencing technology is the best fit for interrogating a particular genomic region. Our rule of thumb is that in any case where large elements such as repeat expansions, copy number variants, or indels matter, SMRT sequencing is much more likely to produce more reliable results than short-read sequencing or array-based genotyping.
Whenever possible, we opt for the highest-resolution technology or combination of tools to improve discovery of all potentially useful variants and generate the most information we can from a region, as this ultimately will have to inform the course of patient treatment. Elucidating the full range of natural human genetic variation will be a necessary foundation for future development of new tests, therapies, and other elements of optimal healthcare.
Robert P. Sebra, PhD, is an associate professor of genetics and genomic sciences, and director of technology development, in the Institute of Genomics and Multiscale Biology at the Icahn School of Medicine at Mount Sinai. Melissa Laird Smith, PhD, is an assistant professor of genetics and genomic sciences, and assistant director of the technology development team, in the Institute of Genomics and Multiscale Biology at the Icahn School of Medicine at Mount Sinai. For further information contact CLP chief editor Steve Halasey via [email protected].
- Ritz A, Bashir A, Sindi S, Hsu D, Hajirasouliha I, Raphael BJ. Characterization of structural variants with single molecule and hybrid sequencing approaches. Bioinformatics. 2014;30(24):3458–3466; doi: 10.1093/bioinformatics/btu714.
- Scott SA, Yang Y. Long-read CYP2D6 sequencing enables full gene characterization and novel allele discovery. Drug Discovery and Development. October 13, 2016. Available at: www.dddmag.com/article/2016/10/long-read-cyp2d6-sequencing-enables-full-gene-characterization-and-novel-allele-discovery. Accessed February 27, 2017.
- CYP2D6 allele nomenclature. In: Human Cytochrome P450 (CYP) Allele Nomenclature Database [online]. Solna, Sweden: Karolinska Institute, 2017. Available at: www.cypalleles.ki.se/cyp2d6.htm. Accessed February 27, 2017.
- Qiao W, Yang Y, Sebra R, Mendiratta G, Gaedigk A, Desnick RJ, Scott SA. Long-read single molecule real-time full gene sequencing of cytochrome P450-2D6. Hum Mutat. 2016;37(3):315–323; doi: 10.1002/humu.22936.
- Uzilov AV, Ding W, Fink MY, et al. Development and clinical application of an integrative genomic approach to personalized cancer therapy. Genome Med. 2016;8(1):62; doi: 10.1186/s13073-016-0313-0.