Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Integrated digital error suppression for improved detection of circulating tumor DNA

Abstract

High-throughput sequencing of circulating tumor DNA (ctDNA) promises to facilitate personalized cancer therapy. However, low quantities of cell-free DNA (cfDNA) in the blood and sequencing artifacts currently limit analytical sensitivity. To overcome these limitations, we introduce an approach for integrated digital error suppression (iDES). Our method combines in silico elimination of highly stereotypical background artifacts with a molecular barcoding strategy for the efficient recovery of cfDNA molecules. Individually, these two methods each improve the sensitivity of cancer personalized profiling by deep sequencing (CAPP-Seq) by about threefold, and synergize when combined to yield 15-fold improvements. As a result, iDES-enhanced CAPP-Seq facilitates noninvasive variant detection across hundreds of kilobases. Applied to non-small cell lung cancer (NSCLC) patients, our method enabled biopsy-free profiling of EGFR kinase domain mutations with 92% sensitivity and >99.99% specificity at the variant level, and with 90% sensitivity and 96% specificity at the patient level. In addition, our approach allowed monitoring of NSCLC ctDNA down to 4 in 105 cfDNA molecules. We anticipate that iDES will aid the noninvasive genotyping and detection of ctDNA in research and clinical settings.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Framework for noninvasive profiling of ctDNA.
Figure 2: Development of iDES.
Figure 3: Technical performance of iDES.
Figure 4: Noninvasive tumor genotyping with iDES-enhanced CAPP-Seq.
Figure 5: Ultrasensitive ctDNA detection and monitoring with iDES-enhanced CAPP-Seq.
Figure 6: iDES-enhanced CAPP-Seq.

Similar content being viewed by others

Accession codes

Primary accessions

Sequence Read Archive

Referenced accessions

Sequence Read Archive

References

  1. Heitzer, E., Ulz, P. & Geigl, J.B. Circulating tumor DNA as a liquid biopsy for cancer. Clin. Chem. 61, 112–123 (2015).

    Article  CAS  Google Scholar 

  2. Diehl, F. et al. Circulating mutant DNA to assess tumor dynamics. Nat. Med. 14, 985–990 (2008).

    Article  CAS  Google Scholar 

  3. Bettegowda, C. et al. Detection of circulating tumor DNA in early- and late-stage human malignancies. Sci. Transl. Med. 6, 224ra24 (2014).

    Article  Google Scholar 

  4. Bratman, S.V., Newman, A.M., Alizadeh, A.A. & Diehn, M. Potential clinical utility of ultrasensitive circulating tumor DNA detection with CAPP-Seq. Expert Rev. Mol. Diagn. 15, 715–719 (2015).

    Article  CAS  Google Scholar 

  5. Diaz, L.A. Jr. & Bardelli, A. Liquid biopsies: genotyping circulating tumor DNA. J. Clin. Oncol. 32, 579–586 (2014).

    Article  Google Scholar 

  6. Kurtz, D.M. et al. Noninvasive monitoring of diffuse large B-cell lymphoma by immunoglobulin high-throughput sequencing. Blood 125, 3679–3687 (2015).

    Article  CAS  Google Scholar 

  7. Butler, T.M. et al. Exome sequencing of cell-free DNA from metastatic cancer patients identifies clinically actionable mutations distinct from primary disease. PLoS One 10, e0136407 (2015).

    Article  Google Scholar 

  8. Newman, A.M. et al. An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage. Nat. Med. 20, 548–554 (2014).

    Article  CAS  Google Scholar 

  9. Taniguchi, K. et al. Quantitative detection of EGFR mutations in circulating tumor DNA derived from lung adenocarcinomas. Clin. Cancer Res. 17, 7808–7815 (2011).

    Article  CAS  Google Scholar 

  10. Jabara, C.B., Jones, C.D., Roach, J., Anderson, J.A. & Swanstrom, R. Accurate sampling and deep sequencing of the HIV-1 protease gene using a Primer ID. Proc. Natl. Acad. Sci. USA 108, 20166–20171 (2011).

    Article  CAS  Google Scholar 

  11. Kinde, I., Wu, J., Papadopoulos, N., Kinzler, K.W. & Vogelstein, B. Detection and quantification of rare mutations with massively parallel sequencing. Proc. Natl. Acad. Sci. USA 108, 9530–9535 (2011).

    Article  Google Scholar 

  12. Schmitt, M.W. et al. Detection of ultra-rare mutations by next-generation sequencing. Proc. Natl. Acad. Sci. USA 109, 14508–14513 (2012).

    Article  CAS  Google Scholar 

  13. Kennedy, S.R. et al. Detecting ultralow-frequency mutations by Duplex Sequencing. Nat. Protoc. 9, 2586–2606 (2014).

    Article  CAS  Google Scholar 

  14. Gregory, M.T. et al. Targeted single molecule mutation detection with massively parallel sequencing. Nucleic Acids Res. 44, e22 (2016).

    Article  Google Scholar 

  15. Kukita, Y. et al. High-fidelity target sequencing of individual molecules identified using barcode sequences: de novo detection and absolute quantitation of mutations in plasma cell-free DNA from cancer patients. DNA Res. 22, 269–277 (2015).

    Article  CAS  Google Scholar 

  16. Lou, D.I. et al. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proc. Natl. Acad. Sci. USA 110, 19872–19877 (2013).

    Article  CAS  Google Scholar 

  17. Schmitt, M.W. et al. Sequencing small genomic targets with high efficiency and extreme accuracy. Nat. Methods 12, 423–425 (2015).

    Article  CAS  Google Scholar 

  18. De Mattos-Arruda, L. et al. Cerebrospinal fluid-derived circulating tumour DNA better represents the genomic alterations of brain tumours than plasma. Nat. Commun. 6, 8839 (2015).

    Article  CAS  Google Scholar 

  19. Costello, M. et al. Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res. 41, e67 (2013).

    Article  CAS  Google Scholar 

  20. Chen, G., Mosier, S., Gocke, C.D., Lin, M.T. & Eshleman, J.R. Cytosine deamination is a major cause of baseline noise in next-generation sequencing. Mol. Diagn. Ther. 18, 587–593 (2014).

    Article  CAS  Google Scholar 

  21. Leon, S.A., Shapiro, B., Sklaroff, D.M. & Yaros, M.J. Free DNA in the serum of cancer patients and the effect of therapy. Cancer Res. 37, 646–650 (1977).

    CAS  PubMed  Google Scholar 

  22. Hafner, C. et al. Oncogenic PIK3CA mutations occur in epidermal nevi and seborrheic keratoses with a characteristic mutation pattern. Proc. Natl. Acad. Sci. USA 104, 13450–13454 (2007).

    Article  CAS  Google Scholar 

  23. Higgins, M.J. et al. Detection of tumor PIK3CA status in metastatic breast cancer using peripheral blood. Clin. Cancer Res. 18, 3462–3469 (2012).

    Article  CAS  Google Scholar 

  24. Sequist, L.V. et al. Rociletinib in EGFR-mutated non-small-cell lung cancer. N. Engl. J. Med. 372, 1700–1709 (2015).

    Article  Google Scholar 

  25. Oxnard, G.R. et al. Noninvasive detection of response and resistance in EGFR-mutant lung cancer using quantitative next-generation genotyping of cell-free plasma DNA. Clin. Cancer Res. 20, 1698–1705 (2014).

    Article  CAS  Google Scholar 

  26. Pao, W. et al. EGF receptor gene mutations are common in lung cancers from “never smokers” and are associated with sensitivity of tumors to gefitinib and erlotinib. Proc. Natl. Acad. Sci. USA 101, 13306–13311 (2004).

    Article  CAS  Google Scholar 

  27. Pao, W. et al. Acquired resistance of lung adenocarcinomas to gefitinib or erlotinib is associated with a second mutation in the EGFR kinase domain. PLoS Med. 2, e73 (2005).

    Article  Google Scholar 

  28. Sequist, L.V. et al. Genotypic and histological evolution of lung cancers acquiring resistance to EGFR inhibitors. Sci. Transl. Med. 3, 75ra26 (2011).

    Article  Google Scholar 

  29. Douillard, J.Y. et al. Gefitinib treatment in EGFR mutated caucasian NSCLC: circulating-free tumor DNA as a surrogate for determination of EGFR status. J. Thorac. Oncol. 9, 1345–1353 (2014).

    Article  CAS  Google Scholar 

  30. Mok, T. et al. Detection and dynamic changes of EGFR mutations from circulating tumor DNA as a predictor of survival outcomes in NSCLC patients treated with first-line intercalated erlotinib and chemotherapy. Clin. Cancer Res. 21, 3196–3203 (2015).

    Article  CAS  Google Scholar 

  31. Misale, S. et al. Emergence of KRAS mutations and acquired resistance to anti-EGFR therapy in colorectal cancer. Nature 486, 532–536 (2012).

    Article  CAS  Google Scholar 

  32. Murtaza, M. et al. Non-invasive analysis of acquired resistance to cancer therapy by sequencing of plasma DNA. Nature 497, 108–112 (2013).

    Article  CAS  Google Scholar 

  33. Thress, K.S. et al. Acquired EGFR C797S mutation mediates resistance to AZD9291 in non-small cell lung cancer harboring EGFR T790M. Nat. Med. 21, 560–562 (2015).

    Article  CAS  Google Scholar 

  34. Marchetti, A. et al. Early prediction of response to tyrosine kinase inhibitors by quantification of EGFR mutations in plasma of NSCLC patients. J. Thorac. Oncol. 10, 1437–1443 (2015).

    Article  CAS  Google Scholar 

  35. Dawson, S.J. et al. Analysis of circulating tumor DNA to monitor metastatic breast cancer. N. Engl. J. Med. 368, 1199–1209 (2013).

    Article  CAS  Google Scholar 

  36. Garcia-Murillas, I. et al. Mutation tracking in circulating tumor DNA predicts relapse in early breast cancer. Sci. Transl. Med. 7, 302ra133 (2015).

    Article  Google Scholar 

  37. Roschewski, M. et al. Circulating tumour DNA and CT monitoring in patients with untreated diffuse large B-cell lymphoma: a correlative biomarker study. Lancet Oncol. 16, 541–549 (2015).

    Article  Google Scholar 

  38. Samorodnitsky, E. et al. Evaluation of hybridization capture versus amplicon-based methods for whole-exome sequencing. Hum. Mutat. 36, 903–914 (2015).

    Article  CAS  Google Scholar 

  39. Drilon, A. et al. Broad, hybrid capture-based next-generation sequencing identifies actionable genomic alterations in lung adenocarcinomas otherwise negative for such alterations by other genomic testing approaches. Clin. Cancer Res. 21, 3631–3639 (2015).

    Article  Google Scholar 

  40. Rehm, H.L. et al. ACMG clinical laboratory standards for next-generation sequencing. Genet. Med. 15, 733–747 (2013).

    Article  Google Scholar 

  41. Ellis, P.M., Verma, S., Sehdev, S., Younus, J. & Leighl, N.B. Challenges to implementation of an epidermal growth factor receptor testing strategy for non-small-cell lung cancer in a publicly funded health care system. J. Thorac. Oncol. 8, 1136–1141 (2013).

    Article  CAS  Google Scholar 

  42. Leighl, N.B. et al. Molecular testing for selection of patients with lung cancer for epidermal growth factor receptor and anaplastic lymphoma kinase tyrosine kinase inhibitors: American Society of Clinical Oncology endorsement of the College of American Pathologists/International Association for the study of lung cancer/association for molecular pathology guideline. J. Clin. Oncol. 32, 3673–3679 (2014).

    Article  Google Scholar 

  43. Lim, C. et al. Biomarker testing and time to treatment decision in patients with advanced nonsmall-cell lung cancer. Ann. Oncol. 26, 1415–1421 (2015).

    Article  CAS  Google Scholar 

  44. Shiau, C.J. et al. Sample features associated with success rates in population-based EGFR mutation testing. J. Thorac. Oncol. 9, 947–956 (2014).

    Article  Google Scholar 

  45. Yatabe, Y. et al. EGFR mutation testing practices within the Asia Pacific region: results of a multicenter diagnostic survey. J. Thorac. Oncol. 10, 438–445 (2015).

    Article  CAS  Google Scholar 

  46. Hindson, B.J. et al. High-throughput droplet digital PCR system for absolute quantitation of DNA copy number. Anal. Chem. 83, 8604–8610 (2011).

    Article  CAS  Google Scholar 

  47. Forbes, S.A. et al. COSMIC: exploring the world's knowledge of somatic mutations in human cancer. Nucleic Acids Res. 43, D805–D811 (2015).

    Article  CAS  Google Scholar 

  48. Su, Z. et al. A platform for rapid detection of multiple oncogenic mutations with relevance to targeted therapy in non-small-cell lung cancer. J. Mol. Diagn. 13, 74–84 (2011).

    Article  CAS  Google Scholar 

  49. Lambert, D. Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics 34, 1–14 (1992).

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by grants from the Department of Defense (A.M.N., M.D., A.A.A.), the National Cancer Institute (A.M.N., 1K99CA187192-01A1; M.D., A.A.A., R01CA188298), the US National Institutes of Health Director's New Innovator Award Program (M.D., 1-DP2-CA186569), a US Public Health Service/National Institutes of Health U01 CA194389 (A.A.A.), the Ludwig Institute for Cancer Research (M.D., A.A.A.), a Stanford Cancer Institute-Developmental Cancer Research Award (M.D., A.A.A.), the CRK Faculty Scholar Fund (M.D.), V-Foundation (A.A.A.), Damon Runyon Cancer Research Foundation (A.A.A.) and a grant from both the Siebel Stem Cell Institute and the Thomas and Stacey Siebel Foundation (A.M.N.).

Author information

Authors and Affiliations

Authors

Contributions

A.M.N., A.F.L., D.M. Klass, M.D., and A.A.A. developed the concept, designed the experiments, and analyzed the data. A.M.N., A.F.L., M.D., and A.A.A. wrote the manuscript. A.F.L. and D.M. Klass performed the molecular biology experiments with assistance from D.M. Kurtz, J.J.C., F.S., S.V.B., and L.Z. Bioinformatics analyses were performed by A.M.N. with assistance from A.F.L., H.S., and C.L.L. Patient specimens were provided by C.S., J.N.C., R.B.W., G.W.S. Jr., J.B.S., B.W.L. J.W.N., H.A.W., and M.D. All authors commented on the manuscript at all stages. A.A.A. and M.D. contributed equally as senior authors.

Corresponding authors

Correspondence to Maximilian Diehn or Ash A Alizadeh.

Ethics declarations

Competing interests

A.M.N., D.M. Klass, S.V.B., M.D., and A.A.A. are co-inventors on patent applications related to CAPP-Seq. A.M.N., M.D., and A.A.A. are consultants for, and A.F.L. and D.M. Klass are employed by, Roche Molecular Systems. A.A.A. has served as a consultant for Genentech, Gilead, and Celgene. M.D. has served as a consultant for Novartis and Quanticel Pharmaceuticals. M.D. and B.W.L. Jr. have received research funding from Varian Medical Systems. B.W.L. Jr. has received research support from RaySearch Laboratories and is a founder and board member of TibaRay, Inc.

Integrated supplementary information

Supplementary Figure 1 Overview and initial performance characterization of CAPP-Seq barcode adapters.

(a) Diagram illustrating design and usage of custom sequencing adapters that implement two types of molecular barcodes. Shown are the initial molecule to which adapters are ligated (left), the two molecules derived from one round of PCR applied to the original molecule (top right), and the sequencing reads derived from these two post-PCR molecules (bottom right). Index and insert barcode types are indicated by blue/red and purple/green blocks, respectively. The sample multiplexing barcode is indicated in orange. (b) Comparison of selector-wide error rates and base substitution distributions across 12 healthy control cfDNA samples (Supplementary Table 2) for the following methods: no barcoding or polishing, index barcode de-duplication, insert barcode de-duplication (with and without considering duplex-supported barcodes), insert barcode followed by index barcode de-duplication (insert–index; here, singleton variants were ignored for insert de-duplication and were only eliminated if not supported by an index barcode family with ≥2 members), and duplex-only de-duplication (i.e., only molecules with both strands of the original duplex). Error bars represent s.e.m. (c) CAPP-Seq libraries were made from 12 healthy control cfDNA samples (Supplementary Table 2), and the libraries were sequenced with and without the inclusion of 10% PhiX during sequencing. In addition, CAPP-Seq libraries were made from seven different healthy control cfDNA samples using staggered insert barcodes—four with short and three with long barcodes (Methods). The selector-wide error rates for each of these 31 samples are shown. Errors in b,c were determined as described in Calculation of selector-wide error profiles in Methods. Group comparisons were performed with a paired two-sided t test (**, P < 1.4×10−10; *, P < 0.03).

Supplementary Figure 2 Analysis of barcode complexity.

(a,b) The fraction of distinct cfDNA molecules that have the same start/end positions and the same barcodes (i.e., barcode collisions) is shown for two clinically obtainable quantities of recovered hGEs from the same 32ng normal donor sample. Barcode collisions were predicted based on the number of molecules with identical start/end coordinates and the number of possible physical UIDs (=256). (c) Estimated percentages of unambiguously barcoded molecules, pooled from results in a,b. (d) Percentage of uniquely barcoded molecules in a,b calculated using an approach that counts collisions within UID families containing heterozygous SNPs. Data are presented as means +/− 95% confidence intervals, and are shown for 2,100 and 4,000 recovered hGEs across 169 and 153 heterozygous SNPs with adequate coverage (>50% of the median depth), respectively. Additional details are provided in Supplementary Note.

Supplementary Figure 3 Evaluation of CAPP-Seq library efficiency.

(a) Workflow for estimating the fraction of input haploid genome equivalents (hGEs) recovered post-capture but prior to sequencing using the mark and recapture method to estimate population size. (b) Comparison of observed total hGEs (left) and duplex hGEs (right) with estimations of their respective numbers in the post-capture library. The robustness of mark-recapture estimation was assessed over a wide range of sequencing depths by (i) using two lanes with ~2 fold difference in sequencing reads and (ii) down-sampling Lane 1 to 1/2 and 1/10 the original number of reads. Notably, when Lane 1 was down-sampled to approximately the same number of reads as Lane 2, the quantities of observed hGEs were comparable between the two lanes (see green square and red triangle), validating the results. Additional details are provided in Supplementary Note.

Supplementary Figure 4 Assessment of barcode recovery rates.

Fold over-sequencing relative to input hGEs is shown for (a) the number of on-target input reads needed to build a single stranded consensus sequence (SSCS) molecule and (b) the number of SSCS molecules needed to build a double stranded consensus sequence (DCS) molecule. Regression lines (dashed lines), showing a linear fit (a) and power series (b), were determined across 202 cfDNA samples sequenced with insert barcodes (Fig. 2a, Supplementary Table 2). (c) DCS efficiency, as described in Kennedy et al.13, was determined for all on-target reads as a function of peak SSCS family size (mode). Results are shown for a highly over-sequenced 32ng cfDNA sample that was down-sampled in defined intervals. Further details are provided in Supplementary Note. (d) Observed versus predicted duplex molecules recovered by CAPP-Seq. Duplex recovery rates were predicted for 202 cfDNA samples sequenced with insert barcodes (Supplementary Table 2), and were determined using both the number of input hGEs and number of recovered single stranded hGEs for each sample. Modeling was performed as described in Supplementary Note. Results in a,b,d are shown separately for our most common input mass (32ng, n = 153) and lower input masses (<32ng, n = 48).

Supplementary Figure 5 Reproducibility of cfDNA sequence errors, and performance comparison of error suppression techniques.

(a) Top: Heat map illustrating recurrent background patterns across ~1.3Mb of targeted cfDNA sequence data profiled by De Mattos-Arruda and colleagues18. Five representative cfDNA samples are shown (4 plasma and 1 cerebrospinal fluid sample (CSF)). Accession numbers are provided in the caption of Supplementary Fig. 6e. Bottom: Heat map depicting recurrent background errors across 53kb of shared genomic coordinates between the samples above and 12 normal control cfDNA samples from Fig. 2b of this work. Color scales are identical to panel b. (b) Top: Heat map showing stereotypical background errors across all 172 cfDNA samples from subjects analyzed in this study, including 30 normal controls, 12 of which were used as a training cohort to learn stereotypical background errors, and 142 cfDNA samples collected from NSCLC patients (Supplementary Table 2). The differential impact of barcoding and background polishing is shown. Bottom: Base substitution distributions and selector-wide base-level error rates corresponding to samples in the heat map above. Errors in a,b were determined as described in Calculation of selector-wide error profiles in Methods.

Supplementary Figure 6 Characterization of base substitution errors following treatment with DNA repair enzymes, during hybridization, and in independent studies.

(a) Prior to library preparation, cfDNA from a single healthy control patient was treated with one of six DNA repair enzyme conditions: UDG, FPG, UDG and FPG, PreCR, PreCR with BSA (Methods). Median depths, error rates, and base substitution distributions are shown. (b) Error rates of reciprocal base substitution types were compared across 12 healthy control patients (Supplementary Table 2) from two vantage points: the sequencer (left) or the human reference genome (+ strand) (right). Left: Errors that would be read identically on the sequencer were pooled (G>T on the genomic + strand and G>T on the genomic – strand were treated identically), and the ratios of complementary base substitutions were determined. Right: Errors that mapped to the same substitution on the + strand were combined (G>T on the genomic + strand and C>A on the genomic – strand were treated identically), and the ratios of complementary base substitutions were determined. These results indicate that errors on the sequencer are unlikely to contribute significantly to the high G>T skew observed after mapping. (c) CAPP-Seq was performed on 12 sequencing libraries derived from cfDNA from the same healthy control patient, with hybridization capture performed for a range of times (Supplementary Table 2). Three samples were captured for each time point and the genome equivalent recovery (% hGEs recovered), percent of reads that mapped to selector regions (% on-target rate), and ratio of complementary error types (mapped to the + strand of the genome) were calculated for all samples, then averaged for each time point. Error bars = s.e.m. (d) Proposed model to explain the imbalance of G>T versus C>A observed in mapped sequencing data (Fig. 2b). Following adapter ligation and pre-capture PCR, targeted enrichment is performed. Since only the plus strand of the genome is captured, errors occurring on the minus strand do not propagate through post-capture PCR and sequencing. We hypothesize that oxidation-induced 8-oxoguanine is the primary cause of G>T damage19. Notably, this model predicts that hybrid capture reagents targeting the minus strand will yield the opposite base substitution imbalance (i.e., higher ratio of C>A to G>T errors). (e) The ratio of background errors for reciprocal base substitutions in cfDNA samples from two independent studies and three hybrid capture reagents. Left: Five cfDNA samples captured by NimbleGen SeqCap18, which targets the plus strand of the reference genome. Right: Five cfDNA samples captured by the Agilent SureSelect Human All Exon v4 UTR kit (n = 3)7 and by the Nextera Rapid Capture Exome kit (37Mb) (Illumina) (n = 2)18, both of which target the minus strand of the reference genome. Medians with min-to-max ranges are shown. The following accession numbers from the NCBI Sequence Read Archive (SRA) were analyzed: Left panel, SRR1657007, SRR1646742, SRR1656953, SRR1657055, SRR1657062; Right panel, ERR852106, ERR855950, ERR855949, SRR1654347, SRR1654380. Additional details are provided in Supplementary Note. Errors in ac,e were determined as described in Calculation of selector-wide error profiles in Methods.

Supplementary Figure 7 Statistical modeling and polishing of recurrent background errors.

(a) For all genomic positions modeled by the Weibull distribution (Background polishing in Methods), we assessed goodness-of-fit using linear regression applied to quantile-quantile plots. Resulting data are depicted as density plots showing Pearson’s correlations (top) and corresponding p-values (bottom). (b) Scatterplots comparing fractions of background alleles between pre- and post-barcode de-duplicated data, shown for the same normal controls plotted in Fig. 2b. Allele fractions (AFs) were averaged across the 12 controls for each base substitution at a given genomic position. (c) Two sets of 12 genetically distinct normal control cfDNA samples were run on separate lanes on different dates (Supplementary Table 2). Each cohort was then used to learn position-specific background distributions in data without duplicates removed (‘without barcoding’), and the resulting models were applied to polish errors in 142 barcode de-duplicated NSCLC cfDNA samples. Resulting selector-wide error rates (with medians and interquartile ranges) are shown before and after barcode de-duplication (‘without and with barcoding’, respectively) and after iDES, trained using either batch 1 or batch 2. Importantly, iDES results were only marginally different, suggesting that batch-to-batch variation is not a significant factor affecting the performance of in silico polishing. Errors were determined as described in Calculation of selector-wide error profiles in Methods.

Supplementary Figure 8 Theoretical detection limits of barcode-mediated error suppression methods for clinically practical quantities of cfDNA.

(a) Global error rates (x-axis) and barcode efficiencies (y-axis) for methods reported in this work (i.e., iDES, barcoding or polishing only, duplex only) compared with barcoding methods from several previous studies12,13,14,17. Barcode efficiency reflects the number of reads required to build a barcode consensus sequence and is defined as the number of consensus sequences per read. Of note, these comparisons are not perfect since both error rate and barcode efficiency can be affected by heterogeneous sequencing amounts and sample types used in the different studies. For details on the derivation of these quantities, see Error rates and efficiencies of previous barcoding methods in Methods. In addition, the theoretical error rate of duplex sequencing is approximately equivalent to the error rate of single-stranded barcoding multiplied by itself and divided by 3 (to account for all possible base substitutions)12. Bona fide low frequency mutations (i.e., biological background) may skew observed duplex error rates, resulting in both their overestimation and variability across studies. Data are presented as means +/- s.d. Note that due to log scaling, the limits of some error bars are shown for one direction, but not the other. (b) Comparison of all methods in a in relation to estimated detection-limit over a range of sequencing depth and cfDNA input (assuming a 90% detection probability). Sequencing was calibrated to iDES, such that the quantity of reads R needed to recover a desired number of hGEs was determined (assuming constant yield and sufficient input molecules). R was then used to calculate the number of recovered hGEs for all other methods. Modeling was performed as described in Statistical methods for ctDNA detection in Methods (and in Figs. 1a, 3d). Note that the maximum attainable detection-limit for each method is bound by its error rate. For example, barcoding-only (this study) and background polishing reach their respective maximum detection limits in this analysis. Although background polishing is not a barcoding strategy, it was included in this analysis to compare its detection-limit with other approaches. The “barcode efficiency” of polishing is identical to our barcoding approach since the same sequencing adapters were used for both.

Supplementary Figure 9 Evaluation of duplex-boosting and ultralow frequency allele detection.

(a) Impact of duplex-supported barcodes on iDES noninvasive genotyping performance. Here, we performed background polishing on single-stranded barcode reads (ignoring duplex strand information) and repeated our noninvasive genotyping analysis of the 5% variant blend analyzed in Fig. 4a (right). While specificity, PPV, and NPV remained comparable to duplex-boosted iDES (bottom panel), the mean sensitivity dropped by 10% from 96% to 86% when ignoring duplexes (top panel). Thus, “duplex-boosting” improves the sensitivity of noninvasive tumor genotyping, highlighting the value of a hybrid barcoding approach within iDES. (b) Analysis of noninvasive tumor genotyping using the same variant blend in Fig. 4a,b, but added into normal control cfDNA at a 10-fold lower dilution (0.5%) and sequenced using twice the input mass (72ng DNA). Genotyping results are shown for different error suppression methods (iDES = barcoding + polishing). In all, there were 13 known alleles with externally validated AFs covered by our NSCLC selector (Supplementary Table 4). Due to the ultralow range of expected AFs (0.005% ≤ AF ≤ 0.16%), 16 additional alleles known to be present in the variant blend, but lacking external AF validation, were excluded from this analysis (Supplementary Table 4). Specificity was assessed using nearly 300 hotspot variants not present in the variant blend (Supplementary Table 4). Sensitivity and negative predictive value (NPV) were evaluated using all 13 alleles with external validation, however given the median number of hGEs recovered (=12,630), the detection-limit was determined to be 0.024% AF with 95% confidence (Statistical methods for ctDNA detection in Methods). Therefore, we also calculated detection-limit-adjusted sensitivity and NPV using only those variants with an expected AF >0.024% (n = 10 of 13; denoted with an asterisk). Genotyping was performed as described in Noninvasive tumor genotyping of hotspot alleles and selected regions in Methods. Sn, sensitivity; Sp, specificity; PPV, positive predictive value.

Supplementary Figure 10 Overview and technical assessment of adaptive SNV genotyping.

(a) Plots showing cumulative selector-wide background errors as a function of supporting reads (only non-reference bases with ≤5x support were considered). Lines were fit using semi-log linear regression, in which the y-axis was represented in log space and the x-axis in linear space. Data from representative cfDNA and NSCLC tumor samples are shown. (b) Left: schematic showing key steps performed by the adaptive SNV genotyping approach introduced in this work (e.g., Fig. 4c). Right: Anecdotal example illustrating potential false positive SNVs in the lower tail of the fractional distribution of candidate variant calls. The final step of the adaptive genotyping approach involves the detection and elimination of tail-end variants. (c) Performance of the adaptive genotyping approach for recovering 100 ground truth SNVs from a simulated spike series consisting of 5% and 0.5% ctDNA. Details related to ac are provided in Selector-wide genotyping in Methods and Additional details related to selector-wide genotyping in Supplementary Note. PPV, positive predictive value.

Supplementary Figure 11 Specificity of biopsy-free genotyping and correlation of hotspot mutation frequency with NSCLC disease stage.

(a) Analysis of the specificity of noninvasive tumor genotyping using normal control cfDNA samples. To compare specificity among error-suppression methods, we only analyzed normal controls that were not used for building the background-polishing database (n = 18; Methods, Supplementary Table 4). (b) Comparison of error suppression methods for the mean number of variants detected per cfDNA sample in 18 normal controls (same as in a) and all 24 pretreatment NSCLC samples with matching tumor biopsies (Supplementary Table 2). Group comparisons were performed using a two-sided Wilcoxon rank sum test (NS, not significant). Data are expressed as means +/- 95% confidence intervals. (c) Percentage of pretreatment NSCLC cfDNA samples (n = 24), organized by tumor stage, with at least one variant noninvasively detected by iDES-enhanced CAPP-Seq (related to Fig. 4d). (d) AFs for all variants noninvasively detected in cfDNA samples from 24 NSCLC patients (with sequenced tumors) using iDES-enhanced CAPP-Seq (Supplementary Table 4). Samples are ranked from left to right by decreasing mean AF. Error bars denote AF range. Tick marks (x-axis) indicate individual cfDNA samples. A list of 292 candidate hotspot variants was used for the analyses in a-d, excluding those specific to the variant blend analyzed in Fig. 4a,b (Supplementary Table 4). Genotyping was performed as described in Noninvasive tumor genotyping of hotspot alleles and selected regions in Methods.

Supplementary Figure 12 Empirical spike analysis to determine the detection-limit of iDES-enhanced CAPP-Seq and duplex sequencing.

Analysis of the detection-limit of each method in this study. Acoustically shorn DNA from a hyper-mutated glioblastoma (GBM) tumor was added into healthy donor-derived cfDNA in four 32ng dilutions, ranging from 2.5 in 104 molecules down to 2.5 in 106 molecules. All mixtures were assessed in two technical replicates except for the lowest fraction, which was assessed in four replicates. A custom selector targeting all 1,502 non-silent mutations identified in this tumor was used. (a) Comparison of error-suppression methods (with/without barcoding and with/without background polishing) applied to the spike series. Data with two technical replicates are presented as means with minimum-to-maximum ranges. Data with four technical replicates (lowest spike only) are presented as medians +/- interquartile range. (b) Analysis of one technical replicate from the spike series in a using 20 mutations randomly selected from the pool of 1,502 total mutations. Random sampling was repeated 50 times and the results are presented as means +/- 95% confidence intervals. (c) Same as a, but for duplex molecules only. The detection-limit (dashed line) was determined by pooling duplex sequence data from 12 normal control cfDNA samples and by calculating the mean AF of the 1,502 GBM mutations. The presentation of data and error bars is identical to panel a. (d) Observed versus expected mutation counts for each sample plotted in c. The number of expected mutations was estimated based on the expected fraction of tumor-derived DNA molecules in each sample and the number of duplex hGEs observed (Statistical methods for ctDNA detection in Methods). Additional details related to these analyses are provided in ctDNA detection limits for iDES, duplex sequencing, and other methods in Methods.

Supplementary Figure 13 Sensitivity and specificity of ctDNA detection in pretreatment samples.

Plots are identical to Fig. 5b, except for the focus on pretreatment cfDNA and the inclusion of detailed sample annotation. In all, 30 sets of patient mutations (columns) were queried in both NSCLC pretreatment cfDNA (top rows; n = 30) and control cfDNA samples (bottom rows; n = 30). The latter were subdivided into training and test cohorts for background polishing (Background polishing in Methods) and are indicated by cyan and gray bars. P value cutoffs were selected to maximize sensitivity and specificity (ctDNA monitoring analysis in Methods). Specificity was determined separately for all cfDNA samples and for healthy controls (CTR). Patient samples (columns) are grouped and colored according to the source of reporters (ctDNA monitoring anaysis in Methods). Red squares, true positives; blue squares, false positives; white squares, not detected. Sn, sensitivity; Sp, specificity.

Supplementary Figure 14 Noninvasive ctDNA detection and monitoring with iDES-enhanced CAPP-Seq (related to Fig. 5d).

To assess noninvasive ctDNA quantitation, monitoring using variants called from a tumor biopsy (‘Tumor’, blue line) was compared to monitoring using variants called directly from pretreatment cfDNA (‘cfDNA’, red line) from the same patient (using iDES). All 8 analyzed patients had a sequenced PBL sample and at least 3 longitudinal plasma time points available for correlation assessments. Open circles/squares indicate time points without significantly detectable ctDNA. ND, not detected. Time points are shown in chronological order (1st time point, pretreatment; later time points, post-treatment).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–14 and Supplementary Note (PDF 10964 kb)

Supplementary Table 1

NSCLC selector designs, final coordinates, and in silico performance (XLSX 128 kb)

Supplementary Table 2

Patient details, sample inventory with QC metric, and figure-sample directory (XLSX 123 kb)

Supplementary Table 3

Somatic mutations detected in primary tumor biopsies or pretreatment plasma (XLSX 95 kb)

Supplementary Table 4

List of variants interrogated in plasma for noninvasive tumor genotyping and corresponding variant calls in NSCLC patients (XLSX 94 kb)

Supplementary Table 5

ctDNA monitoring results, including sensitivity and specificity, and raw data (XLSX 119 kb)

Supplementary Software

iDES software package (ZIP 48641 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Newman, A., Lovejoy, A., Klass, D. et al. Integrated digital error suppression for improved detection of circulating tumor DNA. Nat Biotechnol 34, 547–555 (2016). https://doi.org/10.1038/nbt.3520

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nbt.3520

This article is cited by

Search

Quick links

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer