Risk prediction models for selection of lung cancer screening candidates: A retrospective validation study

Kevin ten Haaf; Jihyoun Jeon; Martin C. Tammemägi; Summer S. Han; Chung Yin Kong; Sylvia K. Plevritis; Eric J. Feuer; Harry J. de Koning; Ewout W. Steyerberg; Rafael Meza

doi:10.1371/journal.pmed.1002277

Abstract

Background

Selection of candidates for lung cancer screening based on individual risk has been proposed as an alternative to criteria based on age and cumulative smoking exposure (pack-years). Nine previously established risk models were assessed for their ability to identify those most likely to develop or die from lung cancer. All models considered age and various aspects of smoking exposure (smoking status, smoking duration, cigarettes per day, pack-years smoked, time since smoking cessation) as risk predictors. In addition, some models considered factors such as gender, race, ethnicity, education, body mass index, chronic obstructive pulmonary disease, emphysema, personal history of cancer, personal history of pneumonia, and family history of lung cancer.

Methods and findings

Retrospective analyses were performed on 53,452 National Lung Screening Trial (NLST) participants (1,925 lung cancer cases and 884 lung cancer deaths) and 80,672 Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial (PLCO) ever-smoking participants (1,463 lung cancer cases and 915 lung cancer deaths). Six-year lung cancer incidence and mortality risk predictions were assessed for (1) calibration (graphically) by comparing the agreement between the predicted and the observed risks, (2) discrimination (area under the receiver operating characteristic curve [AUC]) between individuals with and without lung cancer (death), and (3) clinical usefulness (net benefit in decision curve analysis) by identifying risk thresholds at which applying risk-based eligibility would improve lung cancer screening efficacy. To further assess performance, risk model sensitivities and specificities in the PLCO were compared to those based on the NLST eligibility criteria. Calibration was satisfactory, but discrimination ranged widely (AUCs from 0.61 to 0.81). The models outperformed the NLST eligibility criteria over a substantial range of risk thresholds in decision curve analysis, with a higher sensitivity for all models and a slightly higher specificity for some models. The PLCOm2012, Bach, and Two-Stage Clonal Expansion incidence models had the best overall performance, with AUCs >0.68 in the NLST and >0.77 in the PLCO. These three models had the highest sensitivity and specificity for predicting 6-y lung cancer incidence in the PLCO chest radiography arm, with sensitivities >79.8% and specificities >62.3%. In contrast, the NLST eligibility criteria yielded a sensitivity of 71.4% and a specificity of 62.2%. Limitations of this study include the lack of identification of optimal risk thresholds, as this requires additional information on the long-term benefits (e.g., life-years gained and mortality reduction) and harms (e.g., overdiagnosis) of risk-based screening strategies using these models. In addition, information on some predictor variables included in the risk prediction models was not available.

Conclusions

Selection of individuals for lung cancer screening using individual risk is superior to selection criteria based on age and pack-years alone. The benefits, harms, and feasibility of implementing lung cancer screening policies based on risk prediction models should be assessed and compared with those of current recommendations.

Author summary

Why was this study done?

In the United States, lung cancer screening is currently recommended based on age, pack-years smoked, and years since smoking cessation, the criteria used to select participants for the National Lung Screening Trial (NLST).
A number of recent investigations suggest that using lung cancer risk prediction models could lead to more effective screening programs compared to the current recommendations.
External validation and direct comparisons between risk models are often limited due to insufficient numbers of events or methodological limitations.

What did the researchers do and find?

Various performance characteristics of nine risk prediction models for lung cancer incidence or mortality were assessed using data from two randomized controlled trials on lung cancer screening: the NLST and the Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial (PLCO).
The calibration performance of the models was satisfactory, but discrimination ranged widely between models. However, all models had a higher sensitivity—and some models had a slightly higher specificity—than the NLST eligibility criteria.

What do these findings mean?

Using risk prediction models to select individuals for lung cancer screening is superior to currently recommended selection criteria.
The benefits, harms, and feasibility of using risk prediction models to select individuals for lung cancer screening should be assessed and compared with current recommendations.

Citation: ten Haaf K, Jeon J, Tammemägi MC, Han SS, Kong CY, Plevritis SK, et al. (2017) Risk prediction models for selection of lung cancer screening candidates: A retrospective validation study. PLoS Med 14(4): e1002277. https://doi.org/10.1371/journal.pmed.1002277

Academic Editor: John D. Minna, University of Texas Southwestern Medical Center at Dallas, UNITED STATES

Received: September 12, 2016; Accepted: February 27, 2017; Published: April 4, 2017

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Data Availability: The authors confirm that, for approved reasons, some access restrictions apply to the data underlying the findings. Data are available from the U.S. NCI Cancer Data Access Center at https://biometry.nci.nih.gov/cdas for researchers who meet the criteria for access to confidential data.

Funding: This report is based on research conducted by the National Cancer Institute’s (NCI) Cancer Intervention and Surveillance Modeling Network (CISNET) (NIH grants U01-CA152956 & U01-CA199284). EWS is supported by a National Institutes of Health grant (Value of personalized risk information, U01 AA022802). The funding source had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, and approval of the manuscript; or the decision to submit the manuscript for publication.

Competing interests: HJdK and KtH are members of the Cancer Intervention and Surveillance Modeling Network (CISNET) Lung working group (grant 1U01CA199284-01 from NIH). HJdK is the principal investigator of the Dutch-Belgian Lung Cancer Screening Trial (Nederlands-Leuvens Longkanker Screenings onderzoek; the NELSON trial). KtH is a researcher affiliated with the NELSON trial. HJdK and KtH received a grant from the University of Zurich to assess the cost-effectiveness of computed tomographic lung cancer screening in Switzerland. HJdK took part in a 1-day advisory meeting on biomarkers organized by M.D. Anderson/Health Sciences during the 16th World Conference on Lung Cancer. HJdK and KtH were involved in a Health Technology Assessment study for CT Lung Cancer Screening in Canada (dr. Paszat, Cancer Care Ontario). MCT is the developer of the PLCOm2012 lung cancer risk prediction model. Use of the model is free of charge to all non-commercial users of the PLCOm2012, whether clinical or personal use. MCT has assigned the commercial intellectual property rights to Brock University. Part of the net proceeds from Brock University’s commercial licensing of use of the PLCOm2012 is to be paid to MCT. To date no such payments have been made to or received by MCT. At Cancer Care Ontario, MCT is Senior Scientist and Scientific Lead for the High Risk Lung Cancer Screening Pilot Studies, which plan to begin recruitment in April 2017. Recruitment will be based on lung cancer risk estimated by the PLCOm2012. Cancer Care Ontario is the agency representing Ontario’s Ministry of Health and Long-Term Care for cancer screening and is a not-for-profit organization. Neither Cancer Care Ontario nor MCT will receive any funds for use of the PLCOm2012 in the pilot studies or subsequently if lung cancer screening of high-risk individuals is expanded across Ontario. MCT was an invited speaker at the following meetings in which the PLCOm2012 model was discussed and MCT travel expenses were in part paid for. In none of these presentations did the total expenses paid for exceed the actual total costs, i.e., MCT did not financially benefit from the presentation. 1. MCT, Invited speaker: Selection Criteria for Lung Cancer Screening – Incidence versus Mortality Models. Cancer Intervention and Surveillance Modeling Network (CISNET) meeting, Ann Arbor, Michigan, 6 May 2013. 2. MCT, Invited speaker: Modeling Lung Cancer Risk, In Symposium: Lung Cancer Screening: Moving Forward, and Looking Forward. American Thoracic Society (ATS) Annual Meeting. May 17-22, 2013, Philadelphia, Pennsylvania. 3. MCT, Invited speaker: Lung Cancer Screening, In Symposium: High-Risk Populations and Cancer Screening. American Society for Clinical Oncology (ASCO) Annual Meeting. June 1-4, 2013, Chicago, Illinois. 4. MCT, Invited speaker: Risk Stratification for Lung Cancer Screening Studies. In session: Low-dose computed lung screening. At the 15th International Association for the Study of Lung Cancer (IASLC) World Conference on Lung Cancer. October 28, 2013, Sydney, Australia. 5. MCT, Invited speaker: Screening for lung cancer. At the Canadian Lung Cancer Conference. 7 February 2014, Vancouver, BC. 6. MCT, Invited speaker: Lung Cancer Screening – Issues & Updates. 9th Ontario Thoracic Cancer Conference. Niagara-on-the-Lake. 26 April 2014. 7. MCT, Invited speaker: Risk models for selection of individuals for lung cancer screening. Pan-Canadian Lung Cancer Screening Meeting Montreal, QC. 29 May 2014. 8. MCT, Invited speaker: Lung cancer risk models for targeting screening. Cancer Intervention and Surveillance Modeling Network (CISNET) meeting. Minneapolis, MN. June 2, 2014. 9. MCT, Invited speaker: Selection of Individuals for Lung Cancer Screening. McMaster University – Juravinski Cancer Center – Regional Oncology Rounds. Hamilton, ON. 13 November 2014. 10. MCT, Invited speaker: Lung cancer risk and screening. Cancer Intervention and Surveillance Modeling Network (CISNET) annual meeting. Bethesda, MD. December 10, 2014. 11. MCT, Invited speaker: National Academy of Sciences, Institute of Medicine, National Cancer Policy Forum, Workshop – Implementation of Lung Cancer Screening. Identifying High-Risk Populations for Screening: Risk Modeling – Current Ideas, New Developments, and Future Potentials. June 20-21, 2016, Washington, D.C.

Abbreviations: AUC, area under the receiver operating characteristic curve; CT, computed tomography; CXR, chest radiography; LLP, Liverpool Lung Project; NLST, National Lung Screening Trial; PLCO, Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial; TSCE, Two-Stage Clonal Expansion; USPSTF, United States Preventive Services Task Force

Introduction

The National Lung Screening Trial (NLST) found that screening with low-dose computed tomography (CT) can reduce lung cancer mortality by 20% [1]. Based on an evidence review, including the results of the NLST and a comparative microsimulation modeling study, the United States Preventive Services Task Force (USPSTF) recommended lung cancer screening for current and former smokers aged 55 through 80 y who smoked at least 30 pack-years and, if quit, quit less than 15 y ago [2–4]. To our knowledge, only the United States has implemented lung cancer screening policies. Although the province of Ontario, Canada, recommends screening individuals at high risk for lung cancer through an organized program, no program has yet been established [5]. Cancer Care Ontario (the provincial cancer agency of Ontario) is currently evaluating the feasibility of implementing such a program [6]. European countries have not yet made any recommendations on lung cancer screening, as the final results of the Dutch-Belgian Lung Cancer Screening Trial (Nederlands-Leuvens Longkanker Screenings Onderzoek [NELSON] trial), potentially pooled with high-quality data from other trials, are still awaited [7–9].

The screening eligibility criteria used in the current USPSTF recommendations are based on age and pack-years, a measure of cumulative smoking exposure. Thus, these recommendations do not take other important risk factors into account, such as family history, nor other relevant aspects of smoking, such as smoking duration or intensity. Recently, a number of investigations have suggested that determining screening eligibility using an individual’s risk based on age, more detailed smoking history, and other risk factors such as ethnicity and family history of lung cancer could lead to more effective screening programs compared with the USPSTF recommendations [10–13]. Indeed, some lung cancer screening guidelines already encourage assessment of an individual’s risk to determine screening eligibility [14].

While various lung cancer risk prediction models have been developed, external validation and direct comparisons between models have been limited due to insufficient numbers of events or methodological limitations [15–21]. Such validations are essential, as risk prediction models generally have optimistic performance within their development dataset [15–17]. This study aims to externally validate and directly compare the performance of nine currently available lung cancer risk prediction models for stratifying lung cancer risk groups and determining screening eligibility.

Methods

Ethics statement

No identifiable information was used; therefore, no institutional review board (IRB) approval was needed. Nonetheless, a determination of exempt was given by the University of Michigan IRB (HUM00054750), and a determination of this not being human subjects research was given by the Fred Hutchinson Cancer Research Center (former affiliation of J. J.) IRB (6007–680).

Study population

We used data from two large randomized controlled screening trials: the NLST and the Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial (PLCO) [1,22–24]. All participants in the CT arm (n = 26,722) and chest radiography (CXR) arm (n = 26,730) of the NLST and ever-smoking participants in the CXR arm (n = 40,600) and control arm (n = 40,072) of the PLCO were included in the analysis. Never-smokers in the PLCO were not considered, as (1) not all lung cancer risk prediction models can be applied to never-smokers and (2) never-smokers are unlikely to reach levels of risk that allow them to benefit from screening [13,25].

Data on the predictor variables in each trial were collected through epidemiologic questionnaires administered at study entry and harmonized across both trials. Reported average numbers of cigarettes smoked per day above 100 were considered implausible and recoded as 100 cigarettes per day (n = 11). Furthermore, body mass index values less than 14 and over 60 kg/m² were considered implausible for enrollment in both trials and recoded as 14 (n = 5) and 60 kg/m² (n = 18), respectively. Lung cancer diagnoses (1,925 in the NLST and 1,463 in the PLCO) and lung cancer deaths (884 in the NLST and 915 in the PLCO) that occurred between study entry and 6 y of follow-up were included in the final dataset and were considered as binary outcomes.

Lung cancer risk prediction models

Our study includes nine risk prediction models for lung cancer incidence or death that have been used frequently in the literature. Risk prediction models were not considered for this investigation, if they (1) were developed for specific ethnicities and are therefore not broadly applicable [26–28], (2) used information on biomarkers or lung nodules and are therefore not readily applicable for the prescreening selection of individuals [29–33], (3) were developed for identifying symptomatic patients [34,35], (4) did not incorporate smoking behavior [36], (5) did not provide information on parameter estimates (e.g., baseline risk parameters) necessary to allow replication of the model [11,12], or (6) had poor discriminative ability in their development dataset [37].

Nine models remained and were investigated: the Bach model, the Liverpool Lung Project (LLP) model, the PLCOm2012 model, the Two-Stage Clonal Expansion (TSCE) model for lung cancer incidence, the Knoke model, two versions of the TSCE model for lung cancer death [10,38–44], and simplified versions of the PLCOm2012 and LLP models. The characteristics of these models are shown in Table 1. The TSCE and Knoke models consider only age, gender, and smoking-related characteristics as risk factors [40–43]. The Bach model considers asbestos exposure as an additional risk factor, while the LLP and PLCOm2012 models consider multiple additional risk factors [10,38,39]. The simplified versions of the PLCOm2012 and LLP models considered only age, gender, and smoking variables. A detailed description of each model can be found in S1 Appendix.

Download:

Table 1. Characteristics of investigated risk models.

https://doi.org/10.1371/journal.pmed.1002277.t001

Data on frequency and intensity of asbestos exposure, used in the LLP and Bach models, was not available for the PLCO participants and could not be accurately derived for the NLST participants [38,39]. Therefore, we assumed that none of the participants were exposed to asbestos, even though this assumption may lead to biased estimates [45]. However, as the potential number of individuals with asbestos exposure was low (less than 5% of the NLST participants reported ever working with asbestos), this bias is expected to be minor [46].

The LLP model incorporates age at lung cancer diagnosis of a first-degree relative: early age (60 y or younger) versus late age (older than 60 y) [38]. However, while both the PLCO and the NLST had information about the occurrence of family history of lung cancer (yes/no), neither had information on the age of diagnosis for the affected relative(s). Since the median age of lung cancer diagnosis in the United States is 70 y and the majority of lung cancers occur after the age of 65 y (68.6%), we assumed that lung cancer in first-degree relatives in the PLCO and the NLST always occurred after the age of 60 y [47,48].

In addition, the LLP model incorporates a history of pneumonia as a risk factor [38]. While information on this risk factor was available in the NLST, it was not available in the PLCO. Therefore, we assumed that none of the PLCO participants had a history of pneumonia for the complete case analyses. While 22.1% of NLST participants had a history of pneumonia (Table 2), the association of a history of pneumonia with a lung cancer diagnosis within 6 y was not clear (p = 0.3378 in the CT arm and p = 0.0035 in the CXR arm). Missing history of pneumonia for PLCO participants was imputed by using information from the NLST participants [49].

Download:

Table 2. Baseline characteristics of National Lung Screening Trial and Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial participants according to 6-y lung cancer incidence.

https://doi.org/10.1371/journal.pmed.1002277.t002

Statistical analyses

To assess the performance of the risk prediction models, several metrics were employed: calibration, discrimination, and clinical usefulness (net benefit over a range of risk thresholds) [50]. The performance of the investigated risk prediction models was assessed in each trial arm separately, for both lung cancer incidence and lung cancer mortality. We assessed both lung cancer incidence and mortality in both arms of both trials for all investigated risk models, as these outcomes may be influenced differently by screening. Screening may affect the predictive performance for lung cancer incidence, due to the advance in time of detection due to screening (lead time) and the detection of cancers that would never have been detected if screening had not occurred (overdiagnosis) [51–53]. Furthermore, CT screening reduces lung cancer mortality compared to CXR screening, which may influence the predictive performance of models for lung cancer mortality in the CT arm of the NLST [1]. Furthermore, the sensitivity and specificity of each model in the PLCO cohorts were compared to the sensitivity and specificity of the NLST/USPSTF smoking eligibility criteria (being a current or former smoker who smoked at least 30 pack-years and, if quit, quit less than 15 y ago). Model performance was assessed by varying follow-up duration and outcome (5- and 6-y lung cancer incidence or mortality) to investigate the effect of follow-up duration on the discrimination performance of each model [54]. The 5- and 6-y time frames were chosen because the LLP and PLCOm2012 models were calibrated to these respective time frames, and complete follow-up of NLST participants was limited to 6 y [10,38]. Since performance was similar for 5- and 6-y outcomes, only the results of the 6-y outcomes are presented. Performance was evaluated for the risk prediction models as presented in their original publication, without any recalibration or reparameterization to the NLST and the PLCO. The only exception is the PLCOm2012 model, which was originally developed based on data from the control arm of the PLCO [10]. All analyses were performed in R (version 3.3.0) [55].

Aspects of calibration performance

Calibration plots were constructed for the observed proportions of outcome events against the predicted risks for individuals grouped by similar ranges of predicted risk [56]. Perfect predictions should show an ideal 45-degree line that can be described by an intercept of 0 and a slope of 1 in the calibration plot [57]. The calibration intercept quantifies the extent to which a model systematically under- or overestimates a person’s risk; an intercept value of 0 represents perfect calibration in the large. The calibration slope was estimated by logistic regression analysis, using the log odds of the predictions for the single predictor of the binary outcome [50]. For a (near-)perfect calibration in the large, a calibration slope less than 1 reflects that predictions for individuals with low risk are too low and predictions for individuals with high risk are too high [50]. The calibration plots, calibration in the large, and calibration slopes for each model were obtained using the R package rms [58].

Discrimination

Discrimination reflects the capability of a model to distinguish individuals with the event from those without the event; the risk predicted by the model should be higher for individuals with the event compared with those without the event [59]. The area under the receiver operating characteristic curve (AUC) was used to assess discrimination, which ranges between 0.5 and 1.0 for sensible models. The AUCs for each model were obtained using the R package rms [58].

Clinical usefulness

While discrimination and calibration are important statistical properties of a risk prediction model, they do not assess its clinical usefulness [50,54,59]. For example, if a false-negative result causes greater harm than a false-positive result, one would prefer a model with a higher sensitivity over a model that has a greater specificity but a slightly lower sensitivity, even though the latter might have a higher AUC [60].

In the context of selecting individuals for lung cancer screening, a model is clinically useful if applying that model to determine screening eligibility yields a better ratio of benefits to harms than not applying it. Decision curve analysis has been proposed to assess the net benefit of using a risk prediction model [60,61]. Decision curve analysis evaluates the net benefit of a model over a range of risk thresholds, i.e., the level of risk used to classify predictions as positive or negative for the predicted outcome. For example, for the PLCOm2012 model, a risk threshold of 1.51% has been suggested, meaning that individuals with an estimated risk of 1.51% or higher are classified as positive (and thus eligible for screening) and individuals with an estimated risk lower than 1.51% as negative (and thus ineligible for screening) [13].

The net benefit is defined as: where the weighting factor is defined as:

This weighting factor represents how the relative harms of false-positive (classifying a person as eligible for screening who does not develop, or die from, lung cancer) and false-negative (classifying a person as ineligible for screening who develops, or dies from, lung cancer) results are valued at a given risk threshold, i.e., the ratio of harm to benefit, and is estimated by the threshold odds. For example, a risk threshold of 2.5% yields the following weighting factor:

This weighting factor implies that missing one case of lung cancer that could be detected through screening is valued as 39 times worse than unnecessarily screening one person, or that one case should be detected per 40 screened persons. Consequently, the less relative weight one gives to detecting a lung cancer case, the higher the risk threshold one will favor.

The net benefit can then be interpreted as follows: if the net benefit at a risk threshold of 2.5% is 0.002 greater compared with screening all persons eligible according to the NLST criteria, taking the weighing factor into account, this is equivalent to a net improvement in true-positive results of 0.002 × 1,000 = 2 per 1,000 persons assessed for screening eligibility, or a net reduction in false-positive results of 0.002 × 1,000/(0.025/0.975) = 78 per 1,000 persons assessed for screening eligibility [60]. Thus, if the risk model has a positive net benefit at the preferred risk threshold, this indicates that applying the model at this risk threshold provides a better ratio of benefits to harms than current screening guidelines based on pack-years. Decision curves visualize the net benefit over a range of risk thresholds, allowing one to discern whether and at which risk thresholds applying the risk model can be clinically useful [61]. Decision curves were used to determine at which range of risk thresholds applying the models provides a net benefit over using the NLST eligibility criteria for selecting individuals for lung cancer screening.

Finally, we identified the risk threshold for each model in the PLCO cohorts that selected a similar number of individuals for screening as the NLST eligibility criteria, on which most lung cancer screening recommendations are currently based. We then assessed the sensitivity (the number of individuals with lung cancer incidence or death classified as eligible for screening divided by the total number of individuals with lung cancer incidence or death) and specificity (the number of individuals without lung cancer incidence or death classified as ineligible for screening divided by the total number of individuals without lung cancer incidence or death) for each model compared to the NLST criteria at the chosen risk threshold, as reported before by Tammemägi et al. [13].

Multiple imputation of missing values

Multiple imputation of missing data for all considered risk factors was performed through the method of chained equations using the R package MICE [62]. History of pneumonia was not measured in the PLCO but was measured in the NLST; therefore, data from the NLST were used to impute history of pneumonia for PLCO participants [49]. Analyses were performed using 20 imputations, and the results were pooled through applying Rubin’s rules [63]. The results of the analyses with imputation of missing variables were similar to those obtained from complete case analyses. The Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines suggest applying multiple imputation when missing data are present, as complete case analyses can lead to inefficient estimates [64,65]. Therefore, all analyses reported here were performed with multiple imputation of missing values.

Results

Characteristics of study populations

An overview of the characteristics of the four study cohorts (two trial arms in each trial) is given in Table 2, stratified by 6-y lung cancer incidence. A similar table stratifying participants by 6-y lung cancer mortality is provided in S2 Appendix. An overview of the proportion of individuals with complete information on all risk factors, stratified by trial arm and 6-y outcome, is given in S3 Appendix. Overall, approximately 93% of the study population had complete information for all considered risk factors.

Differences in levels of absolute risk

The risk prediction models included in this study were developed in different populations (Table 1) and incorporate risk factors, specifically smoking behavior, in different ways (S1 Appendix). In addition, some models predict lung cancer incidence, while others predict lung cancer mortality. Therefore, the estimated absolute risk for the same individual varies between models [66]. Fig 1 shows the estimated 6-y risk of lung cancer incidence or mortality (depending on the target outcome of the model) across the models for five individuals with different risk factor profiles. This difference in estimated absolute risk between models suggests that specific risk thresholds might be needed for each model.

Download:

Fig 1. Examples of projected absolute risk for individuals with different risk factor profiles by model.

Person 1: 70-y-old, high-school-graduated white male, current smoker, who smoked 30 cigarettes per day for 55 y, has a BMI of 28 kg/m², has COPD, no asbestos exposure, no personal history of cancer, no personal history of pneumonia, but has a family history of lung cancer (relative was diagnosed at age > 60 y). Person 2: 63-y-old, college-graduated black woman, former smoker who quit 10 y ago, who smoked 15 cigarettes per day for 40 y, has a BMI of 25 kg/m², does not have COPD, no asbestos exposure, no personal history of cancer, has a personal history of pneumonia, and no family history of lung cancer. Person 3: 65-y-old Asian male with some college education, former smoker who quit 14 y ago, who smoked 10 cigarettes per day for 30 y, has a BMI of 24 kg/m², does not have COPD, has asbestos exposure, no personal history of cancer, no personal history of pneumonia, and no family history of lung cancer. Person 4: 58-y-old, post-graduate-educated Hispanic woman, current smoker, who smoked 5 cigarettes per day for 38 y, has a BMI of 22 kg/m², does not have COPD, no asbestos exposure, has a personal history of cancer, no personal history of pneumonia, and no family history of lung cancer. Person 5: 50-y-old, college-educated white woman, current smoker, who smoked 5 cigarettes per day for 30 y, has a BMI of 22 kg/m², does not have COPD, no asbestos exposure, no personal history of cancer, no personal history of pneumonia, and no family history of lung cancer. BMI, body mass index; COPD, chronic obstructive pulmonary disease; CPS, Cancer Prevention Study; HPFS, Health Professionals Follow-up Study; LLP, Liverpool Lung Project; NHS, Nurses’ Health Study; TSCE, Two-Stage Clonal Expansion.

https://doi.org/10.1371/journal.pmed.1002277.g001