Article Text


Predicting risk of COPD in primary care: development and validation of a clinical risk score
  1. Shamil Haroon,
  2. Peymane Adab,
  3. Richard D Riley,
  4. Tom Marshall,
  5. Robert Lancashire and
  6. Rachel E Jordan
  1. Department of Public Health, Epidemiology & Biostatistics, School of Health and Population Sciences, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
  1. Correspondence to Professor Peymane Adab; p.adab{at}


Objectives To develop and validate a clinical risk score to identify patients at risk of chronic obstructive pulmonary disease (COPD) using clinical factors routinely recorded in primary care.

Design Case–control study of patients containing one incident COPD case to two controls matched on age, sex and general practice. Candidate risk factors were included in a conditional logistic regression model to produce a clinical score. Accuracy of the score was estimated on a separate external validation sample derived from 20 purposively selected practices.

Setting UK general practices enrolled in the Clinical Practice Research Datalink (1 January 2000 to 31 March 2006).

Participants Development sample included 340 practices containing 15 159 newly diagnosed COPD cases and 28 296 controls (mean age 70 years, 52% male). Validation sample included 2259 cases and 4196 controls (mean age 70 years, 50% male).

Main outcome measures Area under the receiver operator characteristic curve (c statistic), sensitivity and specificity in the validation practices.

Results The model included four variables including smoking status, history of asthma, and lower respiratory tract infections and prescription of salbutamol in the previous 3 years. It had a high average c statistic of 0.85 (95% CI 0.83 to 0.86) and yielded a sensitivity of 63.2% (95% CI 63.1 to 63.3) and specificity 87.4% (95% CI 87.3 to 87.5).

Conclusions Risk factors associated with COPD and routinely recorded in primary care have been used to develop and externally validate a new COPD risk score. This could be used to target patients for case finding.

Statistics from

Key messages

  • Opportunities to diagnose chronic obstructive pulmonary disease (COPD) in primary care are frequently missed.

  • Data routinely recorded in primary care can be used to identify patients with undiagnosed COPD.

  • We report the development and external validation of a clinical risk score for COPD in primary care, providing important information for the future development of risk prediction models for COPD that may be used to stratify patients for case finding.


Chronic obstructive pulmonary disease (COPD) is the third leading cause of mortality.1 However, population studies suggest that 50–80% of the disease burden remains undiagnosed.2 ,3 A recent analysis of UK primary healthcare records showed that opportunities to diagnose COPD are frequently missed with up to 85% of patients presenting within 5 years of their diagnosis with indicative symptoms and clinical events.4 There is now a drive to identify such patients in order to instigate early management and reduce disease progression.5 ,6 A variety of screening tools have been proposed and evaluated including symptom-based questionnaires,7 and use of handheld8 and diagnostic spirometry.9 However, mass screening is likely to be costly and a more targeted approach is required to improve their efficiency.

Several clinical prediction models have been developed to identify individuals at risk of undiagnosed COPD. These include two developed in the USA using administrative claims data,10 ,11 one in Denmark using primary and secondary care data,12 and most recently in Scotland using routine primary healthcare data.13 The first three models are unlikely to be implementable in a UK or similar primary care setting because of differences in healthcare structures as well as the included predictor variables, many of which are not routinely recorded. The Scottish model while likely to be implementable only considered a very limited number of potential risk factors and was not externally validated.13

We report the development and external validation of a clinical prediction model that provides a score for identifying patients at high risk of undiagnosed COPD in primary care.


Study design

Electronic primary care records were available from a matched case–control data set obtained from the General Practice Research Database (GPRD; now the Clinical Practice Research Datalink). Cases with incident COPD were matched by age, sex and general practice with controls without COPD (1:2).

Risk score derivation and external validation

Description of dataset

The GPRD is a computerised database of longitudinal anonymised patient records from a representative sample of 480 general practices across the UK, covering approximately 6% of the population.14

Selection of cases and controls

Cases consisted of all patients aged ≥35 years on 1st April 2006 with a new diagnosis of COPD recorded between 1 January 2000 and 1 April 2006 (see online supplementary table S1 for clinical codes). Cases had at least 3 years of up-to-standard data (ie, data entry meeting set quality standards) prior to the date of COPD diagnosis (index date). Controls had no diagnosis of COPD, were registered on the index date and also had at least 3 years of up-to-standard data.

Identification of candidate risk factors

Risk factors associated with newly diagnosed COPD were identified from published epidemiological studies. Studies were identified from Medline, Embase and Google Scholar using ‘COPD’ (and relevant synonyms) and ‘risk factor’ as Medical Subject Headings and free text. From 544 articles we identified 46 candidate risk factors that were likely to be routinely recorded in primary care (see online supplementary table S2). The final list included smoking history, comorbidities (including asthma, ischaemic heart disease and depression), lower respiratory tract infections (LRTIs) and upper respiratory tract infections, respiratory symptoms (including cough, dyspnoea, wheeze and sputum production), systemic symptoms (including unintentional weight loss, chronic fatigue, and poor sleep), body mass index and health service use including medication prescriptions (salbutamol, oral prednisolone and antibiotics for a LRTI) and number of previous primary care consultations.

Data extraction

Clinical codes for each variable were identified using the CPRD medical and product dictionaries and the NHS Clinical Terminology Browser V.1.04.15 Data on demographic characteristics, smoking, comorbidities, respiratory symptoms and health service use were extracted over the specified period (see online supplementary table S2). Data recorded within 60 days prior to the index date were excluded since a clinical suspicion of COPD could have influenced clinical activity during this period.10 Smoking status closest to the COPD diagnosis date (or matched time point) was used to reflect likely clinical practice.

Sample size and creation of derivation and validation data sets

The data set was split into a development and external validation sample (while preserving matching of cases and controls) by purposively selecting 20 general practices that reflected the full range of practice and population characteristics where the risk score would be applicable. These practices each had at least 200 individuals to ensure validation statistics were estimated with high precision.

Model development

Both the unadjusted and adjusted association between each factor and COPD were estimated using conditional logistic regression (to account for matching of cases and controls). Risk factors were included in the model based on statistical significance (adjusted OR≥1.5 and p value <0.05) and clinical understanding, with the aim to achieve a parsimonious and clinically acceptable model. The final model was simplified by including only four risk factors that had the highest adjusted ORs and were most likely to be recorded in a range of primary care settings.

Missing smoking status was accounted for by including a missing value category in the regression model. Patients in primary care often have unknown smoking status and the model in practice may be applied to such patients. Missing data for other factors was assumed indicative of their true absence. Risk scores were computed for each individual by combining the estimated regression log ORs (β coefficients) from the final model with the individual’s risk factor values.

External validation and model performance

The accuracy of the risk score was evaluated in each of the 20 validation practices by producing the corresponding receiver operator characteristic (ROC) curve and estimating the area under it (c statistic). To summarise the average performance across the 20 practices, the c statistic estimates were synthesised in a random effects meta-analysis. We also summarised the heterogeneity in performance by estimating a 95% interval for the range of potential c statistics.16

A score threshold to define ‘high risk’ was selected by optimising the balance between the sensitivity and positive predictive value (PPV), assuming a prevalence of undiagnosed COPD of 5.5% in the general population.17 We divided the total number screened by the number of true positives to derive the number-needed-to-screen (NNS) to detect a single case of COPD. The number of diagnostic assessments needed to detect a single case of COPD was estimated as the reciprocal of the PPV.


Development sample: population characteristics

15 159 newly diagnosed COPD cases and 28 296 controls from 340 general practices were included in the development sample (tables 1 and 2). Mean age was 70 years and 52% were male. Cases and controls were matched and therefore identical in age, sex, and socioeconomic status of registered practice. 27% were current smokers, 25% ex-smokers and 40% had never smoked. A significantly higher proportion of cases than controls had a positive smoking history (77% vs 38%, respectively). All co-morbidities except hyperlipidaemia and diabetes mellitus were more common in cases than in controls. This was also true for respiratory and systemic symptoms, including fatigue and poor sleep, as well as health service use.

Table 1

General characteristics of participants in the development sample

Table 2

Comorbidities, symptoms, and healthcare use of participants in the development sample

Model results

The final model included history of smoking, asthma and salbutamol prescriptions and number of LRTIs in the previous 3 years (table 3). There was a significant drop in the model fit when removing asthma, salbutamol and LRTIs. The model was used to derive a clinical score ranging from 0 to 6.5 as shown below table 3. This had a c statistic in the development sample of 0.85 (95% CI 0.845 to 0.853). A more comprehensive model that incorporates additional variables, including symptoms, is provided in table 4.

Table 3

Adjusted ORs and regression coefficients (β) for risk factors included in the final risk model

Table 4

Adjusted ORs and regression coefficients (β) for variables included in a more comprehensive risk score

External validation sample: population characteristics

A total of 2259 newly diagnosed cases and 4196 controls from 20 general practices were included in the validation sample (table 5). The mean age was 70 years, 50% were men, and 26.6% were current smokers. A greater proportion of participants in the validation sample than the development sample were from the lowest socioeconomic quintile (42.4% vs 26.8%, respectively).

Table 5

Characteristics of subjects in the external validation sample (derived from 20 general practices)

External validation: discriminative ability

The final risk score had a c statistic of 0.84 (95% CI 0.83 to 0.85) in the validation sample when analysing the data from all 20 practices combined (ignoring clustering of patients within practices; figure 1). The c statistic in each of the validation practices separately was consistently high (figure 2) and a random effects meta-analysis (which takes into account clustering) produced a similar summary c statistic of 0.85 (95% CI 0.83 to 0.86), with a 95% prediction interval of 0.80 to 0.90. The more comprehensive score had a marginally higher c statistic (0.87, 95% CI 0.86 to 0.87).

Figure 1

Receiver under the operator characteristic (ROC) curve for the test accuracy of the final risk score in the entire external validation sample (c statistic=0.84, 95% CI 0.83 to 0.85), ignoring clustering of patients within practices. Each point on the graph represents the performance (sensitivity and specificity) of the risk score at specific thresholds.

Figure 2

Random effects meta-analysis of the c statistics obtained for the final risk score when applied in each of the 20 validation practices separately. The summary result is the estimate of the average c statistic across the validation practices.

Table 6 summarises the performance of the final score across a range of thresholds in the validation sample. A score threshold ≥2.5 yielded a sensitivity of 63.2% (95% CI 63.1% to 63.3%) and specificity 87.4% (95% CI 87.3% to 87.5%). Assuming a 5.5% prevalence of undiagnosed COPD,17 the score at our suggested threshold would have a PPV of 22.6%, NPV of 97.6%, and an overall screening yield of 3.5% when applied to patients over the age of 35 years. At this threshold the score would need to be applied to 29 patients, 5 of whom would require a clinical assessment, to identify one with COPD (figure 3).

Table 6

Test accuracy of the final risk score in the external validation sample

Figure 3

Screening test accuracy of the final risk score at a threshold of ≥2.5 when applied to 100 patients aged ≥35 years in primary care with an assumed prevalence of undiagnosed chronic obstructive pulmonary disease (COPD) of 5%.


Principal findings

We have developed and validated a clinical prediction model for identifying patients at high risk of COPD in primary care. Our clinical score incorporates smoking status, previous diagnosis of asthma and LRTIs, and prescriptions for salbutamol. The score showed good discrimination characteristics in the external validation population and our choice of optimal cut point yielded a relatively high sensitivity and specificity. It can potentially detect about three out of every five patients with undiagnosed COPD while also being able to effectively rule out patients at low risk of disease. The score threshold, however, can be altered to either maximise sensitivity or specificity.

This builds on our previous published model (based on data from the Health Survey for England) which would require 19 patients to actively undertake a screening process (19 questionnaire responses and 7 clinical assessments) to identify one individual with COPD.17 Our new clinical score, where we use routine data from primary care records, would significantly improve the efficiency of this process.

Comparison with existing literature

The first published risk model to identify patients with undiagnosed COPD was based on managed (predominantly secondary) care administrative claims data in the USA.18 Using a case–control design, 19 health service utilisation characteristics were included, many of which are unlikely to be routinely recorded in primary care. In contrast we developed a more parsimonious model that uses routinely data recorded in primary care. Furthermore our study population had more complete data on smoking history. A further US model was developed using outpatient pharmacy data.11 This incorporated respiratory and cardiovascular medications and antibiotics, and had a sensitivity of 60.6% and specificity of 70.5% when externally validated. Our risk score similarly included prior prescription of salbutamol as an important predictor. However, the ROC curve and c statistic for both US models were not reported, which makes it impossible to evaluate their discriminatory accuracy.

In Denmark Smidth et al12 used administrative data on hospital admissions for lung disorders, respiratory prescriptions and lung function tests to develop a model to identify COPD. This had a much lower sensitivity (29.7–44.8%) but higher specificity (98.9%) than the score we developed. While it had a high PPV in the Dutch population (65.0–72.9%; based on an overall COPD prevalence of 9%), it would be difficult to administer in a UK or similar primary care setting where primary and secondary care data are currently poorly linked. This model also relies on prior diagnoses of emphysema and chronic bronchitis at hospital admissions and would miss a significant number of patients due to the low sensitivity and high proportion of false negative results.

Kotz et al13 recently developed and internally validated a COPD risk model using routine longitudinal data from primary care in Scotland, including a very large (n=480 903 in the development cohort) and relatively young population (mean age 55.6 years). Their model demonstrated similar discrimination characteristics to our own (c statistic 0.85 (95% CI 0.84 to 0.85) in women and 0.83 (95% CI 0.83 to 0.84) in men), with good calibration. However, this has only been internally validated since the study population was randomly split into derivation and validation samples. Furthermore only a very limited range of risk factors were considered (age, sex, smoking status, socioeconomic status and history of asthma) and important predictors such as respiratory infections were not. They constructed separate models for men and women since they found an interaction between smoking status and sex. We also stratified our model by sex and repeated our analysis but found the ORs to be broadly similar to those in the non-stratified model.

A variety of screening questionnaires have also been evaluated. For example Price et al7 assessed the accuracy of a case finding questionnaire which included items on respiratory symptoms, smoking, and allergies and showed good discrimination characteristics. 8 This and other questionnaire-based tools can only be used in either face-to-face consultations or distributed by mail or online. If used in a population with a 5.5% prevalence of undiagnosed COPD, 26 patients would need to be screened to identify one case of COPD. Our model has the advantage of being applicable in both face-to-face consultations as well as integrated with clinical information systems and used at a practice level to identify whole groups of high-risk patients who could be invited to screening sessions. With the latter approach, only five patients would need to be invited for assessment to identify one case of COPD, thus improving its efficiency fivefold over the use of current screening questionnaires.


We used data from a large primary care population and explored a wide range of risk factors, focusing on those routinely recorded in primary care. Both aspects help ensure this clinical score will be widely applicable in primary care in the UK and other similar health systems. The score was also validated in a number of non-randomly selected practices allowing for assessment of the heterogeneity of its performance.


Ideally we would like to have used previously undiagnosed COPD cases identified by case-finding/screening to derive our risk score since their characteristics may differ from incident cases identified clinically. We used a coded diagnosis of COPD for our case definition. However, there is good evidence that COPD is misdiagnosed and underdiagnosed in primary care,19 a proportion of patients are likely to have undergone spirometry of variable quality,20 and this may have led to some misclassification of our cases and controls. Unfortunately there was insufficient spirometry data in our data set to validate the diagnosis. Quint et al21 recently demonstrated that clinical codes specific for COPD and emphysema have a high PPV for validated COPD. We used clinical codes for COPD that were recommended by the GPRD (now CPRD) at the time of our analysis. Although these largely overlapped with those recommended by Quint et al they also included codes specific for chronic bronchitis, which would not necessarily constitute a diagnosis of COPD (although may increase the likelihood of the development of airflow obstruction and risk of mortality).22

The mean age of our study population (70 years) was older than patients who would typically be targeted for case finding. Age and sex, which are likely to be predictors of COPD, could not be incorporated in the model because of the matched case-control data. This also prevented us from examining calibration performance in the validation practices. Another limitation of the matched case–control design is that c statistics are generally downwardly biased when estimated in such data.23 ,24 Therefore, it is possible the true c statistic may be closer to 1 on application.

Some of the variables we explored, such as hospitalisations were poorly recorded, and may actually be significant predictors of COPD. In addition the absence of a risk factor could be secondary to under-recording. However, we aimed to produce a model that would be implementable in a common primary care setting drawing on routinely recorded data. If clinical coding improves over time some of these variables may need to be revisited as potential predictors and considered for inclusion in future revised models. Finally, our clinical score may not be applicable in health settings where exposure to risk factors other than cigarette smoking (eg, biomass fuels) is a significant cause of COPD.

Implications for clinicians, policymakers and research

Our clinical score once further validated, could be used by clinicians in primary care to stratify patients by risk of COPD. This could be achieved primarily with the aid of developed software applications that would automate the calculations. Since the model was based entirely on routinely collected data it could also be integrated into primary care clinical information systems to use data on risk factors to stratify all eligible patients. Patients predicted to be at high risk of COPD could then be referred for a clinical assessment including confirmatory spirometry testing.

However, further work is needed to validate or adapt this preliminary model in other populations, notably in case finding trials that have enrolled patients with previously undiagnosed COPD. This includes examining our matching factors (age, sex and socioeconomic deprivation) as potential predictors. The cost-effectiveness of targeting patients at different thresholds should also be evaluated. Future studies should also address the impact of this tool on use and outcomes in general practice.


Our risk score shows promising accuracy and increased efficiency over current methods for identifying patients with COPD in primary care. Use of an externally validated score could be used for risk stratification so that high-risk patients can be efficiently identified and referred for confirmatory spirometry. However, evidence that early identification of COPD results in improved patient outcomes must be robustly assessed before screening for COPD can be recommended as part of routine practice.


This study is based in part on data from the Full Feature General Practice Research Database obtained under licence from the UK Medicines and Healthcare Products Regulatory Agency. However, the interpretation and conclusions contained in this study are those of the authors alone. Access to the CPRD database was funded through the Medical Research Council's licence agreement with MHRA. This CPRD data set was obtained under MRC licence. Approval was given by the Independent Scientific Advisory Committee for the Medicines and Healthcare products Regulatory Agency for this project (protocol 07_089R). However, the interpretation and conclusions contained in this study are those of the authors alone. The authors are grateful to the General Practice Research Database for access to their data. We obtained the data with a maximum of 100 000 records in order to compare the characteristics and health service use of prevalent patients with COPD with matched controls without COPD.

View Abstract
  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:


  • Contributors The idea for this study was initially conceived by REJ. REJ and PA applied for approval, designed the protocol for data extraction from the CPRD and obtained the CPRD data set. SH, RDJ, PA and REL identified appropriate clinical codes for data extraction. RL extracted and manipulated the data from the CPRD data set to create a STATA file. SH led the design of the study with advice from PA, RDJ, TM and RDR. SH undertook the statistical analysis with specific advice from RDR and additional input from PA, REJ and TM. SH wrote the manuscript with advice and input from all authors. All authors agreed to the final version.

  • Funding  This paper presents independent research funded by the National Institute for Health Research (NIHR). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health. Shamil Haroon is funded by a National Institute for Health Research (NIHR) doctoral fellowship (DRF-2011-04-064). Rachel Jordan was funded by an NIHR post-doctoral fellowship (pdf/01/2008/023). Tom Marshall is partly funded by the National Institute for Health Research (NIHR) through the Collaborations for Leadership in Applied Health Research and Care for Birmingham and Black Country (CLAHRC-BBC) programme.

  • Competing interests All authors have completed the Unified Competing Interest form at (available on request from the corresponding author) and declare: no support from any organisation for the submitted work (except described above); no financial relationships with any organisations that might have an interest in the submitted work in the previous 3 years, no other relationships or activities that could appear to have influenced the submitted work.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement No additional data are available.

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.