Article Text

Reliability and responsiveness of the D12 and validity of its scores as a measure of dyspnoea severity in patients with rheumatoid arthritis-related interstitial lung disease
  1. Jeffrey J Swigris1,
  2. Sonye Danoff2,
  3. Paul F Dellaripa3,
  4. Tracy J Doyle4 and
  5. Joshua J Solomon1
  1. 1National Jewish Health Center for Interstitial Lung Disease, Denver, Colorado, USA
  2. 2Division of Pulmonary Critical Care Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
  3. 3Division of Rheumatology, Immunology and Allergy, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA
  4. 4Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA
  1. Correspondence to Professor Jeffrey J Swigris; swigrisj{at}


Background Interstitial lung disease due to rheumatoid arthritis (RA-ILD) affects a substantial minority of patients with RA, inducing life-altering symptoms, impairing quality of life (QOL) and forcing patients to confront the potential for shortened survival. Dyspnoea is the predominant respiratory symptom of RA-ILD and a strong driver of QOL impairment in patients with it. The D12 is a 12-item questionnaire that assesses the physical and affective components of dyspnoea. It was one of a battery of patient-reported outcomes used in the double-blind, placebo-controlled TRAIL 1 trial of pirfenidone for RA-ILD. There is little information on the reliability, validity or responsiveness of the D12 in RA-ILD.

Methods In accordance with COSMIN (COnsensus-based Standards for the selection of health Measurement INstruments) methodology, we conducted analyses on data from the TRAIL 1 trial to assess the measurement properties of the D12.

Results Internal consistency (α=0.95, 0.95, 0.95, 0.95 and 0.96 at baseline, 13, 26, 39 and 52 weeks) and test-retest reliability 0.85 (0.71 to 0.92) exceeded acceptability criteria. Well over the 75% benchmark of hypotheses (43/46=93%) around D12 measurement properties were confirmed. Known-groups validity was supported by significant differences between subgroups of patients with differing levels of dyspnoea (eg, St. George’s Respiratory Questionnaire (SGRQ) Activity score ≥50 vs <50, 9.36 (1.27) points, p<0.0001, with a large effect size=1.7) and physiological impairment at baseline. Longitudinal validity was supported by significant associations between D12 and anchor scores over time (eg, at 52 weeks, correlation between D12 change and SGRQ Activity change was 0.54, p<0.0001; between D12 change and Routine Assessment of Patient Index Data (RAPID) Functioning Component was 0.41, p<0.0001). A battery of analyses confirmed the responsiveness of D12 scores for capturing change in dyspnoea over time. We estimated the minimal within-patient change threshold for worsening as 3 points.

Conclusions D12 scores possess acceptable measurement properties in RA-ILD, such that it can be used with confidence in this population to assess dyspnoea severity defined by its physical and affective components. As validation is an ongoing process, and never accomplished in a single study, additional research on the psychometric properties of the D12 in RA-ILD is encouraged.

  • interstitial fibrosis
  • systemic disease and lungs
  • surveys and questionnaires

Data availability statement

No data are available.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


  • Dyspnoea is the most common symptom of rheumatoid arthritis-related interstitial lung disease (RA-ILD) and a strong driver of quality of life impairment in patients with RA-ILD.


  • In this study, we confirm the reliability, validity and responsiveness of the Dyspnoea-12 (D12) as a measure of dyspnoea severity in patients with RA-ILD.


  • This study suggests that the D12 could be used in patients with RA-ILD to assess the effect of therapeutic interventions on dyspnoea severity.


Interstitial lung disease (ILD) is a common manifestation of rheumatoid arthritis (RA), with clinically significant ILD affecting a non-trivial proportion of patients with RA.1 2 RA-related ILD (RA-ILD) causes life-altering symptoms and impairments in quality of life (QOL) that add to the burdens imposed by joint manifestations.3 Like in patients with other forms of ILD, in those with RA-ILD, dyspnoea is a strong driver of impaired health status and QOL.4 Thus, assessing dyspnoea (with reliable, valid and responsive metrics) and attempting to limit its worsening are worthwhile endeavours in therapeutic trials enrolling patients with RA-ILD.

The D12 is a questionnaire developed to assess overall dyspnoea severity by assessing how shortness of breath is perceived (physically) and how it makes the respondent feel (affective effects).5 Pirfenidone is an antifibrotic medication that slows forced vital capacity (FVC) decline in patients with idiopathic pulmonary fibrosis (IPF).6 In TRAIL 1, a randomised, double-blind, placebo-controlled, phase II trial, the efficacy of pirfenidone was examined in RA-ILD.7 The D12 and other patient-reported outcome measures (PROMs) were collected during TRAIL 1 as lower-tier end points. The psychometric properties of the D12 are largely unknown in patients with RA-ILD. We aimed to examine them in accordance with COSMIN methodology8; in brief, to test hypotheses around the reliability, responsiveness and the validity of the D-12’s score as a measure of dyspnoea severity in patients with mild-moderately severe RA-ILD enrolled in TRAIL 1.


Patient and public involvement

Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

Design and participants

The design and primary results for the TRAIL 1 trial have been published.7 Briefly, 123 subjects with RA-ILD were randomised to receive pirfenidone or matching placebo for 52 weeks. Study visits occurred at baseline and 13, 26, 39 and 52 weeks at which subjects performed spirometry and completed a battery of PROMs.

PROMs used in this study

The D12

The D12 is a 12-item questionnaire designed to assess dyspnoea severity by having respondents reflect on “your breathing these days” and give responses of ‘none’, ‘mild’, ‘moderate’ or ‘severe’ to each item. The premise is that more severe dyspnoea is perceived with more uncomfortable physical and/or more bothersome affective characteristics. The recall period is ‘these days’, and respondents are not asked to consider any specific activities, instances or times of day as they reflect on their dyspnoea and respond to items. Its single score ranges from 0 to 36, and a higher score indicates more severe dyspnoea.9 The format of the D12 can be found in online supplemental table S1.


The SGRQ is a 50-item questionnaire designed to assess respiratory health status in patients with asthma or chronic obstructive pulmonary disease, but it has been used frequently in research on patients with ILD. The SGRQ yields four scores (Symptoms, Activity, Impacts domains and a Total), each of which range from 0 to 100, with higher scores indicating worse respiratory health status.10 11


The RAPID3 is a sum of the physical functioning (FN), pain (PN) and global assessment (PTGL) components from the Multi-Dimensional Health Questionnaire. Each component is scored 0–10, with higher scores indicating worse status.12

In TRAIL 1, the D12 and SGRQ were collected at all timepoints, and the RAPID3 was collected at baseline and 52 weeks.

Statistical analyses

Baseline data were tabulated and summarised using counts and measures of central tendency. Analyses were based on COSMIN recommendations for studies on measurement properties of PROMs.8 13 Three anchors—SGRQ Activity, RAPID FN and RAPID PTGL—were used for validity assessment of D12 scores, but in certain instances, results are reported for other PROM scores (eg, SGRQ Total score) and physiological variables of clinical interest (eg, FVC and diffusing capacity of the lung for carbon dioxide (DLCO)). The rationale for using the three anchors includes the following: (1) the SGRQ Activity domain is essentially a measure of dyspnoea severity (the same construct assessed by D12); (2) the RAPID FN is a measure of physical functioning, with items that address activities of varying metabolic demand (relevant to patients with exertional dyspnoea) and (3) RAPID PTGL is a global assessment of physical well-being, and dyspnoea is known as a strong driver of impaired physical well-being in patients with RA-ILD.3 Historically, physiological variables (eg, FVC, DLCO) are not correlated highly enough to serve as adequate anchors for PROMs (including the D12 in a cohort of patients with various forms of ILD, predominantly IPF); furthermore, it has been recommended that rather than physiological variables, PROMs be used as anchors in psychometric assessments of other PROMs.5 14 Because the cut-points for the anchors have not been determined for RA-ILD, we borrowed from the ILD literature15 and drew from our experience and intuition to select them.

Analyses assessed the following: (1) internal consistency, (2) test-retest reliability, (3) convergent and known-groups analyses to assess validity, (4) responsiveness and (5) estimation of the minimal within-patient change (MWPC) threshold for worsening. Online supplemental table S2 shows hypotheses tested and whether they were confirmed by our analyses. Analyses were conducted in SAS, V.9.4 (SAS Institute; Cary, North Carolina, USA).

Item-item correlations and internal consistency

We used Spearman’s item-item correlations to examine inter-relatedness of items and Cronbach’s α to assess internal consistency.

Test-retest reliability

We used the intraclass correlation coefficient (ICC1 2)—two-way mixed-effect analysis of variance (ANOVA) model with interaction for the absolute agreement—as a measure of test-retest reliability (ie, stability) of D12 scores among subjects deemed stable (change <|5|) according to the SGRQ Activity anchor from baseline to 13 weeks.

Convergent and known-groups analyses to assess content validity

We assessed convergent validity by examining pairwise Spearman’s correlations between D12 and other outcomes at the various study timepoints. We used ANOVA with p value corrected pairwise comparisons to assess differences in D12 scores between subgroups stratified on the anchors and other outcomes. We calculated effect sizes (ESs) for those differences using Cohen’s d) Cohen’s d=M1−M2/spooled (where spooled=√[(s 12+s 22)/2]). Thus, we considered ESs as: 0.2 small, 0.5 medium and 0.8 large.


We used several methods to examine the ability of D12 scores to respond to changes in dyspnoea severity. We examined pairwise Spearman’s correlations between D12 change scores and anchor and other outcome change scores. For each anchor, we built a repeated-measures, longitudinal model with D12 change score as the outcome variable and anchor change as the lone predictor variable. Because RAPID FN and RAPID PTGL were collected only at baseline and week 52, for these (rather than a repeated-measures model), we simply regressed D12 change on FN or PTGL change.

We generated empirical cumulative distribution function (eCDF) plots for D12 change-by-change strata for the anchors and other outcomes. For all subjects (from baseline to week 52) and for the subgroups of subjects who worsened (from baseline to week 52) according to the RAPID PTGL anchor or SGRQ Activity anchor, we generated ES and standardised response mean (SRM) statistics for the D12 and anchors—ES, defined as the change divided by baseline SD, and SRM, defined as change divided by SD of change. Finally, we generated receiver operating characteristic (ROC) curves for D12 change as an identifier of worsening from baseline to week 52 (the cohort was divided into two groups—worsened versus not—according to each of the three anchors). We examined area under the curve (AUC) for each ROC curve. For test-retest and the eCDF, ES/SRM and AUC analyses, we converted the RAPID PTGL anchor (21 response options) to a four-category variable (called RAPID PGIS) by dividing the original response by 2.5 and rounding to the nearest integer (online supplemental table S3). Doing so consolidated change on the RAPID PTGL anchor into a more manageable number of categories (5 as opposed to over 40).

Minimal within-patient change threshold

We used data from the ROC curves to calculate Youden’s statistic to identify cut-points (ie, thresholds) for D12 change that distinguished subjects who worsened versus those who did not worsen according to the dichotomised anchors (as defined above). Each of the three anchors yielded an estimate, and the final estimate for the MWPC was weighted based on the correlation16 between D12 change and dichotomised anchor change.



Table 1 shows baseline characteristics for subjects included in the analysis. Pulmonary physiology suggested mild to moderate impairment in lung function.

Table 1

Baseline characteristics for subjects from TRAIL trial

Item-item correlations and internal consistency

An item-item and item-total correlation matrix for the D12 at baseline can be found in online supplemental figure S1. All item-total correlations were strong (>0.8). As hypothesised, internal reliability of the D12 was excellent (α=0.95, 0.95, 0.95, 0.95 and 0.96 at baseline, 13, 26, 39 and 52 weeks).

Test-retest reliability

As hypothesised, the ICC1 2 for subjects who were stable at 13 weeks according to the SGRQ Activity anchor (ie, <|5| point change) was 0.85 (0.71 to 0.92).

Floor/Ceiling effects and missingness

The percentage of subjects who scored at the floor (ie, 0) of the D12 for each study timepoint was 13.1% at baseline, 14.8% at 13 weeks, 18.7% at 26 weeks, 17.0% at 39 weeks and 17.2% at 52 weeks. No subject had a D12 score at the ceiling (ie, 36) at any timepoint. Of the 537 total non-missing D12 evaluations during the study (of a possible 123×5=615 administrations), there were only 6 instances in which the D12 score was 30 or greater (see online supplemental table S3 for missingness for each PROM).

Convergent validity

Table 2 shows correlations between D12 scores and values for other variables. As hypothesised, at all study timepoints, for pairs that included the D12 and other PROMs, correlations were statistically significant, moderately strong or strong, and in the hypothesised directions. At baseline, correlations between D12 scores and FVC or DLCO were not statistically significant and weak.

Table 2

Spearman’s correlation coefficients for D12 scores and other outcomes at the five timepoints of the TRAIL trial

Known-groups validity

At baseline, D12 scores were associated with other outcome measures (table 3). D12 scores differed between subgroups of subjects stratified—least versus most impaired/severe/dyspneic—on the other PROMs or FVC% (table 3). As hypothesised, ESs were large for between-subgroups differences for the three anchors.

Table 3

Known-groups validity for D12 at baseline


There were significant associations between D12 and other change scores over time, with correlations between D12 and PROM change scores ranging from 0.28 to 0.65 across study timepoints (table 4). As hypothesised, longitudinal models confirmed the same association: change in anchor or physiological measure was associated with change in D12 score (table 4 and figure 1). eCDF plots showed a separation of D12 curves for categories of change in anchors and other outcomes (online supplemental figure S2).

Figure 1

Observed and model-predicted values for D12 change according to change values for anchors (panel A, SGRQ Activity; panel B, RAPID FN; panel C, RAPID Global (PTGL)). DLCO, diffusing capacity of the lung for carbon monoxide; DLCO%, per cent predicted DLCO; FVC, forced vital capacity; FVC%, per cent predicted FVC; RAPID FN, functional domain from Multi-Dimensional Health Assessment Questionnaire; RAPID Global (PTGL), global assessment from Multi-Dimensional Health Assessment Questionnaire; SGRQ, St. George’s Respiratory Questionnaire.

Table 4

Associations between the change in D12 change and change in other outcomes over time

ES and SRM values for the entire cohort and for the subgroups that worsened according to the RAPID PGIS or SGRQ Activity score are presented in table 5. Overall, among all subjects with available data, the ES and SRM for the D12 were at least as high as the two RAPID3 anchors but lower than the SGRQ Activity anchor. Whether worsening was defined by the RAPID PGIS or SGRQ Activity anchor, except for the SRM for the RAPID FN anchor, ES and SRM values for the D12 were greater than values for the other anchors. The AUC values for D12 according to dichotomised change (worsened vs not) in SGRQ Activity, RAPID FN and RAPID PTGL anchors were 0.82, 0.71 and 0.76, respectively.

Table 5

ES and SRM values for all subjects and for subgroup who worsened according to the RAPID PGIS or SGRQ Activity anchor

MWPC thresholds for worsening from baseline to 52 weeks

Thresholds (95% confidence limits) for MWPC for the SGRQ Activity, RAPID FN and RAPID PTGL anchors were 2.99 (1.42 to 5.88), 2.99 (−2.29 to 10.12) and 2.99 (0.13 to 6.46), thus yielding a final, weighted estimate of 3 points (ie, an increase in D12 score >3 points is considered worsening).

In total, we generated 46 hypotheses, of which 43 (93%) were met (online supplemental table S4).


We used a battery of analyses based on the COSMIN framework to assess the D12 in patients with RA-ILD. We found it possesses measurement properties that make its scores suitable for assessing dyspnoea severity—defined as a combination of its physical and affective aspects—in this patient group.

As with any clinical outcome assessment (COA), a PROM must possess characteristics that give users confidence—analyses must confirm that, in the target population, all items are strongly related to each other and the construct of interest; scores are a metric of what the PROM is purported to measure; scores remain stable in respondents who are unchanged on the construct, and scores change as expected in respondents who are changed on the construct.

We used hypothesis testing to assess these characteristics (ie, psychometric properties) of the D12 and found high inter-relatedness among items and an instrument with high internal consistency and responsiveness. Nearly every hypothesis was confirmed, thus surpassing the 75% benchmark recommended by measurement experts.13 Analyses of baseline data support construct validity of D12 scores, that is, they reflect aspects of dyspnoea severity. When there is no gold standard against which to compare a COA—as is the case for PROMs—anchors previously shown and/or hypothesised to capture the same construct are used instead. We selected as anchors the Activity domain from the SGRQ, the FN component from the RAPID3 and the PTGL from the RAPID3, because of their direct relationship to dyspnoea severity. The SGRQ Activity domain functions as a dyspnoea index, asking respondents to reveal which activities make them short of breath and how shortness of breath affects their ability to perform various activities. The FN component asks about the ability to complete certain tasks of varying metabolic demand, and as such, likely to induce varying levels of dyspnoea. The PTGL is a global assessment of health status, which is known to be affected by dyspnoea in patients with ILD, including those with connective tissue disease, like RA.4 As recommended, we used scores from these other PROMs as anchors in our analyses of the D12. Physiological measures, like FVC and DLCO, tend not to be suitable.17 The moderately strong, statistically significant associations (in expected directions) between D12 and anchor scores suggest they all (the D12 as well as the anchors) tap aspects of the same construct (dyspnoea severity). That D12 scores differed significantly—and most importantly, by the hypothesised, large ES—between subgroups hypothesised to have different levels of dyspnoea severity (eg, those with the most severe dyspnoea according to the anchors vs those with the lowest). These results show that the D12 can discriminate between subgroups with differing severity of RA-ILD.

Several results supported the longitudinal validity and responsiveness of D12 scores to capture change in dyspnoea severity over time in patients with RA-ILD; these included the simple correlational analyses of change scores, the longitudinal models which also revealed the association between D12 change and anchor change over the course of TRAIL 1, and others. Test-retest confirmed stability (ie, ICC ≥0.8) in D12 scores among the subgroup of subjects hypothesised to have stable dyspnoea severity (based on the SGRQ Activity anchor).

The ES and SRM yield change scores in SD units (SD either for baseline (ES) or for change scores (SRM)). Validity for the D12 is supported when change in SD units for the D12 is similar to (ie, no less than 0.05 units and preferably the same or greater than) change in SD units for the anchors. This criterion was met in most instances but not for the overall study population for the D12 vs the SGRQ Activity anchor. This likely relates to differences in aspects of dyspnoea assessed by the D12 and the SGRQ Activity domain (discussed more below). The eCDF plots show separation of curves (ie, differing changes in D12 over time) for anchor subgroups which, themselves, have differing changes in dyspnoea severity over time. Finally, the ROC-derived MWPC threshold for worsening was the same for each anchor, which bolsters confidence in the overall estimation.

Dyspnoea is a known driver of impaired health status and physical well-being in patients with RA-ILD.3 Results here suggest that the D12 is a suitable tool to capture dyspnoea severity at baseline, to distinguish subgroups of patients with different levels of dyspnoea severity and to capture changes in dyspnoea severity over time (eg, in response to therapeutic interventions). However, the D12 and our study have limitations. By design, the D12 is intended to be somewhat vague in its recall period, and the vagueness induced by ‘these days’ (the D12 recall period) is almost certainly interpreted differently by different respondents, thus introducing variability in scores. Likewise, D12 respondents are asked to “place a tick in the box that best matches your breathing”, but there is no direction on whether respondents should consider their breathing at rest, during light/moderate/vigorous activity or at a particular time of day. The skewed distribution of D12 response data (to the mild side) suggest respondents may be reflecting on their breathing at rest, a situation for which none but the most very severe patients with RA-ILD have dyspnoea.

In contrast, items on the SGRQ Activity domain (from the version of the SGRQ used in TRAIL 1) refer to a specific time period for recall (1 month) and ask which of the given physical activities “usually make you feel breathless”, and how specific activities “may be affected by your breathing” (eg, cause slowing or the need to stop and rest). For these activity-based items, there is no subjective interpretation about the circumstances in which to reflect on dyspnoea. These differences likely account for the differences in ES and SRM between the D12 and SGRQ Activity scores we observed in the overall analysis. The D12 taps aspects of dyspnoea that are related to—but unique from—other dyspnoea domains or indices, and although our results confirm acceptable psychometric properties in patients with RA-ILD, it remains unknown precisely how perceptions of the physical and affective components of dyspnoea (and by extension, D12 scores) vary when patients participate in activities of varying metabolic demand. Thus, the D12 would not be the instrument of choice if such information is desired. Nor is D12 adequate to assess overall functioning, given the potential for articular manifestations of RA to affect how patients feel and function in their daily lives. Because of the paucity of PROM research in RA-ILD, anchor cut-points were selected based on data from a related condition (IPF in the case of the SGRQ Activity domain) and our integration of our research experience, intuition and information from studies of patients with RA but not ILD.

Finally, although 123 subjects in TRAIL 1—the first-ever trial dedicated solely to this patient population—is a respectable number of subjects, particularly for a trial conducted during the COVID-19 pandemic, in several subgroup analyses, the small N creates imprecision in point estimates.


Our analyses support the D12 as fit for the purpose of assessing physical and affective aspects of dyspnoea in patients with RA-ILD. Our analyses highlight three additional things: (1) people with more severe RA-ILD (eg, those who need supplemental oxygen or have lower FVC) will perceive more severe negative physical and emotional effects of dyspnoea, (2) these negative effects become stronger if RA-ILD worsens and (3) a D12 score increase of 3 is the minimal within-patient change threshold for worsening dyspnoea severity in patients with moderately severe RA-ILD.

Data availability statement

No data are available.

Ethics statements

Patient consent for publication

Ethics approval

Our analyses were performed under an approved research protocol by the National Jewish Health Institutional Review Board (HS# 2584).


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Twitter @SwigOutFishing

  • Contributors Study planning/conceptualisation: all authors. Statistical analysis: JJSw. Interpretation of results: all authors. Writing, editing and approving of manuscript: all authors. JJSw is the author responsible for the overall content as the guarantor and accepts full responsibility for the work and/or the conduct of the study, had access to the data, and controlled the decision to publish.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.