Rationale Spirometry and plethysmography are the gold standard pulmonary function tests (PFT) for diagnosis and management of lung disease. Due to the inaccessibility of plethysmography, spirometry is often used alone but this leads to missed or misdiagnoses as spirometry cannot identify restrictive disease without plethysmography. We aimed to develop a deep learning model to improve interpretation of spirometry alone.
Methods We built a multilayer perceptron model using full PFTs from 748 patients, interpreted according to international guidelines. Inputs included spirometry (forced vital capacity, forced expiratory volume in 1 s, forced mid-expiratory flow25–75), plethysmography (total lung capacity, residual volume) and biometrics (sex, age, height). The model was developed with 2582 PFTs from 477 patients, randomly divided into training (80%), validation (10%) and test (10%) sets, and refined using 1245 previously unseen PFTs from 271 patients, split 50/50 as validation (136 patients) and test (135 patients) sets. Only one test per patient was used for each of 10 experiments conducted for each input combination. The final model was compared with interpretation of 82 spirometry tests by 6 trained pulmonologists and a decision tree.
Results Accuracies from the first 477 patients were similar when inputs included biometrics+spirometry+plethysmography (95%±3%) vs biometrics+spirometry (90%±2%). Model refinement with the next 271 patients improved accuracies with biometrics+pirometry (95%±2%) but no change for biometrics+spirometry+plethysmography (95%±2%). The final model significantly outperformed (94.67%±2.63%, p<0.01 for both) interpretation of 82 spirometry tests by the decision tree (75.61%±0.00%) and pulmonologists (66.67%±14.63%).
Conclusions Deep learning improves the diagnostic acumen of spirometry and classifies lung physiology better than pulmonologists with accuracies comparable to full PFTs.
- Lung Physiology
- Respiratory Measurement
Data availability statement
Data are available on reasonable request. Data sharing will be available in response to a written request following review and approval by the responsible institutions to ensure appropriate guidelines are met.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
WHAT IS ALREADY KNOWN ON THIS TOPIC
Spirometry is the most commonly used pulmonary function test for screening and management of lung disease. Without assessment of lung volumes using plethysmography, spirometry misses restrictive defects and can lead to misdiagnoses. Computer-aided tools have been developed to improve classification of lung physiology patterns. However, these tools require the inclusion of plethysmography measurements and/or clinical symptoms. No study has developed machine learning algorithms for classifying the major lung conditions using spirometry only.
WHAT THIS STUDY ADDS
Deep learning using a multilayer perceptron model with spirometry data provides classification accuracies of lung physiology patterns that are comparable to full pulmonary function testing, which includes both spirometry and plethysmography, and better than trained pulmonologists.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
Implementation of the deep learning model (code appended in this paper) in spirometers will facilitate accurate identification of pulmonary physiology patterns to appropriately triage patients for subsequent investigations and/or therapy. This will improve equity in healthcare access; patients who live in regions of the world where spirometry is available, but access to diagnostic laboratories for full pulmonary testing is limited, will receive equitable care when machine learning is applied. The machine learning model will also increase healthcare delivery efficiency and improve patient outcomes by facilitating earlier diagnosis of lung diseases.
The current gold standard pulmonary function test (PFT) consists of both spirometry and plethysmography,1 2 which have well-established guidelines for conduct and interpretation.3 However, patient access to plethysmography is often limited due to the need for expensive infrastructure and technical expertise. Furthermore, the plethysmograph does not readily accommodate patients with physical disabilities and/or claustrophobia. Spirometry alone is the most common PFT modality; it is portable and easily deployed in multiple settings including the bedside, clinic, home or workplace.4 However, spirometry can miss or misdiagnose lung disease as it has limited ability to identify early obstructive lung disease and restrictive defects in the absence of plethysmography.5 6
Many physicians use computer-aided tools to facilitate PFT interpretations, such as the decision-tree algorithm developed in our laboratory (online supplemental file 1). However, few studies have applied machine learning for de novo interpretation of PFTs. One group developed a multiclass support vector machine algorithm to classify normal, obstructive and restrictive patterns based on forced expiratory volume in 1 s (FEV1), forced vital capacity (FVC) and FEV1/FVC.7 While they reported high validation accuracies, determination of the true labels of PFTs did not adhere to international guidelines; obstruction was defined by FEV1/FVC <75%, rather than the lower limit of normal and restriction by FVC <80% with normal FEV1/FVC but without plethysmography.7 Biometrics, key for the derivation of normal reference values,8 were not included in their model.7 Topalovic et al hypothesised that machine learning could reduce the inter-rater variability commonly observed in PFT interpretation by pulmonologists9 and compared their interpretations to a decision tree model built with data from 1430 subjects.10 The gold standard label used for comparison was based on consensus diagnosis made by three clinicians following review of the clinical history, complete PFTs (prebronchodilator and postbronchodilator spirometry, lung volumes, airways resistance and diffusing capacity) and all tests deemed necessary by the responsible physician. Pulmonologists and the machine learning model were given the same data: full PFTs and clinical information (smoking history, cough, sputum and dyspnoea). Physicians correctly classified the respiratory patterns with 74% accuracy (ranging 56%–88%) in contrast to 100% by the machine learning model.9
Multilayer perceptron (MLP) is a type of artificial neural network (ANN) that models non-linear input and output relationships by learning the statistics of large general datasets.11 MLP consists of neurons arranged in an input layer, one or more hidden layers and an output layer.12 MLP containing more than one hidden layer is called deep MLP (DMLP). The input layer includes multiple attributes and input variables, which the model uses to classify the data into different categories. The hidden layers between the input and output layers include intermediate neurons. Each intermediate neuron performs a weighted summation of its inputs and passes the sum to an activation function to produce a value that represents the neuron’s firing intensity.12 Each layer of neurons activates the sequential layer, eventually generating the output variables in the output layer.12 The output variables are digital series representing the defined categories that the model aims to classify. DMLP has no restriction on the type or number of input variables; it considers every possible interaction between input variables, enhancing complexity and classification capability.13
In developing a DMLP, the model is first given a training dataset of prelabelled samples to learn the classification rules. Learning is achieved by adjusting network hyperparameters to generate the best fit for the dataset without explicit instructions. A subsequent validation dataset is used to estimate how well the model learnt the classification rules and tune the hyperparameter values to optimise classification accuracies. Finally, a test dataset containing unseen samples is applied to the refined model to assess its classification performance. During training, weights are updated layer-by-layer based on discrepancies between the actual and output label of each sample. Since ANNs with more than one hidden layer and non-linear activation functions cannot be expressed using linear equations, trained models provide limited information on the decision-making processes.13 We can evaluate whether the ANN has been appropriately trained by assessing its performance or classification accuracy, but we cannot identify how the model learnt to make classifications.13 14 MLP has been shown to yield better classification outcomes compared with statistical methods in practice.11 12
We hypothesise that a DMLP model can accurately distinguish normal, obstructive, restrictive and mixed obstructive-restrictive physiology patterns based on spirometric and biometric measurements.
Data collection and labelling
We used 3827 full PFTs from 748 adult patients collected as routine care between June 2018 and October 2021. Spirometry and plethysmography were performed in the sitting position using BodyBox (Medisoft, Sorinnes, Belgium), following American Thoracic Society/European Respiratory Society (ATS/ERS) guidelines.4 15 Tests were labelled with their ‘true’ physiological pattern (normal, obstructive, restrictive or mixed obstructive-restrictive) based on biometrics (sex, age, height), spirometry and lung volume measurements, in accordance with ATS/ERS guidelines,4 16 facilitated by a computer-aided algorithm (online supplemental file 1) and confirmed by one of six pulmonologists.
Patient and public involvement
Patients and/or the public were not involved in the design, conduct, reporting or dissemination plans of this study.
Sex was binarily converted into 0 (female) and 1 (male). Age, height, spirometric and plethysmographic absolute values were scaled from 0 to 1, using MinMax Scaling equation, , where y is the scaled value, x is the original value and X is the collection of values for a specific input variable. True labels of each test were digitalised: output classes were converted to 3 for normal, 4 for obstructive, 5 for mixed obstructive-restrictive and 6 for restrictive pattern.
DMLP model development
The DMLP model was developed using 2582 tests (collected June 2018 to March 2020) from 477 patients. A Random Search was performed to identify optimal DMLP hyperparameters and regularisation values to be used in the model.17 The model was built with two hidden layers of 180 and 30 neurons each, with drop-out rates of 0.2 and 0.1, respectively. Weights were initialised to a random set of small values from a normal distribution with mean of 0.005 and SD of 0.001667. Adaptive moment estimation optimiser (β1=0.9, β2=0.900, ε=10−8) was used to regularise the learning rate. Learning rates from 0.0001 to 0.1 were tested with logarithmic increments. The model was trained in batches of 32 samples for a maximum of 900 epochs. Early stopping was set such that model training would stop if the validation loss had not improved for 100 epochs (online supplemental file 2).
The DMLP model was given four input variable combinations: (1) biometrics with spirometry and plethysmography; (2) biometrics with spirometry; (3) spirometry and plethysmography and (4) spirometry alone. The included values for spirometry were FVC, FEV1 and forced mid-expiratory flow (FEF25-75); plethysmography were total lung capacity (TLC) and residual volume (RV), and biometrics were sex, age and height. Ten experimental runs were completed for each input combination. For each run, the model randomly selected one test per patient so that each run used 477 tests from 477 unique patients to reduce redundancy in the dataset. The inputted data were randomly partitioned into training (80%), validation (10%) and test (10%) sets of unique patients.18 19
The mean accuracy, precision and recall values of each experimental run were calculated as follows: accuracy by dividing the number of correct predictions by the total number of samples in the test set; precision by dividing the number of true positives by the sum of true positives and false positives for each lung pattern predicted by the model on the test set; recall by dividing the number of true positives by the sum of true positives and false negatives for each lung condition predicted by the model on the test set. The F1 score, a machine learning metric of model performance, for each lung pattern was calculated using the formula, .20
Model refinement and application to unseen data
The DMLP model was refined using 1245 previously unseen PFTs (collected July 2020–October 2021) from 271 new patients. Here, the data from the first 477 patients used in model development were placed into the training set and those from the 271 new patients were split evenly and randomly into validation (136 patients) and test (135 patients) sets. Performance of the refined model using the four input combinations were repeated, as described above. DMLP design and data partitioning for model development, refinement and application is outlined in online supplemental file 3.
Comparing DMLP to pulmonologists and ATS/ERS decision tree
We evaluated the performances of the DMLP model, ATS/ERS decision tree (online supplemental file 4) and six pulmonologists who were given standard reports (online supplemental file 5) in classifying 82 spirometry tests. Accuracies for the DMLP model, ATS/ERS decision tree and pulmonologists were calculated by comparing their classifications to the ‘true’ interpretation which were determined based on full PFTs (described above). We used two-sample t-tests to compare the model to the decision tree and pulmonologists.
The PFTs were concordant with their ‘true’ lung pattern labels (table 1). Normal PFTs had values greater than 80% predicted. Obstructive patterns had FEV1/FVC ratios below 80%, and FVC and TLC were greater than 80% predicted. Restrictive patterns had FEV1/FVC ratios greater than 80%, with both FVC and TLC below 70% predicted. Mixed obstructive-restrictive patterns exhibited FEV1/FVC ratios below 82%, with both FVC and TLC below 70% predicted.
Using data from the first 477 patients to build the DMLP model, we found comparable accuracies when the inputs included biometrics with spirometry and plethysmography versus biometrics with spirometry only (table 2). Next, we validated the DMLP model with previously unseen data. Here, data from the initial 477 patients were placed into the training set; data from the 271 new patients were equally divided into the validation (136 patients, to further refine the hyperparameters) and test (135 patients) sets. Again, we observed high test set classification accuracies when inputs included biometrics with full PFTs versus biometrics with spirometry only (table 2). Biometrics are important as their absence reduced tests accuracies for spirometry and plethysmography, and spirometry only (table 2).
The test set precision improved from 74%–100% for the input combination of biometrics, spirometry and plethysmography to 88%–100% for the combination of biometrics and spirometry (tables 3 and 4). Test set recall values were similarly high between the input combinations of biometric, spirometry and plethysmography (88%–100%) and biometrics with spirometry (86%–100%) (tables 3 and 4). Both the precision and recall values drastically decreased when biometrics were omitted (tables 3 and 4). Larger datasets improve the accuracy of machine learning as illustrated by higher F1, precision and recall values in the larger (table 4) versus the smaller datasets (table 3).
Lastly, we compared the DMLP classification with ATS/ERS decision tree3 and interpretations by six board-certified pulmonologists using 82 spirometry tests (table 5). The DMLP significantly outperformed the decision tree and pulmonologists (p<0.0001 and 0.0051, respectively) with no significant difference between pulmonologists and the decision tree (p=0.5958). The confusion matrix (online supplemental file 6) indicated that the mixed obstructive-restrictive pattern was the most difficult to classify, correlating with the lower F1 score for this category when the DMLP was inputted with biometrics and spirometry (table 4).
We specifically focused on improving the diagnostic accuracies of spirometry as it is the most common PFT modality used for initial assessment of patients with suspected lung disease. Many patients, particularly those in underserviced, rural and remote areas, have limited access to the gold standard full PFT. While diagnosis and management of patients with lung diseases, particularly restrictive lung disease, require clinical evaluation and full PFT that includes spirometry, plethysmography and diffusion capacity, maximising the utility of readily available diagnostic modalities to improve diagnostic acumen will alleviate some of the inequities of healthcare access. Thus, development of machine learning that improves the diagnostic yield of spirometry will improve equity and healthcare delivery for everyone regardless of access to plethysmography.
Our DMLP model was developed with readily available, clinically relevant spirometry variables (FVC, FEV1 and FEF25-75). We compared its classification accuracies to true labels determined by full PFTs following international interpretation standards.4 The model using biometrics and spirometry classified the major physiological patterns with 95% accuracy and was comparable to the model with full PFTs and biometrics. In other words, interpretation of spirometry inclusive of biometrics using DMLP classifies respiratory patterns accurately without the need for plethysmography. The DMLP model also out-performed the ATS/ERS decision tree and trained pulmonologists. For both the DMLP model and physicians, the mixed obstructive-restrictive defect was the most difficult to classify as indicated by the low F1 score (91.53%±11.83%) and confusion matrix (online supplemental file 6). This pattern also had the highest interphysician discrepancies (online supplemental file 7) and suggests these patients should be triaged early for further investigations (full PFT, imaging) to better characterise the disorder.
A key strength of our study is the adherence to international guidelines for conduct and interpretation of PFTs. The ‘true’ labels of PFTs used to evaluate the performance of the DMLP model were determined using spirometry, plethysmography and calculated lower and upper limits of normal.4 21 To our knowledge, no study has investigated DMLP with this approach to interpret spirometry. With a few exceptions,9 22 23 previous studies did not use clear criteria for PFT collection and interpretations nor articulate the criteria used to diagnose the underlying lung disease.24 25
Ioachimescu and Stoller developed ANNs with two hidden layers and 15,308 PFTs to classify the four respiratory patterns.22 PFTs were labelled following ATS/ERS guidelines, with 43% being obstructive, 16.5% restrictive and 4.5% mixed physiological patterns. ANNs using four input parameters (area under the expiratory flow-volume curve and z-scores for FEV1, FVC, FEV1/FVC) yielded the highest accuracies (91% and 92% in the validations and test sets, respectively). The area under the expiratory flow-volume curves contributed significantly to the accuracies, when compared with ANN models that only included the FEV1, FVC, FEV1/FVC z-scores.22 This was particularly true for classifying mixed defects.22 This non-traditional metric is retrievable but not readily available. Unlike our study, biometrics were not included in the ANN but were implied in the z-scores.
Others have compared physicians’ interpretation of PFTs and clinical diagnoses to a decision tree model built using data from 1420 patients, MATLAB 8.3, Statistics and Machine Learning Toolbox, with 10-fold internal cross-validation.9 Inputs included full PFTs (absolute, percent predicted and z-scores for prebronchodilator and postbronchodilator spirometry, plethysmography for lung volumes and airway resistance, diffusing capacity), age, sex, body mass index, smoking pack-years, presence of cough, sputum and dyspnoea.9 10 Given 50 cases, pulmonologists interpreted lung function patterns with 74.4%±5.9% accuracy, with lower rates for restrictive patterns. Conversely, the machine learning model had 100% classification accuracy. When asked to categorise the cases into specific diagnostic categories (eg, asthma, chronic obstructive pulmonary disease (COPD), neuromuscular, interstitial lung disease), machine learning achieved accuracies of only 82%, but still higher than the clinicians at 44.6%.9
A recent study compared a fully convoluted neural network (CNN), random forest model and traditional spirometry for classifying the COPD phenotypes of predominant airway versus predominant emphysema. Data came from the COPDGene study: 3926 participants had no airflow obstruction, 3901 had Global Initiative for Lung Diseases stages 1–4 and 1066 had preserved ratio impaired spirometry.23 The COPD phenotypes were labelled according to computer aided quantitative analysis of CT chest imaging. The CNN and random forest model were trained using all the datapoints in the expiratory flow-volume curve. Participants were split 80% for training and 20% for validation. The CNN significantly outperformed the random forest classifier and traditional spirometry (FEV1/FVC and %predicted FEV1).23 A strength of this study, like the study by Ioachimescu,22 is that the models learnt from all the datapoints in the expiratory flow-volume curve.
While our dataset included PFTs from the full spectrum of respiratory defects and a wide range of abnormal findings, it is a limitation as the data came from a single centre. As a tertiary referral centre and the major lung transplant centre in Canada, our data were collected mostly from lung transplant recipients who had higher tests-to-patient ratios and a higher prevalence of restrictive defects compared with other patient cohorts. The imbalance between the number of tests among the four lung physiology patterns may have skewed the variability of our dataset.
While our samples were labelled using the current clinical gold standard (spirometry and plethysmography), this method is imperfect. The FEV1 is limited in detecting small airway function and early airflow obstruction.26 The contribution from the small airways to total airway resistance is low unless advanced or severe small airway obstruction is present.27 Conversely, flow-volume loops demonstrate a complete representation of flow in regions where the small airways are not as distended as they are in the first second of forced expiration.28 These are important limitations in the current conventional labelling system. Inclusion of data from the entire flow-volume loop may improve the detection of small airway or early-stage abnormalities and will be included in future deep learning models.
Lastly, our laboratory uses the Canadian reference equations to calculate percent predicted values.21 The application of different reference equations can alter the ‘true’ label of the PFT from normal to abnormal; this is another limitation. The use of reference equations that are most appropriate for the specific patient population should be considered, and the MLP model retrained.
We developed a DMLP model to classify lung function patterns using biometrics and spirometry with comparable accuracies to full PFTs inclusive of plethysmography and biometrics. Hand-held spirometers are affordable and widely used as stand-alone diagnostic tools in primary care and outpatient settings. Implementation of the DMLP model into the software of spirometers can facilitate screening of patients with suspected lung disease. Implementation of the model will improve access to high calibre healthcare for patients who cannot perform or access diagnostic laboratories for full PFT with plethysmography. It will particularly benefit patients who live in regions of the world where only spirometry is available, and thus improve healthcare equity. It is also anticipated to improve patient outcomes by focusing subsequent investigations, such as full PFTs for patients identified by the DMLP model to have restrictive or mixed obstructive-restrictive defects, to facilitate earlier diagnoses, leading to reduced healthcare expenditure.
Data availability statement
Data are available on reasonable request. Data sharing will be available in response to a written request following review and approval by the responsible institutions to ensure appropriate guidelines are met.
Patient consent for publication
The study was approved by the University Health Network (protocols 17-5652 and 17-5373) and University of Toronto Research Ethics Board (protocol 376870). Written informed consent was obtained from all participants. Participants gave informed consent to participate in the study before taking part.
AM and TX are joint first authors.
Contributors AM and TX conducted the research, performed the data analysis and drafted the manuscript. JKYW maintained the research ethics protocol, developed the standard operating procedures and ensured quality control of pulmonary function data. NB, HK, NV and DR contributed to data collection and edited the manuscript. CMR refined the research protocol, ensured quality control of pulmonary function data and edited the manuscript. SV developed the research plan, performed, oversaw the data analysis and drafted the manuscript. C-WC developed the concept, study protocol and oversaw all aspects of the project. C-WC is the guarantor and accepts full responsibility for the conduct of the study; she had access to the data, and controlled the decision to publish.
Funding The study is supported by a grant-in-aid from the Lung Health Foundation, the Pettit Block Term Grants, the CIHR/NSERC Collaborative Health Research Program (grant # 415013) and the Ajmera Foundation Multi-Organ Transplant Innovation Fund. AM was supported by an Amgen Scholarship. HK was supported by the scholarship funded by the Nakayama Foundation for Human Science. DR receives research support from the Sandra Faire and Ivan Fecan Professorship in Rehabilitation Medicine. We thank the Registered Cardio-Pulmonary Technologists at Toronto General Hospital for helping to conduct the study, members of C-WC’s laboratory for collecting and maintaining the research data, and the 2019 University of Toronto PFT Committee for their work on developing the University of Toronto Guidelines for PFT Interpretation, eighth Edition
Competing interests None declared.
Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.