Article Text
Abstract
Purpose Acute exacerbation of idiopathic pulmonary fibrosis (AE-IPF) is the primary cause of death in patients with IPF, characterised by diffuse, bilateral ground-glass opacification on high-resolution CT (HRCT). This study proposes a three-dimensional (3D)-based deep learning algorithm for classifying AE-IPF using HRCT images.
Materials and methods A novel 3D-based deep learning algorithm, SlowFast, was developed by applying a database of 306 HRCT scans obtained from two centres. The scans were divided into four separate subsets (training set, n=105; internal validation set, n=26; temporal test set 1, n=79; and geographical test set 2, n=96). The final training data set consisted of 1050 samples with 33 600 images for algorithm training. Algorithm performance was evaluated using accuracy, sensitivity, specificity, positive predictive value, negative predictive value, receiver operating characteristic (ROC) curve and weighted κ coefficient.
Results The accuracy of the algorithm in classifying AE-IPF on the test sets 1 and 2 was 93.9% and 86.5%, respectively. Interobserver agreements between the algorithm and the majority opinion of the radiologists were good (κw=0.90 for test set 1 and κw=0.73 for test set 2, respectively). The ROC accuracy of the algorithm for classifying AE-IPF on the test sets 1 and 2 was 0.96 and 0.92, respectively. The algorithm performance was superior to visual analysis in accurately diagnosing radiological findings. Furthermore, the algorithm’s categorisation was a significant predictor of IPF progression.
Conclusions The deep learning algorithm provides high auxiliary diagnostic efficiency in patients with AE-IPF and may serve as a useful clinical aid for diagnosis.
- Interstitial Fibrosis
- Imaging/CT MRI etc
Data availability statement
Data are available upon reasonable request.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
WHAT IS ALREADY KNOWN ON THIS TOPIC
Acute exacerbation of idiopathic pulmonary fibrosis (AE-IPF) is a significant cause of death in patients with IPF, characterised by clinically significant respiratory deterioration and diffuse bilateral ground-glass opacification on high-resolution CT (HRCT) scans. However, radiological evaluation of AE-IPF remains challenging and is subject to substantial interobserver variability. Currently, several deep learning models have been applied to diagnostic support of fibrotic interstitial lung disease, however, most research has focused on deep learning models based on two-dimensional data, with limited research exploring deep learning for AE-IPF diagnosis.
WHAT THIS STUDY ADDS
This study investigated that the innovative three-dimensional video-sequence methodology called SlowFast, is proposed to classify acute exacerbation in patients with IPF on HRCT scans. The accuracy of the algorithm in predicting radiological diagnosis was superior to that of thoracic radiologists (area under the curve=0.96), with excellent interobserver agreement (κw=0.90).
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
The study provides a valuable contribution to the field by demonstrating the potential of deep learning algorithms to provide low-cost, consistent patient stratification and assistance for radiological decisions of AE-IPF.
Introduction
Idiopathic pulmonary fibrosis (IPF) is a chronic, progressive pulmonary disease of unknown aetiology.1 A subset of patients with IPF experience a significant minority develop episodes of acute clinical respiratory worsening, termed acute exacerbations of IPF (AE-IPF).2 AE-IPF is difficult predict or prevent and precedes approximately half of the IPF-related deaths, with a mean survival of 3–4 months.3 4 The most common radiological feature in patients with AE-IPF is the presence of new ground-glass opacities (GGO) superimposed on subpleural reticular and honeycomb-like densities.2 There have been several studies identified that high-resolution CT (HRCT) scan plays a central role in the adequate diagnosis and early intervention of AE-IPF.5–7 However, radiological evaluation of AE-IPF remains challenging and is susceptible to significant variability between observers, even among experienced radiologists.7 8 Therefore, developing better methods for HRCT scans detection and disease classification generated by deep learning algorithms, has the potential to improve the radiological diagnosis of AE-IPF.
Deep learning, a subset of artificial intelligence (AI) technology that efficiently identifies patterns in high dimensional data, has recently entered an accelerated phase in medical image interpretation.9–11 Currently, several two-dimensional (2D)-data-based deep learning models (Convolution neural networks, Recurrent neural networks, Deep belief networks, etc) have been successfully applied to diagnostic support of fibrotic interstitial lung disease (ILD), early detection of clinically significant fibrotic lung disease and prediction of progressive fibrotic lung disease.9 12 13 However, despite the apparent benefits of these models for fibrotic lung disease, there are still some critical limitations, including algorithm performance, data heterogeneity and constraints, relative opacity of neural networks (black box phenomenon) and lack of histopathological reference standard.14 More importantly, the 2D-based models randomly select four segmented axial HRCT image slices from a range of 250–450 axial image slices per patient for algorithm training, which inevitably leads to the loss of some HRCT image information.9 15 To address this research gap, this study proposes a three-dimensional (3D) video-sequence-based methodology called SlowFast, which incorporates a novel algorithm for detecting HRCT scans. Generally, the SlowFast algorithm uses a slow, high-definition CNN (Fast pathway) to analyse the static content of a video, while in parallel, a fast, low-resolution CNN (Slow pathway) is used to analyse the dynamic content.16 17 HRCT scans consist of large amounts of ordered high-resolution images,18 making them highly suitable for deep learning models based on 3D data. However, most research so far has focused on deep learning models based on 2D data, with limited research exploring deep learning for AE-IPF diagnosis. This study, for the first time, investigated that the 3D video-sequence-based algorithm SlowFast is proposed to classify acute exacerbation in patients with IPF on HRCT scans.
Materials and methods
Patient and public involvement statement
The patients or the public were not involved in the design, or conduct, or reporting, or dissemination plans of our research.
Data split
For model pretraining, an internal data set A comprising 131 patients with HRCT scans taken between December 2015 and December 2018 was obtained from Nanjing Drum Tower Hospital, consisting of 62 cases of stable IPF, 40 cases of AE-IPF and 29 cases of healthy controls. For model validation, an external data set B comprising 175 patients with HRCT taken between January 2019 and December 2022 was obtained from Nanjing Drum Tower Hospital and Nanjing Traditional Chinese Medicine Hospital, consisting of 84 cases of stable IPF, 57 cases of AE-IPF and 34 cases of healthy controls.
The inclusion criteria were: (1) the availability of HRCT with slices thickness of less than 1.5 mm, and each HRCT showing evidence of diagnoses.3 19 20 For AE-IPF, the diagnostic evidence was characterised by the presence of new bilateral GGO and/or consolidation, superimposed on a background pattern consistent with the usual interstitial pneumonia pattern. (2) Other clinical data meet the diagnostic criteria.3 19 20 For stable-IPF, diagnostic criteria include stable clinical symptoms, HRCT imaging and pulmonary function tests for at least 1-month prior to inclusion. The exclusion criterion was the use of contrast enhancement.
The ground truth labels were proved by four thoracic radiologists/respirologists (with 5–20 years of experience in diagnosing ILD). The total data set size was 3060 samples with 97 920 images (204–351 axial image slices per patient). The internal data set A was split into a training set (n=105) and a validation set (n=26). The external data set B was split into two sets: test set 1 (n=79) and test set 2 (n=96), which were used for temporal validation of the model (figure 1).
Image preprocessing and resampling
Data set for semantic segmentation: the semantic segmentation data set consists of 512 HRCT slicers selected from the data set A with uniform sampling. The original images, initially sized at 512×512 pixels, were cropped to 320×320 and labelled by the four radiologists/respirologists using the graphical image annotation tool LabelMe.21 The data set was split into training, validation and test set, with proportions of 0.70:0.15:0.15, containing 358, 77 and 77 images, respectively. The segmentation model DeepLabV3+was trained using this data set to remove redundant information from the original HRCT scans.
Data set for video classification: after being segmented by the segmentation model, the HRCT scans of 306 patients were ready to generate the video classification data set. For each patient, 128 consecutive scans from the middle section were selected and divided into 32 equal parts. Then, 1 scan was randomly chosen from each part to form a learnable sample consisting of 32 slices. To train the video classification model SlowFast, a sample with 32 images is proper. Samples were randomly drawn 10 times from each patient series yielding a total of 3060 samples. Finally, the 3060 samples were split into the train, validation, test set 1 and test set 2 with 1050, 260, 790 and 960 samples, respectively (figure 2).
Data augmentation: to increase the size and diversity of the data set, data augmentation techniques were employed during the preprocessing stage, including Flip, Rotate and Dropout. Horizontal flipping was applied to each image with a 50% probability to generate additional images by creating mirror images of the originals. Additionally, each image was randomly rotated by a degree between −20° and +20° to simulate different viewing angles and orientations. Finally, the Dropout technique was used to randomly select and set 5% of pixels in each image to zero.
Algorithm development
Two neural networks were used in this work. The semantic segmentation network DeepLabV3+was used to separate the lung area from the original HRCT scan. The resulting segmented image was then used for the subsequent classification task. The video classification network SlowFast was used to make a diagnosis using the extracted image sequence from the segmented results.17 The final output was a prediction of the diagnosis category, which included stable IPF, AE-IPF or healthy control (figure 3). The algorithms in the study were developed using PyTorch framework (V.1.9.0 with CUDA V.10.2), using 4 NVIDIA V100 GPUs. For more details, the model was trained for 60 epochs with a batch size of 16. Adam was chosen as the optimiser with learning rate 1e-3 and weight decay 1e-4. The classical cross-entropy loss was used as the loss function.
Radiologist classification
Each HRCT scan in the test sets was visually scored by 3 fellowship-trained radiologists (with 3–24 years of post-fellowship experience) on a 4-point ordinal scale corresponding to the 2018 American Thoracic Society guidelines for IPF and 2016 International Working Group Report for AE-IPF3 19 20 (0=AE-IPF, 1=stable IPF, 2=healthy control). The score was used to compare the diagnostic opinion of the algorithm. To align with AI diagnostic methods, radiologists were provided with complete access to all series and images for each HRCT scan in the test set, while being blinded to other medical imaging and patient history.
Statistical analysis
Statistical analysis was performed in Python (V.3.7). The performance of the algorithm was evaluated by comparing the areas under the receiver operating characteristic curves (AUCs) using the paired DeLong test. The accuracy, sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) were also used to assess the performance. Sensitivity, also known as the true positive rate (TPR), was calculated as the percentage of positive patients that were correctly identified. Specificity, also known as the true negative rate (TNR), was calculated as the percentage of negative patients that were correctly identified. Accuracy was the percentage of subjects with TPR and TNR. To evaluate interobserver agreement between the algorithm and radiologists, the Cohen’s weighted kappa coefficient (κw) was used for an estimation of the probability of each diagnosis.22 Weighted κ coefficients were categorised as follows: poor (0<κw≤0.20), fair (0.20<κw≤0.40), moderate (0.40<κw≤0.60), good (0.60<κw≤0.80) and excellent (0.80<κw≤1.00).23 The correlations of algorithm’s categorisation with physiological variables (PaO2/FiO2) were evaluated using logistic regression. For all comparisons, a two-sided p value threshold of 0.05 was considered statistically significant. The Python package scikit-learn V.1.0.2 was used for statistical calculation.
Results
Patients characteristics
A total of 306 participants were included after applying inclusion and exclusion criteria. 146 patients were diagnosed with stable IPF, 97 with AE-IPF and 63 were healthy controls. The demographic and clinical information of patients with IPF can be found in table 1. Briefly, the average age of patients undergoing HRCT was 69.2 years, with men accounting for 77.0% of the entire cohort. Among them, 54.2% were never-smokers, 34.6% had concomitant pulmonary infections and 60.1% experienced acute exacerbations. The mean PaO2/FiO2 ratio among patients with available blood gas analysis results was 282.4±129.3. In test set 1, 79.7% of patients were men and the mean age was 70.7 years, with 55.1% never-smokers, 30.5% having concomitant pulmonary infections and 23.7% experiencing acute exacerbations. The mean PaO2/FiO2 ratio for patients in test set 1 was 307.7±148.5. In test set 2, 70.7% of patients were men and the mean age was 69.9 years, with 63.5% never-smokers, 39.0% having concomitant pulmonary infections and 52.4% experiencing acute exacerbations. The mean PaO2/FiO2 ratio for patients in test set 2 was 255.8±141.5.
Classification performance
Deep learning algorithm SlowFast and radiologist performance were evaluated using AUC, accuracy, sensitivity (TPR), specificity (TNR), PPV and NPV. The SlowFast model achieved AUC scores of 0.96 and 0.92 for classifying AE-IPF in test sets 1 and 2, respectively, while the mean radiologist achieved an AUC of 0.91 and 0.81, respectively (figure 4A,B). Furthermore, in test set 1, SlowFast achieved an accuracy of 93.9%, with a sensitivity of 90.0%, specificity of 95.7%, PPV of 90.0% and NPV of 95.7%. In test set 2, SlowFast achieved an accuracy of 86.5%, with a sensitivity of 80.9%, specificity of 91.8%, PPV of 90.5% and NPV of 83.3% (table 2). In comparison, in test set 1, radiologists achieved an accuracy of 77.8±9.1%, with a sensitivity, specificity, PPV and NPV of 59.8±12.2%, 87.8±15.2%, 81.1±15.5% and 78.0±7.0%, respectively. In test set 2, radiologists achieved an accuracy of 76.4±8.0%, with a sensitivity, specificity, PPV and NPV of 69.6±34.0%, 82.6±18.3%, 83.7±15.6% and 80.3±16.4%, respectively. For classifying stable-IPF, the SlowFast model achieved an AUC of 0.97 and 0.91 in test sets 1 and 2, respectively (figure 4C,D). Moreover, the model achieved accuracy, sensitivity, specificity, PPV NPV of 93.9%, 93.9%, 93.9%, 93.9%, 93.9% in test set 1 and 86.5%, 81.6%, 89.7%, 83.8%, 88.1% in test set 2, respectively (table 2). And radiologists achieved an AUC of 0.87 and 0.79 (figure 4C,D), and accuracy, sensitivity, specificity, PPV, NPV of 77.8±9.1%, 72.7±24.5%, 84.2±12.9%, 85.4±8.6% and 76.1±14.9% in test set 1, and 76.4±8.0%, 74.6±23.7%, 77.6±25.8%, 75.4±18.2% and 85.4±11.0% in test set 2, respectively (table 2).
Figure 4E shows an example of HRCT that was accurately identified as AE-IPF by the SlowFast model but misclassified as an alternative diagnosis by two radiologists. There was one case that was accurately classified as AE-IPF by radiologists but misclassified by SlowFast model (figure 4F).
Interobserver agreement
We used Cohen’s weighted kappa coefficient (κw) to assess interobserver agreement between the algorithm and radiologists for each diagnostic category.22 We categorised weighted kappa coefficient as follows: poor (0<κw≤0.20), fair (0.20<κw≤0.40), moderate (0.40<κw≤0.60), good (0.60<κw≤0.80) and excellent (0.80<κw≤1.00). In test set 1, median interobserver agreement between the algorithm and the majority opinion of the radiologists was excellent (weighted κ, κw=0.90), and between each of the thoracic radiologists and the majority opinion of the radiologists was good (weighted κ, κw=0.65±0.13) table 3 . Similarly, in test set 2, the algorithm had a good interobserver agreement with the majority opinion of the radiologists (weighted κ, κw=0.73). Additionally, the median interobserver agreement between each of the thoracic radiologists and the majority opinion of the radiologists was good (weighted κ, κw=0.69±0.19) table 3.
Correlation between the model’s categorisation and prognosis
To investigate the prognostic value of the deep learning model in AE-IPF, logistic regression was used to assess the correlations between the model’s categorisation and PaO2/FiO2 ratio, which is a prognostic factors of AE-IPF.24 The result indicated that both the model’s categorisation and the radiologists’ majority opinion were the significant predictors of the disease severity of IPF (p<0.0001, OR=1.007, 95% CI=1.003 to 1.010 for the model’s categorisation, and p<0.0001, OR=1.007, 95% CI=1.004 to 1.011 to for radiologists’ majority opinion, respectively) (figure 5A,B).
Discussion
In this study, we investigated the potential of the deep learning algorithm SlowFast to classify AE-IPF using HRCT scans. In addition, we compare the performance of this deep learning model to the diagnostic performance of radiologists. Our study suggested that the model provided almost instantaneous reporting with accuracy and reproducibility comparable to human experts (AUC 0.96 vs 0.91 in test set 1, AUC 0.92 vs 0.81 in test set 2). As a fatal complication of IPF, the accurate classification of AE-IPF plays a crucial role in improving prognostication, directing patient treatment and facilitating research.25–27 AE-IPF shares similar pathophysiological characteristics with acute respiratory distress syndrome, which can be triggered by COVID-19 and is considered one of the major causes of increased mortality.28 29 However, due to the complicated clinical course, making accurate diagnoses for patients with AE-IPF is a significant challenge for clinicians.30 Notably, HRCT-based deep learning models and diagnostic biomarkers for ILDs have garnered widespread attention in the precision medicine diagnosis of IPF.31–35 Under these circumstances, the importance of using deep learning models to assist in the accurate diagnosis of AE-IPF by radiologists becomes evident, which could provide cheap and consistent patient stratification for clinical trials, thereby reducing failures during screening and costs. Moreover, there is a pressing clinical need to identify contributing or alternative causes of decline in patients with GGO and/or consolidation on a background of IPF.36 Therefore, predicting the future functional decline or the occurrence of AE-IPF would be a valuable and unmet objective, which could be addressed by the application of advanced deep learning techniques to the analysis of HRCT scans.
Previous research has explored the potential of deep learning algorithms to classify fibrotic lung disease on chest HRCT scans. Walsh et al developed a deep learning algorithm for classifying usual interstitial pneumonia (UIP) on HRCT based on the neural network architecture, which achieved human-level accuracy (76.4% vs 70.7% on the test set).14 Alex et al employed a custom deep learning algorithm to predict histopathological diagnosis (UIP vs non-UIP) from chest CT patterns, which provided better diagnostic performance than visual evaluation (AUC 0.87 vs 0.80; p=0.03).37 In addition, Kim et al applied content-based image retrieval (CBIR) to improve the diagnostic accuracy for patients with ILD (before vs after CBIR, 46.1% vs 60.9%).38 In the study by Tzouvelekis et al, a machine learning software system (Imbio V.1.4.2.) was used to evaluate HRCT in patients with non-IPF ILDs receiving mycophenolate mofetil. The software demonstrated similar performance to specialist radiologists, indicating its potential as a valuable diagnostic and prognostic tool (ICC 0.73 vs 0.88).39 In addition to providing diagnostic support, some studies also focus on the early detection or prediction of progressive fibrotic lung disease. In the study by Agarwala et al, a deep learning framework was conducted to automatically identify ILD patterns in HRCT images, achieving an 86% success rate and 74% sensitivity in sections with lung fibrosis.40 Besides, Simon et al developed a deep learning algorithm SOFIA and demonstrated that it improved outcome prediction in patients with progressive fibrotic lung disease when compared with radiologist evaluation (HR, 1.73; p<0.0001; 95% CI=1.40 to 2.14).15 Notably, researchers have endeavoured to address two major barriers in the management of ILD: the diagnosis of disease subtypes and the predicting of patient prognosis. Yang et al employed RadImageNet pretrained models to diagnose five types of ILD and a transformer model to determine a patient’s 3-year survival rate, which proves to be a useful tool to distinguish ILD subcategories and manage the long-term progression of patients.41
However, at the moment, there has been limited research on using deep learning for the diagnosis of AE-IPF, with most deep learning models being trained on 2D data. Our study highlights several advantages of applying this deep learning model to image analysis in fibrotic lung disease. First, our model achieved more extraordinary diagnostic performance than visual evaluation. Second, we innovatively used the 3D-video-sequence-based methodology SlowFast for image analysis on HRCT scans, which provides sequential cross-sectional images of the lungs. This approach allows for more objective analysis compared with traditional 2D-image-based models, and is the first application of this 3D-video-sequence-based methodology on HRCT scans. Third, we directly compared the performance of our model with that of radiologists, and our model demonstrated the potential to outperform the established chest CT classification scheme based on visual analysis.
It should be noted that our study suffers from a few limitations. First, due to the low incidence of AE-IPF in the general patient population,42 the number of cases for model training and testing was small. Although we have employed external validation to confirm the transportability and generalisability of our model, we acknowledge the need for a large-scale, multicentre study of AE-IPF, which could lead to the development of more robust and effective algorithms. Second, only cases of stable IPF, AE-IPF and healthy control were covered in this study. Therefore, algorithm performance for other ILD subtypes is unknown. Further versions of the algorithm will include an extension to cover these other patterns. Despite this, including healthy control subjects in the data set provides a valuable reference point for comparison with patients with IPF. This approach may help to identify HRCT scan features that are specific to the disease and facilitate the development of more robust models. Third, the algorithm was designed to alleviate the workload, improve accuracy and enhance consistency in challenging diagnoses made by radiologists. Nevertheless, the performance of the algorithm was only benchmarked against three radiologists, which may not accurately represent the entire spectrum of human capabilities.
In conclusion, we have developed a deep learning algorithm with similar performance to a human reader for classifying AE-IPF on HRCT scans. In principle, this algorithm has the potential to provide low-cost, consistent patient stratification and assist in radiological decision-making.
Supplemental material
Data availability statement
Data are available upon reasonable request.
Ethics statements
Patient consent for publication
Ethics approval
This retrospective study was approved by IRB of Nanjing Drum Tower Hospital of Nanjing University Medical School (approval: 23-01-18) with a waiver for written informed consent.
References
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
XH and WS contributed equally.
Contributors XH: conceptualisation (lead); data curation (lead); investigation (equal); methodology (equal); project administration (equal); writing—original draft (lead); writing—review and editing (lead). WS: conceptualisation (equal); formal analysis (lead); investigation (equal); methodology (equal); software (lead); writing—review and editing (supporting). XY: investigation (supporting); resources (supporting); writing—review and editing (supporting). YZ: investigation (supporting); methodology (supporting); writing—review and editing (supporting). HG: investigation (supporting); validation (lead); writing—review and editing (supporting). MZ: investigation (supporting); validation (supporting); visualisation (supporting); writing—review and editing (supporting). SW: investigation (supporting); software (supporting); writing—review and editing (supporting). YS: investigation (supporting); validation (supporting); writing—review and editing (supporting). XG: supervision (equal); writing—review and editing (supporting). YX: funding acquisition (equal); supervision (equal); writing—review and editing (supporting). MC: funding acquisition (equal); supervision (equal); visualisation (lead); writing—review and editing (supporting).
Funding This work was supported by National Natural Science Foundation of China (82070064, 81670059 and 81200049), Natural Science Found of Jiangsu province (SBK20230140), Fundings for Clinical Trials from the Nanjing University Medical School Affiliated Drum Tower Hospital (2022-LCYJ-MS-11), Special fund project for clinical research of Nanjing Drum Tower Hospital (2021-LCYJ-DBZ-06).
Competing interests None declared.
Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.