Discussion
Our objective in this study was to develop a CNN model for the classification of benign and malignant lesions in rEBUS images. The proposed model demonstrated favourable performance when applied to an internal validation cohort, and satisfactory performance when applied to two independent test cohorts. We observed a decline in performance when applied to the external test sets; however, the negative effects were mitigated by TTA and fine-tuning. Note that this is the first study of its kind to include independent cohorts for external validation and to assess the feasibility of a CNN in identifying lung cancer subtypes.
rEBUS is a valuable tool for localising lung nodules in patients undergoing TBB; however, relatively few studies have investigated its clinical applicability in practice. Chao et al conducted a study using various image features to distinguish between neoplastic and non-neoplastic lesions, such as the margin outside the lesion, homogeneity among internal echoes, hyperechoic dots and concentric circles along the echo probe. Some of these image features were identified as diagnostic markers; however, the interpretation of images remains highly subjective.25 There is a pressing need for an objective method to interpret rEBUS images. This study demonstrated the efficacy of the proposed CNN prediction model in differentiating between malignant and benign lesions during bronchoscopic biopsy.
Numerous researchers have investigated the use of deep learning for the interpretation of medical ultrasound images26–28; however, there has been very little work on the application of this technology to rEBUS images. Chen et al applied CNN with transfer learning to 164 rEBUS images from 164 patients. Their results (AUC=0.8705, accuracy=85.4% and specificity=82.1%) were similar to the results obtained using the internal validation cohort in the current study.18 Note however that they selected only one rEBUS image for each patient (ie, rather than including all recorded images), which raises serious concerns pertaining to selection bias. Note also that they enrolled patients from only one hospital (ie, no external validation), which may limit the generalisability of their results. Hotta et al used EBUS data from 213 participants to train a CNN algorithm, which achieved accuracy of 83.4%, sensitivity of 95.3% and specificity of 53.6% in differentiating benign from malignant lung lesions.29 Their results provide further support for our assertion that CNN models could be used to differentiate between benign and malignant lesions based on rEBUS images.
One of the major strengths of our study was the enrolment of patients from three different hospitals, using different rEBUS probes, and different operating physicians. We also used all recorded images to minimise selection bias. We assessed our CNN model using images generated by EBUS devices from two different manufacturers (Olympus and Fujifilm). We also included in our analysis two different external validation cohorts. Taken together, our results can be considered highly robust and generalisable to real-world clinical settings.
This study assessed a variety of machine learning techniques, including supervised learning, unsupervised learning and reinforcement leaning.30 CNN is an unsupervised learning technique well suited to image classification.31 CNNs have been applied with considerable success in various medical imaging applications, such as mammography for breast cancer and spine X-rays for scoliosis.32 33 They have also been used for outcome prediction in radiation dose planning as well as in the interpretation of serial CT images to assess the response to treatments for lung cancer.34 35 CNNs have been shown to achieve accuracy comparable to or even surpassing that of human experts in many studies.36 However, the repeated use of rEBUS probes can introduce speckle noise, which can interfere with the machine learning process. To address this issue, we developed a denoising technique to reduce the impact of noise.37 We also employed image augmentation to compensate for imbalances in the training cohort.38
It was observed during external validation that the discrimination performance of conventional CNN analysis at NTUH-TPE and NTUH-BIO was lower than that at NTUH-HC, where internal validation was also performed. This discrepancy can perhaps be explained by variations in rEBUS probes and image processors across institutions. In the current study, we addressed this challenge by implementing fine-tuning and TTA. Fine-tuning was performed using 10% of the data from the external validation cohort, which involved selecting images classified at NTUH-TPE (malignant (n=15) and benign (n=15)) and NTUH-BIO (malignant (n=5) and benign (n=5)). Note that the improvement in discrimination performance obtained using TTA was similar to that of fine-tuning. This suggests that in situations where it is not feasible to include rEBUS images from different image processors, TTA could serve as an alternative approach to bridging the performance gap between internal and external validation. Note also that implementing fine-tuning in conjunction with TTA could further enhance discrimination performance.
The TTA method used in this study involves the use of a classifier to make predictions based on multiple augmented test images and determining the final diagnostic result through voting. This approach closely resembles the decision-making process of clinicians, in which a decision is made only after inspecting an image carefully by zooming or rotating it back and forth. Previous studies have reported that TTA can significantly improve prediction performance by helping the classifier to detect objects that might otherwise be missed in the original image.34–36 TTA proved to be a valuable technique, leading to superior diagnostic performance compared with conventional methods when applied to external validation cohorts. Scaling augmentation, in particular, was found to enhance diagnostic performance by mitigating the impact of image variations arising from the use of different rEBUS equipment across different institutions.
Histological subtyping in lung cancer plays a crucial role in various aspects of patient management, including molecular testing, treatment planning and prognosis assessment.39–41 The histological classification of malignancies using CT or MRI has previously been investigated42 43; however, few researchers have applied machine learning to ultrasound images for the subtyping of malignancies.44 In the current study, we extended the applicability of the proposed model to the differentiation of lung cancer subtypes. Note however that the diagnostic performance of the model (as indicated by AUC) was not satisfactory. Several factors may have contributed to this less-than-optimal performance. First, detecting subtle differences at the cellular or histological level solely from ultrasound images can be challenging. It should also be noted that some subtypes of lung cancer often share similar echo-textural characteristics, making them even more difficult to differentiate. It is possible that this overlap in imaging features contributed to the relatively low diagnostic performance of the CNN in this study. Further studies using larger datasets of higher quality will be required to fully explore the potential of AI in this type of application.
This study has several limitations that could impact the generalisability of our findings. First, the static rEBUS images were recorded by multiple bronchoscopists, resulting in inconsistent image quality. Second, the rEBUS images were linked to corresponding pathology reports of TBB, despite the fact that in clinical practice, definite results related to lung lesions are not always available (ie, biopsy yield is not 100%).45 This could have introduced discrepancies between the rEBUS images and the pathology results. However, we made efforts to minimise these effects by ensuring that the images were obtained by experienced bronchoscopists and that an average of four to six biopsy specimens were obtained. We also conducted a clinical follow-up of tumours identified as benign for at least 6 months after rEBUS. In this study, rEBUS-TBB analysis did not detect any benign tumours (eg, hamartomas), due perhaps to the fact that clinicians adopted diagnostic modalities other than rEBUS-TBB when dealing with patients suspected of having benign lesions based on chest CT scans. Lastly, this was a retrospective study in which the brightness and contrast of the rEBUS images were not standardised. However, we attempted to mitigate this variance by adjusting the brightness and contrast through augmentation during the training process. Furthermore, the variance could be considered a strength of the study, as the model demonstrated good diagnostic performance despite variations in parameter settings.