Article Text
Abstract
Background Diagnosing mediastinal tumours, including incidental lesions, using low-dose CT (LDCT) performed for lung cancer screening, is challenging. It often requires additional invasive and costly tests for proper characterisation and surgical planning. This indicates the need for a more efficient and patient-centred approach, suggesting a gap in the existing diagnostic methods and the potential for artificial intelligence technologies to address this gap. This study aimed to create a multimodal hybrid transformer model using the Vision Transformer that leverages LDCT features and clinical data to improve surgical decision-making for patients with incidentally detected mediastinal tumours.
Methods This retrospective study analysed patients with mediastinal tumours between 2010 and 2021. Patients eligible for surgery (n=30) were considered ‘positive,’ whereas those without tumour enlargement (n=32) were considered ‘negative.’ We developed a hybrid model combining a convolutional neural network with a transformer to integrate imaging and clinical data. The dataset was split in a 5:3:2 ratio for training, validation and testing. The model’s efficacy was evaluated using a receiver operating characteristic (ROC) analysis across 25 iterations of random assignments and compared against conventional radiomics models and models excluding clinical data.
Results The multimodal hybrid model demonstrated a mean area under the curve (AUC) of 0.90, significantly outperforming the non-clinical data model (AUC=0.86, p=0.04) and radiomics models (random forest AUC=0.81, p=0.008; logistic regression AUC=0.77, p=0.004).
Conclusion Integrating clinical and LDCT data using a hybrid transformer model can improve surgical decision-making for mediastinal tumours, showing superiority over models lacking clinical data integration.
- Imaging/CT MRI etc
- Thoracic Surgery
Data availability statement
Data are available upon reasonable request.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
WHAT IS ALREADY KNOWN ON THIS TOPIC
CT is essential for diagnosing mediastinal diseases, particularly for incidental lesions. Traditional management of these lesions typically involves additional examinations like positron emission tomography, MRI and histological studies, which can be costly and burdensome for patients. However, recent advancements in artificial intelligence (AI) technologies, particularly the Vision Transformer (ViT), have revolutionised medical imaging. The ViT, with its attention mechanism, offers enhanced visual recognition capabilities, merging clinical and imaging data for improved diagnostic accuracy.
WHAT THIS STUDY ADDS
The multimodal hybrid transformer model, incorporating Vision Transformer and clinical data from low-dose CT, improved surgical decision-making for mediastinal tumours. This model outperformed models lacking clinical data and radiomics models. Implementing AI in low-dose CT screening streamlines clinical decisions, potentially reducing time and cost by avoiding unnecessary tests.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
The hybrid model’s superior diagnostic performance indicates potential for streamlined surgical decision-making, leading to more efficient and cost-effective patient management.
Introduction
CT is a crucial imaging technique for diagnosing mediastinal diseases, especially incidental mediastinal lesions.1 The role of CT, especially low-dose CT (LDCT), has been significantly emphasised in lung cancer screening programmes. Lung cancer remains a leading cause of cancer-related deaths worldwide, and early detection through screening is crucial for improving patient outcomes. The use of LDCT in screening, particularly among high-risk populations such as heavy smokers, has been proven to be the only effective method to significantly reduce mortality rates.2 3
Historically, in a lung cancer screening study including heavy smokers, the Early Lung Cancer Action Project used LDCT and found mediastinal lesions in a significant minority of cases (prevalence of 0.77%), indicating the importance of further exploration.4 Although mediastinal lesions are detected incidentally through CT, traditional management requires various additional examinations such as positron emission tomography (PET), MRI and histological studies to ascertain the lesions’ characteristics, determine malignancy, and plan surgical interventions.5
However, these conventional approaches prove burdensome both in terms of cost and patient experience. As an alternative, the introduction of artificial intelligence (AI) technologies, such as the Vision Transformer (ViT), has ushered in a new era for medical imaging.6 7 The ViT is a model that applies the transformer architecture, a de facto standard successful in natural language processing, to computer vision. Unlike previous architectures such as convolutional neural networks (CNNs), the ViT leverages the attention mechanism of the transformer architecture, offering advanced capabilities in visual recognition tasks, including image classification and semantic segmentation.7 This innovation represents a significant advancement, particularly because of its multimodal capacity, integrating diverse data such as clinical and imaging information.
The distinctiveness of this study, compared with other previous studies, is rooted in its endeavour to use the specific capabilities of a multimodal hybrid transformer model, especially the ViT, to optimise the surgical decision-making process for incidentally identified mediastinal tumours through LDCT. This represents an original contribution in the field, positioning AI not only as a diagnostic aid but also as an integral facet of the surgical planning process. By comparing multimodal hybrid transformer models with existing single-modal or radiomics-based machine learning models, this study aimed to demonstrate superior efficacy and efficiency in handling this complex clinical scenario. Recent advancements in the literature have also explored similar concepts and can be further investigated to enhance the current understanding.8 9
This study aimed to pioneer a transformative approach for diagnostic AI, especially in the context of mediastinal tumours, by leveraging new technologies such as the ViT, which enable the integration of clinical information with radiological features. The objective is to provide a more efficient, patient-centred approach that may reduce the need for additional diagnostic burdens, ultimately striving for improved medical outcomes.
Methods
Participants
This single-centre retrospective study was approved by the Ethics Review Committee of Saint Luke’s International University (approval number: 21-R147). The requirement for informed consent was waived due to the retrospective study design. Mediastinal tumours incidentally detected on LDCT performed for lung cancer screening at our facility between January 2010 and December 2021 by a team of experienced radiologists specialising in thoracic imaging were retrieved from the electronic medical records and Picture Archiving and Communication System. For the purposes of this study, 30 patients who underwent surgical biopsy or resection based on clinical judgement were defined as ‘positive patients,’ and 32 patients who showed no mediastinal tumour enlargement during an average follow-up period of 2,176±1261 days were defined as ‘negative patients.’ These patients were selected from a larger cohort of 2321 patients based on specific inclusion criteria, ensuring a focused and relevant analysis of the study objectives (figure 1). The inclusion criterion was an interval of no more than 3 months between LDCT examination and surgical resection. Each cohort (positive and negative patients) was randomly assigned at a 5:3:2 ratio and used for training, validation, and testing.
CT protocols
Each patient underwent whole-lung LDCT, which was conducted using either a 64-detector scanner (Revolution EVO; GE Healthcare) or a 16-detector scanner (LightSpeed Ultra; GE Healthcare). The parameters for CT examinations performed using the Revolution EVO unit were as follows: 120 kV tube voltage, Auto Exposure Control (max 200) mA tube current, 40 mm collimation, 1.375 pitch, 320 mm field of view and 512 × 512 matrix. The parameters for the LightSpeed Ultra unit were as follows: 120 kV tube voltage, Auto Exposure Control (max 200) mA tube current, 20 mm collimation, 1.375 pitch, 320 mm field of view and 512 × 512 matrix. Non-contrast unenhanced CT scans were obtained for all patients. The dose length product was 64.4 ± 3.07 mSv, with a volume CT dose index of 1.57 ± 0.08 mSv.
Modeling
A radiologist (D.Y., 10 years of experience) annotated the voxel-wise mediastinal tumour mask on LDCT scans for the training data. The model consists of a CNN, which extracts imaging features while segmenting the mediastinal tumour from the CT volume and mask data, and a transformer, which integrates the extracted features and clinical information (figure 2). The CNN performed U-net segmentation on 3D blocks of 50 mm around the mediastinal tumour and obtained imaging features by global average pooling and max pooling from the last hidden layer. Segmentation loss was defined as binary cross-entropy with the mask data. The aim of U-net segmentation is to extract efficient imaging features that explain mediastinal tumours in CT images through a segmentation task focusing on mediastinal tumours. In contrast, while the ViT divides input images into smaller patches and sequentially feeds them into the model, our model compresses feature maps using global pooling or averaging operations.7 Although this would result in loss of spatial information, the model’s only target would be the mediastinal tumour at the centre of the input data, whose average CT value, image texture or edge features would be considered.
The obtained imaging features were tokenised using a linear projection to serve as input to the transformer. The transformer comprises standard and masked self-attention layers, a well-established concept in natural language processing. As described in the implementation details, the length of input vectors to the transformer was 128, whereas the clinical information was two-dimensional (age and sex). To ensure that the clinical data remains distinct during training, it was concatenated just before the multilayer perceptron layer for prediction from the transformer output.
Implementation details
Each CT volume is initially resampled into 1 mm iso-voxels and then renormalised with a window range of −57 HU to 164 HU. During training, a 3D block measuring 50 mm around the mediastinal tumour was cropped and augmented in random intensity shift (prob=0.5, offsets=2), random rotation (prob=0.5, rotation=90°), random flip (prob=0.5), random zoom (prob=0.5, min_zoom=0.95, max_zoom=1.05) and random affine transform (prob=0.5) using Project MONAI.10 The feature maps of the last hidden layer of the U-net were global average-pooled, max-pooled and concatenated into one vector with dimensions of 64. A linear projection was performed to obtain embeddings with dimensions of 128. The embeddings are concatenated with class tokens and fed to the transformer. The transformer contained six consecutive 8-head attention blocks. The AdamW11 optimiser was used to train our hybrid model in an end-to-end manner. We used One Cycle LR12 with the maximum learning rate of 1×10−4 and total steps of 100 and performed that cycle five times. The programme was coded using, Python (2023 version 3.11.3; python.org) with PyTorch (2023, version 2.0.1; python.org).
Evaluation methods and metrics
To evaluate the performance of the model on independent testing data, the cross-validation technique was employed. While ensuring sufficient data for training, to secure data for the evaluation of generalisation performance, 50% of the data was used for training.13 The remaining 50% was used for validation and testing in a 3:2 ratio. While the validation samples were used to evaluate the generalisation performance of the model during training, testing data were not used until the model was finalised to eliminate optimistic bias—a data-dependent choice that was made on this validation set. After the training was completed, the test samples were used for receiver operating characteristic (ROC) analysis to calculate the area under the curve (AUC), which is commonly used to evaluate the accuracy of machine learning models.14–16 Because of AUC variability, the random assignment ROC analysis was repeated 25 times and the mean AUC was calculated. A surgical recommendation score was then calculated for each patient.
Comparison with other modeling
As prognostication of mediastinal tumour malignancy through LDCT remains relatively undefined, we used several modelling patterns to identify key elements of the prediction. First, we evaluated the contribution of clinical information to malignancy prediction by comparing our hybrid transformer with a model in which only the clinical information input was removed with the number of parameters unchanged. Second, to validate the superiority of neural networks in autonomously selecting optimal image features over radiomics (pyradiomics), we compared our model with a random forest model using radiomics obtained from masked mediastinal tumours. Finally, to examine the contribution of the modelling process, we compared the above neural networks and random forest, which are known as strong learners, with a model using logistic regression, which is a weak learner. All training, validation and test sets for each model were identical.
Statistical analysis
G*Power V.3.1 was used to determine the sample size.17 Since the patient set of 25 random trials was identical for all model patterns, the sample size was calculated assuming a paired two-sided t test. A sample size of 62 was deemed sufficient to attain a minimum of 80% statistical power and 5% error rate, with an effect size of 0.4. Model performance was evaluated in terms of difference in AUC for each model. For all statistical tests, p<0.05 was assumed to indicate a significant difference. Statistical tests were performed in Python 3.11.3 (python.org) using sklearn 1.3.0 (scikit-learn.org) and scipy 1.11.1 (scipy.org).
Results
Study population
Among the initial cohort of 2321 consecutive Asian patients with mediastinal tumours detected using LDCT screening, detailed analysis was conducted on a selected group, which was included in this study based on the strict inclusion criterion detailed in the ‘Participants’ subsection of the Methods. This focused approach allowed for a more precise examination of the characteristics and outcomes of patients with incidental mediastinal tumours. All cases included in our study, both ‘positive’ and ‘negative’ for mediastinal intervention, were exclusively related to mediastinal findings without any diagnosed lung cancer. Additional patient characteristics for this selected group are provided in table 1.
Evaluation of individual models
The analysis of diagnostic models revealed that the AUC for the multimodal hybrid transformer model that included a ViT was 0.90, which was significantly higher than the AUC for the single-modal model that excluded clinical information (AUC=0.86, p=0.04; figures 3 and 4). The AUC for the radiomics models was lower, with the random forest model achieving an AUC of 0.81 (p=0.008) and the logistic regression model an AUC of 0.77 (p=0.004; figure 4). The multimodal hybrid transformer model using a ViT demonstrated enhanced performance when compared with these other models. The importance of each input variable was assessed using a random forest model (online supplemental table S1), suggesting that the maximum CT value was the most significant predictor, whereas variables such as short diameter and internal heterogeneity had moderate importance, and sex was deemed unremarkable. Examination of each case revealed that thymic epithelial tumours, including thymic carcinoma and thymoma, were frequently associated with high surgical recommendation scores (figure 5). Notably, benign conditions such as large cysts also presented with high surgical recommendation scores (figure 6).
Supplemental material
Discussion
The primary objective of this study was to develop a ViT-based software to aid in surgical decision-making for mediastinal tumours. Our study’s main finding was that the multimodal hybrid transformer model using a ViT achieved a mean AUC of 0.90, significantly outperforming comparison models such as single-modal models, random forest model and logistic regression model. This supports the superiority of the multimodal hybrid transformer model using a ViT and validates its potential application in clinical settings. The successful integration of clinical information, such as age and sex, with LDCT imaging data, led to an enhanced prediction of mediastinal tumour malignancy. This aligns with recent research findings emphasising the importance of multimodal data in medical imaging.8 9 AI-based analysis of medical images could be the key to overcoming barriers in the radiological evaluation of lung diseases in general. The combination of clinical features and radiographic biomarkers seems rational for a holistic approach to patients with thoracic tumours.18
The mediastinum is susceptible to a multitude of tumour types, rendering it one of the most challenging regions for diagnosis. Therefore, achieving a definitive diagnosis—particularly in distinguishing between benign and malignant tumours—often necessitates surgical removal. Even benign tumours can cause serious symptoms if they gradually grow and exert pressure on vital organs in the chest (such as the heart, major vessels, oesophagus and trachea). Decisions regarding resection of mediastinal tumours are based on information obtained from multiple imaging studies, such as contrast-enhanced CT, MRI and PET. Conventional imaging examinations, including contrast-enhanced CT, bone scintigraphy and contrast-enhanced brain MRI, are commonly used for staging thymic epithelial tumours (TETs).19–24 Ohno et al reported that whole-body FDG PET/MRI and MRI have better potential for diagnosing the IASLC/ITMIG TET stage than do conventional imaging examinations, such as whole-body contrast-enhanced CT, contrast-enhanced brain MRI and bone scintigraphy, and can be considered as effective as whole-body FDG PET/CT.25 Multimodality imaging is required to diagnose mediastinal tumours, including TETs. To date, no studies have examined whether LDCT alone can be used to determine the indications for surgery for mediastinal tumours.
Radiomics is an emerging field in translational research that aims to extract features from radiological images beyond what radiologists observe for clinical decision-making.26 Quantitative radiomic analysis based on CT, MRI and PET/CT has shown good diagnostic performance in differentiating tumour subtypes, staging, invasiveness and risk classification of TETs.27 Many studies have used radiomics to evaluate the risk of TETs, the most prevalent primary tumour in the anterior mediastinum, accounting for approximately 50% of all mediastinal tumours.28–31 Deep learning is a subfield of machine learning that focuses on training artificial neural networks with multiple layers to learn and extract complex patterns and representations from data.32 Deep learning has revolutionised many fields by enabling breakthroughs in tasks, such as image recognition. This success is attributed to the ability to model complex relationships and learn representations that capture intricate patterns in data. A deep learning radiomics nomogram (DLRN) is used in medical image analysis and combines deep learning and radiomics methodologies33; it involves extracting quantitative features (radiomics features) from medical images and using them to construct predictive models for clinical outcomes such as prognosis and treatment response. DLRNs have emerged as critical instruments in cancer research and clinical diagnosis, facilitating clinical diagnosis, prognostic prediction and treatment response assessment. By combining advanced feature extraction with deep learning and quantitative analysis of radiomic features, more accurate predictive models can be developed. DLRNs have also been used to study mediastinal tumours. Chen et al reported that DLRNs demonstrated superior performance in differentiating the risk status of TETs compared with deep learning signatures, radiomics signatures, or clinical models.34
The ViT is a deep-learning model specifically designed for image classification tasks.6 It differs from traditional CNNs by using a transformer architecture originally developed for natural language processing tasks.7 The ViT breaks down an image into a sequence of patches and processes them using a transformer encoder, enabling it to capture global information and long-range dependencies in the image. The ViT is a novel multimodal AI technology that integrates imaging and clinical data. To the best of our knowledge, no previous study has reported the clinical application of the ViT in mediastinal tumours.
To date, most mediastinal tumour machine learning studies have classified the histological subtypes of TET.27–31 34 Decisions regarding surgical intervention for mediastinal tumours are not limited solely to malignant lesions; some lesions, even benign ones, require surgical resection because of their relationship with the surrounding organs. The complex anatomy of mediastinal lesions makes biopsies difficult and invasive. While multimodality imaging (such as CT, MRI and PET) is necessary for the diagnosis of mediastinal tumours, it is desirable to determine whether surgery should be performed with only a simple examination that is both minimally invasive and cost-effective. In this study, we developed a multimodal hybrid transformer model using only LDCT and clinical information to assist in surgical decision-making for mediastinal tumours. In contrast to previous studies, our investigation has shown that surgical decision support could be provided only by LDCT, which is minimally invasive and cost-effective. Our findings contribute to the growing body of evidence supporting the application of deep learning and ViT in medical imaging.6 7 34 Unlike previous studies focusing solely on histological classification,27–31 34 our approach extends to the critical aspect of surgical determination. The potential for minimally invasive and cost-effective decision-making aligns with current trends in medical practice and economics.
The strengths of our study include the novel application of the ViT in mediastinal tumours and the innovative focus on surgical determination rather than mere classification.31 Our findings can serve as an extension of prior advancements within the realm of deep learning and radiomics. Our results build on previous work in the field of deep learning and radiomics,32 33 offering a fresh perspective on mediastinal tumour management. Our work introduces a new pathway for mediastinal tumour diagnosis and treatment, potentially revolutionising the way clinicians approach these complex cases.6 7 The integration of LDCT and clinical information offers a more nuanced and patient-centred approach to care, a theme consistent with modern medical practice.33
This study also has some limitations. First, this was a retrospective study from a single centre, which might have caused a selection bias. Therefore, a multicenter study with a larger sample size is required to validate these results. Second, the decision to surgically treat the mediastinal tumour was made in a collaborative meeting of respiratory surgeons, radiologists, respiratory medicine physicians and anesthesiologists at our institution. The outcomes of future multicenter studies are needed to confirm the validity of our selection criteria. Third, regarding labelling methods, when the training set is relatively large, the segmentation method may require more time and effort. Thus, bounding box labels may be more applicable for saving time in image processing. Finally, the patients’ long-term prognoses and clinical trials need to be supplemented before applying this model to clinical practice in the future.
In conclusion, our study provides a pioneering exploration of a multimodal hybrid transformer approach integrating both clinical information and imaging features for mediastinal tumours. The promising results suggest potential changes in clinical practice and patient care, with wider implications in the field of medical imaging. Future research should focus on extending this work, validating the findings through multicenter studies, and exploring additional clinical variables to refine the model.
Supplemental material
Data availability statement
Data are available upon reasonable request.
Ethics statements
Patient consent for publication
Ethics approval
This study involves human participants and was approved by the Ethics Review Committee of Saint Luke’s International University (approval number: 21-R147). The requirement for informed consent was waived due to the retrospective study design.
References
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
Contributors DY and FK conceived the idea of the study. YO developed the statistical analysis plan and conducted statistical analyses. KK, NK and KO contributed to the interpretation of the results. DY drafted the original manuscript and responsible for the overall content as the guarantor. TB, MM and YK supervised the conduct of this study. All authors reviewed the manuscript draft and revised it critically for intellectual content. All authors approved the final version of the manuscript to be published.
Funding This research received funding from the Supplementary Research Support Program of St. Luke’s Health Science Research and Bayer Academic Support of Bayer Yakuhin.
Competing interests YO is an employee of Plusman LLC and Milliman, Inc. The remaining authors do not have any conflict of interest to disclose.
Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.