Introduction The lack of effective, consistent, reproducible and efficient asthma ascertainment methods results in inconsistent asthma cohorts and study results for clinical trials or other studies. We aimed to assess whether application of expert artificial intelligence (AI)-based natural language processing (NLP) algorithms for two existing asthma criteria to electronic health records of a paediatric population systematically identifies childhood asthma and its subgroups with distinctive characteristics.
Methods Using the 1997–2007 Olmsted County Birth Cohort, we applied validated NLP algorithms for Predetermined Asthma Criteria (NLP-PAC) as well as Asthma Predictive Index (NLP-API). We categorised subjects into four groups (both criteria positive (NLP-PAC+/NLP-API+); PAC positive only (NLP-PAC+ only); API positive only (NLP-API+ only); and both criteria negative (NLP-PAC−/NLP-API−)) and characterised them. Results were replicated in unsupervised cluster analysis for asthmatics and a random sample of 300 children using laboratory and pulmonary function tests (PFTs).
Results Of the 8196 subjects (51% male, 80% white), we identified 1614 (20%), NLP-PAC+/NLP-API+; 954 (12%), NLP-PAC+ only; 105 (1%), NLP-API+ only; and 5523 (67%), NLP-PAC−/NLP-API−. Asthmatic children classified as NLP-PAC+/NLP-API+ showed earlier onset asthma, more Th2-high profile, poorer lung function, higher asthma exacerbation and higher risk of asthma-associated comorbidities compared with other groups. These results were consistent with those based on unsupervised cluster analysis and lab and PFT data of a random sample of study subjects.
Conclusion Expert AI-based NLP algorithms for two asthma criteria systematically identify childhood asthma with distinctive characteristics. This approach may improve precision, reproducibility, consistency and efficiency of large-scale clinical studies for asthma and enable population management.
- asthma epidemiology
- paediatric asthma
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Can expert artificial intelligence (AI)-based natural language processing (NLP) systematically identify childhood asthma and a subgroup of asthmatic children with distinctive clinical characteristics by leveraging electronic health records (EHRs)?
Expert-AI-based NLP algorithms unlocks the vast yet valuable information in free text embedded in EHRs in a way systematically identifying childhood asthma and reducing methodological heterogeneity of identifying asthma in capturing its true biological heterogeneity.
Expert AI-based NLP algorithms helps clinicians and researchers systematically identify childhood asthma and its subgroups with distinctive characteristics from EHRs with precision, reproducibility and affordability.
Important concerns in the current asthma care and research are the use of inconsistent asthma criteria, asthma ascertainment processes and sampling frame.1 The resultant variability in identification of asthma across the practice and research settings may cause inconsistent results of studies including genome-wide association studies, clinical trials and biomarker studies and delayed translation of important study findings into clinical practice eventually deterring translation of study results into clinical practice.2–10 For example, one previous study reported that 60 different definitions of childhood asthma have been used among 122 published studies.1 This is in part due to: (1) the lack of consensus for asthma ascertainment, (2) inherent limitations of structured data to ascertain asthma (eg, poor sensitivity of International Classification of Diseases (ICD) codes, 31%11), (3) expensive and difficult to use suggested biomarkers for ascertaining asthma for large-scale studies and (4) labour-intensive, expensive and inconsistent manual chart review of great volumes of records to apply asthma criteria despite their availability.
Given the growing deployment of electronic health records (EHRs) systems enabling large practice-based longitudinal data mining, advancement in artificial intelligence (AI) approaches such as natural language processing (NLP; expert AI) may potentially enable us to address these challenges as it can extract, process and classify free-text data from EHRs.12–15 For example, we recently developed and validated NLP algorithms for two existing retrospective criteria for childhood asthma (NLP algorithms for Predetermined Asthma Criteria (NLP-PAC) and NLP algorithms for Asthma Predictive Index (NLP-API)).14 15 The performance of individual NLP algorithm determining asthma status based on comprehensive EHRs including free text was almost close to that by humans (eg, 97% sensitivity and 95% specificity for NLP-PAC).14 We also demonstrated external validation of our NLP algorithms for these asthma criteria across different study settings despite different population, practice and EHRs systems.16 17 Thus, such capabilities of NLP using EHRs are poised to potentially address the current challenges in asthma research and care described above by applying the existing asthma criteria to cohorts of children in a consistent manner on a large scale.
While the two asthma criteria are complementary, it is unknown whether NLP algorithms for the two asthma criteria systematically identify childhood asthma and its subgroup with distinctive clinical characteristics. We applied the NLP algorithms to a large birth cohort in a real-word setting and systematically characterised subgroups of asthmatic children.
This is a cross-sectional analysis nested in the retrospective birth cohort study using the 1997–2007 Olmsted County Birth Cohort. We applied NLP algorithms for the two asthma criteria to the EHRs of the birth cohort to identify children with asthma and characterise subgroups of these children by using supervised cross-sectional analysis for the whole birth cohort. Then, we replicated the original results by performing unsupervised cluster analysis for asthmatic subgroups and a cross-sectional analysis for laboratory and pulmonary function test (PFT) data of a random stratified sample of 300 children.
Patient and public involvement
This research was done without patient involvement. Patients were not invited to comment on the study design and were not consulted to develop patient relevant outcomes or interpret the results. Patients were not invited to contribute to the writing or editing of this document for readability or accuracy.
Olmsted County, Minnesota, is a virtually self-contained healthcare environment (only two healthcare providers provide clinical care to Olmsted County, Minnesota residents), and 98% of residents authorise their medical records to be used for research.18 Under the auspices of the Rochester Epidemiology Project (REP), all clinical diagnoses and procedures are linked between healthcare providers and individual patients and retrievable from medical records.18
We enrolled all eligible children who were born at Mayo Clinic Rochester and received their primary care there throughout the study period (1997–2015). We excluded: (1) children who did not have research authorisation, (2) those who visited a non-Mayo Clinic healthcare provider in the community with a diagnostic code related to asthma (eg, asthma, bronchiolitis, pneumonia and wheezing), which was captured in the REP database and (3) those who did not have any visits at Mayo Clinic within the last 3 years.
Asthma defined by NLP-PAC and NLP-API as predictor variables
The renowned asthma researchers, Drs Yunginger and Reed developed and validated PAC for retrospective studies among children and adults based on comprehensive medical record review (table 1-1),19 which has been extensively used for asthma research over time. PAC is conceptually similar to the 2015 Canadian Thoracic and Canadian Pediatric Society asthma criteria consisting of: (1) recurrent wheezing episodes or airflow obstruction, (2) reversibility to bronchodilator and (3) exclusion of alternative diagnoses.20 Since most cases of probable asthma became definite asthma over time, both definite and probable asthma were considered as PAC positive.19 Although the API was originally developed to predict asthma among preschoolers, the National Asthma Education and Prevention Program recommended it for identification of asthmatic children for timely asthma treatment (table 1-2).14 15 We previously reported the details for the development and validation of both NLP algorithms14 15 with a great performance (sensitivity, specificity, positive predictive value and negative predictive value: 97%, 95%, 90% and 98% for NLP-PAC, and 86%, 98%, 88% and 98% for NLP-API). Briefly, both NLP algorithms had the sequential process to determine positivity for asthma criteria: (1) the text extraction that searches evidence concepts for asthma in EHRs, (2) processing the extracted concepts based on rules for asthma criteria and (3) categorising asthma status accordingly. The algorithm was implemented using the open-source NLP pipeline MedTagger (http://ohnlp.org/index.php/MedTagger) developed by Mayo Clinic.21 NLP-PAC has been externally validated at both Mayo Clinic and another study setting with a different practice, population and EHRs (Epic Systems) (Sioux Falls, South Dakota).16 17 We applied these two NLP algorithms, NLP-PAC and NLP-API, to the entire EHRs of the eligible subjects of the 1997–2007 Olmsted County Birth Cohort up to 31 August 2015, or the last follow-up date, and categorised them into four groups: both criteria positive (NLP-PAC+/NLP-API+), PAC only positive (NLP-PAC+ only), API only positive (NLP-API+ only), and both criteria negative, non-asthmatic (NLP-PAC−/NLP-API−). An asthma index date was defined as when PAC or API was met, whichever came first.
Clinical variables for characterising subgroups of asthmatic children:
To characterise subgroups of the birth cohort, we collected pertinent variables from EHRs listed in tables 2 and 3. Socioeconomic status (SES) at birth defined by the validated HOUsing-based Index of SocioEconomic Status (HOUSES).22 We also identified asthma-associated infectious and inflammatory multimorbidities (AIMs) based on the previously reported conditions associated with asthma.23
Replication of the initial results by analysing lab and PFT data of a random sample and performing unsupervised cluster analysis
We performed an unsupervised cluster analysis to replicate the initial results based on a supervised cross-sectional analysis as described in the Statistical Analysis section. In addition, as not all subjects had laboratory and PFT data available in EHRs of the birth cohort, to replicate the initial supervised cross-sectional analysis results based on the whole birth cohort, we performed a stratified random sampling of a total of 300 subjects from four subgroups of the whole cohort described above and prospectively enrolled them to obtain laboratory and PFT data. We included total and specific IgE, serum eosinophil count, exhaled nitric oxide (eNO), serum periostin and forced expiratory volume in 1 s (FEV1)/forced vital capacity (FVC). Serum periostin was measured by Periostin ELISA kit (Shino-Test Corporation).
Baseline characteristics for the four groups described above were summarised using frequencies for categorical variables and means (±SD) for continuous variables in both the whole cohort and the random sample. Statistical significance for the associations of individual clinical and laboratory variables of the four groups was tested using Pearson’s χ2 or Fisher’s exact test and Kruskal-Wallis rank-sum test. For an unsupervised cluster analysis, we performed a non-negative matrix factorisation approach24 to identify clusters of variables and subgroups of subjects with asthma (excluding non-asthmatics) described above. The variables for cluster analysis comprised the same variables included in the initial analysis for the whole cohort (see tables 2 and 3). The optimal number of clusters were determined by finding the first value for which the cophenetic coefficient, which measures the stability of the clusters, starts decreasing drastically.25 Once the optimal number of clusters was determined, clusters were created by following standard approaches for non-negative matrix factorisation.24 All analyses were performed using R statistical software.
Characteristics of study subjects are summarised in table 2. Of the total number of 22 011 Olmsted County Birth Cohort, we excluded 13 815 subjects (n=1528 for no research authorisation, n=4412 for asthma-related diagnosis outside Mayo Clinic EMRs and n=7875 for no visit within 3 years) resulting in 8196 children. Of the eligible 8196 subjects, 51% were male, 80% were white and mean age (±SD) at the last follow-up date was 11.8 (±3.2) years. Asthmatic children (those who met either or both asthma criteria) were more likely to be male (p<0.001) and had lower SES at birth as measured by HOUSES compared with those without asthma (p=0.004). The frequency of well-child visits in the NLP-PAC+/NLP-API+ group was clinically similar to that of non-asthmatics (about one visit per year), while children in NLP-PAC+ only group seem to have slightly lower frequency of well-child visit (p<0.001). There was no difference between asthmatics and non-asthmatics with regard to birth season or maternal smoking rate during pregnancy. The maternal smoking rate during pregnancy in this birth cohort was only 4%.
Prevalence of asthma
During the study period, 1679 (21%) children had a physician diagnosis of asthma in EHRs, and the mean age at the first physician diagnosis was 4.9 (±3.8) years, whereas 2568 (31%) and 1719 (21%) children met PAC and API, respectively.With inclusion of all three asthmatic groups (NLP-PAC+/NLP-API+, NLP-PAC+ only and NLP-API+ only), the mean age at asthma index date by both algorithms was 3.9 (±3.8) years. The resulting breakdown asthma prevalence of all four groups is as follows: 1614 (20%, NLP-PAC+/NLP-API+), 954 (12%, NLP-PAC+ only), 105 (1%, NLP-API+ only) and 5523 (67%, NLP-PAC−/NLP-API−; no asthma). Ninety-one per cent of PAC positive children were definite asthma by PAC. The highest proportion of those with a physician diagnosis of asthma (70%) and the earliest onset of asthma (4.3 years) were observed among children who met both criteria (NLP-PAC+/NLP-API+ group) (table 3).
Characteristics of subgroups of asthma
As expected, the NLP-PAC+/NLP-API+ and NLP-API+ only groups were more likely to have a history of allergic rhinitis, eczema, a family history of asthma, elevated eosinophil count and total IgE level than their counterparts in the NLP-PAC−/NLP-API− and NLP-PAC+ only groups tables 2 and 3). Importantly, the NLP-PAC+/NLP-API+ group was more likely to have impaired lung function, frequent asthma exacerbations, persistent asthma and overall higher risk of AIMS, compared with other asthmatic groups (either NLP-PAC+ only or NLP-API+ only) (table 3).
Laboratory and PFT measures for a random sample of subjects from subgroups
Among a random stratified sample of study subjects (n=300) for replicating the results based on the whole cohort, 53% were male, 81% were white and mean age (±SD) at the enrolment date was 13.2 (±2.5) years similar to the whole cohort. NLP-PAC+/NLP-API+ children showed the highest likelihood of atopic conditions, allergic sensitisations, Th2-high immune responses (elevated eNO and serum periostin) and impaired pulmonary function compared with other groups (figure 1).
Unsupervised cluster analysis:
In an independent cluster analysis among asthmatics only, three clusters of subjects emerged and cluster A was most distinctive based on heatmap in figure 2 and table 4. Subjects in cluster A defined in the purple column and row (n=655) were characterised by a greater likelihood of persistent asthma, asthma exacerbation, pneumonia, pertussis, Pressure Equilizer (PE) tube, coeliac disease, viral and streptococcal infection, family history of asthma, eczema, allergic rhinitis, eosinophilia, no smoking during pregnancy, higher SES and spring birth. Importantly, cluster A had a disproportionately higher proportion of NLP-PAC+/NLP-API+ (82%) compared with cluster B (51%) or cluster C (55%). As most of cluster A represented NLP-PAC+/NLP-API+ group, these results are consistent with those by the supervised analysis of entire study subjects (table 1 and figure 1).
To our knowledge, this is the first study demonstrating that the AI using NLP algorithms for two asthma criteria systematically identified childhood asthma and its subgroup with distinctive clinical characteristics on a large scale.
Clinical characteristics of NLP-PAC+/NLP-API+ subjects observed in our study are consistent with those of children who had poor asthma outcomes in the literature as male, early onset, a family history of asthma and atopic tendency, which have been reported to be predictors for poor asthma outcomes.7 26 They had a greater likelihood of Th2-high, persistent asthma, frequent asthma exacerbation, impaired lung function and high risk of AIMs compared with non-asthmatics and those who met only NLP-PAC or NLP-API. Importantly, the findings based on a supervised cross-sectional analysis (tables 2 and 3) were replicated by a stratified random sample of 300 children selected from the four subgroups as shown in figure 1, which showed NLP-PAC+/NLP-API+ had a high likelihood of atopy (high eosinophil count, total IgE and allergen-specific IgE), Th2-high profile (FeNO and serum periostin) and impaired lung function (FEV1/FVC <85%), suggesting application of NLP-based phenotyping to a large sample-sized population is reasonable when lab test is not feasible. Also, an independent unsupervised cluster analysis for asthmatic subgroups corroborated the findings as it identified cluster A (defined in the purple column and row in figure 2) characterised by atopy, persistent asthma, frequent asthma exacerbation, impaired lung function and high risk of AIMs as shown in table 5. Importantly, cluster A had a disproportionately higher proportion of NLP-PAC+/NLP-API+ (82%) compared with cluster B (51%) or cluster C (55%).
In our study, 30% of children with NLP-PAC+/NLP-API+ did not have a physician diagnosis of asthma. In the context of ‘under-diagnosis’ of asthma,27 28 the lack of diagnosis might deter access to preventive and therapeutic interventions for asthma. As the asthma index date by the criteria was almost 1 year earlier than the first date of physician diagnosis of asthma (3.9 years vs 4.9 years), our NLP algorithms may be helpful as a population management or clinical decision support tool in the era of EHRs for early identification of asthmatic children. For example, in our recent clinical trial (Developing and Implementing Asthma-Guidance and Prediction System (a-GPS) for Better Asthma Management, Young J Juhn, MD), these two algorithms were used to inform clinicians of their patients who met two criteria without a diagnosis of asthma to help with a timely diagnosis. Nonetheless, given the wide range of different asthma ascertainment methods (eg, 60 different criteria in the literature)1 causing inconsistent results,1–10 delaying translation of scientific findings into practice and obscuring the true biological heterogeneity of asthma, our study provides an effective, consistent, reproducible and cost-efficient method of asthma ascertainment on a large scale, while not relying on self-report or ICD codes. At present, the literature on application of NLP to asthma is severely limited. One study applied a machine learning technique on EHR data (ie, codes, drugs and clinical text) in order to identify children with asthma.29 Their approach relied on a physician diagnosis of asthma (instead of asthma criteria) and did not take into account the patient’s asthma symptoms that could precede the physician’s asthma diagnosis. Thus, timely identification of asthma might not be feasible, and this approach is not able to provide physicians with evidence of the likelihood of asthma that would assist in their clinical decision making. A few studies demonstrated feasibility of extracting PFT information and smoking status from structured and semistructured data by applying NLP,13 30 while other studies attempted to predict asthma outcomes by applying machine learning or artificial neural network approaches.31 32 Nonetheless, as rich clinical information for asthma exists in free text embedded in EHRs, it is crucially important to develop an emerging and innovative AI approach enabling automated chart review and extraction or retrieval of relevant data for asthma from EHRs to make precision medicine in asthma care scalable in the future. In this respect, our study results demonstrate feasibility of such approach in a real-world setting, and this is a significantly understudied area.
The main strength of our study is the design that uses a large population-based birth cohort with longitudinal follow-up. Our study setting also has the epidemiological advantages of being a self-contained healthcare environment with a medical record linkage system through the REP enabling comprehensive medical record review for all eligible children. Our study results are based on two asthma criteria,19 33 which have been extensively used for epidemiological investigations for asthma studies. NLP-PAC was validated at both our study setting and another study setting (Sioux Falls, South Dakota) (external validity).14 17 This suggests that the NLP algorithm can be adapted in a different care setting with comparable performance, which may enable us to define and identify childhood asthma in a timely manner. This supports feasibility of application of our NLP algorithms to other study settings while recognising further multisite studies in the future. Along these lines, we discussed the results of our study with the Research Advisory Board for Community Engagement consisting of parents, community members and representatives of community agencies to seek their inputs. The advisory board provided valuable feedback for implementation of NLP algorithms in clinical care (eg, timely identification of children with asthma). This study has the inherent limitation of retrospective studies in that laboratory, and lung function data are not available for all study subjects. However, we included prospectively obtained laboratory and PFT measures for a random sample of the whole cohort that replicated the findings observed in the whole cohort. The two asthma criteria used in this study are not intended to replace a physician diagnosis of asthma. However, it is challenging to determine asthma in young children retrospectively as tests for the diagnosis of asthma are frequently not feasible, and to our knowledge, these two criteria are the only validated criteria that have been retrospectively applied to EHRs. In our study, children who met NLP-PAC during the first 4 years of life, compared with those who did not, were more likely, at a later date, to have a timely physician diagnosis of asthma (62% vs 10%, p<0.001) and reduction in FEV1/FVC (<0.85) (p<0.001). These data suggest that for research purposes, the PAC is a reasonable asthma ascertainment criteria for younger children largely overlapping with the Canadian Thoracic Society guidelines for asthma diagnosis for preschoolers.20 Even though asthma is a dynamic condition that changes over time, we had not addressed this issue in this study as it goes beyond the scope of this study. However, recently, we developed and validated an NLP algorithm for asthma prognosis after asthma onset.34 We should be able to extend NLP algorithms for asthma prognosis to the same birth cohort and report the results in near future.
In conclusion, an expert AI-based NLP algorithms for two existing asthma criteria systematically identified childhood asthma on a large scale and its subgroup with distinctive characteristics minimising methodological heterogeneity in defining asthma and maximising our abilities to detect true biological heterogeneity among asthmatic patients. In the era of EHRs, it enables precision population management strategies for asthma care and the execution of large-scale clinical studies with improved precision, reproducibility and affordability.
We would like to thank Mrs Kelly Okeson for her administrative assistance. We would also like to thank Drs Rohit D Divekar, Thanai Pongdee, Bong Seok Choi and Mrs Julie C Porcher for their review and helpful comments. Funding information: National Institute of Health (NIH)-funded R01 grant (R01 HL126667), R21 grants (R21AI116839-01 and R21AI142702) and T. Denny Sanford Paediatric Collaborative Research Fund. The resources of the Rochester Epidemiology Project (R01-AG34676) from the National Institute on Ageing and CTSA Grant Number UL1 TR000135 from the National Centre for Advancing Translational Sciences.
Contributors Study concept and design: HL, WC, SS, ER, MAP, HK, IC and YJ; acquisition, analysis or interpretation of data: HYS, MCR, HL, WC, SS, ER, JO, SMA, JDW and YJ; drafting of the manuscript: HYS, MCR, WC, HL and YJ; critical revision of the manuscript for important intellectual content: HYS, MCR, HL, WC, SS, ER, MAP, HK, JO, IC, SMA, JAC-R, JDW and YJ; statistical analysis: WC, ER and SMA; study supervision: HL, WC, SS, ER, MAP, HK and YJ.
Funding National Institute of Health (NIH)-funded R01 grant (R01 HL126667) and R21 grant (R21AI116839-01 and R21AI142702), and T. Denny Sanford Pediatric Collaborative Research Fund. The resources of the Rochester Epidemiology Project (R01-AG34676) from the National Institute on Aging and CTSA Grant Number UL1 TR000135 from the National Center for Advancing Translational Sciences.
Competing interests None declared.
Patient consent for publication Not required.
Ethics approval The study protocol was approved by the Institutional Review Board (IRB) at Mayo Clinic (14-009934).
Provenance and peer review Not commissioned; externally peer reviewed.
Data availability statement No data are available. The datasets generated and/or analysed during the current study are not publicly available as they include protected health information. Access to data could be discussed per the institutional policy after the IRB at Mayo Clinic approves it.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.