Utility of established prognostic scores in COVID-19 hospital admissions: multicentre prospective evaluation of CURB-65, NEWS2 and qSOFA

Introduction The COVID-19 pandemic is ongoing, yet, due to the lack of a COVID-19-specific tool, clinicians must use pre-existing illness severity scores for initial prognostication. However, the validity of such scores in COVID-19 is unknown. Methods The North West Collaborative Organisation for Respiratory Research performed a multicentre prospective evaluation of adult patients admitted to the hospital with confirmed COVID-19 during a 2-week period in April 2020. Clinical variables measured as part of usual care at presentation to the hospital were recorded, including the Confusion, Urea, Respiratory Rate, Blood Pressure and Age Above or Below 65 Years (CURB-65), National Early Warning Score 2 (NEWS2) and Quick Sequential (Sepsis-Related) Organ Failure Assessment (qSOFA) scores. The primary outcome of interest was 30-day mortality. Results Data were collected for 830 people with COVID-19 admitted across seven hospitals. By 30 days, a total of 300 (36.1%) had died and 142 (17.1%) had been in the intensive care unit. All scores underestimated mortality compared with pre-COVID-19 cohorts, and overall prognostic performance was generally poor. Among the ‘low-risk’ categories (CURB-65 score<2, NEWS2<5 and qSOFA score<2), 30-day mortality was 16.7%, 32.9% and 21.4%, respectively. NEWS2≥5 had a negative predictive value of 98% for early mortality. Multivariable logistic regression identified features of respiratory compromise rather than circulatory collapse as most relevant prognostic variables. Conclusion In the setting of COVID-19, existing prognostic scores underestimated risk. The design of new prognostic tools should focus on features of respiratory compromise rather than circulatory collapse. We provide a baseline set of variables which are relevant to COVID-19 outcomes and may be used as a basis for developing a bespoke COVID-19 prognostication tool.


INTRODUCTION
The novel coronavirus SARS-CoV-2 is causing a global pandemic of the infectious disease termed COVID-19. COVID-19 is frequently associated with a pneumonia syndrome and the large ISARIC (International severe acute respiratory and emerging infection consortium) observational study estimates a case fatality rate of 33% among those admitted to hospital. 1 Prognostic scores can improve clinical decision making, and pre-COVID-19 several scores had been extensively validated and supported by national and international guidelines for application in the context of acute infectious disease. [2][3][4] The Confusion, Urea, Respiratory Rate, Blood Pressure and Age Above or Below 65 Years (CURB-65) score is a community-acquired pneumonia (CAP)-specific tool for predicting all-cause mortality within 30 days. CURB-65 has been validated across large, diverse patient populations and has been endorsed by national and international guidelines as an aid to clinical decision making. [5][6][7][8][9] The National Early Warning Score 2 (NEWS2) is a scoring system based on routine physiological measurements, and its implementation into all English National Health Service (NHS) hospitals has beenbmandated in the pre-COVID-19 era. 10 NEWS2 is a disease agnostic early warning tool used to trigger escalation of care in the deteriorating patient, with high scores being associated with death or unanticipated intensive care unit (ICU) admission within 24 hours. 2 The Quick Sequential (Sepsis-Related) Organ Failure Assessment (qSOFA) score is a tool for predicting mortality and ICU admission among patients with suspected infection in prehospital, emergency department and ward settings. It has been validated through large datasets, and has gained prominence following its recommendation by the Sepsis-3 task force. 4 11 At the onset of the UK epidemic, in the absence of COVID-19-specific prognostic tools, CURB-65, NEWS2 and qSOFA remained in widespread use, but little was known about their validity in the COVID-19 setting. The primary aim of this study was to determine the performance characteristics of these scores in Open access the context of COVID-19 and, secondarily, to investigate potential components of a COVID-19-specific prognostication tool for future validation.

METHODS Study setting and participants
The North West Collaborative Organisation for Respiratory Research (NW-CORR) collected data during the 2-week period from 1 April 2020 to 14 April 2020 on prospective adult COVID-19 admissions at seven acute hospitals in North West England. NW-CORR constitutes a group of research-interested, respiratory, specialist trainee grade doctors, and the recruiting centres were those with an NW-CORR member available. Collaborators were asked to record routinely collected clinical data for consecutive patients admitted to their hospitals who met the Public Health England inpatient case definition for COVID-19 12 and had a positive SARS-CoV-2 PCR test. There were no exclusion criteria. No approach to the patient was made, and only fully anonymised, routinely available clinical information was collated; on this basis, consent was not required under pandemic-specific guidance from the NHS Human Research Authority (https://www. hra. nhs. uk/ covid-19research/ guidance-using-patient-data/). 13 Patient and public involvement Healthcare professionals with COVID-19 were involved in the design, conception and conduct of the study. All agreed the project was desirable and non-intrusive for patients and deliverable during a time of crisis.
Outcomes and prognostic scores Data collected included demographic characteristics, vital signs and blood test results. All physiological values and blood results constituting components of the CURB-65, NEWS2 and qSOFA scores were the earliest measurement recorded in the hospital. The variables included in each risk score are shown in table 1. At the point of data entry, collaborators were also asked to comment on the presence or absence of consolidation on chest radiography. Outcomes studied were 30-day all-cause mortality, 72-hour mortality and ICU admission. With respect to the three risk scores, these outcome measures are in some cases validated and in other cases unvalidated but widely applied in clinical practice; in the unvalidated context, the analysis was therefore exploratory.

Data handling
Anonymised study data were collated centrally and managed using the secure, web-based software platform Research Electronic Data Capture (Vanderbilt University, USA) hosted at the University of Liverpool. In general, there were minimal missing data with >99% completeness for all constituent variables of the three prognostic scores (table 2 and figure 1).

Statistical analysis
Scores were assessed individually against their validated outcomes and overall for their ability to identify people at risk of mortality within 72 hours (early death) and 30 days of admission. This analysis included sensitivity and specificity of each score's respective risk strata followed by an evaluation of discrimination and calibration in keeping with TRIPOD guidelines. 14 Discriminatory ability was assessed by comparison of the corresponding receiver operating characteristic curves with computation of area under the curve (AUC). Calibration was assessed visually by plotting the observed risk for a score's individual strata against published reference risk derived from their original validation. In order to allow direct comparison of the clinical scoring systems, only patients with complete data for all variables were included in comparative statistical analyses. Data for the clinical parameter Clinical Frailty Scale (CFS) score was missing from one centre where this data was not recorded; since this was confined to one centre and therefore was not randomly missing, analyses were restricted the population where all variables were available.
Multiple logistic regression models were fitted for each of the outcome variables (30-day mortality, 72-hour mortality and ICU admission) using each score (CURB-65, NEWS2 and qSOFA). With the aim of identifying variables relevant to COVID-19 outcomes and in order to assess the association of each individual variable (eg, age and respiratory rate) with each of the outcomes, multiple regression modelling using all variables was fitted by applying backward variable selection. Data heterogeneity introduced by differences among hospitals was assessed by adding a random intercept in the model. However, clustering by hospital did not improve the accuracy of the model, and the final models did not include a random term. The performance of the fitted models was assessed by sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and AUC. These analyses were performed using the statistical software package pROC in R V.3.5.3. pROC features internal cross validation based on bootstrap sampling method. The discriminatory ability of each score was assessed for death within 30 days, death within 72 hours and admission to critical care, and are presented in figure 2. In general, performance was modest, with AUCs ranging from 0.62 to 0.77. Calibration was computed by comparing the predicted risk from each score against the respective observed risk in the study cohort. Visual comparison of each calibration plot confirmed slopes of >1 and intercepts of >0, suggestive of underestimation (see figure 3).
To test scores' performance at their individual validated thresholds, the sensitivity, specificity, NPV and PPV were calculated for patients with complete datasets (see figure 1) and are presented in table 3. Overall, for 30-day mortality, scores failed to accurately identify a low-risk group, with mortality in the lowest-risk strata ranging from 16% to 33%. For 72-hour mortality, a CURB-65 threshold of 2 and A NEWS2 threshold of 5 both identified a low-risk group with just 2-3% mortality and NEWS2 achieved a sensitivity of 92% with an NPV of 98%. All scores performed poorly in predicting admission to ICU (see online supplemental table S2). The relative likelihood of mortality at each stratum of each score is presented in table 4.
When all individual variables were considered, multivariable logistic regression revealed that confusion and blood pressure (BP) were less relevant to 30-day mortality   Finally, a backward selection multivariable model fitted for each outcome identified de novo a set of variables independently associated with 30-day mortality (Clinical Frailty Scale, Urea, Consolidation, Age, FiO 2 , Sex, Respiratory rate (CUCAF-SR)) and a similar set of variables for 72-hour mortality (Clinical Frailty Scale, Urea,

DISCUSSION
We analysed the accuracy with which admission CURB-65, NEWS2 and qSOFA scores predict ICU admission, early in hospital mortality and all-cause mortality within 30 days of hospital admission in the context of COVID-19. In general, calibration was poor as all three scores underestimated the risk of adverse outcomes. CURB-65 and qSOFA both performed poorly in comparison to their respective standard applications, suggesting their  Open access utility is limited in COVID-19. In contrast, NEWS2 and CURB-65 were better at predicting early death, defined here as death of <72 hours, where a NEWS2 threshold of 5 (the recommended threshold for urgent intervention) showed excellent NPV. We went on to derive two sets of parameters which, when combined on admission with COVID-19, may provide a more accurate prediction of mortality and may provide a useful basis for predictive scores to be validated in larger datasets. CURB-65 was derived and validated to predict low, moderate and high risk of death at 30 days in patients with CAP and therefore assisted healthcare workers in deciding who required admission to the hospital. Pre-COVID validation demonstrated a score of <2 was consistently associated with low mortality rates of 0.4%-2.8% at 30 days. 6 15 However, here in the COVID-19 setting, a CURB-65 score of <2 was associated with a mortality of 17%. In clinical practice, its utility has been expanded to support decision making around antimicrobial prescribing and escalation of care. We therefore performed an exploration of CURB-65 performance with respect to ICU admission and early mortality prediction. Our data suggest low CURB-65 scores may not support early COVID-19 discharge, but higher scores may still have value in predicting particularly poor outcomes; CURB-65 scores of ≥3 were associated with death in 60% of cases, compared with just 22% in the pre-COVID era. 5 On that basis, high scores could prompt early escalation planning and inform discussions with patients and their families.
In the pre-COVID era, at the time of presentation to hospital with CAP, it was extremely unusual for the medical team to know the causal pathogen, and although it was well recognised that there was a range of virulence among the possible viral and bacterial causes of CAP, the data presented here confirm that SARS-CoV-2 is a highly virulent outlier. This finding has particular relevance to the evolving pandemic since, as transmission reduces, SARS-CoV-2 will become one of numerous endemic causes of CAP in many countries. It will therefore be important to recognise that this will reduce the

Open access
performance of CURB-65 on undifferentiated CAP cases and makes a strong case for the implementation of rapid diagnostics to determine the aetiology of CAP. NEWS2 has been widely implemented in English hospitals as a simple score consisting of routine physiological measurements. While it is most widely recognised as a simple tool to identify inpatients in need of urgent or emergent medical attention based on changing physiological measurements, it is also often used in the emergency department, where it has been validated in a number of syndromes, including sepsis and acute dyspnoea. 16 17 We found that a NEWS2 score of <5 accurately identified a group of patients as a low-risk group for early mortality; however, it was less successful when 30-day mortality was considered. Our findings are based on a single measure of NEWS2 on admission, and future longitudinal work would be needed to confirm if established NEWS2 trigger thresholds remain valid for inpatients with COVID-19.
qSOFA was validated for use among hospital inpatients and emergency department admissions as a simple and accurate way to identify people with infections at higher risk of poor outcomes. 4 11 18 Early data from China suggested those who survived COVID-19 had lower qSOFA scores, a finding replicated here. 19 However, in our study, the median qSOFA in those who died within 30 days was <2, and mortality in this 'low-risk' qSOFA group was 32.5%. Taken together, the poor overall discriminatory ability of qSOFA and the poor diagnostic performance seen here suggests a qSOFA score on admission is not a useful prognostic tool in COVID-19. These findings are supported by a recent study which found qSOFA was low in people with COVID-19 admitted to critical care and could not reliably identify those at risk of death. 20 The poor performance of qSOFA is interesting, given it was derived from cohorts of patients with sepsis, a syndrome defined as a 'life-threatening organ dysfunction due to a dysregulated host response to infection'. It would be expected that many with COVID-19-associated mortality would meet that definition on the basis of respiratory failure. However, the striking difference between the physiology of bacterial sepsis and severe COVID-19 is that cardiovascular instability is rare in COVID-19. 18 In our modelling of individual variables of CURB-65, we found that unlike the respiratory components, blood pressure was not independently associated with adverse outcome. Similarly, confusion, often a sign of haemodynamic compromise, was less relevant to outcomes in the CURB-65 score.
Blood pressure and mental status are integral components of the qSOFA score and in other contexts contribute to its ability to prognosticate; thus, the poor performance of qSOFA is explained by the limited effect of COVID-19 on these physiological parameters. These findings suggest COVID-19-associated mortality may be mediated by different mechanisms than conventional bacterial sepsis. An example of this may be the profound endothelial injury and abundant microthrombi identified in a recent postmortem study. 21 Given the limited performance of the previously validated and widely used scores seen here, we explored whether performance could be improved by deriving new models. Using multiple logistic regression, we derived new models, CUCAF-SR and CUCA-SF, in an attempt to predict 30-day and 72-hour mortality, respectively. In keeping with the findings described earlier, markers of cardiovascular compromise were not independently associated with poorer outcomes, with markers of respiratory function, age and frailty appearing more relevant. This finding is supported by the findings of the ISARIC study, where respiratory function (respiratory rate and oxygen saturations), age and comorbidities, but not cardiovascular parameters, were important constituents of the 4C mortality score. 22 Some limitations must be addressed. First, we only included a 2-week period, and it is possible demographics and outcomes may change across the course of the COVID-19 pandemic. 23 Reassuringly, the characteristics and outcomes in the study population seen here are in keeping with those reported by the ISARIC study, one of the largest studies in this setting to date. For example, the median age here was 70 years compared with 72 years in the ISARIC study; 61% were male here compared with 59.9% in ISARIC; 17.1% here were admitted to critical care compared with 17% in ISARIC; and we observed 34% 30-day mortality, which is comparable to the hospital case fatality rate of 33.6% reported by ISARIC. 24 25 A further limitation here is the inclusion only of those people admitted to hospital, thus excluding those well enough to be discharged from the emergency department. As a consequence, the observed risk presented among low-risk categories seen here may, in theory, be inflated; for example, it is possible only the sickest patients with CURB-65 scores of 0-1 were admitted. However, the main derivation dataset for the CURB-65 score included only patients admitted to the hospital, replicating the methods here. 5 Similarly, a large validation study that included both admissions of those discharged directly from the emergency department found that 30-day mortality in those admitted to hospital with a low-risk CURB-65 score remained low at 0.0%-1.6%. 15 This supports the conclusion that the observed high mortality among low CURB-65 scores in this study was due to SARS-CoV-2 virulence rather than study design. Conversely, some patients presented to the emergency department in a moribund state and did not survive long enough for a viral swab to be taken. Such patients were not included in our study, and generalisability to that small subset of patients may be limited. Data collection here did not include assessment of detailed patient demographics or comorbidities, but instead focused on clinical measurements normally taken at presentation to the hospital. Characteristics such as obesity, ethnicity and comorbidities are reported to be relevant to COVID-19 outcomes but are not included here. 26 27 It may be that a combination of clinical parameters and patient characteristics is more informative than either in isolation and Open access validation of such an approach is required. 28 We defined 'early mortality' as death occurring within 72 hours of admission in order to capture patients who deteriorated quickly, but within a time frame that would allow them to have a known SARS-CoV-2 status and be identified by our investigators. This 72-hour timepoint differs from the 24-hour timepoint used in the NEWS2 score's validation studies but, given the above constraints, was considered a pragmatic approach for our analysis. Finally, ICU admission is not appropriate for all patients, and our analysis of ICU admissions may be prone to unmeasured confounding in that regard.
The strengths of this study lie in the prospective collection of data on consecutive admissions from multiple regional hospitals with rigorous assessment of the performance of each score. These readily available data were compiled from real-world clinical assessment, and outcomes followed usual clinical care. We also demonstrate the hitherto underappreciated potential of highly trained and motivated specialty trainees and their ability to coordinate and collaborate for research.
CONCLUSION CURB-65, NEWS2 and qSOFA underestimate 30-day mortality among patients admitted to the hospital with COVID-19. CURB-65 and NEWS2 were slightly better at predicting early mortality. However, our data suggest CURB-65 should not be used to prognosticate in the setting of COVID-19 pneumonia since low CURB-65 scores were associated with high mortality rates. We provide a set of clinical parameters which appear relevant to outcomes in COVID-19 and should be considered in future studies aimed at deriving COVID-19-specific prognostic tools.