Background During the COVID-19 pandemic, portable pulse oximeters were issued to some patients to permit home monitoring and alleviate pressure on inpatient wards. Concerns were raised about the accuracy of these devices in some patient groups. This study was conducted in response to these concerns.
Objectives To evaluate the performance characteristics of five portable pulse oximeters and their suitability for deployment on home-use pulse oximetry pathways created during the COVID-19 pandemic. This study considered the effects of different device models and patient characteristics on pulse oximeter accuracy, false negative and false positive rate.
Methods A total of 915 oxygen saturation (spO2) measurements, paired with measurements from a hospital-standard pulse oximeter, were taken from 50 patients recruited from respiratory wards and the intensive care unit at an acute hospital in London. The effects of device model and several patient characteristics on bias, false negative and false positive likelihood were evaluated using multiple regression analyses.
Results and conclusions All five portable pulse oximeters appeared to outperform the standard to which they were manufactured. Device model, patient spO2 and patient skin colour were significant predictors of measurement bias, false positive and false negative rate, with some variation between models. The false positive and false negative rates were 11.2% and 24.5%, respectively, with substantial variation between models.
- equipment evaluations
Data availability statement
Data are available on reasonable request.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
What is already known on this topic?
Pulse oximeter performance varies with factors like subject skin colour and oxygen saturation.
The pulse oximeters investigated for this study are known to meet the requirements for “CE marking”, which guarantees a minimum level of safety and performance.
What this study adds
This study considered the performance, and factors which affect performance, of five devices issued for home-use during the COVID-19 pandemic, for which no similar analysis has previously been published.
The study also considered the implications of any inaccuracy with respect to the escalation thresholds within the COVID-19 home-use pathways.
How this study might affect research, practice or policy
This study should improve understanding of the performance and limitations of these devices and the implications when they are deployed for home-use within the COVID‑19 home oximetry pathways.
It is likely that more pathways will be developed, which involve the home-use of diagnostic medical devices. The methods used and the findings reported here, may inform further and larger-scale evaluations into oximeters and other devices.
The COVID-19 pandemic created unprecedented demand for bed space within National Health Service (NHS) hospitals. Home-use pulse oximeters were introduced to allow home monitoring of some patients who would otherwise occupy a hospital bed. This was recommended by WHO guidance.1 There is evidence of uptake of this recommendation in various health systems.2 3 Use of similar pathways is likely to increase,4 including for non-COVID-19 patients.5
There are two virtual wards created by the NHS in England to manage the surge of COVID-19 patients: ‘COVID-19 Virtual Ward-s’ and ‘COVID-19 Oximetry @Home’. Both virtual wards were supported by portable pulse oximeters (as opposed to hospital grade pulse oximeters) due to their wider availability. For the rest of the article portable pulse oximeters will be referred to as pulse oximeters for simplicity.
COVID-19 Virtual Wards were operated by secondary care providers, and it contains an ascending and a descending pathways. The descending pathway included patients in the recovery phase, deemed fit for transfer to a virtual ward for home-monitoring. The ascending pathway included patients with a COVID-19 diagnosis deemed appropriate for home monitoring. A pulse oximeter was issued to patients on both pathways.6 Patients were asked to record their pulse oximetry derived oxygen saturation (spO2) three times daily, and they were proactively contacted by phone daily, and asked to contact the hospital if their spO2 fell below 92%2 or if other symptoms worsened. Asymptomatic patients were considered for discharge by fourteen days. The other virtual ward, COVID-19 Oximetry @Home, was led by primary care providers,7 where pulse oximeters were issued to COVID-19 positive patients. Patients were supported to self-escalate if their spO2 fell to 92% or below, and to call a low-acuity hotline or their general practitioner if it fell to 93% or 94% or if other symptoms worsened.Various media articles8 9 suggested that the public purchase pulse oximeters and monitor themselves, hence forming a self-referral pathway. The recommended decision boundaries were the same as for the two virtual wards.
Pulse oximeter performance
The ISO 80601-2-61:2019 standard gives performance requirements for pulse oximeters.10 Pulse oximeter manufacturers are not mandated to comply with this standard, but compliance is generally expected for NHS devices. All of the virtual ward devices claimed compliance with this standard.7
The standard requires that manufacturers perform a study to quantify their pulse oximeter’s accuracy. This can be a desaturation study on healthy volunteers, using either a CO-oximeter or another pulse oximeter as reference. Patient studies are also permitted provided that a CO-oximeter is used as reference. Root-mean-square error may not exceed 4% for saO2 in the range 70%–100%. The standard recommends that study subjects ‘should vary in their physical characteristics to the greatest extent possible’ to permit broad application to different patient groups. It does not prescribe sample size or analysis methodology. Publication and peer-review of these studies is not required.
Assurance of the accuracy of the pulse oximeters deployed during the pandemic was sought, since they had little history of NHS use and the results of previous investigations of pulse oximeters performance have been variable.
A desaturation study on six non-US Food and Drug Administration (FDA)-approved low-cost pulse oximeters11 found that 4/6 did not meet ISO 80601-2-61 requirements. The study included two device models that are earlier versions of devices used for the virtual wards: the Contec CMS50DL (which did meet the requirements) and the Beijing Choice MD300C23 (which did not). Two further clinical studies found that several portable pulse oximeters (including the Contec CMS50D, CMS50DL and ChoiceMMed MD300C52) performed acceptably in patients with relatively high spO2,12 13 but offered no definitive comment for patients with lower spO2.
Other studies which have considered the accuracy of portable pulse oximeters similar to those investigated for this study, have found that their measurements agree closely with gold standard measurements from CO-oximeters. These include studies which consider these devices’ use at high altitude (a common use case for these devices, due to their easy portability).14 15
Concerns had also been raised about pulse oximeter performance for patients with darker skin. A California based research group has undertaken desaturation studies to evaluate the relationship between pulse oximeter performance and skin tone,16 17 finding that skin tone, gender and saO2 range were consistent predictors of bias for various pulse oximeters. They found that spO2 was overestimated in dark-skinned, hypoxic individuals (saO2 <80%), and concluded that this effect could be clinically significant. Similar findings were presented in a letter to the New England Journal of Medicine,18 and an early study of ear pulse oximetry also found that performance was poorer for darker skinned patients.19 Other studies have found no relationship between pulse oximeter performance and skin colour,20 21 including one study on the Contec CMS50D.12
Factors other than skin tone may affect pulse oximeter performance. These devices rely on arterial flow pulsatility to measure spO2, so conditions like peripheral arterial disease (PAD), which reduce pulsatility, may affect function.22–24 The presence of carboxyhaemoglobin in the blood (for instance, due to smoking) also causes pulse oximeters to overestimate spO2.22
While questions remain about the performance characteristics of home-use pulse oximeters, there is evidence that they reduce COVID-19 mortality in some populations. A South African retrospective cohort study of high risk COVID-19 positive individuals found that mortality was 48% lower in individuals issued with a pulse oximeter than in those without one.25
Previous studies have not investigated the performance of pulse oximeters used within the context of these virtual wards, considering the 92% decision threshold. For this study ‘false positive’ was defined as a spO2 reading less than 92% for a patient whose ‘true’ spO2 is greater than or equal to 92%. False positive rate (FPR) was the proportion of negative results recorded as positive by the pulse oximeter. A ‘false negative’ was defined as a spO2 reading greater than or equal to 92% for a patient whose true spO2 is less than 92%. False negative rate (FNR) was defined as the proportion positive results recorded as negative by the pulse oximeter. The objective of this work is to estimate the impact of performance characteristics of these devices on the home-use pathways in terms of FPR and FNR. This work also investigated the effects of factors such as skin tone, smoking status and PAD status on FPR and FNR.
The accuracies of the five pulse oximeters were evaluated by comparison with a hospital-use pulse oximeter, an M1191BL digital probe (Philips, Eindhoven, the Netherlands) connected to an IntelliVue X3 Patient Monitor (Philips, Eindhoven, Netherlands). The test pulse oximeters were: (1) Oxywatch MD300C19 (ChoiceMMed, Beijing, China), (2) Oxywatch MD300C29 (ChoiceMMed, Beijing, China), (3) PC-60F (Creative Medical, Shenzhen, China), (4) Contec CMS50D (Contec Medical Systems, Hebei, China) and 5) AM801 (Shenzhen Med-link Electronics Tech, Shenzhen, China).
Use and maintenance of the M1191BL digital probe and IntelliVue X3 Patient Monitor was in accordance with the manufacturer’s instructions and standard hospital practice. Measurements made using this device therefore represent ‘standard’ measurements that would be performed in a hospital setting in the absence of any home-oximetry pathway.
This evaluation was authorised by the Guy’s and St Thomas’ NHS Foundation Trust Quality Improvement and Patient Safety team. Informed consent was taken from all conscious participants, and the responsible nursing team advised on the inclusion of sedated patients. Participants’ clinical care was unaffected by inclusion in the study.
Patient and public involvement
Patients and members of the public were involved in the determination of this study’s value, and in the design of the information sheets issues to participants.
Patients with spO2 <85% (according to inpatient monitoring equipment) were excluded, together with unstable patients likely to experience acute changes to spO2 (eg, due to respiratory support or positional changes). Each participant underwent up to three sets of measurements, at least 90 min apart.
Patients, or their nursing teams, who met the inclusion criteria and were inpatients on the intensive care units, recovery and respiratory wards, were approached for inclusion in the study. Fifty patients were recruited in total.
Measurements were collected in pairs from a test device and reference device, placed simultaneously on the ring and index finger. The measurement was taken 30 s after placement of the pulse oximeter, in-line with the greatest stabilisation period recommended by any of the device manufacturers. This was repeated three times for each test device. Test order and finger selection were randomly determined for each session.
Skin pigmentation was quantified using the Fitzpatrick Skin Pigmentation (FSP) scale and recorded alongside spO2 measurements. Patients’ ages, genders, PAD and smoking statuses were extracted from their medical notes.
The effects of device model, gender, age, smoking status, PAD status, FSP score and spO2 (as recorded using the reference device) on bias, and on the likelihoods of a false positive or false negative result were evaluated using multiple effects analyses.
The effect on bias was modelled using linear multiple regression analysis, with significance calculated using t-tests. The effects on false positive and false negative likelihoods were modelled using binary logistic regression, with significance calculated using the Wald test. Results were considered significant where p<0.05.
In all cases, an initial analysis was conducted on the data set as a whole with subsequent analyses performed for each test device model individually. All analyses were performed using the IBM SPSS software, release 188.8.131.52.
Subject characteristics are shown in table 1.
Multiple regression analyses
Aggregated data set
Table 2 shows the results of the multiple effects analyses with the aggregated data from all five test pulse oximeters. This allows investigation of the differences between the reference device and the general performance of test pulse oximeters.
The relationships between bias and its significant predictors are shown in figure 1.
Bias is plotted against device model (top left), FSP (top right), subject spO2 as measured by the reference device (bottom left) and subject age (bottom right). Zero bias is shown for each plot with a red dashed line. Error bars show the bias SD associated with each data group.
On average, measurements from all five test pulse oximeters were lower than those by the reference device. The mean biases were −1.1% (AM801), −2.1% (CMS50D), −0.6% (MD300C29), −1.3% (MD300C29) and −0.1% (PC-60F). The associated SD were 2.2%, 3.3%, 2.8%, 3.7% and 2.6%, respectively. These SD overestimate devices’ RMS error, as they also include reference device error (quoted by the manufacturer as 2.5% RMS).
Subject spO2 (as measured using the reference device) was a significant predictor of bias, false negative likelihood and false positive likelihood. Bias became more negative with increasing spO2. The directions of the effects on false positive and FNR indicate that false results are more likely for patients whose spO2 is close to the threshold value of 92%, as would be expected.Device model was also a significant predictor of bias, false negative likelihood and false positive likelihood.
Skin tone was a significant predictor of bias and of false negative likelihood. Bias was more negative for subjects with darker skin. False negative likelihood was lower for subjects with darker skin, this is consistent with a stronger tendency for test pulse oximeters to under-read for these subjects.
A significant effect existed between smoking status and FNR, with current smokers less likely to receive a false negative result than non-smokers.
The only other significant effect was between subject age and bias. The test pulse oximeters’ tendency to under-read was greater for older subjects. However, the effect size was too small to be clinically significant.
Results for individual test pulse oximeters
Factors (such as age, gender, FSP score smoking status, PAD and reference spO2 range) that may affect measurement bias, FPR and FNR were considered, and table 3 summarises the statistically significant results from the multiple effects analyses for individual test pulse oximeters. A summary of all the results (including non-significant results) is given in the online supplemental material.
Subject spO2 (as measured using the reference device) had a significant effect on bias for all PPOs except the CMS50D and MD300C29. The effect direction was the same in all cases, with bias becoming more negative for greater spO2. The effect of spO2 on false positive likelihood was also fairly consistent, being significant for all devices other than the MD300C19 and MD300C29. The direction of this effect was also the same in all cases, indicating that false positive likelihood increases for subjects whose spO2 is close to the 92% threshold value. The equivalent effect on false negative likelihood (with likelihood increasing for spO2 close to the threshold) was significant for the AM801 only.
Skin tone had a significant effect on bias for the MD300C19 and MD300C29, with bias becoming more negative for subjects with darker skin.
PAD status had a significant effect on bias for the MD300C29 and PC-60F. The effects were in opposite directions, with the MD300C29 tending to underestimate saturation level compared with the reference device for subjects with a PAD diagnosis, and the PC-60F tending to overestimate saturation level compared with the reference device.
The only other significant effect was between smoking status and bias for the MD300C19. This PPO was more likely to over-read in current smokers.
Observed error rates
The observed FNRs and FPRs in the study population are shown in table 4. Any attempt to generalise the results shown in table 4 should account for variation in population characteristics, particularly spO2 distribution. The spO2 distribution of the study population is therefore presented in online supplemental figure 2.
False negative and false positive likelihood vary depending on subject spO2, being greatest for subjects with spO2 close to the threshold value of 92%. The PPOs’ tendency to under-read might be expected to produce a higher FNR than FPR, however the opposite was true in this case. This was due to the high proportion of positive cases with spO2 close to the threshold value (46.5% within 2% of this value), so small biases were more likely to cause false negative results. The proportion of negative cases close to the threshold was smaller (20.1% within 2% of this value) so larger biases were generally required to produce false positive results.
The results indicate that, for the study population, both the FPR and FNR were appreciable, with approximately one in four positive cases incorrectly identified as negative, and one in nine negative cases incorrectly identified as positive by the portable pulse oximeters.
Level of agreement
All five portable pulse oxiemters tended to under-read spO2 relative to the reference device. The bias SD was between 2% and 4% in all cases. This would be expected to include the error associated with the portable pulse oximeters, as well as that associated with the reference device, suggesting that all devices meet the 4% RMS error requirement in the ISO 80601-2-61 standard. However, this error would be enough to affect clinical decisions, particularly for patients with spO2 close to a decision threshold (as a small bias may be sufficient to push these patients’ measurements across the threshold).
Pulse oximeter performance
This study evaluated the individual performance of five portable pulse oximeters, and also their aggregated performance, to identify performance attributes they shared in common. More significant effects were identified for the aggregated data set than for the individual devices. This is likely partly due to the smaller sample size, and consequently lower analytical power, for the individual models. Power was particularly limited for the analyses of false negative likelihood, due to the small number of subjects with spO2 readings below 92%.
The aggregate analysis showed that pulse oximeter error (whether expressed in terms of bias, false positive likelihood or false negative likelihood) varied with patient characteristics and, independently, with device model. A personalised approach to spO2-based decision-making could account for both effects by modelling these dependencies.
Dependency on spO2
The variation in pulse oximeter performance with saO2 (as measured by CO-oximetry) is well documented, and is permitted by the ISO 80601-2-61 standard.10 11 16 17 Pulse oximeters are known to perform more poorly at lower saO2.10 11 16 17 This study suggests that a separate dependency exists between the performance of the portable pulse oximeters, and the ‘hospital standard’ device used as reference. The direction of this dependency was the same for all five pulse oximeters, with all five under-reading by a greater amount at higher spO2. However, this effect only reached statistical significance for three of the five devices, when considered individually.
Within the home-use oximetry pathways, the effect of this dependency is likely to be modest. Irrespective of this effect, the patients most likely to be wrongly classified remain those whose spO2 is closest to the decision threshold. The relationship between bias and spO2 is not strong enough to much increase the likelihood of false positive results among patients with high spO2 (and consequently higher bias).
Dependency on skin colour
Previous studies have reported that pulse oximeters tend to over-read spO2, relative to CO-oximeters, for dark-skinned subjects whose saO2 is very low.16 17 This study did not include subjects with spO2 low enough to be influenced by this effect. However, the data do suggest that a separate effect exists for the portable pulse oximeters. These devices appear to under-read spO2 more for subjects with darker skin than those with lighter skin. This effect would be expected to increase the likelihood of false positive results, and reduce the risk of false negative results in this patient group. There was evidence of this effect in the data, although it did not achieve significance in all cases.
In the context of the home oximetry pathways, this effect could mean that failure to appropriately escalate is less likely for darker-skinned patients, and that inappropriate escalation is more likely.
Other dependencies were found to exist between age and bias, between smoking status and FNR, between PAD status and bias for specific pulse oximeter models, and between smoking status and bias for one model. These effects were not consistent for different devices or between the aggregate data set and the device specific data sets. It is difficult to draw firm conclusions from these results, however, they would merit further investigation if widespread deployment of the affected devices were planned.
Observed and forecasted error rates
The observed FNR and FPR for all five PPOs are given in table 4. These parameters are strongly dependent on population characteristics, particularly spO2 distribution. Caution should be exercised before generalising these results to other populations. This should be informed by a comparison of respective population characteristics (see table 1 and the online supplemental material). These results provide a first estimate of the likely practical implications of deployment of these devices.
Within a home-oximetry pathway, a high FNR may result in failure to detect clinical deteriorations, representing a patient safety risk. This study found the FNR to be appreciable in the study population (table 4). The associated risks may be reduced by emphasising the importance of other information available to the responsible clinical team, including regular communications and the symptom diary—in line with elaborated advice.2
The home oximetry pathways were intended to reduce hospital bed occupancy by allowing some COVID-19 patients, who would otherwise occupy a hospital bed, to instead be monitored at home. Use of pulse oximeters with high FPR would tend to increase (re)admission and reduce this benefit. The results of this study suggest that this effect would be modest in size for these pulse oximeters.
Choice of reference measurement device
Due to practical constraints on the protocol, a widely used hospital pulse oximeter was used as the reference device in this study, rather than a CO-oximeter. This study, therefore, gives an indication of how portable pulse oximeters compare to hospital-based oximetry. It is not possible to definitively determine the devices’ accuracies from these results.
The relevant FDA and ISO standards require that pulse oximeters are validated across an spO2 range of 70%–100%. This study only included patients with spO2 ≥85%. It is therefore not possible to comment on these devices’ performance for spO2 <85%. In practice, errors large enough for a patient with spO2 <85% to receive a measurement ≥92% are unlikely; this limitation is therefore unlikely to affect interpretation with respect to the home oximetry pathways.
This study considered a small subset of the home-use oximeters available on the market. A comprehensive analysis of all available models would be logistically challenging and might, for example, require aggregation of data from multiple centres over longer time periods.
Further research, with a larger sample size and use of CO-oximeter measurements as reference, is required to precisely determine these devices’ bias. A simple approach to accounting for this bias would be to apply the offset back to observed data, or to modify the threshold value. More sophisticated techniques, which account for the test’s receiver-operator characteristics, are also available.26 27 Another alternative would be to use a longitudinal deterioration model as described by Prower et al.28
Conclusions and final remarks
All the portable pulse oximeters investigated out-performed the standard to which they were manufactured. When used within a home-use pathway, pulse oximetry measurements must be interpreted in isolation, without the benefit of the other information available to a clinician reviewing a patient who is physically present (such as observable breathlessness and work of breathing). This creates an increased risk of error. Notwithstanding methodological limitations discussed, the variation between the pulse oximeters upholds this consideration.
This indicates the use of trend analysis for more robust indications that might mitigate inter-oximeter variation. The practice of observing baselines and trends goes beyond the initial SOP documents but is clear in elaborated guidance.2 Other risk mitigations include use of conservative escalation thresholds, and inclusion of other clinical information from patients’ symptom diaries and from telephone contact with clinicians. Advanced approaches might include use of a-priori data and analyses to tailor these approaches for individual patients and/or devices.
There was also evidence of performance being dependent on subject skin tone. This would tend to increase the likelihood of inappropriate escalations of patients with darker skin, and reduce the risk of these patients not being escalated when appropriate.
Data availability statement
Data are available on reasonable request.
Patient consent for publication
This evaluation (12094) was authorised by the Guy’s and St Thomas’ NHS Foundation Trust Quality Improvement and Patient Safety team. Non-sedated participants gave informed consent to participate in the study before taking part. Standard ethics processes were followed for consent of sedated patients.
The authors would like to thank the team of training Clinical Scientists who came together to carry out data collection: namely Charlotte Jones, Ruphinder Kaur, Amy Morris and Alexander Mitton. They were supported by Clinical Scientists Adam Shortland, Nicola Fry and Shima Maqsood. The authors would also like to thank Marlies Ostermann and Ruth Thomsen (NHSE&I Medical Directorate) for their support and facilitation of this work. Finally, we would like to thank the members of the public who helped us to verify the importance of this study, and who helped us to redraft our information sheets.
Contributors DS carried out data postprocessing; modelling and analysis; and led in the drafting of manuscript. JJN was the principal investigator, had input to protocol design and advised on ethics and governance. RHK had input into protocol design—particularly around data collection. MTK also supported on drafting, modelling and analysis. MJRJ and LJ, coordinated data collection. GG provided specialist input to the protocol design and to the interpretation of results. EA was the chief investigator, had input to protocol design, modelling and analysis, and the drafting of the manuscript. EA was responsible overall for the content as guarantor.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.