Discussion
Here, we describe a rigorous method for manually annotating coughs from continuous audio recordings. We also describe how different units of cough analysis can be applied to these annotations and use summary statistics from linear and Bland-Altman analyses to assess the agreement within and between human annotators and using different metrics. Finally, we analyse the individual cough length and epoch size by sex and diagnosis.
The cough recordings used in this study come from a diverse group of participants which allowed for the evaluation of the cough metrics and annotation process in samples from males and females across a range of different ages and acoustic environments. COVID-19 was the most prevalent cause of cough at the time of the study, this sample was however enriched by over 40% of participants with other causes of cough.
We collected over 400 hours of continuous audio and human annotators labelled a random 10% plus the full 24 hours of two patients compatible with the intended use population of a cough monitor. From this sample of 4112 labels were placed encompassing a broad range of possible labels including coughs (over 50% of all labels), throat clears, sneezes and other cough-like sounds.
It is known that human annotators show better agreement in cough counts than in the assessment of other aspects of cough such as severity, strength or quality.17 18 In our data, there was high intraobserver and interobserver agreement on numbers of coughs per unit of time regardless of units of measure used or analysis strategy, this is aligned with previous studies on this topic18–23 as well as studies assessing agreement on the numerical assessment of other clinical processes such as the respiratory rate.24
Far-labelled coughs assessed by the same annotator show a high numerical slope with a correlation of 0.6, which suggests that numerical corrections can improve the performance, however, the low correlation obtained interlabeller means the ‘far’ tag will not suffice to assess acoustic contamination on its own.
We described several different ways to describe cough frequency. As previously described, coughs, mobile cough seconds and fixed cough seconds are all highly correlated and either could be used to reflect a user’s cough applying a correction factor between them.2 The number of intra-annotator disagreements per hour, although low across all cough metric units, is reduced almost 50% when mobile cough seconds are used versus just coughs (table 4). Hence, for evaluating automatic systems, mobile cough seconds are a more reproducible metric. All three metrics are highly correlated and can be reported in a way that has intuitive value to patients and providers as ‘coughing rate per hour’.
This study also explored two different ways to quantitatively assess the accuracy of automated cough monitoring in comparison to human listeners. Comparing on an event-by-event basis in terms of specific coughs is problematic due to the inability of annotators to discern when sequential explosive peaks are separate coughs or part of the same cough epoch, as this report demonstrated. Using that approach to compare on the basis of specific cough seconds minimises this source of noise and allows for calculating the sensitivity and specificity for individual coughs of an automatic cough counter. This approach to describing performance is standard in many settings, such as diagnostic testing. However, performance at the level of individual coughs is not appropriate as the clinically relevant question is about coughing trends and totals, not any single specific cough, hence best expressed as a cough frequency or rate. Thus, we propose that the clinically relevant performance metric is correlation of hourly cough second rate which we can express as a Pearson correlation, y intercept and slope and are highly correlated with ‘raw’ cough rate.
Manual cough counts are the most commonly used gold standard to determine accuracy of automatic cough detectors.9 10 Recorded audio has been shown equivalent to video recordings to assess cough frequency.25 When involving human annotators, a visual depiction of the sounds has been proven useful.23 While Audacity has been used successfully in the past, it requires manual handling of data which is prone to human error, Hyfe’s labelling software is reported by annotators to be easy to use and allows for automatic management of databases in the backend, reducing the likelihood that errors are made in managing data. The completed and ongoing studies using here described annotation protocol has instructed the most recent update to the continuous cough sound annotation standard operating procedure, which can be accessed here.
There are limited published data on the duration of cough sounds.16 26 27 The normal duration of cough sound varies in the literature, from 0.3 to 1.0 s27 with lengthening described in association with disease or smoking status.27 Despite well-described sex differences in cough severity and cough-reflex,28–31 there are little data on differences on cough sound length. A previous study using only 234 coughs from 24 participants found shorter coughs in male.32 Similarly, literature describing differences in cough epoch size is scarce.
Here, we show that, in this cohort, women cough-sound is significantly shorter while their epochs tend to contain more coughs. Voluntary suppression of cough by women has been proposed as a mechanism for the development of specific infections in poorly drained lung regions,33 our findings further support this concept. Voluntary suppression and shorter coughs may contribute to suboptimal airway clearance hence driving longer cough epochs in women.
The finding of specific differences in cough length and epoch size associated with COVID-19 are also worthy of further exploration.
Among the limitations of this study, we can list that despite 400 hours of continuous audio were collected, these come from a relatively small number of patients and only 40 hours were selected for triple labelling. A proper evaluation of sex differences in cough sound length requires a much larger sample size corrected for clustering at patient level.
In summary, we describe the performance of different metrics and analysis methods to describe agreement between cough labellers. We use a robust dataset of 40 hours of continuous audio sampled from a total of over 400 hours collected from a diverse group of participants. The most intuitive way to annotate coughs is by time stamping the first explosive phase. However, this creates ambiguity as to whether sequential peaks are the same or different coughs. In contracts, we use segment-length labels and counting cough seconds and show that this metric can be used interchangeably with coughs while still capturing cough’s clinically meaningful quantitative significance.
Finally, we describe sex differences in cough sound length and epoch size. These findings have implication for sexual disparities around disease progress, disease awareness and diagnosis associated with cough as a syndrome and disease transmission.