To the Editor — Over the next decade, systems that are centered on artificial intelligence (AI), particularly machine learning, are predicted to become key components of several workflows within the health sector. Medical diagnosis is seen as one of the first areas that may be revolutionized by AI innovations. Indeed, more than 90% of health-related AI systems that have reached regulatory approval by the US Food and Drug Administration belong to the field of diagnostics1.

In the current paradigm, most diagnostic investigations require interpretation from a clinician to identify the presence of a target condition — a crucial step in determining subsequent treatment strategies. Despite being an essential step in the provision of patient care, many health systems find it increasingly difficult to meet the demand for the interpretation of diagnostic tests. To address this issue, diagnostic AI systems have been characterized as medical devices that may alleviate the burden placed on diagnosticians: by serving as case triage tools, enhancing diagnostic accuracy and stepping in as a second reader when necessary. As AI-centered diagnostic test accuracy (AI DTA) studies emerge, there has been a concurrent rise in systematic reviews that amalgamate the findings of comparable studies.

Notably, of these published AI DTA systematic reviews, 94% have been conducted in the absence of an AI-specific quality assessment tool2. The most commonly used instrument is the quality assessment of diagnostic accuracy studies (QUADAS-2) tool3. QUADAS-2 is a tool that assesses bias and applicability and its use is encouraged by current PRISMA 2020 guidance4. However, QUADAS-2 does not accommodate for niche terminology encountered in AI DTA studies, nor does it signal researchers to the sources of bias found within this class of study. Examples of such biases, when framed against the established domains of QUADAS-2 (patient selection; index test; reference standard; and flow and timing) are listed in Table 1.

Table 1 Examples of bias within AI DTA studies

To tackle these sources of bias, as well as AI-specific examples such as algorithmic bias, we propose an AI-specific extension to QUADAS-2 and QUADAS-C5, a risk of bias tool that has been developed for comparative accuracy studies. This new tool, termed QUADAS-AI, will provide researchers and policy-makers with a specific framework to evaluate the risk of bias and applicability when conducting reviews that evaluate AI DTA and reviews of comparative accuracy studies that evaluate at least one AI-centered index test.

QUADAS-AI will be complementary to ongoing reporting guideline tool initiatives, such as STARD-AI6 and TRIPOD-AI7. QUADAS-AI is being coordinated by a global project team and steering committee that consists of clinician scientists, computer scientists, epidemiologists, statisticians, journal editors, representatives of the EQUATOR Network11, regulatory leaders, industry leaders, funders, health policy-makers and bioethicists. Given the reach of AI technologies, we view that connecting global stakeholders is of the utmost importance for this initiative. In turn, we would welcome contact from any new potential collaborators.