Graduating AI:  An AI practitioner's perspective on evaluation for medical device

By Ben Hachey   | 
Read Post
Filter by Category
Filter by Category

As a parent of a nine-year-old, I'm facing a hard decision. At what point can he walk to school on his own? I know he knows how to do it. My job is to make sure he can do it consistently and safely.

Effective learning takes both knowledge and practice. And safety-critical activities require careful observation. Yet, in many situations, AI is deployed with surprisingly little assessment of how it will perform in the real world.

Standalone vs. machine-in-the-loop evaluation

In AI research, standalone evaluation of model performance is standard practice. By comparing model accuracy to human accuracy, it aims to demonstrate that a model can make predictions on par with human experts.

Standalone evaluation is often the basis of claims that a model is superhuman. In some global jurisdictions, it’s also the basis of regulatory approval. However, standalone evaluation doesn’t reflect utility in a real-world setting.

To measure utility, we can instead look at how an AI model impacts on clinical workflow. Here, we hope to demonstrate that doctors are more accurate when they have access to AI predictions.

MRMC measures diagnostic accuracy

We recently published an evaluation of the chest x-ray product. The product uses AI predictions to help radiologists interpret X-rays. The paper reports on a multi-reader, multi-case (MRMC) study measuring the product's effect on interpretation accuracy.

An MRMC study is a standard clinical research design for diagnostic accuracy, e.g., comparing digital versus film mammography. It measures radiologist accuracy, not model accuracy. For an AI-assisted workflow, we measure once interpreting alone and once with the help of the model.

The results of our study show that an AI assistant can improve human performance:

  • Overall radiologist accuracy improves by 13%.
  • Radiologist accuracy improves on 80% of classes.
  • Critically there’s no significant decrease for any finding.

Health and AI outcomes

The potential impact is exciting. It promises to reduce errors in patient care, improve radiologist efficiency and increase the accuracy of comprehensive chest x-ray interpretation for the first time in decades.

I don't know yet when my son will walk to school on his own. But I’m convinced that working on AI medical devices is an incredible opportunity to establish best practice for applied AI. Our MRMC study demonstrates one way to measure model impact on machine-in-the-loop workflows.




Seah et al. Effect of a comprehensive deep-learning model on the accuracy of chest x-ray interpretation by radiologists: a retrospective, multireader multi-case study. Lancet Digital Health, 3(8), e496-e506, 2021.

Jones et al. Chest radiographs and machine learning – Past, present and future. Journal of Medical Imaging and Radiation Oncology, 65(5): 538-544, 2021.

Agarwal et al. Prediction machines: The simple economics of artificial intelligence. HBR Press, 2018.