Deep learning algorithms hold huge potential for healthcare diagnostics, but they’re not a silver bullet. As adoption of these tools increases, it’s crucial that healthcare professionals and decision makers consider algorithmic success metrics in their proper context.
Beth Israel Deaconess Medical Center and Harvard Medical School recently released a study in which researchers built a deep learning algorithm capable of assessing whether a cluster of lymph node cells contain cancer with a 92 percent success rate based on image recognition-based classification. While a 92 percent diagnostic success rate is typically more favorable than, say, a 90 percent diagnostic success rate, in a field with stakes as high as those in healthcare, the nature of those kinds of mistakes matter.
For simplicity’s sake, let’s consider a similar (but fictional) algorithmically-trained, 92 percent accurate diagnostic study including exactly 10,000 patients. The researchers’ deep learning algorithm would return a correct diagnosis in 9,200 cases, and an incorrect diagnosis in 800 cases. Now, each of these 800 errors would fall into one of two categories: a false positive, wherein the algorithm reported that a patient had cancer when in reality they did not, or a false negative, wherein the algorithm reported that a patient was cancer-free when in reality they were not.
Needless to say, a return of eight hundred false negatives would have the potential to be far more catastrophic than a return of eight hundred false positives. A false positive causes a great deal of unnecessary stress, to be sure, but the error is likely to be spotted fairly quickly. A false negative, on the other hand, greatly increases the likelihood of a missed diagnosis — which could drive the possibility that a patient will fail to undergo potentially life-saving treatment in a timely manner.
As such, in and of itself, “accuracy” — or how frequently a deep learning algorithm makes the correct assessment — is an insufficient indicator of algorithmic success in the context of medical diagnosis. When the outcome of an algorithm’s errors matters with respect to real-world decision making (especially in a medical context), we must evaluate not only statistical accuracy but consider statistical “sensitivity” and “specificity,” as well.
The critical insight provided by sensitivity and specificity
In a diagnostic context, sensitivity — also commonly referred to as the “recall” or the “true positive rate” — measures the percentage of sick patients who are correctly identified as having the disease in question. Specificity — or the “true negative rate” — measures the percentage of healthy patients who are correctly identified as being disease-free.
For the reasons discussed above, most medical screening tests are designed to be highly sensitive — that is, less likely to deliver a false negative — with the knowledge that the patient will also have to undergo a diagnostic test (which is usually highly specific). That said, different biases are built into different tests based on both what the test is screening for and what other tests will be conducted thereafter.
Research from Kidney International uses the detection of chronic kidney disease as an example of an instance in which a highly sensitive screening test is preferable. In detecting chronic kidney disease, an inexpensive dipstick test could be preferred as a screening test, allowing many individuals to be tested. In this test, it is important that all patients with chronic kidney disease have a positive test result (high sensitivity), whereas the number of patients with false-positive results (low specificity) is considered somewhat less important, as they would be quickly identified by a subsequent test.
Alternatively, a diagnostic test with a lower sensitivity and specificity could be preferable when the subsequent test is is either invasive or has a high risk of complications. As the report’s authors suggest, performing renal arteriography to test for renal artery stenosis is an invasive diagnostic method with potential complications. In many cases it might be preferably to replace arteriography with Doppler testing, which has 89 percent sensitivity and 73 percent specificity. In this way, HCPs and researchers are almost always working to contextualize and balance the positive and negative outcomes of different diagnostic tests.
Why context matters
In the end, statistical sensitivity is hugely consequential for some algorithms and entirely inconsequential for others. If we’re told that an image recognition algorithm pinpoints pictures of cats with 92 percent accuracy, the algorithm’s sensitivity should be of little interest. It doesn’t really matter whether the 8 per 100 mis-recognitions are cases of the algorithm identifying dogs as cats, overlooking cats in a picture that contains other items, or a mix of both. In this case, the cost of an incorrect prediction — while somewhat annoying — is trivial with respect to outcomes.
As a less trivial example, consider my field: marketing. In marketing, the ultimate arbiter of success is whether you achieve your stated KPIs. If, for example, a specific ad is served to a member of your target audience 85 percent of the time, you’ll probably be satisfied. Figuring out why the ad was improperly served 15 percent of the time will help you refine future campaigns — and thus spend your budget more wisely down the line. A percentage of the ads were served to an inherently uninterested audience; which uninterested audience doesn’t make your wasted spend any worse or any better.
This, clearly, is not the case for deep learning algorithms deployed in a diagnostic setting. A missed diagnosis can quite literally be a matter of life and death, placing a tremendous onus on healthcare stakeholders to carefully consider not just a diagnostic algorithm’s accuracy, but its sensitivity, as well.
Generally speaking, an algorithm that is 90 percent accurate but highly sensitive presents less risk in a healthcare environment than an algorithm that is 95 percent accurate but not sensitive at all. This will be essential to remember as algorithmic diagnostics slowly trickle to market over the coming years and decades, and, hopefully, will be the decisive factor in which tools are exposed as high-risk novelties and which tools go on to make a significant positive impact on the way that diagnosticians approach their craft.