No test, no diagnosis? A cautionary tale

2 minute read

Published:

This blog post is adapted from Disparate Censorship & Undertesting: A Source of Label Bias in Clinical Machine Learning, MLHC 2022.

Where does ground truth come from?

Let’s start with a standard supervised binary classification problem. Since I work in machine learning for healthcare, let’s say that we’re trying to predict whether a patient will develop sepsis in the next day. We’ve collected a lot of data about a patient’s health, such as their medical history, their recent vitals, some demographics, and more – enough to train a nice binary classifier. In healthcare, we often want to perform risk stratification, or “rank” patients by their severity, so assume that this classifier outputs some probability for sepsis ($Y$) given the covariates (input variables; $X$): $P(Y \mid X)$.

If you’ve played a bit with basic machine learning, you probably feel an itch in your fingers: perhaps you’d like to open up a Jupyter notebook, import your favorite thing from scikit-learn, and call .fit() faster than you can say “logistic regression.” But let’s slow down and look at our data. Specifically, I’d like to zoom in on that column for $Y$: the labels for sepsis.

What is $Y$?

If your answer was “ground truth, duh!” – congratulations, partial credit to you. But who – or what – is actually responsible for filling in entries in the $Y$ column? Data – including labels – don’t simply fall from the sky.

In some cases, we have to rely on annotators, and that can get noisy – many approaches have studied that problem. In healthcare, luckily, we have rules-based definitions, such as the Sepsis-3 criterion for sepsis.

But in this pipeline, there is one critical decision point. Even if we believe that laboratory tests perform perfectly/aren’t biased, that our medical devices for measuring vitals perform perfectly, and everything else – human biases can still enter into this decision-making process.

(a brief disclaimer: While I have collaborated with clinicians on this work, I’m not a doctor, and I don’t have clinical experience. So take this as a informed example that’s reasonably anchored in reality, but not a reflection of what literally happens in healthcare.)

Labels can’t talk: silent label biases

When machine learning goes wrong

It’s not the algorithm’s fault that it’s learned a classifier that performs dispropotionately across groups. From an optimization perspective, the algorithm completed the assignment – fit $X$ to $Y$ – but we’ve given it faulty $Y$.

The moral of the story is that we need to carefully verify where our labels came from, especially in high-stakes situations.

So what happens to all the datasets that are already out there, with labels that are potentially biased? Let’s find out in the next installment!

Contact: ctrenton at umich dot edu