ML Error Analysis with a Causal Perspective

Table of Contents

In most machine learning projects, “error analysis” is treated as a descriptive exercise. We slice metrics by segment, look at confusion matrices, check feature importances, and maybe generate a few SHAP plots. All of that is useful, but it tends to answer only one question:

“Where is the model wrong?”

A causal perspective pushes us to ask a much harder –and much more actionable- question:

“Why is it wrong here, and what would actually change the outcome if we intervened?”

This shift might sound philosophical, but it has very concrete implications in production systems. Once a model is embedded in a decision pipeline such as approving loans, sending retention campaigns or triggering fraud reviews, errors are no longer just misclassified rows. They are business costs, regulatory risks, and unhappy users. Understanding them in a purely correlational way is not enough.

This article outlines how to bring a causal lens into ML error analysis, and how phenomena like Simpson’s paradox can easily mislead us if we don’t.

From “wrong predictions” to “wrong decisions”

Consider a typical supervised learning setup in a real system:

We observe input features X (customer profile, behavior, channel, device, etc.).
We care about an outcome of Y (churn, default, fraud, conversion).
Our model produces a prediction $\hat{Y}$ .
Downstream, a policy or business rule D turns $\hat{Y}$ into an action: approve or reject, send or do not send a campaign, escalate or ignore an alert.
In the background, there are environment variables Z: macroeconomic conditions, country-specific regulations, differences between data sources, seasonality, and so on.

In practice, many errors we observe are not “pure model mistakes”. They are the result of interactions between these layers:

A change in data collection in one country (a Z) silently modifies the distribution of X and the meaning of labels Y.
A new marketing policy D changes user behavior, so that the model is now predicting in a world that no longer resembles the training data.
A regulatory constraint forbids using some predictive features, pushing the model to rely on weaker proxies.

A classical error analysis, however, mostly looks at relationships such as “error rate as a function of feature value” or “performance by segment”. In probabilistic terms, this is about . $P\left(\mathrm{error}\mid X,\mathrm{segment}\right)$ It does not tell us what would happen if we changed the feature, changed the policy, or deployed a different model in this environment.

A causal perspective explicitly acknowledges these arrows:
Some features causally influence the outcome.
The model’s prediction causally influences the action.
The action may feed back into the outcome we observe (e.g. a retention campaign preventing a churn that would otherwise have happened).

Error analysis then becomes less about describing patterns and more about reasoning through this system: which parts of it are responsible for the errors, and which levers can we pull to reduce them?

Real-World Example: Churn Prediction Under Policy Feedback

Imagine a telecom company with a churn model in production. The model assigns a churn probability to each customer; if the score is above a threshold, the customer is targeted with an aggressive discount campaign.

At first glance, error analysis seems straightforward: compare predicted probabilities with observed churn, compute metrics, break them down by age group, region, tariff plan and channel.

However, very quickly you run into a problem: the outcome you observe is not simply “churn in nature”. It is “churn given that we may or may not have intervened”. The model prediction influences the policy, and the policy influences the label you see.

Some concrete questions immediately become causal:

In a high-risk segment, is the low observed churn due to a good model, or due to a generous retention campaign that would have succeeded even with a mediocre model?
If we change the threshold for a specific segment (for example, never send discounts to low-profit customers), how would the overall business outcome actually change?
If we gather more data for a small but important segment, will this reduce the model error rate there, or are we limited by label noise and inconsistent definitions of churn?

A purely correlational analysis can tell you that “the model’s recall is low in segment X”, but it cannot tell you whether sending more campaigns in segment X would be effective, or whether lowering the threshold is beneficial once costs and feedback effects are considered. A causal analysis aims at those “what if we changed…” questions.

Simpson’s Paradox: When Aggregate Metrics Lie

One of the clearest illustrations of why causal thinking matters in error analysis is Simpson’s paradox. It arises when performance comparisons flip direction depending on whether you look at aggregate data or at subgroups.

Suppose you are comparing two fraud models, A and B, on historical transactions. Transactions come from two different channels: the web and mobile. You evaluate both models on both channels and obtain the following:

Web channel (relatively easy):
- Model A: 60 correct out of 100 → 60%
- Model B: 70 correct out of 100 → 70%
Mobile channel (more difficult):
- Model A: 10 correct out of 100 → 10%
- Model B: 30 correct out of 200 → 15%

Within each channel, model B has the higher accuracy: it is better on web (70% vs 60%) and better on mobile (15% vs 10%). If you only look at the per-channel metrics, the natural conclusion is that “B dominates A”.

Now look at the overall performance of each model, aggregating across channels:

Model A in total:
- Correct predictions: 60 + 10 = 70
- Total transactions: 100 + 100 = 200
- Overall accuracy: 70 / 200 = 35%
Model B in total:
- Correct predictions: 70 + 30 = 100
- Total transactions: 100 + 200 = 300
- Overall accuracy: 100 / 300 ≈ 3%

Taken as a whole, model A now appears better (35% vs 33.3%), even though model B was better in both subgroups.

The paradox is not a mathematical trick; it is a warning. What changed between the subgroup view and the aggregate view is weighting. Model B was evaluated on a much larger number of mobile transactions, which are intrinsically harder cases. Because its evaluation set contains proportionally more difficult examples, its aggregated score is pulled down, even though it is locally better on both web and mobile when those channels are compared fairly.

This is exactly the kind of situation that can arise in real deployments:

Different models are tested at different times, under different mixes of web and mobile traffic.
A/B experiments fail to stratify properly by channel or region.
Evaluation logs oversample some segments because of routing or sampling rules.

If you look only at the overall metric, you might conclude that “A is safer and more accurate”. If you look only at the per-channel metrics, you might conclude that “B is strictly better”. Both views are incomplete because they ignore how the data were generated: which model saw which mix of channels, under what selection rules.

A causal perspective forces you to ask a different question: “Under a common deployment policy and a common traffic distribution, which model would be better?” That is a counterfactual question. Answering it requires reasoning not just about what happened historically, but about how performance would change if you intervened on the deployment policy, assigning both models to the same kind of traffic, in the same proportions, and then comparing them on equal footing.

Confounding, Feedback and Label Quality

Beyond Simpson’s paradox, there are three recurring causal themes in real-world error analysis.

First, there is confounding. Some variables influence both features and outcomes. Macroeconomic conditions affect both customers’ financial behavior and the labels you record as “default” or “churn”. A new product line changes customer mix and their propensity to respond to campaigns. If you ignore these confounders, you may attribute an error pattern to the model when it actually arises from a shift in the underlying environment.

Second, feedback. Once a model is used to make decisions, those decisions alter the data that will later be used to evaluate the model. High-risk customers receiving proactive retention offers may no longer churn; accounts flagged as suspicious may be closed or monitored differently. The distribution of the data is now a function of the model’s own behavior. Evaluating “what the model would have done without the intervention” is inherently counterfactual.

Third, label quality. In many business settings, label Y is itself the output of a process: human adjudication, operational rules, external systems. Different teams or countries may apply different criteria for “fraud”, “bad debt”, or “win”. Error patterns that appear as “model bias” can turn out to be inconsistencies in labelling practices, or changes in how events are recorded.

A causal perspective encourages you to systematically ask, for each surprising error pattern:

Is this due to the model mapping from X to $\hat{Y}$ ?
Or is it due to how X is generated, how Y is defined, or how the model outputs influence Y through decisions?

The answer determines whether you should retrain, recollect, relabel, redesign the policy, or change the target altogether.

Towards a Causal Style of Error Analysis

Adopting a causal style of error analysis does not mean that you must immediately implement sophisticated structural causal models or perfect counterfactual estimators. It starts with a change in the questions you ask and the way you document your investigations.

Instead of stopping at “segment S has high error”, you explicitly ask:

What upstream changes – in data collection, in the environment, in policy – could plausibly explain this pattern?
If we removed or altered a particular feature, or if we changed the decision rule for this segment, do we expect the error to go down, and why?
Are we comparing models under the same traffic mix and deployment policy, or are there hidden selection effects that could create a Simpson-like paradox?
For critical errors, what would need to change in this specific case for the outcome to be different? Is that change realistic, and does it align with domain knowledge?

From there, you can incorporate more formal tools as needed: stratified evaluations that respect the causal structure of the problem; simple interventions on data and policy in offline simulations; or, when stakes are high enough, explicit causal effect estimation with modern libraries.

The key is that error analysis ceases to be a static catalogue of plots and metrics and becomes instead a structured investigation into mechanisms: how your model, your data and your business logic combine to produce the behavior you see – and how they would behave under alternative decisions.

In that sense, a causal perspective does not replace classical error analysis; it upgrades it from “where are we wrong?” to “what is making us wrong here, and what would actually fix it?”.

At MDP Group, we evaluate our machine learning solutions not only through standard metrics, but also through this causal lens, carefully examining how data, models and business policies interact to produce the errors we observe.

ML Error Analysis with a Causal Perspective

From “wrong predictions” to “wrong decisions”

Real-World Example: Churn Prediction Under Policy Feedback

Simpson’s Paradox: When Aggregate Metrics Lie

Confounding, Feedback and Label Quality

Towards a Causal Style of Error Analysis

Further Reading

Similar
Blog

ML Error Analysis with a Causal Perspective

From “wrong predictions” to “wrong decisions”

Real-World Example: Churn Prediction Under Policy Feedback

Simpson’s Paradox: When Aggregate Metrics Lie

Confounding, Feedback and Label Quality

Towards a Causal Style of Error Analysis

Further Reading

SimilarBlog

Similar
Blog