Table of Contents
In most machine learning projects, “error analysis” is treated as a descriptive exercise. We slice metrics by segment, look at confusion matrices, check feature importances, and maybe generate a few SHAP plots. All of that is useful, but it tends to answer only one question:
“Where is the model wrong?”
A causal perspective pushes us to ask a much harder –and much more actionable- question:
“Why is it wrong here, and what would actually change the outcome if we intervened?”
This shift might sound philosophical, but it has very concrete implications in production systems. Once a model is embedded in a decision pipeline such as approving loans, sending retention campaigns or triggering fraud reviews, errors are no longer just misclassified rows. They are business costs, regulatory risks, and unhappy users. Understanding them in a purely correlational way is not enough.
This article outlines how to bring a causal lens into ML error analysis, and how phenomena like Simpson’s paradox can easily mislead us if we don’t.
Consider a typical supervised learning setup in a real system:
In practice, many errors we observe are not “pure model mistakes”. They are the result of interactions between these layers:
A classical error analysis, however, mostly looks at relationships such as “error rate as a function of feature value” or “performance by segment”. In probabilistic terms, this is about . P\left(\mathrm{error}\mid X,\mathrm{segment}\right) It does not tell us what would happen if we changed the feature, changed the policy, or deployed a different model in this environment.
Error analysis then becomes less about describing patterns and more about reasoning through this system: which parts of it are responsible for the errors, and which levers can we pull to reduce them?
Imagine a telecom company with a churn model in production. The model assigns a churn probability to each customer; if the score is above a threshold, the customer is targeted with an aggressive discount campaign.
At first glance, error analysis seems straightforward: compare predicted probabilities with observed churn, compute metrics, break them down by age group, region, tariff plan and channel.
However, very quickly you run into a problem: the outcome you observe is not simply “churn in nature”. It is “churn given that we may or may not have intervened”. The model prediction influences the policy, and the policy influences the label you see.
Some concrete questions immediately become causal:
A purely correlational analysis can tell you that “the model’s recall is low in segment X”, but it cannot tell you whether sending more campaigns in segment X would be effective, or whether lowering the threshold is beneficial once costs and feedback effects are considered. A causal analysis aims at those “what if we changed…” questions.
One of the clearest illustrations of why causal thinking matters in error analysis is Simpson’s paradox. It arises when performance comparisons flip direction depending on whether you look at aggregate data or at subgroups.
Suppose you are comparing two fraud models, A and B, on historical transactions. Transactions come from two different channels: the web and mobile. You evaluate both models on both channels and obtain the following:
Within each channel, model B has the higher accuracy: it is better on web (70% vs 60%) and better on mobile (15% vs 10%). If you only look at the per-channel metrics, the natural conclusion is that “B dominates A”.
Now look at the overall performance of each model, aggregating across channels:
Taken as a whole, model A now appears better (35% vs 33.3%), even though model B was better in both subgroups.
The paradox is not a mathematical trick; it is a warning. What changed between the subgroup view and the aggregate view is weighting. Model B was evaluated on a much larger number of mobile transactions, which are intrinsically harder cases. Because its evaluation set contains proportionally more difficult examples, its aggregated score is pulled down, even though it is locally better on both web and mobile when those channels are compared fairly.
This is exactly the kind of situation that can arise in real deployments:
If you look only at the overall metric, you might conclude that “A is safer and more accurate”. If you look only at the per-channel metrics, you might conclude that “B is strictly better”. Both views are incomplete because they ignore how the data were generated: which model saw which mix of channels, under what selection rules.
A causal perspective forces you to ask a different question: “Under a common deployment policy and a common traffic distribution, which model would be better?” That is a counterfactual question. Answering it requires reasoning not just about what happened historically, but about how performance would change if you intervened on the deployment policy, assigning both models to the same kind of traffic, in the same proportions, and then comparing them on equal footing.
Beyond Simpson’s paradox, there are three recurring causal themes in real-world error analysis.
First, there is confounding. Some variables influence both features and outcomes. Macroeconomic conditions affect both customers’ financial behavior and the labels you record as “default” or “churn”. A new product line changes customer mix and their propensity to respond to campaigns. If you ignore these confounders, you may attribute an error pattern to the model when it actually arises from a shift in the underlying environment.
Second, feedback. Once a model is used to make decisions, those decisions alter the data that will later be used to evaluate the model. High-risk customers receiving proactive retention offers may no longer churn; accounts flagged as suspicious may be closed or monitored differently. The distribution of the data is now a function of the model’s own behavior. Evaluating “what the model would have done without the intervention” is inherently counterfactual.
Third, label quality. In many business settings, label Y is itself the output of a process: human adjudication, operational rules, external systems. Different teams or countries may apply different criteria for “fraud”, “bad debt”, or “win”. Error patterns that appear as “model bias” can turn out to be inconsistencies in labelling practices, or changes in how events are recorded.
A causal perspective encourages you to systematically ask, for each surprising error pattern:
The answer determines whether you should retrain, recollect, relabel, redesign the policy, or change the target altogether.
Adopting a causal style of error analysis does not mean that you must immediately implement sophisticated structural causal models or perfect counterfactual estimators. It starts with a change in the questions you ask and the way you document your investigations.
Instead of stopping at “segment S has high error”, you explicitly ask:
From there, you can incorporate more formal tools as needed: stratified evaluations that respect the causal structure of the problem; simple interventions on data and policy in offline simulations; or, when stakes are high enough, explicit causal effect estimation with modern libraries.
The key is that error analysis ceases to be a static catalogue of plots and metrics and becomes instead a structured investigation into mechanisms: how your model, your data and your business logic combine to produce the behavior you see – and how they would behave under alternative decisions.
In that sense, a causal perspective does not replace classical error analysis; it upgrades it from “where are we wrong?” to “what is making us wrong here, and what would actually fix it?”.
At MDP Group, we evaluate our machine learning solutions not only through standard metrics, but also through this causal lens, carefully examining how data, models and business policies interact to produce the errors we observe.
– Pearl, J., Glymour, M., & Jewell, N. P. (2016). Causal Inference in Statistics: A Primer. John Wiley & Sons.
– Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.
– Hernán, M. A., & Robins, J. M. (2020). Causal Inference: What If. Chapman & Hall/CRC
Managing Multiple Processes with Process Direct in SAP Integration Suite
On the Integration Suite platform, it may be necessary to manage multiple integration processes simultaneously. In this article, we will cover the...
Benefits of SAPUI5 Smart Components
With the rapid advancement of technology in the recent period, the importance of applications accessed and used through mobile devices is increasing....
What is ABAP on Cloud?
SAP is constantly innovating to accelerate digital transformation in the business world. One of these innovations is ABAP on Cloud platform. This...
What is SAP Signavio Process Collaboration Hub?
In today's business world, it is very important to work collaboratively for businesses to be successful. However, there are many technologies and...
What You Need to Know About Developing Fiori Applications
What is SAP Fiori? SAP Fiori is a user experience solution from SAP that provides a modern, simple user experience for enterprise applications. It is...
SAP GTS – SAP EWM Integration (Outbound Processing)
After the outbound delivery gets created in SAP ERP and distributed from ERP to SAP EWM side, GTS-related PPF action which is the main communication...
What is Stock Room Management?
As we all know, SAP WM will no longer allow improvements after 2025. Because of this, most companies will need to choose between SAP EWM or EWM...
MDP Insights: An Interview with Our Web Team Leader
This week we sat down with our Web & Mobile Development Manager, Ahmet Buğra Okyay, to have a chat about what his team does at here MDP. Our Web...
What is SAP Master Data Integration?
Master data is data that is central to an organization’s operations. Data is used across the organization to provide an accurate and consistent...
Your mail has been sent successfully. You will be contacted as soon as possible.
Your message could not be delivered! Please try again later.