Linear Regression – When It Works for Causal Inference
How regression estimates treatment effects, the assumptions it needs, and when those assumptions quietly fail.
Why linear regression shows up in causal inference
Linear regression is one of the most widely used tools in data science. Most teams first meet it as a prediction method, but it also plays a central role in causal inference. When used carefully, regression can estimate the causal effect of a treatment after adjusting for confounders. When used casually, it quietly mixes correlation with causation.
The key question is not whether regression fits the data, but whether the model reflects a valid causal design. A beautiful R² does not protect you from bias if the underlying assumptions are violated.
The basic causal regression setup
To use regression for causal inference, we usually write a model like:
where Y is the outcome, T is a treatment indicator, X is a vector of covariates, and β is interpreted as the treatment effect after adjusting for X.
For this interpretation to be valid, regression must be doing more than prediction—it must be reproducing the comparisons we would see in a well-designed experiment, but within strata of X.
The key assumption: conditional ignorability
Regression delivers a causal estimate only if treatment assignment is as good as random after conditioning on covariates. Formally:
Once we condition on X, treated and control units must be comparable in terms of their potential outcomes.
Intuitively, X must block all backdoor paths from T to Y. If a relevant confounder is missing or poorly measured, the regression has no way to recover the true causal effect. The estimate of β will be biased, even if the model predicts Y very well.
Why regression can remove confounding (when assumptions hold)
When conditional ignorability is plausible and overlap is good, regression behaves like a compact, efficient adjustment procedure:
- Adjustment: regression separates the variation in Y uniquely associated with T from variation explained by X.
- Within-strata comparison: you can think of it as comparing treated and control units with the same X.
- Weighting: the estimator combines many local comparisons into a single average treatment effect.
In this regime, linear regression is not just a prediction tool—it is a valid causal estimator.
Classic failure modes for regression in causal work
Regression is not automatically causal. It breaks in specific, repeatable ways:
- Omitted variable bias: a confounder that affects both T and Y is left out of X.
- Bad controls: you condition on variables that are effects of T (mediators) or colliders.
- Functional form misspecification: the relationship between X and Y is nonlinear or involves interactions, but the model is purely linear and additive.
- Poor overlap: treated and control units live in very different regions of X; regression is forced to extrapolate.
- Multicollinearity: T is highly correlated with X, making it hard to distinguish treatment from confounding.
Omitted variable bias in one line
A simple way to see omitted variable bias is through the two-regressor case. Suppose the true model is:
but you regress Y only on T, leaving out Z. The estimated β picks up both the true treatment effect and the effect of the omitted confounder Z to the extent Z is correlated with T.
This is why simply "adding more controls" is not enough. You must add the right controls, and you must avoid controlling for variables that should not be in the model.
How supports regression as a causal tool
In , linear regression is treated as one causal design among others—not as a default. The product helps data scientists use regression when it is justified, and nudges them away from it when it is not:
- Covariate diagnostics: visualize overlap between treated and control groups across X.
- Design warnings: highlight potential bad controls and obvious post-treatment variables.
- Model robustness: compare simple regression estimates to matching, DiD, or Synthetic Control on the same data.
- Sensitivity thinking: encourage users to think in terms of missing confounders, not just p-values.
When regression is a good causal choice
Regression is a sensible causal method when:
- You believe all major confounders are observed and included in X.
- There is reasonable overlap in covariates across treated and control units.
- You are not conditioning on mediators or colliders.
- The relationship between X and Y can be captured by a flexible, well-specified model (possibly with interactions and transformations).
- You are comfortable with an estimate that is model-dependent rather than design-based.
When regression should not be your primary design
Consider other methods if:
- Key confounders are missing or poorly measured.
- Treatment assignment is driven by unobserved factors you cannot proxy.
- Covariate overlap is severely limited; treated and control groups barely intersect in X.
- Timing, dynamics, or staggered adoption are central (DiD or event-study may be more appropriate).
- You need a clearly design-based argument for identification (randomization, as-if random shocks, or structured rollouts).
Summary
Linear regression is not a magic causal engine. It is a powerful tool that can estimate treatment effects when treatment is ignorable after conditioning on the right covariates and when overlap is good. Used without a clear causal story, it quietly turns design problems into model assumptions.
In , regression sits alongside A/B tests, DiD, PSM, and Synthetic Control. The goal is not to crown a single "best" method, but to help data scientists choose a design where the assumptions are visible, defensible, and aligned with how the data were actually generated.