Choosing a Causal Inference Method
A practical comparison of A/B tests, Difference-in-Differences, Propensity Score Matching, and Synthetic Control— written for data scientists who need to choose the right tool for real-world questions.
Start with the question, then pick the method
There is no single "best" causal inference method. Each design makes different assumptions and works best with specific data structures. A good workflow starts from the question and dataset, then narrows down which methods are reasonable.
The table below summarizes the core methods supported in today:
| Method | Best for | Key assumptions |
|---|---|---|
| A/B test (randomized) | Online experiments where randomization is feasible | Random assignment, no interference, consistent measurement |
| Difference-in-Differences – Two Point | Pre/post changes with treated vs control groups and two periods | Parallel trends, no other shocks differentially affecting groups |
| Difference-in-Differences – TWFE | Panel data with multiple periods and staggered rollout | Parallel trends (conditional on fixed effects), no anticipation, careful with heterogeneity |
| Propensity Score Matching | Observational data with selection on observables | Unconfoundedness given covariates, overlap (common support) |
| Synthetic Control | Single treated unit with many controls and rich pre-treatment history | Good pre-treatment fit, no unique shocks, stable relationships over time |
A simple decision flow
If you can randomly assign treatment at the user, session, or geo level without breaking the product, an A/B test is usually the cleanest choice. Randomization removes many identification headaches.
Use 's A/B flow when:
- You control assignment and can ensure consistent exposure.
- Spillovers or interference are minimal or explicitly designed around.
- You mainly care about short- to medium-term effects on standard metrics.
When a feature, policy, or price changes at a specific time for some units but not others, Difference-in-Differences is often a good fit.
- If you effectively have one pre period and one post period, use DiD – Two Point.
- If you have multiple periods and staggered rollout, consider DiD – TWFE, with careful attention to heterogeneity and design.
DiD works best when you believe that, without the treatment, treated and control units would have followed similar trends over time.
When users or units opt into a feature, campaign, or behavior, randomization is gone and selection bias is a real concern. Propensity Score Matching is useful when you believe that, after conditioning on covariates, treatment is as good as random.
Use PSM when:
- You have rich covariates that plausibly capture the main drivers of selection.
- You can diagnose and enforce overlap between treated and control groups.
- You want a matched sample that feels intuitive and inspectable, not just a regression coefficient.
When a single country, platform, or business line is treated, there may be no perfect "twin" to use as a control. Synthetic Control builds a weighted combination of donor units whose pre-treatment path matches the treated unit, then compares their trajectories after the intervention.
Use Synthetic Control when:
- You have a long pre-treatment history for the treated and donor units.
- The treated unit is unique, but the donor pool collectively can approximate it.
- Visualizing treated vs synthetic paths over time is important for communicating results.
How methods complement each other
In practice, strong causal work rarely relies on a single method. Instead, data scientists often combine designs to cross-check conclusions:
- Run an A/B test where possible, then use DiD to study long-run or secondary outcomes.
- Use PSM to build a balanced sample, then run DiD within that sample.
- Apply Synthetic Control to one focal geo while using DiD across a broader set of regions.
is built for this multi-method reality: the same dataset can feed multiple causal views without manual plumbing for each.
Mapping methods to typical product questions
If you can randomize: A/B test. If you launched to a subset of geos or cohorts at a specific time: DiD – Two Point or DiD – TWFE, depending on the data structure.
If only one country was treated: Synthetic Control using other countries as the donor pool, with DiD as a robustness check across groupings.
Self-selection is likely. Use PSM to match feature users to non-users with similar histories, then compare retention or run DiD on the matched sample.
If some regions were affected and others were not, and you have multiple time periods: DiD – TWFE, with checks for parallel trends and potential staggered adoption issues.
Where helps
Choosing methods is only half the battle. Implementing them correctly, checking assumptions, and communicating results are where most of the time and risk sit. helps by:
- Detecting dataset structure and surfacing which methods are viable.
- Encoding assumptions and diagnostics for each method, not leaving them to chance.
- Providing a unified interface across A/B tests, DiD, PSM, and Synthetic Control.
- Making outputs legible to product and business partners without hiding the details from data scientists.