Choosing a Causal Inference Method

A practical comparison of A/B tests, Difference-in-Differences, Propensity Score Matching, and Synthetic Control— written for data scientists who need to choose the right tool for real-world questions.

Use this as a decision guide; let handle the implementation details.

Start with the question, then pick the method

There is no single "best" causal inference method. Each design makes different assumptions and works best with specific data structures. A good workflow starts from the question and dataset, then narrows down which methods are reasonable.

The table below summarizes the core methods supported in today:

Method	Best for	Key assumptions
A/B test (randomized)	Online experiments where randomization is feasible	Random assignment, no interference, consistent measurement
Difference-in-Differences – Two Point	Pre/post changes with treated vs control groups and two periods	Parallel trends, no other shocks differentially affecting groups
Difference-in-Differences – TWFE	Panel data with multiple periods and staggered rollout	Parallel trends (conditional on fixed effects), no anticipation, careful with heterogeneity
Propensity Score Matching	Observational data with selection on observables	Unconfoundedness given covariates, overlap (common support)
Synthetic Control	Single treated unit with many controls and rich pre-treatment history	Good pre-treatment fit, no unique shocks, stable relationships over time

A simple decision flow

1. Can you randomize?

If you can randomly assign treatment at the user, session, or geo level without breaking the product, an A/B test is usually the cleanest choice. Randomization removes many identification headaches.

Use 's A/B flow when:

You control assignment and can ensure consistent exposure.
Spillovers or interference are minimal or explicitly designed around.
You mainly care about short- to medium-term effects on standard metrics.

2. Is this a pre/post change with a natural control group?

When a feature, policy, or price changes at a specific time for some units but not others, Difference-in-Differences is often a good fit.

If you effectively have one pre period and one post period, use DiD – Two Point.
If you have multiple periods and staggered rollout, consider DiD – TWFE, with careful attention to heterogeneity and design.

DiD works best when you believe that, without the treatment, treated and control units would have followed similar trends over time.

3. Is treatment self-selected in observational data?

When users or units opt into a feature, campaign, or behavior, randomization is gone and selection bias is a real concern. Propensity Score Matching is useful when you believe that, after conditioning on covariates, treatment is as good as random.

Use PSM when:

You have rich covariates that plausibly capture the main drivers of selection.
You can diagnose and enforce overlap between treated and control groups.
You want a matched sample that feels intuitive and inspectable, not just a regression coefficient.

4. Do you have one treated unit and many potential controls?

When a single country, platform, or business line is treated, there may be no perfect "twin" to use as a control. Synthetic Control builds a weighted combination of donor units whose pre-treatment path matches the treated unit, then compares their trajectories after the intervention.

Use Synthetic Control when:

You have a long pre-treatment history for the treated and donor units.
The treated unit is unique, but the donor pool collectively can approximate it.
Visualizing treated vs synthetic paths over time is important for communicating results.

How methods complement each other

In practice, strong causal work rarely relies on a single method. Instead, data scientists often combine designs to cross-check conclusions:

Run an A/B test where possible, then use DiD to study long-run or secondary outcomes.
Use PSM to build a balanced sample, then run DiD within that sample.
Apply Synthetic Control to one focal geo while using DiD across a broader set of regions.

is built for this multi-method reality: the same dataset can feed multiple causal views without manual plumbing for each.

Mapping methods to typical product questions

"Did this new onboarding flow increase activation?"

If you can randomize: A/B test. If you launched to a subset of geos or cohorts at a specific time: DiD – Two Point or DiD – TWFE, depending on the data structure.

"What is the impact of a new subscription tier in one market?"

If only one country was treated: Synthetic Control using other countries as the donor pool, with DiD as a robustness check across groupings.

"Do users who enable this feature retain better?"

Self-selection is likely. Use PSM to match feature users to non-users with similar histories, then compare retention or run DiD on the matched sample.

"Did this policy change reduce risky behavior?"

If some regions were affected and others were not, and you have multiple time periods: DiD – TWFE, with checks for parallel trends and potential staggered adoption issues.

Where helps

Choosing methods is only half the battle. Implementing them correctly, checking assumptions, and communicating results are where most of the time and risk sit. helps by:

Detecting dataset structure and surfacing which methods are viable.
Encoding assumptions and diagnostics for each method, not leaving them to chance.
Providing a unified interface across A/B tests, DiD, PSM, and Synthetic Control.
Making outputs legible to product and business partners without hiding the details from data scientists.