Evaluating causal inference methods in a scientifically thorough way is a cumbersome and error-prone task. To foster good scientific practice JustCause provides a framework to easily:

  1. evaluate your method using common data sets like IHDP, IBM ACIC, and others;
  2. create synthetic data sets with a generic but standardized approach;
  3. benchmark your method against several baseline and state-of-the-art methods.

Our cause is to develop a framework that allows you to compare methods for causal inference in a fair and just way. JustCause is a work in progress and new contributors are always welcome.

The reasons for creating a library like JustCause are laid out in the thesis A Systematic Review of Machine Learning Estimators for Causal Effects of Maximilian Franz. Therein, it is shown that many publications about causality:

  • lack reproducibility,
  • use different versions of the seemingly same data set,
  • fail to state that some theoretical conditions in the data set are not met,
  • miss several state of the art methods in their comparison.

A more standardised approach, as offered by JustCause, is able to improve these points.


Install JustCause with:

pip install justcause

but consider using conda to set up an isolated environment beforehand. This can be done with:

conda env create -f environment.yaml
conda activate justcause

with the following environment.yaml.


For a minimal example we are going to load the IHDP (Infant Health and Development Program) data set, do a train/test split, apply a basic learner on each replication and display some metrics:

>>> from import load_ihdp
>>> from justcause.learners import SLearner
>>> from justcause.learners.propensity import estimate_propensities
>>> from justcause.metrics import pehe_score, mean_absolute
>>> from justcause.evaluation import calc_scores

>>> from sklearn.model_selection import train_test_split
>>> from sklearn.linear_model import LinearRegression

>>> import pandas as pd

>>> replications = load_ihdp(select_rep=[0, 1, 2])
>>> slearner = SLearner(LinearRegression())
>>> metrics = [pehe_score, mean_absolute]
>>> scores = []

>>> for rep in replications:
>>>    train, test = train_test_split(rep, train_size=0.8)
>>>    p = estimate_propensities(,
>>>,,, weights=1/p)
>>>    pred_ite = slearner.predict_ite(,,
>>>    scores.append(calc_scores(, pred_ite, metrics))

>>> pd.DataFrame(scores)
   pehe_score  mean_absolute
0    0.998388       0.149710
1    0.790441       0.119423
2    0.894113       0.151275

Indices and tables