Synthetic Data for Uplift Modeling and Heterogenous Treatment Effect with Known Counterfactuals and ITE

Zhao, Zhenyu

doi:10.5281/zenodo.6342553

Published March 10, 2022 | Version v1

Dataset Open

Synthetic Data for Uplift Modeling and Heterogenous Treatment Effect with Known Counterfactuals and ITE

Zhao, Zhenyu¹

1. Tencent

This dataset is designed and simulated for evaluating uplift modeling. The data generation process is based on a logistic regression model - no real data is included or used for generating this dataset.

This dataset has several signatures:

It generates features with various patterns associated with the outcome variable and the causal effect (or treatment effect). Thus it is suitable for evaluating feature importance and model interpretation for uplift modeling.
The true counterfactual outcomes under control and treatment are known for each user, as well as the true ITE (Individual treatment effect).

This dataset consists of 50 trials (replicates with different random seeds), each trial with 20,000 samples and 36 features. The outcome variable is binary, which makes this dataset for classification problems. The samples are equally split for the control and treatment groups (10,000 samples in each group in each trial).

The generated data has three types of features: (1) uplift features influencing the treatment effect on the conversion probability; (2) classification features affecting the conversion probability but independent of the treatment effect; and (3) irrelevant features that are independent of both conversion probability and the treatment effect.

To simulate the relationship between uplift features and the treatment effect and classification features and outcome probability, we implement six types of association patterns in the data generation process: linear, quadratic, cubic, ReLU (Rectified Linear Unit), trigonometric function sine, and cosine.

In this data set, there are 36 features in total, including 10 classification features, 6 uplift features, and 20 irrelevant features.

Column names:

    Trial ID: 'trial_id'
    Experiment group label: 'treatment_group_key'
    Outcome variable (classification label): 'conversion'
    Feature names: ['x1_informative',
    'x2_informative',
    'x3_informative',
    'x4_informative',
    'x5_informative',
    'x6_informative',
    'x7_informative',
    'x8_informative',
    'x9_informative',
    'x10_informative',
    'x11_irrelevant',
    'x12_irrelevant',
    'x13_irrelevant',
    'x14_irrelevant',
    'x15_irrelevant',
    'x16_irrelevant',
    'x17_irrelevant',
    'x18_irrelevant',
    'x19_irrelevant',
    'x20_irrelevant',
    'x21_irrelevant',
    'x22_irrelevant',
    'x23_irrelevant',
    'x24_irrelevant',
    'x25_irrelevant',
    'x26_irrelevant',
    'x27_irrelevant',
    'x28_irrelevant',
    'x29_irrelevant',
    'x30_irrelevant',
    'x31_uplift_increase',
    'x32_uplift_increase',
    'x33_uplift_increase',
    'x34_uplift_increase',
    'x35_uplift_increase',
    'x36_uplift_increase']
    True underlying control conversion probability: 'control_conversion_prob'
    True underlying treatment conversion probability: 'treatment1_conversion_prob'
    True treatment effect: 'treatment1_true_effect'

Files

synthetic_uplift_data.csv

Files (785.9 MB)

Name	Size	Download all
synthetic_uplift_data.csv md5:48101d5fa9fac9ea530b81375108afb6	785.9 MB	Preview Download

Additional details

https://arxiv.org/abs/2005.03447

	All versions	This version
Views	478	477
Downloads	212	211
Data volume	194.1 GB	193.3 GB

Synthetic Data for Uplift Modeling and Heterogenous Treatment Effect with Known Counterfactuals and ITE

Authors/Creators

Description

Files

synthetic_uplift_data.csv

Files (785.9 MB)

Additional details

References