Published April 12, 2022 | Version v1
Dataset Open

HEPMASS-IMB

  • 1. University of Bologna

Contributors

Contact person:

  • 1. University of Bologna

Description

HEPMASS-IMB is a benchmark dataset for signal-background classification in High-Energy Physics (HEP), derived from HEPMASS (Baldi et al.) by imbalancing it two times: on the class labels, as well as on the mass labels.

  • It has 27 feature columns (named from f0 to f26), and a 28-th mass feature (named mass).
  • The 27 features are already normalized to have approximately zero-mean and unitary variance.
  • The mass feature has five unique values: 500, 750, 1000, 1250, and 1500.
  • There are two class labels: 1 (signal), and 0 (background).
  • The dataset describes the decay of an hypothetical particle: \(X \to t\bar{t}\to X->t\bar{t} \to W^+bW^-\bar{b}\).

Further details about the original dataset are available here, whereas a description of our modifications is presented in our paper.

NOTE:

  • The files provided here represent only the training-set, since it's what is diverse compared to the original HEPMASS.
  • The label column has been renamed from "# label" to "type".
  • There are two new columns: name, and weight.

Steps to adapt `all_test.csv` (from HEPMASS):

# 1. Load csv
df = pd.read_csv('<your-path>/all_test.csv')

# 2. Rename columns
df.rename(columns={'# label': 'type'}, inplace=True)

# 3. Adjust mass column
mass = np.sort(df['mass'].unique())
df.loc[df['mass'] == mass[0], 'mass'] = 500.0

# 4. Finally save the new csv
df.to_csv('<your-path>/test.csv', index=False)

 

Files

hepmass-imb.zip

Files (440.5 MB)

Name Size Download all
md5:3bf71690b1b7684e209b93fceaa06491
440.5 MB Preview Download

Additional details

Related works

Cites
Preprint: arXiv:1601.07913 (arXiv)
Is cited by
Preprint: arXiv:2202.00424 (arXiv)
Is part of
Dataset: http://archive.ics.uci.edu/ml/datasets/hepmass (URL)

References

  • Baldi et al. (2015) HEPMASS
  • Baldi et al. (2016) Parameterized Machine Learning for High-Energy Physics