Published April 3, 2025 | Version v1
Dataset Open

FAIR Universe - HiggsML Uncertainty Challenge Public Dataset

  • 1. ROR icon Lawrence Berkeley National Laboratory
  • 2. ROR icon Centre National de la Recherche Scientifique
  • 3. ROR icon Université Paris-Saclay
  • 4. EDMO icon Ohio State University
  • 5. ROR icon University of Washington
  • 6. ROR icon University of California, Berkeley
  • 7. ROR icon University of California System
  • 8. ROR icon ChaLearn
  • 9. ROR icon National Tsing Hua University
  • 10. ROR icon University of California, San Diego
  • 11. CNRS Délégation Ile-de-France Sud
  • 12. ROR icon University of California, Davis

Description

HiggsML Uncertainty Challenge Public Dataset

This dataset has been created for the HiggsML Uncertainty Challenge, a NeurIPS 2024 competition. A detailed documentation is available in the challenge white paper

The tabular dataset is created using the particle physics simulation tools Pythia 8.2 and Delphes 3.5.0. The proton-proton collision events are generated with a center of mass energy of 13 TeV using Pythia8. Subsequently, these events undergo the Delphes tool to produce simulated detector measurements. We used an ATLAS-like detector description to make the dataset closer to experimental data. The events are divided into two groups:

  1. Higgs boson signal (H→ττ)
  2. ZZ boson background (Z→ττ)
  3. Diboson background (VV→ττ)
  4. ttbar background (ttˉ)

 

Process Number Generated LHC Events Label
Higgs  52 040 227  1015  signal
Z Boson 160 383 358 1 002 395 background
Di-Boson 605 118 3 783 background
ttbar 7 070 398 44 192 background


⚠️ Note: The "LHC events" is the average number in this category in a pseudo-experiment corresponding to running of the Large Hadron Collider for 10 fb−1, corresponding to approximately 800 billion inelastic proton collisions, or 2 weeks in summer 2024 conditions

Higgs Signal:

The Higgs bosons are produced with all possible production modes and decay into two tau leptons. The tau leptons are further allowed to decay into all possible final states, but only final states with one lepton (electron or muon) and one hadron tau decay are kept.

Z boson Background:

Only background events coming from Z bosons are included in this challenge. While simulating the process, interference effects between Z bosons and photons are included. Similar to signal events, only the tau-tau decay mode of the Z boson is included in the dataset.

⚠️ Note:

The training events have weights.

Event Weights:

Event weights are defined as:

w=Cross-Section × LuminosityTotal number of generated eventsw=Total number of generated eventsCross-Section × Luminosity

The challenge is considering a scenario of analyzing proton-proton collision data of 10 fb−1 luminosity collected by the ATLAS experiment.

Features in the data

Prefix-less variables

Weight, Label, DetailedLabel, have a special role and should NOT be used as regular features for the model:

Variable Description
Weight The event weight wi
Label The event label yi ∈ 1,0  (1 for signal, 0 for background).
Detailed Label The event detailed label ∈ htautau, ztautau, diboson, ttbar

Primary Features

The variables prefixed with PRI (for PRImitives) are “raw” quantities about the bunch collision as measured by the detector, essentially parameters of the momenta of particles.

Variable Description
PRI_had_pt The transverse momentum px2+py2 of the hadronic tau.
PRI_had_eta The pseudorapidity η of the hadronic tau.
PRI_had_phi The azimuth angle ϕ of the hadronic tau.
PRI_lep_pt The transverse momentum px2+py2 of the lepton (electron or muon).
PRI_lep_eta The pseudorapidity η of the lepton.
PRI_lep_phi The azimuth angle ϕ of the lepton.
PRI_met The missing transverse energy ETmiss.
PRI_met_phi The azimuth angle ϕ of the missing transverse energy.
PRI_jet_num The number of jets (integer with a value of 0, 1, 2 or 3; possible larger values have been capped at 3).
PRI_jet_leading_pt The transverse momentum px2 + py2 of the leading jet, that is the jet with the largest transverse momentum (undefined if PRI_jet_num = 0).
PRI_jet_leading_eta The pseudorapidity η of the leading jet (undefined if PRI_jet_num = 0).
PRI_jet_leading_phi The azimuth angle ϕ of the leading jet (undefined if PRI_jet_num = 0).
PRI_jet_subleading_pt The transverse momentum px2+py2 of the leading jet, that is, the jet with the second largest transverse momentum (undefined if PRI_jet_num ≤ 1).
PRI_jet_subleading_eta The pseudorapidity η of the subleading jet (undefined if PRI_jet_num ≤ 1).
PRI_jet_subleading_phi The azimuth angle ϕ of the subleading jet (undefined if PRI_jet_num ≤ 1).
PRI_jet_all_pt The scalar sum of the transverse momentum of all the jets of the events.

Derived Features

These variables are derived from the primary variables with the help of derived_quantities.py

Variable Description
DER_mass_transverse_met_lep The transverse mass between the missing transverse energy and the lepton.
DER_mass_vis The invariant mass of the hadronic tau and the lepton.
DER_pt_h The modulus of the vector sum of the transverse momentum of the hadronic tau, the lepton and the missing transverse energy vector.
DER_deltaeta_jet_jet The absolute value of the pseudorapidity separation between the two jets (undefined if PRI_jet_num ≤ 1).
DER_mass_jet_jet The invariant mass of the two jets (undefined if PRI_jet_num ≤ 1).
DER_prodeta_jet_jet The product of the pseudorapidities of the two jets (undefined if PRI_jet_num ≤ 1).
DER_deltar_had_lep The R separation between the hadronic tau and the lepton.
DER_pt_tot The modulus of the vector sum of the missing transverse momenta and the transverse momenta of the hadronic tau, the lepton, the leading jet (if PRI_jet_num ≥ 1) and the subleading jet (if PRI_jet_num = 2) (but not of any additional jets).
DER_sum_pt The sum of the moduli of the transverse momenta of the hadronic tau, the lepton, the leading jet (if PRI_jet_num ≥ 1) and the subleading jet (if PRI_jet_num = 2) and the other jets (if PRI_jet_num = 3).
DER_pt_ratio_lep_tau The ratio of the transverse momenta of the lepton and the hadronic tau.
DER_met_phi_centrality The centrality of the azimuthal angle of the missing transverse energy vector w.r.t. the hadronic tau and the lepton.
DER_lep_eta_centrality The centrality of the pseudorapidity of the lepton w.r.t. the two jets (undefined if PRI_jet_num ≤ 1).

Preselection Cuts

Criteria Pre-selected cut Post selection cut
Number of τhad 1  
Number of τlep 1  
pTτhad > 20GeV > 26GeV
pTτleppTτlep > 20GeV > 20GeV
pTleadingjet > 20GeV > 26GeV
pTsubleadingjet > 20GeV > 26GeV
Charge Opposite Charges  

⚠️ Note: The post-selection cuts are the cuts made after systematics is applied.
⚠️ Note: The Dataset might not be properly shuffled. 
One could use dataset.py from the dataset repository (see below). 

 

Utility Software

Alongside the dataset, a GitHub repository with the relevant code for reading and analysing it is made available. This includes a Jupyter notebook starting kit, simple baseline models, and code to run the challenge and generate the score. The repository also has a sample dataset, a subset of the main dataset, to let users experience the challenge software without downloading the much larger dataset.

The code for dataset generation is provided in a dedicated repository: https://github.com/FAIR-Universe/genHEPdata. This repository also contains a Dockerfile, which facilitates the installation of the necessary software dependencies.

Files

FAIR_Universe_HiggsML_data.zip

Files (15.1 GB)

Name Size Download all
md5:7fb2a4f2b73bb8dcdaa6ffc0fe67e96e
15.1 GB Preview Download

Additional details

Software

Repository URL
https://github.com/FAIR-Universe/FAIR_Universe_dataset
Programming language
Python
Development Status
Active