Published June 3, 2026 | Version v3pre
Dataset Open

Fuτure - dataset for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons

  • 1. ROR icon National Institute of Chemical Physics and Biophysics

Contributors

Contact person:

Data manager:

  • 1. ROR icon National Institute of Chemical Physics and Biophysics

Description

 Data description

MC Simulation


The Fuτure dataset is intended for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons. The v3 of this dataset is generated with Pythia8, with the full detector simulation being performed by Geant4 using the FTFP_BERT physics list with the CLIC Like Detector (CLD)  detector setup (CLD_o2_v07). Events are reconstructed using the Marlin reconstruction framework and interfaced with Key4HEP (release 2025-05-29). Particle candidates in the reconstructed events are reconstructed using the PandoraPF algorithm.

In this version of the dataset no γγ -> hadrons background is included.

 

Samples


This dataset contains e+e- samples with Z->ττ, and Z->qq events.

The following processes e+e- were simulated with Pythia 8 at sqrt(s) = 91 GeV:

  • p8_ee_Z_tautau_ecm91 [Z -> ττ events]
  • p8_ee_Z_qq_ecm91 [Z -> qq events]

The .root files from the MC simulation chain are eventually processed by the software found in Github in order to create flat ntuples as the final product.


Features


The basis of the ntuples are the particle flow (PF) candidates from PandoraPF. Each PF candidate has four momenta, charge and particle label (electron / muon / photon / charged hadron / neutral hadron). The PF candidates in a given event are clustered into jets using generalized kt algorithm for ee collisions, with parameters p=-1 and R=0.4. The minimum pT is set to be 0 GeV for both generator level jets and reconstructed jets. The dataset contains the four momenta of the jets, with the PF candidates in the jets with the above listed properties.

Additionally, a set of variables describing the tau lifetime are calculated. As tau lifetime is very short, these variables are sensitive to true tau decays. In the calculation of these lifetime variables, we use a linear approximation.

In summary, the features found in the flat ntuples are:

 

Name Description
reco_cand_p4s 4-momenta per particle in the reco jet.
reco_cand_charges Charge per particle in the jet.
reco_cand_pdgs PDGid per particle in the jet.
reco_jet_p4s RecoJet 4-momenta.
reco_cand_dz Longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
reco_cand_dz_err Uncertainty of the longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
reco_cand_dxy Transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
reco_cand_dxy_err Uncertainty of the transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
gen_jet_p4s GenJet 4-momenta. Matched with RecoJet within a cone of radius dR < 0.3.
gen_jet_tau_decaymode Decay mode of the associated genTau. Jets that have associated leptonically decaying taus are removed, so there are no DM=16 jets. If no GenTau can be matched to GenJet within dR < 0.4, a fill value is used.
gen_jet_tau_charge Charge of the genTau. 
gen_jet_tau_p4s Visible 4-momenta of the genTau. If no GenTau can be matched to GenJet within dR<0.4, a fill value is used.
cls_weight Per-jet classification weight used to balance signal and background contributions across bins in jet θ–p space during classifier training.

The ground truth is based on stable particles at the generator level, before detector simulation. These particles are clustered into generator-level jets and are matched to generator-level τ leptons as well as reconstructed jets. In order for a generator-level jet to be matched to generator-level τ lepton, the τ lepton needs to be inside a cone of dR = 0.4. The same applies for the reconstructed jet, with the requirement on dR being set to dR = 0.4.

Contents:

  • qq_test.parquet
  • qq_train.parquet
  • z_test.parquet
  •  z_train.parquet

Dataset characteristics

 

File # Jets Size
z_test.parquet
593 540
139 MB
z_train.parquet
5 341 858
1.3 GB
qq_test.parquet
786 130
150 MB
qq_train.parquet
7 075 163
1.4 GB

The dataset consists of 4 files of 2.9 GB in total.

How can you use these data?

The .parquet files can be directly loaded with the Awkward Array Python library.

Files

Files (3.0 GB)

Name Size Download all
md5:343a555723deea7060291c8abedf45f5
156.8 MB Download
md5:cf01f63bcb19801dd4c7564d7d0b365f
1.4 GB Download
md5:c52be1f443ee314e6b0b5d490f33ce7f
145.7 MB Download
md5:77d8cb0cabe5ffe6906d154779161844
1.3 GB Download

Additional details

Funding

Estonian Research Council
Flexible and scalable data reconstruction and analysis using machine learning PSG864
Estonian Research Council
European Organisation for Nuclear Research TARISTU24-TK10
Estonian Research Council
AI for data reconstruction in high energy physics experiments PUTJD1344
Estonian Research Council
Study of Higgs boson pair production and the trilinear Higgs boson self-coupling with the LHC Run 3 and beyond PRG2502
Ministry of Education and Research
Foundations of the Universe TK202
European Commission
mPP - machine learning for Particle Physics 772369

Software

Repository URL
https://doi.org/10.5281/zenodo.20596644
Development Status
Active