Fuτure - dataset for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons
Authors/Creators
Contributors
Contact person:
Data manager:
Other:
Researcher:
Description
Data description
MC Simulation
The Fuτure dataset is intended for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons. The v3 of this dataset is generated with Pythia8, with the full detector simulation being performed by Geant4 using the FTFP_BERT physics list with the CLIC Like Detector (CLD) detector setup (CLD_o2_v07). Events are reconstructed using the Marlin reconstruction framework and interfaced with Key4HEP (release 2025-05-29). Particle candidates in the reconstructed events are reconstructed using the PandoraPF algorithm.
In this version of the dataset no γγ -> hadrons background is included.
Samples
This dataset contains e+e- samples with Z->ττ, and Z->qq events.
The following processes e+e- were simulated with Pythia 8 at sqrt(s) = 91 GeV:
- p8_ee_Z_tautau_ecm91 [Z -> ττ events]
- p8_ee_Z_qq_ecm91 [Z -> qq events]
The .root files from the MC simulation chain are eventually processed by the software found in Github in order to create flat ntuples as the final product.
Features
The basis of the ntuples are the particle flow (PF) candidates from PandoraPF. Each PF candidate has four momenta, charge and particle label (electron / muon / photon / charged hadron / neutral hadron). The PF candidates in a given event are clustered into jets using generalized kt algorithm for ee collisions, with parameters p=-1 and R=0.4. The minimum pT is set to be 0 GeV for both generator level jets and reconstructed jets. The dataset contains the four momenta of the jets, with the PF candidates in the jets with the above listed properties.
Additionally, a set of variables describing the tau lifetime are calculated. As tau lifetime is very short, these variables are sensitive to true tau decays. In the calculation of these lifetime variables, we use a linear approximation.
In summary, the features found in the flat ntuples are:
| Name | Description |
| reco_cand_p4s | 4-momenta per particle in the reco jet. |
| reco_cand_charges | Charge per particle in the jet. |
| reco_cand_pdgs | PDGid per particle in the jet. |
| reco_jet_p4s | RecoJet 4-momenta. |
| reco_cand_dz | Longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
| reco_cand_dz_err | Uncertainty of the longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
| reco_cand_dxy | Transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
| reco_cand_dxy_err | Uncertainty of the transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
| gen_jet_p4s | GenJet 4-momenta. Matched with RecoJet within a cone of radius dR < 0.3. |
| gen_jet_tau_decaymode | Decay mode of the associated genTau. Jets that have associated leptonically decaying taus are removed, so there are no DM=16 jets. If no GenTau can be matched to GenJet within dR < 0.4, a fill value is used. |
| gen_jet_tau_charge | Charge of the genTau. |
| gen_jet_tau_p4s | Visible 4-momenta of the genTau. If no GenTau can be matched to GenJet within dR<0.4, a fill value is used. |
| cls_weight | Per-jet classification weight used to balance signal and background contributions across bins in jet θ–p space during classifier training. |
The ground truth is based on stable particles at the generator level, before detector simulation. These particles are clustered into generator-level jets and are matched to generator-level τ leptons as well as reconstructed jets. In order for a generator-level jet to be matched to generator-level τ lepton, the τ lepton needs to be inside a cone of dR = 0.4. The same applies for the reconstructed jet, with the requirement on dR being set to dR = 0.4.
Contents:
- qq_test.parquet
- qq_train.parquet
- z_test.parquet
- z_train.parquet
Dataset characteristics
| File | # Jets | Size |
| z_test.parquet |
593 540 |
139 MB |
| z_train.parquet |
5 341 858 |
1.3 GB |
| qq_test.parquet |
786 130 |
150 MB |
| qq_train.parquet |
7 075 163 |
1.4 GB |
The dataset consists of 4 files of 2.9 GB in total.
How can you use these data?
The .parquet files can be directly loaded with the Awkward Array Python library.
Files
Files
(3.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:343a555723deea7060291c8abedf45f5
|
156.8 MB | Download |
|
md5:cf01f63bcb19801dd4c7564d7d0b365f
|
1.4 GB | Download |
|
md5:c52be1f443ee314e6b0b5d490f33ce7f
|
145.7 MB | Download |
|
md5:77d8cb0cabe5ffe6906d154779161844
|
1.3 GB | Download |
Additional details
Funding
- Estonian Research Council
- Flexible and scalable data reconstruction and analysis using machine learning PSG864
- Estonian Research Council
- European Organisation for Nuclear Research TARISTU24-TK10
- Estonian Research Council
- AI for data reconstruction in high energy physics experiments PUTJD1344
- Estonian Research Council
- Study of Higgs boson pair production and the trilinear Higgs boson self-coupling with the LHC Run 3 and beyond PRG2502
- Ministry of Education and Research
- Foundations of the Universe TK202
- European Commission
- mPP - machine learning for Particle Physics 772369
Software
- Repository URL
- https://doi.org/10.5281/zenodo.20596644
- Development Status
- Active