There is a newer version of the record available.

Published July 2, 2024 | Version v1
Dataset Open

Fuτure - dataset for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons

  • 1. ROR icon National Institute of Chemical Physics and Biophysics

Contributors

Contact person:

Data manager:

  • 1. ROR icon National Institute of Chemical Physics and Biophysics

Description

 Data description

MC Simulation


The Fuτure dataset is intended for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons. The dataset is generated with Pythia 8, with the full detector simulation being performed by Geant4 with the CLIC-like detector setup CLICdet (CLIC_o3_v14) setup. Events are reconstructed using the Marlin reconstruction framework and interfaced with Key4HEP. Particle candidates in the reconstructed events are reconstructed using the PandoraPF algorithm.

In this version of the dataset no γγ -> hadrons background is included.

Samples


This dataset contains e+e- samples with Z->ττ, ZH->Zττ and Z->qq events, with approximately 2 million events simulated in each category.

The following processes e+e- were simulated with Pythia 8 at sqrt(s) = 380 GeV:

  • p8_ee_qq_ecm380 [Z -> qq events]
  • p8_ee_ZH_Htautau [ZH -> Ztautau]
  • p8_ee_Z_Ztautau_ecm380 [ZH -> Ztautau]

The .root files from the MC simulation chain are eventually processed by the software found in Github in order to create flat ntuples as the final product.


Features


The basis of the ntuples are the particle flow (PF) candidates from PandoraPF. Each PF candidate has four momenta, charge and particle label (electron / muon / photon / charged hadron / neutral hadron). The PF candidates in a given event are clustered into jets using generalized kt algorithm for ee collisions, with parameters p=-1 and R=0.4. The minimum pT is set to be 0 GeV for genJets and 5 GeV for reconstructed jets. The dataset contains the four momenta of the jets, with the PF candidates in the jets with the above listed properties.

Additionally, a set of variables describing the tau lifetime are calculated using the software in Github. As tau lifetime is very short, these variables are sensitive to true tau decays. In the calculation of these lifetime variables, we use a linear approximation.

In summary, the features found in the flat ntuples are:

 

Name Description
reco_cand_p4s 4-momenta per particle in the reco jet.
reco_cand_charge Charge per particle in the jet.
reco_cand_pdg PDGid per particle in the jet.
reco_jet_p4s RecoJet 4-momenta.
reco_cand_dz Longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
reco_cand_dz_err Uncertainty of the longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
reco_cand_dxy Transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
reco_cand_dxy_err Uncertainty of the transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
gen_jet_p4s GenJet 4-momenta. Matched with RecoJet within a cone of radius dR < 0.3.
gen_jet_tau_decaymode Decay mode of the associated genTau. Jets that have associated leptonically decaying taus are removed, so there are no DM=16 jets. If no GenTau can be matched to GenJet within dR < 0.4, a fill value is used.
gen_jet_tau_p4s Visible 4-momenta of the genTau. If no GenTau can be matched to GenJet within dR<0.4, a fill value is used.

The ground truth is based on stable particles at the generator level, before detector simulation. These particles are clustered into generator-level jets and are matched to generator-level τ leptons as well as reconstructed jets. In order for a generator-level jet to be matched to generator-level τ lepton, the τ lepton needs to be inside a cone of dR = 0.4. The same applies for the reconstructed jet, with the requirement on dR being set to dR = 0.3. For each reconstructed jet, we define three target values related to τ lepton reconstruction:

  •  a binary flag isTau if it was matched to a generator-level hadronically decaying τ lepton. gen_jet_tau_decaymode of value -1 indicates no match to generator-level hadronically decaying τ.
  •  the categorical decay mode of the τ gen_jet_tau_decaymode in terms of the number of generator level charged and neutral hadrons. Possible gen_jet_tau_decaymode are {0, 1, . . . , 15}.
  •  if matched, the visible (neglecting neutrinos), reconstructable pT of the τ lepton. This is inferred from the gen_jet_tau_p4s

Contents:

  • qq_test.parquet
  • qq_train.parquet
  • zh_test.parquet
  • zh_train.parquet
  • z_test.parquet
  •  z_train.parquet
  • data_intro.ipynb

Dataset characteristics

 

File # Jets Size
z_test.parquet 460 382 101.00 MB
z_train.parquet 1 841 526 404.01M B
zh_test.parquet 521 977 116.44 MB
zh_train.parquet 2 087 907 466.26 MB
qq_test.parquet 949 958 496.89 MB
qq_train.parquet 3 799 829 1.99 GB

The dataset consists of 6 files of 3.4 GB in total.


How can you use these data?

The .parquet files can be directly loaded with the Awkward Array Python library.
An example how one might use the dataset and the features is given in data_intro.ipynb

Files

data_intro.ipynb

Files (3.6 GB)

Name Size Download all
md5:3b7084eab4a90f27416dd9a18241eff5
37.9 kB Preview Download
md5:cff92a8b931dfef5b6e292f72fce5dc0
496.9 MB Download
md5:47cb965126f657231a3ee6f28e99ea74
2.0 GB Download
md5:3ba6ac8d28545b81aefd3f62e5db337b
101.0 MB Download
md5:8b08452946cb41536a617410957a74e4
404.0 MB Download
md5:f5ae32584c76f3987ab90386f052c606
116.4 MB Download
md5:31eec75b8d628f3e7e6e884854f59dad
466.3 MB Download

Additional details

Software

Repository URL
https://github.com/HEP-KBFI/ml-tau-en-reg
Development Status
Active