Fuτure - dataset for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons
Description
Data description
MC Simulation
The Fuτure dataset is intended for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons. The dataset is generated with Pythia 8, with the full detector simulation being performed by Geant4 with the CLIC-like detector setup CLICdet (CLIC_o3_v14) setup. Events are reconstructed using the Marlin reconstruction framework and interfaced with Key4HEP. Particle candidates in the reconstructed events are reconstructed using the PandoraPF algorithm.
In this version of the dataset no γγ -> hadrons background is included.
Samples
This dataset contains e+e- samples with Z->ττ, ZH->Zττ and Z->qq events, with approximately 2 million events simulated in each category.
The following processes e+e- were simulated with Pythia 8 at sqrt(s) = 380 GeV:
- p8_ee_qq_ecm380 [Z -> qq events]
- p8_ee_ZH_Htautau [ZH -> Ztautau]
- p8_ee_Z_Ztautau_ecm380 [ZH -> Ztautau]
The .root files from the MC simulation chain are eventually processed by the software found in Github in order to create flat ntuples as the final product.
Features
The basis of the ntuples are the particle flow (PF) candidates from PandoraPF. Each PF candidate has four momenta, charge and particle label (electron / muon / photon / charged hadron / neutral hadron). The PF candidates in a given event are clustered into jets using generalized kt algorithm for ee collisions, with parameters p=-1 and R=0.4. The minimum pT is set to be 0 GeV for genJets and 5 GeV for reconstructed jets. The dataset contains the four momenta of the jets, with the PF candidates in the jets with the above listed properties.
Additionally, a set of variables describing the tau lifetime are calculated using the software in Github. As tau lifetime is very short, these variables are sensitive to true tau decays. In the calculation of these lifetime variables, we use a linear approximation.
In summary, the features found in the flat ntuples are:
Name | Description |
reco_cand_p4s | 4-momenta per particle in the reco jet. |
reco_cand_charge | Charge per particle in the jet. |
reco_cand_pdg | PDGid per particle in the jet. |
reco_jet_p4s | RecoJet 4-momenta. |
reco_cand_dz | Longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
reco_cand_dz_err | Uncertainty of the longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
reco_cand_dxy | Transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
reco_cand_dxy_err | Uncertainty of the transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
gen_jet_p4s | GenJet 4-momenta. Matched with RecoJet within a cone of radius dR < 0.3. |
gen_jet_tau_decaymode | Decay mode of the associated genTau. Jets that have associated leptonically decaying taus are removed, so there are no DM=16 jets. If no GenTau can be matched to GenJet within dR < 0.4, a fill value is used. |
gen_jet_tau_p4s | Visible 4-momenta of the genTau. If no GenTau can be matched to GenJet within dR<0.4, a fill value is used. |
The ground truth is based on stable particles at the generator level, before detector simulation. These particles are clustered into generator-level jets and are matched to generator-level τ leptons as well as reconstructed jets. In order for a generator-level jet to be matched to generator-level τ lepton, the τ lepton needs to be inside a cone of dR = 0.4. The same applies for the reconstructed jet, with the requirement on dR being set to dR = 0.3. For each reconstructed jet, we define three target values related to τ lepton reconstruction:
- a binary flag isTau if it was matched to a generator-level hadronically decaying τ lepton. gen_jet_tau_decaymode of value -1 indicates no match to generator-level hadronically decaying τ.
- the categorical decay mode of the τ gen_jet_tau_decaymode in terms of the number of generator level charged and neutral hadrons. Possible gen_jet_tau_decaymode are {0, 1, . . . , 15}.
- if matched, the visible (neglecting neutrinos), reconstructable pT of the τ lepton. This is inferred from the gen_jet_tau_p4s
Contents:
- qq_test.parquet
- qq_train.parquet
- zh_test.parquet
- zh_train.parquet
- z_test.parquet
- z_train.parquet
- data_intro.ipynb
Dataset characteristics
File | # Jets | Size |
z_test.parquet | 460 382 | 101.00 MB |
z_train.parquet | 1 841 526 | 404.01M B |
zh_test.parquet | 521 977 | 116.44 MB |
zh_train.parquet | 2 087 907 | 466.26 MB |
qq_test.parquet | 949 958 | 496.89 MB |
qq_train.parquet | 3 799 829 | 1.99 GB |
The dataset consists of 6 files of 3.4 GB in total.
How can you use these data?
The .parquet files can be directly loaded with the Awkward Array Python library.
An example how one might use the dataset and the features is given in data_intro.ipynb
Files
data_intro.ipynb
Files
(3.6 GB)
Name | Size | Download all |
---|---|---|
md5:3b7084eab4a90f27416dd9a18241eff5
|
37.9 kB | Preview Download |
md5:cff92a8b931dfef5b6e292f72fce5dc0
|
496.9 MB | Download |
md5:47cb965126f657231a3ee6f28e99ea74
|
2.0 GB | Download |
md5:3ba6ac8d28545b81aefd3f62e5db337b
|
101.0 MB | Download |
md5:8b08452946cb41536a617410957a74e4
|
404.0 MB | Download |
md5:f5ae32584c76f3987ab90386f052c606
|
116.4 MB | Download |
md5:31eec75b8d628f3e7e6e884854f59dad
|
466.3 MB | Download |
Additional details
Software
- Repository URL
- https://github.com/HEP-KBFI/ml-tau-en-reg
- Development Status
- Active