Dataset of paper "Scalable Multi-Task Learning for Particle Collision Event Reconstruction with Heterogeneous Graph Neural Networks"
Contributors
Description
Scalable Multi-Task Learning for Particle Collision Event Reconstruction with HGNNs Dataset
The full description can also be found in README.md.
The dataset was used in the paper “Scalable Multi-Task Learning for Particle Collision Event Reconstruction with Heterogeneous Graph Neural Networks”:
https://arxiv.org/abs/2504.21844
This paper presents a scalable Heterogeneous graph network with integrated pruning layers, which jointly determines if tracks originate from decay of beauty hadrons and associates each track to a proton-proton collision point known as a primary vertex (PV).
For training HGNNs and GNNs on the dataset see the associated github repo:
https://github.com/willsutcliffe/scalable_mtl_hgnn
Generated events
The events in this dataset are based on simulation generated with PYTHIA and EvtGen, in which the particle-collision conditions expected for the LHC Run 3 are replicated as shown in the table.
| LHCb period | Num. vis. pp collisions | Num. tracks | Num. b hadrons | Num. c hadrons |
| Runs 3-4 (Upgrade I) | ~5 | ~150 | < 1 | ~1 |
Additionally, an approximate emulation of the LHCb detection and reconstruction effects is applied, as described fully in appendix A of https://arxiv.org/pdf/2304.08610. In the generated dataset, each event is required to contain at least one b-hadron, which is subsequently allowed to decay freely through any of the standard decay modes present in PYTHIA8. On average, 40% of those events contain more than one b-hadron decay, with a maximum b-hadron decay multiplicity of five. Only charged stable particles that have been produced inside the LHCb geometrical acceptance and in the Vertex Locator region (as defined in the paper) are included in the datasets.
Datasets
The datasets are divided in three categories
Inclusive training and validation
The file compressed file inclusive_training_validation_dataset.tar.gz contains the training dataset (40,000 events) test dataset (10,000 events) of inclusive decays.
Inclsuive test
The inclusive dataset inclusive_test_dataset.tar.gz contains the evaluation events (10,000).
Exclusive test and training
We provide samples of 5,000 events in which one decay is required to decay to a specific decay (an exclusive decay). For certain exclusive decay modes we separate the 5,000 events into an 1,000 event training set and 4,000 test set for the training of the HGNN (H2) in the paper.
Exclusive decays include:
Bd_DD_dataset.tar.gzBd_Kstmumu_dataset.tar.gzBd_Kpi_dataset.tar.gzBu_Kmumu_dataset.tar.gzBu_Kpipimumu_dataset.tar.gzBu_KKpi_dataset.tar.gzLb_Lcpi_dataset.tar.gzLb_pK_dataset.tar.gzLb_pKmumu_dataset.tar.gzBs_Dspi_dataset.tar.gzBs_Jpsiphi_dataset.tar.gz
The datasets we provide here overlap in some cases with our previous dataset for the Deep Full Event Interpretation at :
https://zenodo.org/records/7799170
which, provides several of the datasets in .root format with more available information. Here, we provide a more amenable format of the data for trainings with GNNs and HGNNs with pytorch with our latest framework.
Data format
The relevant features used in the HGNN are described in the following. A cartesian right-handed coordinate system is used, with the z axis pointing along the beamline, the x axis beinng parallel to the horizontal and the y axis being vertically oriented.
Events are stored in a graph format in the files of the format input_.npy where each numbered input file represents a unique event. Meanwhile, LCAG (Lowest Common Ancestor Generation) edge targets are contained within the files target_.npy
In the input files the following graph data is stored in a dictionary format
node features
Are contained in key value 'nodes' in a numpy array format (n_nodes, 13) and include in index order:
- Ox, Oy, Oz: cartesian coordinates of the origin point of the particle.
- px, py, pz: cartesian coordinates of the three-momentum.
- PVᴵᴾx, PVᴵᴾy, PVᴵᴾz: cartesian coordinates of the position of the associated reconstructed primary vertex based on a minimum impact parameter. This is only used for training of homogeneous GNNs.
- Charge (q): for the stable particles under consideration, the charge can take the value 1 or -1.
- PVx, PVy, PVz: cartesian coordinates of the position of reconstructed primary vertex corresponding to the true associated PV. These reconstructed PVs are used for PV nodes within a heterogeneous graph representation. Additionally, it is used to determine the edge level target for track to PV edges.
edge features
Are contained in key value 'edges' a numpy array format (n_edges, 4) and include in index order:
-
Opening angle (θ): angle between the three-momentum directions of the two particles.
-
Momentum-transverse distance (d ⊥ P⃗): distance between the origin point of the two particles defined on a plane which is transverse to the combined three momentum of the two particles.
-
Distance along the beam axis (Δz): difference between the z-coordinate of the origin points of the two particles.
-
FromSamePV_MinIP: a reconstructed boolean variable indicating whether the two particles share the same associated primary vertex accordining to minimum impact paramter
track edge relations
The keys 'senders' and' receivers' yield the numpy arrays of sender and receiver node indices for tracks.
global features
- The number of unique reconstructed PVs per graph is an additional global feature for the HGNN
targets
Meanwhile in the the target files the Lowest Common Ancestor Generation (LCAG) edge targets can be found in a one hot encoded format with the key value 'edges' in a numpy array of shape (n_edges,4) with 4 referring to the 4 LCAG classes (0, 1, 2, 3)
Additional truth information for performance
For determining the reconstruction performance metrics in the papers additional truth information is required including LCAG mother particle identification numbers (MIDs) and particle identification numbers (IDs).
For the test datasets we include the following information:
- part_ids: particle IDs for every track after loose preselection
- ids: the mother particle IDs for particles belonging to beauty decay chains after loose preselection
- init_part_ids: particle IDs for every track before loose preselection
- init_ids: the mother particle IDs for particles belonging to beauty decay chains before loose preselection
- init_y: LCAG target values before loose preselection
- truth_part_ids: particle IDs only for particles belonging to beauty decay chains
- truth_ids: mother IDs only for particles belonging to beauty decay chains
- truth_senders: sender nodes of edges (only for particles belonging to beauty hadrons)
- truth_receivers: receiver nodes of edges (only for particles belonging to beauty hadrons)
- truth_y: LCAG target values (only for particles belonging to beauty hadrons)
- lca_chain: truth full chain LCA values used only for determining max chain depth
Loading data
import numpy as np
# load graph features and LCAG edge targets for event 0
graph_input_features = np.load("input_0.npy", allow_pickle=True).item()
graph_target = np.load("target_0.npy", allow_pickle=True).item()
We provide provide functionality to load the datasets with pytorch geometric data loaders in the github repo:
https://github.com/willsutcliffe/scalable_mtl_hgnn
Files
README.md
Files
(19.2 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:1dc14e0ffb1deafd21b274dfb43d505b
|
919.8 MB | Download |
|
md5:48c9bd04ba91f8d715b65212790a26d9
|
902.9 MB | Download |
|
md5:b97620d5c5833caeb8de82c00f204544
|
920.5 MB | Download |
|
md5:1d7346f705a4882cb3466093915b3cb8
|
918.9 MB | Download |
|
md5:2205dc8cc488e5da2deae1c956ae25ea
|
910.7 MB | Download |
|
md5:6ef7b411282d762077af36bf37d7270d
|
909.8 MB | Download |
|
md5:b51ef54924f1b6499db235f0fc2d1557
|
904.7 MB | Download |
|
md5:8d3c243765faff46ff4821c6fa6859da
|
934.3 MB | Download |
|
md5:3ea0cd06b74d358784d911ad45495508
|
1.6 GB | Download |
|
md5:fc90b8012c6f2be98a22503d306f8655
|
7.5 GB | Download |
|
md5:e0532463029cf6ce0a741890d8ea925e
|
908.6 MB | Download |
|
md5:80edca4a74c76a580067094e52b1263f
|
875.5 MB | Download |
|
md5:bc6901997e784a4b3c4a25e316cc68b9
|
920.5 MB | Download |
|
md5:b594205a40cc13416d03c823d6fb56ca
|
4.9 kB | Preview Download |
Additional details
Related works
- Cites
- Publication: 10.1007/s41781-023-00107-8 (DOI)
- Is supplement to
- Publication: arXiv:2504.21844 (arXiv)
References
- Garc´ıa Pardinas J, Calvi M, Eschle J, Mauri A, Meloni S, Mozzanica M and Serra N 2023 Comput. Softw. Big Sci. 7 12
- Garc´ıa Pardinas J, Calvi M, Eschle J, Mauri A, Meloni S, Mozzanica M and Serra N 2023 Dataset of paper "GNN for Deep Full Event Interpretation and hierarchical reconstruction of heavy-hadron decays in proton-proton collisions" URL https://doi.org/10.5281/zenodo.7799170