Dataset of paper "Scalable Multi-Task Learning for Particle Collision Event Reconstruction with Heterogeneous Graph Neural Networks"

Sutcliffe, William

doi:10.5281/zenodo.15584745

Published June 10, 2025 | Version v1

Dataset Open

Dataset of paper "Scalable Multi-Task Learning for Particle Collision Event Reconstruction with Heterogeneous Graph Neural Networks"

Sutcliffe, William (Contact person)¹

1. University of Zurich

Contributors

Contact person (7):

1. University of Zurich
2. Massachusetts Institute of Technology
3. Syracuse University
4. Istituto Nazionale di Fisica Nucleare, Sezione di Milano Bicocca
5. European Organization for Nuclear Research

Scalable Multi-Task Learning for Particle Collision Event Reconstruction with HGNNs Dataset

The full description can also be found in README.md.

The dataset was used in the paper “Scalable Multi-Task Learning for Particle Collision Event Reconstruction with Heterogeneous Graph Neural Networks”:

https://arxiv.org/abs/2504.21844

This paper presents a scalable Heterogeneous graph network with integrated pruning layers, which jointly determines if tracks originate from decay of beauty hadrons and associates each track to a proton-proton collision point known as a primary vertex (PV).

For training HGNNs and GNNs on the dataset see the associated github repo:

https://github.com/willsutcliffe/scalable_mtl_hgnn

Generated events

The events in this dataset are based on simulation generated with PYTHIA and EvtGen, in which the particle-collision conditions expected for the LHC Run 3 are replicated as shown in the table.

LHCb period	Num. vis. pp collisions	Num. tracks	Num. b hadrons	Num. c hadrons
Runs 3-4 (Upgrade I)	~5	~150	< 1	~1

Additionally, an approximate emulation of the LHCb detection and reconstruction effects is applied, as described fully in appendix A of https://arxiv.org/pdf/2304.08610. In the generated dataset, each event is required to contain at least one b-hadron, which is subsequently allowed to decay freely through any of the standard decay modes present in PYTHIA8. On average, 40% of those events contain more than one b-hadron decay, with a maximum b-hadron decay multiplicity of five. Only charged stable particles that have been produced inside the LHCb geometrical acceptance and in the Vertex Locator region (as defined in the paper) are included in the datasets.

Datasets

The datasets are divided in three categories

Inclusive training and validation

The file compressed file inclusive_training_validation_dataset.tar.gz contains the training dataset (40,000 events) test dataset (10,000 events) of inclusive decays.

Inclsuive test

The inclusive dataset inclusive_test_dataset.tar.gz contains the evaluation events (10,000).

Exclusive test and training

We provide samples of 5,000 events in which one decay is required to decay to a specific decay (an exclusive decay). For certain exclusive decay modes we separate the 5,000 events into an 1,000 event training set and 4,000 test set for the training of the HGNN (H2) in the paper.

Exclusive decays include:

Bd_DD_dataset.tar.gz
Bd_Kstmumu_dataset.tar.gz
Bd_Kpi_dataset.tar.gz
Bu_Kmumu_dataset.tar.gz
Bu_Kpipimumu_dataset.tar.gz
Bu_KKpi_dataset.tar.gz
Lb_Lcpi_dataset.tar.gz
Lb_pK_dataset.tar.gz
Lb_pKmumu_dataset.tar.gz
Bs_Dspi_dataset.tar.gz
Bs_Jpsiphi_dataset.tar.gz

The datasets we provide here overlap in some cases with our previous dataset for the Deep Full Event Interpretation at :

https://zenodo.org/records/7799170

which, provides several of the datasets in .root format with more available information. Here, we provide a more amenable format of the data for trainings with GNNs and HGNNs with pytorch with our latest framework.

Data format

The relevant features used in the HGNN are described in the following. A cartesian right-handed coordinate system is used, with the z axis pointing along the beamline, the x axis beinng parallel to the horizontal and the y axis being vertically oriented.

Events are stored in a graph format in the files of the format input_.npy where each numbered input file represents a unique event. Meanwhile, LCAG (Lowest Common Ancestor Generation) edge targets are contained within the files target_.npy

In the input files the following graph data is stored in a dictionary format

node features

Are contained in key value 'nodes' in a numpy array format (n_nodes, 13) and include in index order:

O_x, O_y, O_z: cartesian coordinates of the origin point of the particle.
p_x, p_y, p_z: cartesian coordinates of the three-momentum.
PVᴵᴾ_x, PVᴵᴾ_y, PVᴵᴾ_z: cartesian coordinates of the position of the associated reconstructed primary vertex based on a minimum impact parameter. This is only used for training of homogeneous GNNs.
Charge (q): for the stable particles under consideration, the charge can take the value 1 or -1.
PV_x, PV_y, PV_z: cartesian coordinates of the position of reconstructed primary vertex corresponding to the true associated PV. These reconstructed PVs are used for PV nodes within a heterogeneous graph representation. Additionally, it is used to determine the edge level target for track to PV edges.

edge features

Are contained in key value 'edges' a numpy array format (n_edges, 4) and include in index order:

Opening angle (θ): angle between the three-momentum directions of the two particles.
Momentum-transverse distance (d _⊥ P⃗): distance between the origin point of the two particles defined on a plane which is transverse to the combined three momentum of the two particles.
Distance along the beam axis (Δ_z): difference between the z-coordinate of the origin points of the two particles.
FromSamePV_MinIP: a reconstructed boolean variable indicating whether the two particles share the same associated primary vertex accordining to minimum impact paramter

track edge relations

The keys 'senders' and' receivers' yield the numpy arrays of sender and receiver node indices for tracks.

global features

The number of unique reconstructed PVs per graph is an additional global feature for the HGNN

targets

Meanwhile in the the target files the Lowest Common Ancestor Generation (LCAG) edge targets can be found in a one hot encoded format with the key value 'edges' in a numpy array of shape (n_edges,4) with 4 referring to the 4 LCAG classes (0, 1, 2, 3)

Additional truth information for performance

For determining the reconstruction performance metrics in the papers additional truth information is required including LCAG mother particle identification numbers (MIDs) and particle identification numbers (IDs).

For the test datasets we include the following information:

part_ids: particle IDs for every track after loose preselection
ids: the mother particle IDs for particles belonging to beauty decay chains after loose preselection
init_part_ids: particle IDs for every track before loose preselection
init_ids: the mother particle IDs for particles belonging to beauty decay chains before loose preselection
init_y: LCAG target values before loose preselection
truth_part_ids: particle IDs only for particles belonging to beauty decay chains
truth_ids: mother IDs only for particles belonging to beauty decay chains
truth_senders: sender nodes of edges (only for particles belonging to beauty hadrons)
truth_receivers: receiver nodes of edges (only for particles belonging to beauty hadrons)
truth_y: LCAG target values (only for particles belonging to beauty hadrons)
lca_chain: truth full chain LCA values used only for determining max chain depth

Loading data

import numpy as np

# load graph features and LCAG edge targets for event 0 
graph_input_features = np.load("input_0.npy", allow_pickle=True).item()
graph_target = np.load("target_0.npy", allow_pickle=True).item()

We provide provide functionality to load the datasets with pytorch geometric data loaders in the github repo:

https://github.com/willsutcliffe/scalable_mtl_hgnn

Files

README.md

Files (19.2 GB)

Name	Size	Download all
Bd_DD_dataset.tar.gz md5:1dc14e0ffb1deafd21b274dfb43d505b	919.8 MB	Download
Bd_Kpi_dataset.tar.gz md5:48c9bd04ba91f8d715b65212790a26d9	902.9 MB	Download
Bd_Kstmumu_dataset.tar.gz md5:b97620d5c5833caeb8de82c00f204544	920.5 MB	Download
Bs_Dspi_dataset.tar.gz md5:1d7346f705a4882cb3466093915b3cb8	918.9 MB	Download
Bs_Jpsiphi_dataset.tar.gz md5:2205dc8cc488e5da2deae1c956ae25ea	910.7 MB	Download
Bu_KKpi_dataset.tar.gz md5:6ef7b411282d762077af36bf37d7270d	909.8 MB	Download
Bu_Kmumu_dataset.tar.gz md5:b51ef54924f1b6499db235f0fc2d1557	904.7 MB	Download
Bu_Kpipimumu_dataset.tar.gz md5:8d3c243765faff46ff4821c6fa6859da	934.3 MB	Download
inclusive_test_dataset.tar.gz md5:3ea0cd06b74d358784d911ad45495508	1.6 GB	Download
inclusive_training_validation_dataset.tar.gz md5:fc90b8012c6f2be98a22503d306f8655	7.5 GB	Download
Lb_Lcpi_dataset.tar.gz md5:e0532463029cf6ce0a741890d8ea925e	908.6 MB	Download
Lb_pK_dataset.tar.gz md5:80edca4a74c76a580067094e52b1263f	875.5 MB	Download
Lb_pKmumu_dataset.tar.gz md5:bc6901997e784a4b3c4a25e316cc68b9	920.5 MB	Download
README.md md5:b594205a40cc13416d03c823d6fb56ca	4.9 kB	Preview Download

Additional details

Cites: Publication: 10.1007/s41781-023-00107-8 (DOI)
Is supplement to: Publication: arXiv:2504.21844 (arXiv)

Garc´ıa Pardinas J, Calvi M, Eschle J, Mauri A, Meloni S, Mozzanica M and Serra N 2023 Comput. Softw. Big Sci. 7 12
Garc´ıa Pardinas J, Calvi M, Eschle J, Mauri A, Meloni S, Mozzanica M and Serra N 2023 Dataset of paper "GNN for Deep Full Event Interpretation and hierarchical reconstruction of heavy-hadron decays in proton-proton collisions" URL https://doi.org/10.5281/zenodo.7799170

	All versions	This version
Views	209	209
Downloads	524	524
Data volume	811.1 GB	811.1 GB

Contributors

Contact person (7):

Scalable Multi-Task Learning for Particle Collision Event Reconstruction with HGNNs Dataset

Generated events

Datasets

Data format

Loading data

README.md

Files (19.2 GB)

Related works

References

Dataset of paper "Scalable Multi-Task Learning for Particle Collision Event Reconstruction with Heterogeneous Graph Neural Networks"

Authors/Creators

Contributors

Contact person (7):

Description

Scalable Multi-Task Learning for Particle Collision Event Reconstruction with HGNNs Dataset

Generated events

Datasets

Data format

Loading data

Files

README.md

Files (19.2 GB)

Additional details

Related works

References