Published June 10, 2025 | Version v1
Dataset Open

Dataset of paper "Scalable Multi-Task Learning for Particle Collision Event Reconstruction with Heterogeneous Graph Neural Networks"

  • 1. ROR icon University of Zurich
  • 1. ROR icon University of Zurich
  • 2. EDMO icon Massachusetts Institute of Technology
  • 3. ROR icon Syracuse University
  • 4. ROR icon Istituto Nazionale di Fisica Nucleare, Sezione di Milano Bicocca
  • 5. ROR icon European Organization for Nuclear Research

Description

Scalable Multi-Task Learning for Particle Collision Event Reconstruction with HGNNs Dataset

The full description can also be found in README.md.

The dataset was used in the paper “Scalable Multi-Task Learning for Particle Collision Event Reconstruction with Heterogeneous Graph Neural Networks”:

https://arxiv.org/abs/2504.21844

This paper presents  a scalable Heterogeneous graph network with integrated pruning layers, which jointly determines if tracks originate from decay of beauty hadrons and associates each track to a proton-proton collision point known as a primary vertex (PV).

For training HGNNs and GNNs on the dataset see the associated github repo:

https://github.com/willsutcliffe/scalable_mtl_hgnn

 

Generated events

The events in this dataset are based on simulation generated with PYTHIA  and EvtGen, in which the particle-collision conditions expected for the LHC Run 3 are replicated as shown in the table.

LHCb period Num. vis. pp collisions     Num. tracks        Num. b hadrons Num. c hadrons
  Runs 3-4 (Upgrade I)                          ~5             ~150                  < 1              ~1

Additionally, an approximate emulation of the LHCb detection and reconstruction effects is applied, as described fully in appendix A of https://arxiv.org/pdf/2304.08610. In the generated dataset, each event is required to contain at least one b-hadron, which is subsequently allowed to decay freely through any of the standard decay modes present in PYTHIA8. On average, 40% of those events contain more than one b-hadron decay, with a maximum b-hadron decay multiplicity of five. Only charged stable particles that have been produced inside the LHCb geometrical acceptance and in the Vertex Locator region (as defined in the paper) are included in the datasets.

 

Datasets

The datasets are divided in three categories

Inclusive training and validation

The file compressed file inclusive_training_validation_dataset.tar.gz contains the training dataset (40,000 events) test dataset (10,000 events) of inclusive decays.

Inclsuive test

The inclusive dataset inclusive_test_dataset.tar.gz contains the evaluation events (10,000).

Exclusive test and training

We provide samples of 5,000 events in which one decay is required to decay to  a specific decay (an exclusive decay). For certain exclusive decay modes we separate the 5,000 events into an 1,000 event training set and 4,000 test set for the training of the HGNN (H2) in the paper.

Exclusive decays include:

  • Bd_DD_dataset.tar.gz
  • Bd_Kstmumu_dataset.tar.gz
  • Bd_Kpi_dataset.tar.gz
  • Bu_Kmumu_dataset.tar.gz
  • Bu_Kpipimumu_dataset.tar.gz
  • Bu_KKpi_dataset.tar.gz
  • Lb_Lcpi_dataset.tar.gz
  • Lb_pK_dataset.tar.gz
  • Lb_pKmumu_dataset.tar.gz
  • Bs_Dspi_dataset.tar.gz
  • Bs_Jpsiphi_dataset.tar.gz

The datasets we provide here overlap in some cases with our previous dataset for the Deep Full Event Interpretation at :

https://zenodo.org/records/7799170

which, provides several of the datasets in .root format with more available information. Here, we provide a more amenable format of the data for trainings with GNNs and HGNNs with pytorch with our latest framework.

Data format

The relevant features used in the HGNN are described in the following. A cartesian right-handed coordinate system is used, with the z axis pointing along the beamline, the x axis beinng parallel to the horizontal and the y axis being vertically oriented.

Events are stored in a graph format in the files of the format input_.npy where each numbered input file represents a unique event. Meanwhile, LCAG (Lowest Common Ancestor Generation) edge targets are contained within the files target_.npy

In the input files the following graph data is stored in a dictionary format

node features

Are contained in key value 'nodes' in a numpy array format (n_nodes, 13) and include in index order:

  • Ox, Oy, Oz: cartesian coordinates of the origin point of the particle.
  • px, py, pz: cartesian coordinates of the three-momentum. 
  • PVᴵᴾx, PVᴵᴾy, PVᴵᴾz: cartesian coordinates of the position of the associated reconstructed primary vertex based on a minimum impact parameter. This is only used for training of homogeneous GNNs. 
  • Charge (q): for the stable particles under consideration, the charge can take the value 1 or -1.
  • PVx, PVy, PVz: cartesian coordinates of the position of reconstructed primary vertex corresponding to the true associated PV. These reconstructed PVs are used for PV nodes within a heterogeneous graph representation. Additionally, it is used to determine the edge level target for track to PV edges.

 

edge features 

Are contained in key value 'edges' a numpy array format (n_edges, 4) and include in index order:

  • Opening angle (θ): angle between the three-momentum directions of the two particles.

  • Momentum-transverse distance (d ⊥ P⃗): distance between the origin point of the two particles defined on a plane which is transverse to the combined three momentum of the two particles.

  • Distance along the beam axis (Δz): difference between the z-coordinate of the origin points of the two particles.

  • FromSamePV_MinIP: a reconstructed boolean variable indicating whether the two particles share the same associated primary vertex accordining to minimum impact paramter

track edge relations

The keys 'senders' and' receivers' yield the numpy arrays of sender and receiver node indices for tracks.

global features

  • The number of unique reconstructed PVs per graph is an additional global feature for the HGNN

targets

Meanwhile in the the target files the Lowest Common Ancestor Generation (LCAG) edge targets can be found in a one hot encoded format with the key value 'edges' in a numpy array of shape (n_edges,4) with 4 referring to the 4 LCAG classes (0, 1, 2, 3)

 

Additional truth information for performance

For determining the reconstruction performance metrics in the papers additional truth information is required including LCAG mother particle identification numbers (MIDs) and particle identification numbers (IDs).

For the test datasets we include the following information:

 

  • part_ids: particle IDs for every track after loose preselection
  • ids: the mother particle IDs for particles belonging to beauty decay chains after loose preselection
  • init_part_ids: particle IDs for every track before loose preselection
  • init_ids: the mother particle IDs for particles belonging to beauty decay chains before loose preselection
  • init_y:  LCAG target values before loose preselection
  • truth_part_ids: particle IDs only for particles belonging to beauty decay chains
  • truth_ids: mother IDs only for particles belonging to beauty decay chains
  • truth_senders: sender nodes of edges  (only for particles belonging to beauty hadrons)
  • truth_receivers: receiver nodes of edges  (only for particles belonging to beauty hadrons)
  • truth_y: LCAG target values (only for particles belonging to beauty hadrons)
  • lca_chain: truth full chain LCA values used only for determining max chain depth

 

Loading data

import numpy as np

# load graph features and LCAG edge targets for event 0 
graph_input_features = np.load("input_0.npy", allow_pickle=True).item()
graph_target = np.load("target_0.npy", allow_pickle=True).item()

 

We provide provide functionality to load the datasets  with  pytorch geometric data loaders in the github repo:

https://github.com/willsutcliffe/scalable_mtl_hgnn

 

Files

README.md

Files (19.2 GB)

Name Size Download all
md5:1dc14e0ffb1deafd21b274dfb43d505b
919.8 MB Download
md5:48c9bd04ba91f8d715b65212790a26d9
902.9 MB Download
md5:b97620d5c5833caeb8de82c00f204544
920.5 MB Download
md5:1d7346f705a4882cb3466093915b3cb8
918.9 MB Download
md5:2205dc8cc488e5da2deae1c956ae25ea
910.7 MB Download
md5:6ef7b411282d762077af36bf37d7270d
909.8 MB Download
md5:b51ef54924f1b6499db235f0fc2d1557
904.7 MB Download
md5:8d3c243765faff46ff4821c6fa6859da
934.3 MB Download
md5:3ea0cd06b74d358784d911ad45495508
1.6 GB Download
md5:fc90b8012c6f2be98a22503d306f8655
7.5 GB Download
md5:e0532463029cf6ce0a741890d8ea925e
908.6 MB Download
md5:80edca4a74c76a580067094e52b1263f
875.5 MB Download
md5:bc6901997e784a4b3c4a25e316cc68b9
920.5 MB Download
md5:b594205a40cc13416d03c823d6fb56ca
4.9 kB Preview Download

Additional details

Related works

Cites
Publication: 10.1007/s41781-023-00107-8 (DOI)
Is supplement to
Publication: arXiv:2504.21844 (arXiv)

References

  • Garc´ıa Pardinas J, Calvi M, Eschle J, Mauri A, Meloni S, Mozzanica M and Serra N 2023 Comput. Softw. Big Sci. 7 12
  • Garc´ıa Pardinas J, Calvi M, Eschle J, Mauri A, Meloni S, Mozzanica M and Serra N 2023 Dataset of paper "GNN for Deep Full Event Interpretation and hierarchical reconstruction of heavy-hadron decays in proton-proton collisions" URL https://doi.org/10.5281/zenodo.7799170