Dataset of paper "GNN for Deep Full Event Interpretation and hierarchical reconstruction of heavy-hadron decays in proton-proton collisions"
Authors/Creators
- 1. Universita di Milano Bicocca and INFN Sezione di Milano-Bicocca, Experimental Physics Department, European Organization for Nuclear Research (CERN),
- 2. Universita di Milano Bicocca and INFN Sezione di Milano-Bicocca
- 3. University of Zurich
- 4. Nikhef National Institute for Subatomic Physics, mperial College London, South Kensington Campus
Description
DFEI dataset
The full description can also be found in README.md.
The dataset was used in the paper “GNN for Deep Full Event Interpretation and hierarchical reconstruction of heavy-hadron decays in proton-proton collisions”. The project describes a full event interpretation at the LHCb experiment, situated at the Large Hadron Collider in CERN, Geneva. An “event” consists of detector responses that were converted to tracks - each track represents a particle.
The aim of the algorithm is to make sense of the tracks and bundle together tracks coming from the same origin, as well as interpreting their decay hierarchy.
Generated events
The events in this dataset are based on simulation generated with PYTHIA8 and EvtGen, in which the particle-collision conditions expected for the LHC Run 3 are replicated as shown in the table.
| LHCb period | Num. vis. pp collisions | Num. tracks | Num. b hadrons | Num. c hadrons |
|---|---|---|---|---|
| Runs 3-4 (Upgrade I) | ∼ 5 | ∼ 150 | ≪ 1 | ∼ 1 |
Additionally, an approximate emulation of the LHCb detection and reconstruction effects is applied, as described in the paper in the appendix “Simulation”. In the generated dataset, each event is required to contain at least one b-hadron, which is subsequently allowed to decay freely through any of the standard decay modes present in PYTHIA8. On average, 40% of those events contain more than one b-hadron decay, with a maximum b-hadron decay multiplicity of five. Only charged stable particles that have been produced inside the LHCb geometrical acceptance and in the Vertex Locator region (as defined in the paper) are included in the datasets.
Datasets
The datasets are divided in three categories
Training and testing
The file Dataset_InclusiveHb_Training.root contains the training dataset (40,000 events) test dataset (10,000 events) of inclusive decays.
Evaluation
The inclusive dataset Dataset_InclusiveHb_Evaluation.root contains the evaluation events (50,000).
Exclusive decays
In addition to this inclusive dataset, several other smaller samples (of few thousand events each) have also been generated, requiring that all the events in each sample contained a specific (exclusive) type of b-hadron decay. The specific modes have been chosen to be representative of the most common classes of decay topologies of physics interest for LHCb. These samples contain only events in which all the particles originating from each of the considered exclusive decays have been produced inside the LHCb geometrical acceptance and in the Vertex Locator region.
The datasets contained are:
Dataset_Bd_DD.rootDataset_Bd_Kpi.rootDataset_Bd_Kstmumu.rootDataset_Bs_Dspi.rootDataset_Bs_Jpsiphi.rootDataset_Bu_KKpi.rootDataset_Lb_Lcpi.root
More information on them can be found in the paper.
Loading the data
The dataset is saved in the binary ROOT format with a key-array mapping. It can be loaded using the uproot Python library to convert it to a pandas DataFrame or similar.
An example snippet is given here:
import uproot
# treename = "Particles"
treename = "Relations"
with uproot.open('/path/to/file.root') as file:
df = file[treename].arrays(
# we can specify only a set of branches
# ['EventNumber', "FromSamePV_true"],
library='pd') # 'pd' for pandas
The returned file behaves like a mapping that contains two different data holders. They are accessible with Relations or Particles that contain either the relations between the particles or the particles themselves.
Regarding the Relations, only edges connecting two different particles are contained in the dataset. The edges are treated as not directional, so a single edge is considered for each pair of particles.
Variables
The relevant features used in the GNN are described in the following. A cartesian right-handed coordinate system is used, with the z axis pointing along the beamline, the x axis beinng parallel to the horizontal and the y axis being vertically oriented. When specified in the name of the variables, the suffix “_true” refers to ground-truth information, and the suffix “_reco” refers to the output of the emulated LHCb reconstruction.
-
General:
- EventNumber: unique number to identify the event that the entry belongs to.
-
Node variables:
-
ParticleKey: unique number to identify each particle in a given event.
-
Identity (ID): numerical code identifying the type of particle, following the Monte Carlo Particle Numbering Scheme.
-
FromPrimaryBeautyHadron: boolean variable indicating whether the particles has been produced in a beauty hadron decay or not.
-
Transverse momentum (pT): component of the three-momentum transverse to the beamline, i.e. the x and y component combined.
-
Impact parameter with respect to the associated primary vertex (IP): distance of closest approach between the particle trajectory and its associated primary vertex (proton-proton collision point), defined as the one with the smallest IP for the given particle amongst all the primary vertices in the event.
-
Pseudorapidity (η): spatial coordinate describing the angle of a particle relative to the beam axis, computed as η = arctanh(pz/∥p⃗∥).
-
Charge (q): for the stable particles under consideration, the charge can take the value 1 or -1.
-
Ox, Oy, Oz: cartesian coordinates of the origin point of the particle.
-
px, py, pz: cartesian coordinates of the three-momentum.
-
PVx, PVy, PVz: cartesian coordinates of the position of the associated primary vertex.
-
-
Edge variables:
-
FirstParticleKey: ParticleKey of one of the two particles connected by the edge.
-
SecondParticleKey: ParticleKey of the other particle, verifying FirstParticleKey > SecondParticleKey.
-
FromSamePrimaryBeautyHadron: boolean variable indicating whether the two particles originate from the same beauty hadron decay.
-
Opening angle (θ): angle between the three-momentum directions of the two particles.
-
Momentum-transverse distance (d ⊥ P⃗): distance between the origin point of the two particles defined on a plane which is transverse to the combined three momentum of the two particles.
-
Distance along the beam axis (Δz): difference between the z-coordinate of the origin points of the two particles.
-
FromSamePV: boolean variable indicating whether the two particles share the same associated primary vertex.
-
Order of the “topological” Lowest Common Ancestor (TopoLCAOrder): variable that can take the values 0, 1, 2 or 3, as explained in the paper.
-
Identity of the “topological” Lowest Common Ancestor (TopoLCAID): numerical code identifying the particle type of the ancestor, following the Monte Carlo Particle Numbering Scheme.
-
Files
README.md
Files
(23.1 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:71b011577212b381bd0f6903eb2f256b
|
676.4 MB | Download |
|
md5:87a134ecf41305fd63233dd2f1e31692
|
889.8 MB | Download |
|
md5:70b527f5b9bda16299d0bddffe69751a
|
909.1 MB | Download |
|
md5:9565fac5d425eed3c64cb494e30454c7
|
791.3 MB | Download |
|
md5:5f7a268c44d8515379ffe1092b8f5b1c
|
865.4 MB | Download |
|
md5:ee13fc0597751e872f3b2c09258411f7
|
890.2 MB | Download |
|
md5:670a59c5693dd2419d0415b19b06acdc
|
8.9 GB | Download |
|
md5:bfd12ec471480f15af67533dd393e0de
|
8.8 GB | Download |
|
md5:0a43ab9c06764d8ad5ff37ac20fff7bd
|
359.0 MB | Download |
|
md5:b28e4d507dcbe024e7dd24a655ff6005
|
7.8 kB | Preview Download |
Additional details
Funding
References
- Bierlich, Christian et al (2022). A comprehensive guide to the physics and usage of PYTHIA 8.3
- Ryd et al (2005). EvtGen: A Monte Carlo Generator for B-Physics