Published April 4, 2023 | Version v1.0.0
Dataset Open

Dataset of paper "GNN for Deep Full Event Interpretation and hierarchical reconstruction of heavy-hadron decays in proton-proton collisions"

  • 1. Universita di Milano Bicocca and INFN Sezione di Milano-Bicocca, Experimental Physics Department, European Organization for Nuclear Research (CERN),
  • 2. Universita di Milano Bicocca and INFN Sezione di Milano-Bicocca
  • 3. University of Zurich
  • 4. Nikhef National Institute for Subatomic Physics, mperial College London, South Kensington Campus

Description

DFEI dataset

The full description can also be found in README.md.

The dataset was used in the paper “GNN for Deep Full Event Interpretation and hierarchical reconstruction of heavy-hadron decays in proton-proton collisions”. The project describes a full event interpretation at the LHCb experiment, situated at the Large Hadron Collider in CERN, Geneva. An “event” consists of detector responses that were converted to tracks - each track represents a particle.

The aim of the algorithm is to make sense of the tracks and bundle together tracks coming from the same origin, as well as interpreting their decay hierarchy.

Generated events

The events in this dataset are based on simulation generated with PYTHIA8 and EvtGen, in which the particle-collision conditions expected for the LHC Run 3 are replicated as shown in the table.

LHCb period Num. vis. pp collisions Num. tracks Num. b hadrons Num. c hadrons
Runs 3-4 (Upgrade I)  ∼ 5  ∼ 150  ≪ 1  ∼ 1

Additionally, an approximate emulation of the LHCb detection and reconstruction effects is applied, as described in the paper in the appendix “Simulation”. In the generated dataset, each event is required to contain at least one b-hadron, which is subsequently allowed to decay freely through any of the standard decay modes present in PYTHIA8. On average, 40% of those events contain more than one b-hadron decay, with a maximum b-hadron decay multiplicity of five. Only charged stable particles that have been produced inside the LHCb geometrical acceptance and in the Vertex Locator region (as defined in the paper) are included in the datasets.

Datasets

The datasets are divided in three categories

Training and testing

The file Dataset_InclusiveHb_Training.root contains the training dataset (40,000 events) test dataset (10,000 events) of inclusive decays.

Evaluation

The inclusive dataset Dataset_InclusiveHb_Evaluation.root contains the evaluation events (50,000).

Exclusive decays

In addition to this inclusive dataset, several other smaller samples (of few thousand events each) have also been generated, requiring that all the events in each sample contained a specific (exclusive) type of b-hadron decay. The specific modes have been chosen to be representative of the most common classes of decay topologies of physics interest for LHCb. These samples contain only events in which all the particles originating from each of the considered exclusive decays have been produced inside the LHCb geometrical acceptance and in the Vertex Locator region.

The datasets contained are:

  • Dataset_Bd_DD.root
  • Dataset_Bd_Kpi.root
  • Dataset_Bd_Kstmumu.root
  • Dataset_Bs_Dspi.root
  • Dataset_Bs_Jpsiphi.root
  • Dataset_Bu_KKpi.root
  • Dataset_Lb_Lcpi.root

More information on them can be found in the paper.

Loading the data

The dataset is saved in the binary ROOT format with a key-array mapping. It can be loaded using the uproot Python library to convert it to a pandas DataFrame or similar.

An example snippet is given here:

import uproot

# treename = "Particles"
treename = "Relations"

with uproot.open('/path/to/file.root') as file:
     df = file[treename].arrays(
             # we can specify only a set of branches
             # ['EventNumber', "FromSamePV_true"],  
             library='pd')  # 'pd' for pandas

The returned file behaves like a mapping that contains two different data holders. They are accessible with Relations or Particles that contain either the relations between the particles or the particles themselves.

Regarding the Relations, only edges connecting two different particles are contained in the dataset. The edges are treated as not directional, so a single edge is considered for each pair of particles.

Variables

The relevant features used in the GNN are described in the following. A cartesian right-handed coordinate system is used, with the z axis pointing along the beamline, the x axis beinng parallel to the horizontal and the y axis being vertically oriented. When specified in the name of the variables, the suffix “_true” refers to ground-truth information, and the suffix “_reco” refers to the output of the emulated LHCb reconstruction.

  • General:

    • EventNumber: unique number to identify the event that the entry belongs to.
  • Node variables:

    • ParticleKey: unique number to identify each particle in a given event.

    • Identity (ID): numerical code identifying the type of particle, following the Monte Carlo Particle Numbering Scheme.

    • FromPrimaryBeautyHadron: boolean variable indicating whether the particles has been produced in a beauty hadron decay or not.

    • Transverse momentum (pT): component of the three-momentum transverse to the beamline, i.e. the x and y component combined.

    • Impact parameter with respect to the associated primary vertex (IP): distance of closest approach between the particle trajectory and its associated primary vertex (proton-proton collision point), defined as the one with the smallest IP for the given particle amongst all the primary vertices in the event.

    • Pseudorapidity (η): spatial coordinate describing the angle of a particle relative to the beam axis, computed as η = arctanh(pz/∥p⃗∥).

    • Charge (q): for the stable particles under consideration, the charge can take the value 1 or -1.

    • Ox, Oy, Oz: cartesian coordinates of the origin point of the particle.

    • px, py, pz: cartesian coordinates of the three-momentum.

    • PVx, PVy, PVz: cartesian coordinates of the position of the associated primary vertex.

  • Edge variables:

    • FirstParticleKey: ParticleKey of one of the two particles connected by the edge.

    • SecondParticleKey: ParticleKey of the other particle, verifying FirstParticleKey > SecondParticleKey.

    • FromSamePrimaryBeautyHadron: boolean variable indicating whether the two particles originate from the same beauty hadron decay.

    • Opening angle (θ): angle between the three-momentum directions of the two particles.

    • Momentum-transverse distance (d ⊥ P⃗): distance between the origin point of the two particles defined on a plane which is transverse to the combined three momentum of the two particles.

    • Distance along the beam axis (Δz): difference between the z-coordinate of the origin points of the two particles.

    • FromSamePV: boolean variable indicating whether the two particles share the same associated primary vertex.

    • Order of the “topological” Lowest Common Ancestor (TopoLCAOrder): variable that can take the values 0, 1, 2 or 3, as explained in the paper.

    • Identity of the “topological” Lowest Common Ancestor (TopoLCAID): numerical code identifying the particle type of the ancestor, following the Monte Carlo Particle Numbering Scheme.

Files

README.md

Files (23.1 GB)

Name Size Download all
md5:71b011577212b381bd0f6903eb2f256b
676.4 MB Download
md5:87a134ecf41305fd63233dd2f1e31692
889.8 MB Download
md5:70b527f5b9bda16299d0bddffe69751a
909.1 MB Download
md5:9565fac5d425eed3c64cb494e30454c7
791.3 MB Download
md5:5f7a268c44d8515379ffe1092b8f5b1c
865.4 MB Download
md5:ee13fc0597751e872f3b2c09258411f7
890.2 MB Download
md5:670a59c5693dd2419d0415b19b06acdc
8.9 GB Download
md5:bfd12ec471480f15af67533dd393e0de
8.8 GB Download
md5:0a43ab9c06764d8ad5ff37ac20fff7bd
359.0 MB Download
md5:b28e4d507dcbe024e7dd24a655ff6005
7.8 kB Preview Download

Additional details

Funding

Swiss National Science Foundation
Probing the flavour anomalies at LHCb P400P2_191121
European Commission
LHCbDFEI - Design of a Deep Full Event Interpretation for LHCb and application in semitauonic B decays 892683
Swiss National Science Foundation
Understanding the Flavour Anomalies 200020_204238

References

  • Bierlich, Christian et al (2022). A comprehensive guide to the physics and usage of PYTHIA 8.3
  • Ryd et al (2005). EvtGen: A Monte Carlo Generator for B-Physics