TrackFormers - Collision Event Data Sets
dr. ir. Uraz Odyurt
Introduction
This artefact includes five individual data sets containing particle collision data in virtual detector setups. These data sets are utilised for Machine Learning (ML) model design and training within the publication “TrackFormers: In Search of Transformer-Based Particle Tracking for the High-Luminosity LHC Era”. Three of the data sets are generated using the REDuced VIrtual Detector (REDVID) simulation framework. The other two are reduced versions of the TrackML data set. The full TrackML data set is simulated using Pythia 8 event generator.
For further information on REDVID, refer to REDVID website.
For further information on TrackML data, refer to TrackML paper.
Detector geometry
Both simulations, REDVID or Pythia, incorporate a virtual detector. These detectors are generally inspired by the ATLAS and the CMS detector designs. There are multiple detection layers consisting of cylinders and disks arranged in the detector space. This arrangement is different per REDVID and TrackML.
In the case of REDVID, the following sub-detector categories are present in these data sets:
- Short-strip
- Long-strip
- Barrel
The Barrel category is of the cylindrical shape, while both the Short-strip and the Long-strip categories consist of disk-shaped sub-detectors.
Data description
The generated data includes information on geometry, tracks and hits from simulations. While the information on the geometry is included, the bulk of the data set covers data for tracks and hits.
Tracks and hits belonging to REDVID simulations are defined as parameters of line equations and point coordinates in the Cylindrical coordinate system. In the case of TrackML, a right-handed Cartesian coordinate system is considered.
Files
The generated data is saved in multiple CSV files and stored as compressed tar balls. These are as follows:
- REDVID data set of 100k events with 10 to 50 linear tracks per event and added noise to
the calculated hit coordinates. The track count per event is randomised.
redvid_3d_noisy-100k-events-10-to-50-tracks.tar.gz
- REDVID data set of 100k events with 10 to 50 helical tracks per event and added noise
to the calculated hit coordinates. The track count per event is randomised.
redvid_3d_noisy-100k-events-10-to-50-helical-tracks.tar.gz
- REDVID data set of 100k events with 50 to 100 helical tracks per event and added noise
to the calculated hit coordinates. The track count per event is randomised.
redvid_3d_noisy-100k-events-50-to-100-helical-tracks.tar.gz
- TrackML reduced data set of 40k events with 10 to 50 tracks per event. The track count
per event is randomised.
trackml_40k-events-10-to-50-tracks.tar.gz
- TrackML reduced data set of 40k events with 200 to 500 tracks per event. The track
count per event is randomised.
trackml_40k-events-200-to-500-tracks.tar.gz
Associated ML models
Below table provides the mapping for our data sets and the trained ML model variants. Further detail can be found on TrackFormers paper.
Data set | Trained ML model(s) |
---|---|
redvid_3d_noisy-100k-events-10-to-50-tracks |
EncDec, EncCla, EncReg, U-Net |
redvid_3d_noisy-100k-events-10-to-50-helical-tracks |
EncDec, EncCla, EncReg, U-Net |
redvid_3d_noisy-100k-events-50-to-100-helical-tracks |
EncDec, EncCla, EncReg, U-Net |
trackml_40k-events-10-to-50-tracks |
EncDec, EncCla, EncReg |
trackml_40k-events-200-to-500-tracks |
EncCla, EncReg, EncReg-FA |
REDVID data headers
Data headers, i.e., CSV column titles, for the REDVID data sets are as follows:
-
event_id
- An incremental identifier for events belonging to an experiment, which is unique within the scope of the experiment.
Type =>integer
-
sub_detector_id
- An incremental identifier for different sub-detector layers belonging to a geometry, which is unique within the scope of the geometry.
Type =>integer
-
sub_detector_type
- The type of the sub-detector layer recording a hit, which can be one of three available types, pixel, short-strip, or long-strip.
Type =>string
-
track_id
- An incremental identifier for tracks belonging to an event, which is unique within the scope of the event.
Type =>integer
-
track_type
- Indicates the type of function defining the track in terms of polynomial degree. Available types are ‘linear’, ‘helical_uniform’ and ‘helical_expanding’.
Type =>string
-
r_0
orradial_const
- Ther
coordinate of the(r, theta, z)
tuple defining the pointP_0
, used in a track’s parametric set of equations. The value will represent origin smearing forr
.r_0
andradial_const
are applicable to ‘linear’ and ‘helical_expanding’ track types, respectively.
Type =>float
-
theta_0
orazimuthal_const
- Thetheta
coordinate of the(r, theta, z)
tuple defining the pointP_0
, used in a track’s parametric set of equations. The value will represent origin smearing fortheta
.theta_0
andazimuthal_const
are applicable to ‘linear’ and ‘helical_expanding’ track types, respectively.
Type =>float
-
z_0
orpitch_const
- Thez
coordinate of the(r, theta, z)
tuple defining the pointP_0
, used in a track’s parametric set of equations. The value will represent origin smearing forz
.z_0
andpitch_const
are applicable to ‘linear’ and ‘helical_expanding’ track types, respectively.
Type =>float
-
r_d
- Ther
coordinate of the(r, theta, z)
tuple defining the direction vectorV_d
, used in a track’s parametric set of equations.r_d
is applicable to the ‘linear’ track type.
OR,
radial_coeff
- The coefficient affecting the radius rate in the helical track.radial_coeff
is applied to the free variable in the equation forr
.radial_coeff
is applicable to the ‘helical_expanding’ track type.
Type =>float
-
theta_d
- Thetheta
coordinate of the(r, theta, z)
tuple defining the direction vectorV_d
, used in a track’s parametric set of equations.theta_d
is applicable to the ‘linear’ track type.
OR,
azimuthal_coeff
- The coefficient affecting the clockwise/counter-clockwise extrusion direction of the helical track.azimuthal_coeff
is applied to the free variable in the equation fortheta
.azimuthal_coeff
is applicable to the ‘helical_expanding’ track type.
Type =>float
-
z_d
- Thez
coordinate of the(r, theta, z)
tuple defining the direction vectorV_d
, used in a track’s parametric set of equations. This value will be1
or-1
, depending on which side of the XY-plane the track is being directed to.z_d
is applicable to the ‘linear’ track type.
OR,
pitch_coeff
- The coefficient affecting the pitch rate in the helical track.pitch_coeff
is applied to the free variable in the equation forz
.pitch_coeff
is applicable to the ‘helical_expanding’ track type.
Type =>integer
-
hit_id
- An incremental identifier for hits belonging to an event, which is unique within the scope of the event.
Type =>integer
-
hit_r
- Ther
coordinate of the(r, theta, z)
tuple defining the recorded hit point on the relevant sub-detector.
Type =>float
-
hit_theta
- Thetheta
coordinate of the(r, theta, z)
tuple defining the recorded hit point on the relevant sub-detector.
Type =>float
-
hit_z
- Thez
coordinate of the(r, theta, z)
tuple defining the recorded hit point on the relevant sub-detector.
Type =>float
TrackML data headers
Data headers, i.e., CSV column titles, for the TrackML reduced data sets are as follows:
-
x
- Measured x-coordinate (in millimetre) of the hit in global coordinates.
Type =>float
-
y
- Measured y-coordinate (in millimetre) of the hit in global coordinates.
Type =>float
-
z
- Measured z-coordinate (in millimetre) of the hit in global coordinates.
Type =>float
-
volume_id
- Numerical identifier of the detector group.
Type =>integer
-
vx
- The x-component of the initial position or vertex (in millimetres) in global coordinates.
Type =>float
-
vy
- The y-component of the initial position or vertex (in millimetres) in global coordinates.
Type =>float
-
vz
- The z-component of the initial position or vertex (in millimetres) in global coordinates.
Type =>float
-
px
- The x-component of the initial momentum (in GeV/c) along each global axis.
Type =>float
-
py
- The y-component of the initial momentum (in GeV/c) along each global axis.
Type =>float
-
pz
- The z-component of the initial momentum (in GeV/c) along each global axis.
Type =>float
-
q
- Particle charge (as multiple of the absolute electron charge).
Type =>float
-
particle_id
- Numerical identifier of the particle inside the event.
Type =>integer
-
weight
- Per-hit weight used for the scoring metric; total sum of weights within one event equals to one.
Type =>float
-
event_id
- Numerical identifier of the event.
Type =>integer
Usage and citation
If you use this data set in your research or any publication, we kindly request you to cite the following paper:
@article{Caron:YEAR:TrackFormers,
author = {Caron, Sascha and Dobreva, Nadezhda and Ferrer Sánchez, Antonio and
Martín-Guerrero, José D. and Odyurt, Uraz and Ruiz de Austri Bazan, Roberto and
Wolffs, Zef and Zhao, Yue},
title = {TrackFormers: In Search of Transformer-Based Particle Tracking for the
High-Luminosity LHC Era},
journal = {UPDATE AS APPROPRIATE},
year = {UPDATE AS APPROPRIATE},
doi = {GIVEN IN ZENODO METADATA}
}
Support
Note that this data set is being shared on an “as is” basis, without any express or implied warranties or obligations of support. While we have made efforts to ensure the accuracy and completeness of the data, we cannot guarantee its fitness for any particular purpose or provide any form of ongoing support.
As the creators and sharers of this data set, we are unable to offer any dedicated support or assistance in working with or analysing the data. We do not commit to responding to inquiries, fixing issues, or providing additional documentation or guidance related to this data set. Should you encounter any challenges or have questions, we recommend referring to the existing documentation.
Authors and acknowledgement
The generated data sets are created by:
- REDVID: Uraz Odyurt - University of Twente; Nikhef
- TrackML reductions: Nadezhda Dobreva - Radboud University
The collaborating team includes:
- Sascha Caron - Radboud University; Nikhef
- Antonio Ferrer Sánchez - University of Valencia
- José D. Martín-Guerrero - University of Valencia
- Roberto Ruiz de Aurtri Bazan - University of Valencia
- Zef Wolffs - University of Amsterdam; Nikhef
- Yue Zhao - SURF
Licence
The data set is licenced under the Creative Commons Attribution 4.0 International License (CC-BY-4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited, as shown above.
If you have any questions regarding the licence or usage of the data set, please contact the authors.
Note: The licence applies only to the data set itself and not to any third-party content or software that may be included with the data set. Please review any licences or terms of use associated with those components separately.