TrackFormers - Collision Event Data Sets

dr. ir. Uraz Odyurt

Introduction

This artefact includes five individual data sets containing particle collision data in virtual detector setups. These data sets are utilised for Machine Learning (ML) model design and training within the publication “TrackFormers: In Search of Transformer-Based Particle Tracking for the High-Luminosity LHC Era”. Three of the data sets are generated using the REDuced VIrtual Detector (REDVID) simulation framework. The other two are reduced versions of the TrackML data set. The full TrackML data set is simulated using Pythia 8 event generator.

For further information on REDVID, refer to REDVID website.

For further information on TrackML data, refer to TrackML paper.

Detector geometry

Both simulations, REDVID or Pythia, incorporate a virtual detector. These detectors are generally inspired by the ATLAS and the CMS detector designs. There are multiple detection layers consisting of cylinders and disks arranged in the detector space. This arrangement is different per REDVID and TrackML.

In the case of REDVID, the following sub-detector categories are present in these data sets:

The Barrel category is of the cylindrical shape, while both the Short-strip and the Long-strip categories consist of disk-shaped sub-detectors.

Data description

The generated data includes information on geometry, tracks and hits from simulations. While the information on the geometry is included, the bulk of the data set covers data for tracks and hits.

Tracks and hits belonging to REDVID simulations are defined as parameters of line equations and point coordinates in the Cylindrical coordinate system. In the case of TrackML, a right-handed Cartesian coordinate system is considered.

Files

The generated data is saved in multiple CSV files and stored as compressed tar balls. These are as follows:

Associated ML models

Below table provides the mapping for our data sets and the trained ML model variants. Further detail can be found on TrackFormers paper.

Data set Trained ML model(s)
redvid_3d_noisy-100k-events-10-to-50-tracks EncDec, EncCla, EncReg, U-Net
redvid_3d_noisy-100k-events-10-to-50-helical-tracks EncDec, EncCla, EncReg, U-Net
redvid_3d_noisy-100k-events-50-to-100-helical-tracks EncDec, EncCla, EncReg, U-Net
trackml_40k-events-10-to-50-tracks EncDec, EncCla, EncReg
trackml_40k-events-200-to-500-tracks EncCla, EncReg, EncReg-FA

REDVID data headers

Data headers, i.e., CSV column titles, for the REDVID data sets are as follows:

  1. event_id - An incremental identifier for events belonging to an experiment, which is unique within the scope of the experiment.
    Type => integer

  2. sub_detector_id - An incremental identifier for different sub-detector layers belonging to a geometry, which is unique within the scope of the geometry.
    Type => integer

  3. sub_detector_type - The type of the sub-detector layer recording a hit, which can be one of three available types, pixel, short-strip, or long-strip.
    Type => string

  4. track_id - An incremental identifier for tracks belonging to an event, which is unique within the scope of the event.
    Type => integer

  5. track_type - Indicates the type of function defining the track in terms of polynomial degree. Available types are ‘linear’, ‘helical_uniform’ and ‘helical_expanding’.
    Type => string

  6. r_0 or radial_const - The r coordinate of the (r, theta, z) tuple defining the point P_0, used in a track’s parametric set of equations. The value will represent origin smearing for r. r_0 and radial_const are applicable to ‘linear’ and ‘helical_expanding’ track types, respectively.
    Type => float

  7. theta_0 or azimuthal_const - The theta coordinate of the (r, theta, z) tuple defining the point P_0, used in a track’s parametric set of equations. The value will represent origin smearing for theta. theta_0 and azimuthal_const are applicable to ‘linear’ and ‘helical_expanding’ track types, respectively.
    Type => float

  8. z_0 or pitch_const - The z coordinate of the (r, theta, z) tuple defining the point P_0, used in a track’s parametric set of equations. The value will represent origin smearing for z. z_0 and pitch_const are applicable to ‘linear’ and ‘helical_expanding’ track types, respectively.
    Type => float

  9. r_d - The r coordinate of the (r, theta, z) tuple defining the direction vector V_d, used in a track’s parametric set of equations. r_d is applicable to the ‘linear’ track type.
    OR,
    radial_coeff - The coefficient affecting the radius rate in the helical track. radial_coeff is applied to the free variable in the equation for r. radial_coeff is applicable to the ‘helical_expanding’ track type.
    Type => float

  10. theta_d - The theta coordinate of the (r, theta, z) tuple defining the direction vector V_d, used in a track’s parametric set of equations. theta_d is applicable to the ‘linear’ track type.
    OR,
    azimuthal_coeff - The coefficient affecting the clockwise/counter-clockwise extrusion direction of the helical track. azimuthal_coeff is applied to the free variable in the equation for theta. azimuthal_coeff is applicable to the ‘helical_expanding’ track type.
    Type => float

  11. z_d - The z coordinate of the (r, theta, z) tuple defining the direction vector V_d, used in a track’s parametric set of equations. This value will be 1 or -1, depending on which side of the XY-plane the track is being directed to. z_d is applicable to the ‘linear’ track type.
    OR,
    pitch_coeff - The coefficient affecting the pitch rate in the helical track. pitch_coeff is applied to the free variable in the equation for z. pitch_coeff is applicable to the ‘helical_expanding’ track type.
    Type => integer

  12. hit_id - An incremental identifier for hits belonging to an event, which is unique within the scope of the event.
    Type => integer

  13. hit_r - The r coordinate of the (r, theta, z) tuple defining the recorded hit point on the relevant sub-detector.
    Type => float

  14. hit_theta - The theta coordinate of the (r, theta, z) tuple defining the recorded hit point on the relevant sub-detector.
    Type => float

  15. hit_z - The z coordinate of the (r, theta, z) tuple defining the recorded hit point on the relevant sub-detector.
    Type => float

TrackML data headers

Data headers, i.e., CSV column titles, for the TrackML reduced data sets are as follows:

  1. x - Measured x-coordinate (in millimetre) of the hit in global coordinates.
    Type => float

  2. y - Measured y-coordinate (in millimetre) of the hit in global coordinates.
    Type => float

  3. z - Measured z-coordinate (in millimetre) of the hit in global coordinates.
    Type => float

  4. volume_id - Numerical identifier of the detector group.
    Type => integer

  5. vx - The x-component of the initial position or vertex (in millimetres) in global coordinates.
    Type => float

  6. vy - The y-component of the initial position or vertex (in millimetres) in global coordinates.
    Type => float

  7. vz - The z-component of the initial position or vertex (in millimetres) in global coordinates.
    Type => float

  8. px - The x-component of the initial momentum (in GeV/c) along each global axis.
    Type => float

  9. py - The y-component of the initial momentum (in GeV/c) along each global axis.
    Type => float

  10. pz - The z-component of the initial momentum (in GeV/c) along each global axis.
    Type => float

  11. q - Particle charge (as multiple of the absolute electron charge).
    Type => float

  12. particle_id - Numerical identifier of the particle inside the event.
    Type => integer

  13. weight - Per-hit weight used for the scoring metric; total sum of weights within one event equals to one.
    Type => float

  14. event_id - Numerical identifier of the event.
    Type => integer

Usage and citation

If you use this data set in your research or any publication, we kindly request you to cite the following paper:

@article{Caron:YEAR:TrackFormers,
  author = {Caron, Sascha and Dobreva, Nadezhda and Ferrer Sánchez, Antonio and 
    Martín-Guerrero, José D. and Odyurt, Uraz and Ruiz de Austri Bazan, Roberto and 
    Wolffs, Zef and Zhao, Yue},
  title = {TrackFormers: In Search of Transformer-Based Particle Tracking for the 
    High-Luminosity LHC Era}, 
  journal = {UPDATE AS APPROPRIATE},
  year = {UPDATE AS APPROPRIATE},
  doi = {GIVEN IN ZENODO METADATA}
}

Support

Note that this data set is being shared on an “as is” basis, without any express or implied warranties or obligations of support. While we have made efforts to ensure the accuracy and completeness of the data, we cannot guarantee its fitness for any particular purpose or provide any form of ongoing support.

As the creators and sharers of this data set, we are unable to offer any dedicated support or assistance in working with or analysing the data. We do not commit to responding to inquiries, fixing issues, or providing additional documentation or guidance related to this data set. Should you encounter any challenges or have questions, we recommend referring to the existing documentation.

Authors and acknowledgement

The generated data sets are created by:

The collaborating team includes:

Licence

The data set is licenced under the Creative Commons Attribution 4.0 International License (CC-BY-4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited, as shown above.

If you have any questions regarding the licence or usage of the data set, please contact the authors.

Note: The licence applies only to the data set itself and not to any third-party content or software that may be included with the data set. Please review any licences or terms of use associated with those components separately.