Published April 4, 2019 | Version v5
Dataset Open

R&D Dataset for LHC Olympics 2020 Anomaly Detection Challenge

  • 1. University of Hamburg
  • 2. Lawrence Berkeley National Lab
  • 3. Rutgers University

Description

This is the first R&D dataset for the LHC Olympics 2020 Anomaly Detection Challenge. It consists of 1M QCD dijet events and 100k W'->XY events, with X->qq and Y->qq. The W', X, and Y masses are 3.5 TeV, 500 GeV and 100 GeV respectively. The events are produced using Pythia8 and Delphes 3.4.1, with no pileup or MPI included. They are selected using a single fat-jet (R=1) trigger with pT threshold of 1.2 TeV. 

The events are randomly shuffled together, but for the purposes of testing and development, we provide the user with a signal/background truth bit for each event. Obviously, the truth bit will not be included in the actual challenge.

These events are stored as pandas dataframes saved to compressed h5 format. For each event, all Delphes reconstructed particles in the event are assumed to be massless and are recorded in detector coordinates (pT, eta, phi). More detailed information such as particle charge is not included. Events are zero padded to constant size arrays of 700 particles, with the truth bit appended at the end. The array format is therefore (Nevents=1.1M, 2101).

For more information, including an example Jupyter notebook illustrating how to read and process the events, see the official LHC Olympics 2020 webpage.

https://lhco2020.github.io/homepage/

UPDATE May 18 2020

We have uploaded a second signal dataset for R&D, consisting of 100k W'->XY with X,Y->qqq (i.e. 3-prong substructure). Everything else about this signal dataset (particle masses, trigger, Pythia configuration, detector simulation) is the same as the previous one described above. 

UPDATE November 23 2020

We now include high-level feature files for the background and 2-prong signal (events_anomalydetection_v2.features.h5) and for the 3-prong signal (events_anomalydetection_Z_XY_qqq.features.h5). To produce the features, we have clustered every event into R=1 jets using the anti-kT algorithm. The features (calculated using fastjet plugins) are the 3-momenta, invariant masses, and n-jettiness variables tau1, tau2 and tau3 for the highest pT jet (j1) and the second highest pT jet (j2):

'pxj1', 'pyj1', 'pzj1', 'mj1', 'tau1j1', 'tau2j1', 'tau3j1', 'pxj2', 'pyj2', 'pzj2', 'mj2', 'tau1j2', 'tau2j2', 'tau3j2'

The rows (events) in each feature file should be ordered exactly the same as in their corresponding raw event file. For convenience, we have also included the label (1 for signal and 0 for background) as an additional column in the first feature file (events_anomalydetection_v2.features.h5).

UPDATE February 11 2021

We have included the Delphes detector card and the Pythia8 command files used to produce the R&D datasets.

UPDATE April 17 2022

It was brought to our attention that somehow the raw events file events_anomalydetection.h5 was never updated to v2, which had a lower generator-level pT threshold (PhaseSpace:pTHatMin = 500) for QCD events to minimize artificial trigger sculpting. This v2 is the version that the features file (events_anomalydetection_v2.features.h5) corresponds to, as well as the Pythia cmnd file (pythia_RnD_qcd.cmnd). Now the raw events file has been brought up to date as well. 

Files

Files (3.2 GB)

Name Size Download all
md5:cb11b729ec10c04ae5250d057fd088b2
22.4 kB Download
md5:271cf5e71fc756b2a8d2b32730689bdb
74.3 MB Download
md5:629789d55813be3860781b084ae7f1de
2.9 GB Download
md5:1e729f7dff225451182c28afaa4bb411
5.2 MB Download
md5:54e123a86143b668f9cb76905152a124
235.5 MB Download
md5:19555e76f8a787184ec43fd5ff295465
1.9 kB Download
md5:1e9b731c2bf90f4ba549b85996cd7424
2.0 kB Download
md5:21472daafd7d54cd10e7548869a41d03
2.0 kB Download