Published August 28, 2025 | Version v1
Dataset Open

Pythia8 and Herwig7 Boosted Top & QCD Jet datasets

  • 1. EDMO icon Massachusetts Institute of Technology
  • 2. ROR icon Brown University

Description

A dataset of labeled top and QCD jets, generated using both Pythia8 and Herwig. 

 

There are 20 files: 10 files generated using Pythia, and 10 generated using Herwig (with the prefix `HERWIG`). Each file consists of 100k top jets and 100k QCD jets, for a total of 2M events for Pythia and 2M events for Herwig (4M total). There are two arrays in each file

  • X: (200000,M,4), A set of 100k top jets and 100k QCD jets, where M is the max multiplicity of the jets in that file (other jets have been padded with zero-particles), and the features of each particle are its pt, rapidity, azimuthal angle, and pdgid.
  • y: (200000,), an array of labels for the jets where QCD is 0 and top is 1.

The Pythia samples are generated using Pythia 8.331. The top events are generated using the processes `Top:gg2ttbar` and `Top:qqbar2ttbar`, and the W's are forced to decay hadronically. The QCD events are generated using `HardQCD:all`. 

The Herwig samples are generated using Herwig 7.3.0. The top events are generated using  `MEHeavyQuark`, and leptonic decays of the W's are discarded The QCD events are generated using `MEQCD2to2`.

For both datasets, jets are clustered using FastJet 3.3.0 using the anti-kt algorithm with R = 0.8. For top jets, a hard top parton is required to exist within the jet cone. We select for jets with a pT between 500 and 550 GeV and a pseudorapidity less than 2.5. If multiple jets in an event meet these criteria, one jet is chosen at random.

Usage

This dataset can be automatically and conveniently downloaded using the ParticleLoader python package. This will download to a specified cache, and load from the cache if the files already exist.

from particleloader import load

# Change this to a working directory on your machine!
dir = "~/.ParticleLoader"
N = 100000

X_pythia, y_pythia = load("topqcd_jets", N, cache_dir=dir)
X_herwig, y_herwig = load("topqcd_jets", N, cache_dir=dir, generator="herwig")

 

WARNING: A similar dataset exists for quark/gluon tagging. However, as these events were generated using different versions of Pythia and Herwig, these datasets should not be mixed.

Files

Files (12.8 GB)

Name Size Download all
md5:8305392fb9d9439892f2c6889a915f7c
640.8 MB Download
md5:b40f00a218a10f05381cacf5c18b5f32
640.8 MB Download
md5:f5a48cfbe662cd4acd8c88aa79e9716a
640.8 MB Download
md5:9874e1d71835e3398697f5c85ec87d53
640.8 MB Download
md5:56cf814a6eef31a5e5849d3ed50bd704
640.8 MB Download
md5:900e552c871a08c496be6ae942e86e33
640.8 MB Download
md5:9cf33a722f10f3c8aa534d46b9f62b7e
640.8 MB Download
md5:2790665b1f6446e2ac1e478d144f0f94
640.8 MB Download
md5:ac2d4776ea69f19a3a20af829b58dfdf
640.8 MB Download
md5:26e8b3360f425b5f113e656bf1d2f128
640.8 MB Download
md5:d0b38093b83dfecde6ae5fcd684432fc
640.8 MB Download
md5:9bf9a008074286ee6ba255af766c6c2c
640.8 MB Download
md5:b20fc1cf560d868e4dc82f069c9d6fe6
640.8 MB Download
md5:3c1fcabe3eeb683cf62a61c293d00209
640.8 MB Download
md5:2721a2955d2c30bdf366b227c4065e54
640.8 MB Download
md5:df1b6e4a018648dcc99c5a8d172b8790
640.8 MB Download
md5:df3496b3486995a72ca0b54df43d338e
640.8 MB Download
md5:90f2c40989ef6b055cffa5fc1bf439f3
640.8 MB Download
md5:1813136c6a50bf55c5abd4cffe1491d1
640.8 MB Download
md5:4f036e3576aa0f60dd0d3e88982bb6f2
640.8 MB Download