Published April 17, 2026 | Version v1
Dataset Open

The FlavorClass dataset: Transforming flavor tagging at Belle II

  • 1. University of Göttingen
  • 2. ROR icon Deutsches Elektronen-Synchrotron DESY

Description

The Belle II experiment at Tsukuba, Japan, creates entangled pairs of neutral $B$ mesons in electron positron collisions. If one $B$ meson is fully reconstructed, the remaining tracks and clusters originate from the other tag-side $B$ meson. A Flavor Tagger is a multivariate classifier that inspects these remaining tracks and clusters and classifies them as either originating from $B^0$(q=1) or $\bar{B}^0$(q=-1) meson.

The published data set allows the training of novel Flavor Taggers based on a realistic simulation of the Belle II detector.  For example, the software needed to train and test a transformer based Flavor Tagger (TFlaT) can be found here: https://github.com/BenjaminSchwenker/tflat.

Methods

The dataset consists of 12 million events simulated with the Belle II software framework (basf2). The events originate from the MC16rd simulation campaign and reflect the data taking conditions during Run 1 from 2019 until the first long shutdown of Belle II (2022). One $B$ meson decays into a pair of invisible neutrinos while the other (tag-side) $B$ meson decays into all physically allowed decay modes. 

The dataset is split into three files for training, validation and testing in parquet format. The training sample consists of 10 million events and the validation and testing sample consist of 1 million events each. Example code for reading the dataset can be found here: https://github.com/BenjaminSchwenker/tflat.

The input of the algorithm is based on features from charged tracks, ECL clusters and global rest-of-event properties. The charged tracks are sorted by momentum. If an event has less then 10 charged tracks, padding is applied. For each track, a total of 27 features are computed. Also the ECL clusters are sorted by momentum. If an event has less then 20 clusters, padding is applied. For each cluster, a total of 6 features are computed. 

The meaning of the variables can be inferred from the column name in the parquet file. Please use the basf2 documentation  https://software.belle2.org/development/sphinx/analysis/doc/Variables.html to look up the variable meaning. 

Files

Files (11.6 GB)

Name Size Download all
md5:a3f5861c37a5192e61d60264a88346bf
982.1 MB Download
md5:4664d637946cdc7d39292d22f82db405
9.6 GB Download
md5:db8bdd1ec40b8d17256e581c053dd57d
982.6 MB Download