Published July 11, 2025 | Version v1
Dataset Open

A Benchmark Dataset for Multilingual Tokenization Energy and Efficiency Across 23 Models and 325 Languages

  • 1. ROR icon Universidad de Deusto

Description

This repository contains a benchmark dataset and processing scripts for analyzing the energy consumption and processing efficiency of 23 Hugging Face tokenizers applied to 8,212 standardized text chunks across 325 languages.

The dataset includes raw and normalized energy measurements, tokenization times, token counts, and rich structural metadata for each chunk, including script composition, entropy, compression ratios, and character-level features. All experiments were conducted in a controlled environment using PyRAPL on a Linux workstation, with baseline CPU consumption subtracted via linear interpolation.

The resulting data enables fine-grained comparison of tokenizers across linguistic and script diversity. It also supports downstream tasks such as energy-aware NLP, script-sensitive modeling, and benchmarking of future tokenization methods.

Accompanying R scripts are provided to reproduce data processing, regression models, and clustering analyses. Visualization outputs and cluster-level summaries are also included.

All contents are organized into clearly structured folders to support easy access, interpretability, and reuse by the research community.

 

01_processing_scripts/

R scripts to transform raw data, subtract baseline energy, and produce clean metrics.

  • multimodel_tokenization_energy.py
    ⤷ Python script used to tokenize all chunks with 23 models while logging energy and time.
  • adapting_original_dataset.R
    ⤷ Reads raw logs and metadata, computes net energy, and outputs cleaned files.

  • energy_patterns.R
    ⤷ Performs clustering, regression, t-SNE, and generates all visualizations.

 

02_raw_data/

Raw output from the tokenization experiment and baseline profiler.

  • all_models_tokenization.csv
    ⤷ Full log of 42M+ tokenization runs (23 models × 225 reps × 8,212 chunks).

  • baseline.csv
    ⤷ Background CPU energy samples, one per 50 chunks. Used for normalization.

 

03_clean_data/

Cleaned, enriched, and reshaped datasets ready for analysis.

  • net_energy.csv
    ⤷ Raw tokenization results after baseline energy subtraction (per run).

  • tokenization_long.csv
    ⤷ One row per chunk × tokenizer, with medians + token counts.

  • tokenization_wide.csv
    ⤷ Wide-format matrix: one row per chunk, one column per tokenizer × metric.

  • complete.csv
    ⤷ Fully enriched dataset joining all metrics, metadata, and script distributions.

  • metadata.csv
    ⤷ Structural features and script-based character stats per chunk.

 

04_cluster_outputs/

Outputs from clustering and dimensionality reduction over tokenizer energy profiles.

  • tokenizer_dendrogram.pdf
    ⤷ Hierarchical clustering of 23 tokenizers based on energy profiles.

  • tokenizer_tsne.pdf
    ⤷ t-SNE projection of tokenizers grouped by energy usage.

  • mean_energy_per_cluster.csv
    ⤷ Mean energy consumption (mJ) per language × tokenizer cluster.

  • sd_energy_per_cluster.csv
    ⤷ Standard deviation of energy consumption (mJ) per language × cluster.

  • grid.pdf
    ⤷ Heatmap of script-wise energy deltas (relative to Latin) for all tokenizers.

Files

tokenization_energy_benchmark.zip

Files (3.8 GB)

Name Size Download all
md5:de62bcadc11d2ec1f91bbae3d09922ab
3.8 GB Preview Download

Additional details

Related works

Is supplemented by
Dataset: 10.5281/zenodo.15696122 (DOI)