Published December 9, 2025 | Version v1
Dataset Open

TPC-DS Benchmark Dataset Generated for DPJoin Optimization Experiments

  • 1. ROR icon Izmir Institute of Technology

Description

This dataset contains the TPC-DS benchmark data generated for evaluating DPJoin, a cost-based, replication-aware query optimizer for distributed analytical workloads. The dataset includes three scale factors: SF1 (~0.4 GB), SF10 (~4 GB), and SF50 (~16 GB), each generated using the official TPC-DS data generation toolkit (https://www.tpc.org/tpcds/) without modification.

These datasets were used to run all performance experiments in the DPJoin study, including comparisons with RelJoin, AQE, ShuffleHashJoin, and other distributed join strategies. The provided files reproduce the exact experimental environment of the paper, enabling full transparency, independent verification, and reusability.

All scale-factor directories preserve the original TPC-DS schema and table formats. The data is suitable for research on distributed query optimization, join algorithms, replication strategies, cost modeling, and large-scale analytics.

Files

tpcds_sf1.zip

Files (16.4 GB)

Name Size Download all
md5:b4ffa1ab8b8fac36494d19a8d44f72cc
304.8 MB Preview Download
md5:443a831ff1219a85363b4b89b01feeaf
3.0 GB Preview Download
md5:d784d1d285968f79d6d63ef621674dc2
13.1 GB Preview Download