FastLloyd Clustering Datasets

Diaa, Abdulrahman; Humphries, Thomas; Kerschbaum, Florian

doi:10.5281/zenodo.15530593

Published May 27, 2025 | Version v1

Dataset Open

FastLloyd Clustering Datasets

1. University of Waterloo

This artifact bundles the five dataset archives used in our private federated clustering evaluation, corresponding to the real-world benchmarks, scaling experiments, ablation studies, and timing performance tests described in the paper. The real_datasets.tar.xz includes ten established clustering benchmarks drawn from UCI and the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7); scale_datasets.tar.xz contains the SynthNew family generated to assess scalability via the R clusterGeneration package ; ablate_datasets.tar.xz holds the AblateSynth sets varying cluster separation for ablation analysis also powered by clusterGeneration ; g2_datasets.tar.xz packages the G2 sets—Gaussian clusters of size 2048 across dimensions 2–1024 with two clusters each, collected from the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7) ; and timing_datasets.tar.xz includes the real s1 and lsun datasets alongside TimeSynth files (balanced synthetic clusters for timing), as per Mohassel et al.’s experimental framework .

iris.txt: 150 samples, 4 features, 3 classes; classic UCI Iris dataset for petal/sepal measurements.
lsun.txt: 400 samples, 2 features, 3 clusters; two-dimensional variant of the LSUN dataset for clustering experiments .
s1.txt: 5,000 samples, 2 features, 15 clusters; synthetic benchmark from Fränti’s S1 series.
house.txt: 1,837 samples, 3 features, 3 clusters; housing data transformed for clustering tasks.
adult.txt: 48,842 samples, 6 features, 3 clusters; UCI Census Income (“Adult”) dataset for income bracket prediction.
wine.txt: 178 samples, 13 features, 3 cultivars; UCI Wine dataset with chemical analysis features.
breast.txt: 569 samples, 9 features, 2 classes; Wisconsin Diagnostic Breast Cancer dataset.
yeast.txt: 1,484 samples, 8 features, 10 localization sites; yeast protein localization data.
mnist.txt: 10,000 samples, 784 features (28×28 pixels), 10 digit classes; MNIST handwritten digits.
birch2.txt: (a random) 25,000/100,000 subset of samples, 2 features, 100 clusters; synthetic BIRCH2 dataset for high-cluster‐count evaluation .

2. scale_datasets.tar.xz

Holds the SynthNew_{k}_{d}_{s}.txt files for scaling experiments, where:

$k \in \{2,4,8,16,32\}$ is the number of clusters,
$d \in \{2,4,8,16,32,64,128,256,512\}$ is the dimensionality,
$s \in \{1,2,3\}$ are different random seeds.

These are generated with the R clusterGeneration package with cluster sizes following a $1:2:...:k$ ratio. We incorporate a random number (in $[0, 100]$) of randomly sampled outliers and set the cluster separation degrees randomly in $[0.16, 0.26]$, spanning partially overlapping to separated clusters.

3. ablate_datasets.tar.xz

Contains the AblateSynth_{k}_{d}_{sep}.txt files for ablation studies, with:

$k \in \{2,4,8,16\}$ clusters,
$d \in \{2,4,8,16\}$ dimensions,
$sep \in \{0.25, 0.5, 0.75\}$ controlling cluster separation degrees.

Also generated via clusterGeneration.

4. g2_datasets.tar.xz

Packages the G2 synthetic sets (g2-{dim}-{var}.txt) from the clustering-data benchmarks:

$N=2048$ samples, $k=2$ Gaussian clusters,
Dimensions $d \in \{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024\}$
Cluster overlap $var \in \{10, 20, 30, 40, 50, 60, 70, 80, 90, 100\}$

5. timing_datasets.tar.xz

Includes:

s1.txt, lsun.txt: two real datasets for baseline timing.
timesynth_{k}_{d}_{n}.txt: synthetic timing datasets with balanced cluster sizes C_{avg}=N/K, varying:
- $k \in \{2,5\}$
- $d \in \{2,5\}$
- $N \in \{10000; 100000\}$

Generated similarly to the scaling sets, following Mohassel et al.’s timing experiment protocol .

Usage:

Unpack any archive with tar -xJf <archive>.tar.xz to access the .txt files directly for replication of clustering experiments. Each file contains one data point per line, with features separated by spaces.

Files

Files (2.3 GB)

Name	Size	Download all
ablate_datasets.tar.xz md5:19490b582f982075537806b0c5fc62ed	5.6 MB	Download
g2_datasets.tar.xz md5:58f299a7d7f1488aad4e1c69dcf7b330	49.9 MB	Download
real_datasets.tar.xz md5:c65c09450d898d2adfd4b143b14a3fe6	2.1 MB	Download
scale_datasets.tar.xz md5:90399dd8f052604dcf127baff57e0600	2.2 GB	Download
timing_datasets.tar.xz md5:39cdd1a09e0ba9358b36a75a73849449	11.8 MB	Download

Additional details

Is supplement to: Software: 10.5281/zenodo.15530617 (DOI); Software: https://github.com/D-Diaa/FastLloyd/tree/v0.0.1 (URL)

Repository URL: https://github.com/D-Diaa/FastLloyd
Programming language: Python

	All versions	This version
Views	182	182
Downloads	226	226
Data volume	107.5 GB	107.5 GB

Contents

1. real_datasets.tar.xz

2. scale_datasets.tar.xz

3. ablate_datasets.tar.xz

4. g2_datasets.tar.xz

5. timing_datasets.tar.xz

Files (2.3 GB)

Related works

Software

FastLloyd Clustering Datasets

Authors/Creators

Description

Contents

1. real_datasets.tar.xz

2. scale_datasets.tar.xz

3. ablate_datasets.tar.xz

4. g2_datasets.tar.xz

5. timing_datasets.tar.xz

Files

Files (2.3 GB)

Additional details

Related works

Software