Published May 27, 2025 | Version v1
Dataset Open

FastLloyd Clustering Datasets

  • 1. ROR icon University of Waterloo

Description

This artifact bundles the five dataset archives used in our private federated clustering evaluation, corresponding to the real-world benchmarks, scaling experiments, ablation studies, and timing performance tests described in the paper. The real_datasets.tar.xz includes ten established clustering benchmarks drawn from UCI and the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7); scale_datasets.tar.xz contains the SynthNew family generated to assess scalability via the R clusterGeneration package  ; ablate_datasets.tar.xz holds the AblateSynth sets varying cluster separation for ablation analysis also powered by clusterGeneration  ; g2_datasets.tar.xz packages the G2 sets—Gaussian clusters of size 2048 across dimensions 2–1024 with two clusters each, collected from the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7) ; and timing_datasets.tar.xz includes the real s1 and lsun datasets alongside TimeSynth files (balanced synthetic clusters for timing), as per Mohassel et al.’s experimental framework  .

Contents

1. real_datasets.tar.xz

Contains ten real-world benchmark datasets and formatted as one sample per line with space-separated features:

  • iris.txt: 150 samples, 4 features, 3 classes; classic UCI Iris dataset for petal/sepal measurements.

  • lsun.txt: 400 samples, 2 features, 3 clusters; two-dimensional variant of the LSUN dataset for clustering experiments  .

  • s1.txt: 5,000 samples, 2 features, 15 clusters; synthetic benchmark from Fränti’s S1 series.

  • house.txt: 1,837 samples, 3 features, 3 clusters; housing data transformed for clustering tasks.

  • adult.txt: 48,842 samples, 6 features, 3 clusters; UCI Census Income (“Adult”) dataset for income bracket prediction.

  • wine.txt: 178 samples, 13 features, 3 cultivars; UCI Wine dataset with chemical analysis features.

  • breast.txt: 569 samples, 9 features, 2 classes; Wisconsin Diagnostic Breast Cancer dataset.

  • yeast.txt: 1,484 samples, 8 features, 10 localization sites; yeast protein localization data.

  • mnist.txt: 10,000 samples, 784 features (28×28 pixels), 10 digit classes; MNIST handwritten digits.

  • birch2.txt: (a random) 25,000/100,000 subset of samples, 2 features, 100 clusters; synthetic BIRCH2 dataset for high-cluster‐count evaluation .

2. scale_datasets.tar.xz

Holds the SynthNew_{k}_{d}_{s}.txt files for scaling experiments, where:

  • $k \in \{2,4,8,16,32\}$ is the number of clusters,

  • $d \in \{2,4,8,16,32,64,128,256,512\}$ is the dimensionality,

  • $s \in \{1,2,3\}$ are different random seeds.

These are generated with the R clusterGeneration package with cluster sizes following a $1:2:...:k$ ratio. We incorporate a random number (in $[0, 100]$) of randomly sampled outliers and set the cluster separation degrees randomly in $[0.16, 0.26]$, spanning partially overlapping to separated clusters.

3. ablate_datasets.tar.xz

Contains the AblateSynth_{k}_{d}_{sep}.txt files for ablation studies, with:

  • $k \in \{2,4,8,16\}$ clusters,

  • $d \in \{2,4,8,16\}$ dimensions,

  • $sep \in \{0.25, 0.5, 0.75\}$ controlling cluster separation degrees.

Also generated via clusterGeneration.

4. g2_datasets.tar.xz

Packages the G2 synthetic sets (g2-{dim}-{var}.txt) from the clustering-data benchmarks:

  • $N=2048$ samples, $k=2$ Gaussian clusters,

  • Dimensions $d \in \{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024\}$

  • Cluster overlap $var \in \{10, 20, 30, 40, 50, 60, 70, 80, 90, 100\}$

5. timing_datasets.tar.xz

Includes:

  • s1.txt, lsun.txt: two real datasets for baseline timing.

  • timesynth_{k}_{d}_{n}.txt: synthetic timing datasets with balanced cluster sizes C_{avg}=N/K, varying:

    • $k \in \{2,5\}$

    • $d \in \{2,5\}$

    • $N \in \{10000; 100000\}$

Generated similarly to the scaling sets, following Mohassel et al.’s timing experiment protocol  .

Usage:

Unpack any archive with tar -xJf <archive>.tar.xz to access the .txt files directly for replication of clustering experiments. Each file contains one data point per line, with features separated by spaces.

Files

Files (2.3 GB)

Name Size Download all
md5:19490b582f982075537806b0c5fc62ed
5.6 MB Download
md5:58f299a7d7f1488aad4e1c69dcf7b330
49.9 MB Download
md5:c65c09450d898d2adfd4b143b14a3fe6
2.1 MB Download
md5:90399dd8f052604dcf127baff57e0600
2.2 GB Download
md5:39cdd1a09e0ba9358b36a75a73849449
11.8 MB Download

Additional details

Related works

Software

Repository URL
https://github.com/D-Diaa/FastLloyd
Programming language
Python