SIGMOD 2024 Programming Contest Datasets

Li, Guoliang; Deng, Dong

doi:10.5281/zenodo.13998879

Published April 12, 2024 | Version v1

Dataset Open

SIGMOD 2024 Programming Contest Datasets

1. Tsinghua University
2. Rutgers, The State University of New Jersey

Our datasets, both released and evaluation set, are derived from the YFCC100M Dataset. Each dataset comprises vectors encoded from images using the CLIP model, which are then reduced to 100 dimensions using Principal Component Analysis (PCA). Additionally, categorical and timestamp attributes are selected from the metadata of the images. The categorical attribute is discretized into integers starting from 0, and the timestamp attribute is normalized into floats between 0 and 1.

For each query, a query type is randomly selected from four possible types, denoted by the numbers 0 to 3. Then, we randomly choose two data points from dataset D, utilizing their categorical attribute (C) timestamp attribute (T), and vectors, to determine the values of the query. Specifically:

Randomly sample two data points from D.
Use the categorical value of the first data point as v for the equality predicate over the categorical attribute C.
Use the timestamp attribute values of the two sampled data points for the range predicate. Designate l as the smaller timestamp value and r as the larger. The range predicate is thus defined as l≤T≤r.
Use the vector of the first data point as the query vector.
If the query type does not involve v, l, or r, their values are set to -1.

We assure that at least 100 data points in D meet the query limit.

Dataset Structure

Dataset D is in a binary format, beginning with a 4-byte integer num_vectors (uint32_t) indicating the number of vectors. This is followed by data for each vector, stored consecutively, with each vector occupying 102 (2 + vector_num_dimension) x sizeof(float32) bytes, summing up to num_vectors x 102 (2 + vector_num_dimension) x sizeof(float32) bytes in total. Specifically, for the 102 dimensions of each vector: the first dimension denotes the discretized categorical attribute C and the second dimension denotes the normalized timestamp attribute T. The rest 100 dimensions are the vector.

Query Set Structure

Query set Q is in a binary format, beginning with a 4-byte integer num_queries (uint32_t) indicating the number of queries. This is followed by data for each query, stored consecutively, with each query occupying 104 (4 + vector_num_dimension) x sizeof(float32) bytes, summing up to num_queries x 104 (4 + vector_num_dimension) x sizeof(float32) bytes in total.

The 104-dimensional representation for a query is organized as follows:

The first dimension denotes query_type (takes values from 0, 1, 2, 3).
The second dimension denotes the specific query value v for the categorical attribute (if not queried, takes -1).
The third dimension denotes the specific query value l for the timestamp attribute (if not queried, takes -1).
The fourth dimension denotes the specific query value r for the timestamp attribute (if not queried, takes -1).
The rest 100 dimensions are the query vector.

There are four types of queries, i.e., the query_type takes values from 0, 1, 2 and 3. The 4 types of queries correspond to:

If query_type=0: Vector-only query, i.e., the conventional approximate nearest neighbor (ANN) search query.
If query_type=1: Vector query with categorical attribute constraint, i.e., ANN search for data points satisfying C=v.
If query_type=2: Vector query with timestamp attribute constraint, i.e., ANN search for data points satisfying l≤T≤r.
If query_type=3: Vector query with both categorical and timestamp attribute constraints, i.e. ANN search for data points satisfying C=v and l≤T≤r.

The predicate for the categorical attribute is an equality predicate, i.e., C=v. And the predicate for the timestamp attribute is a range predicate, i.e., l≤T≤r.

Originally provided on https://dbgroup.cs.tsinghua.edu.cn/sigmod2024/task.shtml?content=datasets .

Files

Files (6.2 GB)

Name	Size	Download all
contest-data-release-10m.bin md5:ef068abda86c77f83c3e740be889eb39	4.1 GB	Download
contest-data-release-1m.bin md5:7fb72efaaaf9aabefb0faadb8d6812cb	408.0 MB	Download
contest-queries-release-10m.bin md5:66ef1d97501e74a1d1a6afd686ca4a60	1.7 GB	Download
contest-queries-release-1m.bin md5:37ffbab17c883dde72fbc4b0b71e68ab	4.2 MB	Download

	All versions	This version
Views	340	340
Downloads	336	336
Data volume	322.5 GB	322.5 GB

SIGMOD 2024 Programming Contest Datasets

Creators

Description

Dataset Structure

Query Set Structure

Files

Files (6.2 GB)