Published November 22, 2025 | Version 2.1
Dataset Open

Polygon Similarity Benchmark Dataset

  • 1. ROR icon The University of Texas at San Antonio
  • 2. ROR icon Missouri University of Science and Technology
  • 3. ROR icon Marquette University
  • 1. ROR icon The University of Texas at San Antonio
  • 2. ROR icon Missouri University of Science and Technology
  • 3. ROR icon Marquette University

Description

Dataset Description: Polygon Similarity Benchmark Dataset

Overview

This dataset provides a curated collection of polygonal shapes derived from SpatialHadoop GIS dataset: Parks, Water Bodies, and Sports.
It is intended to support research in polygon representation learning, geometric similarity search, and spatial indexing.

The dataset includes raw polygonal geometries, pre-computed similarity ground truth, and supplementary documentation. For each dataset category, 80% of the polygons were used to build the similarity index, while the remaining 20% were reserved exclusively for evaluation.

Dataset Contents

The distributed ZIP package contains the following files:

1. ShapeToVecResults2.pdf

A supplementary document containing additional experimental results referenced in the  publication.

2. poly_data.zip

A collection of polygonal GIS datasets extracted from SpatialHadoop. These represent the input geometries used for similarity computation.

3. Ground Truth Files

These archives contain precomputed shape similarity results for each domain:

  • parks.tar

  • water_bodies.tar

  • sports_all-query.tar.gz

Each ground-truth archive consists of multiple text files, where each line represents a similarity query result.

How to Extract the Archives

You can extract any of the above `.tar` or `.tar.gz` archives using the methods below, depending on your operating system.

Linux

Use the tar command in the Terminal:

tar -xvf archive_name.tar
tar -xzvf archive_name.tar.gz

This extracts the files into the current directory.

macOS

Extraction commands are the same as Linux. Open the Terminal and run:

tar -xvf archive_name.tar
tar -xzvf archive_name.tar.gz

You may also double-click the archive in Finder to extract it automatically.

Windows

Open the PowerShell (Windows 10 and later):

tar -xvf archive_name.tar
tar -xzvf archive_name.tar.gz

You may also use your preferred uncompress software on Windows (e.g., 7-Zip, WinRAR, or PeaZip) to extract both .tar and .tar.gz archives.

Ground Truth Format

Each line in a ground-truth file encodes:

<input_polygon_id> <similar_polygon_id_1> <similar_polygon_id_2> ... <similar_polygon_id_k>
 
  • The first value is the ID of the input polygon.

  • The subsequent values are the IDs of polygons determined to be most similar based on geometric shape similarity.

  • The list of similar polygons is sorted in decreasing order of similarity, with the most similar polygon appearing first.

These ground-truth lists were generated using geometric similarity (Jaccard Similarity) metrics for evaluation and benchmarking of vector-based polygon encodings.

Intended Use

This dataset is primarily designed for:

  • Research on polygon representation learning, embedding models, and shape encoders.

  • Benchmarking approximate nearest-neighbor (ANN) algorithms on spatial shape data.

  • Studying spatial indexing, vector search strategies, and geometric similarity measures.

  • GIS analytics, spatial data mining, and machine learning applications involving polygonal geometries.

 

Files

poly_data.zip

Files (76.2 GB)

Name Size Download all
md5:19f9193c18d246c3cdaa2f91732db082
661.9 MB Download
md5:ed2ec9e5bd0745e515b6c796bf128538
560.6 MB Preview Download
md5:22f1ae01b852eb30bd2a63731b3a4435
435.2 kB Preview Download
md5:0d44cf36916d3bee04b0263d5fc4c1f1
71.8 GB Download
md5:2234044a467d6ad5a322db8b2e599fb5
3.1 GB Download

Additional details

Funding

U.S. National Science Foundation
Collaborative Research: OAC: Approximate Nearest Neighbor Similarity Search for Large Polygonal and Trajectory Datasets 2313039
U.S. National Science Foundation
Collaborative Research: OAC: Approximate Nearest Neighbor Similarity Search for Large Polygonal and Trajectory Datasets 2344585