Polygon Similarity Benchmark Dataset
Authors/Creators
Contributors
Contact person:
Supervisor (2):
Description
Dataset Description: Polygon Similarity Benchmark Dataset
Overview
This dataset provides a curated collection of polygonal shapes derived from SpatialHadoop GIS dataset: Parks, Water Bodies, and Sports.
It is intended to support research in polygon representation learning, geometric similarity search, and spatial indexing.
The dataset includes raw polygonal geometries, pre-computed similarity ground truth, and supplementary documentation. For each dataset category, 80% of the polygons were used to build the similarity index, while the remaining 20% were reserved exclusively for evaluation.
Dataset Contents
The distributed ZIP package contains the following files:
1. ShapeToVecResults2.pdf
A supplementary document containing additional experimental results referenced in the publication.
2. poly_data.zip
A collection of polygonal GIS datasets extracted from SpatialHadoop. These represent the input geometries used for similarity computation.
3. Ground Truth Files
These archives contain precomputed shape similarity results for each domain:
-
parks.tar
-
water_bodies.tar
-
sports_all-query.tar.gz
Each ground-truth archive consists of multiple text files, where each line represents a similarity query result.
How to Extract the Archives
You can extract any of the above `.tar` or `.tar.gz` archives using the methods below, depending on your operating system.
Linux
Use the tar command in the Terminal:
tar -xvf archive_name.tar
tar -xzvf archive_name.tar.gz
This extracts the files into the current directory.
macOS
Extraction commands are the same as Linux. Open the Terminal and run:
tar -xvf archive_name.tar
tar -xzvf archive_name.tar.gz
You may also double-click the archive in Finder to extract it automatically.
Windows
Open the PowerShell (Windows 10 and later):
tar -xvf archive_name.tar
tar -xzvf archive_name.tar.gz
You may also use your preferred uncompress software on Windows (e.g., 7-Zip, WinRAR, or PeaZip) to extract both .tar and .tar.gz archives.
Ground Truth Format
Each line in a ground-truth file encodes:
-
The first value is the ID of the input polygon.
-
The subsequent values are the IDs of polygons determined to be most similar based on geometric shape similarity.
-
The list of similar polygons is sorted in decreasing order of similarity, with the most similar polygon appearing first.
These ground-truth lists were generated using geometric similarity (Jaccard Similarity) metrics for evaluation and benchmarking of vector-based polygon encodings.
Intended Use
This dataset is primarily designed for:
-
Research on polygon representation learning, embedding models, and shape encoders.
-
Benchmarking approximate nearest-neighbor (ANN) algorithms on spatial shape data.
-
Studying spatial indexing, vector search strategies, and geometric similarity measures.
-
GIS analytics, spatial data mining, and machine learning applications involving polygonal geometries.
Files
poly_data.zip
Files
(76.2 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:19f9193c18d246c3cdaa2f91732db082
|
661.9 MB | Download |
|
md5:ed2ec9e5bd0745e515b6c796bf128538
|
560.6 MB | Preview Download |
|
md5:22f1ae01b852eb30bd2a63731b3a4435
|
435.2 kB | Preview Download |
|
md5:0d44cf36916d3bee04b0263d5fc4c1f1
|
71.8 GB | Download |
|
md5:2234044a467d6ad5a322db8b2e599fb5
|
3.1 GB | Download |
Additional details
Funding
- U.S. National Science Foundation
- Collaborative Research: OAC: Approximate Nearest Neighbor Similarity Search for Large Polygonal and Trajectory Datasets 2313039
- U.S. National Science Foundation
- Collaborative Research: OAC: Approximate Nearest Neighbor Similarity Search for Large Polygonal and Trajectory Datasets 2344585