Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections

anonymous

doi:10.5281/zenodo.4468783

Published January 26, 2021 | Version v1

Dataset Open

Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections

anonymous

Three corpora in different domains extracted from Wikipedia.

For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.

The article structure, and particularly the sub-titles and paragraphs are kept in these datasets

Wines

Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are

Dom Pérignon - Moët & Chandon
Pinot Meunier - Chardonnay

Movies

The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more.
For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies.
Examples for ground-truth expert-based recommendations are

Schindler's List - The Pianist
Lion King - The Jungle Book

Video games

The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are:

Grand Theft Auto - Mafia
Burnout Paradise - Forza Horizon 3

Files

movies.csv

Files (854.3 MB)

Name	Size	Download all
movies.csv md5:e4714cd328fc98c4919a66c77bc1198f	664.3 MB	Preview Download
video_games.csv md5:a1db0c906d2d31fc359c234ec57c6929	178.6 MB	Preview Download
wines.csv md5:617fa73f2a331ab160d7653c81daf17e	11.4 MB	Preview Download

	All versions	This version
Views	1,528	1,527
Downloads	632	632
Data volume	401.5 GB	401.5 GB

Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections

Authors/Creators

Description

Files

movies.csv

Files (854.3 MB)