Published January 26, 2021 | Version v1
Dataset Open

Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections

Creators

Description

Three corpora in different domains extracted from Wikipedia.

For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.

The article structure, and particularly the sub-titles and paragraphs are kept in these datasets

 

Wines

Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are 

  • Dom Pérignon - Moët & Chandon
  • Pinot Meunier - Chardonnay

Movies

The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more.
For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies.
Examples for ground-truth expert-based recommendations are 

  • Schindler's List - The Pianist
  • Lion King - The Jungle Book

Video games

The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are:

  • Grand Theft Auto - Mafia
  • Burnout Paradise - Forza Horizon 3

Files

movies.csv

Files (854.3 MB)

Name Size Download all
md5:e4714cd328fc98c4919a66c77bc1198f
664.3 MB Preview Download
md5:a1db0c906d2d31fc359c234ec57c6929
178.6 MB Preview Download
md5:617fa73f2a331ab160d7653c81daf17e
11.4 MB Preview Download