Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections
Creators
Description
Three corpora in different domains extracted from Wikipedia.
For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.
The article structure, and particularly the sub-titles and paragraphs are kept in these datasets
Wines
Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are
- Dom Pérignon - Moët & Chandon
- Pinot Meunier - Chardonnay
Movies
The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more.
For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies.
Examples for ground-truth expert-based recommendations are
- Schindler's List - The Pianist
- Lion King - The Jungle Book
Video games
The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are:
- Grand Theft Auto - Mafia
- Burnout Paradise - Forza Horizon 3