Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset, Models & Code)

Malte Ostendorff; Terry Ruas; Moritz Schubotz; Georg Rehm; Bela Gipp

doi:10.5281/zenodo.3713183

Published August 1, 2020 | Version 1.0

Dataset Open

Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset, Models & Code)

1. University of Konstanz
2. University of Wuppertal
3. DFKI GmbH

Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93,
which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.

Additional information can be found on GitHub.

The following data is supplemental to the experiments described in our research paper. The data consists of:

Datasets (articles, class labels, cross-validation splits)
Pretrained models (Transformers, GloVe, Doc2vec)
Model output (prediction) for the best performing models

Dataset

The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.

The actual dataset is provided as used in the stratified k-fold with k=4 in train_testdata__4folds.tar.gz.

├── 1
│   ├── test.csv
│   └── train.csv
├── 2
│   ├── test.csv
│   └── train.csv
├── 3
│   ├── test.csv
│   └── train.csv
└── 4
    ├── test.csv
    └── train.csv

4 directories, 8 files

Pretrained models

PyTorch: vanilla and Siamese BERT + XLNet

Pretrained model for each fold is available in the corresponding model archives:

# Vanilla
model_wiki.bert_base__joint__seq512.tar.gz
model_wiki.xlnet_base__joint__seq512.tar.gz

# Siamese
model_wiki.bert_base__siamese__seq512__4d.tar.gz
model_wiki.xlnet_base__siamese__seq512__4d.tar.gz

Files

predictions_4folds_2__bert_basejointseq512.csv

Files (10.3 GB)

Name	Size	Download all
enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2 md5:e2cbd67c30fc099d8b5b10569afeb002	56.4 MB	Download
model_avg_glove_title2vec.200d.pickle md5:deff6bf082b1cdb8f4d9a6e7ca88c82d	23.9 MB	Download
model_doc2vec.10k.model md5:64fc65cd2f7dbbac765f76344c2e9543	43.8 MB	Download
model_wiki.bert_base__joint__seq256.tar.gz md5:24db636977db959dc78cc3367a194891	1.6 GB	Download
model_wiki.bert_base__joint__seq512.tar.gz md5:0756d8e0847317e6dbfd65e9a56d2eb4	1.6 GB	Download
model_wiki.bert_base__siamese__seq512__4d.tar.gz md5:7dd75f938b2b66c255839e0f6ea2effe	1.6 GB	Download
model_wiki.xlnet_base__joint__seq256.1.tar.gz md5:2363fc6e69bd7634aa5f78a244743733	1.7 GB	Download
model_wiki.xlnet_base__joint__seq512.tar.gz md5:b3ab4c8ca24cc1e1dde714191750c6e8	1.7 GB	Download
model_wiki.xlnet_base__siamese__seq512__4d.tar.gz md5:862f1f706625732841ee66f333ef25bb	1.8 GB	Download
predictions_4folds_2__bert_base__joint__seq512.csv md5:bf4e27b205570e43da190f2922d1c84e	873.0 kB	Preview Download
predictions_4folds_4__bert_base__joint__seq512.csv md5:801c865aa4286e7c0a4f74f37d9358d5	879.6 kB	Preview Download
semantic-document-relations-master.zip md5:21f5a2567bb9a4a1652879f5a22f704e	37.6 kB	Preview Download
train_testdata__4folds.tar.gz md5:982a042e461363e7082850fd109f690b	3.3 MB	Download

	All versions	This version
Views	550	548
Downloads	343	343
Data volume	79.3 GB	79.3 GB

Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset, Models & Code)

Creators

Description

Files

predictions_4folds_2__bert_base__joint__seq512.csv

Files (10.3 GB)

predictions_4folds_2__bert_basejointseq512.csv