Detecting Cross-Language Plagiarism using Open Knowledge Graphs

doi:10.5281/zenodo.5159398

Published August 6, 2020 | Version 1.2

Dataset Open

Detecting Cross-Language Plagiarism using Open Knowledge Graphs

1. University of Wuppertal
2. University of Konstanz
3. FIZ Karlsruhe

Corresponding authors: Norman Meuschke, Terry Ruas
Venue: 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2021)
at the ACM/IEEE Joint Conference on Digital Libraries 2021 (JCDL2021)

==========================================================================

Source code: https://github.com/ag-gipp/cl-osa

==========================================================================

Dataset Details

ASPEC. The Asian Scientific Paper Excerpt Corpus comprises excepts of scientific papers in Japanese that have been manually translated to English and Chinese. We use both subsets of the ASPEC corpus.

ASPEC-JC contains abstracts and paragraphs from the main text of research papers that were translated manually from Japanese to Chinese.
ASPEC-JE contains abstracts of approx. two million research papers that were translated manually from Japanese to English.

JRC-Acquis. The corpus consists of legislative texts in 22 languages, which the European Union's Joint Research Centre (JRC) selected from the cumulative body of EU laws (the so called Acquis communautaire). We sampled our test cases from the 10,000 document pairs in the English-French subset of the corpus.

Europarl. The corpus contains transcripts of European Parliament proceedings in 21 European languages. We exclusively sampled test cases from the 9,443 document pairs in the English-French subset of the corpus.

PAN-PC-11. The corpus contains instances of simulated monolingual and cross-language plagiarism that were used for evaluating plagiarism detection methods as part of the workshop series Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (PAN). Most of the 26,939 documents in the corpus were created by extracting text from openly available books. The documents are partially interspersed with instances of simulated plagiarism that were created and obfuscated automatically or by crowdsourced workers. We exclusively sampled test cases from the 2,921 Spanish-English aligned document pairs in the corpus, for which simulated plagiarism instances were either machine-generated or created manually by crowdsourced workers.

==========================================================================

File Structure

[corpus_documents] folder: Corpora of translation-aligned documents used in our experiments composed of:

aspec: Japanese and English
aspecx: Japanese and Chinese
jrc: English and French
europarl: English and French
pan: English and Spanish

Each sub-corpus consists of 4,000 translation-aligned files (2,000 per language); the entire corpus has thus 20,000 files.
Each set of translation-aligned documents was randomly selected from the original datasets (details in the paper).
The Japanese files in aspec and aspecx do not necessarily overlap even though they are from the same dataset.

[vectors_documents] folder: Average vector representation of the documents in the datasets from two pre-trained models:

Universal Sentence Encoder - Multilingual (USE-ML)
ConceptNet Numberbatch

Naming convention: <model>_<dataset>_<language>;

Example: cn_jrc_es:
- model: ConceptNet Numberbatch
- corpus: JRC-Acquis
- language: Spanish

Labels:
- <model>:
  cn - ConceptNet Numberbatch
  um - USE-ML
- <dataset>
  aspec - ASPEC (Asian Scientific Paper Excerpt Corpus) - English and Japanese
  aspecx - ASPEC (Asian Scientific Paper Excerpt Corpus) - Japanese and Chinese
  jrc - JRC-Acquis
  europarl - Europarl
  pan - PAN-PC-11
- <language>
  en - English
  es - Spanish
  fr - French
  ja - Japanese
  zh - Chinese

Files

corpus_documents.zip

Files (174.5 MB)

Name	Size	Download all
corpus_documents.zip md5:f21d10abd8451eb5cba6a447c8e15dc2	85.4 MB	Preview Download
vectors_documents.zip md5:968e092e215d05f265042918dfb2df97	89.1 MB	Preview Download

	All versions	This version
Views	964	306
Downloads	84	27
Data volume	9.8 GB	3.1 GB

Detecting Cross-Language Plagiarism using Open Knowledge Graphs

Creators

Description

Files

corpus_documents.zip

Files (174.5 MB)