Lexica corpus

Hewett, Freya; Stede, Manfred

doi:10.5281/zenodo.5196030

Published August 13, 2021 | Version v1.0

Dataset Open

Lexica corpus

1. Humboldt Institut für Internet & Gesellschaft
2. Universität Potsdam

First release of the lexica corpus: a corpus for German text simplification.

The corpus consists of approximately 3000 texts from three Wiki-based lexica in German language: MiniKlexikon, Klexikon and Wikipedia. The articles in the Wikis are created by volunteers and can be written, discussed, and improved upon collaboratively. Klexikon is aimed specifically at children aged between 6 and 12 and MiniKlexikon is designed for children who are beginner readers, and is therefore an even simpler version of the Klexikon. We make the assumption that the three different sub-corpora represent three different levels of conceptual complexity due to the target groups they are written for: younger children, children and adults. As Wikipedia articles can be extremely long, in comparison to the other two lexica, only the introduction or abstract was taken for this corpus.

This repository contains the corpora from the original study (295 texts per sub-corpus in the orig_files folder), extended versions with ca. 1000 texts (as of August 2021) per sub-corpus and a script to update the extended version as new articles are added to the Klexikon and MiniKlexikon.

Files

fhewett/lexica-corpus-v1.0.zip

Files (2.3 MB)

Name	Size	Download all
fhewett/lexica-corpus-v1.0.zip md5:e30572238ec7f951b02142dfd7e42d96	2.3 MB	Preview Download

Additional details

Is supplement to: https://github.com/fhewett/lexica-corpus/tree/v1.0 (URL)

	All versions	This version
Views	877	548
Downloads	104	48
Data volume	253.9 MB	109.5 MB

Lexica corpus

Creators

Description

Files

fhewett/lexica-corpus-v1.0.zip

Files (2.3 MB)

Additional details

Related works