Published June 19, 2025 | Version v1
Dataset Open

Wikipedia Multilingual Fragments (WMF)

  • 1. ROR icon Universidad de Deusto

Description

Wikipedia Multilingual Fragments (WMF) is a multilingual corpus of clean text fragments extracted from 340 different Wikipedia language editions using the official MediaWiki API.

The dataset contains over 373,000 plain-text fragments, each between 500 and 1500 characters, collected by randomly sampling Wikipedia articles in encyclopedic namespaces. For each language, up to 2 million characters were collected, discarding articles that were too short or contained excessive formatting. Fragments were cleaned by removing section headers, citations, LaTeX expressions, and redundant whitespace.

Contents

The repository includes:

  • wmf.csv: a unified CSV file with one row per text fragment. Each row includes metadata such as:

    • language code

    • article ID and title

    • character count of the excerpt

    • timestamp of extraction

    • a flag indicating whether the excerpt was truncated

    • the cleaned text itself

  • chunks/: 8,212 plain-text files, each containing exactly 50,000 characters. These were generated by concatenating all excerpts per language and splitting them into fixed-length, non-overlapping segments. Only complete chunks were retained. Languages that could not produce at least one full chunk were excluded, resulting in a final count of 325 languages in this stage.

  • analysis/n_grams/: character-level n-gram frequency tables (n=1 to 20) for each language, based on the cleaned text.

  • analysis/tfidf/: word- and character-based TF, IDF, and TF-IDF scores per language, depending on the script (e.g., word-based for English, character-based for Chinese, Japanese, etc.).

  • scripts/: Python scripts to reproduce the dataset and analyses:

    • generate_fragments_from_wikipedia.py: collects and filters plain-text excerpts from Wikipedia.

    • merge_language_csvs.py: merges individual per-language CSVs into a unified dataset (wmf.csv).

    • chunk_creation.py: creates fixed-length (50,000 characters) text chunks from the full dataset, removing invisible Unicode characters and discarding incomplete segments.

    • n_grams_calculator.py: computes character-level n-gram frequencies (1 to 20) per language.

    • tfidf_calculator.py: calculates TF, IDF, and TF-IDF scores per language using language-aware tokenization.

Applications

This dataset is designed for:

  • multilingual language modeling

  • benchmarking tokenization strategies

  • cross-linguistic comparison

  • low-resource NLP

  • statistical and linguistic analysis

Files

wmf.zip

Files (972.2 MB)

Name Size Download all
md5:6ba93c0e627383187eda4f0e0cfde5ac
972.2 MB Preview Download

Additional details

Related works

Is supplement to
Dataset: 10.5281/zenodo.15866739 (DOI)

Dates

Created
2024-12-30