Published November 29, 2025 | Version v1.0
Dataset Open

Bosnian Corpus (v1.0): Cleaned Web and SMS Text for Entropy and NLP Research

  • 1. Hyper Efficient System LLC, Sheridan, WY, USA

Contributors

Description

This record provides a cleaned and genre-annotated corpus of contemporary Bosnian, designed for quantitative analysis of language entropy, "language energy" and modern NLP tasks.

The corpus is built from three publicly available resources released in the CLARIN.SI repository:
(1) The Sarajevo Corpus of SMS Messages in Bosnian 1.1,
(2) Bosnian web corpus bsWaC 1.1, and
(3) Bosnian web corpus CLASSLA-web.bs 1.0.
All sources were converted to plain text, cleaned, normalised, partially deduplicated, and merged into a single consistent dataset.

The final corpus contains approximately 6.18 GB of text (≈ 6,182,905,888 bytes), 46,258,935 lines and 942,515,845 tokens.
The web portion is organised into several “super-genres” (News, Opinion, Forum/Chat, Info/HowTo, Legal/Admin, Literature, Ads/Promo, Mix/Other).
For each super-genre a separate text file is provided, together with one global file that concatenates all genres for entropy estimation and language-model training.

Cleaning focuses on removing technical noise that would bias frequency distributions and entropy estimates, while preserving the linguistic signal:
– Unicode normalisation (UTF-8, NFC),
– correction of common mojibake artefacts,
– removal of URLs, e-mail addresses, file names, boilerplate and CMS/navigation lines,
– filtering of lines with a high proportion of non-letter characters,
– optional digit normalisation and lowercasing,
– language filtering to keep primarily Bosnian text.

Files in this record:
bosnian_corpus_all.txt (full corpus, all genres),
– per-genre text files (news, forum, opinion, info/howto, legal/admin, literature, ads/promo, mix/other),
README.txt with dataset description,
– two accompanying research papers (Bosnian and English), uploaded separately as PDF files.

Code availability:
Preprocessing, cleaning and entropy-calculation scripts are publicly available on GitHub:
https://github.com/H4sK0/bosnian-corpus-pipeline

Licence:
This corpus is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) licence.
Users must credit this Zenodo record and the original source corpora (Sarajevo SMS 1.1, bsWaC 1.1, CLASSLA-web.bs 1.0), and must distribute derivative corpora under the same or a compatible licence.

Suggested citation:
Hasan Kahrimanović (2025). Bosnian Corpus (v1.0): Cleaned Web and SMS Text for Entropy and NLP Research. Zenodo. DOI: [assigned by Zenodo].

Files

bosnian-corpus-1.0.zip

Files (4.8 GB)

Name Size Download all
md5:ee07e92e8170937cca7a94cba1dbbe28
4.8 GB Preview Download
md5:5ca4c42443fc8d8f8b49a2172c5052b3
398.2 kB Preview Download
md5:fe671fb6c25fc8d0afdb9b8d3d319c1a
397.6 kB Preview Download

Additional details

Dates

Issued
2025-11-29
First public release of the corpus

Software

Repository URL
https://github.com/H4sK0/bosnian-corpus-pipeline
Programming language
Python
Development Status
Active