Bosnian Corpus (v1.0): Cleaned Web and SMS Text for Entropy and NLP Research
Authors/Creators
- 1. Hyper Efficient System LLC, Sheridan, WY, USA
Contributors
Other:
Description
This record provides a cleaned and genre-annotated corpus of contemporary Bosnian, designed for quantitative analysis of language entropy, "language energy" and modern NLP tasks.
The corpus is built from three publicly available resources released in the CLARIN.SI repository:
(1) The Sarajevo Corpus of SMS Messages in Bosnian 1.1,
(2) Bosnian web corpus bsWaC 1.1, and
(3) Bosnian web corpus CLASSLA-web.bs 1.0.
All sources were converted to plain text, cleaned, normalised, partially deduplicated, and merged into a single consistent dataset.
The final corpus contains approximately 6.18 GB of text (≈ 6,182,905,888 bytes), 46,258,935 lines and 942,515,845 tokens.
The web portion is organised into several “super-genres” (News, Opinion, Forum/Chat, Info/HowTo, Legal/Admin, Literature, Ads/Promo, Mix/Other).
For each super-genre a separate text file is provided, together with one global file that concatenates all genres for entropy estimation and language-model training.
Cleaning focuses on removing technical noise that would bias frequency distributions and entropy estimates, while preserving the linguistic signal:
– Unicode normalisation (UTF-8, NFC),
– correction of common mojibake artefacts,
– removal of URLs, e-mail addresses, file names, boilerplate and CMS/navigation lines,
– filtering of lines with a high proportion of non-letter characters,
– optional digit normalisation and lowercasing,
– language filtering to keep primarily Bosnian text.
Files in this record:
– bosnian_corpus_all.txt (full corpus, all genres),
– per-genre text files (news, forum, opinion, info/howto, legal/admin, literature, ads/promo, mix/other),
– README.txt with dataset description,
– two accompanying research papers (Bosnian and English), uploaded separately as PDF files.
Code availability:
Preprocessing, cleaning and entropy-calculation scripts are publicly available on GitHub:
https://github.com/H4sK0/bosnian-corpus-pipeline
Licence:
This corpus is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) licence.
Users must credit this Zenodo record and the original source corpora (Sarajevo SMS 1.1, bsWaC 1.1, CLASSLA-web.bs 1.0), and must distribute derivative corpora under the same or a compatible licence.
Suggested citation:
Hasan Kahrimanović (2025). Bosnian Corpus (v1.0): Cleaned Web and SMS Text for Entropy and NLP Research. Zenodo. DOI: [assigned by Zenodo].
Files
bosnian-corpus-1.0.zip
Additional details
Dates
- Issued
-
2025-11-29First public release of the corpus
Software
- Repository URL
- https://github.com/H4sK0/bosnian-corpus-pipeline
- Programming language
- Python
- Development Status
- Active