Bosnian CORE NLP Standard (BCS-compatible): Text Normalization & Tokenization — v1.0-LTS

Kahrimanovic, Hasan

doi:10.5281/zenodo.18570562

Published February 10, 2026 | Version V 1.0

Software documentation Open

Bosnian CORE NLP Standard (BCS-compatible): Text Normalization & Tokenization — v1.0-LTS

Kahrimanovic, Hasan (Data curator)

Bosnian CORE NLP Standard v1.0-LTS defines a deterministic, loss-controlled normalization, segmentation, and tokenization specification for Bosnian/BCS text intended for reproducible corpus statistics and comparable NLP experiments across heterogeneous sources (web, PDFs, subtitles, social media, OCR). The standard fixes rule ordering, artifact contracts, token typing, offset conventions, and export schemas, while allowing a small set of explicitly logged policy switches (case/number/emoji/newline).

The specification introduces a three-level normalization stack (text_raw → text_nfc → text_clean), a strict run directory layout with immutable outputs, and machine-readable metadata (run_metadata.json), manifests, and SHA-256 checksums. Tokenization outputs are delivered as JSONL streams (tokens_core.jsonl, segments_core.jsonl) designed for scalable processing and stable downstream ingestion (Python/R/SQL).

Key comparability contract:

Canonical token inclusion sets for metrics: Set A (lexical), Set B (lexical+punctuation), Set C (full stream), with explicit URL/EMAIL handling.
Normative definitions of denominators N and type counts V to prevent cross-study mismatch.
Normative frequency and n-gram export formats and deterministic sorting rules.

Included in this release:

Specification PDF (compile-ready, referenceable).
LaTeX source bundle (sections/, bib/, assets/ test files).
Citation metadata (CITATION.cff) and licensing guidance for spec/code/assets.

Files

bcs-core-nlp-standard-latex.zip

Files (655.8 kB)

Name	Size	Download all
bcs-core-nlp-standard-latex.zip md5:13942a9bdc6976ae26a9b8d3cb39ef74	88.3 kB	Preview Download
Bosnian_CORE_NLP_Standard_v1.0-LTS.pdf md5:791c8dec5a00230aa177c494fe45b0ed	567.5 kB	Preview Download

Additional details

Development Status: Active

	All versions	This version
Views	74	74
Downloads	1	1
Data volume	567.5 kB	567.5 kB

Bosnian CORE NLP Standard (BCS-compatible): Text Normalization & Tokenization — v1.0-LTS

Authors/Creators

Description

Files

bcs-core-nlp-standard-latex.zip

Files (655.8 kB)

Additional details

Software