Published February 10, 2026 | Version V 1.0
Software documentation Open

Bosnian CORE NLP Standard (BCS-compatible): Text Normalization & Tokenization — v1.0-LTS

Description

Bosnian CORE NLP Standard v1.0-LTS defines a deterministic, loss-controlled normalization, segmentation, and tokenization specification for Bosnian/BCS text intended for reproducible corpus statistics and comparable NLP experiments across heterogeneous sources (web, PDFs, subtitles, social media, OCR). The standard fixes rule ordering, artifact contracts, token typing, offset conventions, and export schemas, while allowing a small set of explicitly logged policy switches (case/number/emoji/newline).

The specification introduces a three-level normalization stack (text_raw → text_nfc → text_clean), a strict run directory layout with immutable outputs, and machine-readable metadata (run_metadata.json), manifests, and SHA-256 checksums. Tokenization outputs are delivered as JSONL streams (tokens_core.jsonl, segments_core.jsonl) designed for scalable processing and stable downstream ingestion (Python/R/SQL).

Key comparability contract:

  • Canonical token inclusion sets for metrics: Set A (lexical), Set B (lexical+punctuation), Set C (full stream), with explicit URL/EMAIL handling.

  • Normative definitions of denominators N and type counts V to prevent cross-study mismatch.

  • Normative frequency and n-gram export formats and deterministic sorting rules.

Included in this release:

  • Specification PDF (compile-ready, referenceable).

  • LaTeX source bundle (sections/, bib/, assets/ test files).

  • Citation metadata (CITATION.cff) and licensing guidance for spec/code/assets.

Files

bcs-core-nlp-standard-latex.zip

Files (655.8 kB)

Name Size Download all
md5:13942a9bdc6976ae26a9b8d3cb39ef74
88.3 kB Preview Download
md5:791c8dec5a00230aa177c494fe45b0ed
567.5 kB Preview Download

Additional details

Software

Development Status
Active