Bosnian CORE NLP Standard (BCS-compatible): Text Normalization & Tokenization — v1.0-LTS
Authors/Creators
Description
Bosnian CORE NLP Standard v1.0-LTS defines a deterministic, loss-controlled normalization, segmentation, and tokenization specification for Bosnian/BCS text intended for reproducible corpus statistics and comparable NLP experiments across heterogeneous sources (web, PDFs, subtitles, social media, OCR). The standard fixes rule ordering, artifact contracts, token typing, offset conventions, and export schemas, while allowing a small set of explicitly logged policy switches (case/number/emoji/newline).
The specification introduces a three-level normalization stack (text_raw → text_nfc → text_clean), a strict run directory layout with immutable outputs, and machine-readable metadata (run_metadata.json), manifests, and SHA-256 checksums. Tokenization outputs are delivered as JSONL streams (tokens_core.jsonl, segments_core.jsonl) designed for scalable processing and stable downstream ingestion (Python/R/SQL).
Key comparability contract:
-
Canonical token inclusion sets for metrics: Set A (lexical), Set B (lexical+punctuation), Set C (full stream), with explicit URL/EMAIL handling.
-
Normative definitions of denominators N and type counts V to prevent cross-study mismatch.
-
Normative frequency and n-gram export formats and deterministic sorting rules.
Included in this release:
-
Specification PDF (compile-ready, referenceable).
-
LaTeX source bundle (sections/, bib/, assets/ test files).
-
Citation metadata (CITATION.cff) and licensing guidance for spec/code/assets.
Files
bcs-core-nlp-standard-latex.zip
Files
(655.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:13942a9bdc6976ae26a9b8d3cb39ef74
|
88.3 kB | Preview Download |
|
md5:791c8dec5a00230aa177c494fe45b0ed
|
567.5 kB | Preview Download |
Additional details
Software
- Development Status
- Active