CCF Database: A Machine-Learning-Annotated Corpus of 266,271 Canadian Climate Articles (1978–2024) — PostgreSQL edition
Authors/Creators
- 1. Université de Sherbrooke / Université de Montréal
- 2. Université de Montréal
Description
The Canadian Climate Framing (CCF) Database is a comprehensive, machine-learning-annotated corpus of climate-change media coverage in Canada. It comprises 266,271 articles from 20 major Canadian newspapers (1978-2024) processed into 9,198,958 two-sentence analytical units (82.9% English, 17.1% French). Each unit is annotated across 65 hierarchical categories by 128 BERT and CamemBERT classifiers, with a macro F1 of 0.866 on a 1,000-sentence gold standard double-coded by an independent annotator (Gwet's AC1 = 0.894, Krippendorff's α = 0.698, Cohen's κ = 0.596 on the 400 blind sentences). Each category receives an A/B/C reliability tier summarising annotation quality from classifier performance and inter-coder agreement. The deposit ships six relational tables (bibliographic metadata, sentence-level annotations, named-entity rollups, article-level aggregates, per-category reliability tiers, and 9,462,845 BAAI/bge-m3 sentence-and-title embeddings). Raw newspaper text is excluded for copyright reasons; bibliographic coordinates (media, date, title, author, page_number) are sufficient for any researcher with institutional access to Factiva, Eureka.cc or ProQuest Canadian Major Dailies to recover the original sentences. This deposit accompanies a methodology paper currently under revision at Scientific Data (Nature Portfolio).
This deposit is the canonical PostgreSQL edition. It contains a pg_dump -Fd directory archive (compressed into a single .tar file) of the six relational tables, including the pgvector extension and HNSW cosine indexes for sub-second semantic-similarity search. Restoration is a one-liner:
tar -xf CCF_Database.tar && createdb CCF_Database && psql -d CCF_Database -c 'CREATE EXTENSION IF NOT EXISTS vector;' && pg_restore -d CCF_Database --no-owner --no-privileges -j 8 CCF_Database_dump
A column-oriented Apache Parquet mirror of the same six tables is available as the sister deposit on Zenodo (cross-referenced in Related identifiers). The Parquet mirror is recommended for users without PostgreSQL access (it is directly readable by pandas, polars, R/arrow, DuckDB, and Spark).
The full annotation pipeline, training data, manual-annotation JSONL, intercoder-reliability benchmark, methodology manuscript (LaTeX sources + PDF), and reproducibility scripts are bundled with this deposit as ccf_code_and_paper.tar.gz. The same materials are also available on the project's OSF companion deposit (10.17605/OSF.IO/Q5W47) and on the development mirror at GitHub.
Requirements: PostgreSQL 16 or 17 with pgvector ≥ 0.8.2 (for halfvec(1024) storage of the sentence embeddings).
Notes
Methods
Climate-related newspaper articles published between 1978 and 2024 in 20 major Canadian outlets (national, regional, and French-language) were retrieved from Factiva, Eureka.cc, and ProQuest Canadian Major Dailies through the institutional subscriptions of Université de Montréal and Université de Sherbrooke. After language detection, deduplication, and a 100-word minimum-length filter, the 266,271 articles were segmented into 9,198,958 two-sentence analytical units with spaCy.
Each unit was annotated across 65 hierarchical binary categories covering eight main frames (Economic, Health, Security, Justice, Political, Scientific, Environmental, Cultural), actors/messengers, events, solutions, emotional tone, geographic focus and urgency. Annotation relies on 128 transformer-based classifiers (BERT for English, CamemBERT for French) trained on more than 4,000 expert-coded sentences, with a reinforced-training phase triggered for low-F1 categories. Models reach a macro F1 of 0.866 on a stratified 1,000-sentence gold standard double-coded by an independent annotator (Gwet's AC1 = 0.894, Krippendorff's α = 0.698, Cohen's κ = 0.596 on the 400 blind sentences).
Per-sentence named entities (persons, organisations, locations) are extracted with a hybrid spaCy + BERT pipeline. Article-level rollups (top frame, framing entropy, deduplicated named-entity arrays) are materialised on PostgreSQL together with 9,462,845 BAAI/bge-m3 sentence-and-title embeddings stored as halfvec(1024) with HNSW cosine indexing through pgvector. A per-category A/B/C reliability tier, jointly determined by classifier macro F1 and inter-coder agreement, accompanies every annotation.
Files
CCF_Methodology.pdf
Files
(39.8 GB)
Additional details
Related works
- Is source of
- 10.5281/zenodo.20667154 (DOI)
- Is supplement to
- 10.17605/OSF.IO/Q5W47 (DOI)
Dates
- Collected
-
1978-01-01/2024-12-31Publication dates of the 266,271 climate-related newspaper articles in the CCF corpus.
References
- Lemor, A., Pillod, A., Taylor, M., & Nadeau, R. (2026). The Canadian Climate Framing (CCF) database: a sentence-level annotated corpus for the analysis of climate-change discourse in the Canadian press. Scientific Data (under revision).