CCF Database: A Machine-Learning-Annotated Corpus of 266,271 Canadian Climate Articles (1978–2024) — PostgreSQL edition

Lemor, Antoine; Pillod, Alizée; Taylor, Matthew; Nadeau, Richard

doi:10.5281/zenodo.20667151

Published June 12, 2026 | Version 1.1.0

Dataset Open

CCF Database: A Machine-Learning-Annotated Corpus of 266,271 Canadian Climate Articles (1978–2024) — PostgreSQL edition

1. Université de Sherbrooke / Université de Montréal
2. Université de Montréal

Contributors

Data curator:

Rouyer, Maëlle¹

1. Université de Montréal

The Canadian Climate Framing (CCF) Database is a comprehensive, machine-learning-annotated corpus of climate-change media coverage in Canada. It comprises 266,271 articles from 20 major Canadian newspapers (1978-2024) processed into 9,198,958 two-sentence analytical units (82.9% English, 17.1% French). Each unit is annotated across 65 hierarchical categories by 128 BERT and CamemBERT classifiers, with a macro F1 of 0.866 on a 1,000-sentence gold standard double-coded by an independent annotator (Gwet's AC1 = 0.894, Krippendorff's α = 0.698, Cohen's κ = 0.596 on the 400 blind sentences). Each category receives an A/B/C reliability tier summarising annotation quality from classifier performance and inter-coder agreement. The deposit ships six relational tables (bibliographic metadata, sentence-level annotations, named-entity rollups, article-level aggregates, per-category reliability tiers, and 9,462,845 BAAI/bge-m3 sentence-and-title embeddings). Raw newspaper text is excluded for copyright reasons; bibliographic coordinates (media, date, title, author, page_number) are sufficient for any researcher with institutional access to Factiva, Eureka.cc or ProQuest Canadian Major Dailies to recover the original sentences. This deposit accompanies a methodology paper currently under revision at Scientific Data (Nature Portfolio).

This deposit is the canonical PostgreSQL edition. It contains a pg_dump -Fd directory archive (compressed into a single .tar file) of the six relational tables, including the pgvector extension and HNSW cosine indexes for sub-second semantic-similarity search. Restoration is a one-liner:

tar -xf CCF_Database.tar && createdb CCF_Database && psql -d CCF_Database -c 'CREATE EXTENSION IF NOT EXISTS vector;' && pg_restore -d CCF_Database --no-owner --no-privileges -j 8 CCF_Database_dump

A column-oriented Apache Parquet mirror of the same six tables is available as the sister deposit on Zenodo (cross-referenced in Related identifiers). The Parquet mirror is recommended for users without PostgreSQL access (it is directly readable by pandas, polars, R/arrow, DuckDB, and Spark).

The full annotation pipeline, training data, manual-annotation JSONL, intercoder-reliability benchmark, methodology manuscript (LaTeX sources + PDF), and reproducibility scripts are bundled with this deposit as ccf_code_and_paper.tar.gz. The same materials are also available on the project's OSF companion deposit (10.17605/OSF.IO/Q5W47) and on the development mirror at GitHub.

Requirements: PostgreSQL 16 or 17 with pgvector ≥ 0.8.2 (for halfvec(1024) storage of the sentence embeddings).

Notes

Version 1.1.0 (2026-06-12): schema-level update to v1.0.0.

Column rename in CCF_article_aggregates: dominant_frame → top_frame; dominant_frame_prop → top_frame_prop. The semantics are unchanged (the frame with the highest prop_X in the article); the rename avoids a collision with the meaning that 'dominant frame' carries in a separate paper by the same authors.
Throughout the manuscript and documentation, 'thematic frames' is replaced by 'main frames' to remove the collision with the narrower definition of 'thematic frame' in part of the framing literature. The eight main frames are unchanged.
Reproducible tables (Supplementary Tables S3-S12, Data Overview tables) and the CODEBOOK regenerated from the renamed schema; Supplementary Table S3 'Thematic frames' section header renamed to 'Main frames'.
Manuscript abstract revised for clarity; new Acknowledgements section added.
Row counts, statistical results, and trained models are byte-identical to v1.0.0.

The original v1.0.0 DOIs (10.5281/zenodo.20346364 and 10.5281/zenodo.20346373) remain valid for v1.0.0 of the deposit (with the original dominant_frame column).

Methods

Climate-related newspaper articles published between 1978 and 2024 in 20 major Canadian outlets (national, regional, and French-language) were retrieved from Factiva, Eureka.cc, and ProQuest Canadian Major Dailies through the institutional subscriptions of Université de Montréal and Université de Sherbrooke. After language detection, deduplication, and a 100-word minimum-length filter, the 266,271 articles were segmented into 9,198,958 two-sentence analytical units with spaCy.

Each unit was annotated across 65 hierarchical binary categories covering eight main frames (Economic, Health, Security, Justice, Political, Scientific, Environmental, Cultural), actors/messengers, events, solutions, emotional tone, geographic focus and urgency. Annotation relies on 128 transformer-based classifiers (BERT for English, CamemBERT for French) trained on more than 4,000 expert-coded sentences, with a reinforced-training phase triggered for low-F1 categories. Models reach a macro F1 of 0.866 on a stratified 1,000-sentence gold standard double-coded by an independent annotator (Gwet's AC1 = 0.894, Krippendorff's α = 0.698, Cohen's κ = 0.596 on the 400 blind sentences).

Per-sentence named entities (persons, organisations, locations) are extracted with a hybrid spaCy + BERT pipeline. Article-level rollups (top frame, framing entropy, deduplicated named-entity arrays) are materialised on PostgreSQL together with 9,462,845 BAAI/bge-m3 sentence-and-title embeddings stored as halfvec(1024) with HNSW cosine indexing through pgvector. A per-category A/B/C reliability tier, jointly determined by classifier macro F1 and inter-coder agreement, accompanies every annotation.

Files

CCF_Methodology.pdf

Files (39.8 GB)

Name	Size
ccf_code_and_paper.tar.gz md5:4188d5411f2995cd054af9095b8bdbbf	27.2 MB	Download
CCF_Database.tar md5:c4a3ad65c220d0a9d832b7857799cc23	39.7 GB	Download
CCF_Database.tar.sha256 md5:9f1d4383dbc2c1c777f814c497237b4b	83 Bytes	Download
CCF_Methodology.pdf md5:33f7fc9b9304d2d4355694b918d293f1	2.3 MB	Preview Download
CCF_Methodology_SI.pdf md5:ef81a9dad21c871212b7511432abaad3	686.4 kB	Preview Download

Additional details

Is source of: 10.5281/zenodo.20667154 (DOI)
Is supplement to: 10.17605/OSF.IO/Q5W47 (DOI)

Collected: 1978-01-01/2024-12-31

Publication dates of the 266,271 climate-related newspaper articles in the CCF corpus.

Lemor, A., Pillod, A., Taylor, M., & Nadeau, R. (2026). The Canadian Climate Framing (CCF) database: a sentence-level annotated corpus for the analysis of climate-change discourse in the Canadian press. Scientific Data (under revision).

	All versions	This version
Views	23	9
Downloads	35	13
Data volume	278.2 GB	79.5 GB

Contributors

Data curator:

CCF_Methodology.pdf

Files (39.8 GB)

Related works

Dates

References

CCF Database: A Machine-Learning-Annotated Corpus of 266,271 Canadian Climate Articles (1978–2024) — PostgreSQL edition

Authors/Creators

Contributors

Data curator:

Description

Notes

Methods

Files

CCF_Methodology.pdf

Files (39.8 GB)

Additional details

Related works

Dates

References