CLIC24 – Climate Change Multilingual Media Corpus 2024
Authors/Creators
-
Kubát, Miroslav
(Contact person)1
-
Nogolová, Michaela
(Data collector)1
-
Mostýn, Martin
(Data collector)1
-
Místecký, Michal
(Data collector)1
-
Beneš Kováčová, Dominika
(Data collector)1
-
Šlechta, Petr
(Data collector)1
-
Pišl, Milan
(Data collector)1
-
Lukl, Jiří
(Data collector)1
-
Chen, Xinying
(Data collector)1
-
Vankova, Lenka
(Data collector)1
Description
The CLIC24 corpus is a multilingual collection of journalistic texts focusing on climate change related topics. The corpus contains texts in Czech, German, English, and Spanish, collected automatically using a custom extraction script.
For each language, a predefined set of topic-specific keywords related to climate change was used. An article was included in the corpus if at least one keyword occurred either in the article title or in the article body.
During data extraction, texts were cleaned of navigation elements, advertisements, and other typical website noise. Articles identified as incomplete (e.g., due to paywalls) were excluded. Duplicate texts were removed based on URL comparison.
Only articles published between January 1, 2024 and December 31, 2024 were included. Records without a reliably traceable publication date were excluded.
The corpus documentation includes an overview of individual media sources, the number of included texts, and the list of keywords used for data collection.
The dataset was created for linguistic and interdisciplinary research on media discourse, disinformation, and climate change communication.
Access and Licensing Conditions:
Metadata-only dataset.
Due to copyright restrictions, the original full-text articles included in the CLIC24 dataset cannot be redistributed.
Use of the dataset outside the scope of the grant project requires permission from the authors.
Files
CLIC24_Metadata_Czech_Media.csv
Additional details
Funding
Dates
- Collected
-
2024Time range of published articles