Published July 11, 2025
| Version v1
Dataset
Open
2020-2022 Health-related GDELT news classified using ICD9-CM taxonomy
Authors/Creators
Description
This dataset contains health-related news data from GDELT project with ICD9-CM annotations, covering January 2020 to December 2022. Each CSV file represents one month of data with the following fields:
country_code: country code where news originatednews_datetime: Timestamp of news publicationjson_col: A json object containing additional metadata from GDELT in JSON format, including the field "quotes"icd9_code: list of top 3 ICD9-CM code obtained with zero-shot classification of the field "quotes"icd9_annotation: description associated to the ICD9-CM codes in the fieldicd9_code
Files are named using YYYY_MM format (e.g., 2020_01.csv for January 2020).
Files
ICD9_GDELT_dataset.zip
Files
(1.8 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:f804b83d4a1196ae1853a03779b90dbe
|
1.8 GB | Preview Download |
Additional details
Additional titles
- Subtitle
- Zero-shot ICD9-CM classification of health news using MPNet transformer