Published July 11, 2025 | Version v1
Dataset Open

2020-2022 Health-related GDELT news classified using ICD9-CM taxonomy

Description

This dataset contains health-related news data from GDELT project with ICD9-CM annotations, covering January 2020 to December 2022. Each CSV file represents one month of data with the following fields:

  • country_code: country code where news originated
  • news_datetime: Timestamp of news publication
  • json_col: A json object containing additional metadata from GDELT in JSON format, including the field "quotes"
  • icd9_code: list of top 3 ICD9-CM code obtained with zero-shot classification of the field "quotes"
  • icd9_annotation: description associated to the ICD9-CM codes in the field icd9_code

Files are named using YYYY_MM format (e.g., 2020_01.csv for January 2020).

Files

ICD9_GDELT_dataset.zip

Files (1.8 GB)

Name Size Download all
md5:f804b83d4a1196ae1853a03779b90dbe
1.8 GB Preview Download

Additional details

Additional titles

Subtitle
Zero-shot ICD9-CM classification of health news using MPNet transformer