Published February 2, 2024 | Version v1
Dataset Open

Multilingual news article similarity dataset

  • 1. ROR icon University of Massachusetts Amherst
  • 2. ROR icon Sapienza University of Rome
  • 3. ROR icon University of Michigan–Ann Arbor


This dataset contains the extended version of the authors' earlier work:, where pairs of news articles drawn from the first half of 2020 are annotated for seven aspects of similarity in the original version as well as an additional FRAME aspect:

  • GEO: How similar is the geographic focus (places, cities, countries, etc.) of the two articles?
  • ENT: How similar are the named entities (e.g., people, companies, organizations, products, named living beings), excluding previously considered locations appearing in the two articles?
  • TIME Are the two articles relevant to similar time periods or describing similar time periods?
  • NAR How similar are the narrative schemas presented in the two articles?
  • OVERALL Overall, are the two articles covering the same substantive news story? (excluding style, framing, and tone)
  • STYLE Do the articles have similar writing styles?
  • TONE Do the articles have similar tones?
  • FRAME Do the articles have similar framing and express similar opinions?


Codebook for text similarity annotations - Google Docs.pdf

Files (13.6 MB)

Name Size Download all
402.6 kB Preview Download
13.2 MB Preview Download

Additional details

Related works

Is version of
Conference proceeding: (URL)




  • Chen et al. (2024). Multilingual news article similarity dataset. doi: 10.5281/zenodo.10611923