Published February 2, 2024 | Version v1
Dataset Open

Multilingual news article similarity dataset

  • 1. ROR icon University of Massachusetts Amherst
  • 2. ROR icon Sapienza University of Rome
  • 3. ROR icon University of Michigan–Ann Arbor

Description

This dataset contains the extended version of the authors' earlier work: https://zenodo.org/records/6507872, where pairs of news articles drawn from the first half of 2020 are annotated for seven aspects of similarity in the original version as well as an additional FRAME aspect:

  • GEO: How similar is the geographic focus (places, cities, countries, etc.) of the two articles?
  • ENT: How similar are the named entities (e.g., people, companies, organizations, products, named living beings), excluding previously considered locations appearing in the two articles?
  • TIME Are the two articles relevant to similar time periods or describing similar time periods?
  • NAR How similar are the narrative schemas presented in the two articles?
  • OVERALL Overall, are the two articles covering the same substantive news story? (excluding style, framing, and tone)
  • STYLE Do the articles have similar writing styles?
  • TONE Do the articles have similar tones?
  • FRAME Do the articles have similar framing and express similar opinions?

Files

Codebook for text similarity annotations - Google Docs.pdf

Files (13.6 MB)

Name Size Download all
md5:9007e1014065d65c20690c6ee54270ce
402.6 kB Preview Download
md5:f013b39fcd3359c20daf4b9c7c9604c2
13.2 MB Preview Download

Additional details

Related works

Is version of
Conference proceeding: https://zenodo.org/records/6507872 (URL)

Dates

Collected
2024-01-15

References

  • Chen et al. (2024). Multilingual news article similarity dataset. doi: 10.5281/zenodo.10611923