Published May 1, 2022 | Version 0.01
Dataset Open

SemEval-2022 Task 8: Multilingual news article similarity

  • 1. UMass Amherst
  • 2. University of Exeter
  • 3. GESIS
  • 4. Meedan
  • 5. Meedan and University of Oxford
  • 6. University of Michigan

Description

This dataset contains pairs of news articles drawn from the first half of 2020 and annotated for seven aspects of similarity:

  • GEO: How similar is the geographic focus (places, cities, countries, etc.) of the two articles?
  • ENT: How similar are the named entities (e.g., people, companies, organizations, products, named living beings), excluding previously considered locations appearing in the two articles?
  • TIME Are the two articles relevant to similar time periods or describing similar time periods?
  • NAR How similar are the narrative schemas presented in the two articles?
  • OVERALL Overall, are the two articles covering the same substantive news story? (excluding style, framing, and tone)
  • STYLE Do the articles have similar writing styles?
  • TONE Do the articles have similar tones?

Further details are provided in

Chen et al. (2022). SemEval-2022 Task 8: Multilingual news article similarity. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022). https://aclanthology.org/2022.semeval-1.155/

The data in this repository includes pairs of URLs and annotations. The text of webpages is generally via the Internet Archive in this special collection: https://archive.org/details/2020-multilingual-news-article-similarity . A script to download and process the webpages is available at https://github.com/euagendas/semeval_8_2022_ia_downloader . 

Notes

This research has received funding through the Volkswagen Foundation. We thank Media Cloud for access to increased API volume. We thank the Internet Archive which made it possible for participants to all have access to the same data. We are deeply grateful to the annotators and task participants: thank you. The webpages annotated in this dataset are mostly available via the Internet Archive in this special collection: https://archive.org/details/2020-multilingual-news-article-similarity

Files

Codebook for text similarity annotations.pdf

Files (14.3 MB)

Name Size Download all
md5:2765dc6580b589319fd7125da726be78
274.0 kB Preview Download
md5:2faa787b553ffb2bcc2794cd87681ca6
357.6 kB Download
md5:0f73b405f5a69bf72926133c3e7baa42
18.0 kB Preview Download
md5:9f00f9192dec6a78915fdbb40bf46767
23.9 kB Preview Download
md5:b1acdfeafd230d186f7a59b0d9085e67
110.8 kB Download
md5:49c4dbb8c9db263bbfb7bd502df9349b
2.5 MB Preview Download
md5:f514b277ea50f1d082258a4631274ee5
3.0 MB Preview Download
md5:2ba095d53f51142a12375e7ccb22fdab
5.3 MB Preview Download
md5:c3f2cf2be0460bb338d6d16f232f5992
2.7 MB Preview Download

Additional details

References

  • Chen et al. (2022). SemEval-2022 Task 8: Multilingual news article similarity. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)