Published May 11, 2022 | Version v1
Dataset Open

Finding Reliable Sources: Evaluation Datasets

  • 1. Institute of Computer Science, Polish Academy of Sciences

Description

Finding Reliable Sources (FRS) is an NLP/ML task, in which, given a single-sentence claim, we aim to automatically find reliable sources that could be used to confirm or refute it. This repository includes evaluation datasets that could be used to train and test solution for this problem. Datasets are organised as large collections of individual records, each consisting of (1) a claim expressed a short text fragment and (2) a list of identifiers (DOI, ISBN, arXiv ID or URL) of reliable sources associated with it.

Two datasets are included:

  • Claim-Source Pairing Dataset (CSP) contains 32 million textual contexts paired with 24 million source identifiers mined from English Wikipedia,
  • FEVER-FRS contains 15,798 claims gathered for the FEVER shared task, supplemented with source identifiers using Wikipedia.

The data were gathered within a study described in the article "Countering Disinformation by Finding Reliable Sources: a Citation-Based Approach", presented at the 2022 International Joint Conference on Neural Networks (IJCNN 2022). Please refer to the article (in conference proceedings or authors' version) for more information on the FRS task, data collection procedures and evaluation results. The research was done within the HOMADOS project at the Institute of Computer Science, Polish Academy of Sciences.

The datasets described here are based on the Wikipedia Complete Citation Corpus (WCCC), which also includes the train/test split for CSP.

The CSP dataset is uploaded in three variants, using different text fragments as 'context'. In 's', the context consists of a single sentence preceding (or containing) a citation. The 'ts' variant additionally includes a title of the Wikipedia article, while 'ss' contains two sentences preceding a citation. Each variant contains subsequently numbered TSV files, including lines with the following values:

  • document ID (see WCCC for full document metadata),
  • textual context,
  • a list of (at least one, possibly many) source identifiers.

The FEVER-FRS dataset was created based on the FEVER shared task data. In the original dataset, each claim is matched with a set of 'evidence' sentences from Wikipedia that either support or refute it. For the purpose of evaluating FRS solutions, we replaced each evidence sentence with the identifiers to external sources that are cited alongside it. This was not always possible (e.g. for Wikipedia sentences with no sources, see the paper for details), but we managed to find 4414 and 11384 claims that are, respectively, refuted or supported by the provided sources.

Each line of the FEVER-FRS TSV files contains the following:

  • text of a short claim,
  • a list of (at least one, possibly many) source identifiers.

The source code code for generating both datasets (CSP from Wikipedia and FEVER-FRS from FEVER) can be obtained from a GitHub repository.

Files

CSP-2021-s.zip

Files (9.5 GB)

Name Size Download all
md5:fe0e557c8460b37544ec3315b6c03672
2.9 GB Preview Download
md5:346a32680be06f44cdb343b1cd91ddad
3.7 GB Preview Download
md5:430aa1786ece5163021b039dc342cd2d
2.9 GB Preview Download
md5:697a66aa0d8a6aada71a17ef36bdc8d2
828.4 kB Download
md5:a5381f20b002c8c5d93e4fa6d237add5
2.0 MB Download