Published January 28, 2020 | Version 1.0
Dataset Open

Crowdsourcing Document Similarity Judgements

  • 1. King's College London

Description

This is the data obtained from crowdsourcing tasks which ask workers to provide similarity metrics between pairs of documents. Each document, as well as each pair, has a unique ID. We provide crowd workers with the pairs through three different task variations:

  • Variation 1: We showed workers 5 pairs of documents and, for each, asked them to rate their similarity in a 4-level Likert scale (None, Low, Medium, High), tell us a confidence level of how sure they were (from 0 to 4) and a written reason as to why they chose that similarity level. For quality reasons, two of the 5 pairs were golden-standards, which means we knew their ratings already and checked the workers' responses. They had to give the golden pair with the higher similarity a higher score than the other golden pair, otherwise, their answer would be rejected.
  • Variation 2: We repeated variation 1 but with a slight alteration: instead of a Likert scale for the similarity score, we asked for a Magnitude Estimation, which is any number above 0. It could be 1, 0.0001, 1000, 42, as long as it was coherent, as in a more similar pair had a higher score than a less similar pair and vice-versa;
  • Variation 3: We showed workers 5 rankings. Each ranking had a main document and 3 auxiliary documents to be compared against the main one. They also had to report a confidence score and give a short written reason, just like variation 1. The first ranking is a golden-standard, and we knew the values for the 3 pairs in it (the pairs were the main document paired with each of the 3 auxiliary documents), and they had to give the golden pair with the highest similarity a higher rank than the one with the lower similarity.

The raw results from the tasks are recorded in the JSON file CrowdResults.json. For a description of its contents, please read the file CrowdResults_README.md.

These raw annotations from the crowd were then parsed into the three CSVs you see, each corresponding to the aggregated results from one of the task variations.

  • final_scores_likert.csv is the resulting scores for each pair using the variation 1 tasks;
    • pair_id is a unique identifier for each pair;
    • similarity_alg is the similarity assigned to the pair of documents from an automated similarity algorithm;
    • relation is the type of relationship shown by the pair, where smaller values indicate more similar pairs;
    • similarity_crowd_simple_maj stores the simple majority result from the crowd's annotations;
    • similarity_crowd_simple_mean stores the mean of the crowd's annotations;
    • similarity_crowd_simple_median stores the median of the crowd's annotations;
  • final_scores_magnitude.csv is the resulting scores for each pair using the variation 2 tasks;
    • pair_id is a unique identifier for each pair;
    • similarity_alg is the similarity assigned to the pair of documents from an automated similarity algorithm;
    • relation is the type of relationship shown by the pair, where smaller values indicate more similar pairs;
    • scaled_similarity_worker is the magnitude score scaled based on worker's behaviours
    • scaled_similarity_worker_docset is the magnitude score scaled based both on the worker's behaviour and on the pair
  • final_scores_ranking.csv is the resulting scores for each pair using the variation 3 tasks;
    • pair_id is a unique identifier for each pair;
    • similarity_alg is the similarity assigned to the pair of documents from an automated similarity algorithm;
    • relation is the type of relationship shown by the pair, where smaller values indicate more similar pairs;
    • mean_similarity is the mean ranking from that value

This dataset was built and used as part of the TheyBuyForYou project.

Files

CrowdResults.json

Files (24.6 MB)

Name Size Download all
md5:212a7820839ba6621a62e46d89740263
24.6 MB Preview Download
md5:c33c1622eced21583a3eff6737c9b919
11.4 kB Preview Download
md5:58aeaaabdf22eb5c88499869b385b172
11.2 kB Preview Download
md5:4af9b64fe89c7f2d1f445f1d08557fbd
12.9 kB Preview Download
md5:d7347d751923a56ee43cf275863bc12e
10.9 kB Preview Download

Additional details

Funding

European Commission
TheyBuyForYou - Enabling procurement data value chains for economic development, demand management, competitive markets and vendor intelligence 780247