Crowdsourcing Document Similarity Judgements

Gabriel Maia Rocha Amaral

doi:10.5281/zenodo.4298976

Published January 28, 2020 | Version 1.0

Dataset Open

Crowdsourcing Document Similarity Judgements

Gabriel Maia Rocha Amaral¹

1. King's College London

This is the data obtained from crowdsourcing tasks which ask workers to provide similarity metrics between pairs of documents. Each document, as well as each pair, has a unique ID. We provide crowd workers with the pairs through three different task variations:

Variation 1: We showed workers 5 pairs of documents and, for each, asked them to rate their similarity in a 4-level Likert scale (None, Low, Medium, High), tell us a confidence level of how sure they were (from 0 to 4) and a written reason as to why they chose that similarity level. For quality reasons, two of the 5 pairs were golden-standards, which means we knew their ratings already and checked the workers' responses. They had to give the golden pair with the higher similarity a higher score than the other golden pair, otherwise, their answer would be rejected.
Variation 2: We repeated variation 1 but with a slight alteration: instead of a Likert scale for the similarity score, we asked for a Magnitude Estimation, which is any number above 0. It could be 1, 0.0001, 1000, 42, as long as it was coherent, as in a more similar pair had a higher score than a less similar pair and vice-versa;
Variation 3: We showed workers 5 rankings. Each ranking had a main document and 3 auxiliary documents to be compared against the main one. They also had to report a confidence score and give a short written reason, just like variation 1. The first ranking is a golden-standard, and we knew the values for the 3 pairs in it (the pairs were the main document paired with each of the 3 auxiliary documents), and they had to give the golden pair with the highest similarity a higher rank than the one with the lower similarity.

The raw results from the tasks are recorded in the JSON file CrowdResults.json. For a description of its contents, please read the file CrowdResults_README.md.

These raw annotations from the crowd were then parsed into the three CSVs you see, each corresponding to the aggregated results from one of the task variations.

final_scores_likert.csv is the resulting scores for each pair using the variation 1 tasks;
- pair_id is a unique identifier for each pair;
- similarity_alg is the similarity assigned to the pair of documents from an automated similarity algorithm;
- relation is the type of relationship shown by the pair, where smaller values indicate more similar pairs;
- similarity_crowd_simple_maj stores the simple majority result from the crowd's annotations;
- similarity_crowd_simple_mean stores the mean of the crowd's annotations;
- similarity_crowd_simple_median stores the median of the crowd's annotations;
final_scores_magnitude.csv is the resulting scores for each pair using the variation 2 tasks;
- pair_id is a unique identifier for each pair;
- similarity_alg is the similarity assigned to the pair of documents from an automated similarity algorithm;
- relation is the type of relationship shown by the pair, where smaller values indicate more similar pairs;
- scaled_similarity_worker is the magnitude score scaled based on worker's behaviours
- scaled_similarity_worker_docset is the magnitude score scaled based both on the worker's behaviour and on the pair
final_scores_ranking.csv is the resulting scores for each pair using the variation 3 tasks;
- pair_id is a unique identifier for each pair;
- similarity_alg is the similarity assigned to the pair of documents from an automated similarity algorithm;
- relation is the type of relationship shown by the pair, where smaller values indicate more similar pairs;
- mean_similarity is the mean ranking from that value

This dataset was built and used as part of the TheyBuyForYou project.

Files

CrowdResults.json

Files (24.6 MB)

Name	Size	Download all
CrowdResults.json md5:212a7820839ba6621a62e46d89740263	24.6 MB	Preview Download
CrowdResults_README.md md5:c33c1622eced21583a3eff6737c9b919	11.4 kB	Preview Download
final_scores_likert.csv md5:58aeaaabdf22eb5c88499869b385b172	11.2 kB	Preview Download
final_scores_magnitude.csv md5:4af9b64fe89c7f2d1f445f1d08557fbd	12.9 kB	Preview Download
final_scores_ranking.csv md5:d7347d751923a56ee43cf275863bc12e	10.9 kB	Preview Download

Additional details

European Commission
TheyBuyForYou - Enabling procurement data value chains for economic development, demand management, competitive markets and vendor intelligence 780247

	All versions	This version
Views	253	252
Downloads	97	97
Data volume	688.6 MB	688.6 MB

Crowdsourcing Document Similarity Judgements

Authors/Creators

Description

Files

CrowdResults.json

Files (24.6 MB)

Additional details

Funding