Webis Wikipedia-IPC

Gohsen, Marcel; Hagen, Matthias; Potthast, Martin; Stein, Benno

doi:10.5281/zenodo.7621320

Published February 8, 2023 | Version v1

Dataset Open

Webis Wikipedia-IPC

1. Bauhaus-Universität Weimar
2. Friedrich-Schiller-Universität Jena
3. Leipzig University and ScaDS.AI

Webis Wikipedia-IPC

When an image is reused on the Web, an original caption is often assigned. We hypothesize that different captions for the same image naturally form a set of mutual paraphrases. To demonstrate the suitability of this idea, we analyzed captions in the English Wikipedia, where editors frequently relabel the same image for different articles. As a result, the Wikipedia-IPC (Image caption Paraphrase Corpus) dataset was created which include caption pairs of the same image which represent paraphrases. It contains 30,237 gold, 229,877 silver, and 656,560 bronze quality paraphrase pairs.

Notes

Bronze quality will be released soon.

Files

wikipedia-ipc.zip

Files (50.2 MB)

Name	Size	Download all
wikipedia-ipc.zip md5:7a2b7ecb9bd13546eb9be67787e8939c	50.2 MB	Preview Download

292

Views

Downloads

Show more details

	All versions	This version
Views	292	292
Downloads	73	73
Data volume	3.8 GB	3.8 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Languages

English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: April 28, 2023
Modified: April 29, 2023

Webis Wikipedia-IPC

Authors/Creators

Description

Notes

Files

wikipedia-ipc.zip

Files (50.2 MB)