Published February 11, 2021 | Version one
Journal article Restricted

Dataset belonging to the article 'Does transnational contention lead to transnational memory? The online visual memory of the February 2003 anti-Iraq War protests'

  • 1. Utrecht University
  • 2. Luxembourg Centre for Contemporary and Digital History

Description

## The Dataset

This dataset contains the material collected for the article 'Does transnational contention lead to transnational memory? The online visual memory of the February 2003 anti-Iraq War protests.’ We identified 25 often-used photographs of the 2003 world-wide anti-war protests. The Google Cloud Vision (GCV) API was subsequently used to identify webpages that host a copy of one of these images. The GCV API uses computer vision methods and the Google index to retrieve these reproductions. The code for calling the API and parsing the data can be found on GitHub: https://github.com/rubenros1795/ReACT_GCV.

The dataset consists of .tsv-file with the URLs that refer to the webpages. Other metadata provided by the GCV API is also found in the file and manually generated metadata. This includes:

-The URL that refers to webpage.
-The URL that refers specifically to the image. This can be an URL that refers to a full match or a partial match- the title of the page.
-The iteration number. Because the GCV API puts a limit on its output, we had to reupload the identified images to the API to extend our search. We continued these iterations until no more new unique URLs were found.
-The language found by the ``langid`` Python module (https://github.com/saffsd/langid.py), along with the normalized score.
-The labels associated with the image by Google.
-The scrape date.
-The top-level domain of the circulation.
-The date when the webpage was published, extracted using the HTML time-tags (https://github.com/adbar/htmldate).

## Data Cleaning and Curation

Our pipeline contained several interventions to prevent noise in the data. First, in between the iterations we manually checked the scraped photos for relevance. We did so because reuploading an iconic image that is paired with another, irrelevant, one results in reproductions of the irrelevant one in the next iteration. Another issue was the parsing of webpage texts. After experimenting with different webpage parsers that aim to extract 'relevant' text it proved too difficult to use one solution for all our webpages. Therefore we simply parsed all the text contained in commonly used html-tags, such as ```<p>```, ```<h1>``` etc.

More info in our paper:

Smits T and Ros R (2020) Quantifying Iconicity in 940K Online Circulations of 26 Iconic Photographs. In: Proceedings of the Workshop on Computational Humanities Research (CHR 2020) (eds F Karsdorp, B McGillivray, A Nerghes, et al.), Amsterdam, 18 November 2020, pp. 375–384. CEUR-WS. Available at: http://ceur-ws.org/Vol-2723/short34.pdf.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

Waiting for publication of the article.

You are currently not logged in. Do you have an account? Log in here

Additional details

Funding

ReAct – Remembering Activism: The Cultural Memory of Protest in Europe 788572
European Commission