Dataset belonging to the article 'Does transnational contention lead to transnational memory? The online visual memory of the February 2003 anti-Iraq War protests'
Creators
- 1. Utrecht University
- 2. Luxembourg Centre for Contemporary and Digital History
Description
## The Dataset
This dataset contains the material collected for the article 'Does transnational contention lead to transnational memory? The online visual memory of the February 2003 anti-Iraq War protests.’ We identified 25 often-used photographs of the 2003 world-wide anti-war protests. The Google Cloud Vision (GCV) API was subsequently used to identify webpages that host a copy of one of these images. The GCV API uses computer vision methods and the Google index to retrieve these reproductions. The code for calling the API and parsing the data can be found on GitHub: https://github.com/rubenros1795/ReACT_GCV.
The dataset consists of .tsv-file with the URLs that refer to the webpages. Other metadata provided by the GCV API is also found in the file and manually generated metadata. This includes:
-The URL that refers to webpage.
-The URL that refers specifically to the image. This can be an URL that refers to a full match or a partial match- the title of the page.
-The iteration number. Because the GCV API puts a limit on its output, we had to reupload the identified images to the API to extend our search. We continued these iterations until no more new unique URLs were found.
-The language found by the ``langid`` Python module (https://github.com/saffsd/langid.py), along with the normalized score.
-The labels associated with the image by Google.
-The scrape date.
-The top-level domain of the circulation.
-The date when the webpage was published, extracted using the HTML time-tags (https://github.com/adbar/htmldate).
## Data Cleaning and Curation
Our pipeline contained several interventions to prevent noise in the data. First, in between the iterations we manually checked the scraped photos for relevance. We did so because reuploading an iconic image that is paired with another, irrelevant, one results in reproductions of the irrelevant one in the next iteration. Another issue was the parsing of webpage texts. After experimenting with different webpage parsers that aim to extract 'relevant' text it proved too difficult to use one solution for all our webpages. Therefore we simply parsed all the text contained in commonly used html-tags, such as ```<p>```, ```<h1>``` etc.
More info in our paper:
Smits T and Ros R (2020) Quantifying Iconicity in 940K Online Circulations of 26 Iconic Photographs. In: Proceedings of the Workshop on Computational Humanities Research (CHR 2020) (eds F Karsdorp, B McGillivray, A Nerghes, et al.), Amsterdam, 18 November 2020, pp. 375–384. CEUR-WS. Available at: http://ceur-ws.org/Vol-2723/short34.pdf.