Published November 5, 2020 | Version 1
Dataset Open

Dataset and trained models belonging to the article 'Distant reading patterns of iconicity in 940.000 online circulations of 26 iconic photographs'

  • 1. Utrecht University
  • 2. Luxembourg Centre for Contemporary and Digital History


Quantifying Iconicity - Zenodo

## The Dataset
This dataset contains the material collected for the article "Distant reading 940,000 online circulations of 26 iconic photographs" (to be) published in New Media & Society (DOI: 10.1177/14614448211049459). We identified 26 iconic photographs based on earlier work (Van der Hoeven, 2019). The Google Cloud Vision (GCV) API was subsequently used to identify webpages that host a reproduction of the iconic image. The GCV API uses computer vision methods and the Google index to retrieve these reproductions. The code for calling the API and parsing the data can be found on GitHub:

The core dataset consists of .tsv-files with the URLs that refer to the webpages. Other metadata provided by the GCV API is also found in the file and manually generated metadata. This includes:
- the URL that refers specifically to the image. This can be an URL that refers to a full match or a partial match
- the title of the page
- the iteration number. Because the GCV API puts a limit on its output, we had to reupload the identified images to the API to extend our search. We continued these iterations until no more new unique URLs were found
- the language found by the ``langid`` Python module [link](, along with the normalized score.
- the labels associated with the image by Google
- the scrape date

Alongside the .tsv-files, there are several other elements in the following folder structure:

├── data
│   ├── embeddings
│               └── doc2vec
│               └── input-text
│               └── metadata
│               └── umap
│   └── evaluation
│   └── results
│               └── diachronic-plots
│               └── top-words
│   └── tsv

1. The ```/embeddings``` folder contains the doc2vec models, the training input for the models, the metadata (id, URL, date) and the UMAP embeddings used in the GMM clustering. Please note that the date parser was not able to find dates for all webpages and for this reason not all training texts have associated metadata.
2. The ```/evaluation``` folder contains the AIC and BIC scores for GMM clustering with different numbers of clusters.
3. The ```/results``` folder contains the top words associated with the clusters and the diachronic cluster prominence plots.

## Data Cleaning and Curation
Our pipeline contained several interventions to prevent noise in the data. First, in between the iterations we manually checked the scraped photos for relevance. We did so because reuploading an iconic image that is paired with another, irrelevant, one results in reproductions of the irrelevant one in the next iteration. Because we did not catch all noise, we used Scale Invariant Feature Transform (SIFT), a basic computer vision algorithm, to remove images that did not meet a threshold of ten keypoints. By doing so we removed completely unrelated photographs, but left room for variations of the original (such as painted versions of Che Guevara, or cropped versions of the Napalm Girl image). Another issue was the parsing of webpage texts. After experimenting with different webpage parsers that aim to extract 'relevant' text it proved too difficult to use one solution for all our webpages. Therefore we simply parsed all the text contained in commonly used html-tags, such as ```<p>```, ```<h1>``` etc.


Files (966.9 MB)

Name Size Download all
966.9 MB Preview Download

Additional details


ReAct – Remembering Activism: The Cultural Memory of Protest in Europe 788572
European Commission


  • Van der