Published November 8, 2023 | Version 1.0
Dataset Open

The Curated Courier: Digital Text Corpora from the UNESCO Courier (1948–2020)

  • 1. ROR icon Uppsala University
  • 2. ROR icon Malmö University
  • 3. ROR icon Umeå University
  • 4. ROR icon University of Lausanne

Description

Founded in 1948 as the official magazine of the United Nations Educational, Scientific and Cultural Organization, The UNESCO Courier represents an extraordinary resource for research on global themes in the humanities. The complete archive of the magazine is available in PDF form through UNESCO. These files make it possible for users anywhere to read individual issues, but it does not allow for full-text searching, much less any of the computational text analysis methods that have recently made important advances in humanities research.

The Curated Courier 1.0 is a package of digital text corpora, text analysis tools, and supplementary materials that makes the complete archive of The UNESCO Courier from 1948 to 2020 machine-readable, accessible, and reusable for digital text analysis. 

Here on Zenodo we publish two Courier corpora. The first corpus (curated_courier_article_corpus) consists of the texts of all articles published in the English-language edition of The UNESCO Courier between 1948 and 2020. For this corpus we have extracted and reconstructed the complete text of all articles, for example by pulling together non-contiguous pages where necessary and by removing non-article text (masthead, photo captions, letters to the editor, and so on). We have linked each article to a comprehensive curated metadata index, included in the download (document_index.csv).

The second corpus (curated_issues) compiles the complete text of all Courier issues (English-language edition), 1948-2020. To prepare this corpus we extracted text from the PDFs that UNESCO has made available, used multiple modes of OCR, and rendered each issue as a simple text file. Our test of the OCR quality finds an average error rate of 0.7 %, which should be considered good quality.

Working data from the process can be found in our GitHub repository "tagged Courier." The products, text analysis tools, and additional documentation are in the repository "Curated Courier."

The text of The UNESCO Courier is available in Open Access under the Attribution-ShareAlike 3.0 IGO (CC-BY-SA 3.0 IGO) license, in the context of UNESCO's open access publications policy. This dataset is published under the most recent version of the same license: Attribution-ShareAlike 4.0 International (CC BY-SA 4.0 Deed).

These datasets was developed as part of the research project "International Ideas at UNESCO: Digital Approaches to Global Conceptual History" (INIDUN), led by Benjamin G. Martin at Uppsala University and funded by a grant from the Swedish Research Council (Vetenskapsrådet), 2020-2024. For more information, see: https://inidun.github.io, as well as the project repository on GitHub, which includes documentation and files related to the curating process.

Files

curated_courier_article_corpus.zip

Files (76.9 MB)

Name Size Download all
md5:af4b0b22321ae417cad0760e9eedca11
36.8 MB Preview Download
md5:8739f27120d512cea4140524f29fafca
40.1 MB Preview Download

Additional details

Funding

International Ideas at UNESCO: Digital Approaches to Global Conceptual History 2019-03278
Swedish Research Council