MAPE: A Dataset of Correspondence from the Portuguese Empire

Błoch, Agata; Vasques Filho, Demival; Bojanowski, Michał; Santana, Clodomir; Hussain, Saddam

doi:10.5281/zenodo.15481608

Published May 21, 2025 | Version v2

Dataset Open

MAPE: A Dataset of Correspondence from the Portuguese Empire

1. Tadeusz Manteuffel Institute of History
2. Jagiellonian University in Krakow/Poland
3. University of Luxembourg
4. Universitat Autònoma de Barcelona
5. Kozminski University

We present the MAPE dataset: Mapping the Atlantic Portuguese Empire: a large-scale historical resource curated from archival material. The dataset is made available in different versions together with its detailed description:

MAPE Dataset: Raw Archival Materials (version 1)
MAPE Dataset: Bilingual Version (Portuguese-English) (version 2)
MAPE Dataset: Bilingual Version with Senders and Recipients (version 3) - in progress

The MAPE dataset comprises 182,491 historical correspondence records from the Arquivo Histórico Ultramarino de Lisboa (Portuguese Overseas Archives of Lisbon, hereafter AHU), in particular from the collection of the Conselho Ultramarino (Overseas Council), covering the period from 1581 to 1859.

The AHU holds an extensive archive of correspondence covering the administrative, diplomatic and commercial activities of the Portuguese Empire. The “Conselho Ultramarino”, created in 1642, represents formal bureaucratic communication between Lisbon and its overseas dominions and covers topics such as colonial administration, trade, diplomacy and social developments.

Originally, these materials were only available as unstructured PDF files, which posed a major challenge for data analysis and large-scale retrieval. These PDF documents contained not only the core correspondence registers, but also a variety of non-essential metadata, such as cataloging details, pagination markers, section headings, and record summaries. The mixing of primary records with additional metadata hindered effective content analysis, searching and visualization.

To overcome these challenges, we converted the PDFs into a structured format (CSV) that isolates the main data elements, improving searchability, navigation and analytical potential. This restructuring allows researchers to work directly with the primary correspondence records without the noise of the surrounding archival metadata.

The data for this study were obtained from the Arquivo Histórico Ultramarino, where historical documents were preserved primarily in unstructured formats, predominantly as PDF files. These documents were publicly available for free download at https://actd.iict.pt/collection/actd:CU. The collections were originally divided into large sections such as Portugal, Africa, Brazil, etc. Within each section there are further subdivisions corresponding to the different colonies of the Portuguese empire at that time. The correspondence register of each colony is stored in individual PDF files, which are organized chronologically. However, these files also contain extraneous metadata such as headings, page numbers, cataloging details and document summaries, which make it difficult to extract the relevant content. As a rule, a header is followed by a short summary of the correspondence, with each document being provided with details of the archiving source.

Our research focuses primarily on specific collections within the AHU, arranged chronologically and geographically, which include the following:

Africa:

The Angola Collection (Série Angola), whose cataloging was financially supported by the Portuguese Fundação para a Ciência e Tecnologia as part of the project África Atlântica: da documentação ao conhecimento, sécs. XVII-XIX (Atlantic Africa: from documentation to knowledge, seventeenth to nineteenth centuries).
The Cabo Verde and Guinea Collection (Série Cabo Verde, Série Guiné), which was cataloged as part of two separate projects: the aforementioned África Atlântica and the Resgate do acervo histórico de Cabo Verde em Portugal (Rescue the historical collection of Cape Verde in Portugal) funded by Camões, Instituto da Cooperação e da Língua (ICL).
The São Tomé Collection (Série S. Tomé e Príncipe), also cataloged within the África Atlântica project.

Brazil:

The “Barão do Rio Branco” — Historical Documentation Rescue Project known as Projeto Resgate (Bertoletti et al. 2022; Boschi 2018) includes 26 catalogues of documents referring to Brazilian regions, cataloged at different times and by different researchers. The Projeto Resgate collection is currently managed by the National Library of Rio de Janeiro in Brazil, but is housed in the AHU.

Portugal: Madeira-CA and Madeira.
Rio da Prata:

Nova Colónia do Sacramento,
Montevideu,
Buenos Aires,
Paraguai

Oriente

Macau
Timor

MAPE Dataset: Raw Archival Materials (version 1)

Column	Type	Description
doc_id	Integer	Unique identifier for each record.
doc_source	String	Archival origin (e.g. ALAGOAS, BAHIA, Cabo Verde).
doc_box	String	Physical box code within the archive (e.g. Cx.1).
doc_number	String	Document number within the box (zero-padded, e.g. 00001).
doc_type	String	Type of register (e.g. INFORMACAO, CONSULTA, CARTA, PROPOSTA, REQUERIMENTO, PARECER).
year	Integer	Four-digit year of the correspondence (e.g. 1690).
month	Integer	Month of the register (1–12). Blank if not recorded in the original.
day	Integer	Day of the month (1–31). Blank if not recorded.
reference_code	String	Integer Unique identifier for each record.
doc_link	URL	Direct link to the AHU catalog entry for the document.
Doc_Text	String	Original Portuguese summary of the correspondence, as transcribed from the archival register.

The MAPE dataset is provided as a single CSV file the repository root. It consolidates all correspondence registers extracted from the AHU PDFs into a uniform tabular structure.

2. MAPE Dataset: Bilingual Version (Portuguese-English) (version 2)

It is an updated version of MAPE Dataset: Raw Archival Materials (version 1)

Multilingual Adaptation of Consolidated Data Files

The consolidated dataset originally contained correspondence in Portuguese, which was a significant barrier for a global audience. To overcome this limitation, we translated the original content into English using Google Gemini 1.5 Flash, a lightweight transformer-based model optimized for multilingual text processing and translation. Google Gemini 1.5 Flash supports over 100 languages and is designed to strike a balance between speed, computational efficiency and high-quality text creation. With a context window of up to 1 million tokens, it can process large volumes of text in a single prompt and is therefore well suited to the translation of historical documents. As our dataset consists of colonial-era correspondence, it was important to maintain historical accuracy and linguistic integrity. To achieve this, we carefully crafted the following translation prompt:

Prompt:
"You are a skilled historical linguist and translator with deep knowledge of both colonial-era Portuguese and archaic/historical English usage. Your task is to translate the following Portuguese text into an English style that reflects the era in which it was originally written. Please:

Maintain the historical tone.
Avoid modern terms and slang.
Capture the nuanced formality of the original text."

The translated dataset, which is structured in the same format as the original, ensures linguistic and historical authenticity and at the same time makes the correspondence accessible to a wider audience

Files

MAPE Dataset Bilingual Version Portuguese-English version 2.csv

Files (142.2 MB)

Name	Size	Download all
MAPE Dataset Bilingual Version Portuguese-English version 2.csv md5:64b66549e8b8e0205ce9bce75718d173	142.2 MB	Preview Download

Additional details

Is metadata for: Journal article: 10.1016/j.socnet.2020.08.008 (DOI)

National Science Centre
Imperial Commoners of Brazil and West Africa (1640-1822): Global History from a Correspondence Network Perspective 2022/45/B/HS3/00473

	All versions	This version
Views	320	106
Downloads	265	87
Data volume	40.4 GB	16.1 GB

MAPE: A Dataset of Correspondence from the Portuguese Empire

Files

MAPE Dataset Bilingual Version Portuguese-English version 2.csv

Files (142.2 MB)

Additional details

Related works

Funding

MAPE: A Dataset of Correspondence from the Portuguese Empire

Creators

Description

Files

MAPE Dataset Bilingual Version Portuguese-English version 2.csv

Files (142.2 MB)

Additional details

Related works

Funding