MAPE: A Dataset of Correspondence from the Portuguese Empire
Creators
Description
We present the MAPE dataset: Mapping the Atlantic Portuguese Empire: a large-scale historical resource curated from archival material. The dataset is made available in different versions together with its detailed description:
-
MAPE Dataset: Raw Archival Materials (version 1)
-
MAPE Dataset: Bilingual Version (Portuguese-English) (version 2)
-
MAPE Dataset: Bilingual Version with Senders and Recipients (version 3) - in progress
The MAPE dataset comprises 182,491 historical correspondence records from the Arquivo Histórico Ultramarino de Lisboa (Portuguese Overseas Archives of Lisbon, hereafter AHU), in particular from the collection of the Conselho Ultramarino (Overseas Council), covering the period from 1581 to 1859.
The AHU holds an extensive archive of correspondence covering the administrative, diplomatic and commercial activities of the Portuguese Empire. The “Conselho Ultramarino”, created in 1642, represents formal bureaucratic communication between Lisbon and its overseas dominions and covers topics such as colonial administration, trade, diplomacy and social developments.
Originally, these materials were only available as unstructured PDF files, which posed a major challenge for data analysis and large-scale retrieval. These PDF documents contained not only the core correspondence registers, but also a variety of non-essential metadata, such as cataloging details, pagination markers, section headings, and record summaries. The mixing of primary records with additional metadata hindered effective content analysis, searching and visualization.
To overcome these challenges, we converted the PDFs into a structured format (CSV) that isolates the main data elements, improving searchability, navigation and analytical potential. This restructuring allows researchers to work directly with the primary correspondence records without the noise of the surrounding archival metadata.
The data for this study were obtained from the Arquivo Histórico Ultramarino, where historical documents were preserved primarily in unstructured formats, predominantly as PDF files. These documents were publicly available for free download at https://actd.iict.pt/collection/actd:CU. The collections were originally divided into large sections such as Portugal, Africa, Brazil, etc. Within each section there are further subdivisions corresponding to the different colonies of the Portuguese empire at that time. The correspondence register of each colony is stored in individual PDF files, which are organized chronologically. However, these files also contain extraneous metadata such as headings, page numbers, cataloging details and document summaries, which make it difficult to extract the relevant content. As a rule, a header is followed by a short summary of the correspondence, with each document being provided with details of the archiving source.
Our research focuses primarily on specific collections within the AHU, arranged chronologically and geographically, which include the following:
-
Africa:
-
The Angola Collection (Série Angola), whose cataloging was financially supported by the Portuguese Fundação para a Ciência e Tecnologia as part of the project África Atlântica: da documentação ao conhecimento, sécs. XVII-XIX (Atlantic Africa: from documentation to knowledge, seventeenth to nineteenth centuries).
-
The Cabo Verde and Guinea Collection (Série Cabo Verde, Série Guiné), which was cataloged as part of two separate projects: the aforementioned África Atlântica and the Resgate do acervo histórico de Cabo Verde em Portugal (Rescue the historical collection of Cape Verde in Portugal) funded by Camões, Instituto da Cooperação e da Língua (ICL).
-
The São Tomé Collection (Série S. Tomé e Príncipe), also cataloged within the África Atlântica project.
-
Brazil:
-
The “Barão do Rio Branco” — Historical Documentation Rescue Project known as Projeto Resgate (Bertoletti et al. 2022; Boschi 2018) includes 26 catalogues of documents referring to Brazilian regions, cataloged at different times and by different researchers. The Projeto Resgate collection is currently managed by the National Library of Rio de Janeiro in Brazil, but is housed in the AHU.
-
Portugal: Madeira-CA and Madeira.
-
Rio da Prata:
-
Nova Colónia do Sacramento,
-
Montevideu,
-
Buenos Aires,
-
Paraguai
-
Oriente
-
Macau
-
Timor
- MAPE Dataset: Raw Archival Materials (version 1)
Column |
Type |
Description |
doc_id |
Integer |
Unique identifier for each record. |
doc_source |
String |
Archival origin (e.g. ALAGOAS, BAHIA, Cabo Verde). |
doc_box |
String |
Physical box code within the archive (e.g. Cx.1). |
doc_number |
String |
Document number within the box (zero-padded, e.g. 00001). |
doc_type |
String |
Type of register (e.g. INFORMACAO, CONSULTA, CARTA, PROPOSTA, REQUERIMENTO, PARECER). |
year |
Integer |
Four-digit year of the correspondence (e.g. 1690). |
month |
Integer |
Month of the register (1–12). Blank if not recorded in the original. |
day |
Integer |
Day of the month (1–31). Blank if not recorded. |
reference_code |
String |
Integer Unique identifier for each record. |
doc_link |
URL |
Direct link to the AHU catalog entry for the document. |
Doc_Text |
String |
Original Portuguese summary of the correspondence, as transcribed from the archival register. |
The MAPE dataset is provided as a single CSV file the repository root. It consolidates all correspondence registers extracted from the AHU PDFs into a uniform tabular structure.
2. MAPE Dataset: Bilingual Version (Portuguese-English) (version 2)
It is an updated version of MAPE Dataset: Raw Archival Materials (version 1)
Multilingual Adaptation of Consolidated Data Files
The consolidated dataset originally contained correspondence in Portuguese, which was a significant barrier for a global audience. To overcome this limitation, we translated the original content into English using Google Gemini 1.5 Flash, a lightweight transformer-based model optimized for multilingual text processing and translation. Google Gemini 1.5 Flash supports over 100 languages and is designed to strike a balance between speed, computational efficiency and high-quality text creation. With a context window of up to 1 million tokens, it can process large volumes of text in a single prompt and is therefore well suited to the translation of historical documents. As our dataset consists of colonial-era correspondence, it was important to maintain historical accuracy and linguistic integrity. To achieve this, we carefully crafted the following translation prompt:
Prompt:
"You are a skilled historical linguist and translator with deep knowledge of both colonial-era Portuguese and archaic/historical English usage. Your task is to translate the following Portuguese text into an English style that reflects the era in which it was originally written. Please:
-
Maintain the historical tone.
-
Avoid modern terms and slang.
-
Capture the nuanced formality of the original text."
The translated dataset, which is structured in the same format as the original, ensures linguistic and historical authenticity and at the same time makes the correspondence accessible to a wider audience
Files
MAPE Dataset Bilingual Version Portuguese-English version 2.csv
Files
(142.2 MB)
Name | Size | Download all |
---|---|---|
md5:64b66549e8b8e0205ce9bce75718d173
|
142.2 MB | Preview Download |
Additional details
Related works
- Is metadata for
- Journal article: 10.1016/j.socnet.2020.08.008 (DOI)
Funding
- National Science Center
- Imperial Commoners of Brazil and West Africa (1640-1822): Global History from a Correspondence Network Perspective 2022/45/B/HS3/00473