A Ground-Truth Dataset for Article Separation in Historical Newspapers: A Shenbao corpus centered on 華美協進社/華美協進會 (1926–1949)

Armand, Cécile

doi:10.5281/zenodo.17801322

Published December 3, 2025 | Version 0.0.1

Dataset Open

A Ground-Truth Dataset for Article Separation in Historical Newspapers: A Shenbao corpus centered on 華美協進社/華美協進會 (1926–1949)

Armand, Cécile (Data curator)^{1, 2}

1. French National Center for Scientific Research (head office)
2. Institut d'Asie Orientale

Overview

This ground-truth dataset contains carefully curated and properly segmented documents derived from an original corpus of news articles focused on 華美協進社/華美協進會 (China Institute in America), drawn from the Shenbao newspaper. The dataset includes 70 articles published between 1926 and 1949.

The ground-truth data contains the following fields:

DocId: Unique identifier as stored in the Modern China Textual Database (MCTB).
Date: Original date of publication
Title: Title as provided by the data supplier
Source: Shenbao
Text: Original, unsegmented text
text_seg: Historian-curated segmented text, produced using GPT + close reading
length: Character/word length of the original text
length_seg: Character/word length after re-segmentation
diff: length difference between original and segmented text

The segmentation process uses a hybrid human–AI workflow: an automated step with a GPT-based “Historical Text Segmenter,” followed by detailed historian-guided verification and correction. The result is a high-quality ground-truth dataset suitable for OCR benchmarking, segmentation modeling, historical text analysis, and digital humanities research. Additional documentation on the configuration of the GPT “Historical Text Segmenter” is available here.

Use Cases

This dataset is intended for:

Historical research on Sino-American cultural institutions
Media and discourse analysis of Shenbao
Training/evaluating segmentation and OCR models
Digital humanities projects requiring high-quality ground truth corpora
Studies of textual reuse and viral news circulation in Republican-era newspapers

Lessons from Curation

On the biases of digital avatars of historical newspapers

Certain news sections such as “教育新聞” or “敎育消息” are more likely to contain unrelated items, and therefore require more careful attention.
Articles are generally better segmented toward the end of the period (post-WWII); they more systematically correspond to meaningful and autonomous semantic units. However, some messy texts remain (e.g., SPSP194803300401).
Some so-called “documents” (listed as such in the digital record) are in fact incomplete. The full text clearly shows only a fragment of a longer piece. In some cases, the beginning of the article is missing from the “text” column but appears in the “title” column.
Some news items are repeated across successive issues of the newspaper (e.g., calls for scholarship applications, SPSP194901200201, SPSP194901200221). In some cases, they are properly segmented; in messy cases, viral-text detection could help identify the correctly bounded passage.

On the nature of references to the research focus

The China Institute may appear with different statuses or degrees of relevance in the articles:

Central focus of the article (rare): e.g., announcement of a conference series (SPSP194802030620)
Mentioned in connection with the China Foundation (most common): a subsidiary institution or grant recipient
Mentioned in passing through reference to its leadership: Guo Bingwen 郭秉文 (before WWII), Meng Zhi 孟治 (after WWII)
Referenced as a meeting venue or hosting site (e.g., SPSP194701270117)

Files

huamei_shenbao_gold.csv

Files (925.9 kB)

Name	Size	Download all
huamei_shenbao_gold.csv md5:b2dbd7d4602e3e758b02f4f9e45097df	925.9 kB	Preview Download

Additional details

European Commission
ENPMUC - Elites, networks, and power in modern urban China (1830-1949). 788476

Submitted: 2025-12-03

	All versions	This version
Views	72	72
Downloads	48	48
Data volume	50.0 MB	50.0 MB

Overview

Use Cases

Lessons from Curation

On the biases of digital avatars of historical newspapers

On the nature of references to the research focus

huamei_shenbao_gold.csv

Files (925.9 kB)

Funding

Dates

A Ground-Truth Dataset for Article Separation in Historical Newspapers: A Shenbao corpus centered on 華美協進社/華美協進會 (1926–1949)

Authors/Creators

Description

Overview

Use Cases

Lessons from Curation

On the biases of digital avatars of historical newspapers

On the nature of references to the research focus

Files

huamei_shenbao_gold.csv

Files (925.9 kB)

Additional details

Funding

Dates