A Ground-Truth Dataset for Article Separation in Historical Newspapers: A Shenbao corpus centered on 華美協進社/華美協進會 (1926–1949)
Authors/Creators
Description
Overview
This ground-truth dataset contains carefully curated and properly segmented documents derived from an original corpus of news articles focused on 華美協進社/華美協進會 (China Institute in America), drawn from the Shenbao newspaper. The dataset includes 70 articles published between 1926 and 1949.
The ground-truth data contains the following fields:
- DocId: Unique identifier as stored in the Modern China Textual Database (MCTB).
- Date: Original date of publication
- Title: Title as provided by the data supplier
- Source: Shenbao
- Text: Original, unsegmented text
- text_seg: Historian-curated segmented text, produced using GPT + close reading
- length: Character/word length of the original text
- length_seg: Character/word length after re-segmentation
- diff: length difference between original and segmented text
The segmentation process uses a hybrid human–AI workflow: an automated step with a GPT-based “Historical Text Segmenter,” followed by detailed historian-guided verification and correction. The result is a high-quality ground-truth dataset suitable for OCR benchmarking, segmentation modeling, historical text analysis, and digital humanities research. Additional documentation on the configuration of the GPT “Historical Text Segmenter” is available here.
Use Cases
This dataset is intended for:
- Historical research on Sino-American cultural institutions
- Media and discourse analysis of Shenbao
- Training/evaluating segmentation and OCR models
- Digital humanities projects requiring high-quality ground truth corpora
- Studies of textual reuse and viral news circulation in Republican-era newspapers
Lessons from Curation
On the biases of digital avatars of historical newspapers
- Certain news sections such as “教育新聞” or “敎育消息” are more likely to contain unrelated items, and therefore require more careful attention.
- Articles are generally better segmented toward the end of the period (post-WWII); they more systematically correspond to meaningful and autonomous semantic units. However, some messy texts remain (e.g., SPSP194803300401).
- Some so-called “documents” (listed as such in the digital record) are in fact incomplete. The full text clearly shows only a fragment of a longer piece. In some cases, the beginning of the article is missing from the “text” column but appears in the “title” column.
- Some news items are repeated across successive issues of the newspaper (e.g., calls for scholarship applications, SPSP194901200201, SPSP194901200221). In some cases, they are properly segmented; in messy cases, viral-text detection could help identify the correctly bounded passage.
On the nature of references to the research focus
The China Institute may appear with different statuses or degrees of relevance in the articles:
- Central focus of the article (rare): e.g., announcement of a conference series (SPSP194802030620)
- Mentioned in connection with the China Foundation (most common): a subsidiary institution or grant recipient
- Mentioned in passing through reference to its leadership: Guo Bingwen 郭秉文 (before WWII), Meng Zhi 孟治 (after WWII)
- Referenced as a meeting venue or hosting site (e.g., SPSP194701270117)
Files
huamei_shenbao_gold.csv
Files
(925.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:b2dbd7d4602e3e758b02f4f9e45097df
|
925.9 kB | Preview Download |
Additional details
Funding
Dates
- Submitted
-
2025-12-03