Published December 3, 2025 | Version 0.0.1
Dataset Open

A Ground-Truth Dataset for Article Separation in Historical Newspapers: A Shenbao corpus centered on 華美協進社/華美協進會 (1926–1949)

  • 1. EDMO icon French National Center for Scientific Research (head office)
  • 2. ROR icon Institut d'Asie Orientale

Description

Overview

This ground-truth dataset contains carefully curated and properly segmented documents derived from an original corpus of news articles focused on 華美協進社/華美協進會 (China Institute in America), drawn from the Shenbao newspaper. The dataset includes 70 articles published between 1926 and 1949.

The ground-truth data contains the following fields:

  • DocId: Unique identifier as stored  in the Modern China Textual Database (MCTB).
  • Date: Original date of publication
  • Title: Title as provided by the data supplier
  • Source: Shenbao
  • Text: Original, unsegmented text
  • text_seg: Historian-curated segmented text, produced using GPT + close reading
  • length: Character/word length of the original text
  • length_seg: Character/word length after re-segmentation
  • diff: length difference between original and segmented text

The segmentation process uses a hybrid human–AI workflow: an automated step with a GPT-based “Historical Text Segmenter,” followed by detailed historian-guided verification and correction. The result is a high-quality ground-truth dataset suitable for OCR benchmarking, segmentation modeling, historical text analysis, and digital humanities research. Additional documentation on the configuration of the GPT “Historical Text Segmenter” is available here.

Use Cases

This dataset is intended for:

  • Historical research on Sino-American cultural institutions
  • Media and discourse analysis of Shenbao
  • Training/evaluating segmentation and OCR models
  • Digital humanities projects requiring high-quality ground truth corpora
  • Studies of textual reuse and viral news circulation in Republican-era newspapers

Lessons from Curation

On the biases of digital avatars of historical newspapers

  • Certain news sections such as “教育新聞” or “敎育消息” are more likely to contain unrelated items, and therefore require more careful attention.
  • Articles are generally better segmented toward the end of the period (post-WWII); they more systematically correspond to meaningful and autonomous semantic units. However, some messy texts remain (e.g., SPSP194803300401).
  • Some so-called “documents” (listed as such in the digital record) are in fact incomplete. The full text clearly shows only a fragment of a longer piece. In some cases, the beginning of the article is missing from the “text” column but appears in the “title” column.
  • Some news items are repeated across successive issues of the newspaper (e.g., calls for scholarship applications, SPSP194901200201, SPSP194901200221). In some cases, they are properly segmented; in messy cases, viral-text detection could help identify the correctly bounded passage.

On the nature of references to the research focus

The China Institute may appear with different statuses or degrees of relevance in the articles:

  • Central focus of the article (rare): e.g., announcement of a conference series (SPSP194802030620)
  • Mentioned in connection with the China Foundation (most common): a subsidiary institution or grant recipient
  • Mentioned in passing through reference to its leadership: Guo Bingwen 郭秉文 (before WWII), Meng Zhi 孟治 (after WWII)
  • Referenced as a meeting venue or hosting site (e.g., SPSP194701270117)

Files

huamei_shenbao_gold.csv

Files (925.9 kB)

Name Size Download all
md5:b2dbd7d4602e3e758b02f4f9e45097df
925.9 kB Preview Download

Additional details

Funding

European Commission
ENPMUC - Elites, networks, and power in modern urban China (1830-1949). 788476

Dates

Submitted
2025-12-03