Published January 15, 2026 | Version 1.0
Dataset Open

InciText — Incident-Centric Text Dataset for Attribute Extraction

  • 1. ROR icon University of Maryland, Baltimore County
  • 2. ROR icon Purdue University West Lafayette

Description

This release provides InciText v1.0, an incident-centric text dataset released for research and academic use.

 

Contents

The dataset includes three components packaged together:

  • Raw and processed text corpora (DataSet_Attribute_Extraction/)

  • Structured annotation files (Annotation/)

  • Derived, normalized datasets used in experiments (Derived_Datasets/)

The Derived_Datasets/ folder contains the paper-faithful PostgreSQL exports and is the recommended entry point for reproducing reported results.

 

 

Scope

InciText includes incident reports, press releases, newspaper articles, and synthetic or generated narratives used for attribute extraction and retrieval research.

Some documents were provided in privacy-reviewed or historical form.
Users should follow applicable ethical and institutional guidelines.

Dataset Composition

The frequency of each data type in the FemmIR-text corpus is:

  • Newspaper articles: 300
  • Officer narratives: 40
  • Press releases: 13
  • Dispatch reports: 5
  • Synthetic narratives: 1500

 

Citation

If you use this dataset, please cite:

@misc{solaiman2025modularunsupervisedframeworkattribute,
  title={A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text},
  author={KMA Solaiman},
  year={2025},
  eprint={2507.03949},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.03949},
}

Files

InciText.zip

Files (2.9 MB)

Name Size Download all
md5:77a9618e9102fe54c29d9a2543bc0002
2.9 MB Preview Download