InciText — Incident-Centric Text Dataset for Attribute Extraction
Authors/Creators
Description
This release provides InciText v1.0, an incident-centric text dataset released for research and academic use.
Contents
The dataset includes three components packaged together:
-
Raw and processed text corpora (DataSet_Attribute_Extraction/)
-
Structured annotation files (Annotation/)
-
Derived, normalized datasets used in experiments (Derived_Datasets/)
The Derived_Datasets/ folder contains the paper-faithful PostgreSQL exports and is the recommended entry point for reproducing reported results.
Scope
InciText includes incident reports, press releases, newspaper articles, and synthetic or generated narratives used for attribute extraction and retrieval research.
Some documents were provided in privacy-reviewed or historical form.
Users should follow applicable ethical and institutional guidelines.
The frequency of each data type in the FemmIR-text corpus is:
- Newspaper articles: 300
- Officer narratives: 40
- Press releases: 13
- Dispatch reports: 5
- Synthetic narratives: 1500
Citation
If you use this dataset, please cite:
@misc{solaiman2025modularunsupervisedframeworkattribute,
title={A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text},
author={KMA Solaiman},
year={2025},
eprint={2507.03949},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.03949},
}
Files
InciText.zip
Files
(2.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:77a9618e9102fe54c29d9a2543bc0002
|
2.9 MB | Preview Download |