<!-- # CESA
Characterization and Extraction of Suspect Attributes -->

<!-- # CESA: Corpus for Entity and Structure-based Attribute Extraction

CESA is a research corpus designed to support **attribute extraction, structured representation, and multimodal information retrieval** from unstructured text.

The dataset is used in the development and evaluation of the **FemmIR** framework and related work on weakly supervised, property-centric retrieval.

CESA brings together raw text corpora, structured annotations, and derived relational datasets to enable transparent and reproducible research in low-label settings.

--- -->

# InciText: Incident-Centric Text Corpus for Attribute Extraction

InciText is a research corpus designed to support **attribute extraction, structured representation, and multimodal information retrieval** from unstructured incident-related text.

The dataset is used in the development and evaluation of the **FemmIR** framework and related work on weakly supervised, property-centric retrieval.

InciText brings together raw text corpora, structured annotations, and derived relational datasets to enable transparent and reproducible research in low-label settings.

---


## Repository Structure

The repository (historically referred to as **CESA**) is organized into three main components:

- `DataSet_Attribute_Extraction/`  
  Raw and lightly processed text documents (incident reports, press releases, newspaper articles, synthetic and generated narratives).  
  This folder contains the **source text corpus** used for annotation and downstream dataset construction.

- `Annotation/`  
  Structured annotation files (`annotations.json`, `report_metadata.json`) provided on a per-subset basis.  
  These files represent the **original annotation representations** used to extract person- and property-level attributes.

- `Derived_Datasets/`  
  Released, normalized datasets derived from the raw corpus and annotations.  
  This folder contains the **paper-faithful snapshot** of the data used to compute reported statistics and experimental results and is the **recommended entry point** for reproducing results.

---

## Recommended Usage

- To **reproduce results reported in the paper**, start with the datasets in `Derived_Datasets/`.
- To **inspect source text or trace provenance**, refer to `DataSet_Attribute_Extraction/`.
- To **examine annotation structure and schema**, see `Annotation/`.

---

## Reference

If you use this dataset, please cite:
```bibtex
@misc{solaiman2025modularunsupervisedframeworkattribute,
      title={A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text}, 
      author={KMA Solaiman},
      year={2025},
      eprint={2507.03949},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.03949}, 
}
@article{Solaiman_2023,
   title={Feature Centric Multi-modal Information Retrieval in Open World Environment (FemmIR)},
   url={http://dx.doi.org/10.36227/techrxiv.21990284.v1},
   DOI={10.36227/techrxiv.21990284.v1},
   publisher={Institute of Electrical and Electronics Engineers (IEEE)},
   author={Solaiman, KMA and Bhargava, Bharat},
   year={2023},
   month=feb }
```
---

## License & Use

Released for research and academic use.

Portions of the incident report data were provided in redacted or privacy-reviewed form by the data source, or originate from historical records.

<!-- This dataset may contain sensitive or incident-related text.  
Users are responsible for complying with applicable privacy, ethical, and institutional guidelines when using this data. -->
Users should follow applicable ethical and institutional guidelines.

The data is provided as-is and does not represent endorsement of any downstream use.


