# InciText Derived Dataset (PostgreSQL Export)

This folder contains the **derived text-side dataset (InciText)** used in the FemmIR experiments, exported from PostgreSQL in CSV format.


These files represent the **complete and consistent snapshot of person-level attributes** used in the paper, after normalization and consolidation of annotations across all text subsets.

---

## What This Dataset Represents

- A relational representation of the InciText corpus, including incident, generated, and synthetic reports
- Person-level attribute records used for analysis and retrieval
- Consolidated attributes (e.g., upper-wear / lower-wear color) derived from finer-grained descriptors
- The **exact data used to compute the reported statistics and results**

This dataset was generated by loading raw documents from
`DataSet_Attribute_Extraction/` together with structured annotations from
`Annotation/`, applying normalization rules in PostgreSQL, and exporting the
resulting tables.

---

## Files

### `incident_report_record.csv`
- One row per report or narrative
- Contains report identifiers and text-related fields

### `incident_report_detected_people.csv`
- One or more rows per report
- Contains **1530+ detected person records**
- Attributes include:
  - gender
  - upper-wear color
  - lower-wear color
  - other person-related properties (as available)

---

## Relationship to Other Folders

- Raw and generated text documents are located in `DataSet_Attribute_Extraction/`.
- Structured annotation files are provided in `Annotation/` on a per-subset basis.
- Some generated subsets (e.g., `Generated_flash_2`) do not have complete standalone
  annotation JSON files in the repository.

However, annotations for these documents were available during experimentation and
are **fully reflected in this derived dataset**, which should be treated as the
authoritative representation for reproducing reported results.

---

## Notes on Completeness

- This derived dataset is the **recommended entry point** for reproducing text-side
  experiments in the paper.
- Users working directly from raw documents should note that annotation coverage may
  differ across subsets.
- Attribute values may contain noise due to automatic extraction and normalization.

---

## Reference

If you use this dataset, please cite:
```bibtex
@misc{solaiman2025modularunsupervisedframeworkattribute,
      title={A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text}, 
      author={KMA Solaiman},
      year={2025},
      eprint={2507.03949},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.03949}, 
}
@article{Solaiman_2023,
   title={Feature Centric Multi-modal Information Retrieval in Open World Environment (FemmIR)},
   url={http://dx.doi.org/10.36227/techrxiv.21990284.v1},
   DOI={10.36227/techrxiv.21990284.v1},
   publisher={Institute of Electrical and Electronics Engineers (IEEE)},
   author={Solaiman, KMA and Bhargava, Bharat},
   year={2023},
   month=feb }
```
---

## License & Use

Released for research and academic use.

Portions of the incident report data were provided in redacted or privacy-reviewed form by the data source, or originate from historical records.

<!-- This dataset may contain sensitive or incident-related text.  
Users are responsible for complying with applicable privacy, ethical, and institutional guidelines when using this data. -->
Users should follow applicable ethical and institutional guidelines.

The data is provided as-is and does not represent endorsement of any downstream use.


