# DataSet Attribute Extraction

This folder contains the **raw and lightly processed text corpora** used as input for
attribute extraction in the CESA dataset.

The contents of this folder represent the **source documents** from which structured
annotations and derived datasets are produced.  
This folder does **not** contain the final, normalized attribute tables used in experiments.

---

## Folder Contents

The corpus includes multiple text sources:

- `Incident Reports/`  
  Incident narratives, officer reports, and dispatch-style documents.

- `Press Release/`  
  Public press release documents.  
  All documents have been converted to plain text where necessary, and original copies
  are preserved.

- `Purdue_Exponent/`  
  Newspaper articles collected via keyword-based crawling.  
  Search keywords include: **suspect, arrest, person of interest, persons of interest**.

- `Synthetic/`  
  Synthetic narratives generated to augment the corpus.

- `Generated/`, `Generated_flash/`, `Generated_flash_2/`  
  Automatically generated or transformed text subsets used for experimentation.

- `Data_Processing/`  
  Scripts and intermediate processing artifacts related to corpus preparation.

---

## Relationship to Annotations

- Structured annotations are stored separately in the `Annotation/` folder.
- Each data subset is associated with:
  - `annotations.json`
  - `report_metadata.json`
  where available.

Some generated subsets (e.g., `Generated_flash_2`) do **not** have complete standalone
annotation files in this folder.  
However, annotations for these documents were available during experimentation and are
fully reflected in the **derived PostgreSQL datasets** released elsewhere in the repository.

---

## Relationship to Derived Datasets

- This folder contains **source text only**.
- Normalized, person-level attributes (e.g., gender, upper-wear color, lower-wear color)
  are provided in the derived datasets exported from PostgreSQL.
- When reproducing results reported in the paper, the **derived datasets should be treated
  as authoritative**, not the raw text alone.

---

## Dataset Composition

The frequency of each data type in the FemmIR-text corpus is:

- Newspaper articles: 300  
- Officer narratives: 40  
- Press releases: 13  
- Dispatch reports: 5  
- Synthetic narratives: 1500  

---

## Notes

- Documents may differ in structure and completeness across sources.
- Attribute extraction and normalization are performed downstream and are not recomputed
  directly from this folder.
- This folder is provided for transparency, inspection, and traceability of the corpus.

---

## Reference

If you use this dataset, please cite:
```
@misc{solaiman2025modularunsupervisedframeworkattribute,
      title={A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text}, 
      author={KMA Solaiman},
      year={2025},
      eprint={2507.03949},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.03949}, 
}
```
---

## License & Use

Released for research and academic use.

Portions of the incident report data were provided in redacted or privacy-reviewed form by the data source, or originate from historical records.

<!-- This dataset may contain sensitive or incident-related text.  
Users are responsible for complying with applicable privacy, ethical, and institutional guidelines when using this data. -->
Users should follow applicable ethical and institutional guidelines.

The data is provided as-is and does not represent endorsement of any downstream use.


