# Annotations

This folder contains the **structured annotation files** used in the CESA dataset for text-based attribute extraction.

Annotations are stored **per data subset**, with each subset containing the same two core files:

- `annotations.json` — extracted person-level and property-level attributes
- `report_metadata.json` — document identifiers and metadata

Together, these files define the **annotation view over the raw text documents**.

---

## Annotation Structure

Each subset-level `annotations.json` follows a consistent structure.  
Annotations are keyed by report/document ID and may contain one or more detected persons per document.

A typical person-level annotation includes attributes such as:

- gender
- race
- height / weight
- hair color / hair length
- facial hair
- build
- posture
- accessories (e.g., backpack, headphones)
- clothing descriptors

### Clothing Representation

Clothing attributes may appear in two forms:

**Version 1 (free-form):**
```json
"wearing_v1": ["light colored jeans", "T-Shirt"]
```

**Version 2 (structured):**
```json
"wearing": {
  "jeans": "light colored",
  "T-Shirt": "Yes",
  "Shoes": null
}
```

## Metadata

Each subset-level `report_metadata.json` contains:

- document identifiers  
- source information  
- index mappings used to align annotations with raw text  

These identifiers are used to link annotations back to documents in  
`DataSet_Attribute_Extraction/`.

---

## Subset Coverage and Completeness
- the top-level `annotations.json` and `report_metadata.json` contain the annotations for the non-generated documents collected in collaboration with or originating from West Lafayette Police Department (WLPD) source. Those contain incident reports, synthetic reports by police officers, event reports, etc.
- Most subsets contain both `annotations.json` and `report_metadata.json`.
- Some generated or synthetic subsets (e.g., `generated_flash_2`) do **not** have complete standalone annotation files in this folder.

However:

- The annotations for these documents **were available during experimentation**, loaded into PostgreSQL, normalized, and used to generate the derived person-level datasets.

As a result:

- Annotations for such documents are **fully reflected in the derived PostgreSQL exports** (e.g., upper-wear color, lower-wear color, gender), even if they are not present here as raw JSON files.

---

## Relationship to Derived Datasets

- This folder contains the **original annotation representations**.
- The folder `Derived_Datasets/FemmIR_Text_Postgres_Export/` contains the  
  **paper-faithful, normalized annotation snapshot** used in experiments.
- When discrepancies exist, the **derived dataset should be treated as authoritative** for reproducing reported results.

---

## Example Usage

Example code for loading annotations is provided in `annotation_load_example.py`.

```python
with open('annotations.json') as f:
    annotations = json.load(f)

with open('report_metadata.json') as f:
    meta = json.load(f)
```
Annotations can be accessed by document ID and iterated per detected person.

## Notes
 - Annotations for the generated documents were automatically rendered.
 - Attribute values may contain noise.
 - Absence of an annotation file does not imply absence of extracted attributes in the experiments.
 - Attribute normalization (e.g., upper-/lower-wear color) was performed prior to database export and is not recomputed from the raw JSON annotations.

## Reference

If you use this dataset, please cite:
```
@misc{solaiman2025modularunsupervisedframeworkattribute,
      title={A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text}, 
      author={KMA Solaiman},
      year={2025},
      eprint={2507.03949},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.03949}, 
}
```
---

## License & Use

Released for research and academic use.

Portions of the incident report data were provided in redacted or privacy-reviewed form by the data source, or originate from historical records.

<!-- This dataset may contain sensitive or incident-related text.  
Users are responsible for complying with applicable privacy, ethical, and institutional guidelines when using this data. -->
Users should follow applicable ethical and institutional guidelines.

The data is provided as-is and does not represent endorsement of any downstream use.


