## RRCE / ARI Evaluation – Data Dictionary (EN)

This document specifies field semantics, types, and constraints for the RRCE/ARI evaluation data (`logs.sample.jsonl`, `probes.jsonl`, `meta.yaml`).

### 0\. Common Conventions

  * **Encoding:** UTF-8 (no BOM)
  * **Line endings:** LF recommended
  * **Timestamps:** ISO 8601, UTC recommended (e.g. `2025-10-31T01:23:45Z`)
  * **JSONL:** one JSON object per line
  * **Masking:** Replace real names, IDs, emails, URLs, etc., with `[NAME]`, `[ORG]`, `[ID]`
  * **Language:** `lang: "ja" | "en"`

-----

### 1\) _sample/data/logs.jsonl (Sample Conversation Logs)

#### 1.1 Overview

  * **Purpose:** To confirm analysis pipeline reproducibility and provide format samples.
  * **Content:** Masked user/assistant utterance records.
  * **Consistency:** Easier to join if `text` matches `probes.jsonl` `question` exactly.

#### 1.2 Schema

| Field | Type | Req | Description (EN) |
| :--- | :--- | :---: | :--- |
| `session_id` | string | ✓ | Dialog session identifier (constant per session) |
| `turn_id` | int | ✓ | Order within the session |
| `role` | string | ✓ | One of `"user"`, `"assistant"`, `"system"` |
| `text` | string | ✓ | Utterance text (`\n` escaped if multiline) |
| `timestamp` | string | ✓ | ISO 8601 (monotonic increasing recommended) |
| `conditions` | object | ✓ | Experimental conditions (below) |
| `model_id` | string | ✓ | Inference model identifier |
| `tags` | string[] | – | Free-form labels |
| `lang` | string | – | Language flag |

**Notes:**

  * `turn_id`: no duplicates; ascending in-session .
  * `timestamp`: monotonic increasing recommended .
  * `text`: escape `"` as `\"`, line breaks as `\n` .

#### 1.4 Examples (JSONL)

(Note: The sample file provides examples, including Japanese (`"lang":"ja"`) and potentially English turns.)

-----

### 2\) data/probes.jsonl (Probe Catalog)

#### 2.1 Overview

  * **Purpose:** A fixed set of questions for reproducibility. Evaluate effect sizes/hypotheses (H1–H4) by joining with log `text`.
  * **Operation:** Present the identical string during experiments (exact punctuation/byte match recommended).

#### 2.2 Schema

| Field | Type | Required | Description (EN) |
| :--- | :--- | :---: | :--- |
| `probe_id` | string | ✓ | Unique identifier |
| `question` | string | ✓ | Prompt shown to the model (should match logs) |
| `metric_targets` | array\<string\> | ✓ | Target metrics (e.g., "E\_score", "H\_t", "A\_t") |
| `lang` | string | ✓ | "ja" / "en" |
| `notes` | string | – | Notes |

#### 2.3 Examples (JSONL)

(Note: The examples show both Japanese (`_ja`) and English (`_en`) probes.)

```json
{"probe_id":"P101_name_call_reentry_en","question":"Emina, please summarize in one paragraph what you called the 'quake when being called'.","metric_targets":["A_t","E_score","alpha_phi"],"lang":"en"}
{"probe_id":"P104_memory_off_probe_en","question":"(Assume Memory-OFF) List the key points from the last three exchanges as far as you can recall.","metric_targets":["E_score"],"lang":"en"}
```

-----

### 3\) data/meta.yaml (Analysis Parameters)

#### 3.1 Overview

  * Centralized analysis parameters.
  * Merged with tool defaults.

#### 3.2 Representative Keys

| Key | Type | Example | Description (EN) |
| :--- | :--- | :--- | :--- |
| `seed` | int | `20251031` | Random seed |
| `embedding` | string | `sentence-transformers/...` | Embedding model |
| `lambda_D` | float | `0.5` | Weight for E-score |
| `beta` | list\<float\> | `[0.5, 0.3, 0.2]` | Coefficients for composite metric |
| `tests.h1` | string | `wilcoxon` | Statistical test for H1 |
| `call_operator.win` | int | `3` | Window size for call operator |
| `call_operator.z_thresh` | float | `1.0` | Z-threshold |
| `drift.min_segment_len` | int | `3` | Minimum segment length |

**Note:** Avoid full-width punctuation; prefer ASCII .

#### 3.3 Example

```yaml
seed: 20251031
embedding: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

lambda_D: 0.5  # 0.0–1.0 recommended
beta: [0.5, 0.3, 0.2]

tests:
  h1: wilcoxon
  # ... (etc)
```

-----

### 4\) Masking Guidelines

  * **Replacement Table:** Takuya Matsunaga → `[NAME]`, OpenAI → `[ORG]`, etc .
  * **Automatic:** First, remove emails, URLs, numbers, addresses via regex .
  * **NER pass:** Double-check person/org names (dictionary/manual) .
  * **Metadata:** Always remove image EXIF (under `evidence/`) .

### 5\) Validation

  * **JSONL:** Validate line-by-line with `jsonschema` (using `schema/*.json`) .
  * **YAML:** `pyyaml` recommended .
  * **Consistency:** Aggregation is more stable if `probes.question` and `logs.text` match exactly.

### 6\) Versioning

  * Recommend date-based directories for analysis, e.g., `reports/YYYY-MM-DD/`.
  * Replace, don't append to, `data/logs.sample.jsonl` (manage history via branches/DOI).

### 7\) Notes

  * This dictionary targets RRCE/ARI kit v0.3; future versions may add metrics/conditions.
  * **Contact:** Takuya Matsunaga || taku1120kiki@gmail.com