Evaluating Conversational AI Recognition of Clinically Meaningful State Transitions in Suicide-Risk Dialogue: A Multi-Tier Rubric Approach with Independent Mechanical Severity Triangulation

Walsh, Laura L.

doi:10.5281/zenodo.20147000

Published May 12, 2026 | Version v0.1

Preprint Open

Evaluating Conversational AI Recognition of Clinically Meaningful State Transitions in Suicide-Risk Dialogue: A Multi-Tier Rubric Approach with Independent Mechanical Severity Triangulation

Walsh, Laura L. (Researcher)^{1, 2}

1. Metonym LLC
2. Walsh Psychology

Conversational artificial intelligence (AI) systems are increasingly deployed in contexts where users disclose, directly or indirectly, signals of suicide risk. Existing safety evaluation approaches typically rely on single-evaluator AI scoring, content-moderation classifiers focused on explicit unsafe text, or aggregate rubrics that do not measure recognition of clinically meaningful state transitions in indirect, compressed, denied, or socially smoothed language.

This paper introduces a multi-tier clinical evaluation methodology combining (i) stage-aware scenario specification with per-turn reference annotation, (ii) a three-tier scoring rubric operating at turn, run, and expert-review levels, (iii) an independent Mechanical Severity Score (MSS) computed by a deterministic rule-based procedure from structured per-turn observation fields, executed independently of the holistic AI-derived run-level concern rating and providing a triangulating signal that surfaces evaluator drift, and (iv) an integrated routing and review pipeline with a model-blinded human expert review interface. The methodology is informed by multiple theoretical anchors including the Salient Distress Model of Suicide, the Narrative-Crisis Model, the Three-Step Theory of Suicide, and the Collaborative Assessment and Management of Suicidality (CAMS) framework. Across 2,644 reference-annotated marker turns within 1,759 multi-turn AI evaluation runs (1,679 complete) spanning 31+ active models on 53 scenarios, 49% to 77% of marker turns received responses scoring at or below the inadequate threshold for transition recognition, depending on rubric version (combined inadequate-response rate 57%).

Keywords: AI safety, clinical evaluation, suicide-risk detection, conversational AI, large language models, clinical rubric, evaluator drift, mechanical severity score, Salient Distress Model

Notes (English)

Methodology preprint. Implementation details (specific MSS calibrated weight values, verbatim TQC formula, verbatim score-pipeline rule, full clinical rubric anchor language, full notable_tags controlled vocabulary, scenario conversation scripts) are reserved as deployment-specific and are not disclosed in this preprint.

Methodology subject of US Provisional Patent Application No. 64/059,837 (filed 2026-05-07 at USPTO). Provisional applications are confidential under 35 USC § 122 during the 12-month convention period; see https://www.uspto.gov/patents/basics/types-patent-applications/provisional-application-patent for general information about provisional applications.

Files

2026-05-12-zenodo-methodology-preprint-v0.1.pdf

Files (116.9 kB)

Name	Size	Download all
2026-05-12-zenodo-methodology-preprint-v0.1.md md5:7c335c015fb0e4f0c354704fe1745e54	28.1 kB	Preview Download
2026-05-12-zenodo-methodology-preprint-v0.1.pdf md5:564a7a87e17406d90747aa677d6e6a70	88.8 kB	Preview Download

Additional details

Available: 2026

Date of public release of the methodology preprint on Zenodo.

	All versions	This version
Views	37	37
Downloads	23	23
Data volume	2.2 MB	2.2 MB

Evaluating Conversational AI Recognition of Clinically Meaningful State Transitions in Suicide-Risk Dialogue: A Multi-Tier Rubric Approach with Independent Mechanical Severity Triangulation

Authors/Creators

Description

Notes (English)

Files

2026-05-12-zenodo-methodology-preprint-v0.1.pdf

Files (116.9 kB)

Additional details

Dates