Published March 8, 2026 | Version v1
Preprint Open

Benchmarking Structural Validation Metrics for LLM-Generated Directed Graph Artifacts

Authors/Creators

Description

This scientific paper presents a rigorous methodological framework for evaluating the structural reproducibility of directed graph artifacts generated by Large Language Models (LLMs).  As LLMs are increasingly deployed to generate complex structured outputs like workflows and architectural decompositions, establishing robust validation metrics has become a critical challenge.The study systematically benchmarks seven graph similarity metrics—including Graph Edit Distance, Wasserstein distance, Gromov-Wasserstein, Fused Gromov-Wasserstein, and Unbalanced Fused Gromov-Wasserstein—under controlled synthetic perturbations simulating common generative errors such as semantic drift, abstraction shifts (node splits/merges), and topological hallucinations.Key findings from this research mathematically demonstrate the illusion of a single scalar metric. The empirical results prove that standard hybrid formulations confound benign lexical paraphrasing with severe structural failures, rendering single aggregated scores ambiguous. Furthermore, rigid one-to-one alignment metrics are shown to over-penalize legitimate abstraction shifts, while single-domain metrics suffer from either semantic or structural blindness.To resolve these validation bottlenecks, the paper proposes two scientifically calibrated strategies for automated benchmarking: a decoupled dual-metric diagnostic framework for transparent error profiling, and an engineering-led approach utilizing contextually enriched node embeddings to safely deploy joint optimal transport metrics without confounding the signal.

Files

Benchmarking Structural Validation Metrics for LLM-Generated Directed Graph Artifacts-March-2026.pdf