Published May 7, 2026 | Version v1
Conference paper Open

Valid JSON, Wrong Answer: Per-Role Regression Detection and Linguistic Presupposition Labeling for Structured Output

Description

Generating reliable structured output from large language models, e.g., JSON of the form "apply_brake":"True", schema-conformant tool calls, or database queries, remains difficult in production. One possible fix pairs grammar-constrained decoding (’strict-json’ mode to address syntax) with LoRA fine-tuning (to improve semantics), evaluated by aggregate loss. We observe counter-current fields: per- grammar-role components whose loss rises under fine-tuning even as aggregate loss falls.

Across Qwen 2.5 Instruct at three scales (0.5B, 7B, 32B) on Schema-Guided Dialogue (SGD) and the Contract Understanding Atticus Dataset (CUAD), we reproduce three signatures for counter-current fields:

  • Aggregate loss falls 44–81% under fine-tuning, while boolean and enum-value roles increase loss on individual schemas (boolean on Flights, enum-value on CUAD).

  • The effect is scale-dependent — on the boolean refundable field of a flight- booking schema, fine-tuning’s gain shrinks from 55% at 0.5B to 3% at 7B and crosses into a +12% regression at 32B.

  • Counter-currency tracks closely with what we call a uniqueness presupposi- tion: training assumes each field’s value is uniquely determined by the input, but in practice many examples lack the supporting evidence, and fine-tuning’s gradient pulls the model toward the marginal distribution regardless of input.

    We propose two mitigations rooted in this presupposition view:

  • Margin gating, an inference-time technique that emits an abstain value when the model’s top two predictions differ by less than a threshold θ. Im- proves precision for both baseline and fine tuned models with varied impact on recall.

  • Presupposition labeling, a training-time technique that extends the schema with an ambiguous value and relabels training examples that lack supporting evidence. Reduces constrained-content loss by 21–58% across scales and eliminates the boolean regression at 7B and 32B.

    We release code and data for reproduction. 

Files

main.pdf

Files (285.4 kB)

Name Size Download all
md5:af053a810b3422098c95730a285c672a
285.4 kB Preview Download

Additional details

Software

Repository URL
https://github.com/validjson/valid-json-wrong-answer
Programming language
Python
Development Status
Active