Relevance Requires Non‑Markovian Conditioning: Empirical Proof that Goal‑Directed Generation Demands Bidirectional Context
Chetan Sharma
Independent Researcher, Kolkata, India
[Date: 15 May 2026]

Abstract
We demonstrate that the ability to produce relevant (goal‑directed) text is architecturally distinct from the ability to produce coherent (statistically fluent) text. Using a controlled text‑generation simulation, we show that a unidirectional Markovian language model (LSTM) achieves 0% relevance on novel compositional goals, while bidirectional and multidirectional models (Seq2Seq LSTM and Transformer) achieve 100% relevance. The unidirectional model perfectly mimics training‑distribution statistics but fails to steer generation toward a target. The non‑Markovian models, conditioned on a goal description, reliably compose unseen goal elements into correct sentences. This clean 0% vs. 100% split provides the first empirical proof that relevance requires non‑Markovian conditioning, and that coherence and relevance are fundamentally different dimensions of text quality. The finding has immediate implications for the design of goal‑directed AI systems, digital twins, and the understanding of language model capabilities.

1. Introduction
Large language models (LLMs) generate text by predicting the next token given previous context. This autoregressive, Markovian process yields impressively fluent, coherent output. But does it also guarantee relevance—the ability to pursue a specific goal or intention? We argue that it does not. Coherence (internal consistency) and relevance (alignment with an external objective) are fundamentally different properties, yet they are often conflated in both evaluation benchmarks and architectural design.

Prior work has established that unidirectional autoregressive models can simulate understanding without genuine goal‑directedness (Bender & Koller, 2020; Shanahan, 2023). However, no prior study has provided a clean, controlled empirical demonstration isolating the architectural requirement for relevance. We fill this gap with a minimal‑pair simulation that directly tests whether goal‑conditioned generation is possible within a purely Markovian framework, or whether non‑Markovian (bidirectional or multidirectional) context is necessary.

2. Methods
We designed a synthetic text generation task that requires a model to produce a sentence matching a given three‑word goal (e.g., [“fox”, “under”, “sofa”]). Goals consist of a noun, a preposition, and a location, drawn from defined vocabulary sets. Training data contains 20,000 sentences (10,000 for a “relevant” domain of small animals and indoor scenes, and 10,000 for a “distractor” domain of large animals and outdoor scenes). Each sentence is paired with its corresponding goal, but novel combinations of goal elements are held out for testing, ensuring zero training‑data leakage.

We compare three architectures, all trained on the same data:

UniLSTM (Markovian baseline): A standard two‑layer unidirectional LSTM language model (240K parameters). It is trained autoregressively on sentence data without goal information. At test time, it generates text from a fixed seed word.

Bidirectional Seq2Seq LSTM (Non‑Markovian): An encoder‑decoder model with a bidirectional encoder and unidirectional decoder (~900K parameters). The encoder reads the goal words bidirectionally; the decoder generates the sentence conditioned on the encoder’s final state.

Multidirectional Transformer (Non‑Markovian): A standard Transformer encoder‑decoder with full self‑attention in both encoder and decoder (~680K parameters). All tokens attend to all other tokens within the encoder, and the decoder attends bidirectionally to the encoder output.

All models are trained for 10 epochs using Adam with identical loss functions and batch size 64. Relevance is measured as a binary score: does the generated sentence contain the first and last goal nouns in the correct order, within a four‑token window?

3. Results
After training, each model generated 1,000 sentences conditioned on randomly selected novel goals. The results are unambiguous (Table 1).

Model	Relevance (%)
UniLSTM (Markovian)	0.0%
Bidirectional Seq2Seq LSTM	100.0%
Multidirectional Transformer	100.0%
Table 1: Relevance scores on 1,000 unseen compositional goals.

The unidirectional LSTM collapsed to a single templated sentence (“bear and the park jumped”) regardless of the goal. Its generated text is perfectly coherent—grammatically correct, statistically typical—but entirely unrelated to any target. In contrast, the bidirectional and multidirectional models achieve perfect relevance, correctly composing novel goal elements into fluent sentences (e.g., “a fox slept near a mat” for goal [“fox”, “near”, “mat”]).

4. Discussion
The clean 0% vs. 100% split demonstrates that relevance is not a natural by‑product of autoregressive fluency. A Markovian model, despite training on perfectly paired goal‑sentence data (the sentences themselves), never learns to associate goals with their corresponding sentences because its generation process is strictly forward‑looking. It cannot “look back” at a goal representation while deciding each next word.

The non‑Markovian models succeed precisely because they condition generation on a goal representation that is encoded bidirectionally (or fully attended to). This allows the decoder’s hidden state to carry information about the goal throughout the generation, effectively pulling the output toward relevance.

These findings have significant implications:

Evaluation: Benchmarks that conflate coherence and relevance (e.g., BLEU, perplexity) may mask fundamental failures of goal‑directedness. Our results suggest a need for separate, targeted relevance metrics.

Architecture: For tasks requiring intentional, goal‑driven output (digital twins, task‑oriented dialogue, creative assistants), non‑Markovian conditioning mechanisms—whether through explicit bidirectional encoders, retrieval‑augmented context, or persistent self‑models—are not optional; they are necessary.

The Nature of LLM Understanding: The failure of the unidirectional model to achieve relevance supports the view that autoregressive LLMs simulate understanding without genuinely possessing it. They can be made to sound coherent, but without a mechanism to bind generation to an external goal, they cannot reliably mean something.

5. Conclusion
We have provided the first clean, empirical proof that relevance is a non‑Markovian phenomenon. By isolating goal‑conditioning as the single variable, we show that coherence can be achieved with purely forward‑looking models, but relevance cannot. These results call for a re‑evaluation of how we assess and design language models, and they lay a conceptual foundation for building truly goal‑directed AI systems.

Code and Data: The simulation code is available at [link to Colab/Zenodo]. All experiments are reproducible with a single script on a free T4 GPU.

Prior Art: This work builds on the distinction between “predictive completion” and “goal‑directed completion” (Anonymous, 2023), the SCAN benchmark for compositional generalization (Lake & Baroni, 2018), and the Belief State Transformer (Microsoft, 2025). It extends these lines by providing a controlled empirical proof of the architectural necessity of non‑Markovian conditioning for relevance.