Long-Context Isn't All You Need: Impact of Retrieval and Chunking on Finance RAG

As very long-context large language models (LLMs) have arrived on the scene, some have claimed that retrieval is no longer needed for a high-quality AI answer generation pipeline — rather, simply fit all documents into a large context window and let the LLM pick out the relevant pieces. Through a case study on financial document analysis, we disprove this theory and show that retrieval and chunking strategies are still large determinants of AI answer generation quality, even more important than the quality of the generating model itself.

Retrieval-augmented generation (RAG) systems face unique challenges when processing long, complex documents — particularly financial filings such as SEC 10-K and 10-Q forms. For example, consider these queries:

“What are the key factors contributing to the $2.1 trillion increase in assets under management (AUM) at BNY as of the third quarter of 2024?”

“Which filings mention regulatory changes that impact capital requirements?”

In these scenarios, accurately pinpointing relevant information is critical. Overlooking key details can lead to costly missteps. Financial teams need a RAG system that doesn't just retrieve information but retrieves relevant information with precision from thousands of PDFs, each spanning hundreds of pages.

A typical RAG pipeline, in a nutshell, involves the following steps:

Parsing documents: Extracting text via file parsers or OCR models for a high-fidelity retrieval across varied document formats.

Chunking: Segmenting text into meaningful units that balance retrieval effectiveness with contextual integrity.

Retrieval: Searching and returning the most relevant chunks using advanced techniques such as vector search.

Generation: Synthesizing retrieved chunks into a coherent, factually consistent response.

Figure 1. A standard RAG pipeline over a document corpus.
Each of these components can be tuned individually with the overall goal of improving the quality of the output text generated by a chatbot. Our experiments on a hand-curated data set of SEC filings focus on the chunking, retrieval and generation components of the pipeline. Our results are powerful and somewhat counterintuitive — even with the advent of long-context models, chunking and retrieval strategies are far more impactful on output quality than the raw computational power of the generative language model.

Specifically, based on our high-quality, hand-curated data set of SEC filings, we discovered that:

Global document context is critical: Appending LLM‑generated global document context (for example, company name or filing date) boosts response accuracy, as seen in Figure 2. This simple strategy outperforms LLM‑generated chunk-specific context.

Optimal chunking matters: Using moderate chunk sizes (~1,800 characters) and retrieving more chunks improves accuracy, while using overly large chunks (for example, 14,400 characters) can dilute relevance and drop performance by ~10%-20%.

Optimized retrieval narrows the gap: A robust retrieval pipeline with moderate 1,800‑character chunks and top‑50 retrieval significantly elevates performance — narrowing the quality gap between generation models (for example, bringing Llama 3.3 70B quality close to that of Claude 3.5 Sonnet).

Markdown-aware chunking has benefits: Without document context, markdown‑aware chunking (using section headers) can boost accuracy by 5%-10% over fixed splits; however, its advantage lessens when global context is added.

Snowflake Cortex Search — and the broader Snowflake Cortex AI ecosystem — is engineered to tackle these challenges head-on, offering a flexible, production-ready solution that extracts precisely what you need from even the most extensive financial filings.

Figure 2. Different strategies improve RAG accuracy. Using more chunks as contexts, adding contexts to chunks and employing structure-aware chunking all push performance further. All settings (except “No RAG”) use Arctic-Embed 2.0 M for retrieval.
For all these plots, the y-axis is LLM-judged accuracy.

Chunking strategies explored
Our SEC filings experiments focused on a select set of chunking techniques that significantly boosted retrieval accuracy — without overcomplicating the pipeline. We concentrated on three "vanilla" approaches and enhanced them with LLM-based metadata injection.

“Vanilla” chunking strategies
Recursive chunking: This strategy uses a hierarchical approach using predefined delimiters. It begins by splitting text at larger, semantically meaningful boundaries (for example, double newlines for paragraphs) and then recurses into finer splits (for example, single newlines or spaces) as needed. This method is available as a user-defined function (UDF) in Snowflake and LangChain. In our experiments, we convert parsed documents from markdown to plaintext before applying recursive split text chunking.

Semantic chunking: This strategy finds natural breakpoints in the provided text by embedding each sentence and calculating the cosine distance between embeddings of each consecutive sentence. If the distance between two sentence embedding exceeds a specified threshold, that point is considered a good place to split. The breakpoint type (for example, percentile or standard deviation) and threshold can be specified in the input. The LangChain implementation can be found here. For these experiments, we use the plaintext version of the parsed documents and we use the percentile breakpoint type.

Markdown-header-based chunking: This approach splits text based on markdown headers, preserving structural cues that help reconstruct document context (here is a LangChain example). For example, headers such as “Risk Factors” or “MD&A” serve as valuable context during retrieval. The markdown headers can be used as context for the chunk. We plan to offer this chunking strategy soon as a UDF in snowflake.

LLM-enhanced context-based chunking
LLM-based document-level metadata: Here, a language model extracts or generates a concise summary of the entire document — covering key details such as company name, filing date, form type, financial metrics and section highlights. This global metadata is then prepended to every chunk, providing a consistent context that improves retrieval precision. See the appendix for more details and a simple SQL to add this context to all chunks.

LLM-based chunk-level metadata: This strategy improves retrieval accuracy by constructing chunk-specific contextual explanations prior to indexing (as exemplified here) and prepending them to the chunk. This approach ensures that each chunk not only carries its immediate content but also benefits from additional context derived from the broader document. See appendix for more details.

Comparison of chunking strategies
This table shows the pros and cons of each of the abovementioned chunking strategies.

Strategy

Pros

Cons

Recursive Chunking

- Preserves natural language boundaries

- Adapts flexibly to document structure

- May produce very small or uneven chunks

- Each chunk has no global context

Semantic Chunking

- Determines natural splitting points based on semantic similarity based on sentence embeddings

- May produce uneven chunks

- Embedding the whole corpus can be slow and costly

Markdown-Header-Based Chunking

- Aligns with the author’s logical organization

- Maintains coherent context within sections

- Can miss content spanning multiple headers

- Employs inconsistent header usage across filings

LLM-Based Document-Level Metadata

- Embeds global context with every chunk

- Improves retrieval precision

- Reduces redundancy

- Requires extra computation

- Has more complex implementation

LLM-Based Chunk-Level Metadata

- Provides tailored, context-rich summaries for each chunk

- Enhances disambiguation in complex filings


- Has significant computational overhead

- Introduces risk of LLM “hallucinations” and increased storage needs

Results
Figure 3. Different strategies improve RAG accuracy. Using more chunks as contexts, adding contexts to chunks and employing structure-aware chunking all push performance further. All settings (except “No RAG”) use Arctic-Embed 2.0 M for retrieval.
Document context outperforms chunk-level context
One of the most notable findings is that appending document-level context to each chunk provides consistent gains across both generation models (Claude 3.5 Sonnet and Llama 70B). For instance, with 1,800‑character chunks, injecting global metadata (such as company name, filing date and form type) boosts QA accuracy from around 50%-60% to the 72%-75% range.

By contrast, heavily augmenting each chunk with a unique LLM-generated summary (i.e., chunk-level context), though recommended by some sources such as Anthropic's "contextual retrieval" approach, is not quite as effective. This method resulted in 5.8- and 0.4-point losses on Llama 3 and Claude Sonnet, respectively, despite requiring much more computation. Simply including global document context — once per chunk — proved more robust and efficient.

Choice of retrieval and chunking can matter more than the generative model
A striking outcome in our experiments is how powerful retrieval and chunking can sometimes outweigh the strength of the generative model itself. While Claude 3.5 Sonnet does outperform Llama 70B in nearly every setting, the gap narrows significantly when we introduce higher‑quality retrieval pipelines (for example, top‑50 ranked chunks) and well‑structured chunking strategies (for example, 1,800‑character splits with appended doc metadata).

No RAG baseline: Both models perform poorly on the query sets without RAG, scoring merely 5%-10% accuracy.

High‑quality retrieval and chunking: Llama 70B’s accuracy can climb from 40-50% to the 70%+ range — nearly matching Claude 3.5 Sonnet in some configurations.

This highlights a crucial insight for finance RAG:

Even a strong generative model can yield subpar results if the retrieval stage is weak (for example, with large or unfocused chunks or with few relevant chunks retrieved).

Conversely, a carefully tuned retrieval pipeline (with moderate chunk size, a robust chunking strategy and appended doc metadata) can substantially boost a weaker model’s performance, in some cases closing most of the gap to a more powerful model.

In other words, the interplay of retrieval and chunking can be as important — if not more — than the raw capacity of the LLM, particularly when dealing with large, specialized documents such as SEC filings.

Figure 4. Effect of “more contexts” — through more chunks or larger chunks — without and with document contexts. All settings use Arctic-Embed 2.0 M for retrieval.
Adding large chunks is worse than adding more granular chunks
Many have argued recently that in the era of powerful long context models, we can simply throw entire documents or pages at the generative model and let it take care of extracting relevant information. However, our experiments show that bigger chunks (for example, ~14,400 characters) can degrade retrieval performance by bundling too much text, making it harder for vector searches to pinpoint the most relevant sections. Even Claude 3.5 Sonnet, with a 200,000-token window, sometimes suffers from “context confusion” if large chunks contain too many irrelevant details. On close examination of additional errors, we found that the extra noise added by larger chunks confuses the generation model:

In one case, the question was about “driver compensation,” but the model got confused and included details of “driver commission.”

In another case, it picked up information from the wrong year (2023 vs. 2024).

By contrast, a moderate chunk size (1,800 characters) is generally more effective, especially when retrieving multiple chunks (for example, top‑50). However, “more is not always better”: Returning too many chunks or using overly large chunk sizes can saturate the context window and muddy the model’s focus. In practice, balancing chunk size and retrieval depth is key to maximizing QA accuracy.

Figure 5. Three "vanilla" chunking approaches and how LLM-based metadata injection enhances them. All settings use Arctic-Embed 2.0 M for retrieval and use top-10 retrieved chunks for RAG.
Of all the “vanilla” chunking strategies, markdown-aware chunking performs the best
When document-level metadata is not appended, markdown-aware chunking (i.e., splitting on section headers or structured boundaries) tends to outperform naive fixed‑size splits as well as semantic chunking by 5-10 percentage points. This is because it preserves natural thematic breaks — important for financial documents that have standard sections such as “Management Discussion and Analysis” or “Risk Factors.”

However, once document-level contexts are appended, the differences between markdown and plaintext chunking strategies shrink. Having a global “this is from XYZ Corp 10-Q” label on every chunk helps the model keep track of the bigger picture, reducing the need for chunk‑level structural awareness. Hence, teams looking to avoid the overhead of generating doc contexts might consider markdown‑aware chunking. Those who can afford to generate or store doc metadata often find the chunking approach becomes less critical once global context is injected.



Appendix
Details of metrics and data overview
Metrics
To rigorously assess RAG performance, we compare the final generated outputs against a gold standard using two key metrics:

Average Normalized Levenshtein Similarity (ANLS): A soft-matching metric that quantifies similarity between the generated answer and the gold standard using Levenshtein distance.

LLM-based quality scores: An approach where a second LLM (or the same LLM in a separate pass) is used to judge the correctness and completeness of the generated response, with a rubric-like prompt. If the generated response is deemed to cover the golden answer for the query, then it is considered accurate; otherwise, if the generated response states "no answer" or fails to cover the golden answer, it is considered inaccurate.

In a future post, we will correlate metrics at document level, chunk level and final generation level. Next, we detail our data set's structure and annotation scheme, designed to support rigorous evaluation.

A closer look at our data
Document corpus
We compiled five years of SEC 10-K and 10-Q filings for the top 1,000 Fortune companies, amounting to approximately 23,000 PDFs. These files were parsed with Snowflake Cortex Parse_Document in ‘LAYOUT’ mode to yield markdown text. Subsequent chunking (using SPLIT_TEXT_RECURSIVE_CHARACTER with an 1,800-token size and 300-token overlap) produced roughly 3.2 million chunks.

Annotations
We worked with an annotation agency to collect very high-quality annotations for around 500 queries. This is by far the best finance RAG data set in both the extensiveness of our annotations and the realistic and difficult nature of the queries involved (the closest other alternative is Financebench, but the information-seeking queries in that dataset are highly artificial often specifying which table or section the query should be answered from).

Designed to capture every nuance of financial filings, each annotated record includes:

Category: The type of query (e.g., contextual, fact-based)

Query: The actual question (e.g., “What are the key factors contributing to the $2.1 trillion increase in AUM at BNY?”)

Answer: A synthesized answer based on the aggregated evidence

Document identifiers (Doc Ids): Unique IDs linking to specific SEC filings

Span texts: The specific segments of text (or “spans”) where relevant information is located

Evidence metadata: For each evidence snippet, we record the page number, a detailed excerpt of the evidence text and a qualitative rating indicating the confidence or relevance

Types: The type of evidence (e.g., text or table)

For instance, one annotated example in our data set might include the following elements:

Category: Contextual

Query: What are the key factors contributing to the $2.1 trillion increase in assets under management (AUM) at BNY as of the third quarter of 2024?

Answer: The AUM of $2.1 trillion reflects an 18% YoY growth driven by higher market values and the favorable impact of a weaker U.S. dollar.

Doc Id 1: Bank of New York Mellon Corp_10-Q_2024-11-01

Span 1: "higher market values and the favorable impact of a weaker U.S. dollar"

Evidence Page Number 1: 5

Evidence Text 1: "AUM of $2.1trn up 18% YoY, primarily reflecting higher market values and the favorable impact of a weaker U.S. dollar"

Rating 1: 2

Type: Text

... [further evidence fields follow] ...

This comprehensive annotation scheme enables evaluation at multiple levels:

Generation level: Assessing the overall accuracy and coherence of the generated answer.

Document level: Ensuring the right document (by Doc Id) is retrieved.

Chunk level: Verifying that the specific relevant snippets of text are accurately captured, despite challenges such as split spans or redundant information.

Details of adding LLM-based context
Adding document-level context to each document
Metadata generation: An LLM is used to extract or generate a concise summary of the document. For SEC filings, the generated metadata might include1:

Company name: "Bank of New York Mellon Corp"

Filing date: "2024-11-01"

Form type: "10-Q" or "10-K"

Key financial metrics: Brief notes such as "18% YoY AUM growth driven by higher market values and favorable currency trends"

Section highlights: Identifiers for sections such as "Risk Factors" or "Management Discussion and Analysis"

Context injection into chunks: The generated metadata is then prepended to each chunk produced via a vanilla strategy, ensuring that every chunk carries both its immediate content and a snapshot of the document’s broader context.

-- STEP 1: GENERATE METADATA WITH LLM BASED ON PROVIDED KEYS

CREATE OR REPLACE TABLE MY_DOC_METADATA AS (

    SELECT

        DOC_ID,

        TEXT,

        SNOWFLAKE.CORTEX.COMPLETE(

        'llama3.3-70b',

        'I am going to provide a document which will be indexed by a retrieval system containing many similar documents. I want you to provide key information associated with this document that can help differentiate this document in the index. Follow these instructions:

    1. Do not dwell on low level details. Only provide key high level information that a human might be expected to provide when searching for this doc.

    2. Do not use any formatting, just provide keys and values using a colon to separate key and value. Have each key and value be on a new line.

    3. Only extract at most the following information. If you are not confident with pulling any one of these keys, then do not include that key:\n'

    ||

    ARRAY_TO_STRING(

        ARRAY_CONSTRUCT(<INSERT KEYS AS LIST OF STRINGS HERE>),

        '\t\t* ')

    ||

    '\n\nDoc starts here:\n' || SUBSTR(TEXT, 0, 4000) || '\nDoc ends here\n\n') METADATA,

    FROM

        MY_DOC_TABLE

)

;



-- STEP 2: GENERATE CHUNKS AND PREPEND CONTEXT TO CHUNKb

CREATE OR REPLACE TABLE MY_CONTEXTUALIZED_CHUNKS AS (

    WITH SPLIT_TEXT_CHUNKS AS (

        SELECT

            DOC_ID,

            C.VALUE AS CHUNK,

        FROM

           MY_DOC_METADATA,

           LATERAL FLATTEN( input => SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER (

              TEXT,

              'none',

              1800, -- SET CHUNK SIZE

              300 -- SET CHUNK OVERLAP

           )) C

    )

    SELECT

        M.DOC_ID,

        CONCAT(M.METADATA, '\n\n', C.CHUNK) AS CONTEXTUALIZED_CHUNK,

    FROM

        SPLIT_TEXT_CHUNKS C

    JOIN

        MY_DOC_METADATA M ON C.DOC_ID = M.DOC_ID

)

;

Adding a per-chunk context
This method consists of two key steps:

Context generation: A language model creates a concise explanation for each chunk, situating it within the broader document context. For instance, for SEC filings, this metadata may include:

Document reference: Identifies the broader document or section it belongs to (e.g., "This chunk is from the ‘Management Discussion and Analysis’ section of XYZ Corp’s Q3 2023 10-Q filing").

Historical context: Adds reference points from the document (e.g., "In Q2 2023, operating expenses were reported at $150 million").

Key details: Extracts crucial information that might not be immediately present in the chunk itself.

Context injection into chunks: The generated contextual metadata is prepended to each chunk before embedding and indexing, ensuring that the retrieval model understands how the chunk relates to the entire document.
Illustrative example for SEC filings
Consider a chunk from an SEC filing that originally states: "The company reported a 12% increase in operating expenses compared to the prior quarter."

Using contextual retrieval, this chunk would be transformed into: "This chunk is from the ‘Management Discussion and Analysis’ section of XYZ Corp’s Q3 2023 10-Q filing. In Q2 2023, operating expenses were reported at $150 million. The company reported a 12% increase in operating expenses compared to the prior quarter."

1 In some cases, metadata might be readily available even without using LLMs.

Arctic Agentic RAG Episode 1: Agentic Query Clarification for Grounded and Speedy Responses

Welcome to the first deep-dive episode of our Arctic Agentic RAG series!

In our overview blog post, we explored the limitations of traditional retrieval-augmented generation (RAG) systems in enterprise environments. These systems struggle with ambiguous queries, multimodal data and complex reasoning — making it difficult for businesses to extract the right information reliably. This is where Arctic Agentic RAG comes in, transforming enterprise AI with intelligent, adaptive and verifiable retrieval.

Now, in this first deep-dive episode, we dive into one of the most fundamental challenges: handling ambiguous queries.

The first challenge: Addressing ambiguous queries
In traditional search, ambiguous queries are typically addressed by presenting users with a diverse set of related links, allowing them to explore different perspectives and manually extract the most relevant information. While this approach offers flexibility, it shifts the responsibility onto the user and lacks the precision required for enterprise AI applications, where speed, trust and accuracy are paramount.

Unlike traditional search, which relies on users to refine their own queries, standard RAG pipelines sometimes attempt to generate direct answers. However, these answers may be incorrect, incomplete or misleading if the query lacks sufficient specificity. Other times, they provide possible clarification questions. However, these clarification questions can often be irrelevant or unanswerable based on the user's repository, leading to further confusion and inefficiency. This is especially problematic in enterprises where accuracy and trust are paramount.

Arctic Agentic RAG takes a different approach: It clarifies the query first, ensuring that any clarification questions generated are both relevant and answerable within the user's repository, ensuring both questions and answers are grounded — meaning that responses can be verified against retrieved passages. Besides enhancing accuracy, our approach also ensures that responses are generated efficiently and cost effectively.

In this episode, we dive deep into how Arctic Agentic RAG tackles query ambiguity with grounded and speedy responses. We’ll also provide an overview of the Arctic Agentic RAG open source framework, which enables researchers and developers to explore and implement these techniques themselves.

Why ambiguous queries matter in enterprise RAG
RAG is designed to complement LLMs by retrieving evolving, domain-specific enterprise information from the corpus that is possibly absent from an LLM’s static training data. For RAG to be effective, it must ensure both diversity in covering diverse user intents and grounding responses to the retrieved passages.

Current methods for clarification tend to prioritize diversity, often leading to interpretations that RAG cannot effectively answer.

Pitfalls of existing methods
Most state-of-the-art approaches follow a "diversify then verify" (DtV) strategy. An LLM first generates multiple possible meanings for a query, then retrieves documents for all interpretations and finally prunes irrelevant results.

For example, the query "What is HP?" could refer to Hewlett-Packard, horsepower or Harry Potter. A general LLM might suggest all three, but in an enterprise-specific corpus, only one may be relevant. Despite this, DtV retrieves documents for all interpretations, adding noise and increasing computational cost.

Figure 1a illustrates this limitation: Verification happens too late, after retrieval has already been influenced by irrelevant interpretations. This inefficiency makes enterprise retrieval less precise and more resource intensive.

Comparison of (a) DtV and (b) Snowflake’s VD workflows for handling ambiguous questions in RAG. VD avoids generating ungrounded interpretations and thus does not attempt to answer those.
Figure 1: Comparison of (a) DtV and (b) Snowflake’s VerDICT workflows for handling ambiguous questions in RAG. VerDICT avoids generating ungrounded interpretations and thus does not attempt to answer those.
How Arctic Agentic RAG addresses query ambiguity: Verified DIversification with ConsolidaTion (VerDICT)
Figure 1b contrasts Snowflake’s improved workflow, Verified DIversification with ConsolidaTion (VerDICT), integrating verification directly into the diversification step. Rather than generating all possible interpretations up front, our approach first relaxes user queries to retrieve passages with diverse interpretations, represented as relevance feedback in the figure. We then extract grounded interpretations from the retrieved passages and also ensure that each can be answered from the retrieved passages, using answerability feedback. These two feedback types are elaborated below.

Retriever: relevance feedback: Unlike DtV, which diversifies into all possible interpretations, our approach first checks which interpretations are supported by the retrieved passages. A single retriever call with a relaxed query identifies top-k search results representing diverse interpretations, from which interpretations are extracted to avoid extracting an ungrounded interpretation such as Harry Potter in Figure 1a.

Generator: answerability feedback: Even if a document is relevant to the interpretation grounded to this document, it may not answer the query. Thus retrieval alone is insufficient for feedback — we introduce a generator feedback, to ensure that an answer can be generated before retraining an interpretation. To motivate, Figure 1b describes how we filter out relevant but unanswerable passage p2. It is relevant to the Hewlett-Packard interpretation, describing its products, but it cannot answer what HP is. To address this, we prompt the generator LLM with the question and its grounded passage, verifying whether a valid question-answer pair can be formed. If not, the interpretation is discarded.

A consolidation phase using clustering then follows to enhance robustness against noise in retriever and generator feedback: Question-answer pairs obtained from verification are clustered to keep those consistently supported by relevant passages, while filtering out outliers from noisy passages.

Performance Results
VerDICT is fast and resource efficient
By verifying interpretations up front, VerDICT eliminates noise and reduces unnecessary computations. Traditional DtV methods repeatedly call the retriever per each interpretation, while VerDICT significantly reduces retrieval overhead. For example, when there are three interpretations — Hewlett-Packard, horsepower or Harry Potter — the number of interpretations is |Q| = 3 and top-k results are retrieved per each interpretation. VerDICT, on the other hand, by verifying grounded interpretations up front, cuts down on retrieval and processing costs, as Table 1 shows.




Retriever

LLM


# of calls

# of calls × Input length

DtV

O(|Q|)

O(|Q|) × O(k)

VerDICT

O(1)

O(k) × O(1)

Table 1: Comparison of number of calls made to the retriever and LLM generator per each question between DtV and VerDICT. |Q| is the number of interpretations, and k is the size of retrieval.

VerDICT generates correct and grounded interpretations
Efficiency alone isn’t enough — accuracy is critical. In our evaluations (see Figure 2), 93% of VerDICT-generated interpretations led to correct and grounded answers, compared to just 56% with DtV. Even human-generated interpretations scored only 65%, proving that VerDICT is both accurate and reliable.

Figure 2: The ratio of grounded interpretations from DtV (orange) and VerDICT (blue), with Llama 3.3 70B (left) and GPT-4o (right) as backbone LLMs.
Summing up, these results demonstrate that VerDICT enhances accuracy, minimizes wasted resources and improves the user experience. For a deeper technical and empirical analysis, check out our paper “Agentic Verification for Ambiguous Query Disambiguation,” available at arXiv.

Applications in Snowflake’s Cortex Agents API
Arctic Agentic RAG is integrated into the Snowflake Cortex Agents API, providing Snowflake customers with a more intelligent, efficient and precise retrieval experience. This integration enhances enterprise search, knowledge management and automated analytics workflows. When a user provides a query that is vague and ambiguous in nature, instead of providing a single answer, Cortex Agents API follows up with related queries with VerDICT. This is anchored on:

Accurate retrieval: Clarifies queries dynamically for precise, context-aware responses through Snowflake Cortex Search with a proven record of search quality compared to competitors.

Optimized efficiency: Reduces computational overhead, speeding up analytics workflows.

Enterprise-grade applications: Supports customer support, compliance and R&D with domain-specific, verifiable insights of diverse nature.

This feature enhances various enterprise scenarios, such as:

Customer support automation: Clarifies vague queries such as "issue with my account" into actionable support topics.

Financial and legal compliance: Directs compliance officers to precise policy sections for regulations such as GDPR.

Internal knowledge management: Helps employees find specific HR and IT policies from vague search terms such as “work from home.”

Ecommerce analytics: Refines broad queries into segmented insights on sales, trends and customer behavior.

Healthcare and pharma research: Guides medical professionals to precise drug interactions and treatment protocols.

Figure 3 illustrates how our work helps users clarify ambiguous queries by generating relevant follow-up questions and providing grounded answers.

Query clarification in Snowflake’s Cortex Agents API setup with tool access to a series of synthetically generated insurance documents retrieved via Cortex Search services. The system refines vague queries by suggesting specific related questions, improving retrieval accuracy and user experience.
Figure 3: Query clarification in Snowflake’s Cortex Agents API setup with tool access to a series of synthetically generated insurance documents retrieved via Cortex Search services. The system refines vague queries by suggesting specific related questions, improving retrieval accuracy and user experience.
Open sourcing Arctic Agentic RAG
Beyond Snowflake’s own offerings, we have an open source Arctic Agentic RAG framework for researchers and practitioners. Unlike other agentic RAG frameworks that aim to be feature-complete, our Arctic Agentic RAG framework prioritizes lightweight, efficient components for fast development and research exploration. The key components include:

LLM backend: Supports cloud providers such as Snowflake Cortex Completion and Azure OpenAI, as well as local inference via vLLM.

Template format: Standardizes input-output structures for defining agentic factories.

Agentic factory: Defines agent functionality, including input parsing, retrieval and response generation.

These modular components allow researchers to easily build and customize functional agents while maintaining efficiency.

For this initial release, we provide all components related to VerDICT for handling ambiguous queries, including retrieval, answering and clustering modules for disambiguation. Additionally, we offer easy-to-use examples for building a simple RAG framework using Cortex Search and Completion functionalities, streamlining retrieval and large model deployment.

As we continue developing Arctic Agentic RAG, we plan to open source more features from our RAG innovations, helping the community reproduce research results and accelerate the adoption of advanced RAG techniques. By leveraging our framework, researchers and practitioners can rapidly prototype and iterate on novel ideas without the overhead of a full-fledged implementation. This fosters innovation and promotes shared advancements to benefit the broader AI community.

Explore Arctic Agentic RAG to learn more, and start experimenting with it today!



This concludes Episode 1 of our Arctic Agentic RAG series. Stay tuned for Episode 2, where we tackle the next major challenge: handling multimodal enterprise data — bringing images, tables, structured databases and text together in a seamless, intelligent retrieval process. Interested in this discussion? Visit the AI Research & Development Community forum on Snowflake.

Our great collaborators: We would like to extend our gratitude to our academic collaborators — Seung-won Hwang and Youngwon Lee from Seoul National University, and Feng Yan and Ruofan Wu from the University of Houston — for their valuable contributions.

Authors
Snowflake AI Research
Snowflake AI Research
Share Article
Just For You

Gen AI
Introducing Arctic Agentic RAG: Smarter, Faster and More Reliable AI for Enterprise
Yuxiong He
|
Zhewei Yao
FEB 18, 2025
|
5 min read

Gen AI
ArcticTraining: Simplifying and Accelerating Post-Training for LLMs
Snowflake AI Research
JAN 16, 2025
|
10 min read
Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

josh.reini@snowflake.com
Subscribe Now
By submitting this form, I understand Snowflake will process my personal information in accordance with their Privacy Notice.

Benchmarking LLM-as-a-Judge for the RAG Triad Metrics

In 2023, as part of the TruLens open source project developed by the TruEra team, we introduced the RAG Triad.1, 2, 3 This triad comprises a set of three metrics — context relevance, groundedness and answer relevance — that measure how well each step of a retrieval-augmented generation (RAG) system is performing.

Each of these metrics is automatically computed using an LLM-as-a-Judge (a carefully prompted LLM), thus providing a scalable evaluation method for the common case in enterprises where ground truth data sets are often limited in scope. This kind of LLM-as-a-Judge can also be thought of as an agent that reviews and reasons about the quality of the retrieval and generation steps of a RAG.

The RAG Triad introduced a modular approach to specifying requirements for each step of a RAG and either verifying that these requirements are met or enabling debuggability by localizing errors. As this approach has gotten widespread adoption with RAGs increasingly moving into production in enterprises, we have consistently heard one question from our users: How can we evaluate an LLM-as-a-Judge and build confidence in its trustworthiness?

In this blog post, we report on our recent work that addresses this question. Specifically, we share the results of benchmarking the three LLM Judges on standard ground truth data sets — TREC-DL4 for context relevance, LLM-AggreFact for groundedness and HotpotQA4 for answer relevance — reporting precision, recal, F1 scores and Cohen's Kappa5. Our benchmarking results indicate that our LLM Judges are comparable to or exceed the existing state of the art for groundedness and MLflow for the other two metrics in the RAG Triad.  The Cohen Kappa results range from high moderate to substantial agreement with human annotators for all our results, providing additional support for the trustworthiness of the LLM Judges.

The LLM Judge for groundedness has an F1 score of 81%, a precision of 78%, a recall of 85% and a Cohen’s Kappa of 0.54 on the LLM-AggreFact data set. This placed it above the SOTA fine-tuned, proprietary Bespoke-MiniCheck-7B model on precision, recall and F1 score, as well as above the related LLM Judge from MLflow with respect to precision, F1 score and Cohen’s Kappa.

The LLM Judge for context relevance has an F1 score of 64%, a precision of 51% and a recall of 87% on the TREC-DL data set. This placed it above the LLM Judge with the UMBRELA prompt and the corresponding MLflow Judge on the F1 and recall metrics.

The LLM Judge for answer relevance has an F1 score of 79%, a precision of 99% and a recall of 66% on the HotpotQA data set. This makes it comparable to the MLflow Judge for the related metric.

We compare our results to SOTA baselines from the literature, such as the Bespoke-MiniCheck-7B model for groundedness and the UMBRELA prompt-based LLM Judge for context relevance, in addition to open source library MLflow, which provides some metrics related to the RAG Triad. We used GPT-4o as the default LLM in all our experiments. Note also that while we worked with an internal benchmark on which we could evaluate all three Judges of the RAG Triad, it would be useful to have public benchmarks for RAGs that enable this form of evaluation.

As a key part of this work, we introduced a new method for eval-guided optimization and applied it to improve the prompts for the LLM Judges. We will discuss this method and its application in a follow-up blog post.

We are also open sourcing the updated prompts in TruLens (see here). We encourage you to try them out as you build and evaluate RAGs. Here’s a notebook to get you started!

The RAG Triad
Before we consider a system of evaluations, it's useful to first discuss the architecture of the RAG system itself. In a simple RAG, there are three primary artifacts we can use to evaluate quality: query, retrieved context and generated output. Common failure modes of RAGs, including poor retrieval quality, hallucination and irrelevant answers, can all be traced back to the interactions between those three artifacts.

We proposed the RAG Triad of metrics — context relevance, groundedness and answer relevance — as a system of reference-free evaluations to identify and root-cause these common failure modes of RAG systems.

diagram of the RAG Triad
Figure 1. The RAG Triad of metrics — context relevance, groundedness and answer relevance — measure how well each step of a retrieval-augmented generation (RAG) system is performing
Let us examine each of the LLM Judges for the RAG Triad in detail below. Prompting for these LLM Judges is composed of a few key parts: the system prompt, judging criteria, few-shot examples, output scale and a user prompt containing the text to be evaluated. TruLens provides an easy way to configure the judging criteria, few-shot examples and output scale. In this blog post, we will focus on the default prompts and capabilities for each metric, including the default output scale of 0-3. The LLM Judge also makes use of an LLM. We used GPT-4o as the default LLM in all our experiments.

Context relevance
Context relevance addresses the retrieval step by evaluating the relevance of the context to the query. While this failure mode can lead to inaccurate or irrelevant application output, low-context relevance scores allow us to root-cause this output as poor retrieval performance and give the AI engineer a starting point to improve the application performance by focusing on the retriever.

As mentioned above, LLM Judge prompting is composed of the system prompt, criteria, few-shot examples and a user prompt containing the text to be evaluated. Context relevance utilizes each of these components and is evaluated on each piece of retrieved context independently. The prompt below is an optimized version of an LLM Judge for context relevance created in collaboration with the Snowflake Cortex Search team. We share each component of the prompt below, deferring the specific few-shot examples to the appendix.

Context relevance system prompt
You are an EXPERT SEARCH RESULT RATER. You are given a USER QUERY and a SEARCH RESULT.

The context relevance system prompt provides an overall description of the evaluation task and sets the role of the LLM. It also emphasizes the exact terminology to be used later in the user prompt (“question” and “context”).

Context relevance criteria
Your task is to rate the search result based on its relevance to the user query. You should rate the search result on a scale of 0 to 3, where:

    0: The search result has no relevance to the user query.

    1: The search result has low relevance to the user query. It may contain some information that is very slightly related to the user query but not enough to answer it. The search result contains some references or very limited information about some entities present in the user query. In case the query is a statement on a topic, the search result should be tangentially related to it.

    2: The search result has medium relevance to the user query. If the user query is a question, the search result may contain some information that is relevant to the user query but not enough to answer it. If the user query is a search phrase/sentence, either the search result is centered around most but not all entities present in the user query, or if all the entities are present in the result, the search result while not being centered around it has medium level of relevance. In case the query is a statement on a topic, the search result should be related to the topic.

    3: The search result has high relevance to the user query. If the user query is a question, the search result contains information that can answer the user query. Otherwise, if the search query is a search phrase/sentence, it provides relevant information about all entities that are present in the user query and the search result is centered around the entities mentioned in the query. In case the query is a statement on a topic, the search result should be either directly addressing it or be on the same topic.

    You should think step by step about the user query and the search result and rate the search result. Be critical and strict with your ratings to ensure accuracy.

    Think step by step about the user query and the search result and rate the search result. Provide a reasoning for your rating.

For context relevance, we provide qualitative descriptions for each possible score. Additionally, we provide additional guidance for the LLM to be “critical and strict” with ratings to avoid inflated relevance scores.

Context relevance user prompt
Now given the USER QUERY and SEARCH RESULT below, rate the search result based on its relevance to the user query and provide a reasoning for your rating.


USER QUERY: <user query>

SEARCH RESULT: <search result>

RELEVANCE:

The context relevance user prompt is where the text to be evaluated is inserted and ties in consistent terminology with the system prompt — maintaining clear instructions for the LLM’s evaluation task.

Groundedness
Groundedness looks at the generation step, considering the adherence of the LLM response to the retrieved context. Importantly, groundedness considers the adherence of each claim made in the LLM’s response to the entire set of retrieved contexts. This allows us to identify unfaithful segments of the response and, in turn, allows the engineer to iterate on the generation step as the root cause. This could mean testing different models, engineering prompts, testing model parameters and more.

Preprocessing
Prior to the groundedness LLM Judge evaluation, the feedback function performs two preprocessing steps. The first step is claim decomposition. Here we break down the full LLM response into its composite claims or facts so that each can be evaluated independently. This step is performed either with a sentence tokenizer or with an LLM.

Second, any trivial claims are removed from the set prior to the evaluation. These trivial claims are claims that do not contain fact and therefore should not be considered in the evaluation. This step is performed with an LLM.

After the LLM response is processed into individual, nontrivial claims, we proceed with the evaluation.

The groundedness evaluation prompt includes a system prompt, criteria and a user prompt but does not make use of few-shot examples.

Groundedness system prompt
You are an INFORMATION OVERLAP classifier; providing the overlap of information (entailment or groundedness) between the source and statement.

The groundedness system prompt provides an overall description of the evaluation task and sets the role of the LLM to evaluate the information overlap between the source and statement. It also uses terminology (“source” and “statement”) consistent with the user prompt.

Groundedness criteria
    - Statements that are directly supported by the source should be considered grounded and should get a high score.

    - Statements that are not directly supported by the source should be considered not grounded and should get a low score.

    - Statements of doubt, that admissions of uncertainty or not knowing the answer are considered abstention, and should be counted as the most overlap and therefore get a max score of 3.

    - Consider indirect or implicit evidence, or the context of the statement, to avoid penalizing potentially factual claims due to lack of explicit support.

    - Be cautious of false positives; ensure that high scores are only given when there is clear supporting evidence.

    - Pay special attention to ensure that indirect evidence is not mistaken for direct support.

The groundedness criteria accomplish two primary objectives. The first three bullets provide a qualitative description for low, high and maximum scores. The remaining three bullets in the prompt caution the LLM against common evaluation failure modes including both false positives and false negatives.

Groundedness user prompt
Source: {all retrieved context}

Statement: {individual claim from LLM response}

Please meticulously answer with the template below for ALL statement sentences:

Criteria: <individual claim from LLM response>

Supporting Evidence: <Identify and describe the location in the source where the information matches the statement. Provide a detailed, human-readable summary indicating the path or key details. if nothing matches, say NOTHING FOUND. For the case where the statement is an abstention, say ABSTENTION>

Score: <Output a number based on the scoring output space / range>

First, the user prompt provides the source (retrieved) context along with an individual claim, using terminology (“source” and “statement”) consistent with the system prompt. Second, we provide chain-of-thought instructions to the LLM to first state the evaluation criteria and reasoning before the score. This is useful for ensuring both the quality and the interpretability of the groundedness evaluation. While chain-of-thought reasoning is an option for other TruLens evaluators, it is mandatory for groundedness due to the difficulty of the task.

Answer relevance
Answer relevance takes a second look at the generation step but focuses on the relevance of the LLM response to the user query. This allows us to identify cases where the response may be accurate to the limit of the retrieved context but still fails to answer the user question.

Answer relevance system prompt
You are a RELEVANCE grader; providing the relevance of the given RESPONSE to the given PROMPT.

Similar to context relevance, the answer relevance system prompt provides an overall description of the evaluation task and sets the role of the LLM. It also places emphasis on the exact terminology to be used later in the user prompt.

Answer relevance criteria
- RESPONSE must be relevant to the entire PROMPT to get a maximum score of 3.

- RELEVANCE score should increase as the RESPONSE provides RELEVANT context to more parts of the PROMPT.

- RESPONSE that is RELEVANT to none of the PROMPT should get a minimum score of 0.

- RESPONSE that is RELEVANT and answers the entire PROMPT completely should get a score of 3.

- RESPONSE that is confidently FALSE should get a score of 0.

- RESPONSE that is only seemingly RELEVANT should get a score of 0.

- Answers that intentionally do not answer the question, such as 'I don't know' and model refusals, should also be counted as the least RELEVANT and get a score of 0.

The answer relevance criteria provide qualitative descriptions of the minimum and maximum scores, along with an instruction for the direction of middle scores: “RELEVANCE score should increase as the RESPONSE provides RELEVANT context to more parts of the PROMPT.” Last, the criteria address key evaluation failure modes, including confident falsehoods and seeming relevance, and provide direct instructions for scoring abstention.

Answer relevance user prompt
QUESTION: <question>

CONTEXT: <context>

RELEVANCE:

The novelty of the RAG Triad comes with its direct tie to the RAG architecture, with evaluations targeting each step of the RAG. Additionally, the RAG Triad operates reference-free, without requiring ground truth, which in most cases is expensive or impossible to collect in generative AI systems. By decomposing a composite RAG system into its components — query, context and response — our evaluation framework can triage the failure points, provide a clearer understanding of where improvements are needed in the RAG system and guide targeted optimization.

Benchmarking results
In this section, we report the results of benchmarking the three LLM Judges discussed above on ground truth data sets: TREC-DL for context relevance, LLM-AggreFact for groundedness and HotpotQA for answer relevance. We compare our results to SOTA baselines from the literature, such as the Bespoke-MiniCheck-7B model for groundedness and the UMBRELA prompt-based LLM Judge for context relevance, in addition to open source libraries such as MLflow, which provide some metrics related to the RAG Triad. Our benchmarking results indicate that our LLM Judges are comparable to or exceed the existing state of the art for groundedness and MLflow for the other two metrics in the RAG Triad.

Groundedness
We use LLM-AggreFact as the benchmark data set to evaluate the TruLens LLM Judge for groundedness using GPT-4o and compare it against the SOTA Bespoke-MiniCheck-7B model.

LLM-AggreFact is chosen, given that it is the largest annotated collection of 11 claim verification data sets to our knowledge. All labels are unified to either 0 (not grounded) or 1, thus making metrics used in binary classification easily interpretable in the benchmarking of our groundedness LLM Judge.

On the LLM-AggreFact holdout set of 11,000 examples, the TruLens groundedness evaluator beats the SOTA Bespoke-MiniCheck-7B model on all three metrics. Doing so shows that optimized prompting can match and surpass the performance of task-specific fine-tuned models, such as Bespoke-MiniCheck-7B, on a large-scale benchmark.

Evaluator

Precision

Recall

F1 Score

Cohen’s Kappa

Bespoke-MiniCheck-7B

0.7610

0.8038

0.7771

0.5525

MLflow faithfulness

0.6693

0.8902

0.7545

0.4155

TruLens groundedness



0.7830

0.8515

0.8082

0.5358

Context relevance
 Context relevance is closely related to the task of relevance prediction in information retrieval. For the benchmark data set, we used a sample of TREC-DL passage retrieval data sets with human annotations from the years 2021 and 2022 with a fair distribution of labels from each relevance score {0, 1, 2, 3}.

The original relevance scores are then unified to binary labels {0, 1}, where {2,3} are converted to 1 (relevant) and {0, 1} are converted to 0 (nonrelevant), following the instructions from the original TREC passage retrieval challenge.

The TruLens LLM Judge using GPT-4o context relevance scores makes a positive trade-off of recall at the expense of a smaller drop in precision compared to the UMBRELA evaluator. This results in the TruLens context relevance evaluator landing a higher F1 score than UMBRELA on the TREC-DL data set, while keeping a comparable off-by-one accuracy.

We include MLflow relevance metrics for comparison’s sake, but note that we are not reporting off-by-one accuracy for MLflow as the output scores are all between 0.0 and 1.0 instead of being directly prompted to generate relevance scores {0, 1, 2, 3} like UMBRELA’s and TruLens’s adapted context relevance feedback function. MLflow relevance also takes into account the generated responses from the LLM, whereas the other two require only query and context, and we substitute the retrieved responses given the query for the missing LLM-generated responses on our TREC-DL data set.

Evaluator

Precision

Recall

F1 Score

Off-by-1 Accuracy

Cohen’s Kappa

UMBRELA

0.6000

0.6449

0.6216

0.8945

0.4529

MLflow relevance

0.5973

0.6885

0.6396

N/A

0.4873

TruLens context relevance

0.5129

0.8660
0.6443

0.8902

0.4769

Answer relevance
Among all three metrics of the RAG Triad, answer relevance is the most straightforward task. Most high-quality, established benchmarks of intelligent question-answering (QA) systems can be used for benchmarking the answer relevance feedback function, with the expected golden answers in the QA benchmark being treated as relevant (ground truth label 1).

We report our answer relevance meta-evaluation results on HotpotQA samples. The benchmark examples are sampled with both classes balanced, where ground truth answers are assumed to be relevant and we shuffle answers to queries to create negative examples (ground truth label 0).

In our answer relevance benchmarks, the TruLens answer relevance LLM Judge using GPT-4o achieves near-perfect precision and identical recall to the MLflow implementation. This is partly due to the simpler nature of the HotpotQA data set, where the ground truth answers to questions are likely to be considered perfectly relevant by the LLM Judges and negative examples are mined by pairing questions with answers from another disjoint set of questions.

Evaluator

Precision

Recall

F1 Score

Cohen’s Kappa

MLflow answer relevance

1.0000

0.6650

0.7988

0.665

TruLens answer relevance

0.9924

0.6550

0.7892

0.605

It is important to note that TruLens’s definition of answer relevance does not encompass answer correctness. That is, while a correct answer, whenever available, should always be judged as relevant, an incorrect answer can still be scored as perfectly relevant.

Conclusions
In this blog post, we shared the results of benchmarking the LLM Judges of the RAG Triad — a modular approach to specifying requirements for each step of a RAG and either verifying that these requirements are met or enabling debuggability by localizing errors. Specifically, we shared the results of benchmarking the three LLM Judges on standard ground truth data sets: TREC-DL for context relevance, LLM-AggreFact for groundedness and HotpotQA for answer relevance. Our benchmarking results indicate that our LLM Judges are comparable to or exceed the existing state of the art for groundedness (the Bespoke-MiniCheck-7B model) and popular open source libraries (MLflow) for the other two metrics in the RAG Triad. The Cohen Kappa results range from high moderate to substantial agreement with human annotators for all our results, providing additional support for the trustworthiness of the LLM Judges.

While we introduced the concept of the RAG Triad back in July 2023,1 as part of this work, we have improved the prompts for the LLM Judges significantly by leveraging a new method for eval-guided optimization. Stay tuned for a follow-up blog post on that next week.

We are also open sourcing the updated prompts in TruLens (see here). We encourage you to try them out as you build and evaluate RAGs. Here’s a notebook to get you started!

1 Shayak Sen, LLMs: Consider Hallucinatory Unless Proven Otherwise, AI Transformation Summit – Pinecone, July 2023.

2 Anupam Datta, Jerry Liu with Andrew Ng, Building and Evaluating Advanced RAG.

3 TruLens docs, The RAG Triad.

4 https://creativecommons.org/licenses/by/4.0/legalcode

5 Cohen's Kappa measures inter-rater reliability between humans and LLM judges tasked in the same experiment. Cohen’s Kappa, ranging from -1 to 1, takes into account agreement by chance between human and LLM judges.

Appendix: Context relevance few-shot examples
Use the following format:

    Rating: Example Rating

    Reasoning: Example Reasoning



    ### Examples

    Example:

    Example 1:

    INPUT:

    User Query: What is the definition of an accordion?

    Search Result: Accordion definition, Also called piano accordion. a portable wind instrument having a large bellows for forcing air through small metal reeds, a keyboard for the right hand, and buttons for sounding single bass notes or chords for the left hand. a similar instrument having single-note buttons instead of a keyboard.

    OUTPUT:

    Rating: 3

    Reasoning: In this case the search query is a question. The search result directly answers the user question for the definition of an accordion, hence it has high relevance to the user query.



    Example 2:

    INPUT:

    User Query: dark horse

    Search Result: Darkhorse is a person who everyone expects to be last in a race. Think of it this way. The person who looks like he can never get laid defies the odds and gets any girl he can by being sly,shy and cunning. Although he\'s not a player, he can really charm the ladies.

    OUTPUT:

    Rating: 3

    Reasoning: In this case the search query is a search phrase mentioning \'dark horse\'. The search result contains information about the term \'dark horse\' and provides a definition for it and is centered around it. Hence it has high relevance to the user query.



    Example 3:

    INPUT:

    User Query: Global warming and polar bears

    Search Result: Polar bear The polar bear is a carnivorous bear whose native range lies largely within the Arctic Circle, encompassing the Arctic Ocean, its surrounding seas and surrounding land masses. It is a large bear, approximately the same size as the omnivorous Kodiak bear (Ursus arctos middendorffi).

    OUTPUT:

    Rating: 2

    Reasoning: In this case the search query is a search phrase mentioning two entities \'Global warming\' and \'polar bears\'. The search result contains is centered around the polar bear which is one of the two entities in the search query. Therefore it addresses most of the entities present and hence has medium relevance.



    Example 4:

    INPUT:

    User Query: Snowflake synapse private link

    Search Result: "This site can\'t be reached" error when connecting to Snowflake via Private Connectivity\nThis KB article addresses an issue that prevents connections to Snowflake failing with: "This site can\'t be reached" ISSUE: Attempting to reach Snowflake via Private Connectivity fails with the "This site can\'t be reached" error

    OUTPUT:

    Rating: 1

    Reasoning: In this case the search result is a search query mentioning \'Snowflake synapse private link\'. However the search result doesn\'t contain information about it. However it shows an error message for a generic private link which is tangentially related to the query, since snowflake synapse private link is a type of private link. Hence it has low relevance to the user query.



    Example 5:

    INPUT:

    User Query: The Punisher is American.

    Search Result: The Rev(Samuel Smith) is a fictional character, a supervillain appearing in American comic books published by Marvel Comics. Created by Mike Baron and Klaus Janson, the character made his first appearance in The Punisher Vol. 2, #4 (November 1987). He is an enemy of the Punisher.

    OUTPUT:

    Rating: 1

    Reasoning: In this case the search query is a statement concerning the Punisher. However the search result is about a character called Rev, who is an enemy of the Punisher. The search result is tangentially related to the user query but does not address topic about Punisher being an American. Hence it has low relevance to the user query.



    Example 6:

    INPUT:

    User Query: query_history

    Search Result: The function task_history() is not enough for the purposes when the required result set is more than 10k.If we perform UNION between information_schema and account_usage , then we will get more than 10k records along with recent records as from information_schema.query_history to snowflake.account_usage.query_history is 45 mins behind.

    OUTPUT:

    Rating: 1

    Reasoning: In this case the search query mentioning one entity \'query_history\'. The search result is neither centered around it and neither has medium relevance, it only contains an unimportant reference to it. Hence it has low relevance to the user query.



    Example 7:

    INPUT:

    User Query: Who directed pulp fiction?

    Search Result: Life on Earth first appeared as early as 4.28 billion years ago, soon after ocean formation 4.41 billion years ago, and not long after the formation of the Earth 4.54 billion years ago.

    OUTPUT:

    Rating: 0

    Reasoning: In the case the search query is a question. However the search result is completely unrelated to it. Hence the search result is completely irrelevant to the movie pulp fiction.

    ###

Evaluating and Optimizing Search Quality: Seamlessly Tune Your Snowflake Cortex Search Service with an LLM Judge

Recently, we showed that Snowflake’s Cortex Search has unmatched out-of-the-box retrieval quality, but what do you do if your out-of-the-box experience isn’t good enough? You tune your retrieval and ranking algorithms, of course. But how do you know what to tune?  Large tech companies have entire teams dedicated to tuning specific features for numerous query types so that the perfect results are displayed to the user. For example, a navigational, single-word query will probably use more lexical features than a long, complete sentence query, which is more likely to use semantic features. This process of manual tuning is painstaking and costly; most businesses don’t have the resources to devote to this endeavor.

What if there was a way to serve high-quality results to the user without spending days and resources tuning each parameter? What if you could rely on an autonomous pipeline that could do this for you? In this investigation, we find that, with the use of LLMs, you can build high-quality evaluation data sets for iteration without spending precious human hours to improve your search system — all with Snowflake.

Search quality 101: Beyond the search bar
We all use search engines every day, typing in queries and expecting relevant results. But have you ever stopped to think about how "relevance" is measured and how search engines are constantly improved? This is where the concept of search quality comes in. Essentially, search quality boils down to this: How well does a search system retrieve the most relevant documents from a given collection for a specific query? While striving for a perfect score is a natural goal, the absolute number is less important than the relative improvement between different search systems or modifications to a single system. We're looking for progress, not necessarily perfection.

Measuring search quality: The power of goldens
So how do we quantify "good" search results? The key lies in query relevance labels, often called "goldens." Think of these as the gold standard (pun intended!) for evaluating search performance. Goldens provide a structured way to express how relevant a document is to a specific query. They assign a score that represents this relevance.

These scores can be binary (0 = irrelevant, 1 = relevant), offering a simple yes/no assessment. However, they are more often scaled, providing a more nuanced evaluation. A typical scaled system might look like this:

0 = Irrelevant

1 = Fair

2 = Good

3 = Perfect

Having these goldens is crucial. They allow us to evaluate different retrieval systems or tweaks to existing ones. By comparing the results against the goldens, we can get numerical metrics that reflect the relative retrieval quality of each system. These metrics provide a quantifiable way to track progress and make informed decisions about system tuning. An example of this metric is NDCG, or Normalized Discounted Cumulative Gain. NDCG measures how well search results are ranked by rewarding relevant results that appear higher on the list while giving less credit to those ranked lower. The result's relevance is defined by the golden set the system is being evaluated on. The process is iterative: Goldens go in, and a better search system comes out. Goldens can be formatted like the following:

Query	Document	Relevancy
What are apples?	Apples are sweet, crisp fruits that grow on trees and come in various colors like red, green and yellow.	3
What are apples?	Discover the best hiking trails in Yosemite.	0
What are apples?	Many people enjoy eating apples as a snack.	1
What are apples?	The apple tree produces round fruits that can be eaten raw or used in desserts, like pies.	2
Table 1. An example of a golden set and how relevancy labels can be assigned. A 3 represents something perfectly relevant; 2 is good relevance; 1 is partially relevant; and 0 is irrelevant.

The million-dollar question: Where do goldens come from?
If goldens are so essential, you might wonder where they come from. Is there some magical Golden Search Goose? Do they magically appear? Unfortunately, no. Creating high-quality goldens is a significant undertaking. Goldens are where businesses encode their core logic into a search system, a subjective representation of what is relevant for their use case. Goldens typically involve trained human assessors evaluating the relevance of documents to a wide range of queries, based on a well-defined and explicit labeling guidance, such as Google’s (181 pages long!). Maintaining a good set of goldens is an ongoing effort, as document collections and user search behavior evolve. Goldens are not about randomly tuning knobs until your performance on a public benchmark goes up. Instead, goldens are about creating representative samples of what you care about so that your search system is optimized to behave the way your business needs.

The rise of LLMs: A new era for golden creation?
Given the effort involved in creating goldens, researchers are exploring ways to automate or augment the process. This is where large language models (LLMs) enter the picture. LLMs, with their ability to understand and generate natural language text, hold the potential to revolutionize how we create and use goldens. Could LLMs act as judges, automatically assessing the relevance of documents to queries? This is a hot area of research with exciting possibilities. Imagine a future where LLMs assist in generating goldens, making the process faster, cheaper and more scalable. This could lead to even more rapid improvements in search quality, benefiting users worldwide.

To infinity and beyond: How Snowflake Cortex Search improves your retrieval quality automatically
In the past few paragraphs, we have been talking at a fairly high level, but what does that mean for our Cortex Search users? Luckily Cortex Search retrieval performance is easily tunable if you have a useful signal. Cortex Search uses a variety of search features, such as topicality, semantic similarity, popularity and more, to determine whether a result is relevant to a query. These features interact differently with one another depending on what corpus has been indexed. With our Streamlit app eval tool, customers can evaluate how well their search service does on its indexed contents. Additionally, customers can autotune their search service, and the tool provides different sets of parameters that are more optimal than the default.

Figure 1. Snowflake’s Streamlit app eval tool for Cortex Search.
Although the eval tool utilizes customer-provided query doc pairs to evaluate the search service, the autotune pipeline uses an LLM to suggest hyperparameters for features used to rank search results. Using an LLM judge provides flexibility with customers: If goldens are not provided, customers can still tune their system. However, to help ensure that the system autotunes correctly, the LLM judge needs to be at a similar judgment level as a human. With the following experiments, we show that our LLM judge is reliable in tuning a search system in the correct direction.

The results
Let’s examine the results of our experiments to prove this claim. First, we show that our LLM-based relevance assessor (commonly referred to as LLM as Judge) produces high-quality labels. Next, we show that with their high-quality labels LLM-based judges are reliable enough to autotune.

Our first experiment compares our in-house LLM judge labels with labels from NIST’s TREC DL21, TREC DL22 and TREC DL23 data sets. TREC (Text REtrieval Conference) data sets are benchmarks for passage and document ranking tasks, designed to evaluate deep learning-based retrieval models using human-annotated relevance labels. We also compared the labels generated by UMBRELA, which is an open source toolkit that uses GPT-4o to perform document-relevance assessments, replicating human judgment for information retrieval evaluation. These labels were generated using the SNOWFLAKE.CORTEX.COMPLETE function with the llama3.1-405b model. Each label is on a scale of 0 to 3:

0 = the doc is not relevant to the query at all

1 = the doc is slightly relevant to the query

2 = the doc is somewhat relevant to the query

3 = the doc is very relevant to the query

For this experiment, we measured an off-by-1 metric. Off-by-1 measures the percentage of predictions within ±1 of the ground truth. When prompting an LLM to generate a label for a query doc pair, the reasoning of the LLM could cause a query doc pair to be rated a 1 when the true label is a 0. When humans assign labels, being off by 1 is expected given different reasoning and interpretation of the query doc pair. In other words, off-by-1 lets us know whether there is a disagreement in magnitude or due to a complete concept misunderstanding.

Data set	Judge	off-by-1
TREC DL21	In-house	0.915
UMBRELA	0.883
TREC DL22	In-house	0.660
UMBRELA	0.678
TREC DL23	In-house	0.888
UMBRELA	0.875
Average	In-House	0.821
UMBRELLA	0.812
Table 2. Exploration on performance of LLM as Judge as compared to professional human rates from NIST’s TREC.

From these numbers, we do see that there is a general agreement in query doc pair relevancies; that is, the LLM labels in both methods tend to agree with the ground-truth human labels.

From analyzing human and synthetic labels for query doc pairs, we saw that the majority either agreed or were one off. For TREC DL22 specifically, we saw that the synthetic labels were more accurate.

Data Set	Query	Document	Human Rating	In-house Synthetic Rating
TREC DL 21	What classes do I need to take to go to vet school?	…
Therefore, begin developing your competitive advantage during high school. Get good grades, complete the prerequisites, and gain experience working with animals. Math and science are important for getting into veterinary school, so join high school and college math and science clubs, as well as taking advanced classes in these areas.
…	3	3
TREC DL 21	What is a Kanima?	…
So, what is Kanikama (Surimi)? Kanikama is the imitation or fake crab meat produced from surimi paste that is made by grinding various species of white fish (mostly Alaska Pollock).
…	1	0
TREC DL 22	How many people watch NBA basketball?	…
Fans are looking forward to the return of HBO’s Succession for Season 3 after the series’ critically acclaimed sophomore season ended in October 2019.
…	3	0
TREC DL 23	How to cook thin sliced home fries?	…
For a healthy alternative to potato chips, you can dehydrate sweet potatoes into crunchy snacks using your oven or a dehydrator.
…	2	1
Table 3. Examples of human and synthetic ratings (generated with our in-house judge) for query-doc pairs. While model disagreement in absolute terms can differ when compared side-by-side, the model is within 1 point of the human judgment more than 82% of the time.

LLM as Judge works! How can I use it for Snowflake Cortex Search?
The second experiment dives into exploring the directionality of the LLM judge. We want to show that if a Cortex Search service is autotuned on synthetic labels, the search quality of that system will improve. In this experiment, we generate synthetic labels for a query set that already has human labels. First, we evaluate the performance of a Cortex Search service on the human labels to get a baseline metric. Then, we autotune the Cortex Search service on the synthetic labels. Autotuning lets us know what coefficients of the search features need to be tuned up or down. Finally, we set the new coefficients for our Cortex Search service that the autotune job suggested, and reevaluate the performance of this Cortex Search service on the human labels. This experiment also used the TREC data sets and generated labels using Snowflake’s COMPLETE API with Llama 3.1 405b as the LLM.

Figure 2. NDCG@10 of a search service with two different parameter coefficients.
As shown in Figure 2, we see a 3% gain in NDCG@10. What this shows is that if you had human labels for a query set that was autotuned using synthetic labels, you would see an improvement in search quality with those human labels. This means that our synthetically generated labels improve search quality in the same, positive direction as human labels.

Cortex Search autotuning: All the gains, none of the pain
In this post we have gone deep into describing why you should try out autotuning in Cortex and why it works, but what does this mean for you? Simple, the search quality of your Cortex Search service can be improved with just a click of a button. Not only does your search performance improve, you can do so without increasing the cost of your deployment. No need to finetune a model, make sure the finetuning parameters are right, wait for your search service to re-embed your entire corpus, or worry about how to serve it at scale. Your deployment’s relevance gains are all done without any new model, and this approach will continue to work even if you change your underlying embedding model. Regardless of size and resources available to our customers, they can iterate and build the best search service for their use case. This functionality is available on our Cortex Search eval tool offering, allowing customers to evaluate and tune their search services.

Appendix
Below is the in-house prompt we used for evaluation.

You are an expert search result rater. You are given a user query and a search result. Your task is to rate the search result based on its relevance to the user query. You should rate the search result on a scale of 0 to 3, where:

    0: The search result has no relevance to the user query.

    1: The search result has low relevance to the user query. In this case the search result may contain some information which seems very slightly related to the user query but not enough information to answer the user query. The search result contains some references or very limited information about some entities present in the user query. In case the query is a statement on a topic, the search result should be tangentially related to it.

    2: The search result has medium relevance to the user query. If the user query is a question, the search result may contain some information that is relevant to the user query but not enough information to answer the user query. If the user query is a search phrase/sentence, either the search result is centered around about most but not all entities present in the user query, or if all the entities are present in the result, the search result while not being centered around it has medium level of relevance. In case the query is a statement on a topic, the search result should be related to the topic.

    3: The search result has high relevance to the user query. If the user query is a question, the search result contains information that can answer the user query. Otherwise if the search query is a search phrase/sentence, it provides relevant information about all entities that are present in the user query and the search result is centered around the entities mentioned in the query. In case the query is a statement on a topic, the search result should be either be directly addressing it or be on the same topic.



    You should think step by step about the user query and the search result and rate the search result. You should also provide a reasoning for your rating.



    Use the following format:

    Rating: Example Rating

    Reasoning: Example Reasoning



    ### Examples

    Example:

    Example 1:

    INPUT:

    User Query: What is the definition of an accordion?

    Search Result: Accordion definition, Also called piano accordion. a portable wind instrument having a large bellows for forcing air through small metal reeds, a keyboard for the right hand, and buttons for sounding single bass notes or chords for the left hand. a similar instrument having single-note buttons instead of a keyboard.

    OUTPUT:

    Rating: 3

    Reasoning: In this case the search query is a question. The search result directly answers the user question for the definition of an accordion, hence it has high relevance to the user query.



    Example 2:

    INPUT:

    User Query: dark horse

    Search Result: Darkhorse is a person who everyone expects to be last in a race. Think of it this way. The person who looks like he can never get laid defies the odds and gets any girl he can by being sly,shy and cunning. Although he\'s not a player, he can really charm the ladies.

    OUTPUT:

    Rating: 3

    Reasoning: In this case the search query is a search phrase mentioning \'dark horse\'. The search result contains information about the term \'dark horse\' and provides a definition for it and is centered around it. Hence it has high relevance to the user query.



    Example 3:

    INPUT:

    User Query: Global warming and polar bears

    Search Result: Polar bear The polar bear is a carnivorous bear whose native range lies largely within the Arctic Circle, encompassing the Arctic Ocean, its surrounding seas and surrounding land masses. It is a large bear, approximately the same size as the omnivorous Kodiak bear (Ursus arctos middendorffi).

    OUTPUT:

    Rating: 2

    Reasoning: In this case the search query is a search phrase mentioning two entities \'Global warming\' and \'polar bears\'. The search result contains is centered around the polar bear which is one of the two entities in the search query. Therefore it addresses most of the entities present and hence has medium relevance.



    Example 4:

    INPUT:

    User Query: Snowflake synapse private link

    Search Result: "This site can\'t be reached" error when connecting to Snowflake via Private Connectivity\nThis KB article addresses an issue that prevents connections to Snowflake failing with: "This site can\'t be reached" ISSUE: Attempting to reach Snowflake via Private Connectivity fails with the "This site can\'t be reached" error

    OUTPUT:

    Rating: 1

    Reasoning: In this case the search result is a search query mentioning \'Snowflake synapse private link\'. However the search result doesn\'t contain information about it. However it shows an error message for a generic private link which is tangentially related to the query, since snowflake synapse private link is a type of private link. Hence it has low relevance to the user query.



    Example 5:

    INPUT:

    User Query: The Punisher is American.

    Search Result: The Rev(Samuel Smith) is a fictional character, a supervillain appearing in American comic books published by Marvel Comics. Created by Mike Baron and Klaus Janson, the character made his first appearance in The Punisher Vol. 2, #4 (November 1987). He is an enemy of the Punisher.

    OUTPUT:

    Rating: 1

    Reasoning: In this case the search query is a statement concerning the Punisher. However the search result is about a character called Rev, who is an enemy of the Punisher. The search result is tangentially related to the user query but does not address topic about Punisher being an American. Hence it has low relevance to the user query.



    Example 6:

    INPUT:

    User Query: query_history

    Search Result: The function task_history() is not enough for the purposes when the required result set is more than 10k.If we perform UNION between information_schema and account_usage , then we will get more than 10k records along with recent records as from information_schema.query_history to snowflake.account_usage.query_history is 45 mins behind.

    OUTPUT:

    Rating: 1

    Reasoning: In this case the search query mentioning one entity \'query_history\'. The search result is neither centered around it and neither has medium relevance, it only contains an unimportant reference to it. Hence it has low relevance to the user query.



    Example 7:

    INPUT:

    User Query: Who directed pulp fiction?

    Search Result: Life on Earth first appeared as early as 4.28 billion years ago, soon after ocean formation 4.41 billion years ago, and not long after the formation of the Earth 4.54 billion years ago.

    OUTPUT:

    Rating: 0

    Reasoning: In the case the search query is a question. However the search result does is completely unrelated to it. Hence the search result is completely irrelevant to the movie pulp fiction.

    ###



    Now given the user query and search result below, rate the search result based on its relevance to the user query and provide a reasoning for your rating.

    INPUT:

    User Query: {query}

    Search Result: {passage}

    OUTPUT:\n


    Eval-Guided Optimization of LLM Judges for the RAG Triad

In 2023, as part of the TruLens open source project developed by the TruEra team, we introduced the RAG Triad.1, 2, 3 This triad comprises a set of three metrics — context relevance, groundedness and answer relevance — that measure how well each step of a retrieval-augmented generation (RAG) system is performing.

Each of these metrics is automatically computed using an LLM-as-a-Judge (a carefully prompted LLM), thus providing a scalable evaluation method for the common case in enterprises where ground truth data sets are often limited in scope. This kind of LLM-as-a-Judge can also be thought of as an agent that reviews and reasons about the quality of the retrieval and generation steps of a RAG.

The RAG Triad introduced a modular approach to specifying requirements for each step of a RAG and either verifying that these requirements are met or enabling debuggability by localizing errors. As this approach has gotten widespread adoption with RAGs increasingly moving into production in enterprises, we have consistently heard two related questions from our users:

How can we evaluate an LLM-as-a-Judge and build confidence in its trustworthiness?

How can we optimize an LLM-as-a-Judge to further improve it?

In a companion blog post, we address the first question. Specifically, we share the results of benchmarking the three LLM Judges of the RAG Triad on standard ground truth data sets — TREC-DL for context relevance, LLM-AggreFact for groundedness and HotpotQA for answer relevance — reporting precision, recall, F1 scores and Cohen’s Kappa4. Our benchmarking results indicate that our LLM Judges are comparable to or exceed the existing state of the art for groundedness and MLflow for the other two metrics in the RAG Triad.

In this blog post, we focus on the second question: How can we optimize an LLM-as-a-Judge to further improve it? We address this question by developing a new method for Eval-Guided Optimization that leverages the results of the benchmarking to guide an end-to-end agentic prompt optimizer with an appropriate choice of a loss function and a data slice. We implemented this method using TextGrad as the prompt optimizer and observed significant improvements.

For the LLM Judge for groundedness, precision increased by roughly 16% with a 2.5% drop in recall, leading to an F1 score increase of  8%  on the LLMAggreFact data set. This placed it above the SOTA fine-tuned, proprietary Bespoke-MiniCheck-7B model on precision, recall and F1 score, as well as above the related LLM Judge from MLflow with respect to precision and F1 score.

For the LLM Judge for context relevance, precision increased by 4.26% with a 3.7% drop in recall, leading to an F1 score increase of 2.4%  on the TREC-DL data set. This placed it above the LLM Judge with the UMBRELA prompt and the corresponding MLflow Judge on the F1 and recall metrics.

For the LLM Judge for answer relevance, recall increased by 5% with a 0.76% drop in precision, leading to an F1 score increase of 3.5% on the HotpotQA data set. This makes it comparable to the MLflow Judge for the related metric.

We have released the updated prompts in TruLens (see here). We encourage you to try them out as you build and evaluate RAGs. Here’s a notebook to get you started!

Leveraging evals to guide the TextGrad prompt optimizer with an appropriate choice of a loss function and a data slice was essential to see these improvements. Without this guidance, TextGrad failed to improve the LLM Judges.

The rest of the blog post is organized as follows. We begin with a quick overview of the RAG Triad, with more details in a companion blog post. Then we describe our Eval-Guided Optimization method, point out how it addresses observed challenges with TextGrad, and illustrate the method using an effective optimization of an LLM Judge for groundedness. Finally, we summarize our experimental results, including comparison with the LLM Judges from related projects.

The RAG Triad
In a simple RAG, there are three primary artifacts we can use to evaluate quality: query, retrieved context and generated output. Common failure modes of RAGs, including poor retrieval quality, hallucination and irrelevant answers, can all be traced back to the interactions between those three artifacts. We proposed the RAG Triad of metrics — context relevance, groundedness and answer relevance — as a system of reference-free evaluations to identify and root-cause these common failure modes of RAG systems.

The RAG Triad of metrics — context relevance, groundedness and answer relevance — measure how well each step of a retrieval-augmented generation (RAG) system is performing.
Figure 1. The RAG Triad of metrics — context relevance, groundedness and answer relevance — measure how well each step of a retrieval-augmented generation (RAG) system is performing.
The LLM Judges for the RAG Triad are described in detail in a previous blog post. Prompting for these LLM Judges is composed of a few key parts: the system prompt, judging criteria, few-shot examples, output scale and a user prompt containing the text to be evaluated. TruLens provides an easy way to configure the judging criteria, few-shot examples and output scale. The LLM Judge also makes use of an LLM. We used GPT-4o as the default LLM in all our experiments. In this blog post, we focus on the method for optimizing the prompts for these judges, which we describe next.

Eval-Guided Optimization
End-to-end prompt optimization frameworks, such as DSPy and TextGrad, became a natural starting point for our work. However, we observed that out-of-the-box optimizers from DSPy and TextGrad failed to improve the LLM Judges. For the DSPy optimizers, it appeared that augmenting prompts with few-shot demonstrations did not generalize well to new inputs. With TextGrad, a challenge we observed was that textual gradients over a set of training data weren’t producing consistent feedback to improve the prompts; a second challenge was to determine what loss function to use in the optimization.

These observations motivated us to develop our Eval-Guided Optimization method, which consists of the following steps:

Evaluate an LLM Judge on a benchmark data set.

Identify a data slice — a subset of the full data set — on which a metric of interest is performing poorly.

Automatically construct the loss function for the prompt optimizer from a textual description of the objective.

Run the prompt optimizer on the identified data slice from Step 2 with the loss function from Step 3 to produce an optimized prompt for the LLM Judge.

Re-evaluate the LLM Judge on the entire benchmark data set and report the results of the optimization.

Figure 2. The workflow of Eval-Guided Optimization.
In our experiments, we use TextGrad as the prompt optimizer. Note that Step 2 addresses the first challenge with TextGrad by focusing it on a data slice with a low-performing metric that offers an opportunity for consistent improvement. Further, Step 3 addresses the second challenge by automatically constructing the loss function from a textual description of it.

We will now illustrate our Eval-Guided Optimization method by walking through its application to an LLM Judge for groundedness.

Step 1: Evaluate an LLM Judge on a benchmark data set
We use LLMAggreFact as the benchmark data set to evaluate the TruLens LLM Judge for groundedness and compare it against the SOTA Bespoke-Minicheck-7B model.



Evaluator

Precision

Recall

F1 score

Bespoke-MiniCheck-7B

0.7610

0.8038

0.7771

TruLens groundedness (un-optimized)



0.6238

0.8779

0.7232

We notice that the precision of the TruLens LLM Judge for groundedness on the benchmark falls behind the SOTA model. Precision is an important metric for the LLM Judge to excel at since we want to avoid situations where the LLM Judge says that a sentence is well-grounded when in fact it is not.  Thus, precision became our primary optimization target.

Step 2: Identify a data slice on which a metric of interest is performing poorly
Since the LLMAggreFact data set consisted of 11 smaller data sets, it was easy for us to observe the precision of the LLM Judge on these 11 data slices. We selected the RAGTruth subset as our slice to perform prompt optimization on, since the LLM Judge exhibited low precision on it (0.57 vs. overall precision of 0.62) and it had a relatively large number of samples.

As noted earlier, we have found that slice selection is critical to the success and generalization ability of the optimization steps.  Our hypothesis is that by identifying a slice with a consistent error trend (e.g., low precision) we increase the chance that the textual gradients are similar and thus provide consistent feedback to improve the prompts.  Indeed, using the entire benchmark data set for optimization (or a random sample thereof) — as one would normally use with TextGrad — did not work well in our experiments.

Step 3: Automatically construct the loss function for the prompt optimizer from a textual description of it
In the prompt optimization with TextGrad, analogous to the back propagation in deep learning, we define TextLoss = ∂L/∂x = ∇LLM(x, y, ∂L/∂y ) ≜ “Here is a conversation with an LLM-judge: {x|y}.” + LLM(Here is a conversation with an LLM-judge: {x|y}. Below are the criticisms on {y}:  ∂L/∂y  Explain how to improve {x}.),

        where x = accuracy of LLM judge and y = LLM-judge prediction.

Since both the overall precision and slice precision are low, this translates to a high false-positive rate or LLM judges being too lenient for failing examples. We draft a separate prompt with documentation of TextGrad API references, asking to specifically design a function that returns a numerical loss [0,1] to penalize false positives, and use GPT-4o  to generate a weighted groundedness loss function to serve as the TextLoss function for the optimizer. This function places greater weight on false positives, thus steering TextGrad to optimize for precision improvements.

Step 4: Run the prompt optimizer on the identified data slice from Step 2 with the loss function from Step 3 to produce an optimized prompt for the LLM Judge
The diff below shows the prompt edits added by the auto prompt optimizer using the previously defined loss function after 15 iterations. Note the addition of the sentence — “Be cautious of false positives; ensure that high scores are only given when there is clear supporting evidence” — which was added through the Eval-Guided Optimization process that sought to reduce false positives. Interestingly, the optimization process also added in a sentence to correct false negatives: “Consider indirect or implicit evidence, or the context of the statement, to avoid penalizing potentially factual claims due to lack of explicit support.”

You are an INFORMATION OVERLAP classifier; providing the overlap of information (entailment or groundedness) between the source and statement.

Respond only as a number from 0 to 3, where 0 is the lowest score according to the criteria and 3 is the highest possible score.

You should score the groundedness of the statement based on the following criteria:

- Statements that are directly supported by the source should be considered grounded and should get a high score.

+

+ - Consider indirect or implicit evidence, or the context of the statement, to avoid penalizing potentially factual claims due to lack of explicit support.

+ - Be cautious of false positives; ensure that high scores are only given when there is clear supporting evidence.

Step 5: Re-evaluate the LLM Judge on the entire benchmark data set and report the results of the optimization
On the LLM-AggreFact holdout set of 11,000 examples, we observed a significant improvement in overall precision (+15%) and F1 score (+8%), with a smaller decrease in the overall recall (-3%). The optimized LLM Judge beats the SOTA Bespoke-Minicheck-7B model on all three metrics, as well as the related LLM Judge from MLflow on precision and F1 score.



Evaluator

Precision

Recall

F1 score

Bespoke-MiniCheck-7B

0.7610

0.8038

0.7771

MLflow faithfulness

0.6693

0.8902

0.7545

TruLens groundedness (un-optimized)



0.6238

0.8779

0.7232

TruLens groundedness (optimized)



0.7830

0.8515

0.8082

Confusion matrix changes. Figure 3 shows the confusion matrices of the TruLens LLM Judge for groundedness before and after the optimization. They show the significant reduction in false positives (bottom left cell changes from 2504 to 1088) accompanied by a smaller increase in false negatives (top right cell changes from 561 to 624). This leads to higher precision after optimization, a smaller drop in recall and an overall increase in the F1 score on the LLM-AggreFact data set.

Figure 3. The confusion matrices before and after the optimization show the significant reduction in false positives, accompanied by a smaller increase in false negatives.
Data splits. We employ a data splitting strategy to generate a 40/20/20 or 30/30/40 training/validation/testing split for all experiments. We find that for prompt optimization, we don't usually need the train split to be more than a few hundred high-quality examples. Newly proposed prompts/prompt edits are accepted only if we see improvements on the test splits, and we re-evaluate on the entire data set to report final results. Thanks to the large-scale annotation we have in LLM-AggreFact, we use the original dev split as our holdout set for evaluating pre- vs. post-optimization, and we sample from the original test split (29,000) to generate train/val/test splits and perform prompt optimization.

Results for context and answer relevance LLM Judges
In this section, we summarize our experimental results for the two other RAG Triad LLM Judges and include a comparison with the LLM Judges from related projects.

Context relevance.  Context relevance is closely related to the task of relevance prediction in information retrieval. For the benchmark data set, we used a sample of TREC-DL passage retrieval data sets with human annotations from the years 2021 and 2022 with a fair distribution of labels from each relevance score {0, 1, 2, 3}.

The original relevance scores are then unified to binary labels {0, 1}, where {2,3} are converted to 1 (relevant) and {0, 1} are converted to 0 (nonrelevant), following the instructions from the original TREC passage retrieval challenge.

As shown in the table below, we see a similar and even more obvious low-precision/high-recall phenomenon. We also see that despite having low precision, the off-by-1 accuracy score of LLM Judges is high, highlighting an opportunity for prompt optimization, in particular, by aligning criteria between human labels and evaluation prompts.

For the LLM Judge for context relevance, after prompt optimization, precision increased by 4.26% with a 3.7% drop in recall, leading to an F1 score increase of  2.4%  on the TREC-DL data set. This placed it above the UMBRELA Judge on the recall and F1 metrics and above the MLflow Judge on recall and F1 score.



Evaluator

Precision

Recall

F1 score

Off-by-1 acc

UMBRELA

0.6000

0.6449

0.6216

0.8945

MLflow relevance

0.5973

0.6885

0.6396

N/A

TruLens context relevance

(unoptimized)

0.4723

0.9034

0.6203

0.8634

TruLens context relevance

(optimized)

0.5129

0.8660

0.6443

0.8902

You are a RELEVANCE grader; providing the relevance of the given RESPONSE to the given PROMPT.

Respond only as a number from 0 to 3, where 0 is the lowest score according to the criteria and 3 is the highest possible score.

A few additional scoring guidelines:\n\n- Long RESPONSES should score equally well as short RESPONSES.

- RESPONSE must be relevant to the entire PROMPT to get a maximum score of 3.

- RELEVANCE score should increase as the RESPONSE provides RELEVANT context to more parts of the PROMPT.

- RESPONSE that is RELEVANT to none of the PROMPT should get a minimum score of 0.

- RESPONSE that is RELEVANT and answers the entire PROMPT completely should get a score of 3.

- RESPONSE that confidently FALSE should get a score of 0.\n- RESPONSE that is only seemingly RELEVANT should get a score of 0.

- Answers that intentionally do not answer the question, such as "I don't know"" and model refusals, should also be counted as the least RELEVANT and get a score of 0.

+ - Be cautious of false negatives, as they are heavily penalized. Ensure that relevant responses are not mistakenly classified as irrelevant.

Answer relevance. We also include answer relevance evaluation results on HotpotQA samples and provide comparisons with MLflow. The benchmark examples are sampled with both classes balanced, where ground truth answers are assumed to be relevant and we shuffle answers to queries to create negative examples. Both TruLens and MLflow achieve strong precision on the benchmark but weaker recall numbers, where we see higher false negatives. Comparing the evaluation prompt side by side with TruLens’, MLflow’s answer-relevance instructions mention more aspects (appropriateness, applicability) than TruLens, where only relevancy of the answer with respect to the query is specified.  Our hypothesis is that the higher specificity makes MLflow’s eval more strict, resulting in lower recall but higher precision.

For completeness and showcasing generalization ability, we include prompt optimization results on TruLens’ answer-relevance metric where we are able to improve and match the performance of MLflow via eval-guided optimization with interpretable prompt edits, shown below as a diff at the end of the evaluation prompt.



Evaluator

Precision

Recall

F1 score

MLflow answer relevance

1.0000

0.6650

0.7988

TruLens answer relevance (unoptimized)

1.000

0.6050

0.7539

TruLens answer relevance (optimized)

0.9924

0.6550

0.7892



Conclusions and future work
We addressed the challenge of building trust in LLM Judges by benchmarking the quality of the baseline TruLens LLM Judges for the RAG Triad against standard ground truth data sets — TREC-DL for context relevance, LLMAggreFact for groundedness and HotPotQA for answer relevance — reporting precision, recall and F1 scores. We developed a new method for Eval-Guided Optimization that leverages the results of the benchmarking to guide an end-to-end prompt optimizer with an appropriate choice of a loss function and a data slice. We implemented this method using TextGrad as the agentic prompt optimizer and observed significant improvements.

For the LLM Judge for groundedness, precision increased by roughly 16% with a 2.5% drop in recall, leading to an F1 score increase of  8% on the LLMAggreFact data set. This placed it above the SOTA fine-tuned, proprietary Bespoke-MiniCheck-7B model on precision, recall and F1 score, as well as above the related LLM Judge from MLflow with respect to precision and F1 score.

For the LLM Judge for context relevance, precision increased by 4.26% with a 3.7% drop in recall, leading to an F1 score increase of 2.4%  on the TREC-DL data set. This placed it above the LLM Judge with the UMBRELA prompt and the corresponding MLflow Judge on the F1 and recall metrics.

For the LLM Judge for answer relevance, recall increased by 5% with a 0.76% drop in precision, leading to an F1 score increase of 3.5% on the HotpotQA data set. This makes it comparable to the MLflow Judge for the related metric.

We also identified common trends across evaluation tasks where LLM Judges tend to be more liberal or lenient in their judgments than human counterparts.

In future work, we plan to develop the methodology further and apply it to tasks beyond RAGs, including but not limited to agentic workflows and multimodal use cases. In addition, as we strive to push the frontier of novel evaluation frameworks and make enterprise AI enabled by Snowflake more trustworthy, we will explore incorporating eval-guided optimization into Snowflake’s upcoming product offerings.

1 Shayak Sen, LLMs: Consider Hallucinatory Unless Proven Otherwise, AI Transformation Summit – Pinecone, July 2023.

2 Anupam Datta, Jerry Liu with Andrew Ng, Building and Evaluating Advanced RAG.

3 TruLens docs, The RAG Triad.

4 Cohen's Kappa measures inter-rater reliability between humans and LLM judges tasked in the same experiment. Cohen’s Kappa, ranging from -1 to 1, takes into account agreement by chance between human and LLM judges.

Arctic Agentic RAG Episode 1: Agentic Query Clarification for Grounded and Speedy Responses

Welcome to the first deep-dive episode of our Arctic Agentic RAG series!

In our overview blog post, we explored the limitations of traditional retrieval-augmented generation (RAG) systems in enterprise environments. These systems struggle with ambiguous queries, multimodal data and complex reasoning — making it difficult for businesses to extract the right information reliably. This is where Arctic Agentic RAG comes in, transforming enterprise AI with intelligent, adaptive and verifiable retrieval.

Now, in this first deep-dive episode, we dive into one of the most fundamental challenges: handling ambiguous queries.

The first challenge: Addressing ambiguous queries
In traditional search, ambiguous queries are typically addressed by presenting users with a diverse set of related links, allowing them to explore different perspectives and manually extract the most relevant information. While this approach offers flexibility, it shifts the responsibility onto the user and lacks the precision required for enterprise AI applications, where speed, trust and accuracy are paramount.

Unlike traditional search, which relies on users to refine their own queries, standard RAG pipelines sometimes attempt to generate direct answers. However, these answers may be incorrect, incomplete or misleading if the query lacks sufficient specificity. Other times, they provide possible clarification questions. However, these clarification questions can often be irrelevant or unanswerable based on the user's repository, leading to further confusion and inefficiency. This is especially problematic in enterprises where accuracy and trust are paramount.

Arctic Agentic RAG takes a different approach: It clarifies the query first, ensuring that any clarification questions generated are both relevant and answerable within the user's repository, ensuring both questions and answers are grounded — meaning that responses can be verified against retrieved passages. Besides enhancing accuracy, our approach also ensures that responses are generated efficiently and cost effectively.

In this episode, we dive deep into how Arctic Agentic RAG tackles query ambiguity with grounded and speedy responses. We’ll also provide an overview of the Arctic Agentic RAG open source framework, which enables researchers and developers to explore and implement these techniques themselves.

Why ambiguous queries matter in enterprise RAG
RAG is designed to complement LLMs by retrieving evolving, domain-specific enterprise information from the corpus that is possibly absent from an LLM’s static training data. For RAG to be effective, it must ensure both diversity in covering diverse user intents and grounding responses to the retrieved passages.

Current methods for clarification tend to prioritize diversity, often leading to interpretations that RAG cannot effectively answer.

Pitfalls of existing methods
Most state-of-the-art approaches follow a "diversify then verify" (DtV) strategy. An LLM first generates multiple possible meanings for a query, then retrieves documents for all interpretations and finally prunes irrelevant results.

For example, the query "What is HP?" could refer to Hewlett-Packard, horsepower or Harry Potter. A general LLM might suggest all three, but in an enterprise-specific corpus, only one may be relevant. Despite this, DtV retrieves documents for all interpretations, adding noise and increasing computational cost.

Figure 1a illustrates this limitation: Verification happens too late, after retrieval has already been influenced by irrelevant interpretations. This inefficiency makes enterprise retrieval less precise and more resource intensive.

Comparison of (a) DtV and (b) Snowflake’s VD workflows for handling ambiguous questions in RAG. VD avoids generating ungrounded interpretations and thus does not attempt to answer those.
Figure 1: Comparison of (a) DtV and (b) Snowflake’s VerDICT workflows for handling ambiguous questions in RAG. VerDICT avoids generating ungrounded interpretations and thus does not attempt to answer those.
How Arctic Agentic RAG addresses query ambiguity: Verified DIversification with ConsolidaTion (VerDICT)
Figure 1b contrasts Snowflake’s improved workflow, Verified DIversification with ConsolidaTion (VerDICT), integrating verification directly into the diversification step. Rather than generating all possible interpretations up front, our approach first relaxes user queries to retrieve passages with diverse interpretations, represented as relevance feedback in the figure. We then extract grounded interpretations from the retrieved passages and also ensure that each can be answered from the retrieved passages, using answerability feedback. These two feedback types are elaborated below.

Retriever: relevance feedback: Unlike DtV, which diversifies into all possible interpretations, our approach first checks which interpretations are supported by the retrieved passages. A single retriever call with a relaxed query identifies top-k search results representing diverse interpretations, from which interpretations are extracted to avoid extracting an ungrounded interpretation such as Harry Potter in Figure 1a.

Generator: answerability feedback: Even if a document is relevant to the interpretation grounded to this document, it may not answer the query. Thus retrieval alone is insufficient for feedback — we introduce a generator feedback, to ensure that an answer can be generated before retraining an interpretation. To motivate, Figure 1b describes how we filter out relevant but unanswerable passage p2. It is relevant to the Hewlett-Packard interpretation, describing its products, but it cannot answer what HP is. To address this, we prompt the generator LLM with the question and its grounded passage, verifying whether a valid question-answer pair can be formed. If not, the interpretation is discarded.

A consolidation phase using clustering then follows to enhance robustness against noise in retriever and generator feedback: Question-answer pairs obtained from verification are clustered to keep those consistently supported by relevant passages, while filtering out outliers from noisy passages.

Performance Results
VerDICT is fast and resource efficient
By verifying interpretations up front, VerDICT eliminates noise and reduces unnecessary computations. Traditional DtV methods repeatedly call the retriever per each interpretation, while VerDICT significantly reduces retrieval overhead. For example, when there are three interpretations — Hewlett-Packard, horsepower or Harry Potter — the number of interpretations is |Q| = 3 and top-k results are retrieved per each interpretation. VerDICT, on the other hand, by verifying grounded interpretations up front, cuts down on retrieval and processing costs, as Table 1 shows.




Retriever

LLM


# of calls

# of calls × Input length

DtV

O(|Q|)

O(|Q|) × O(k)

VerDICT

O(1)

O(k) × O(1)

Table 1: Comparison of number of calls made to the retriever and LLM generator per each question between DtV and VerDICT. |Q| is the number of interpretations, and k is the size of retrieval.

VerDICT generates correct and grounded interpretations
Efficiency alone isn’t enough — accuracy is critical. In our evaluations (see Figure 2), 93% of VerDICT-generated interpretations led to correct and grounded answers, compared to just 56% with DtV. Even human-generated interpretations scored only 65%, proving that VerDICT is both accurate and reliable.

Figure 2: The ratio of grounded interpretations from DtV (orange) and VerDICT (blue), with Llama 3.3 70B (left) and GPT-4o (right) as backbone LLMs.
Summing up, these results demonstrate that VerDICT enhances accuracy, minimizes wasted resources and improves the user experience. For a deeper technical and empirical analysis, check out our paper “Agentic Verification for Ambiguous Query Disambiguation,” available at arXiv.

Applications in Snowflake’s Cortex Agents API
Arctic Agentic RAG is integrated into the Snowflake Cortex Agents API, providing Snowflake customers with a more intelligent, efficient and precise retrieval experience. This integration enhances enterprise search, knowledge management and automated analytics workflows. When a user provides a query that is vague and ambiguous in nature, instead of providing a single answer, Cortex Agents API follows up with related queries with VerDICT. This is anchored on:

Accurate retrieval: Clarifies queries dynamically for precise, context-aware responses through Snowflake Cortex Search with a proven record of search quality compared to competitors.

Optimized efficiency: Reduces computational overhead, speeding up analytics workflows.

Enterprise-grade applications: Supports customer support, compliance and R&D with domain-specific, verifiable insights of diverse nature.

This feature enhances various enterprise scenarios, such as:

Customer support automation: Clarifies vague queries such as "issue with my account" into actionable support topics.

Financial and legal compliance: Directs compliance officers to precise policy sections for regulations such as GDPR.

Internal knowledge management: Helps employees find specific HR and IT policies from vague search terms such as “work from home.”

Ecommerce analytics: Refines broad queries into segmented insights on sales, trends and customer behavior.

Healthcare and pharma research: Guides medical professionals to precise drug interactions and treatment protocols.

Figure 3 illustrates how our work helps users clarify ambiguous queries by generating relevant follow-up questions and providing grounded answers.

Query clarification in Snowflake’s Cortex Agents API setup with tool access to a series of synthetically generated insurance documents retrieved via Cortex Search services. The system refines vague queries by suggesting specific related questions, improving retrieval accuracy and user experience.
Figure 3: Query clarification in Snowflake’s Cortex Agents API setup with tool access to a series of synthetically generated insurance documents retrieved via Cortex Search services. The system refines vague queries by suggesting specific related questions, improving retrieval accuracy and user experience.
Open sourcing Arctic Agentic RAG
Beyond Snowflake’s own offerings, we have an open source Arctic Agentic RAG framework for researchers and practitioners. Unlike other agentic RAG frameworks that aim to be feature-complete, our Arctic Agentic RAG framework prioritizes lightweight, efficient components for fast development and research exploration. The key components include:

LLM backend: Supports cloud providers such as Snowflake Cortex Completion and Azure OpenAI, as well as local inference via vLLM.

Template format: Standardizes input-output structures for defining agentic factories.

Agentic factory: Defines agent functionality, including input parsing, retrieval and response generation.

These modular components allow researchers to easily build and customize functional agents while maintaining efficiency.

For this initial release, we provide all components related to VerDICT for handling ambiguous queries, including retrieval, answering and clustering modules for disambiguation. Additionally, we offer easy-to-use examples for building a simple RAG framework using Cortex Search and Completion functionalities, streamlining retrieval and large model deployment.

As we continue developing Arctic Agentic RAG, we plan to open source more features from our RAG innovations, helping the community reproduce research results and accelerate the adoption of advanced RAG techniques. By leveraging our framework, researchers and practitioners can rapidly prototype and iterate on novel ideas without the overhead of a full-fledged implementation. This fosters innovation and promotes shared advancements to benefit the broader AI community.

Explore Arctic Agentic RAG to learn more, and start experimenting with it today!



This concludes Episode 1 of our Arctic Agentic RAG series. Stay tuned for Episode 2, where we tackle the next major challenge: handling multimodal enterprise data — bringing images, tables, structured databases and text together in a seamless, intelligent retrieval process. Interested in this discussion? Visit the AI Research & Development Community forum on Snowflake.

Our great collaborators: We would like to extend our gratitude to our academic collaborators — Seung-won Hwang and Youngwon Lee from Seoul National University, and Feng Yan and Ruofan Wu from the University of Houston — for their valuable contributions.

Cortex Agents: Unifying Data Insights with Snowflake

In the modern enterprise environment, data exists across structured databases and unstructured textual sources — often distributed, siloed and challenging to query in unison. Users increasingly expect a single seamless experience that integrates structured analytics and unstructured search, allowing complex questions to be answered quickly and reliably.

In this blog, we will talk about some of the technical challenges in building Snowflake’s Cortex Agents, our powerful new agentic system that is built to combine Snowflake’s existing products for conversing with data. We do this by combining three separate services:

Cortex Search
Snowflake’s proven, low-latency hybrid (vector + keyword) search service that allows users to quickly set up reliable retrieval of text documents stored in Snowflake, currently in general availability. Enterprises primarily use Cortex Search to:

Provide contextual information to large language model (LLM) chatbots (retrieval-augmented generation).

Power robust enterprise search solutions without infrastructure maintenance and tuning.

Cortex Analyst
This API allows users to ask business questions about structured data in Snowflake in natural language, automatically translating these requests into SQL queries executed inside Snowflake, currently in general availability. With Cortex Analyst:

Nontechnical users can easily get reliable analytics from structured data.

API-first integration allows embedding natural language analytics directly into existing workflows.

Semantic models optionally ensure high precision by bridging user language with database structures.

Cortex Agents
Our tool (currently in private preview) combines these two robust tools into a single, streamlined, multi-turn conversational interface, empowering users to ask sophisticated questions involving both structured and unstructured data in a unified manner.

Background: Unified multitool conversations
Consider a customer service manager who needs comprehensive insights into ongoing issues:

Conversation example
User: “Show me recent support cases mentioning internet outages in North America.”

Behind the scenes: Cortex Agents recognizes the user is requesting unstructured content and routes to Cortex Search to retrieve relevant support transcripts.

Agent: Here are recent support transcripts mentioning internet outages from North America:
Transcript 102: 'My internet has been down since yesterday.'
Transcript 109: 'Experiencing intermittent internet outages all week.'
Now, suppose the user follows up with a structured data analytics request:

User: “What percentage of our North America customers reported internet outages last month?”

Behind the scenes: Here, Cortex Agents automatically switches context, recognizing this query is about structured data analytics:

It uses Cortex Analyst to generate a precise SQL query.

The query is executed securely in Snowflake, respecting data governance and access controls.

Agent: “In the last month, 8.7% of North American customers reported an internet outage.”

By integrating both Cortex Search and Cortex Analyst into a single conversational flow, Cortex Agents delivers richer, context-aware responses across multiple queries seamlessly.

Why data tool calling isn’t just API calling
Calling into specialized data tools like Cortex Search and Cortex Analyst differs significantly from standard API function calls. While typical API functions have small, well-defined parameters, data tools can involve much larger, more complex specifications:

Semantic richness: when working with data tools with rich underlying structure, the user is not just calling a function named getData. The user’s question may reference domain concepts that involve advanced metrics, relationships, and business logic. For example, consider a semantic model used by Cortex Analyst:
name: economics_indicators
description: >-
  This semantic model represents economic indicators and FOMC policy decisions.
  It enables querying and analysis of key macroeconomic metrics, monetary policy
  changes, and their contextual relationships...
tables:
  - name: economics_indicators_fomc_demo
    description: Economic indicators and FOMC policy decisions.

    base_table:
      database: CORTEX_SEARCH_DB
      schema: ANALYST
      table: ECONOMICS_INDICATORS_FOMC_DEMO

    dimensions:
      - name: DATE
        synonyms: ["record date"]
        description: The date associated with the economic indicators...
        expr: DATE
        data_type: date

      - name: REAL_GDP_GROWTH
        synonyms: ["gdp growth", "economic growth"]
        description: Real GDP growth rate as a percentage...
        expr: REAL_GDP_GROWTH
        data_type: number

      ...

    time_dimensions:
      - name: DATE
        synonyms: ["record date", "event date"]
        description: The date when the economic indicators and policy decisions...
        expr: DATE
        data_type: date

    filters:
      - name: DATE
        synonyms: ["record date"]
        description: Filter by the date of the record.
        expr: DATE
      - name: UNEMPLOYMENT_RATE
        synonyms: ["jobless rate filter"]
        description: Filter records by unemployment rate.
        expr: UNEMPLOYMENT_RATE
This single YAML snippet already shows how complex and robust the specification is. The system needs to translate business-friendly queries into reliable references to UNEMPLOYMENT_RATE, REAL_GDP_GROWTH, or PCE_INFLATION, often with synonyms like “jobless rate” or “price index.” This is more than a simple function call with five parameters — it’s a rich semantic layer of information.

2. Large specifications: These semantic models can be thousands of tokens long, containing columns, filters, relationships, synonyms, default aggregations and more. A typical function-call approach (like a simple JSON schema) often hits token or size limits.

3. Precision demands: A question such as, “How do FOMC policy changes correlate with inflation when GDP is above 3% growth?” requires:

Understanding “policy changes” as a dimension in the semantic model (KEY_POLICY_CHANGE).
Correctly filtering on REAL_GDP_GROWTH > 3.0.

Possibly aggregating or joining across time dimensions to find correlation with inflation.

Misinterpreting these steps leads to incorrect or misleading answers without the end user even knowing!

Hence, calling data services is less about “calling an endpoint” and more about combining advanced reasoning with semantic knowledge about what’s in a data source. This is highly non-trivial and something that standard function-call approaches do not address at scale. We will go so far as to say

No AI data agent can work accurately without a semantic specification of the underlying data.

Multi-semantic model routing
Adding semantic models for the data sources presents a new set of challenges. When organizations maintain multiple semantic models for different domains (e.g., one for economic indicators, another for customer service analytics and yet another for supply chain data), determining which semantic model best fits a user query can be a major challenge. Typical issues encountered in this context include:

Context limits: Simply dumping all model details into an LLM prompt risks exceeding token capacity — especially for LLMs that have strict prompt-size constraints.

Performance considerations: Analyzing large semantic models can be resource-intensive. Naïvely passing everything into an LLM might result in slow, expensive queries.

Accuracy: If the system picks the wrong semantic model, the user may receive an irrelevant or incorrect answer. For example, if the system used a marketing analytics model responding to a question about Fed funds rate changes rather than the economics_indicators_fomc_demo model, the user would likely get information about marketing campaign performance or customer segmentation instead of data on interest rate fluctuations.

To address these problems, we came up with a multistage routing approach that determines the relevant semantic model to call, using an efficient retrieval-backed pipeline.

Our multistage routing approach:

Pruning: Each semantic model is scanned for key columns or verified queries that might align with the user’s prompt. Irrelevant details are filtered out.
Retrieval: The pruned representations of all candidate models are embedded, and  a similarity-based retrieval is performed to find the Top-K likely model matches.
Final selection: A powerful LLM then “reads” these Top-K pruned models in more detail, deciding which single model (or combination of models) best fits the user’s query.
agent multistage routing
Snowflake’s multistage routing approach helps ensure an optimal balance of accuracy and speed:

Accuracy by focusing on relevant tables, columns and past queries.

Speed by filtering out irrelevant models before the final LLM call.

Scalability by structuring the entire routing pipeline around modern embedding and retrieval techniques.

Governance and security built in
Enterprise-grade governance is integral to Cortex Agents:

Data governance: All operations, including structured SQL queries and unstructured document retrieval, are performed subject to data governance standards you implemented in Snowflake, enabling secure handling of your data at all times.

Access control: Cortex Agents adheres strictly to Snowflake’s role-based access policies, helping protect sensitive data.

Auditing: Complete visibility into query execution and usage patterns enables transparent operations and easy debugging.

Conclusion
Snowflake’s Cortex Agents represents a leap forward in enterprise conversational intelligence, blending Snowflake’s proven Cortex Search and Cortex Analyst functionality into a single unified experience. Cortex Agents simplifies user interactions with complex data environments, enabling both structured analytics and unstructured search in one governed and intuitive API.

By handling intricate multi-turn and multi-tool conversations seamlessly, Cortex Agents empowers business teams to extract deep, reliable insights faster, enhancing productivity and decision-making without compromising governance or security standards.

And we’re not done yet. Cortex Search and Cortex Analyst are just two of the many data tools Snowflake supports. As we continue evolving Cortex Agents, we’re exploring ways to bring even more Snowflake native capabilities into a unified, conversational experience.

Visit these quickstart guides to explore Cortex Agents and begin to transform how your organization leverages data intelligence today:

Snowflake Arctic Embed Joins ArcticTraining: Simple And Scalable Embedding Model Training

We are pleased to announce that the Arctic Embed project has migrated its core training code into the open source ArcticTraining project. Though Snowflake has released all of its AI models under a permissive Apache 2.0 license since Day 1, today we’re going a step further to enable a broader community to engage directly with the tools behind our embedding models' development. This new open code release adds to the rich body of detailed technical publications on our model training processes, making it now easier than ever to reproduce, extend and remix our work to create the world’s most efficient frontier embedding models.

Climbing onto the shoulders of giants: DeepSpeed and ArcticTraining
Our newly migrated training codebase allows Arctic Embed to build upon ArcticTraining and  DeepSpeed (which serves as the backend for ArcticTraining). DeepSpeed is a mainstay in AI model development and is used for many embedding model training workloads, including those of Alibaba’s GTE team, Nomic’s team, and Jina AI’s embedding team. As such, we are excited to join the list of teams leveraging the same easy-to-use ZeRO stage 1 memory efficiency optimizations that have made DeepSpeed the tool of choice for so many others. Notably, this transition brings us greater ease of scaling up to larger model sizes.

DeepSpeed’s unopinionated CLI entry point and laissez-faire approach to training script management leave it up to end users to maintain clean pipelines for experimentation. While simple for smaller projects, things can become complex in long-term projects and projects shared between multiple collaborators who may have different styles of organizing their work. This is where ArcticTraining comes into the picture for Arctic Embed.

ArcticTraining is the front door to DeepSpeed, providing much-appreciated ease-of-use abstractions, including automatic validation of job configurations, to make it easier to organize training code and experimental workflows. While migrating from our previous pure-PyTorch implementation to standard DeepSpeed patterns would have been a major paradigm shift, ArcticTraining implements a clean, pydantic-validated, and config-driven parameterization of training quite similar to what we built into our previous implementation. This made for a smooth transition of our training codebase.

Thanks to DeepSpeed and ArcticTraining, the new Arctic Embed training codebase is now more powerful and simpler to use than the original implementation used to train Arctic Embed 1.0, 1.5, and 2.0, and we are thrilled to be sharing it openly with the world.

Key feature: Generalized contrastive data representation
Figure 1. Arctic Embed represents contrastive training data labels as a (possibly sparse) matrix, capturing all NxM relations between the queries and documents within a batch of data. This generalizes the common cases of pair and triplet data while allowing for more sophisticated patterns as well.
Of all the details we will discuss, Arctic Embed’s most distinctive design decision is its highly generalized formulation of a contrastive data batch. Several other codebases have specialized data formatting conventions, data loading code, and loss functions for large-scale semisupervised contrastive training (often called “pair data”) and for smaller-scale training with explicitly labeled negatives (often termed “triplet data”). Arctic Embed’s training code, on the other hand, centers around a single formulation which not only supports both of these special cases but can also represent other more complicated patterns with ease as well, as seen in Figure 1.

Not only is this formulation highly flexible, but it also makes certain common data mistakes impossible. For example, when we have two labeled positive examples for the same query, the common approach of grouping data into independent triplet examples can accidentally label a known positive as a negative when in-batch negatives are used (e.g., as they are used in the BGE recipe). With the Arctic Embed data representation, it is impossible to use a known-positive document as a negative example for the same query since there is only a single spot in the label matrix for each unique query-document relationship.

Arctic Embed’s data representation is not just flexible and conceptually approachable, it is also efficient, since it deduplicates any possible redundant computation incurred by queries and documents that serve multiple roles within a batch (e.g., some queries have multiple positive documents which lead to duplications in triplet-form data). As the label matrix can be quite large for a large batch of queries and documents, our implementation stays efficient even at large batch sizes by utilizing PyTorch’s built-in sparse matrix support to reduce memory overhead. We also avoid storing unknown query-document relationships on disk in our data storage format, leveraging analogous sparsity to minimize data storage footprint as well.

This new convention comes with batteries included
The power of this generalized data format isn’t completely free. For example, getting our code for data loading, embedding, and loss calculation to cleanly model the underlying mathematical model of contrastive training took more work upfront and required some extra custom code (e.g., we had to implement a custom loss function to avoid memory overhead from performing matrix reindexing inside PyTorch’s autograd computational graph). Luckily for users of Arctic Embed, the heavy lifting has already been done for them.

We also include simple tooling for data preparation to make it easy for users to dive into the Arctic Embed way of handling contrastive training data. While Arctic Embed increases the requirements of data preprocessing, in which all training data must be converted into a batched format prior to running the training loop, there is minimal overhead on the user because this conversion is provided as a native function of the codebase to ensure you can get up and running quickly.

In practice, we have found that our distinctive approach to data set preparation notably improves debuggability and iteration speed by simplifying the training code, so we encourage you to take our approach for a spin.

Getting the details right
Figure 2. Arctic Embed natively is integrated with Weights & Biases to track the correctly rescaled gradient norm, learning rate, training loss, and evaluation metrics.
Thanks to a collaboration with core DeepSpeed maintainers, Arctic Embed has gone deep to iron out a few details that make for subtle but appreciable quality-of-life improvements. For example, our implementation hooks deep into DeepSpeed to properly rescale gradients in distributed training, adjusting for the unique wrinkles of data-parallel backpropagation of contrastive loss. While in practice improper rescaling does not generally have a large impact on training trajectory, because the ubiquitous Adam optimizer is invariant to gradient rescaling, it does allow us to log and clip gradients accurately, improving visibility into the training process. Other polished details include native Weights & Biases integration, easy toggles for gradient checkpointing, our signature simple-and-quantization-friendly Matryoshka Representation Learning loss function, and support for in-training validation on multiple dev data sets.

Walkthrough: Rejuvenating the original E5 models with modern fine-tuning
In late 2022, the paper Text Embeddings by Weakly-Supervised Contrastive Pre-training introduced the E5 family of retrieval models and established the modern paradigm of large-scale contrastive training for high-quality retrieval models. Though the benchmark scores of the E5 models were quickly eclipsed by follow-up works, like the GTE models from Alibaba’s team, these benchmark comparisons are somewhat skewed. The original E5 models only used two retrieval data sets in their fine-tuning (MSMARCO and NQ), while follow-up works frequently extend to half a dozen or more, drawing upon the training splits of data sets used in common benchmarks like BEIR, making for an unfair comparison. Adding another wrinkle to the story, the recent paper NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models has demonstrated that these extra fine-tuning data sets can, in addition to driving “unfair” in-domain performance gains, also deliver impressive improvements out of domain as well.

Given these observations about the generalizable benefits of diverse and extensive fine-tuning for retrieval, we thought it would make for an interesting demonstration to investigate how well the original E5 models can do when fine-tuned with a more modern data recipe in an explanatory walkthrough included within the Arctic Embed code release. Since we have provided all the code for this example in our release (as well as extensive documentation in the corresponding README file), we will focus on the high-level details in this post.

Spoiler alert: The results are great
We find that after upgrading the fine-tuning data, we can use the Arctic Embed training code to fine-tune ourselves an e5-base-arctic-finetune that keeps pace with other leading ~100M-parameter models on the BEIR leaderboard. We also find that when evaluating an out-of-domain news retrieval task (CLEF English), the model continues to do well, scoring nearly as well as Arctic Embed M 2.0 and outperforming a bevy of other BEIR leaderboard-leading models.

Model	BEIR Score (nDCG@10)	CLEF English Score (nDCG@10)
e5-base-v2 (original)	50.19	45.38
e5-base-arctic-finetune (new)	54.70	52.77
gte-base-en-v1.5	54.02	47.91
arctic-embed-m-v1.0	54.89	47.62
arctic-embed-m-v2.0	55.38	54.06
Table 1. Experimental results. Since we have run evaluation at bfloat16 precision, there may be slight discrepancies between these figures and official MTEB leaderboard numbers.

Modern fine-tuning data
The core idea of this example is to pull together numerous open high-quality retrieval training data sets, mine for hard negatives using a high-quality teacher model, and replace the original E5 fine-tuning step with our own training code. We begin with high-quality retrieval data from five sources: FEVER, HotpotQA, MSMARCO, NQ, and StackExchange. We then use the Arctic-embed-l-v2.0 model to mine high-quality, hard-negative examples, using thresholding to discard potential false-negative examples during mining (see NV-Retriever: Improving text embedding models with effective hard-negative mining for extended discussion of effective hard-negative mining). We end up with a fine-tuning data set of just over 1 million queries, each paired with 1-3 positive documents and 10 negative documents.

We are releasing the full data set, including all intermediate states, on Huggingface. We have also released all the code to run the data processing as part of the walkthrough example in the Arctic Embed codebase.

Simple, scalable training
The core training script drives one epoch of training through our high-quality fine-tuning data set using the industry-standard InfoNCE loss. On eight H100 GPUs, we accomplish the full fine-tuning run in less than an hour. Under the hood, this training script leverages not only ArcticTraining’s configuration validation and ease-of-use figures but also the powerful ZeRO Stage 1 memory optimizations provided by DeepSpeed.

Since all data preparation code, training data, training code, the base model and our final model are openly available, it is easy to get started on your training experiments. If you take the data and/or code for a spin, please share your experience on the Snowflake Community Forums AI Research And Development Community!

The future
Though Arctic Embed has historically focused on a pragmatic tradeoff between efficiency and quality, with the unsurpassed memory efficiency of DeepSpeed under our feet we are now excited to scale up and explore the 1B+ parameter range to explore more quality-focused tradeoffs between compute and retrieval accuracy. The DRAMA models have recently shown some lovely scaling trends in the 0.1B to 0.3B to 1.0B parameter range, and we are excited to explore this frontier ourselves. For both the Arctic Embed team and the broader community, we look forward to continuing our efforts toward putting out great models for search — further empowering the community with the tools needed to fine-tune these models to meet their needs and opening up channels for deeper collaboration.

With that said, we are also excited to see the future endeavors of the open source community, as you all begin to build your own embedding models. To get started, explore the Arctic Embed codebase today and share your experiences in the Snowflake Community Forums AI Research and Development Community.



Contributors

Snowflake AI Research: Michael Wyatt, Jeff Rasley, Stas Bekman, Puxuan Yu, Gaurav Nuti
