RetrievalGroundedness judge

The RetrievalGroundedness judge assesses whether your application's response is factually supported by the provided context (either from a RAG system or generated by a tool call), helping detect hallucinations or statements not backed by that context.

This built-in LLM judge is designed for evaluating RAG applications that need to ensure responses are grounded in retrieved information.

Prerequisites for running the examples

Install MLflow and required packages
bash
```
pip install --upgrade mlflow
```
Create an MLflow experiment by following the setup your environment quickstart.

(Optional, if using OpenAI models) Use the native OpenAI SDK to connect to OpenAI-hosted models. Select a model from the available OpenAI models.

python
import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Create an OpenAI client
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"

Usage examples

The RetrievalGroundedness judge can be invoked directly for single trace assessment or used with MLflow's evaluation framework for batch evaluation.

Requirements:

Trace requirements:

The MLflow Trace must contain at least one span with span_type set to RETRIEVER
inputs and outputs must be on the Trace's root span

Invoke directly
Invoke with evaluate()

python
from mlflow.genai.scorers import RetrievalGroundedness
import mlflow

# Get a trace from a previous run
trace = mlflow.get_trace("<your-trace-id>")

# Assess if the response is grounded in the retrieved context
feedback = RetrievalGroundedness()(trace=trace)
print(feedback)

python
import mlflow
from mlflow.genai.scorers import RetrievalGroundedness

# Evaluate traces from previous runs
results = mlflow.genai.evaluate(
    data=traces,  # DataFrame or list containing trace data
    scorers=[RetrievalGroundedness()],
)

tip

For a complete RAG application example with these judges, see the RAG Evaluation guide.

Interpret results

The RetrievalGroundedness judge evaluates each retriever span separately and returns a separate Feedback object for each retriever span in your trace. Each Feedback object contains:

value: "yes" if response is grounded in the retrieved context, "no" if it contains hallucinations
rationale: Detailed explanation identifying:
- Which statements are supported by the context
- Which statements lack support (hallucinations)
- Specific quotes from context that support or contradict claims

Select the LLM that powers the judge

You can change the judge model by using the model argument in the judge definition. The model must be specified in the format <provider>:/<model-name>, where <provider> is a LiteLLM-compatible model provider.

For a list of supported models, see selecting judge models.

RetrievalGroundedness judge

Prerequisites for running the examples

Usage examples

Interpret results

Select the LLM that powers the judge

Next steps

Evaluate context sufficiency

Evaluate context relevance

Run comprehensive RAG evaluation

Prerequisites for running the examples​

Usage examples​

Interpret results​

Select the LLM that powers the judge​

Next steps​

Evaluate context sufficiency

Evaluate context relevance

Run comprehensive RAG evaluation

Prerequisites for running the examples

Usage examples

Interpret results

Select the LLM that powers the judge

Next steps