RetrievalGroundedness judge
The RetrievalGroundedness judge assesses whether your application's response is factually supported by the provided context (either from a RAG system or generated by a tool call), helping detect hallucinations or statements not backed by that context.
This built-in LLM judge is designed for evaluating RAG applications that need to ensure responses are grounded in retrieved information.
Prerequisites for running the examples
-
Install MLflow and required packages
bashpip install --upgrade mlflow -
Create an MLflow experiment by following the setup your environment quickstart.
-
(Optional, if using OpenAI models) Use the native OpenAI SDK to connect to OpenAI-hosted models. Select a model from the available OpenAI models.
pythonimport mlflow
import os
import openai
# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured
# Enable auto-tracing for OpenAI
mlflow.openai.autolog()
# Create an OpenAI client
client = openai.OpenAI()
# Select an LLM
model_name = "gpt-4o-mini"
Usage examples
The RetrievalGroundedness judge can be invoked directly for single trace assessment or used with MLflow's evaluation framework for batch evaluation.
Requirements:
Trace requirements:
- The MLflow Trace must contain at least one span with
span_typeset toRETRIEVER inputsandoutputsmust be on the Trace's root span
- Invoke directly
- Invoke with evaluate()
from mlflow.genai.scorers import RetrievalGroundedness
import mlflow
# Get a trace from a previous run
trace = mlflow.get_trace("<your-trace-id>")
# Assess if the response is grounded in the retrieved context
feedback = RetrievalGroundedness()(trace=trace)
print(feedback)
import mlflow
from mlflow.genai.scorers import RetrievalGroundedness
# Evaluate traces from previous runs
results = mlflow.genai.evaluate(
data=traces, # DataFrame or list containing trace data
scorers=[RetrievalGroundedness()],
)
For a complete RAG application example with these judges, see the RAG Evaluation guide.
Interpret results
The RetrievalGroundedness judge evaluates each retriever span separately and returns a separate Feedback object for each retriever span in your trace. Each Feedback object contains:
- value: "yes" if response is grounded in the retrieved context, "no" if it contains hallucinations
- rationale: Detailed explanation identifying:
- Which statements are supported by the context
- Which statements lack support (hallucinations)
- Specific quotes from context that support or contradict claims
Select the LLM that powers the judge
You can change the judge model by using the model argument in the judge definition. The model must be specified in the format <provider>:/<model-name>, where <provider> is a LiteLLM-compatible model provider.
For a list of supported models, see selecting judge models.