Skip to main content

RetrievalGroundedness judge

The RetrievalGroundedness judge assesses whether your application's response is factually supported by the provided context (either from a RAG system or generated by a tool call), helping detect hallucinations or statements not backed by that context.

This built-in LLM judge is designed for evaluating RAG applications that need to ensure responses are grounded in retrieved information.

Prerequisites for running the examples

  1. Install MLflow and required packages

    bash
    pip install --upgrade mlflow
  2. Create an MLflow experiment by following the setup your environment quickstart.

  3. (Optional, if using OpenAI models) Use the native OpenAI SDK to connect to OpenAI-hosted models. Select a model from the available OpenAI models.

    python
    import mlflow
    import os
    import openai

    # Ensure your OPENAI_API_KEY is set in your environment
    # os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

    # Enable auto-tracing for OpenAI
    mlflow.openai.autolog()

    # Create an OpenAI client
    client = openai.OpenAI()

    # Select an LLM
    model_name = "gpt-4o-mini"

Usage examples

The RetrievalGroundedness judge can be invoked directly for single trace assessment or used with MLflow's evaluation framework for batch evaluation.

Requirements:

Trace requirements:

  • The MLflow Trace must contain at least one span with span_type set to RETRIEVER
  • inputs and outputs must be on the Trace's root span
python
from mlflow.genai.scorers import RetrievalGroundedness
import mlflow

# Get a trace from a previous run
trace = mlflow.get_trace("<your-trace-id>")

# Assess if the response is grounded in the retrieved context
feedback = RetrievalGroundedness()(trace=trace)
print(feedback)
tip

For a complete RAG application example with these judges, see the RAG Evaluation guide.

Interpret results

The RetrievalGroundedness judge evaluates each retriever span separately and returns a separate Feedback object for each retriever span in your trace. Each Feedback object contains:

  • value: "yes" if response is grounded in the retrieved context, "no" if it contains hallucinations
  • rationale: Detailed explanation identifying:
    • Which statements are supported by the context
    • Which statements lack support (hallucinations)
    • Specific quotes from context that support or contradict claims

Select the LLM that powers the judge

You can change the judge model by using the model argument in the judge definition. The model must be specified in the format <provider>:/<model-name>, where <provider> is a LiteLLM-compatible model provider.

For a list of supported models, see selecting judge models.

Next steps