RAG Evaluation with Built-in Judges
Retrieval-Augmented Generation (RAG) systems combine retrieval and generation to provide contextually relevant responses. Evaluating RAG applications requires assessing both the retrieval quality (are the right documents retrieved?) and the generation quality (is the response grounded in those documents?).
MLflow provides built-in judges designed specifically for evaluating RAG applications. These judges help you identify common issues in your RAG pipeline:
- Poor retrieval: Documents aren't relevant to the query
- Insufficient context: Retrieved documents don't contain all necessary information
- Hallucinations: Generated responses include information not present in the retrieved context
Available RAG Judges
| Judge | What does it evaluate? | Requires ground-truth? | Requires traces? |
|---|---|---|---|
| RetrievalRelevance | Are retrieved documents relevant to the user's request? | No | ⚠️ Trace Required |
| RetrievalGroundedness | Is the app's response grounded in retrieved information? | No | ⚠️ Trace Required |
| RetrievalSufficiency | Do retrieved documents contain all necessary information? | Yes | ⚠️ Trace Required |
All RAG judges require MLflow Traces with at least one span marked as span_type="RETRIEVER". Use the @mlflow.trace decorator with span_type="RETRIEVER" on your retrieval functions.
Prerequisites for running the examples
-
Install MLflow and required packages
bashpip install --upgrade mlflow -
Create an MLflow experiment by following the setup your environment quickstart.
-
(Optional, if using OpenAI models) Use the native OpenAI SDK to connect to OpenAI-hosted models. Select a model from the available OpenAI models.
pythonimport mlflow
import os
import openai
# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured
# Enable auto-tracing for OpenAI
mlflow.openai.autolog()
# Create an OpenAI client
client = openai.OpenAI()
# Select an LLM
model_name = "gpt-4o-mini"
Complete RAG Example
Here's a complete example showing how to build a RAG application and evaluate it with the judges:
import mlflow
import openai
from mlflow.genai.scorers import (
RetrievalRelevance,
RetrievalGroundedness,
RetrievalSufficiency,
)
from mlflow.entities import Document
from typing import List
mlflow.openai.autolog()
client = openai.OpenAI()
# Sample knowledge base - in practice, this would be a vector database
KNOWLEDGE_BASE = {
"mlflow": [
Document(
id="doc_1",
page_content="MLflow is an open-source platform for managing the ML lifecycle. It has four main components: Tracking, Projects, Models, and Registry.",
metadata={"source": "mlflow_overview.txt"},
),
],
}
# Define a retriever function with proper span type
@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> List[Document]:
# Simulated retrieval - in practice, this would query a vector database
# using embeddings to find semantically similar documents
query_lower = query.lower()
if "mlflow" in query_lower:
return KNOWLEDGE_BASE["mlflow"]
else:
return [] # No relevant documents found
# Define your RAG agent
@mlflow.trace
def rag_agent(query: str):
docs = retrieve_docs(query)
if not docs:
return {"response": "I don't have information to answer that question."}
context = "\n\n".join([doc.page_content for doc in docs])
# Pass retrieved context to your LLM
messages = [
{
"role": "system",
"content": f"Answer the user's question based only on the following context:\n\n{context}",
},
{"role": "user", "content": query},
]
response = client.chat.completions.create(
model=model_name,
messages=messages,
)
return {"response": response.choices[0].message.content}
# Create evaluation dataset with ground truth expectations
eval_dataset = [
{
"inputs": {"query": "What are the main components of MLflow?"},
"expectations": {
"expected_facts": [
"MLflow has four main components",
"The components are Tracking, Projects, Models, and Registry",
]
},
}
]
# Run evaluation with multiple RAG judges
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=rag_agent,
scorers=[
RetrievalRelevance(model="openai:/gpt-4o-mini"),
RetrievalGroundedness(model="openai:/gpt-4o-mini"),
RetrievalSufficiency(model="openai:/gpt-4o-mini"),
],
)
Understanding the Results
Each RAG judge evaluates retriever spans separately:
- RetrievalRelevance: Returns feedback for each retrieved document, indicating whether it's relevant to the query
- RetrievalGroundedness: Checks if the generated response is factually supported by the retrieved context
- RetrievalSufficiency: Verifies that the retrieved documents contain all information needed to answer based on the
expected facts
Select the LLM that powers the judge
You can change the judge model by using the model argument in the judge definition. The model must be specified in the format <provider>:/<model-name>, where <provider> is a LiteLLM-compatible model provider.
For a list of supported models, see selecting judge models.