Answer and Context Relevance Judges

MLflow provides two built-in LLM judges to assess relevance in your GenAI applications. These judges help diagnose quality issues - if context isn't relevant, the generation step cannot produce a helpful response.

RelevanceToQuery: Evaluates if your app's response directly addresses the user's input
RetrievalRelevance: Evaluates if each document returned by your app's retriever(s) is relevant

Prerequisites for running the examples

Install MLflow and required packages
bash
```
pip install --upgrade mlflow
```
Create an MLflow experiment by following the setup your environment quickstart.

(Optional, if using OpenAI models) Use the native OpenAI SDK to connect to OpenAI-hosted models. Select a model from the available OpenAI models.

python
import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Create an OpenAI client
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"

Usage Examples

RelevanceToQuery Judge

This judge evaluates if your app's response directly addresses the user's input without deviating into unrelated topics.

You can invoke the judge directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Requirements:

Trace requirements: inputs and outputs must be on the Trace's root span

Invoke directly
Invoke with evaluate()

python
import mlflow
from mlflow.genai.scorers import RelevanceToQuery

assessment = RelevanceToQuery(name="my_relevance_to_query")(
    inputs={"question": "What is the capital of France?"},
    outputs="The capital of France is Paris.",
)
print(assessment)

python
import mlflow
from mlflow.genai.scorers import RelevanceToQuery

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
    }
]
result = mlflow.genai.evaluate(data=data, scorers=[RelevanceToQuery()])

RetrievalRelevance Judge

This judge evaluates if each document returned by your app's retriever(s) is relevant to the input request. It evaluates each retriever span separately and returns a separate Feedback object for each retriever span in your trace.

Requirements:

Trace requirements: The MLflow Trace must contain at least one span with span_type set to RETRIEVER

Invoke directly
Invoke with evaluate()

python
from mlflow.genai.scorers import RetrievalRelevance
import mlflow

# Get a trace from a previous run
trace = mlflow.get_trace("<your-trace-id>")

# Assess if each retrieved document is relevant
feedbacks = RetrievalRelevance()(trace=trace)
print(feedbacks)

python
import mlflow
from mlflow.genai.scorers import RetrievalRelevance

# Evaluate traces from previous runs
results = mlflow.genai.evaluate(
    data=traces,  # DataFrame or list containing trace data
    scorers=[RetrievalRelevance()],
)

tip

For a complete RAG application example with these judges, see the RAG Evaluation guide.

Select the LLM that powers the judge

You can change the judge model by using the model argument in the judge definition. The model must be specified in the format <provider>:/<model-name>, where <provider> is a LiteLLM-compatible model provider.

For a list of supported models, see selecting judge models.

Interpret results

The judge returns a Feedback object containing:

value: "yes" if context is relevant, "no" if not
rationale: Explanation of why the judge found the context relevant or irrelevant

Answer and Context Relevance Judges

Prerequisites for running the examples

Usage Examples

RelevanceToQuery Judge

RetrievalRelevance Judge

Select the LLM that powers the judge

Interpret results

Next steps

Explore other built-in judges

Create custom judges

Evaluate RAG applications

Prerequisites for running the examples​

Usage Examples​

RelevanceToQuery Judge​

RetrievalRelevance Judge​

Select the LLM that powers the judge​

Interpret results​

Next steps​

Explore other built-in judges

Create custom judges

Evaluate RAG applications

Prerequisites for running the examples

Usage Examples

RelevanceToQuery Judge

RetrievalRelevance Judge

Select the LLM that powers the judge

Interpret results

Next steps