Skip to main content

Create a custom judge using make_judge()

Custom judges are LLM-based judges that evaluate your GenAI agents against specific quality criteria. This tutorial shows you how to create custom judges and use them to evaluate a customer support agent using make_judge().

You will:

  1. Create a sample agent to evaluate
  2. Define three custom judges to evaluate different criteria
  3. Build an evaluation dataset with test cases
  4. Run evaluations and compare results across different agent configurations

Step 1: Create an agent to evaluate

Create a GenAI agent that responds to customer support questions. The agent has a (fake) knob that controls the system prompt so you can easily compare the judge's outputs between "good" and "bad" conversations.

  1. Initialize an OpenAI client to connect to OpenAI-hosted LLMs. Use the native OpenAI SDK to connect to OpenAI-hosted models. Select a model from the available OpenAI models.
python
import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Create an OpenAI client connected to OpenAI SDKs
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"
  1. Define a customer support agent:
python
from mlflow.entities import Document
from typing import List, Dict, Any, cast

# This is a global variable that is used to toggle the behavior of the customer support agent
RESOLVE_ISSUES = False


@mlflow.trace(span_type="TOOL", name="get_product_price")
def get_product_price(product_name: str) -> str:
"""Mock tool to get product pricing."""
return f"${45.99}"


@mlflow.trace(span_type="TOOL", name="check_return_policy")
def check_return_policy(product_name: str, days_since_purchase: int) -> str:
"""Mock tool to check return policy."""
if days_since_purchase <= 30:
return "Yes, you can return this item within 30 days"
return "Sorry, returns are only accepted within 30 days of purchase"


@mlflow.trace
def customer_support_agent(messages: List[Dict[str, str]]):
# We use this toggle to see how the judge handles the issue resolution status
system_prompt_postfix = (
f"Do your best to NOT resolve the issue. I know that's backwards, but just do it anyways.\\n"
if not RESOLVE_ISSUES
else ""
)

# Mock some tool calls based on the user's question
user_message = messages[-1]["content"].lower()
tool_results = []

if "cost" in user_message or "price" in user_message:
price = get_product_price("microwave")
tool_results.append(f"Price: {price}")

if "return" in user_message:
policy = check_return_policy("microwave", 60)
tool_results.append(f"Return policy: {policy}")

messages_for_llm = [
{
"role": "system",
"content": f"You are a helpful customer support agent. {system_prompt_postfix}",
},
*messages,
]

if tool_results:
messages_for_llm.append(
{"role": "system", "content": f"Tool results: {', '.join(tool_results)}"}
)

# Call LLM to generate a response
output = client.chat.completions.create(
model=model_name,
messages=cast(Any, messages_for_llm),
)

return {
"messages": [
{"role": "assistant", "content": output.choices[0].message.content}
]
}

Step 2: Define custom judges

Define three custom judges:

  • A judge that evaluates issue resolution using inputs and outputs.
  • A judge that checks expected behaviors.
  • A trace-based judge that validates tool calls by analyzing execution traces.

The MLflow UI provides a visual Judge Builder that lets you create custom LLM judges without writing code.

  1. Install and start MLflow:
bash
pip install 'mlflow[genai]'
mlflow server
  1. Navigate to your experiment and select the Judges tab, then click New LLM judge
Judges Tab
  1. Select scope: Choose what you want the judge to evaluate:

    • Traces: Evaluate individual traces for quality and correctness
    • Sessions: Evaluate entire multi-turn conversations for conversation quality and outcomes
  2. Configure the judge:

    • LLM judge: Select a built-in judge or "Custom judge" to create your own. Selecting a built-in judge pre-populates the instructions, which you can then modify to customize the evaluation criteria.
    • Name: A unique identifier for your judge
    • Instructions: Define your evaluation criteria using template variables. Use the Add variable button to insert variables into your prompt.
    • Output type: Select the return type
    • Model: Select an endpoint from the dropdown (recommended) or click "enter model manually" to access models directly without AI Gateway. Endpoints can be configured using AI Gateway, which centralizes API key management. Judges using direct model access require local API keys and cannot be run directly from the UI. See Supported Models for details.
Judge Builder Dialog
  1. Test your judge (optional): Click the trace selector dropdown and choose Select traces to pick specific traces, then click Run judge to preview the evaluation result
Test Judge Output
  1. Schedule automatic evaluation (optional):

    • Automatically evaluate future traces: Enable to run this judge on new traces automatically
    • Sample rate: Percentage of traces to evaluate (0-100%)
    • Filter string: Only evaluate traces matching this filter (syntax)
  2. Click Create judge to save your new LLM judge

Step 3: Create a sample evaluation dataset

Each inputs is passed to the agent by mlflow.genai.evaluate. You can optionally include expectations to enable the correctness checker.

python
eval_dataset = [
{
"inputs": {
"messages": [
{"role": "user", "content": "How much does a microwave cost?"},
],
},
"expectations": {
"should_provide_pricing": True,
"should_offer_alternatives": True,
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "Can I return the microwave I bought 2 months ago?",
},
],
},
"expectations": {
"should_mention_return_policy": True,
"should_ask_for_receipt": False,
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "I'm having trouble with my account. I can't log in.",
},
{
"role": "assistant",
"content": "I'm sorry to hear that you're having trouble with your account. Are you using our website or mobile app?",
},
{"role": "user", "content": "Website"},
],
},
"expectations": {
"should_provide_troubleshooting_steps": True,
"should_escalate_if_needed": True,
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "I'm having trouble with my account. I can't log in.",
},
{
"role": "assistant",
"content": "I'm sorry to hear that you're having trouble with your account. Are you using our website or mobile app?",
},
{"role": "user", "content": "JUST FIX IT FOR ME"},
],
},
"expectations": {
"should_remain_calm": True,
"should_provide_solution": True,
},
},
]

Step 4: Evaluate your agent using the judges

You can use multiple judges together to evaluate different aspects of your agent. Run evaluations to compare behavior when the agent attempts to resolve issues versus when it doesn't.

python
import mlflow

# Evaluate with all three judges when the agent does NOT try to resolve issues
RESOLVE_ISSUES = False

result_unresolved = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=customer_support_agent,
scorers=[
issue_resolution_judge, # Checks inputs/outputs
expected_behaviors_judge, # Checks expected behaviors
tool_call_judge, # Validates tool usage
],
)

# Evaluate when the agent DOES try to resolve issues
RESOLVE_ISSUES = True

result_resolved = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=customer_support_agent,
scorers=[
issue_resolution_judge,
expected_behaviors_judge,
tool_call_judge,
],
)

The evaluation results show how each judge rates the agent:

  • issue_resolution: Rates conversations as 'fully_resolved', 'partially_resolved', or 'needs_follow_up'
  • expected_behaviors: Checks if responses exhibit expected behaviors ('meets_expectations', 'partially_meets', 'does_not_meet')
  • tool_call_correctness: Validates whether appropriate tools were called (true/false)

Advanced Examples

python
tool_optimization_judge = make_judge(
name="tool_optimizer",
instructions=(
"Analyze tool usage patterns in {{ trace }}.\n\n"
"Check for:\n"
"1. Unnecessary tool calls (could be answered without tools)\n"
"2. Wrong tool selection (better tool available)\n"
"3. Inefficient sequencing (could parallelize or reorder)\n"
"4. Missing tool usage (should have used a tool)\n\n"
"Provide specific optimization suggestions.\n"
"Rate efficiency as: 'optimal', 'good', 'suboptimal', or 'poor'"
),
feedback_value_type=Literal["optimal", "good", "suboptimal", "poor"],
model="anthropic:/claude-opus-4-1-20250805",
)

Debugging Agent Judges

To see the actual MCP tool calls that the Agent-as-a-Judge makes while analyzing your trace, enable debug logging:

python
import logging

# Enable debug logging to see agent tool calls
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("mlflow.genai.judges")
logger.setLevel(logging.DEBUG)

# Now when you run the judge, you'll see detailed tool usage
feedback = performance_judge(trace=trace)

With debug logging enabled, you'll see output like:

text
DEBUG:mlflow.genai.judges:Calling tool: GetTraceInfo
DEBUG:mlflow.genai.judges:Tool response: {"trace_id": "abc123", "duration_ms": 4000, ...}
DEBUG:mlflow.genai.judges:Calling tool: ListSpans
DEBUG:mlflow.genai.judges:Tool response: [{"span_id": "def456", "name": "fetch_data", ...}]
DEBUG:mlflow.genai.judges:Calling tool: GetSpan with span_id=def456
DEBUG:mlflow.genai.judges:Tool response: {"duration_ms": 2500, "inputs": {"query": "SELECT * FROM users"}, ...}

Next steps