Custom Judges
Custom LLM judges let you define complex and nuanced judging guidelines for GenAI applications using natural language.
While MLflow built-in LLM judges offer excellent starting points for common quality dimensions, custom judges created using make_judge() give you full control over evaluation criteria. You can create these judges using either the UI or the SDK.
The fastest way to create LLM judges is through the Judge Builder UI - no code required. Navigate to your experiment's Judges tab to create and test judges visually. See the Create a Custom Judge page for details.
- UI: The Judge Builder UI requires MLflow >= 3.9.0.
- SDK: The
make_judgeAPI requires MLflow >= 3.4.0. For earlier versions, use the deprecated custom_prompt_judge instead.
Prompts and template variables
To create a judge, you provide a prompt with natural language instructions on how to assess the quality of your agent. make_judge() accepts template variables to access the agent's inputs, outputs, expected outputs or behaviors, and even complete traces.
Your instructions must include at least one template variable, but you don't need to use all of them.
{{ inputs }}- Input data provided to the agent{{ outputs }}- Output data generated by your agent{{ expectations }}- Ground truths or expected outcomes{{ trace }}- The complete execution trace of your agent
You can only use the reserved template variables shown above (inputs, outputs, expectations, conversation, trace). Custom variables like {{ question }} will cause validation errors. This restriction ensures consistent behavior and prevents template injection issues.
Note on conversation variable: The {{ conversation }} template variable can be used with {{ expectations }}, however it cannot be combined with {{ inputs }}, {{ outputs }}, or {{ trace }} variables. This is because conversation history provides complete context, making individual turn data redundant.
Trace-based judges
Trace-based judges analyze execution traces to understand what happened during agent execution. They autonomously explore traces using Model Context Protocol (MCP) tools and can:
- Validate tool usage patterns
- Identify performance bottlenecks
- Investigate execution failures
- Verify multi-step workflows
The following example defines a judge that assesses tool calling correctness by analyzing traces:
from mlflow.genai.judges import make_judge
from typing import Literal
# Agent judge for tool calling correctness
tool_usage_judge = make_judge(
name="tool_usage_validator",
instructions=(
"Analyze the {{ trace }} to verify correct tool usage.\n\n"
"Check that the agent selected appropriate tools for the user's request "
"and called them with correct parameters."
),
feedback_value_type=Literal["correct", "incorrect"],
model="openai:/gpt-5-mini", # Required for trace-based judges
)
For trace-based judges to analyze the full trace, the model argument must be specified in make_judge().
For a complete tutorial, see Create a custom judge using make_judge().
Selecting Judge Models
By default, MLflow will use OpenAI's GPT-4o-mini model as the judge model. You can change the judge model by passing an override to the model argument within the scorer definition. The model must be specified in the format of <provider>:/<model-name>.
from mlflow.genai.scorers import Correctness
Correctness(model="openai:/gpt-4o-mini")
Correctness(model="anthropic:/claude-4-opus")
Correctness(model="google:/gemini-2.0-flash")
For detailed information on supported models, AI Gateway configuration, and guidance on choosing the right model for your use case, see Supported Judge Models.
Best practices for writing judge instructions
Be specific about expected output format. Your instructions should clearly specify what format the judge should return:
- Categorical responses: List specific values (for example, 'fully_resolved', 'partially_resolved', 'needs_follow_up')
- Boolean responses: Explicitly state the judge should return true or false
- Numeric scores: Specify the scoring range and what each score means
Break down complex evaluations. For complex evaluation tasks, structure your instructions into clear sections:
- What to evaluate
- What information to examine
- How to make the judgment
- What format to return
Align judges with human experts
The base judge is a starting point. As you gather expert feedback on your application's outputs, you can align the LLM judges to the feedback to further improve judge accuracy. See Align judges with humans.
Next steps
Create a custom judge
Get a hands-on tutorial that demonstrates both standard and trace-based judges.
Collect Human Feedback
Learn how to collect human feedback for evaluation.
Aligning Judges with Human Feedback
Learn how to align your scorer with human feedback.