Custom Judges

Custom LLM judges let you define complex and nuanced judging guidelines for GenAI applications using natural language.

While MLflow built-in LLM judges offer excellent starting points for common quality dimensions, custom judges created using make_judge() give you full control over evaluation criteria. You can create these judges using either the UI or the SDK.

Try the Judge Builder UI

The fastest way to create LLM judges is through the Judge Builder UI - no code required. Navigate to your experiment's Judges tab to create and test judges visually. See the Create a Custom Judge page for details.

Version Requirements

UI: The Judge Builder UI requires MLflow >= 3.9.0.
SDK: The make_judge API requires MLflow >= 3.4.0. For earlier versions, use the deprecated custom_prompt_judge instead.

Prompts and template variables

To create a judge, you provide a prompt with natural language instructions on how to assess the quality of your agent. make_judge() accepts template variables to access the agent's inputs, outputs, expected outputs or behaviors, and even complete traces.

Your instructions must include at least one template variable, but you don't need to use all of them.

{{ inputs }} - Input data provided to the agent
{{ outputs }} - Output data generated by your agent
{{ expectations }} - Ground truths or expected outcomes
{{ trace }} - The complete execution trace of your agent

Only Reserved Variables Allowed

You can only use the reserved template variables shown above (inputs, outputs, expectations, conversation, trace). Custom variables like {{ question }} will cause validation errors. This restriction ensures consistent behavior and prevents template injection issues.

Note on conversation variable: The {{ conversation }} template variable can be used with {{ expectations }}, however it cannot be combined with {{ inputs }}, {{ outputs }}, or {{ trace }} variables. This is because conversation history provides complete context, making individual turn data redundant.

Trace-based judges

Trace-based judges analyze execution traces to understand what happened during agent execution. They autonomously explore traces using Model Context Protocol (MCP) tools and can:

Validate tool usage patterns
Identify performance bottlenecks
Investigate execution failures
Verify multi-step workflows

The following example defines a judge that assesses tool calling correctness by analyzing traces:

python
from mlflow.genai.judges import make_judge
from typing import Literal

# Agent judge for tool calling correctness
tool_usage_judge = make_judge(
    name="tool_usage_validator",
    instructions=(
        "Analyze the {{ trace }} to verify correct tool usage.\n\n"
        "Check that the agent selected appropriate tools for the user's request "
        "and called them with correct parameters."
    ),
    feedback_value_type=Literal["correct", "incorrect"],
    model="openai:/gpt-5-mini",  # Required for trace-based judges
)

For trace-based judges to analyze the full trace, the model argument must be specified in make_judge().

For a complete tutorial, see Create a custom judge using make_judge().

Selecting Judge Models

By default, MLflow will use OpenAI's GPT-4o-mini model as the judge model. You can change the judge model by passing an override to the model argument within the scorer definition. The model must be specified in the format of <provider>:/<model-name>.

python
from mlflow.genai.scorers import Correctness

Correctness(model="openai:/gpt-4o-mini")
Correctness(model="anthropic:/claude-4-opus")
Correctness(model="google:/gemini-2.0-flash")

For detailed information on supported models, AI Gateway configuration, and guidance on choosing the right model for your use case, see Supported Judge Models.

Best practices for writing judge instructions

Be specific about expected output format. Your instructions should clearly specify what format the judge should return:

Categorical responses: List specific values (for example, 'fully_resolved', 'partially_resolved', 'needs_follow_up')
Boolean responses: Explicitly state the judge should return true or false
Numeric scores: Specify the scoring range and what each score means

Break down complex evaluations. For complex evaluation tasks, structure your instructions into clear sections:

What to evaluate
What information to examine
How to make the judgment
What format to return

Align judges with human experts

The base judge is a starting point. As you gather expert feedback on your application's outputs, you can align the LLM judges to the feedback to further improve judge accuracy. See Align judges with humans.

Custom Judges

Prompts and template variables

Trace-based judges

Selecting Judge Models

Best practices for writing judge instructions

Align judges with human experts

Next steps

Create a custom judge

Collect Human Feedback

Aligning Judges with Human Feedback

Prompts and template variables​

Trace-based judges​

Selecting Judge Models​

Best practices for writing judge instructions​

Align judges with human experts​

Next steps​

Create a custom judge

Collect Human Feedback

Aligning Judges with Human Feedback

Prompts and template variables

Trace-based judges

Selecting Judge Models

Best practices for writing judge instructions

Align judges with human experts

Next steps