Skip to main content

Agent-as-a-Judge: Autonomous Trace Analysis

Agent-as-a-Judge represents a paradigm shift in LLM evaluation. Instead of simply assessing inputs and outputs, these judges act as autonomous agents equipped with tools to investigate your application's execution in depth.

What Is Agent-as-a-Judge in MLflow?

When you use the {{ trace }} template variable in your judge instructions, MLflow transforms your judge from a passive evaluator into an active investigator. The judge gains access to MCP (Model Context Protocol) tools that interface with MLflow's APIs, enabling it to:

Explore Execution Flows

Navigate through traces, fetching specific spans as needed to understand the application's behavior.

Analyze Performance

Examine timing, latency, and resource usage across different components of your application.

Detect Patterns

Identify bottlenecks, redundancies, circular logic, and inefficient execution paths.

Verify Behavior

Check tool usage, error handling, retry logic, and compliance with expected patterns.

Field-Based vs Agent-as-a-Judge Evaluation

Understanding when to use each approach depends on where you are in your development lifecycle:

AspectAgent-as-a-JudgeField-Based Judges (LLM-as-a-Judge)
Development stageEarly development, iteration, and refinementNear-production validation and production monitoring
Primary purposeInvestigation and debuggingQuality assurance and monitoring
Ease of setupSimple - just describe what to investigateRequires careful prompt engineering and refinement
What they evaluateComplete execution tracesSpecific inputs and outputs
FocusComplete end-to-end trajectorySpecific fields in the data
Template variables{{ trace }}{{ inputs }}, {{ outputs }}, {{ expectations }}
When to use• Getting started with a new application
• Revising and refining your agent
• Identifying failure patterns
• Understanding unexpected behavior
• Final validation before deployment
• Production monitoring
• Regression testing
• Meeting specific quality expectations
Capabilities• Deep investigation of execution flow
• Root cause analysis
• Performance bottleneck detection
• Tool usage patterns
• Fast pass/fail checks
• Quality scoring
• Output validation
• Compliance verification
PerformanceSlower (explores trace in detail)Fast execution
CostHigher (more context and tool usage)Lower (less context)

When to Use Each Approach

Use Agent-as-a-Judge During Development

Agent-as-a-Judge is your investigative tool during the development and refinement phases. It's easier to get started with Agent-as-a-Judge because you can simply describe what you want to investigate without careful prompt engineering:

  • Getting started: When building a new agent or application and need to understand its behavior
  • Rapid iteration: During development cycles when you're making frequent changes
  • Debugging: When failures occur and you need to understand root causes
  • Optimization: When identifying performance bottlenecks or inefficient patterns
  • Refinement: When improving agent reasoning and decision-making logic

The higher cost and slower execution are justified by the deep insights you gain during these critical development phases, and the quick setup time means you can start evaluating immediately.

Switch to Field-Based Judges for Production

Field-Based Judges become essential as you approach production:

  • Pre-deployment validation: Ensure outputs meet quality standards
  • Regression testing: Verify that changes don't break existing functionality
  • Production monitoring: Continuously assess output quality at scale
  • SLA compliance: Verify responses meet defined expectations
  • A/B testing: Compare different model versions based on output quality

The fast execution and lower cost make field-based judges ideal for high-volume production evaluation.

Creating an Agent-as-a-Judge

To create an Agent-as-a-Judge, simply include {{ trace }} in your instructions:

from mlflow.genai.judges import make_judge
import mlflow
import time

performance_judge = make_judge(
name="performance_analyzer",
instructions=(
"Analyze the {{ trace }} for performance issues.\n\n"
"Check for:\n"
"- Operations taking longer than 2 seconds\n"
"- Redundant API calls or database queries\n"
"- Inefficient data processing patterns\n"
"- Proper use of caching mechanisms\n\n"
"Rate as: 'optimal', 'acceptable', or 'needs_improvement'"
),
model="anthropic:/claude-opus-4-1-20250805",
)


@mlflow.trace
def slow_data_processor(query: str):
"""Example application with performance bottlenecks."""
with mlflow.start_span("fetch_data") as span:
time.sleep(2.5)
span.set_inputs({"query": query})
span.set_outputs({"data": ["item1", "item2", "item3"]})

with mlflow.start_span("process_data") as span:
for i in range(3):
with mlflow.start_span(f"redundant_api_call_{i}"):
time.sleep(0.5)
span.set_outputs({"processed": "results"})

return "Processing complete"


result = slow_data_processor("SELECT * FROM users")
trace_id = mlflow.get_last_active_trace_id()
trace = mlflow.get_trace(trace_id)

feedback = performance_judge(trace=trace)

print(f"Performance Rating: {feedback.value}")
print(f"Analysis: {feedback.rationale}")

This example creates an Agent-as-a-Judge that analyzes performance issues, then runs it against a sample application with intentional bottlenecks:

  • A database query taking 2.5 seconds (exceeding the 2-second threshold)
  • Three redundant API calls that could be optimized

The judge will output something like:

Performance Rating: needs_improvement
Analysis: Found critical performance issues:
1. The 'fetch_data' span took 2.5 seconds, exceeding the 2-second threshold
2. Detected 3 redundant API calls (redundant_api_call_0, redundant_api_call_1,
redundant_api_call_2) that appear to be duplicate operations
3. Total execution time of 4 seconds could be optimized by parallelizing
the redundant operations or implementing caching

Viewing Agent Tool Calls with Debug Logs

To see the actual MCP tool calls that the Agent-as-a-Judge makes while analyzing your trace, enable debug logging:

import logging

# Enable debug logging to see agent tool calls
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("mlflow.genai.judges")
logger.setLevel(logging.DEBUG)

# Now when you run the judge, you'll see detailed tool usage
feedback = performance_judge(trace=trace)

With debug logging enabled, you'll see output like:

DEBUG:mlflow.genai.judges:Calling tool: GetTraceInfo
DEBUG:mlflow.genai.judges:Tool response: {"trace_id": "abc123", "duration_ms": 4000, ...}
DEBUG:mlflow.genai.judges:Calling tool: ListSpans
DEBUG:mlflow.genai.judges:Tool response: [{"span_id": "def456", "name": "fetch_data", ...}]
DEBUG:mlflow.genai.judges:Calling tool: GetSpan with span_id=def456
DEBUG:mlflow.genai.judges:Tool response: {"duration_ms": 2500, "inputs": {"query": "SELECT * FROM users"}, ...}

This visibility helps you understand:

  • Which tools the judge uses to investigate your traces
  • What information it extracts at each step
  • How it arrives at its final assessment

When invoked with a trace, this judge will:

  1. Explore the trace structure - Navigate through all spans to understand the execution flow
  2. Analyze timing data - Calculate durations and identify operations exceeding thresholds
  3. Detect patterns - Identify redundant operations like the repeated API calls
  4. Provide actionable feedback - Return specific recommendations for optimization

Key Capabilities

Agent-as-a-Judge can access:

  • Span hierarchy: Understanding parent-child relationships
  • Timing data: Start times, end times, and durations
  • Input/output data: What each component received and produced
  • Attributes: Custom metadata and tags
  • Error information: Exceptions, stack traces, and error messages
  • Tool calls: Which tools were used and their results

Next Steps