Skip to main content

Judge Alignment: Teaching AI to Match Human Preferences

Transform Generic Judges into Domain Experts

Judge alignment is the process of refining LLM judges to match human evaluation standards. Through systematic learning from human feedback, judges evolve from generic evaluators to domain-specific experts that understand your unique quality criteria.

Why Alignment Matters

Even the most sophisticated LLMs need calibration to match your specific evaluation standards. What constitutes "good" customer service varies by industry. Medical accuracy requirements differ from general health advice. Alignment bridges this gap, teaching judges your specific quality standards through example.

Learn from Expert Feedback

Judges improve by learning from your domain experts' assessments, capturing nuanced quality criteria that generic prompts miss.

Consistent Standards at Scale

Once aligned, judges apply your exact quality standards consistently across millions of evaluations.

Continuous Improvement

As your standards evolve, judges can be re-aligned with new feedback, maintaining relevance over time.

Reduced Evaluation Errors

Aligned judges show 30-50% reduction in false positives/negatives compared to generic evaluation prompts.

How Judge Alignment Works

Alignment Lifecycle

Continuous Refinement
Create Initial Judge
Collect Human Feedback
Run Alignment
Validate Accuracy
Monitor & Iterate

Quick Start: Align Your First Judge

Critical Requirement for Alignment

For alignment to work, each trace must have BOTH judge assessments AND human feedback with the same assessment name. The alignment process learns by comparing judge assessments with human feedback on the same traces.

The assessment name must exactly match the judge name - if your judge is named "product_quality", both the judge's assessments and human feedback must use the name "product_quality".

The order doesn't matter - humans can provide feedback before or after the judge evaluates.

Note: Alignment is currently supported only for field-based evaluation using {{ inputs }} and {{ outputs }} templates. Support for Agent-as-a-Judge evaluation ({{ trace }}) and expectations ({{ expectations }}) is not yet available.

Step 1: Setup and Generate Traces

First, create your judge and generate traces with initial assessments:

from mlflow.genai.judges import make_judge
from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer
from mlflow.entities import AssessmentSource, AssessmentSourceType
import mlflow

# Create experiment and initial judge
experiment_id = mlflow.create_experiment("product-quality-alignment")
mlflow.set_experiment(experiment_id=experiment_id)

initial_judge = make_judge(
name="product_quality",
instructions=(
"Evaluate if the product description in {{ outputs }} "
"is accurate and helpful for the query in {{ inputs }}. "
"Rate as: excellent, good, fair, or poor"
),
model="anthropic:/claude-opus-4-1-20250805",
)

# Generate traces from your application (minimum 10 required)
traces = []
for i in range(15): # Generate 15 traces (more than minimum of 10)
with mlflow.start_span(f"product_description_{i}") as span:
# Your application logic
query = f"Product query {i}"
description = f"Product description for query {i}"
span.set_inputs({"query": query})
span.set_outputs({"description": description})
traces.append(span.trace_id)

# Run the judge on these traces to get initial assessments
for trace_id in traces:
trace = mlflow.get_trace(trace_id)

# Extract inputs and outputs from the trace for field-based evaluation
inputs = trace.data.spans[0].inputs # Get inputs from trace
outputs = trace.data.spans[0].outputs # Get outputs from trace

# Judge evaluates using field-based approach (inputs/outputs)
judge_result = initial_judge(inputs=inputs, outputs=outputs)
# Judge's assessment is automatically logged when called

Step 2: Collect Human Feedback

After running your judge on traces, you need to collect human feedback. You can either:

  • Use the MLflow UI (recommended): Review traces and add feedback through the intuitive interface
  • Log programmatically: If you already have ground truth labels

For detailed instructions on collecting feedback, see Collecting Feedback for Alignment below.

Step 3: Align and Register

After collecting feedback, align your judge and register it:

# Retrieve traces with both judge and human assessments
traces_for_alignment = mlflow.search_traces(
experiment_ids=[experiment_id], max_results=15, return_type="list"
)

# Align the judge using human corrections (minimum 10 traces required)
if len(traces_for_alignment) >= 10:
optimizer = SIMBAAlignmentOptimizer(model="anthropic:/claude-opus-4-1-20250805")
aligned_judge = initial_judge.align(optimizer, traces_for_alignment)

# Register the aligned judge
aligned_judge.register(experiment_id=experiment_id)
print("Judge aligned successfully with human feedback")
else:
print(f"Need at least 10 traces for alignment, have {len(traces_for_alignment)}")

The SIMBA Alignment Optimizer

MLflow provides the SIMBA (Simplified Multi-Bootstrap Aggregation) optimizer for aligning judges with human feedback:

from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer

# Create SIMBA optimizer
optimizer = SIMBAAlignmentOptimizer(
model="anthropic:/claude-opus-4-1-20250805" # Model used for optimization
)

# Requirements for alignment:
# - Minimum 10 traces with BOTH judge assessments and human feedback
# - Both assessments must use the same name (matching the judge name)
# - Order doesn't matter - humans can assess before or after judge
# - Mix of agreements and disagreements between judge and human recommended

aligned_judge = initial_judge.align(optimizer, traces_with_feedback)

Collecting Feedback for Alignment

The quality of alignment depends on the quality and quantity of feedback. Choose the approach that best fits your situation:

Feedback Collection Approaches

When to use: You don't have existing ground truth labels and need to collect human feedback.

The MLflow UI provides an intuitive interface for reviewing traces and adding feedback:

  1. Navigate to the Traces tab in your experiment
  2. Click on individual traces to review inputs, outputs, and any existing judge assessments
  3. Add feedback by clicking the "Add Feedback" button
  4. Select the assessment name that matches your judge name (e.g., "product_quality")
  5. Provide your rating according to your evaluation criteria

Tips for effective feedback collection:

  • If you're not a domain expert: Distribute traces among team members or domain experts for review
  • If you are the domain expert: Create a rubric or guidelines document to ensure consistency
  • For multiple reviewers: Organize feedback sessions where reviewers can work through batches together
  • For consistency: Document your evaluation criteria clearly before starting

The UI automatically logs feedback in the correct format for alignment.

MLflow UI Feedback Interface

Diverse Reviewers

Include feedback from multiple experts to capture different perspectives and reduce individual bias.

Balanced Examples

Include both positive and negative examples. Aim for at least 30% of each to help the judge learn boundaries.

Sufficient Volume

Collect at least 10 feedback examples (minimum for SIMBA), but 50-100 examples typically yield better results.

Consistent Standards

Ensure reviewers use consistent criteria. Provide guidelines or rubrics to standardize assessments.

Testing Alignment Effectiveness

Validate that alignment improved your judge:

def test_alignment_improvement(
original_judge, aligned_judge, test_traces: list
) -> dict:
"""Compare judge performance before and after alignment."""

original_correct = 0
aligned_correct = 0

for trace in test_traces:
# Get human ground truth from trace assessments
feedbacks = trace.search_assessments(type="feedback")
human_feedback = next(
(f for f in feedbacks if f.source.source_type == "HUMAN"), None
)

if not human_feedback:
continue

# Get judge evaluations
original_eval = original_judge(trace=trace)
aligned_eval = aligned_judge(trace=trace)

# Check agreement with human
if original_eval.value == human_feedback.value:
original_correct += 1
if aligned_eval.value == human_feedback.value:
aligned_correct += 1

total = len(test_traces)
return {
"original_accuracy": original_correct / total,
"aligned_accuracy": aligned_correct / total,
"improvement": (aligned_correct - original_correct) / total,
}

Next Steps