Judge Alignment: Teaching AI to Match Human Preferences

Transform Generic Judges into Domain Experts

Judge alignment is the process of refining LLM judges to match human evaluation standards. Through systematic learning from human feedback, judges evolve from generic evaluators to domain-specific experts that understand your unique quality criteria.

Why Alignment Matters

Even the most sophisticated LLMs need calibration to match your specific evaluation standards. What constitutes "good" customer service varies by industry. Medical accuracy requirements differ from general health advice. Alignment bridges this gap, teaching judges your specific quality standards through example.

Learn from Expert Feedback

Judges improve by learning from your domain experts' assessments, capturing nuanced quality criteria that generic prompts miss.

Consistent Standards at Scale

Once aligned, judges apply your exact quality standards consistently across millions of evaluations.

Continuous Improvement

As your standards evolve, judges can be re-aligned with new feedback, maintaining relevance over time.

Reduced Evaluation Errors

Aligned judges show 30-50% reduction in false positives/negatives compared to generic evaluation prompts.

How Judge Alignment Works

Alignment Lifecycle

Create Initial Judge

Collect Human Feedback

Run Alignment

Validate Accuracy

Monitor & Iterate

Quick Start: Align Your First Judge

Critical Requirement for Alignment

For alignment to work, each trace must have BOTH judge assessments AND human feedback with the same assessment name. The alignment process learns by comparing judge assessments with human feedback on the same traces.

The assessment name must exactly match the judge name - if your judge is named "product_quality", both the judge's assessments and human feedback must use the name "product_quality".

The order doesn't matter - humans can provide feedback before or after the judge evaluates.

Step 1: Setup and Generate Traces

First, create your judge and generate traces with initial assessments:

python
from mlflow.genai.judges import make_judge
from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer
from mlflow.entities import AssessmentSource, AssessmentSourceType
from typing import Literal
import mlflow

# Create experiment and initial judge
experiment_id = mlflow.create_experiment("product-quality-alignment")
mlflow.set_experiment(experiment_id=experiment_id)

initial_judge = make_judge(
    name="product_quality",
    instructions=(
        "Evaluate if the product description in {{ outputs }} "
        "is accurate and helpful for the query in {{ inputs }}. "
        "Rate as: excellent, good, fair, or poor"
    ),
    feedback_value_type=Literal["excellent", "good", "fair", "poor"],
    model="anthropic:/claude-opus-4-1-20250805",
)

# Generate traces from your application (minimum 10 required)
traces = []
for i in range(15):  # Generate 15 traces (more than minimum of 10)
    with mlflow.start_span(f"product_description_{i}") as span:
        # Your application logic
        query = f"Product query {i}"
        description = f"Product description for query {i}"
        span.set_inputs({"query": query})
        span.set_outputs({"description": description})
        traces.append(span.trace_id)

# Run the judge on these traces to get initial assessments
for trace_id in traces:
    trace = mlflow.get_trace(trace_id)

    # Extract inputs and outputs from the trace for field-based evaluation
    inputs = trace.data.spans[0].inputs  # Get inputs from trace
    outputs = trace.data.spans[0].outputs  # Get outputs from trace

    # Judge evaluates using field-based approach (inputs/outputs)
    judge_result = initial_judge(inputs=inputs, outputs=outputs)
    # Judge's assessment is automatically logged when called

Step 2: Collect Human Feedback

After running your judge on traces, you need to collect human feedback. You can either:

Use the MLflow UI (recommended): Review traces and add feedback through the intuitive interface
Log programmatically: If you already have ground truth labels

For detailed instructions on collecting feedback, see Collecting Feedback for Alignment below.

Step 3: Align and Register

After collecting feedback, align your judge and register it:

Default Optimizer (Recommended)
Explicit Optimizer

python
# Retrieve traces with both judge and human assessments
traces_for_alignment = mlflow.search_traces(
    experiment_ids=[experiment_id], max_results=15, return_type="list"
)

# Align the judge using human corrections (minimum 10 traces recommended)
if len(traces_for_alignment) >= 10:
    optimizer = SIMBAAlignmentOptimizer(model="anthropic:/claude-opus-4-1-20250805")

    # Run alignment - shows minimal progress by default:
    # INFO: Starting SIMBA optimization with 15 examples (set logging to DEBUG for detailed output)
    # INFO: SIMBA optimization completed
    aligned_judge = initial_judge.align(optimizer, traces_for_alignment)

    # Register the aligned judge
    aligned_judge.register(experiment_id=experiment_id)
    print("Judge aligned successfully with human feedback")
else:
    print(f"Need at least 10 traces for alignment, have {len(traces_for_alignment)}")

python
from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer

# Retrieve traces with both judge and human assessments
traces_for_alignment = mlflow.search_traces(
    experiment_ids=[experiment_id], max_results=15, return_type="list"
)

# Align the judge using human corrections (minimum 10 traces recommended)
if len(traces_for_alignment) >= 10:
    # Explicitly specify SIMBA with custom model configuration
    optimizer = SIMBAAlignmentOptimizer(model="anthropic:/claude-opus-4-1-20250805")
    aligned_judge = initial_judge.align(traces_for_alignment, optimizer)

    # Register the aligned judge
    aligned_judge.register(experiment_id=experiment_id)
    print("Judge aligned successfully with human feedback")
else:
    print(f"Need at least 10 traces for alignment, have {len(traces_for_alignment)}")

The SIMBA Alignment Optimizer

MLflow provides the default alignment optimizer using DSPy's implementation of SIMBA (Simplified Multi-Bootstrap Aggregation). When you call align() without specifying an optimizer, the SIMBA optimizer is used automatically:

python
# Default: Uses SIMBA optimizer automatically
aligned_judge = initial_judge.align(traces_with_feedback)

# Explicit: Same as above but with custom model specification
from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer

optimizer = SIMBAAlignmentOptimizer(
    model="anthropic:/claude-opus-4-1-20250805"  # Model used for optimization
)
aligned_judge = initial_judge.align(traces_with_feedback, optimizer)

# Requirements for alignment:
# - Minimum 10 traces with BOTH judge assessments and human feedback
# - Both assessments must use the same name (matching the judge name)
# - Order doesn't matter - humans can assess before or after judge
# - Mix of agreements and disagreements between judge and human recommended

Default Optimizer Behavior

When using align() without an optimizer parameter, MLflow automatically uses the SIMBA optimizer. This simplifies the alignment process while still allowing customization when needed.

Controlling Optimization Output

By default, alignment shows minimal progress information to keep logs clean. If you need to debug the optimization process or see detailed iteration progress, enable DEBUG logging:

python
import logging

# Enable detailed optimization output
logging.getLogger("mlflow.genai.judges.optimizers.simba").setLevel(logging.DEBUG)

# Now alignment will show:
# - Detailed iteration-by-iteration progress
# - Score improvements at each step
# - Strategy selection details
# - Full DSPy optimization output

aligned_judge = initial_judge.align(optimizer, traces_with_feedback)

# Reset to default (minimal output) after debugging
logging.getLogger("mlflow.genai.judges.optimizers.simba").setLevel(logging.INFO)

When to Use Detailed Logging

Enable DEBUG logging when:

Optimization seems stuck or is taking too long
You want to understand how the optimizer is improving instructions
Debugging alignment failures or unexpected results
Learning how SIMBA optimization works internally

Keep it at INFO (default) for production use to avoid verbose output.

Collecting Feedback for Alignment

The quality of alignment depends on the quality and quantity of feedback. Choose the approach that best fits your situation:

Feedback Collection Approaches

MLflow UI (Recommended)
Programmatic (Ground Truth)

When to use: You don't have existing ground truth labels and need to collect human feedback.

The MLflow UI provides an intuitive interface for reviewing traces and adding feedback:

Navigate to the Traces tab in your experiment
Click on individual traces to review inputs, outputs, and any existing judge assessments
Add feedback by clicking the "Add Feedback" button
Select the assessment name that matches your judge name (e.g., "product_quality")
Provide your rating according to your evaluation criteria

Tips for effective feedback collection:

If you're not a domain expert: Distribute traces among team members or domain experts for review
If you are the domain expert: Create a rubric or guidelines document to ensure consistency
For multiple reviewers: Organize feedback sessions where reviewers can work through batches together
For consistency: Document your evaluation criteria clearly before starting

The UI automatically logs feedback in the correct format for alignment.

When to use: You have existing ground truth labels from your data.

If you already have labeled data, you can programmatically log it as feedback:

python
import mlflow
from mlflow.entities import AssessmentSource, AssessmentSourceType

# Your existing ground truth dataset
ground_truth_data = [
    {"trace_id": "trace1", "label": "excellent", "query": "What is MLflow?"},
    {"trace_id": "trace2", "label": "poor", "query": "How to use tracking?"},
    {"trace_id": "trace3", "label": "good", "query": "How to log models?"},
]

# Log ground truth as feedback for alignment
for item in ground_truth_data:
    mlflow.log_feedback(
        trace_id=item["trace_id"],
        name="product_quality",  # Must match your judge name
        value=item["label"],
        source=AssessmentSource(
            source_type=AssessmentSourceType.HUMAN, source_id="ground_truth_dataset"
        ),
    )

print(f"Logged {len(ground_truth_data)} ground truth labels for alignment")

This approach is efficient when you have pre-labeled data from:

• Previous manual labeling efforts • Expert annotations • Production feedback systems • Test datasets with known correct answers

Diverse Reviewers

Include feedback from multiple experts to capture different perspectives and reduce individual bias.

Balanced Examples

Include both positive and negative examples. Aim for at least 30% of each to help the judge learn boundaries.

Sufficient Volume

Collect at least 10 feedback examples (minimum for SIMBA), but 50-100 examples typically yield better results.

Consistent Standards

Ensure reviewers use consistent criteria. Provide guidelines or rubrics to standardize assessments.

Custom Alignment Optimizers

MLflow's alignment system is designed as a plugin architecture, allowing you to create custom optimizers for different alignment strategies. This extensibility enables you to implement domain-specific optimization approaches while leveraging MLflow's judge infrastructure.

Creating a Custom Optimizer

To create a custom alignment optimizer, extend the AlignmentOptimizer abstract base class:

python
from mlflow.genai.judges.base import AlignmentOptimizer, Judge
from mlflow.entities.trace import Trace


class MyCustomOptimizer(AlignmentOptimizer):
    """Custom optimizer implementation for judge alignment."""

    def __init__(self, model: str = None, **kwargs):
        """Initialize your optimizer with custom parameters."""
        self.model = model
        # Add any custom initialization logic

    def align(self, judge: Judge, traces: list[Trace]) -> Judge:
        """
        Implement your alignment algorithm.

        Args:
            judge: The judge to be optimized
            traces: List of traces containing human feedback

        Returns:
            A new Judge instance with improved alignment
        """
        # Your custom alignment logic here
        # 1. Extract feedback from traces
        # 2. Analyze disagreements between judge and human
        # 3. Generate improved instructions
        # 4. Return new judge with better alignment

        # Example: Return judge with modified instructions
        from mlflow.genai.judges import make_judge

        improved_instructions = self._optimize_instructions(judge.instructions, traces)

        return make_judge(
            name=judge.name,
            instructions=improved_instructions,
            feedback_value_type=str,
            model=judge.model,
        )

    def _optimize_instructions(self, instructions: str, traces: list[Trace]) -> str:
        """Your custom optimization logic."""
        # Implement your optimization strategy
        pass

Using Custom Optimizers

Once implemented, use your custom optimizer just like the built-in ones:

python
# Create your custom optimizer
custom_optimizer = MyCustomOptimizer(model="your-model")

# Use it for alignment
aligned_judge = initial_judge.align(traces_with_feedback, custom_optimizer)

Available Optimizers

MLflow currently provides:

SIMBAAlignmentOptimizer (default): Uses DSPy's Simplified Multi-Bootstrap Aggregation for robust alignment
Custom optimizers: Extend AlignmentOptimizer to implement your own strategies

The plugin architecture ensures that new optimization strategies can be added without modifying the core judge system, promoting extensibility and experimentation with different alignment approaches.

Testing Alignment Effectiveness

Validate that alignment improved your judge:

python
def test_alignment_improvement(
    original_judge, aligned_judge, test_traces: list
) -> dict:
    """Compare judge performance before and after alignment."""

    original_correct = 0
    aligned_correct = 0

    for trace in test_traces:
        # Get human ground truth from trace assessments
        feedbacks = trace.search_assessments(type="feedback")
        human_feedback = next(
            (f for f in feedbacks if f.source.source_type == "HUMAN"), None
        )

        if not human_feedback:
            continue

        # Get judge evaluations
        original_eval = original_judge(trace=trace)
        aligned_eval = aligned_judge(trace=trace)

        # Check agreement with human
        if original_eval.value == human_feedback.value:
            original_correct += 1
        if aligned_eval.value == human_feedback.value:
            aligned_correct += 1

    total = len(test_traces)
    return {
        "original_accuracy": original_correct / total,
        "aligned_accuracy": aligned_correct / total,
        "improvement": (aligned_correct - original_correct) / total,
    }

Judge Alignment: Teaching AI to Match Human Preferences

Transform Generic Judges into Domain Experts

Why Alignment Matters

Learn from Expert Feedback

Consistent Standards at Scale

Continuous Improvement

Reduced Evaluation Errors

How Judge Alignment Works

Alignment Lifecycle

Quick Start: Align Your First Judge

Step 1: Setup and Generate Traces

Step 2: Collect Human Feedback

Step 3: Align and Register

The SIMBA Alignment Optimizer

Controlling Optimization Output

Collecting Feedback for Alignment

Feedback Collection Approaches

Diverse Reviewers

Balanced Examples

Sufficient Volume

Consistent Standards

Custom Alignment Optimizers

Creating a Custom Optimizer

Using Custom Optimizers

Available Optimizers

Testing Alignment Effectiveness

Next Steps

Create Custom Judges

Development Workflow

Dataset Integration

Transform Generic Judges into Domain Experts​

Why Alignment Matters​

Learn from Expert Feedback

Consistent Standards at Scale

Continuous Improvement

Reduced Evaluation Errors

How Judge Alignment Works​

Alignment Lifecycle

Quick Start: Align Your First Judge​

Step 1: Setup and Generate Traces​

Step 2: Collect Human Feedback​

Step 3: Align and Register​

The SIMBA Alignment Optimizer​

Controlling Optimization Output​

Collecting Feedback for Alignment​

Feedback Collection Approaches​

Diverse Reviewers

Balanced Examples

Sufficient Volume

Consistent Standards

Custom Alignment Optimizers​

Creating a Custom Optimizer​

Using Custom Optimizers​

Available Optimizers​

Testing Alignment Effectiveness​

Next Steps​

Create Custom Judges

Development Workflow

Dataset Integration

Transform Generic Judges into Domain Experts

Why Alignment Matters

How Judge Alignment Works

Quick Start: Align Your First Judge

Step 1: Setup and Generate Traces

Step 2: Collect Human Feedback

Step 3: Align and Register

The SIMBA Alignment Optimizer

Controlling Optimization Output

Collecting Feedback for Alignment

Feedback Collection Approaches

Custom Alignment Optimizers

Creating a Custom Optimizer

Using Custom Optimizers

Available Optimizers

Testing Alignment Effectiveness

Next Steps