Skip to main content

Built-in LLM Judges

MLflow provides several pre-configured LLM judges optimized for common evaluation scenarios.

Example Usage

Version Requirements

The Judge Builder UI requires MLflow >= 3.9.0.

The MLflow UI provides a visual Judge Builder that lets you create custom LLM judges without writing code.

  1. Navigate to your experiment and select the Judges tab, then click New LLM judge

  2. LLM judge: Select a built-in judge. We're using the RelevanceToQuery and Correctness judges in this example.

RelevanceToQuery Judge UI
  1. Click Create judge to save your new LLM judge
Built-in judges result

Available Judges

Response Quality

JudgeWhat does it evaluate?Requires ground-truth?Requires traces?
RelevanceToQueryDoes the app's response directly address the user's input?NoNo
CorrectnessAre the expected facts supported by the app's response?Yes*No
Completeness**Does the agent address all questions in a single user prompt?NoNo
FluencyIs the response grammatically correct and naturally flowing?NoNo
SafetyDoes the app's response avoid harmful or toxic content?NoNo
EquivalenceIs the app's response equivalent to the expected output?YesNo
GuidelinesDoes the response adhere to provided guidelines?Yes*No
ExpectationsGuidelinesDoes the response meet specific expectations and guidelines?Yes*No

RAG

JudgeWhat does it evaluate?Requires ground-truth?Requires traces?
RetrievalRelevanceAre retrieved documents relevant to the user's request?No⚠️ Trace Required
RetrievalGroundednessIs the app's response grounded in retrieved information?No⚠️ Trace Required
RetrievalSufficiencyDo retrieved documents contain all necessary information?Yes⚠️ Trace Required

Tool Call

JudgeWhat does it evaluate?Requires ground-truth?Requires traces?
ToolCallCorrectness**Are the tool calls and arguments correct for the user query?No⚠️ Trace Required
ToolCallEfficiency**Are the tool calls efficient without redundancy?No⚠️ Trace Required

*Can extract expectations from trace assessments if available.

**Indicates experimental features that may change in future releases.

Multi-Turn

Multi-turn judges evaluate entire conversation sessions rather than individual turns. They require traces with session IDs and are experimental in MLflow 3.7.0. See Track Users and Sessions

Multi-Turn Evaluation Requirements

Multi-turn judges require:

  1. Session IDs: Traces must have mlflow.trace.session metadata
  2. List or DataFrame input: Currently only supports pre-collected traces (no predict_fn support yet)
JudgeWhat does it evaluate?Requires Session?
ConversationCompleteness**Does the agent address all user questions throughout the conversation?Yes
ConversationalGuidelines**Do the assistant's responses comply with provided guidelines?Yes
ConversationalRoleAdherence**Does the assistant maintain its assigned role throughout the conversation?Yes
ConversationalSafety**Are the assistant's responses safe and free of harmful content?Yes
ConversationalToolCallEfficiency**Was tool usage across the conversation efficient and appropriate?Yes
KnowledgeRetention**Does the assistant correctly retain information from earlier user inputs?Yes
UserFrustration**Is the user frustrated? Was the frustration resolved?Yes
Availability

Safety and RetrievalRelevance judges are currently only available in Databricks managed MLflow and will be open-sourced soon.

tip

Typically, you can get started with evaluation using built-in judges. However, every AI application is unique and has domain-specific quality criteria. At some point, you'll need to create your own custom LLM judges.

  • Your application has complex inputs/outputs that built-in judges can't parse
  • You need to evaluate specific business logic or domain-specific criteria
  • You want to combine multiple evaluation aspects into a single judge

See custom LLM judges guide for detailed examples.

Next Steps