MLflow

Evaluations

Evaluation to measure and improve quality

Confidently evaluate quality in development and production to identify issues and iteratively test improvements.

Accurately evaluate free-form language outputs with LLM judges

Pre-built LLM judges

Quickly start with built-in LLM judges for safety, hallucination, retrieval quality, and relevance. Our research-backed judges provide accurate, reliable quality evaluation aligned with human expertise.

Customized LLM judges

Adapt our base model to create custom LLM judges tailored to your business needs, aligning with your human expert's judgment.

Iteratively improve quality through evaluation

Test new app / prompt variants

MLflow's GenAI evaluation API lets you test new application variants (prompts, models, code) against evaluation and regression datasets. Each variant is linked to its evaluation results, enabling tracking of improvements over time.

Customize with code-based metrics

Customize evaluation to measure any aspect of your app's quality or performance using our custom metrics API. Convert any Python function—from regex to custom logic—into a metric.

Identify root causes with evaluation review UIs

Use MLflow's Evaluation UI to visualize a summary of your evals and view results record-by-record to quickly identify root causes and further improvement opportunities.

Compare versions side-by-side

Compare evaluations of 2 app variants to understand if your changes improved or regressed quality. Review individual questions side-by-side in the Trace Comparison UI to find differences, debug regressions, and inform your next version.