Evaluations
Evaluation to measure and improve quality
Confidently evaluate quality in development and production to identify issues and iteratively test improvements.
Accurately evaluate free-form language outputs with LLM judges
Pre-built LLM judges
Quickly start with built-in LLM judges for safety, hallucination, retrieval quality, and relevance. Our research-backed judges provide accurate, reliable quality evaluation aligned with human expertise.
Customized LLM judges
Adapt our base model to create custom LLM judges tailored to your business needs, aligning with your human expert's judgment.
Iteratively improve quality through evaluation
Test new app / prompt variants
MLflow's GenAI evaluation API lets you test new application variants (prompts, models, code) against evaluation and regression datasets. Each variant is linked to its evaluation results, enabling tracking of improvements over time.
Customize with code-based metrics
Customize evaluation to measure any aspect of your app's quality or performance using our custom metrics API. Convert any Python function—from regex to custom logic—into a metric.
Identify root causes with evaluation review UIs
Use MLflow's Evaluation UI to visualize a summary of your evals and view results record-by-record to quickly identify root causes and further improvement opportunities.
Compare versions side-by-side
Compare evaluations of 2 app variants to understand if your changes improved or regressed quality. Review individual questions side-by-side in the Trace Comparison UI to find differences, debug regressions, and inform your next version.
Get started with MLflow
Choose from two options depending on your needs

Self-hosted Open Source

Apache-2.0 license
Full control over your own infrastructure
Community support

Managed hosting

ON
Free and fully managed — experience MLflow without the setup hassle
Built and maintained by the original creators of MLflow
Full OSS compatibility
GET INVOLVED
Connect with the open source community
Join millions of MLflow users