Skip to main content

Evaluation Datasets

Transform Your GenAI Testing with Structured Evaluation Data

Evaluation datasets are the foundation of systematic GenAI application testing. They provide a centralized way to manage test data, ground truth expectations, and evaluation results—enabling you to measure and improve the quality of your AI applications with confidence.

SQL Backend Required

Evaluation Datasets require an MLflow Tracking Server with a SQL backend (PostgreSQL, MySQL, SQLite, or MSSQL). This feature is not available in FileStore (local file system-based tracking). If you need a simple local configuration for MLflow, use the sqlite option when starting MLflow.

Quickstart: Build Your First Evaluation Dataset

There are several ways to create evaluation datasets, each suited to different stages of your GenAI development process.

The simplest way to create one is through MLflow's UI. Navigate to an Experiment that you want the evaluation dataset to be associated with and you can directly create a new one by supplying a unique name. After adding records to it, you can view the dataset's entries in the UI.

Evaluation Datasets Video

At its core, evaluation datasets are comprised of inputs and expectations. Outputs are an optional addition that can be added to an evaluation dataset for post-hoc evaluation with scorers. Adding these elements can be done either directly from traces, dictionaries, or via a Pandas DataFrame.

python
import mlflow
from mlflow.genai.datasets import create_dataset, set_dataset_tags

# Create your evaluation dataset
dataset = create_dataset(
name="production_validation_set",
experiment_id=["0"], # "0" is the default experiment
tags={"team": "ml-platform", "stage": "validation"},
)

# Optionally, add additional tags to your dataset.
# Tags can be used to search for datasets with search_datasets API
set_dataset_tags(
dataset_id=dataset.dataset_id,
tags={"environment": "dev", "validation_version": "1.3"},
)

# First, retrieve traces that will become the basis of the dataset
traces = mlflow.search_traces(
experiment_ids=["0"],
max_results=20,
filter_string="attributes.name = 'chat_completion'",
return_type="list", # Returns list[Trace]
)

# Add expectations to the traces
for trace in traces:
mlflow.log_expectation(
trace_id=trace.info.trace_id,
name="expected_answer",
value=(
"The correct answer should include step-by-step instructions "
"for password reset with email verification"
),
)

# Retrieve the traces with added expectations
annotated_traces = mlflow.search_traces(
experiment_ids=["0"],
max_results=20,
return_type="list",
)

# Merge the list of Trace objects directly into your dataset
dataset.merge_records(annotated_traces)

Understanding Source Types

Every record in an evaluation dataset has a source type that tracks its provenance. This enables you to analyze model performance by data origin and understand which types of test data are most valuable.

TRACE

Records from production traces - automatically assigned when adding traces via mlflow.search_traces()

HUMAN

Subject matter expert annotations - automatically inferred for records with expectations (ground truth)

CODE

Programmatically generated test cases - automatically inferred for records without expectations

DOCUMENT

Test cases extracted from documentation or specs - must be explicitly specified with source metadata

Source types are automatically inferred based on record characteristics but can be explicitly overridden when needed. See the SDK Guide for detailed inference rules and examples.

Why Evaluation Datasets?

Centralized Test Management

Store all your test cases, expected outputs, and evaluation criteria in one place. No more scattered CSV files or hardcoded test data.

Consistent Evaluation Source

Maintain a concrete representation of test data that can be used repeatedly as your project evolves. Eliminate manual testing and avoid repeatedly assembling evaluation data for each iteration.

Systematic Testing

Move beyond ad-hoc testing to systematic evaluation. Define clear expectations and measure performance consistently across deployments.

Collaborative Improvement

Enable your entire team to contribute test cases and expectations. Share evaluation datasets across projects and teams.

The Evaluation Loop

Evaluation datasets bridge the critical gap between trace generation and evaluation execution in the GenAI development lifecycle. As you test your application and capture traces with expectations, evaluation datasets transform these individual test cases into a materialized, reusable evaluation suite. This creates a consistent and evolving collection of evaluation records that grows with your application—each iteration adds new test cases while preserving the historical test coverage. Rather than losing valuable test scenarios after each development cycle, you build a comprehensive evaluation asset that can immediately assess the quality of changes and improvements to your implementation.

The Evaluation Loop

Iterate & Improve
Iterate on Code
Test App
Collect Traces
Add Expectations
Create Dataset
Run Evaluation
Analyze Results

Key Features

Ground Truth Management

Define and maintain expected outputs for your test cases. Capture expert knowledge about what constitutes correct behavior for your AI system.

Schema Evolution

Automatically track the structure of your test data as it evolves. Add new fields and test dimensions without breaking existing evaluations.

Incremental Updates

Continuously improve your test suite by adding new cases from production. Update expectations as your understanding of correct behavior evolves.

Flexible Tagging

Organize datasets with tags for easy discovery and filtering. Track metadata like data sources, annotation guidelines, and quality levels.

Performance Tracking

Monitor how your application performs against the same test data over time. Identify regressions and improvements across deployments.

Experiment Integration

Link datasets to MLflow experiments for complete traceability. Understand which test data was used for each model evaluation.

Next Steps

Ready to improve your GenAI testing? Start with these resources: