Skip to main content

Evaluation Datasets SDK Reference

Complete API reference for creating, managing, and querying evaluation datasets programmatically.

SQL Backend Required

Evaluation Datasets require an MLflow Tracking Server with a SQL backend (PostgreSQL, MySQL, SQLite, or MSSQL). This feature is not available with FileStore (local file system-based tracking).

Creating a Dataset

Use mlflow.genai.datasets.create_dataset() to create a new evaluation dataset:

python
from mlflow.genai.datasets import create_dataset

# Create a new dataset
dataset = create_dataset(
name="customer_support_qa",
experiment_id=["0"], # Link to experiments
tags={"version": "1.0", "team": "ml-platform", "status": "active"},
)

print(f"Created dataset: {dataset.dataset_id}")

You can also use the mlflow.tracking.MlflowClient() API:

python
from mlflow import MlflowClient

client = MlflowClient()
dataset = client.create_dataset(
name="customer_support_qa",
experiment_id=["0"],
tags={"version": "1.0"},
)

Adding Records to a Dataset

Use the mlflow.entities.EvaluationDataset.merge_records() method to add new records to your dataset. Records can be added from dictionaries, DataFrames, or traces:

Add records directly from Python dictionaries:

python
# Add records with inputs and expectations (ground truth)
new_records = [
{
"inputs": {"question": "What are your business hours?"},
"expectations": {
"expected_answer": "We're open Monday-Friday 9am-5pm EST",
"must_mention_hours": True,
"must_include_timezone": True,
},
},
{
"inputs": {"question": "How do I reset my password?"},
"expectations": {
"expected_answer": (
"Click 'Forgot Password' and follow the email instructions"
),
"must_include_steps": True,
},
},
]

dataset.merge_records(new_records)
print(f"Dataset now has {len(dataset.records)} records")

Updating Existing Records

The mlflow.entities.EvaluationDataset.merge_records() method intelligently handles updates. Records are matched based on a hash of their inputs - if a record with identical inputs already exists, its expectations and tags are merged rather than creating a duplicate:

python
# Initial record
dataset.merge_records(
[
{
"inputs": {"question": "What is MLflow?"},
"expectations": {
"expected_answer": "MLflow is a platform for ML",
"must_mention_tracking": True,
},
}
]
)

# Update with same inputs but enhanced expectations
dataset.merge_records(
[
{
"inputs": {"question": "What is MLflow?"}, # Same inputs = update
"expectations": {
# Updates existing value
"expected_answer": (
"MLflow is an open-source platform for managing the ML lifecycle"
),
"must_mention_models": True, # Adds new expectation
# Note: "must_mention_tracking": True is preserved
},
}
]
)

# Result: One record with merged expectations

Retrieving Datasets

Retrieve existing datasets by ID or search for them:

python
from mlflow.genai.datasets import get_dataset

# Get a specific dataset by ID
dataset = get_dataset(dataset_id="d-7f2e3a9b8c1d4e5f")

# Access dataset properties
print(f"Name: {dataset.name}")
print(f"Records: {len(dataset.records)}")
print(f"Schema: {dataset.schema}")
print(f"Tags: {dataset.tags}")

Managing Tags

Add, update, or remove tags from datasets:

python
from mlflow.genai.datasets import set_dataset_tags, delete_dataset_tag

# Set or update tags
set_dataset_tags(
dataset_id=dataset.dataset_id,
tags={"status": "production", "validated": "true", "version": "2.0"},
)

# Delete a specific tag
delete_dataset_tag(dataset_id=dataset.dataset_id, key="deprecated")

Deleting a Dataset

Permanently delete a dataset and all its records:

python
from mlflow.genai.datasets import delete_dataset

# Delete the entire dataset
delete_dataset(dataset_id="d-1a2b3c4d5e6f7890")
warning

Dataset deletion is permanent and cannot be undone. All records will be deleted.

Working with Dataset Records

The mlflow.entities.EvaluationDataset() object provides several ways to access and analyze records:

python
# Access all records
all_records = dataset.records

# Convert to DataFrame for analysis
df = dataset.to_df()
print(df.head())

# View dataset schema (auto-computed from records)
print(dataset.schema)

# View dataset profile (statistics)
print(dataset.profile)

# Get record count
print(f"Total records: {len(dataset.records)}")

Advanced Topics

Understanding Input Uniqueness

Records are considered unique based on their entire inputs dictionary. Even small differences create separate records:

python
# These are treated as different records due to different inputs
record_a = {
"inputs": {"question": "What is MLflow?", "temperature": 0.7},
"expectations": {"expected_answer": "MLflow is an ML platform"},
}

record_b = {
"inputs": {
"question": "What is MLflow?",
"temperature": 0.8,
}, # Different temperature
"expectations": {"expected_answer": "MLflow is an ML platform"},
}

dataset.merge_records([record_a, record_b])
# Results in 2 separate records due to different temperature values

Source Type Inference

MLflow automatically assigns source types before sending records to the backend using these rules:

Automatic Inference

MLflow automatically infers source types based on record characteristics when no explicit source is provided.

Client-Side Processing

Source type inference happens in merge_records() before records are sent to the tracking backend.

Manual Override

You can always specify explicit source information to override automatic inference.

Inference Rules

Records from MLflow traces are automatically assigned the TRACE source type:

python
# When adding traces directly (automatic TRACE source)
traces = mlflow.search_traces(experiment_ids=["0"], return_type="list")
dataset.merge_records(traces)

# Or when using DataFrame from search_traces
traces_df = mlflow.search_traces(experiment_ids=["0"]) # Returns DataFrame
# Automatically detects traces and assigns TRACE source
dataset.merge_records(traces_df)

Manual Source Override

You can explicitly specify the source type and metadata for any record:

python
# Specify HUMAN source with metadata
human_curated = {
"inputs": {"question": "What are your business hours?"},
"expectations": {
"expected_answer": "We're open Monday-Friday 9am-5pm EST",
"must_include_timezone": True,
},
"source": {
"source_type": "HUMAN",
"source_data": {"curator": "support_team", "date": "2024-11-01"},
},
}

# Specify DOCUMENT source
from_docs = {
"inputs": {"question": "How to install MLflow?"},
"expectations": {
"expected_answer": "pip install mlflow",
"must_mention_pip": True,
},
"source": {
"source_type": "DOCUMENT",
"source_data": {"document_id": "install_guide", "page": 1},
},
}

dataset.merge_records([human_curated, from_docs])

Available Source Types

TRACE

Production data captured via MLflow tracing - automatically assigned when adding traces

HUMAN

Subject matter expert annotations - inferred for records with expectations

CODE

Programmatically generated tests - inferred for records without expectations

DOCUMENT

Test cases from documentation or specs - must be explicitly specified

UNSPECIFIED

Source unknown or not provided - for legacy or imported data

Search Filter Reference

Searchable Fields

FieldTypeExample
namestringname = 'production_tests'
tags.<key>stringtags.status = 'validated'
created_bystringcreated_by = 'alice@company.com'
last_updated_bystringlast_updated_by = 'bob@company.com'
created_timetimestampcreated_time > 1698800000000
last_update_timetimestamplast_update_time > 1698800000000

Filter Operators

  • =, !=: Exact match
  • LIKE, ILIKE: Pattern matching with % wildcard (ILIKE is case-insensitive)
  • >, <, >=, <=: Numeric/timestamp comparison
  • AND: Combine conditions (OR is not currently supported)

Common Filter Examples

Filter ExpressionDescriptionUse Case
name = 'production_qa'Exact name matchFind a specific dataset
name LIKE '%test%'Pattern matchingFind all test datasets
tags.status = 'validated'Tag equalityFind production-ready datasets
tags.version = '2.0' AND tags.team = 'ml'Multiple tag conditionsFind team-specific versions
created_by = 'alice@company.com'Creator filterFind datasets by author
created_time > 1698800000000Time-based filterFind recent datasets
python
# Complex filter example
datasets = search_datasets(
filter_string="""
tags.status = 'production'
AND name LIKE '%customer%'
AND created_time > 1698800000000
""",
order_by=["last_update_time DESC"],
)

Next Steps