Skip to main content

Evaluation Dataset Concepts

SQL Backend Required

Evaluation Datasets require an MLflow Tracking Server with a SQL backend (PostgreSQL, MySQL, SQLite, or MSSQL). This feature is not available in FileStore (local mode) due to the relational data requirements for managing dataset records, associations, and schema evolution.

What are Evaluation Datasets?

Evaluation Datasets in MLflow provide a structured way to organize and manage test data for GenAI applications. They serve as centralized repositories for test inputs, optional test outputs, expected outputs (expectations), and evaluation results, enabling systematic quality assessment across your AI development lifecycle.

Unlike static test files, evaluation datasets are living validation collections designed to grow and evolve with your application. Records can be continuously added from production traces, manual curation, or programmatic generation.

They can be viewed directly within the MLflow UI.

Evaluation Datasets Video

Core Components

Evaluation datasets are composed of several key elements that work together to provide comprehensive test management:

Dataset Records

Individual test cases containing inputs (what goes into your model), expectations (what should come out), optional outputs (what your application returned), and metadata about the source and tags for organization.

Schema & Profile

Automatically computed structure and statistics of your dataset. Schema tracks field names and types across records, while profile provides statistical summaries.

Expectations

Ground truth values and quality criteria that define correct behavior. These are the set of standards against which your model outputs are evaluated.

Experiment Association

Links to MLflow experiments enable tracking which datasets were used for which model evaluations, providing full lineage and organizational control.

Dataset Object Schema

The mlflow.entities.EvaluationDataset() object contains the following fields:

FieldTypeDescription
dataset_idstrUnique identifier for the dataset (format: d-{32 hex chars})
namestrHuman-readable name for the dataset
digeststrContent hash for data integrity verification
recordslist[DatasetRecord]The actual test data records containing inputs and expectations
schemaOptional[str]JSON string describing the structure of records (automatically computed)
profileOptional[str]JSON string containing statistical information about the dataset
tagsdict[str, str]Key-value pairs for organizing and categorizing datasets
experiment_idslist[str]List of MLflow experiment IDs this dataset is associated with
created_timeintTimestamp when the dataset was created (milliseconds)
last_update_timeintTimestamp of the last modification (milliseconds)
created_byOptional[str]User who created the dataset (auto-detected from tags)
last_updated_byOptional[str]User who last modified the dataset

Record Structure

Each record in an evaluation dataset represents a single test case with the following structure:

json
{
"inputs": {
"question": "What is the capital of France?",
"context": "France is a country in Western Europe",
"temperature": 0.7
},
"outputs": {
"answer": "The capital of France is Paris."
},
"expectations": {
"name": "expected_answer",
"value": "Paris",
},
"source": {
"source_type": "HUMAN",
"source_data": {
"annotator": "geography_expert@company.com",
"annotation_date": "2024-08-07"
}
},
"tags": {
"category": "geography",
"difficulty": "easy",
"validated": "true"
}
}

Record Fields

  • inputs (required): The test input data that will be passed to your model or application
  • outputs (optional): The actual outputs generated by your model (typically used for post-hoc evaluation)
  • expectations (optional): The expected outputs or quality criteria that define correct behavior
  • source (optional): Provenance information about how this record was created (automatically inferred if not provided)
  • tags (optional): Metadata specific to this individual record for organization and filtering

Record Identity and Deduplication

Records are uniquely identified by a hash of their inputs. When merging records with mlflow.entities.EvaluationDataset.merge_records(), if a record with identical inputs already exists, its expectations and tags are merged rather than creating a duplicate. This enables iterative refinement of test cases without data duplication.

Schema Evolution

Dataset schemas automatically evolve as you add records with new fields. The schema property tracks all field names and types encountered across records, while profile maintains statistical summaries. This automatic adaptation means you can start with simple test cases and progressively add complexity without manual schema migrations.

When new fields are introduced in subsequent records, they're automatically incorporated into the schema. Existing records without those fields are handled gracefully during evaluation and analysis.

Next Steps