MLflow Dataset Tracking
The mlflow.data
module is a comprehensive solution for dataset management throughout the machine learning lifecycle. It enables you to track, version, and manage datasets used in training, validation, and evaluation, providing complete lineage from raw data to model predictions.
Why Dataset Tracking Matters
Dataset tracking is essential for reproducible machine learning and provides several key benefits:
- Data Lineage: Track the complete journey from raw data sources to model inputs
- Reproducibility: Ensure experiments can be reproduced with identical datasets
- Version Control: Manage different versions of datasets as they evolve
- Collaboration: Share datasets and their metadata across teams
- Evaluation Integration: Seamlessly integrate with MLflow's evaluation capabilities
- Production Monitoring: Track datasets used in production inference and evaluation
Core Components
MLflow's dataset tracking revolves around two main abstractions:
Dataset
The Dataset
abstraction is a metadata tracking object that holds comprehensive information about a logged dataset. The information stored within a Dataset
object includes:
Core Properties:
- Name: Descriptive identifier for the dataset (defaults to "dataset" if not specified)
- Digest: Unique hash/fingerprint for dataset identification (automatically computed)
- Source: DatasetSource containing lineage information to the original data location
- Schema: Optional dataset schema (implementation-specific, e.g., MLflow Schema)
- Profile: Optional summary statistics (implementation-specific, e.g., row count, column stats)
Supported Dataset Types:
PandasDataset
- For Pandas DataFramesSparkDataset
- For Apache Spark DataFramesNumpyDataset
- For NumPy arraysPolarsDataset
- For Polars DataFramesHuggingFaceDataset
- For Hugging Face datasetsTensorFlowDataset
- For TensorFlow datasetsMetaDataset
- For metadata-only datasets (no actual data storage)
Special Dataset Types:
EvaluationDataset
- Internal dataset type used specifically withmlflow.evaluate()
for model evaluation workflows
DatasetSource
The DatasetSource
component provides linked lineage to the original source of the data, whether it's a file URL, S3 bucket, database table, or any other data source. This ensures you can always trace back to where your data originated.
The DatasetSource
can be retrieved using the mlflow.data.get_source()
API, which accepts instances of Dataset
, DatasetEntity
, or DatasetInput
.
Quick Start: Basic Dataset Tracking
- Simple Example
- Metadata-Only Datasets
- With Data Splits
- With Predictions
Here's how to get started with basic dataset tracking:
import mlflow.data
import pandas as pd
# Load your data
dataset_source_url = "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"
raw_data = pd.read_csv(dataset_source_url, delimiter=";")
# Create a Dataset object
dataset = mlflow.data.from_pandas(
raw_data, source=dataset_source_url, name="wine-quality-white", targets="quality"
)
# Log the dataset to an MLflow run
with mlflow.start_run():
mlflow.log_input(dataset, context="training")
# Your training code here
# model = train_model(raw_data)
# mlflow.sklearn.log_model(model, "model")
For cases where you only want to log dataset metadata without the actual data:
import mlflow.data
from mlflow.data.meta_dataset import MetaDataset
from mlflow.data.http_dataset_source import HTTPDatasetSource
from mlflow.types import Schema, ColSpec, DataType
# Create a metadata-only dataset for a remote data source
source = HTTPDatasetSource(
url="https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
)
# Option 1: Simple metadata dataset
meta_dataset = MetaDataset(source=source, name="imdb-sentiment-dataset")
# Option 2: With schema information
schema = Schema(
[
ColSpec(type=DataType.string, name="text"),
ColSpec(type=DataType.integer, name="label"),
]
)
meta_dataset_with_schema = MetaDataset(
source=source, name="imdb-sentiment-dataset-with-schema", schema=schema
)
with mlflow.start_run():
# Log metadata-only dataset (no actual data stored)
mlflow.log_input(meta_dataset_with_schema, context="external_data")
# The dataset reference and schema are logged, but not the data itself
print(f"Logged dataset: {meta_dataset_with_schema.name}")
print(f"Data source: {meta_dataset_with_schema.source}")
Use Cases for MetaDataset: Reference datasets hosted on external servers or cloud storage, large datasets where you only want to track metadata and lineage, datasets with restricted access where actual data cannot be stored, and public datasets available via URLs that don't need to be duplicated.
Track training, validation, and test splits separately:
import mlflow.data
import pandas as pd
from sklearn.model_selection import train_test_split
# Load and split your data
data = pd.read_csv("your_dataset.csv")
X = data.drop("target", axis=1)
y = data["target"]
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.4, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)
# Create dataset objects for each split
train_data = pd.concat([X_train, y_train], axis=1)
val_data = pd.concat([X_val, y_val], axis=1)
test_data = pd.concat([X_test, y_test], axis=1)
train_dataset = mlflow.data.from_pandas(
train_data, source="your_dataset.csv", name="wine-quality-train", targets="target"
)
val_dataset = mlflow.data.from_pandas(
val_data, source="your_dataset.csv", name="wine-quality-val", targets="target"
)
test_dataset = mlflow.data.from_pandas(
test_data, source="your_dataset.csv", name="wine-quality-test", targets="target"
)
with mlflow.start_run():
# Log all dataset splits
mlflow.log_input(train_dataset, context="training")
mlflow.log_input(val_dataset, context="validation")
mlflow.log_input(test_dataset, context="testing")
Track datasets that include model predictions for evaluation:
import mlflow.data
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
# Train a model
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Generate predictions
predictions = model.predict(X_test)
prediction_probs = model.predict_proba(X_test)[:, 1]
# Create evaluation dataset with predictions
eval_data = X_test.copy()
eval_data["target"] = y_test
eval_data["prediction"] = predictions
eval_data["prediction_proba"] = prediction_probs
# Create dataset with predictions specified
eval_dataset = mlflow.data.from_pandas(
eval_data,
source="your_dataset.csv",
name="wine-quality-evaluation",
targets="target",
predictions="prediction",
)
with mlflow.start_run():
mlflow.log_input(eval_dataset, context="evaluation")
# This dataset can now be used directly with mlflow.evaluate()
result = mlflow.evaluate(data=eval_dataset, model_type="classifier")
Dataset Information and Metadata
When you create a dataset, MLflow automatically captures rich metadata:
# Access dataset metadata
print(f"Dataset name: {dataset.name}") # Defaults to "dataset" if not specified
print(
f"Dataset digest: {dataset.digest}"
) # Unique hash identifier (computed automatically)
print(f"Dataset source: {dataset.source}") # DatasetSource object
print(
f"Dataset profile: {dataset.profile}"
) # Optional: implementation-specific statistics
print(f"Dataset schema: {dataset.schema}") # Optional: implementation-specific schema
Example output:
Dataset name: wine-quality-white
Dataset digest: 2a1e42c4
Dataset profile: {"num_rows": 4898, "num_elements": 58776}
Dataset schema: {"mlflow_colspec": [
{"type": "double", "name": "fixed acidity"},
{"type": "double", "name": "volatile acidity"},
...
{"type": "long", "name": "quality"}
]}
Dataset source: <DatasetSource object>
The profile
and schema
properties are implementation-specific and may vary depending on the dataset type (PandasDataset, SparkDataset, etc.). Some dataset types may return None
for these properties.