MLflow Dataset Tracking Tutorial
The mlflow.data
module is an integral part of the MLflow ecosystem, designed to enhance your machine learning workflow.
This module enables you to record and retrieve dataset information during model training and evaluation, leveraging MLflow’s tracking capabilities.
Key Interfaces
There are two main abstract components associated with the mlflow.data
module, Dataset
and DatasetSource
:
Dataset
The Dataset
abstraction is a metadata tracking object that holds the information about a given logged dataset.
The information stored within a Dataset
object includes features, targets, and predictions, along with
metadata like the dataset’s name, digest (hash), schema, and profile. You can log this metadata using the mlflow.log_input()
API.
The module provides functions to construct mlflow.data.dataset.Dataset
objects from various data types.
There are a number of concrete implementations of this abstract class, including:
The following example demonstrates how to construct a mlflow.data.pandas_dataset.PandasDataset
object from a Pandas DataFrame:
import mlflow.data
import pandas as pd
from mlflow.data.pandas_dataset import PandasDataset
dataset_source_url = "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"
raw_data = pd.read_csv(dataset_source_url, delimiter=";")
# Create an instance of a PandasDataset
dataset = mlflow.data.from_pandas(
raw_data, source=dataset_source_url, name="wine quality - white", targets="quality"
)
DatasetSource
The DatasetSource
is a component of a given Dataset object, providing a linked lineage to the original source of the data.
The DatasetSource
component of a Dataset
represents the source of a dataset, such as a directory in S3, a Delta Table, or a URL.
It is referenced in the Dataset
for understanding the origin of the data. The DatasetSource
of a logged
dataset can be retrieved either by accessing the source
property of the Dataset
object, or through using the mlflow.data.get_source()
API.
Tip
Many of the supported autologging-enabled flavors within MLflow will automatically log the source of the dataset when logging the dataset itself.
Note
The example shown below is purely for instructive purposes, as logging a dataset outside of a training run is not a common practice.
Example Usage
The following example demonstrates how to use the log_inputs
API to log a training dataset, retrieve its information, and fetch the data source:
import mlflow
import pandas as pd
from mlflow.data.pandas_dataset import PandasDataset
dataset_source_url = "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"
raw_data = pd.read_csv(dataset_source_url, delimiter=";")
# Create an instance of a PandasDataset
dataset = mlflow.data.from_pandas(
raw_data, source=dataset_source_url, name="wine quality - white", targets="quality"
)
# Log the Dataset to an MLflow run by using the `log_input` API
with mlflow.start_run() as run:
mlflow.log_input(dataset, context="training")
# Retrieve the run information
logged_run = mlflow.get_run(run.info.run_id)
# Retrieve the Dataset object
logged_dataset = logged_run.inputs.dataset_inputs[0].dataset
# View some of the recorded Dataset information
print(f"Dataset name: {logged_dataset.name}")
print(f"Dataset digest: {logged_dataset.digest}")
print(f"Dataset profile: {logged_dataset.profile}")
print(f"Dataset schema: {logged_dataset.schema}")
The stdout results of the above code snippet are as follows:
Dataset name: wine quality - white
Dataset digest: 2a1e42c4
Dataset profile: {"num_rows": 4898, "num_elements": 58776}
Dataset schema: {"mlflow_colspec": [
{"type": "double", "name": "fixed acidity"},
{"type": "double", "name": "volatile acidity"},
{"type": "double", "name": "citric acid"},
{"type": "double", "name": "residual sugar"},
{"type": "double", "name": "chlorides"},
{"type": "double", "name": "free sulfur dioxide"},
{"type": "double", "name": "total sulfur dioxide"},
{"type": "double", "name": "density"},
{"type": "double", "name": "pH"},
{"type": "double", "name": "sulphates"},
{"type": "double", "name": "alcohol"},
{"type": "long", "name": "quality"}
]}
We can navigate to the MLflow UI to see what this looks like for a logged Dataset as well.
When we want to load the dataset back from the location that it’s stored (calling load
will download the data locally), we
access the Dataset’s source via the following API:
# Loading the dataset's source
dataset_source = mlflow.data.get_source(logged_dataset)
local_dataset = dataset_source.load()
print(f"The local file where the data has been downloaded to: {local_dataset}")
# Load the data again
loaded_data = pd.read_csv(local_dataset, delimiter=";")
The print statement from above resolves to the local file that was created when calling load
.
The local file where the data has been downloaded to:
/var/folders/cd/n8n0rm2x53l_s0xv_j_xklb00000gp/T/tmpuxwtrul1/winequality-white.csv
Using Datasets with other MLflow Features
The mlflow.data
module serves the crucial role of associating datasets with MLflow runs. Aside from the obvious utility of having a record
associated with an MLflow run to the dataset that was used during training, there are some integrations within MLflow that allow for direct
usage of Datasets that have been logged with the mlflow.log_input()
API.
How to use a Dataset with MLflow evaluate
Note
The integration of Datasets with MLflow evaluate was introduced in MLflow 2.8.0. Previous versions do not have this functionality.
To see how this integration functions, let’s take a look at a fairly simple and typical classification task.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import xgboost
import mlflow
from mlflow.data.pandas_dataset import PandasDataset
dataset_source_url = "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"
raw_data = pd.read_csv(dataset_source_url, delimiter=";")
# Extract the features and target data separately
y = raw_data["quality"]
X = raw_data.drop("quality", axis=1)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=17
)
# Create a label encoder object
le = LabelEncoder()
# Fit and transform the target variable
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)
# Fit an XGBoost binary classifier on the training data split
model = xgboost.XGBClassifier().fit(X_train, y_train_encoded)
# Build the Evaluation Dataset from the test set
y_test_pred = model.predict(X=X_test)
eval_data = X_test
eval_data["label"] = y_test
# Assign the decoded predictions to the Evaluation Dataset
eval_data["predictions"] = le.inverse_transform(y_test_pred)
# Create the PandasDataset for use in mlflow evaluate
pd_dataset = mlflow.data.from_pandas(
eval_data, predictions="predictions", targets="label"
)
mlflow.set_experiment("White Wine Quality")
# Log the Dataset, model, and execute an evaluation run using the configured Dataset
with mlflow.start_run() as run:
mlflow.log_input(pd_dataset, context="training")
mlflow.xgboost.log_model(
artifact_path="white-wine-xgb", xgb_model=model, input_example=X_test
)
result = mlflow.evaluate(data=pd_dataset, predictions=None, model_type="classifier")
Note
Using the mlflow.evaluate()
API will automatically log the dataset used for the evaluation to the MLflow run. An explicit call to
log the input is not required.
Navigating to the MLflow UI, we can see how the Dataset, model, metrics, and a classification-specific confusion matrix are all logged to the run.