mlflow.data

The mlflow.data module helps you record your model training and evaluation datasets to runs with MLflow Tracking, as well as retrieve dataset information from runs. It provides the following important interfaces:

Dataset: Represents a dataset used in model training or evaluation, including features, targets, predictions, and metadata such as the dataset’s name, digest (hash) schema, profile, and source. You can log this metadata to a run in MLflow Tracking using the mlflow.log_input() API. mlflow.data provides APIs for constructing Datasets from a variety of Python data objects, including Pandas DataFrames (mlflow.data.from_pandas()), NumPy arrays (mlflow.data.from_numpy()), Spark DataFrames (mlflow.data.from_spark() / mlflow.data.load_delta()), and more.
DatasetSource: Represents the source of a dataset. For example, this may be a directory of files stored in S3, a Delta Table, or a web URL. Each Dataset references the source from which it was derived. A Dataset’s features and targets may differ from the source if transformations and filtering were applied. You can get the DatasetSource of a dataset logged to a run in MLflow Tracking using the mlflow.data.get_source() API.

The following example demonstrates how to use mlflow.data to log a training dataset to a run, retrieve information about the dataset from the run, and load the dataset’s source.

import mlflow.data
import pandas as pd
from mlflow.data.pandas_dataset import PandasDataset

# Construct a Pandas DataFrame using iris flower data from a web URL
dataset_source_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(dataset_source_url)
# Construct an MLflow PandasDataset from the Pandas DataFrame, and specify the web URL
# as the source
dataset: PandasDataset = mlflow.data.from_pandas(df, source=dataset_source_url)

with mlflow.start_run():
    # Log the dataset to the MLflow Run. Specify the "training" context to indicate that the
    # dataset is used for model training
    mlflow.log_input(dataset, context="training")

# Retrieve the run, including dataset information
run = mlflow.get_run(mlflow.last_active_run().info.run_id)
dataset_info = run.inputs.dataset_inputs[0].dataset
print(f"Dataset name: {dataset_info.name}")
print(f"Dataset digest: {dataset_info.digest}")
print(f"Dataset profile: {dataset_info.profile}")
print(f"Dataset schema: {dataset_info.schema}")

# Load the dataset's source, which downloads the content from the source URL to the local
# filesystem
dataset_source = mlflow.data.get_source(dataset_info)
dataset_source.load()

class mlflow.data.dataset.Dataset(source: mlflow.data.dataset_source.DatasetSource, name: Optional[str] = None, digest: Optional[str] = None)[source]

Bases: object

Represents a dataset for use with MLflow Tracking, including the name, digest (hash), schema, and profile of the dataset as well as source information (e.g. the S3 bucket or managed Delta table from which the dataset was derived). Most datasets expose features and targets for training and evaluation as well.

property digest: A unique hash or fingerprint of the dataset, e.g. "498c7496".

property name: The name of the dataset, e.g. "iris_data", "myschema.mycatalog.mytable@v1", etc.

abstract property profile: Optional summary statistics for the dataset, such as the number of rows in a table, the mean / median / std of each table column, etc.

abstract property schema: Optional dataset schema, such as an instance of mlflow.types.Schema representing the features and targets of the dataset.

property source: Information about the dataset’s source, represented as an instance of DatasetSource. For example, this may be the S3 location or the name of the managed Delta Table from which the dataset was derived.

to_dict() → dict[source]

Create config dictionary for the dataset.

Subclasses should override this method to provide additional fields in the config dict, e.g., schema, profile, etc.

Returns a string dictionary containing the following fields: name, digest, source, source type.

to_json() → str[source]

Obtains a JSON string representation of the Dataset.

Returns: A JSON string representation of the Dataset.

class mlflow.data.dataset_source.DatasetSource[source]

Bases: object

Represents the source of a dataset used in MLflow Tracking, providing information such as cloud storage location, delta table name / version, etc.

from_json(cls, source_json: str) → DatasetSource [source]

abstract classmethod from_dict(source_dict: dict) → mlflow.data.dataset_source.DatasetSource [source]

Constructs an instance of the DatasetSource from a dictionary representation.

Parameters: source_dict – A dictionary representation of the DatasetSource.
Returns: A DatasetSource instance.

abstract load() → Any[source]

Loads files / objects referred to by the DatasetSource. For example, depending on the type of DatasetSource, this may download source CSV files from S3 to the local filesystem, load a source Delta Table as a Spark DataFrame, etc.

Returns: The downloaded source, e.g. a local filesystem path, a Spark DataFrame, etc.

abstract to_dict() → dict[source]

Obtains a JSON-compatible dictionary representation of the DatasetSource.

Returns: A JSON-compatible dictionary representation of the DatasetSource.

to_json() → str[source]

Obtains a JSON string representation of the DatasetSource.

Returns: A JSON string representation of the DatasetSource.

mlflow.data.get_source(dataset: Union[Dataset, DatasetInput, mlflow.data.dataset.Dataset]) → mlflow.data.dataset_source.DatasetSource [source]

Obtains the source of the specified dataset or dataset input.

Parameters: dataset – An instance of mlflow.data.dataset.Dataset, mlflow.entities.Dataset, or mlflow.entities.DatasetInput.
Returns: An instance of DatasetSource.

pandas

mlflow.data.from_pandas(df: pandas.core.frame.DataFrame, source: Optional[Union[str, mlflow.data.dataset_source.DatasetSource]] = None, targets: Optional[str] = None, name: Optional[str] = None, digest: Optional[str] = None, predictions: Optional[str] = None) → mlflow.data.pandas_dataset.PandasDataset [source]

Constructs a PandasDataset instance from a Pandas DataFrame, optional targets, optional predictions, and source.

Parameters

df – A Pandas DataFrame.
source – The source from which the DataFrame was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc. source may be specified as a URI, a path-like string, or an instance of DatasetSource. If unspecified, the source is assumed to be the code location (e.g. notebook cell, script, etc.) where from_pandas is being called.
targets – An optional target column name for supervised training. This column must be present in the dataframe (df).
name – The name of the dataset. If unspecified, a name is generated.
digest – The dataset digest (hash). If unspecified, a digest is computed automatically.
predictions – An optional predictions column name for model evaluation. This column must be present in the dataframe (df).

Example

import mlflow
import pandas as pd

x = pd.DataFrame(
    [["tom", 10, 1, 1], ["nick", 15, 0, 1], ["july", 14, 1, 1]],
    columns=["Name", "Age", "Label", "ModelOutput"],
)
dataset = mlflow.data.from_pandas(x, targets="Label", predictions="ModelOutput")

class mlflow.data.pandas_dataset.PandasDataset[source]

Represents a Pandas DataFrame for use with MLflow Tracking.

property df: The underlying pandas DataFrame.

property predictions: The name of the predictions column. May be None if no predictions column is available.

property profile: A profile of the dataset. May be None if a profile cannot be computed.

property schema: An instance of mlflow.types.Schema representing the tabular dataset. May be None if the schema cannot be inferred from the dataset.

property source: The source of the dataset.

property targets: The name of the target column. May be None if no target column is available.

to_dict() → dict[source]

Create config dictionary for the dataset.

Returns a string dictionary containing the following fields: name, digest, source, source type, schema, and profile.

NumPy

mlflow.data.from_numpy(features: Union[numpy.ndarray, dict], source: Optional[Union[str, mlflow.data.dataset_source.DatasetSource]] = None, targets: Optional[Union[numpy.ndarray, dict]] = None, name: Optional[str] = None, digest: Optional[str] = None) → mlflow.data.numpy_dataset.NumpyDataset [source]

Constructs a NumpyDataset object from NumPy features, optional targets, and source. If the source is path like, then this will construct a DatasetSource object from the source path. Otherwise, the source is assumed to be a DatasetSource object.

Parameters

features – NumPy features, represented as an np.ndarray or dictionary of named np.ndarrays.
source – The source from which the numpy data was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc. source may be specified as a URI, a path-like string, or an instance of DatasetSource. If unspecified, the source is assumed to be the code location (e.g. notebook cell, script, etc.) where from_numpy is being called.
targets – Optional NumPy targets, represented as an np.ndarray or dictionary of named np.ndarrays.
name – The name of the dataset. If unspecified, a name is generated.
digest – The dataset digest (hash). If unspecified, a digest is computed automatically.

Basic Example

import mlflow
import numpy as np

x = np.random.uniform(size=[2, 5, 4])
y = np.random.randint(2, size=[2])
dataset = mlflow.data.from_numpy(x, targets=y)

Dict Example

import mlflow
import numpy as np

x = {
    "feature_1": np.random.uniform(size=[2, 5, 4]),
    "feature_2": np.random.uniform(size=[2, 5, 4]),
}
y = np.random.randint(2, size=[2])
dataset = mlflow.data.from_numpy(x, targets=y)

class mlflow.data.numpy_dataset.NumpyDataset[source]

Represents a NumPy dataset for use with MLflow Tracking.

property features: The features of the dataset.

property profile: A profile of the dataset. May be None if a profile cannot be computed.

property schema: MLflow TensorSpec schema representing the dataset features and targets (optional).

property source: The source of the dataset.

property targets: The targets of the dataset. May be None if no targets are available.

to_dict() → dict[source]

Create config dictionary for the dataset.

Returns a string dictionary containing the following fields: name, digest, source, source type, schema, and profile.

Spark

mlflow.data.load_delta(path: Optional[str] = None, table_name: Optional[str] = None, version: Optional[str] = None, targets: Optional[str] = None, name: Optional[str] = None, digest: Optional[str] = None) → mlflow.data.spark_dataset.SparkDataset [source]

Loads a SparkDataset from a Delta table for use with MLflow Tracking.

Parameters

path – The path to the Delta table. Either path or table_name must be specified.
table_name – The name of the Delta table. Either path or table_name must be specified.
version – The Delta table version. If not specified, the version will be inferred.
targets – Optional. The name of the Delta table column containing targets (labels) for supervised learning.
name – The name of the dataset. E.g. “wiki_train”. If unspecified, a name is automatically generated.
digest – The digest (hash, fingerprint) of the dataset. If unspecified, a digest is automatically computed.

Returns

An instance of SparkDataset.

mlflow.data.from_spark(df: pyspark.sql.DataFrame, path: Optional[str] = None, table_name: Optional[str] = None, version: Optional[str] = None, sql: Optional[str] = None, targets: Optional[str] = None, name: Optional[str] = None, digest: Optional[str] = None, predictions: Optional[str] = None) → mlflow.data.spark_dataset.SparkDataset [source]

Given a Spark DataFrame, constructs a SparkDataset object for use with MLflow Tracking.

Parameters

df – The Spark DataFrame from which to construct a SparkDataset.
path – The path of the Spark or Delta source that the DataFrame originally came from. Note that the path does not have to match the DataFrame exactly, since the DataFrame may have been modified by Spark operations. This is used to reload the dataset upon request via SparkDataset.source.load(). If none of path, table_name, or sql are specified, a CodeDatasetSource is used, which will source information from the run context.
table_name – The name of the Spark or Delta table that the DataFrame originally came from. Note that the table does not have to match the DataFrame exactly, since the DataFrame may have been modified by Spark operations. This is used to reload the dataset upon request via SparkDataset.source.load(). If none of path, table_name, or sql are specified, a CodeDatasetSource is used, which will source information from the run context.
version – If the DataFrame originally came from a Delta table, specifies the version of the Delta table. This is used to reload the dataset upon request via SparkDataset.source.load(). version cannot be specified if sql is specified.
sql – The Spark SQL statement that was originally used to construct the DataFrame. Note that the Spark SQL statement does not have to match the DataFrame exactly, since the DataFrame may have been modified by Spark operations. This is used to reload the dataset upon request via SparkDataset.source.load(). If none of path, table_name, or sql are specified, a CodeDatasetSource is used, which will source information from the run context.
targets – Optional. The name of the Data Frame column containing targets (labels) for supervised learning.
name – The name of the dataset. E.g. “wiki_train”. If unspecified, a name is automatically generated.
digest – The digest (hash, fingerprint) of the dataset. If unspecified, a digest is automatically computed.
predictions – Optional. The name of the column containing model predictions, if the dataset contains model predictions. If specified, this column must be present in the dataframe (df).

Returns

An instance of SparkDataset.

class mlflow.data.spark_dataset.SparkDataset[source]

Represents a Spark dataset (e.g. data derived from a Spark Table / file directory or Delta Table) for use with MLflow Tracking.

property df

The Spark DataFrame instance.

Returns: The Spark DataFrame instance.

property predictions: The name of the predictions column. May be None if no predictions column was specified when the dataset was created.

property profile: A profile of the dataset. May be None if no profile is available.

property schema: The MLflow ColSpec schema of the Spark dataset.

property source

Spark dataset source information.

Returns: An instance of SparkDatasetSource or DeltaDatasetSource.

property targets

The name of the Spark DataFrame column containing targets (labels) for supervised learning.

Returns: The string name of the Spark DataFrame column containing targets.

to_dict() → dict[source]

Create config dictionary for the dataset.

Returns a string dictionary containing the following fields: name, digest, source, source type, schema, and profile.

Hugging Face

mlflow.data.huggingface_dataset.from_huggingface(ds, path: Optional[str] = None, targets: Optional[str] = None, data_dir: Optional[str] = None, data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None, revision=None, name: Optional[str] = None, digest: Optional[str] = None, trust_remote_code: Optional[bool] = None) → mlflow.data.huggingface_dataset.HuggingFaceDataset [source]

Create a mlflow.data.huggingface_dataset.HuggingFaceDataset from a Hugging Face dataset.

Parameters

ds – A Hugging Face dataset. Must be an instance of datasets.Dataset. Other types, such as datasets.DatasetDict, are not supported.
path – The path of the Hugging Face dataset used to construct the source. This is the same argument as path in datasets.load_dataset() function. To be able to reload the dataset via MLflow, path must match the path of the dataset on the hub, e.g., “databricks/databricks-dolly-15k”. If no path is specified, a CodeDatasetSource is, used which will source information from the run context.
targets – The name of the Hugging Face dataset.Dataset column containing targets (labels) for supervised learning.
data_dir – The data_dir of the Hugging Face dataset configuration. This is used by the datasets.load_dataset() function to reload the dataset upon request via HuggingFaceDataset.source.load().
data_files – Paths to source data file(s) for the Hugging Face dataset configuration. This is used by the datasets.load_dataset() function to reload the dataset upon request via HuggingFaceDataset.source.load().
revision – Version of the dataset script to load. This is used by the datasets.load_dataset() function to reload the dataset upon request via HuggingFaceDataset.source.load().
name – The name of the dataset. E.g. “wiki_train”. If unspecified, a name is automatically generated.
digest – The digest (hash, fingerprint) of the dataset. If unspecified, a digest is automatically computed.
trust_remote_code – Whether to trust remote code from the dataset repo.

class mlflow.data.huggingface_dataset.HuggingFaceDataset[source]

Represents a HuggingFace dataset for use with MLflow Tracking.

property ds

The Hugging Face datasets.Dataset instance.

Returns: The Hugging Face datasets.Dataset instance.

property profile: Summary statistics for the Hugging Face dataset, including the number of rows, size, and size in bytes.

property schema: The MLflow ColSpec schema of the Hugging Face dataset.

property source

Hugging Face dataset source information.

Returns: A mlflow.data.huggingface_dataset_source.HuggingFaceDatasetSource

property targets

The name of the Hugging Face dataset column containing targets (labels) for supervised learning.

Returns: The string name of the Hugging Face dataset column containing targets.

to_dict() → dict[source]

Create config dictionary for the dataset.

Returns a string dictionary containing the following fields: name, digest, source, source type, schema, and profile.

to_evaluation_dataset(path=None, feature_names=None) → mlflow.data.evaluation_dataset.EvaluationDataset [source]: Converts the dataset to an EvaluationDataset for model evaluation. Required for use with mlflow.evaluate().

TensorFlow

mlflow.data.tensorflow_dataset.from_tensorflow(features, source: Optional[Union[str, mlflow.data.dataset_source.DatasetSource]] = None, targets=None, name: Optional[str] = None, digest: Optional[str] = None) → mlflow.data.tensorflow_dataset.TensorFlowDataset [source]

Constructs a TensorFlowDataset object from TensorFlow data, optional targets, and source.

If the source is path like, then this will construct a DatasetSource object from the source path. Otherwise, the source is assumed to be a DatasetSource object.

Parameters

features – A TensorFlow dataset or tensor of features.
source – The source from which the data was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc. If source is not a path like string, pass in a DatasetSource object directly. If no source is specified, a CodeDatasetSource is used, which will source information from the run context.
targets – A TensorFlow dataset or tensor of targets. Optional.
name – The name of the dataset. If unspecified, a name is generated.
digest – A dataset digest (hash). If unspecified, a digest is computed automatically.

class mlflow.data.tensorflow_dataset.TensorFlowDataset[source]

Represents a TensorFlow dataset for use with MLflow Tracking.

property data: The underlying TensorFlow data.

property profile: A profile of the dataset. May be None if no profile is available.

property schema: An MLflow TensorSpec schema representing the tensor dataset

property source: The source of the dataset.

property targets: The targets of the dataset.

to_dict() → dict[source]

Create config dictionary for the dataset.

Returns a string dictionary containing the following fields: name, digest, source, source type, schema, and profile.

to_evaluation_dataset(path=None, feature_names=None) → mlflow.data.evaluation_dataset.EvaluationDataset [source]: Converts the dataset to an EvaluationDataset for model evaluation. Only supported if the dataset is a Tensor. Required for use with mlflow.evaluate().

class mlflow.data.evaluation_dataset.EvaluationDataset[source]

An input dataset for model evaluation. This is intended for use with the mlflow.models.evaluate() API.

NUM_SAMPLE_ROWS_FOR_HASH = 5

SPARK_DATAFRAME_LIMIT = 10000

property feature_names

property features_data: return features data as a numpy array or a pandas DataFrame.

property has_predictions: Returns True if the dataset has targets, False otherwise.

property has_targets: Returns True if the dataset has targets, False otherwise.

property hash: Dataset hash, includes hash on first 20 rows and last 20 rows.

property labels_data: return labels data as a numpy array

property name: Dataset name, which is specified dataset name or the dataset hash if user don’t specify name.

property path: Dataset path

property predictions_data: return labels data as a numpy array

property predictions_name: return predictions name

property targets_name: return targets name

Dataset Sources

class mlflow.data.filesystem_dataset_source.FileSystemDatasetSource[source]

Represents the source of a dataset stored on a filesystem, e.g. a local UNIX filesystem, blob storage services like S3, etc.

abstract classmethod from_dict(source_dict: dict) → mlflow.data.filesystem_dataset_source.FileSystemDatasetSource [source]

Parameters: source_dict – A dictionary representation of the FileSystemDatasetSource.

abstract load(dst_path=None) → str[source]

Downloads the dataset source to the local filesystem.

Parameters: dst_path – Path of the local filesystem destination directory to which to download the dataset source. If the directory does not exist, it is created. If unspecified, the dataset source is downloaded to a new uniquely-named directory on the local filesystem, unless the dataset source already exists on the local filesystem, in which case its local path is returned directly.
Returns: The path to the downloaded dataset source on the local filesystem.

abstract to_dict() → dict[source]

Returns: A JSON-compatible dictionary representation of the FileSystemDatasetSource.

abstract property uri

The URI referring to the dataset source filesystem location.

Returns: The URI referring to the dataset source filesystem location, e.g “s3://mybucket/path/to/mydataset”, “/tmp/path/to/my/dataset” etc.

class mlflow.data.http_dataset_source.HTTPDatasetSource[source]

Represents the source of a dataset stored at a web location and referred to by an HTTP or HTTPS URL.

classmethod from_dict(source_dict: dict) → mlflow.data.http_dataset_source.HTTPDatasetSource [source]

Parameters: source_dict – A dictionary representation of the HTTPDatasetSource.

load(dst_path=None) → str[source]

Downloads the dataset source to the local filesystem.

Parameters: dst_path – Path of the local filesystem destination directory to which to download the dataset source. If the directory does not exist, it is created. If unspecified, the dataset source is downloaded to a new uniquely-named directory on the local filesystem.
Returns: The path to the downloaded dataset source on the local filesystem.

to_dict() → dict[source]

Returns: A JSON-compatible dictionary representation of the HTTPDatasetSource.

property url

The HTTP/S URL referring to the dataset source location.

Returns: The HTTP/S URL referring to the dataset source location.

class mlflow.data.huggingface_dataset_source.HuggingFaceDatasetSource[source]

Represents the source of a Hugging Face dataset used in MLflow Tracking.

classmethod from_dict(source_dict: dict) → mlflow.data.huggingface_dataset_source.HuggingFaceDatasetSource [source]

Constructs an instance of the DatasetSource from a dictionary representation.

Parameters: source_dict – A dictionary representation of the DatasetSource.
Returns: A DatasetSource instance.

load(**kwargs)[source]

Load the Hugging Face dataset based on HuggingFaceDatasetSource.

Parameters: kwargs – Additional keyword arguments used for loading the dataset with the Hugging Face datasets.load_dataset() method.
Returns: An instance of datasets.Dataset.

to_dict() → dict[source]

Obtains a JSON-compatible dictionary representation of the DatasetSource.

Returns: A JSON-compatible dictionary representation of the DatasetSource.

class mlflow.data.delta_dataset_source.DeltaDatasetSource[source]

Represents the source of a dataset stored at in a delta table.

property delta_table_id

property delta_table_name

property delta_table_version

classmethod from_dict(source_dict: dict) → mlflow.data.delta_dataset_source.DeltaDatasetSource [source]

Constructs an instance of the DatasetSource from a dictionary representation.

Parameters: source_dict – A dictionary representation of the DatasetSource.
Returns: A DatasetSource instance.

load(**kwargs)[source]

Loads the dataset source as a Delta Dataset Source.

Returns: An instance of pyspark.sql.DataFrame.

property path

to_dict() → dict[source]

Obtains a JSON-compatible dictionary representation of the DatasetSource.

Returns: A JSON-compatible dictionary representation of the DatasetSource.

class mlflow.data.spark_dataset_source.SparkDatasetSource[source]

Represents the source of a dataset stored in a spark table.

classmethod from_dict(source_dict: dict) → mlflow.data.spark_dataset_source.SparkDatasetSource [source]

Constructs an instance of the DatasetSource from a dictionary representation.

Parameters: source_dict – A dictionary representation of the DatasetSource.
Returns: A DatasetSource instance.

load(**kwargs)[source]

Loads the dataset source as a Spark Dataset Source.

Returns: An instance of pyspark.sql.DataFrame.

to_dict() → dict[source]

Obtains a JSON-compatible dictionary representation of the DatasetSource.

Returns: A JSON-compatible dictionary representation of the DatasetSource.