mlflow.data
The mlflow.data
module helps you record your model training and evaluation datasets to
runs with MLflow Tracking, as well as retrieve dataset information from runs. It provides the
following important interfaces:
Dataset
: Represents a dataset used in model training or evaluation, including features, targets, predictions, and metadata such as the dataset’s name, digest (hash) schema, profile, and source. You can log this metadata to a run in MLflow Tracking using themlflow.log_input()
API.mlflow.data
provides APIs for constructingDatasets
from a variety of Python data objects, including Pandas DataFrames (mlflow.data.from_pandas()
), NumPy arrays (mlflow.data.from_numpy()
), Spark DataFrames (mlflow.data.from_spark()
/mlflow.data.load_delta()
), and more.DatasetSource
: Represents the source of a dataset. For example, this may be a directory of files stored in S3, a Delta Table, or a web URL. EachDataset
references the source from which it was derived. ADataset
’s features and targets may differ from the source if transformations and filtering were applied. You can get theDatasetSource
of a dataset logged to a run in MLflow Tracking using themlflow.data.get_source()
API.
The following example demonstrates how to use mlflow.data
to log a training dataset to a run,
retrieve information about the dataset from the run, and load the dataset’s source.
import mlflow.data
import pandas as pd
from mlflow.data.pandas_dataset import PandasDataset
# Construct a Pandas DataFrame using iris flower data from a web URL
dataset_source_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(dataset_source_url)
# Construct an MLflow PandasDataset from the Pandas DataFrame, and specify the web URL
# as the source
dataset: PandasDataset = mlflow.data.from_pandas(df, source=dataset_source_url)
with mlflow.start_run():
# Log the dataset to the MLflow Run. Specify the "training" context to indicate that the
# dataset is used for model training
mlflow.log_input(dataset, context="training")
# Retrieve the run, including dataset information
run = mlflow.get_run(mlflow.last_active_run().info.run_id)
dataset_info = run.inputs.dataset_inputs[0].dataset
print(f"Dataset name: {dataset_info.name}")
print(f"Dataset digest: {dataset_info.digest}")
print(f"Dataset profile: {dataset_info.profile}")
print(f"Dataset schema: {dataset_info.schema}")
# Load the dataset's source, which downloads the content from the source URL to the local
# filesystem
dataset_source = mlflow.data.get_source(dataset_info)
dataset_source.load()
-
class
mlflow.data.dataset.
Dataset
(source: mlflow.data.dataset_source.DatasetSource, name: Optional[str] = None, digest: Optional[str] = None)[source] Bases:
object
Represents a dataset for use with MLflow Tracking, including the name, digest (hash), schema, and profile of the dataset as well as source information (e.g. the S3 bucket or managed Delta table from which the dataset was derived). Most datasets expose features and targets for training and evaluation as well.
-
abstract property
profile
Optional summary statistics for the dataset, such as the number of rows in a table, the mean / median / std of each table column, etc.
-
abstract property
schema
Optional dataset schema, such as an instance of
mlflow.types.Schema
representing the features and targets of the dataset.
-
property
source
Information about the dataset’s source, represented as an instance of
DatasetSource
. For example, this may be the S3 location or the name of the managed Delta Table from which the dataset was derived.
-
to_dict
() → dict[source] Create config dictionary for the dataset.
Subclasses should override this method to provide additional fields in the config dict, e.g., schema, profile, etc.
Returns a string dictionary containing the following fields: name, digest, source, source type.
-
abstract property
-
class
mlflow.data.dataset_source.
DatasetSource
[source] Bases:
object
Represents the source of a dataset used in MLflow Tracking, providing information such as cloud storage location, delta table name / version, etc.
-
from_json
(cls, source_json: str) → DatasetSource[source]
-
abstract classmethod
from_dict
(source_dict: dict) → mlflow.data.dataset_source.DatasetSource[source] Constructs an instance of the DatasetSource from a dictionary representation.
- Parameters
source_dict – A dictionary representation of the DatasetSource.
- Returns
A DatasetSource instance.
-
abstract
load
() → Any[source] Loads files / objects referred to by the DatasetSource. For example, depending on the type of
DatasetSource
, this may download source CSV files from S3 to the local filesystem, load a source Delta Table as a Spark DataFrame, etc.- Returns
The downloaded source, e.g. a local filesystem path, a Spark DataFrame, etc.
-
abstract
to_dict
() → dict[source] Obtains a JSON-compatible dictionary representation of the DatasetSource.
- Returns
A JSON-compatible dictionary representation of the DatasetSource.
-
to_json
() → str[source] Obtains a JSON string representation of the
DatasetSource
.- Returns
A JSON string representation of the
DatasetSource
.
-
-
mlflow.data.
get_source
(dataset: Union[Dataset, DatasetInput, mlflow.data.dataset.Dataset]) → mlflow.data.dataset_source.DatasetSource[source] Obtains the source of the specified dataset or dataset input.
- Parameters
dataset – An instance of
mlflow.data.dataset.Dataset
,mlflow.entities.Dataset
, ormlflow.entities.DatasetInput
.- Returns
An instance of
DatasetSource
.
pandas
-
mlflow.data.
from_pandas
(df: pandas.core.frame.DataFrame, source: Optional[Union[str, mlflow.data.dataset_source.DatasetSource]] = None, targets: Optional[str] = None, name: Optional[str] = None, digest: Optional[str] = None, predictions: Optional[str] = None) → mlflow.data.pandas_dataset.PandasDataset[source] Constructs a
PandasDataset
instance from a Pandas DataFrame, optional targets, optional predictions, and source.- Parameters
df – A Pandas DataFrame.
source – The source from which the DataFrame was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc.
source
may be specified as a URI, a path-like string, or an instance ofDatasetSource
. If unspecified, the source is assumed to be the code location (e.g. notebook cell, script, etc.) wherefrom_pandas
is being called.targets – An optional target column name for supervised training. This column must be present in the dataframe (
df
).name – The name of the dataset. If unspecified, a name is generated.
digest – The dataset digest (hash). If unspecified, a digest is computed automatically.
predictions – An optional predictions column name for model evaluation. This column must be present in the dataframe (
df
).
import mlflow import pandas as pd x = pd.DataFrame( [["tom", 10, 1, 1], ["nick", 15, 0, 1], ["july", 14, 1, 1]], columns=["Name", "Age", "Label", "ModelOutput"], ) dataset = mlflow.data.from_pandas(x, targets="Label", predictions="ModelOutput")
-
class
mlflow.data.pandas_dataset.
PandasDataset
[source] Represents a Pandas DataFrame for use with MLflow Tracking.
-
property
predictions
The name of the predictions column. May be
None
if no predictions column is available.
-
property
schema
An instance of
mlflow.types.Schema
representing the tabular dataset. May beNone
if the schema cannot be inferred from the dataset.
-
to_dict
() → dict[source] Create config dictionary for the dataset.
Returns a string dictionary containing the following fields: name, digest, source, source type, schema, and profile.
-
property
NumPy
-
mlflow.data.
from_numpy
(features: Union[numpy.ndarray, dict], source: Optional[Union[str, mlflow.data.dataset_source.DatasetSource]] = None, targets: Optional[Union[numpy.ndarray, dict]] = None, name: Optional[str] = None, digest: Optional[str] = None) → mlflow.data.numpy_dataset.NumpyDataset[source] Constructs a
NumpyDataset
object from NumPy features, optional targets, and source. If the source is path like, then this will construct a DatasetSource object from the source path. Otherwise, the source is assumed to be a DatasetSource object.- Parameters
features – NumPy features, represented as an np.ndarray or dictionary of named np.ndarrays.
source – The source from which the numpy data was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc.
source
may be specified as a URI, a path-like string, or an instance ofDatasetSource
. If unspecified, the source is assumed to be the code location (e.g. notebook cell, script, etc.) wherefrom_numpy
is being called.targets – Optional NumPy targets, represented as an np.ndarray or dictionary of named np.ndarrays.
name – The name of the dataset. If unspecified, a name is generated.
digest – The dataset digest (hash). If unspecified, a digest is computed automatically.
import mlflow import numpy as np x = np.random.uniform(size=[2, 5, 4]) y = np.random.randint(2, size=[2]) dataset = mlflow.data.from_numpy(x, targets=y)
import mlflow import numpy as np x = { "feature_1": np.random.uniform(size=[2, 5, 4]), "feature_2": np.random.uniform(size=[2, 5, 4]), } y = np.random.randint(2, size=[2]) dataset = mlflow.data.from_numpy(x, targets=y)
Spark
-
mlflow.data.
load_delta
(path: Optional[str] = None, table_name: Optional[str] = None, version: Optional[str] = None, targets: Optional[str] = None, name: Optional[str] = None, digest: Optional[str] = None) → mlflow.data.spark_dataset.SparkDataset[source] Loads a
SparkDataset
from a Delta table for use with MLflow Tracking.- Parameters
path – The path to the Delta table. Either
path
ortable_name
must be specified.table_name – The name of the Delta table. Either
path
ortable_name
must be specified.version – The Delta table version. If not specified, the version will be inferred.
targets – Optional. The name of the Delta table column containing targets (labels) for supervised learning.
name – The name of the dataset. E.g. “wiki_train”. If unspecified, a name is automatically generated.
digest – The digest (hash, fingerprint) of the dataset. If unspecified, a digest is automatically computed.
- Returns
An instance of
SparkDataset
.
-
mlflow.data.
from_spark
(df: pyspark.sql.DataFrame, path: Optional[str] = None, table_name: Optional[str] = None, version: Optional[str] = None, sql: Optional[str] = None, targets: Optional[str] = None, name: Optional[str] = None, digest: Optional[str] = None, predictions: Optional[str] = None) → mlflow.data.spark_dataset.SparkDataset[source] Given a Spark DataFrame, constructs a
SparkDataset
object for use with MLflow Tracking.- Parameters
df – The Spark DataFrame from which to construct a SparkDataset.
path – The path of the Spark or Delta source that the DataFrame originally came from. Note that the path does not have to match the DataFrame exactly, since the DataFrame may have been modified by Spark operations. This is used to reload the dataset upon request via
SparkDataset.source.load()
. If none ofpath
,table_name
, orsql
are specified, a CodeDatasetSource is used, which will source information from the run context.table_name – The name of the Spark or Delta table that the DataFrame originally came from. Note that the table does not have to match the DataFrame exactly, since the DataFrame may have been modified by Spark operations. This is used to reload the dataset upon request via
SparkDataset.source.load()
. If none ofpath
,table_name
, orsql
are specified, a CodeDatasetSource is used, which will source information from the run context.version – If the DataFrame originally came from a Delta table, specifies the version of the Delta table. This is used to reload the dataset upon request via
SparkDataset.source.load()
.version
cannot be specified ifsql
is specified.sql – The Spark SQL statement that was originally used to construct the DataFrame. Note that the Spark SQL statement does not have to match the DataFrame exactly, since the DataFrame may have been modified by Spark operations. This is used to reload the dataset upon request via
SparkDataset.source.load()
. If none ofpath
,table_name
, orsql
are specified, a CodeDatasetSource is used, which will source information from the run context.targets – Optional. The name of the Data Frame column containing targets (labels) for supervised learning.
name – The name of the dataset. E.g. “wiki_train”. If unspecified, a name is automatically generated.
digest – The digest (hash, fingerprint) of the dataset. If unspecified, a digest is automatically computed.
predictions – Optional. The name of the column containing model predictions, if the dataset contains model predictions. If specified, this column must be present in the dataframe (
df
).
- Returns
An instance of
SparkDataset
.
-
class
mlflow.data.spark_dataset.
SparkDataset
[source] Represents a Spark dataset (e.g. data derived from a Spark Table / file directory or Delta Table) for use with MLflow Tracking.
-
property
predictions
The name of the predictions column. May be
None
if no predictions column was specified when the dataset was created.
-
property
source
Spark dataset source information.
- Returns
An instance of
SparkDatasetSource
orDeltaDatasetSource
.
-
property
targets
The name of the Spark DataFrame column containing targets (labels) for supervised learning.
- Returns
The string name of the Spark DataFrame column containing targets.
-
to_dict
() → dict[source] Create config dictionary for the dataset.
Returns a string dictionary containing the following fields: name, digest, source, source type, schema, and profile.
-
property
Hugging Face
-
mlflow.data.huggingface_dataset.
from_huggingface
(ds, path: Optional[str] = None, targets: Optional[str] = None, data_dir: Optional[str] = None, data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None, revision=None, name: Optional[str] = None, digest: Optional[str] = None, trust_remote_code: Optional[bool] = None) → mlflow.data.huggingface_dataset.HuggingFaceDataset[source] Create a mlflow.data.huggingface_dataset.HuggingFaceDataset from a Hugging Face dataset.
- Parameters
ds – A Hugging Face dataset. Must be an instance of datasets.Dataset. Other types, such as datasets.DatasetDict, are not supported.
path – The path of the Hugging Face dataset used to construct the source. This is the same argument as path in datasets.load_dataset() function. To be able to reload the dataset via MLflow, path must match the path of the dataset on the hub, e.g., “databricks/databricks-dolly-15k”. If no path is specified, a CodeDatasetSource is, used which will source information from the run context.
targets – The name of the Hugging Face dataset.Dataset column containing targets (labels) for supervised learning.
data_dir – The data_dir of the Hugging Face dataset configuration. This is used by the datasets.load_dataset() function to reload the dataset upon request via
HuggingFaceDataset.source.load()
.data_files – Paths to source data file(s) for the Hugging Face dataset configuration. This is used by the datasets.load_dataset() function to reload the dataset upon request via
HuggingFaceDataset.source.load()
.revision – Version of the dataset script to load. This is used by the datasets.load_dataset() function to reload the dataset upon request via
HuggingFaceDataset.source.load()
.name – The name of the dataset. E.g. “wiki_train”. If unspecified, a name is automatically generated.
digest – The digest (hash, fingerprint) of the dataset. If unspecified, a digest is automatically computed.
trust_remote_code – Whether to trust remote code from the dataset repo.
-
class
mlflow.data.huggingface_dataset.
HuggingFaceDataset
[source] Represents a HuggingFace dataset for use with MLflow Tracking.
-
property
ds
The Hugging Face
datasets.Dataset
instance.- Returns
The Hugging Face
datasets.Dataset
instance.
-
property
profile
Summary statistics for the Hugging Face dataset, including the number of rows, size, and size in bytes.
-
property
targets
The name of the Hugging Face dataset column containing targets (labels) for supervised learning.
- Returns
The string name of the Hugging Face dataset column containing targets.
-
to_dict
() → dict[source] Create config dictionary for the dataset.
Returns a string dictionary containing the following fields: name, digest, source, source type, schema, and profile.
-
to_evaluation_dataset
(path=None, feature_names=None) → mlflow.data.evaluation_dataset.EvaluationDataset[source] Converts the dataset to an EvaluationDataset for model evaluation. Required for use with mlflow.evaluate().
-
property
TensorFlow
-
mlflow.data.tensorflow_dataset.
from_tensorflow
(features, source: Optional[Union[str, mlflow.data.dataset_source.DatasetSource]] = None, targets=None, name: Optional[str] = None, digest: Optional[str] = None) → mlflow.data.tensorflow_dataset.TensorFlowDataset[source] Constructs a TensorFlowDataset object from TensorFlow data, optional targets, and source.
If the source is path like, then this will construct a DatasetSource object from the source path. Otherwise, the source is assumed to be a DatasetSource object.
- Parameters
features – A TensorFlow dataset or tensor of features.
source – The source from which the data was derived, e.g. a filesystem path, an S3 URI, an HTTPS URL, a delta table name with version, or spark table etc. If source is not a path like string, pass in a DatasetSource object directly. If no source is specified, a CodeDatasetSource is used, which will source information from the run context.
targets – A TensorFlow dataset or tensor of targets. Optional.
name – The name of the dataset. If unspecified, a name is generated.
digest – A dataset digest (hash). If unspecified, a digest is computed automatically.
-
class
mlflow.data.tensorflow_dataset.
TensorFlowDataset
[source] Represents a TensorFlow dataset for use with MLflow Tracking.
-
to_dict
() → dict[source] Create config dictionary for the dataset.
Returns a string dictionary containing the following fields: name, digest, source, source type, schema, and profile.
-
to_evaluation_dataset
(path=None, feature_names=None) → mlflow.data.evaluation_dataset.EvaluationDataset[source] Converts the dataset to an EvaluationDataset for model evaluation. Only supported if the dataset is a Tensor. Required for use with mlflow.evaluate().
-
-
class
mlflow.data.evaluation_dataset.
EvaluationDataset
[source] An input dataset for model evaluation. This is intended for use with the
mlflow.models.evaluate()
API.
Dataset Sources
-
class
mlflow.data.filesystem_dataset_source.
FileSystemDatasetSource
[source] Represents the source of a dataset stored on a filesystem, e.g. a local UNIX filesystem, blob storage services like S3, etc.
-
abstract classmethod
from_dict
(source_dict: dict) → mlflow.data.filesystem_dataset_source.FileSystemDatasetSource[source] - Parameters
source_dict – A dictionary representation of the FileSystemDatasetSource.
-
abstract
load
(dst_path=None) → str[source] Downloads the dataset source to the local filesystem.
- Parameters
dst_path – Path of the local filesystem destination directory to which to download the dataset source. If the directory does not exist, it is created. If unspecified, the dataset source is downloaded to a new uniquely-named directory on the local filesystem, unless the dataset source already exists on the local filesystem, in which case its local path is returned directly.
- Returns
The path to the downloaded dataset source on the local filesystem.
-
abstract
to_dict
() → dict[source] - Returns
A JSON-compatible dictionary representation of the FileSystemDatasetSource.
-
abstract classmethod
-
class
mlflow.data.http_dataset_source.
HTTPDatasetSource
[source] Represents the source of a dataset stored at a web location and referred to by an HTTP or HTTPS URL.
-
classmethod
from_dict
(source_dict: dict) → mlflow.data.http_dataset_source.HTTPDatasetSource[source] - Parameters
source_dict – A dictionary representation of the HTTPDatasetSource.
-
load
(dst_path=None) → str[source] Downloads the dataset source to the local filesystem.
- Parameters
dst_path – Path of the local filesystem destination directory to which to download the dataset source. If the directory does not exist, it is created. If unspecified, the dataset source is downloaded to a new uniquely-named directory on the local filesystem.
- Returns
The path to the downloaded dataset source on the local filesystem.
-
to_dict
() → dict[source] - Returns
A JSON-compatible dictionary representation of the HTTPDatasetSource.
-
classmethod
-
class
mlflow.data.huggingface_dataset_source.
HuggingFaceDatasetSource
[source] Represents the source of a Hugging Face dataset used in MLflow Tracking.
-
classmethod
from_dict
(source_dict: dict) → mlflow.data.huggingface_dataset_source.HuggingFaceDatasetSource[source] Constructs an instance of the DatasetSource from a dictionary representation.
- Parameters
source_dict – A dictionary representation of the DatasetSource.
- Returns
A DatasetSource instance.
-
load
(**kwargs)[source] Load the Hugging Face dataset based on HuggingFaceDatasetSource.
- Parameters
kwargs – Additional keyword arguments used for loading the dataset with the Hugging Face datasets.load_dataset() method.
- Returns
An instance of datasets.Dataset.
-
to_dict
() → dict[source] Obtains a JSON-compatible dictionary representation of the DatasetSource.
- Returns
A JSON-compatible dictionary representation of the DatasetSource.
-
classmethod
-
class
mlflow.data.delta_dataset_source.
DeltaDatasetSource
[source] Represents the source of a dataset stored at in a delta table.
-
classmethod
from_dict
(source_dict: dict) → mlflow.data.delta_dataset_source.DeltaDatasetSource[source] Constructs an instance of the DatasetSource from a dictionary representation.
- Parameters
source_dict – A dictionary representation of the DatasetSource.
- Returns
A DatasetSource instance.
-
load
(**kwargs)[source] Loads the dataset source as a Delta Dataset Source.
- Returns
An instance of
pyspark.sql.DataFrame
.
-
to_dict
() → dict[source] Obtains a JSON-compatible dictionary representation of the DatasetSource.
- Returns
A JSON-compatible dictionary representation of the DatasetSource.
-
classmethod
-
class
mlflow.data.spark_dataset_source.
SparkDatasetSource
[source] Represents the source of a dataset stored in a spark table.
-
classmethod
from_dict
(source_dict: dict) → mlflow.data.spark_dataset_source.SparkDatasetSource[source] Constructs an instance of the DatasetSource from a dictionary representation.
- Parameters
source_dict – A dictionary representation of the DatasetSource.
- Returns
A DatasetSource instance.
-
load
(**kwargs)[source] Loads the dataset source as a Spark Dataset Source.
- Returns
An instance of
pyspark.sql.DataFrame
.
-
to_dict
() → dict[source] Obtains a JSON-compatible dictionary representation of the DatasetSource.
- Returns
A JSON-compatible dictionary representation of the DatasetSource.
-
classmethod