Managing Dependencies in MLflow Models
MLflow Model is a standard format that packages a machine learning model with its dependencies and other metadata. Building a model with its dependencies allows for reproducibility and portability across a variety of platforms and tools.
When you create an MLflow model using the MLflow Tracking APIs, for instance, mlflow.pytorch.log_model()
,
MLflow automatically infers the required dependencies for the model flavor you're using and records them as a part of Model metadata. Then, when you
serve the model for prediction, MLflow automatically installs the dependencies to the environment. Therefore, you normally won't need to
worry about managing dependencies in MLflow Model.
However, in some cases, you may need to add or modify some dependencies. This page provides a high-level description of how MLflow manages dependencies and guidance for how to customize dependencies for your use case.
One tip for improving MLflow's dependency inference accuracy is to add an input_example
when saving your model. This enables MLflow to
perform a model prediction before saving the model, thereby capturing the dependencies used during the prediction.
Please refer to Model Input Example for additional, detailed usage of this parameter.
- How MLflow Records Model Dependencies
- Adding Extra Dependencies to an MLflow Model
- Defining All Dependencies by Yourself
- Saving Extra Code dependencies with an MLflow Model - Automatic inference
- Saving Extra Code with an MLflow Model - Manual Declaration
- Validating Environment for Prediction
- Troubleshooting
How MLflow Records Model Dependencies
An MLflow Model is saved within a specified directory with the following structure:
my_model/
├── MLmodel
├── model.pkl
├── conda.yaml
├── python_env.yaml
└── requirements.txt
Model dependencies are defined by the following files (For other files, please refer to the guidance provided in the section discussing Storage Format):
python_env.yaml
- This file contains the information required to restore the model environment using virtualenv (1) python version (2) build tools like pip, setuptools, and wheel (3) pip requirements of the model (a reference to requirements.txt)requirements.txt
- Defines the set of pip dependencies required to run the model.conda.yaml
- Defines the conda environment required to run the model. This is used when you specifyconda
as the environment manager for restoring the model environment.
Please note that it is not recommended to edit these files manually to add or remove dependencies. They are automatically generated by MLflow and any change you make manually will be overwritten when you save the model again. Instead, you should use one of the recommended methods described in the following sections.
Locking Model Dependencies with MLFLOW_LOCK_MODEL_DEPENDENCIES
Available since MLflow 2.16.0
MLflow can automatically lock both direct and transitive model dependencies to their exact versions when logging a model. This ensures reproducibility by capturing the complete dependency tree at the time of model creation.
To enable dependency locking, set the MLFLOW_LOCK_MODEL_DEPENDENCIES
environment variable:
export MLFLOW_LOCK_MODEL_DEPENDENCIES=true
When enabled, MLflow uses uv
(if installed) to resolve and lock all dependencies, including their transitive dependencies. The locked requirements are then saved in the model's requirements.txt
file.
Example without locking (default):
mlflow==2.9.2
scikit-learn==1.3.2
cloudpickle==3.0.0
Example with locking enabled:
mlflow==2.9.2
scikit-learn==1.3.2
cloudpickle==3.0.0
numpy==1.24.3
scipy==1.11.4
joblib==1.3.2
threadpoolctl==3.2.0
# ... other transitive dependencies
Dependency locking requires uv
to be installed. If uv
is not available, MLflow will skip the locking step and use standard dependency inference. Install uv
with:
pip install uv
Locking dependencies can increase the size of the requirements.txt
file significantly as it includes all transitive dependencies. This provides better reproducibility but may make the environment more rigid.
Example
The following shows an example of environment files generated by MLflow when logging a model with mlflow.sklearn.log_model
:
python_env.yaml
python: 3.9.8
build_dependencies:
- pip==23.3.2
- setuptools==69.0.3
- wheel==0.42.0
dependencies:
- -r requirements.txt
requirements.txt
mlflow==2.9.2
scikit-learn==1.3.2
cloudpickle==3.0.0
conda.yaml
name: mlflow-env
channels:
- conda-forge
dependencies:
- python=3.9.8
- pip
- pip:
- mlflow==2.9.2
- scikit-learn==1.3.2
- cloudpickle==3.0.0
Adding Extra Dependencies to an MLflow Model
MLflow infers dependencies required for the model flavor library, but your model may depend on other libraries e.g. data preprocessing. In this case, you can add extra dependencies to the model by specifying the extra_pip_requirements param when logging the model. For example,
import mlflow
class CustomModel(mlflow.pyfunc.PythonModel):
def predict(self, context, model_input):
# your model depends on pandas
import pandas as pd
...
return prediction
# Log the model
mlflow.pyfunc.log_model(
python_model=CustomModel(),
name="model",
extra_pip_requirements=["pandas==2.0.3"],
input_example=input_data,
)
The extra dependencies will be added to requirements.txt
as follows (and similarly to conda.yaml
):
mlflow==2.9.2
cloudpickle==3.0.0
pandas==2.0.3 # added
In this case, MLflow will install Pandas 2.0.3 in addition to the inferred dependencies when serving the model for prediction.
Once you log the model with dependencies, it is advisable to test it in a sandbox environment to avoid any dependency
issues when deploying the model to production. Since MLflow 2.10.0, you can use the mlflow.models.predict()
API to quickly test
your model in a virtual environment. Please refer to Validating Environment for Prediction for more details.
Defining All Dependencies by Yourself
Alternatively, you can also define all dependencies from scratch rather than adding extra ones. To do so, specify pip_requirements when logging the model. For example,
import mlflow
# Log the model
mlflow.sklearn.log_model(
sk_model=model,
name="model",
pip_requirements=[
"mlflow-skinny==2.9.2",
"cloudpickle==2.5.8",
"scikit-learn==1.3.1",
],
)
The manually defined dependencies will override the default ones MLflow detects from the model flavor library:
mlflow-skinny==2.9.2
cloudpickle==2.5.8
scikit-learn==1.3.1
Please be careful when declaring dependencies that are different from those used during training, as it can be dangerous and prone to unexpected behavior. The safest way to ensure consistency is to rely on the default dependencies inferred by MLflow.
Once you log the model with dependencies, it is advisable to test it in a sandbox environment to avoid any dependency
issues when deploying the model to production. Since MLflow 2.10.0, you can use the mlflow.models.predict()
API to quickly
test your model in a virtual environment. Please refer to Validating Environment for Prediction for more details.
Saving Extra Code dependencies with an MLflow Model - Automatic inference
Automatic code dependency inference is currently supported for Python Function Models only. Support for additional named model flavors will be coming in future releases of MLflow.
In the MLflow 2.13.0 release, a new method of including custom dependent code was introduced that expands on the existing feature of declaring code_paths
when
saving or logging a model. This new feature utilizes import dependency analysis to automatically infer the code dependencies required by the model by checking which
modules are imported within the references of a Python Model's definition.
In order to use this new feature, you can simply set the argument infer_code_paths
(Default False
) to True
when logging. You do not have to define
file locations explicitly via declaring code_paths
directory locations when utilizing this method of dependency inference, as you would have had to
prior to MLflow 2.13.0.
An example of using this feature is shown below, where we are logging a model that contains an external dependency.
In the first section, we are defining an external module named custom_code
that exists in a different than our model definition.
from typing import List
iris_types = ["setosa", "versicolor", "viginica"]
def map_iris_types(predictions: int) -> List[str]:
return [iris_types[pred] for pred in predictions]
With this custom_code.py
module defined, it is ready for use in our Python Model:
from typing import Any, Dict, List, Optional
from custom_code import map_iris_types # import the external reference
import mlflow
class FlowerMapping(mlflow.pyfunc.PythonModel):
"""Custom model with an external dependency"""
def predict(
self, context, model_input, params: Optional[Dict[str, Any]] = None
) -> List[str]:
predictions = [pred % 3 for pred in model_input]
# Call the external function
return map_iris_types(predictions)
with mlflow.start_run():
model_info = mlflow.pyfunc.log_model(
name="flowers",
python_model=FlowerMapping(),
infer_code_paths=True, # Enabling automatic code dependency inference
)
With infer_code_paths
set to True
, the dependency of map_iris_types
will be analyzed, its source declaration detected as originating in
the custom_code.py
module, and the code reference within custom_code.py
will be stored along with the model artifact. Note that defining the
external code dependency by using the code_paths
argument (discussed in the next section) is not needed.
Only modules that are within the current working directory are accessible. Dependency inference will not work across module boundaries or if your
custom code is defined in an entirely different library. If your code base is structured in such a way that common modules are entirely external to
the path that your model logging code is executing within, the original code_paths
option is required in order to log these dependencies, as
infer_code_paths
dependency inference will not capture those requirements.
Restrictions with infer_code_paths
Before using dependency inference via infer_code_paths
, ensure that your dependent code modules do not have sensitive data hard-coded within the modules (e.g., passwords,
access tokens, or secrets). Code inference does not obfuscate sensitive information and will capture and log (save) the module, regardless of what it contains.
An important aspect to note about code structure when using infer_code_paths
is to avoid defining dependencies within a main entry point to your code.
When a Python code file is loaded as the __main__
module, it cannot be inferred as a code path file. This means that if you run your script directly
(e.g., using python script.py
), the functions and classes defined in that script will be part of the __main__
module and not easily accessible by
other modules.
If your model depends on these classes or functions, this can pose a problem because they are not part of the standard module namespace and thus not
straightforward to serialize. To handle this situation, you should use cloudpickle
to serialize your model instance. cloudpickle
is an
extended version of Python's pickle
module that can serialize a wider range of Python objects, including functions and classes defined in
the __main__
module.