Announcing MLflow Enhancements - Deep Learning with MLflow (Part 1)

March 5, 2024 · 6 min read

Abe Omorogbe

Product Manager, ML at Databricks

Hubert Zub

Software Engineer at Databricks

Yun Park

Software Engineer at Databricks

Chen Qian

Software Engineer at Databricks

Jesse Chan

Software Engineer at Databricks

In the quickly evolving world of artificial intelligence, where generative AI has taken center stage, the landscape of machine learning is evolving at an unprecedented pace. There has been a surge in the use of cutting-edge deep learning (DL) libraries like Transformers, Tensorflow, and PyTorch to fine-tune these generative AI models for enhanced performance. As this trend accelerates, it's become clear that the tools used to build these models must rapidly evolve as well, particularly when it comes to managing and optimizing these deep learning workloads. MLflow offers a practical solution for managing the complexities of these machine learning projects.

In collaboration with MosaicML and the broader ML community, MLflow is thrilled to unveil a set of eagerly awaited enhancements. This latest release (MLflow 2.11) introduces updated tracking UI capabilities in direct response to the feedback and needs of MLflow enthusiasts. These updates are not just incremental; they represent a leap forward in addressing the needs of MLflow users doing Deep Learning.

The evolution of enhanced Deep Learning capabilities is a testament to MLflow's commitment to serving the open-source community, ensuring that its offerings are not just keeping pace, but setting the pace in the rapidly evolving domain of machine learning.

Deep Learning API Improvements

Leveraging valuable insights from our user community, we've implemented critical enhancements to the effective scale of metrics logging and the inclusion of system-related metric logging within our platform. These improvements encompass expanded scalability options, support for logging more iterations and the logging of system metrics.

System Metrics

This feature allows you to monitor system metrics and identify any hardware issues that might be impacting performance. Metrics such as CPU utilization, Memory usage, disk usage etc., from all nodes in your cluster can now be logged and visualized within the MLflow UI.

System Metrics

Improved Logging Performance

We recently introduced both asynchronous and batch logging, making it easier to log both parallel and distributed DL training sessions. Additionally, the MLflow Client now supports up to 1 million steps (iterations) when logging metrics, allowing users to log more steps during long-running DL jobs.

Parallel Runs

Checkpointing for Deep Learning

TensorFlow and PyTorch now support model weight checkpointing when using autologging.

DL Checkpointing

User Experience and Productivity Enhancements

We have introduced substantial improvements to user experience and feature organization within our platform. These enhancements include more sophisticated user interfaces and an intuitive redesign of the run details page, the addition of chart groups and metric aggregation, all aimed at simplifying navigation and enhancing productivity especially for Deep Learning use cases.

Metric Aggregation

We've enhanced the UI with metric aggregation, enabling you to aggregate metrics across multiple runs based on datasets, tags, or parameters. These improvements significantly improve the time it takes to understand training results when working with large DL models, enabling more nuanced and comprehensive analysis of overarching trends in model performance across multiple dimensions.

DL Metric Aggregation

Chart Grouping Functionality

You can now easily categorize and organize your metrics, such as training, testing, and system metrics into named groups within the MLflow UI. This organization allows for a comprehensive overview of all metrics, enabling quicker access and better management, particularly when handling experiments with many metrics.

DL Chart Grouping

Slash ("/") Logging Syntax

To further streamline metric organization, we've implemented a new logging syntax that uses slashes ("/") to group metrics. For example, using mlflow.log_metric("x/y/score", 100) helps in structuring and segregating different types of data or metrics into hierarchical groups, making it easier to navigate and interpret the logs, especially when dealing with complex models and experiments.

mlflow.log_metric('SVR/val_MAPE', mean_absolute_percentage_error(test_y, pred_y))

DL Slash Logging

Chart Searching

We've significantly enhanced the search functionality within our platform, enabling more robust and intuitive searching across charts, parameters, and metrics. This upgrade allows for quicker and more precise retrieval of specific data points, streamlining the process of analyzing and comparing different aspects of your experiments.

DL Chart Searching

Run Details Redesign

We reorganized the Run Details UI to a modular tabbed layout, added new drag and drop UI functionality so that you can can now render logged tables. This enhancement will make it easier to organize your runs and experiments.

DL Run Details Redesign

Getting Started Updates

Following extensive feedback from our user community, we've introduced significant updates to enhance the getting started documentation within MLflow. These updates include a comprehensive overhaul of our documentation for easier navigation and enriched guidance, along with a streamlined login API. These enhancements, reflecting our commitment to improving the user experience and workflow, aim to empower our users to achieve more with greater speed and ease.

New Tutorials and Docs

We've overhauled our documentation to offer a more comprehensive, user-friendly experience with practical examples to support both newcomers and experienced practitioners with the information they need to start a Deep Learning project.

Deep Learning Docs

We've streamlined our authentication processes. This method provides a simple way to connect MLflow to your tracking server without having to leave your development environment. Try it out today

Get Started Today

Dive into the latest MLflow updates today and enhance the way you manage your machine learning projects! With our newest enhancements, including advanced metric aggregation, automatic capturing of system metrics, intuitive feature grouping, and streamlined search capabilities, MLflow is here to elevate your data science workflow to new heights. Get started now with MLflow's cutting-edge tools and features.

pip install mlflow==2.11

mlflow ui --port 8080

import mlflow

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor

# Set our tracking server uri for logging
mlflow.set_tracking_uri(uri="http://127.0.0.1:8080")

mlflow.autolog()

db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)

rf = RandomForestRegressor(n_estimators=100, max_depth=6, max_features=3)
# MLflow triggers logging automatically upon model fitting
rf.fit(X_train, y_train)

Feedback

We value your input! Our feature roadmap prioritization is guided by feedback from the MLflow late 2023 survey, GitHub Issues and Slack. Look out for our next survey later this year, by participating you can help ensure that the features you want are implemented in MLflow. You can also create an issue on GitHub or join our Slack.

Gen AI

Model training

Announcing MLflow Enhancements - Deep Learning with MLflow (Part 1)

Deep Learning API Improvements

System Metrics

Improved Logging Performance

Checkpointing for Deep Learning

User Experience and Productivity Enhancements

Metric Aggregation

Chart Grouping Functionality

Slash ("/") Logging Syntax

Chart Searching

Run Details Redesign

Getting Started Updates

New Tutorials and Docs

Get Started Today

Feedback

Gen AI

Model training

Deep Learning API Improvements​

System Metrics​

Improved Logging Performance​

Checkpointing for Deep Learning​

User Experience and Productivity Enhancements​

Metric Aggregation​

Chart Grouping Functionality​

Slash ("/") Logging Syntax​

Chart Searching​

Run Details Redesign​

Getting Started Updates​

New Tutorials and Docs​

Seamless login with mlflow.login()​

Get Started Today​

Feedback​

Deep Learning API Improvements

System Metrics

Improved Logging Performance

Checkpointing for Deep Learning

User Experience and Productivity Enhancements

Metric Aggregation

Chart Grouping Functionality

Slash ("/") Logging Syntax

Chart Searching

Run Details Redesign

Getting Started Updates

New Tutorials and Docs

Seamless login with mlflow.login()

Get Started Today

Feedback