Skip to main content

MLflow Model Serving

Transform your trained models into production-ready inference servers with MLflow's comprehensive serving capabilities. Deploy locally, in the cloud, or through managed endpoints with standardized REST APIs.

REST API Endpoints

Automatic generation of standardized REST endpoints for model inference with consistent request/response formats.

Multi-Framework Support

Serve models from any ML framework through MLflow's flavor system with unified deployment patterns.

Custom Applications

Build sophisticated serving applications with custom logic, preprocessing, and business rules.

Scalable Deployment

Deploy to various targets from local development servers to cloud platforms and Kubernetes clusters.

Quick Start

Get your model serving in minutes with these simple steps:

Choose your serving approach:

# Serve a logged model
mlflow models serve -m "models:/<model-id>" -p 5000

# Serve a registered model
mlflow models serve -m "models:/<model-name>/<model-version>" -p 5000

# Serve a model from local path
mlflow models serve -m ./path/to/model -p 5000

Your model will be available at http://localhost:5000

How Model Serving Works

MLflow transforms your trained models into production-ready HTTP servers through a carefully orchestrated process that handles everything from model loading to request processing.

Server Startup and Model Loading

When you run mlflow models serve, MLflow begins by analyzing your model's metadata to determine how to load it. Each model contains an MLmodel file that specifies which "flavor" it uses - whether it's scikit-learn, PyTorch, TensorFlow, or a custom PyFunc model.

MLflow downloads the model artifacts to a local directory and creates a FastAPI server with standardized endpoints. The server loads your model using the appropriate flavor-specific loading logic. For example, a scikit-learn model is loaded using pickle, while a PyTorch model loads its state dictionary and model class.

The server exposes four key endpoints:

  • POST /invocations - The main prediction endpoint
  • GET /ping and GET /health - Health checks for monitoring
  • GET /version - Returns server and model information

Request Processing Pipeline

When a prediction request arrives at /invocations, MLflow processes it through several validation and transformation steps:

Input Format Detection: MLflow automatically detects which input format you're using. It supports multiple formats to accommodate different use cases:

  • dataframe_split: Pandas DataFrame with separate columns and data arrays
  • dataframe_records: List of dictionaries representing rows
  • instances: TensorFlow Serving format for individual predictions
  • inputs: Named tensor format for more complex inputs

Schema Validation: If your model includes a signature (input/output schema), MLflow validates the incoming data against it. This catches type mismatches and missing columns before they reach your model.

Parameter Extraction: MLflow separates prediction data from optional parameters. Parameters like temperature for language models or threshold for classifiers are extracted and passed separately to models that support them.

Model Prediction and Response

Once the input is validated and formatted, MLflow calls your model's predict() method. The framework automatically detects whether your model accepts parameters and calls it appropriately:

# For models that accept parameters
raw_predictions = model.predict(data, params=params)

# For traditional models
raw_predictions = model.predict(data)

MLflow then serializes the predictions back to JSON, handling various data types including NumPy arrays, pandas DataFrames, and Python lists. The response format depends on your input format - traditional requests get wrapped in a predictions object, while LLM-style requests return unwrapped results.

The Flavor System

MLflow's flavor system is what makes serving work consistently across different ML frameworks. Each flavor implements framework-specific loading and prediction logic while exposing a unified interface.

When you log a model using mlflow.sklearn.log_model() or mlflow.pytorch.log_model(), MLflow creates both a flavor-specific representation and a PyFunc wrapper. The PyFunc wrapper provides the standardized predict() interface that the serving layer expects, while the flavor handles the framework-specific details like tensor operations or data preprocessing.

This architecture means you can serve scikit-learn, PyTorch, TensorFlow, and custom models using identical serving commands and APIs.

Error Handling and Debugging

MLflow's serving infrastructure includes comprehensive error handling to help you debug issues:

  • Schema Errors: Detailed messages about data type mismatches or missing columns
  • Format Errors: Clear guidance when input format is incorrect or ambiguous
  • Model Errors: Full stack traces from your model's prediction code
  • Server Errors: Timeout and resource-related error handling

The server logs all requests and errors, making it easier to diagnose production issues.

Input Format Examples

Here are the main input formats MLflow accepts:

// dataframe_split format
{
"dataframe_split": {
"columns": ["feature1", "feature2", "feature3"],
"data": [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]
}
}

// dataframe_records format
{
"dataframe_records": [
{"feature1": 1.0, "feature2": 2.0, "feature3": 3.0},
{"feature1": 4.0, "feature2": 5.0, "feature3": 6.0}
]
}

// instances format (for simple models)
{
"instances": [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]
}

All formats return a consistent response structure with your predictions and any additional metadata your model provides.

Key Implementation Concepts

Prepare your models for successful serving:

  • Model Signatures: Define input/output schemas for automatic request validation
  • Environment Management: Capture dependencies to ensure reproducible deployments
  • Model Registry: Use aliases for seamless production updates
  • Metadata: Include relevant context for debugging and monitoring
import mlflow
from mlflow.models.signature import infer_signature
from mlflow.tracking import MlflowClient

# Log model with comprehensive serving metadata
signature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(
sk_model=model,
name="my_model",
signature=signature,
registered_model_name="production_model",
input_example=X_train[:5], # Visible example for the MLflow UI
)

# Use aliases for production deployment
client = MlflowClient()
client.set_registered_model_alias(
name="production_model", alias="production", version="1"
)

Complete Example: Train to Production

Follow this step-by-step guide to go from model training to a deployed REST API:

Train a simple model with automatic logging:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd

# Load sample data
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# Enable sklearn autologging with model registration
mlflow.sklearn.autolog(registered_model_name="iris_classifier")

# Train model - MLflow automatically logs everything
with mlflow.start_run() as run:
model = RandomForestClassifier(n_estimators=10, random_state=42)
model.fit(X_train, y_train)

# Autologging automatically captures:
# - Model artifacts
# - Training parameters (n_estimators, random_state, etc.)
# - Training metrics (score on training data)
# - Model signature (inferred from training data)
# - Input example

# Optional: Log additional custom metrics
accuracy = model.score(X_test, y_test)
mlflow.log_metric("test_accuracy", accuracy)

print(f"Run ID: {run.info.run_id}")
print("Model automatically logged and registered!")

Next Steps

Ready to build more advanced serving applications? Explore these specialized topics:

Get Started

The examples in each section are designed to be practical and ready-to-use. Start with the Quick Start above, then explore the use cases that match your deployment needs.