MLflow Model Serving
Transform your trained models into production-ready inference servers with MLflow's comprehensive serving capabilities. Deploy locally, in the cloud, or through managed endpoints with standardized REST APIs.
REST API Endpoints
Automatic generation of standardized REST endpoints for model inference with consistent request/response formats.
Multi-Framework Support
Serve models from any ML framework through MLflow's flavor system with unified deployment patterns.
Custom Applications
Build sophisticated serving applications with custom logic, preprocessing, and business rules.
Scalable Deployment
Deploy to various targets from local development servers to cloud platforms and Kubernetes clusters.
Quick Start
Get your model serving in minutes with these simple steps:
- 1. Serve Model
- 2. Make Predictions
Choose your serving approach:
# Serve a logged model
mlflow models serve -m "models:/<model-id>" -p 5000
# Serve a registered model
mlflow models serve -m "models:/<model-name>/<model-version>" -p 5000
# Serve a model from local path
mlflow models serve -m ./path/to/model -p 5000
Your model will be available at http://localhost:5000
Send prediction requests via HTTP:
curl -X POST http://localhost:5000/invocations \
-H "Content-Type: application/json" \
-d '{"inputs": [[1, 2, 3, 4]]}'
Using Python:
import requests
import json
data = {
"dataframe_split": {
"columns": ["feature1", "feature2", "feature3", "feature4"],
"data": [[1, 2, 3, 4]],
}
}
response = requests.post(
"http://localhost:5000/invocations",
headers={"Content-Type": "application/json"},
data=json.dumps(data),
)
print(response.json())
How Model Serving Works
MLflow transforms your trained models into production-ready HTTP servers through a carefully orchestrated process that handles everything from model loading to request processing.
Server Startup and Model Loading
When you run mlflow models serve, MLflow begins by analyzing your model's metadata to determine how to load it. Each model contains an MLmodel file that specifies which "flavor" it uses - whether it's scikit-learn, PyTorch, TensorFlow, or a custom PyFunc model.
MLflow downloads the model artifacts to a local directory and creates a FastAPI server with standardized endpoints. The server loads your model using the appropriate flavor-specific loading logic. For example, a scikit-learn model is loaded using pickle, while a PyTorch model loads its state dictionary and model class.
The server exposes four key endpoints:
POST /invocations- The main prediction endpointGET /pingandGET /health- Health checks for monitoringGET /version- Returns server and model information