Production Tracing and Monitoring
When you deploy an agent or LLM application to production, real users behave differently than test data—they find edge cases, ask unexpected questions, and expose issues you didn't anticipate. This guide covers how to configure MLflow Tracing for production environments—including automatic (online) quality evaluation—to catch these issues early and continuously improve your application.
Production Checklist
We recommend the following steps before deploying to production. Each topic is covered in more detail below.
- Use a production-grade SQL database — Use PostgreSQL, MySQL, or similar for reliability at scale
- Enable async trace logging — Upload traces in the background to avoid adding latency to your app
- [Optional] Use the production tracing SDK — Faster startup and smaller footprint than the full
mlflowpackage - [Optional] Configure trace sampling — Control costs by logging a percentage of traces for high-volume applications
- [Optional] Enable automatic quality evaluation — Use LLM judges to monitor quality on production traffic
- [Optional] Collect end-user feedback — Capture ratings and comments to identify issues and improve quality
- [Optional] Add application context to traces — Track user IDs, sessions, and metadata to debug and analyze behavior
Setting Up Tracing for Production Endpoints
For production deployments, we recommend the Production Tracing SDK to minimize library dependencies and reduce startup time, and async logging with sampling for better performance and cost control at scale.
Using the Production Tracing SDK
The Production Tracing SDK (mlflow-tracing) is a smaller package that only includes the minimum set of dependencies to instrument your code/models/agents with MLflow Tracing.
⚡️ Faster Deployment: Significantly smaller package size and fewer dependencies enable quicker deployments in containers and serverless environments
📦 Enhanced Portability: Easily deploy across different platforms with minimal compatibility concerns
🚀 Performance Optimizations: Optimized for high-volume tracing in production environments
When installing the MLflow Tracing SDK, make sure the environment does not have the full MLflow package installed. Having both packages in the same environment might cause conflicts and unexpected behaviors.
Automatic (Online) Quality Evaluation
MLflow's automatic evaluation enables continuous quality monitoring of production traffic using LLM judges. Judges run asynchronously on incoming traces without blocking your application, evaluating for issues like:
- Hallucinations and factual accuracy
- PII leakage and safety violations
- User frustration in multi-turn conversations
- Response relevance and completeness
Setting Up Production Judges
You can configure LLM judges to automatically evaluate a sample of your production traces using the UI or SDK. Judges can be set up with sampling rates to control costs and filter strings to target specific traces. For detailed setup instructions and configuration options, see Automatic Evaluation.
import mlflow
from mlflow.genai.scorers import Guidelines, ScorerSamplingConfig
mlflow.set_experiment("production-genai-app")
# Create a judge for detecting potential issues
safety_judge = Guidelines(
name="safety_check",
guidelines="The response must not contain PII, harmful content, or hallucinated information.",
model="gateway:/my-llm-endpoint",
)
# Register and start automatic evaluation
registered_judge = safety_judge.register(name="production_safety_check")
registered_judge.start(
sampling_config=ScorerSamplingConfig(
sample_rate=0.1, # Evaluate 10% of traces
filter_string="metadata.environment = 'production'", # Only production traces
),
)
Production Tracing Configurations
For production deployments, we recommend enabling asynchronous trace logging to avoid blocking your application, and configuring trace sampling to control costs for high-volume traffic.
Example configuration:
# Required: Set MLflow Tracking URI
export MLFLOW_TRACKING_URI="http://your-mlflow-server:5000"
# Optional: Configure the experiment name for organizing traces
export MLFLOW_EXPERIMENT_NAME="production-genai-app"
# Optional: Configure async logging (recommended for production)
export MLFLOW_ENABLE_ASYNC_TRACE_LOGGING=true
export MLFLOW_ASYNC_TRACE_LOGGING_MAX_WORKERS=10
export MLFLOW_ASYNC_TRACE_LOGGING_MAX_QUEUE_SIZE=1000
# Optional: Configure trace sampling ratio (default is 1.0)
export MLFLOW_TRACE_SAMPLING_RATIO=0.1
Asynchronous Trace Logging
For production applications, MLflow logs traces asynchronously by default to prevent blocking your application:
| Environment Variable | Description | Default Value |
|---|---|---|
MLFLOW_ENABLE_ASYNC_TRACE_LOGGING | Whether to log traces asynchronously. When set to False, traces will be logged in a blocking manner. | True |
MLFLOW_ASYNC_TRACE_LOGGING_MAX_WORKERS | The maximum number of worker threads to use for async trace logging per process. Increasing this allows higher throughput of trace logging, but also increases CPU usage and memory consumption. | 10 |
MLFLOW_ASYNC_TRACE_LOGGING_MAX_QUEUE_SIZE | The maximum number of traces that can be queued before being logged to backend by the worker threads. When the queue is full, new traces will be discarded. Increasing this allows higher durability of trace logging, but also increases memory consumption. | 1000 |
MLFLOW_ASYNC_TRACE_LOGGING_RETRY_TIMEOUT | The timeout in seconds for retrying failed trace logging. When a trace logging fails, it will be retried up to this timeout with backoff, after which it will be discarded. | 500 |
Sampling Traces
For high-volume applications, you may want to reduce the number of traces exported to the backend. You can configure the sampling ratio to control the number of traces exported.
| Environment Variable | Description | Default Value |
|---|---|---|
MLFLOW_TRACE_SAMPLING_RATIO | The sampling ratio for traces. When set to 0.0, no traces will be exported. When set to 1.0, all traces will be exported. | 1.0 |
The default value is 1.0, which means all traces will be exported. When set to less than 1.0, say 0.1, only 10% of the traces will be exported. The sampling is done at the trace level, meaning that all spans in some traces will be exported or discarded together.
Adding Context to Production Traces
Adding user IDs, session IDs, and environment metadata to your traces makes it easier to debug issues for specific users and analyze behavior across different segments.
Tracking Request, Session, and User Context
Production applications need to track multiple pieces of context simultaneously. For detailed guidance, see Track Users & Sessions. The following example demonstrates how to track all of these in a FastAPI application.
import mlflow
import os
from fastapi import FastAPI, Request
from pydantic import BaseModel
# Initialize FastAPI app
app = FastAPI()
class ChatRequest(BaseModel):
message: str
@app.post("/chat") # FastAPI decorator should be outermost
@mlflow.trace # Ensure @mlflow.trace is the inner decorator
def handle_chat(request: Request, chat_request: ChatRequest):
# Retrieve all context from request headers
client_request_id = request.headers.get("X-Request-ID")
session_id = request.headers.get("X-Session-ID")
user_id = request.headers.get("X-User-ID")
# Update the current trace with all context and environment metadata
mlflow.update_current_trace(
client_request_id=client_request_id,
tags={
# Session context - groups traces from multi-turn conversations
"mlflow.trace.session": session_id,
# User context - associates traces with specific users
"mlflow.trace.user": user_id,
# Environment metadata - tracks deployment context
"environment": "production",
"app_version": os.getenv("APP_VERSION", "1.0.0"),
"deployment_id": os.getenv("DEPLOYMENT_ID", "unknown"),
"region": os.getenv("REGION", "us-east-1"),
},
)
# Your application logic for processing the chat message
response_text = f"Processed message: '{chat_request.message}'"
return {"response": response_text}
Feedback Collection
Capturing user feedback on specific interactions is essential for understanding quality and improving your GenAI application. For detailed guidance, see Collect User Feedback. The following example demonstrates how to collect feedback in a FastAPI application.
import mlflow
from mlflow.client import MlflowClient
from fastapi import FastAPI, Query, Request
from pydantic import BaseModel
from typing import Optional
from mlflow.entities import AssessmentSource
app = FastAPI()
class FeedbackRequest(BaseModel):
is_correct: bool # True for correct, False for incorrect
comment: Optional[str] = None
@app.post("/chat_feedback")
def handle_chat_feedback(
request: Request,
client_request_id: str = Query(
..., description="The client request ID from the original chat request"
),
feedback: FeedbackRequest = ...,
):
"""
Collect user feedback for a specific chat interaction identified by client_request_id.
"""
# Search for the trace with the matching client_request_id
client = MlflowClient()
experiment = client.get_experiment_by_name("production-genai-app")
traces = client.search_traces(locations=[experiment.experiment_id])
traces = [
trace for trace in traces if trace.info.client_request_id == client_request_id
][:1]
if not traces:
return {
"status": "error",
"message": f"Unable to find data for client request ID: {client_request_id}",
}, 500
# Log feedback using MLflow's log_feedback API
mlflow.log_feedback(
trace_id=traces[0].info.trace_id,
name="response_is_correct",
value=feedback.is_correct,
source=AssessmentSource(
source_type="HUMAN", source_id=request.headers.get("X-User-ID")
),
rationale=feedback.comment,
)
return {
"status": "success",
"message": "Feedback recorded successfully",
"trace_id": traces[0].info.trace_id,
}
Querying Traces with Context
Once you've enriched traces with user, session, and environment context, you can query them to debug issues for specific users, analyze conversation flows within sessions, or compare behavior across deployments. For detailed guidance, see Search Traces. The following example demonstrates how to query traces by user, session, and environment.
import mlflow
mlflow.set_experiment("production-genai-app")
# Query traces by user
user_traces = mlflow.search_traces(
filter_string="tags.`mlflow.trace.user` = 'user-jane-doe-12345'",
max_results=100,
)
# Query traces by session
session_traces = mlflow.search_traces(
filter_string="tags.`mlflow.trace.session` = 'session-def-456'",
max_results=100,
)
# Query traces by environment
production_traces = mlflow.search_traces(
filter_string="tags.environment = 'production'",
max_results=100,
)
Next Steps
Automatic Evaluation
Set up LLM judges to automatically monitor quality on production traffic.
Searching for Traces
Understand how to access trace data for analysis with UI or API.
Track Users & Sessions
Implement user and session context tracking for better monitoring insights.