Skip to main content

AI Gateway Usage

Learn how to query your AI Gateway endpoints, integrate with applications, and leverage different APIs and tools.

Basic Querying

REST API Requests

The gateway exposes REST endpoints that follow OpenAI-compatible patterns. Each endpoint accepts JSON payloads and returns structured responses. Use these when integrating with applications that don't have MLflow client libraries:

# Chat completions
curl -X POST http://localhost:5000/gateway/chat/invocations \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'

# Text completions
curl -X POST http://localhost:5000/gateway/completions/invocations \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"max_tokens": 100
}'

# Embeddings
curl -X POST http://localhost:5000/gateway/embeddings/invocations \
-H "Content-Type: application/json" \
-d '{
"input": "Text to embed"
}'

Query Parameters

These parameters control model behavior and are supported across most providers. Different models may support different subsets of these parameters:

Chat Completions

{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
"temperature": 0.7,
"max_tokens": 150,
"top_p": 0.9,
"frequency_penalty": 0.0,
"presence_penalty": 0.0,
"stop": ["\n\n"],
"stream": false
}

Text Completions

{
"prompt": "Once upon a time",
"temperature": 0.8,
"max_tokens": 100,
"top_p": 1.0,
"frequency_penalty": 0.0,
"presence_penalty": 0.0,
"stop": [".", "!"],
"stream": false
}

Embeddings

{
"input": ["Text to embed", "Another text"],
"encoding_format": "float"
}

Streaming Responses

Enable streaming for real-time response generation:

curl -X POST http://localhost:5000/gateway/chat/invocations \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Write a story"}],
"stream": true
}'

Python Client Integration

MLflow Deployments Client

The MLflow deployments client provides a Python interface that handles authentication, error handling, and response parsing. Use this when building Python applications:

from mlflow.deployments import get_deploy_client

# Create a client for the gateway
client = get_deploy_client("http://localhost:5000")

# Query a chat endpoint
response = client.predict(
endpoint="chat",
inputs={"messages": [{"role": "user", "content": "What is MLflow?"}]},
)

print(response["choices"][0]["message"]["content"])

Advanced Client Usage

Build reusable functions for common operations like streaming responses and batch embedding generation:

from mlflow.deployments import get_deploy_client

# Initialize client
client = get_deploy_client("http://localhost:5000")


# Chat with streaming
def stream_chat(prompt):
response = client.predict(
endpoint="chat",
inputs={
"messages": [{"role": "user", "content": prompt}],
"stream": True,
"temperature": 0.7,
},
)

for chunk in response:
if chunk["choices"][0]["delta"].get("content"):
print(chunk["choices"][0]["delta"]["content"], end="")


# Generate embeddings
def get_embeddings(texts):
response = client.predict(endpoint="embeddings", inputs={"input": texts})
return [item["embedding"] for item in response["data"]]


# Example usage
stream_chat("Explain quantum computing")
embeddings = get_embeddings(["Hello world", "MLflow AI Gateway"])

Error Handling

Proper error handling helps you distinguish between network issues, authentication problems, and model-specific errors:

from mlflow.deployments import get_deploy_client
from mlflow.exceptions import MlflowException

client = get_deploy_client("http://localhost:5000")

try:
response = client.predict(
endpoint="chat", inputs={"messages": [{"role": "user", "content": "Hello"}]}
)
print(response)
except MlflowException as e:
print(f"MLflow error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")

Streaming Responses

For long-form content generation, enable streaming to receive partial responses as they're generated instead of waiting for the complete response:

curl -X POST http://localhost:5000/gateway/chat/invocations \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Write a story"}],
"stream": true
}'

API Reference

Gateway Management

Query the gateway's current configuration and available endpoints programmatically:

from mlflow.deployments import get_deploy_client

client = get_deploy_client("http://localhost:5000")

# List available endpoints
endpoints = client.list_endpoints()
for endpoint in endpoints:
print(f"Endpoint: {endpoint['name']}")

# Get endpoint details
endpoint_info = client.get_endpoint("chat")
print(f"Model: {endpoint_info.get('model', {}).get('name', 'N/A')}")
print(f"Provider: {endpoint_info.get('model', {}).get('provider', 'N/A')}")

# Note: Route creation, updates, and deletion are typically done
# through configuration file changes, not programmatically

Health Monitoring

Monitor gateway availability and responsiveness for production deployments:

import requests

try:
response = requests.get("http://localhost:5000/health")
print(f"Status: {response.status_code}")
if response.status_code == 200:
print("Gateway is healthy")
except requests.RequestException as e:
print(f"Health check failed: {e}")

Next Steps