Optimize Prompts (Experimental)
The simple way to continuously improve your AI agents and prompts.
MLflow's prompt optimization lets you systematically enhance your AI applications with minimal code changes. Whether you're building with LangChain, OpenAI Agent, CrewAI, or your own custom implementation, MLflow provides a universal path from initial prototyping to steady improvement.
Minimum rewrites, no lock-in, just better prompts.
Currently, MLflow supports the GEPA optimization algorithm through the GepaPromptOptimizer. GEPA iteratively refines prompts using LLM-driven reflection and automated feedback, leading to systematic and data-driven improvements.
- Zero Framework Lock-in: Works with ANY agent framework—LangChain, OpenAI Agent, CrewAI, or custom solutions
- Minimal Code Changes: Add a few lines to start optimizing; no architectural rewrites needed
- Data-Driven Improvement: Automatically learn from your evaluation data and custom metrics
- Multi-Prompt Optimization: Jointly optimize multiple prompts for complex agent workflows
- Granular Control: Optimize single prompts or entire multi-prompt workflows—you decide what to improve
- Production-Ready: Built-in version control and registry for seamless deployment
- Extensible: Bring your own optimization algorithms with simple base class extension
The optimize_prompts API requires MLflow >= 3.5.0.
Quick Start​
Here's a realistic example of optimizing a prompt for medical paper section classification:
import mlflow
import openai
from mlflow.genai.optimize import GepaPromptOptimizer
from mlflow.genai.scorers import Correctness
# Register initial prompt for classifying medical paper sections
prompt = mlflow.genai.register_prompt(
name="medical_section_classifier",
template="Classify this medical research paper sentence into one of these sections: CONCLUSIONS, RESULTS, METHODS, OBJECTIVE, BACKGROUND.\n\nSentence: {{sentence}}",
)
# Define your prediction function
def predict_fn(sentence: str) -> str:
prompt = mlflow.genai.load_prompt("prompts:/medical_section_classifier/1")
completion = openai.OpenAI().chat.completions.create(
model="gpt-5-nano",
# load prompt template using PromptVersion.format()
messages=[{"role": "user", "content": prompt.format(sentence=sentence)}],
)
return completion.choices[0].message.content
# Training data with medical paper sentences and ground truth labels
# fmt: off
raw_data = [
("The emergence of HIV as a chronic condition means that people living with HIV are required to take more responsibility for the self-management of their condition , including making physical , emotional and social adjustments .", "BACKGROUND"),
("This paper describes the design and evaluation of Positive Outlook , an online program aiming to enhance the self-management skills of gay men living with HIV .", "BACKGROUND"),
("This study is designed as a randomised controlled trial in which men living with HIV in Australia will be assigned to either an intervention group or usual care control group .", "METHODS"),
("The intervention group will participate in the online group program ` Positive Outlook ' .", "METHODS"),
("The program is based on self-efficacy theory and uses a self-management approach to enhance skills , confidence and abilities to manage the psychosocial issues associated with HIV in daily life .", "METHODS"),
("Participants will access the program for a minimum of 90 minutes per week over seven weeks .", "METHODS"),
("Primary outcomes are domain specific self-efficacy , HIV related quality of life , and outcomes of health education .", "METHODS"),
("Secondary outcomes include : depression , anxiety and stress ; general health and quality of life ; adjustment to HIV ; and social support .", "METHODS"),
("Data collection will take place at baseline , completion of the intervention ( or eight weeks post randomisation ) and at 12 week follow-up .", "METHODS"),
("Results of the Positive Outlook study will provide information regarding the effectiveness of online group programs improving health related outcomes for men living with HIV .", "CONCLUSIONS"),
("The aim of this study was to evaluate the efficacy , safety and complications of orbital steroid injection versus oral steroid therapy in the management of thyroid-related ophthalmopathy .", "OBJECTIVE"),
("A total of 29 patients suffering from thyroid ophthalmopathy were included in this study .", "METHODS"),
("Patients were randomized into two groups : group I included 15 patients treated with oral prednisolone and group II included 14 patients treated with peribulbar triamcinolone orbital injection .", "METHODS"),
("Both groups showed improvement in symptoms and in clinical evidence of inflammation with improvement of eye movement and proptosis in most cases .", "RESULTS"),
("Mean exophthalmometry value before treatment was 22.6 1.98 mm that decreased to 18.6 0.996 mm in group I , compared with 23 1.86 mm that decreased to 19.08 1.16 mm in group II .", "RESULTS"),
("There was no change in the best-corrected visual acuity in both groups .", "RESULTS"),
("There was an increase in body weight , blood sugar , blood pressure and gastritis in group I in 66.7 % , 33.3 % , 50 % and 75 % , respectively , compared with 0 % , 0 % , 8.3 % and 8.3 % in group II .", "RESULTS"),
("Orbital steroid injection for thyroid-related ophthalmopathy is effective and safe .", "CONCLUSIONS"),
("It eliminates the adverse reactions associated with oral corticosteroid use .", "CONCLUSIONS"),
("The aim of this prospective randomized study was to examine whether active counseling and more liberal oral fluid intake decrease postoperative pain , nausea and vomiting in pediatric ambulatory tonsillectomy .", "OBJECTIVE"),
]
# fmt: on
# Format dataset for optimization
dataset = [
{
"inputs": {"sentence": sentence},
"expectations": {"expected_response": label},
}
for sentence, label in raw_data
]
# Optimize the prompt
result = mlflow.genai.optimize_prompts(
predict_fn=predict_fn,
train_data=dataset,
prompt_uris=[prompt.uri],
optimizer=GepaPromptOptimizer(
reflection_model="openai:/gpt-5", max_metric_calls=300
),
scorers=[Correctness(model="openai:/gpt-5-mini")],
)
# Use the optimized prompt
optimized_prompt = result.optimized_prompts[0]
print(f"Optimized template: {optimized_prompt.template}")
The API will automatically improve the prompt to better classify medical paper sections by learning from the training examples.
Example: Simple Prompt → Optimized Prompt​
Before Optimization:
Classify this medical research paper sentence
into one of these sections: CONCLUSIONS, RESULTS,
METHODS, OBJECTIVE, BACKGROUND.
Sentence: {{sentence}}
After Optimization:
You are a single-sentence classifier for medical research abstracts. For each input sentence, decide which abstract section it belongs to and output exactly one label in UPPERCASE with no extra words, punctuation, or explanation.
Allowed labels: CONCLUSIONS, RESULTS, METHODS, OBJECTIVE, BACKGROUND
Input format:
- The prompt will be:
"Classify this medical research paper sentence into one of these sections: CONCLUSIONS, RESULTS, METHODS, OBJECTIVE, BACKGROUND.
Sentence: {{sentence}}"
Core rules:
- Use only the information in the single sentence.
- Classify by the sentence's function: context-setting vs aim vs procedure vs findings vs interpretation.
- Return exactly one uppercase label from the allowed set.
Decision guide and lexical cues:
1) RESULTS
- Reports observed findings/outcomes tied to data.
- Common cues: past-tense result verbs and outcome terms: "showed," "was/were associated with," "increased/decreased," "improved," "reduced," "significant," "p < …," "odds ratio," "risk ratio," "95% CI," percentages, rates, counts or numbers tied to effects/adverse events.
- If it explicitly states changes, associations, statistical significance, or quantified outcomes, choose RESULTS.
2) CONCLUSIONS
- Interpretation, implications, recommendations, or high-level takeaways.
- Common cues: "In conclusion," "These findings suggest/indicate," "We conclude," statements about practice/policy/clinical implications, benefit–risk judgments, feasibility statements.
- Sentences that forecast the significance/utility of the study's results ("Results will provide insight/information," "Findings will inform/guide practice") are CONCLUSIONS.
- Tie-break with RESULTS: If a sentence describes an outcome as a general claim without specific observed data/statistics, prefer CONCLUSIONS over RESULTS.
3) METHODS
- How the study was conducted: design, participants, interventions/programs, measurements/outcomes lists, timelines, procedures, or analyses.
- Common cues: design terms ("randomized," "double-blind," "cross-sectional," "cohort," "case-control"), "participants," "n =," inclusion/exclusion criteria, instruments/scales, dosing/protocols, schedules/timelines, statistical tests/analysis plans ("multivariate regression," "Kaplan–Meier," "ANOVA," "we will compare"), trial registration, ethics approval.
- Measurement/outcome lists are METHODS (e.g., "Secondary outcomes include: …"; "Primary outcome was …").
- Numbers specifying sample size (e.g., "n = 200") → METHODS; numbers tied to effects → RESULTS.
- Program/intervention descriptions, components, theoretical basis, and mechanisms are METHODS, even if written in present tense and even if they contain purpose phrases. Examples: "The program is based on self-efficacy theory…," "The intervention uses a self-management approach to enhance skills…," "The device is designed to…"
- Important: An infinitive "to [verb] …" inside a program/intervention description (e.g., "uses X to improve Y") is METHODS, not OBJECTIVE, because it describes how the intervention works, not the study's aim.
4) OBJECTIVE
- The aim/purpose/hypothesis of the study.
- Common cues: "Objective(s):" "Aim/Purpose was," "We aimed/sought/intended to," "We hypothesized that …"
- Infinitive purpose phrases indicating the study's aim without procedures or results: "To determine/evaluate/assess/investigate whether …" → OBJECTIVE.
- Phrases like "The aim of this study was to evaluate the efficacy/safety of X vs Y …" → OBJECTIVE.
- If "We evaluated/assessed …" is clearly used as a purpose statement (not describing methods or results), label OBJECTIVE.
5) BACKGROUND
- Context, rationale, prior knowledge, unmet need; introduces topic without specific aims, procedures, or results.
- Common cues: burden/prevalence statements, "X is common," "X remains poorly understood," prior work summaries, general descriptions.
- If a sentence merely states that a paper describes/reports a program/design/evaluation without concrete procedures/analyses, label as BACKGROUND.
Important tie-break rules:
- RESULTS vs CONCLUSIONS: Observed data/findings → RESULTS; interpretation/generalization/recommendation → CONCLUSIONS.
- OBJECTIVE vs METHODS: Purpose/aim of the study → OBJECTIVE; concrete design/intervention details/measurements/analysis steps → METHODS.
- BACKGROUND vs OBJECTIVE: Context/motivation without an explicit study aim → BACKGROUND.
- BACKGROUND vs METHODS: General description without concrete procedures/analyses → BACKGROUND.
- The word "Results" at the start does not guarantee RESULTS; e.g., "Results will provide information …" → CONCLUSIONS.
Output constraint:
- Return exactly one uppercase label: CONCLUSIONS, RESULTS, METHODS, OBJECTIVE, or BACKGROUND. No extra text or punctuation.
Components​
The mlflow.genai.optimize_prompts() API requires the following components:
| Component | Description |
|---|---|
| Target Prompt URIs | List of prompt URIs to optimize (e.g., ["prompts:/qa/1"]) |
| Predict Function | A callable that takes inputs as keyword arguments and returns outputs. Must load templates from MLflow prompt versions (e.g., call PromptVersion.format()). |
| Training Data | Dataset with inputs (dict) and expectations (expected results). Supports pandas DataFrame, list of dicts, or MLflow EvaluationDataset. |
| Optimizer | Prompt optimizer instance (e.g., GepaPromptOptimizer) |
1. Target Prompt URIs​
Specify which prompts to optimize using their URIs from MLflow Prompt Registry:
prompt_uris = [
"prompts:/qa/1", # Specific version
"prompts:/instruction@latest", # Latest version
]
You can reference prompts by:
- Specific version:
"prompts:/qa/1"- Optimize a particular version - Latest version:
"prompts:/qa@latest"- Optimize the most recent version - Alias:
"prompts:/qa@champion"- Optimize a version with a specific alias
2. Predict Function​
Your predict_fn must:
- Accept inputs as keyword arguments matching the inputs field of the dataset
- Load the template from MLflow prompt versions using one of the following methods:
- Return outputs in the same format as your training data (e.g., outputs =
{"answer": "xxx"}if expectations ={"expected_response": {"answer": "xxx"}})
def predict_fn(question: str) -> str:
# Load prompt from registry
prompt = mlflow.genai.load_prompt("prompts:/qa/1")
# Format the prompt with input variables
formatted_prompt = prompt.format(question=question)
# Call your LLM
response = your_llm_call(formatted_prompt)
return response
3. Training Data​
Provide a dataset with inputs and expectations. Both columns should have dictionary values. inputs values will be passed to the predict function as keyword arguments. Please refer to Predefined LLM Scorers for the expected format of each built in scorers.
# List of dictionaries - Example: Medical paper classification
dataset = [
{
"inputs": {
"sentence": "The emergence of HIV as a chronic condition means that people living with HIV are required to take more responsibility..."
},
"expectations": {"expected_response": "BACKGROUND"},
},
{
"inputs": {
"sentence": "This study is designed as a randomised controlled trial in which men living with HIV..."
},
"expectations": {"expected_response": "METHODS"},
},
{
"inputs": {
"sentence": "Both groups showed improvement in symptoms and in clinical evidence of inflammation..."
},
"expectations": {"expected_response": "RESULTS"},
},
{
"inputs": {
"sentence": "Orbital steroid injection for thyroid-related ophthalmopathy is effective and safe."
},
"expectations": {"expected_response": "CONCLUSIONS"},
},
{
"inputs": {
"sentence": "The aim of this study was to evaluate the efficacy, safety and complications..."
},
"expectations": {"expected_response": "OBJECTIVE"},
},
]
# Or pandas DataFrame
import pandas as pd
dataset = pd.DataFrame(
{
"inputs": [
{"sentence": "The emergence of HIV as a chronic condition..."},
{"sentence": "This study is designed as a randomised controlled trial..."},
{"sentence": "Both groups showed improvement in symptoms..."},
],
"expectations": [
{"expected_response": "BACKGROUND"},
{"expected_response": "METHODS"},
{"expected_response": "RESULTS"},
],
}
)
4. Optimizer​
Create an optimizer instance for the optimization algorithm. Currently only GepaPromptOptimizer is supported natively.
from mlflow.genai.optimize import GepaPromptOptimizer
optimizer = GepaPromptOptimizer(
reflection_model="openai:/gpt-5", # Powerful model for optimization
max_metric_calls=100,
display_progress_bar=False,
)
Advanced Usage​
Using Custom Scorers​
Define custom evaluation metrics to guide optimization:
from typing import Any
from mlflow.genai.scorers import scorer
@scorer
def accuracy_scorer(outputs: Any, expectations: dict[str, Any]):
"""Check if output matches expected value."""
return 1.0 if outputs.lower() == expectations.lower() else 0.0
@scorer
def brevity_scorer(outputs: Any):
"""Prefer shorter outputs (max 50 chars)."""
return min(1.0, 50 / max(len(outputs), 1))
# Combine scorers with a weighted objective
def weighted_objective(scores: dict[str, Any]):
return 0.7 * scores["accuracy_scorer"] + 0.3 * scores["brevity_scorer"]
# Use custom scorers
result = mlflow.genai.optimize_prompts(
predict_fn=predict_fn,
train_data=dataset,
prompt_uris=[prompt.uri],
optimizer=GepaPromptOptimizer(reflection_model="openai:/gpt-5"),
scorers=[accuracy_scorer, brevity_scorer],
aggregation=weighted_objective,
)
Custom Optimization Algorithm​
Implement your own optimizer by extending BasePromptOptimizer:
from mlflow.genai.optimize import BasePromptOptimizer, PromptOptimizerOutput
from mlflow.genai.scorers import Correctness
class MyCustomOptimizer(BasePromptOptimizer):
def __init__(self, model_name: str):
self.model_name = model_name
def optimize(self, eval_fn, train_data, target_prompts, enable_tracking):
# Your custom optimization logic
optimized_prompts = {}
for prompt_name, prompt_template in target_prompts.items():
# Implement your algorithm
optimized_prompts[prompt_name] = your_optimization_algorithm(
prompt_template, train_data, self.model_name
)
return PromptOptimizerOutput(optimized_prompts=optimized_prompts)
# Use custom optimizer
result = mlflow.genai.optimize_prompts(
predict_fn=predict_fn,
train_data=dataset,
prompt_uris=[prompt.uri],
optimizer=MyCustomOptimizer(model_name="openai:/gpt-5"),
scorers=[Correctness(model="openai:/gpt-5")],
)
Multi-Prompt Optimization​
Optimize multiple prompts together:
import mlflow
from mlflow.genai.scorers import Correctness
# Register multiple prompts
plan_prompt = mlflow.genai.register_prompt(
name="plan",
template="Make a plan to answer {{question}}.",
)
answer_prompt = mlflow.genai.register_prompt(
name="answer",
template="Answer {{question}} following the plan: {{plan}}",
)
def predict_fn(question: str) -> str:
plan_prompt = mlflow.genai.load_prompt("prompts:/plan/1")
completion = openai.OpenAI().chat.completions.create(
model="gpt-5", # strong model
messages=[{"role": "user", "content": plan_prompt.format(question=question)}],
)
plan = completion.choices[0].message.content
answer_prompt = mlflow.genai.load_prompt("prompts:/answer/1")
completion = openai.OpenAI().chat.completions.create(
model="gpt-5-mini", # cost efficient model
messages=[
{
"role": "user",
"content": answer_prompt.format(question=question, plan=plan),
}
],
)
return completion.choices[0].message.content
# Optimize both
result = mlflow.genai.optimize_prompts(
predict_fn=predict_fn,
train_data=dataset,
prompt_uris=[plan_prompt.uri, answer_prompt.uri],
optimizer=GepaPromptOptimizer(reflection_model="openai:/gpt-5"),
scorers=[Correctness(model="openai:/gpt-5")],
)
# Access optimized prompts
optimized_plan = result.optimized_prompts[0]
optimized_answer = result.optimized_prompts[1]
Works with Any Agent Framework​
MLflow's optimization is framework-agnostic—it works seamlessly with LangChain, DSPy, CrewAI, AutoGen, or any custom framework. No need to rewrite your existing agents or switch frameworks.
The example below shows how easy it is to optimize prompts in a LangChain workflow with minimal changes. Note that we call PromptVersion.to_single_brace_format() instead of PromptVersion.format() inside predict_fn.
import mlflow
from mlflow.genai.scorers import Correctness
from mlflow.genai.optimize.optimizers import GepaPromptOptimizer
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
from langchain.chains import LLMChain
def predict_fn(input_language, output_language, text):
template = PromptTemplate(
input_variables=["input_language", "output_language", "text"],
template=prompt.to_single_brace_format(), # call to_single_brace_format
)
llm = OpenAI(temperature=0.7)
chain = LLMChain(llm=llm, prompt=template)
result = chain.run(
input_language=input_language, output_language=output_language, text=text
)
return result
dataset = [
{
"inputs": {
"input_language": "English",
"output_language": "French",
"text": "Hello, how are you?",
},
"expectations": {"expected_response": "Bonjour, comment allez-vous?"},
},
{
"inputs": {
"input_language": "English",
"output_language": "Spanish",
"text": "Good morning",
},
"expectations": {"expected_response": "Buenos dĂas"},
},
{
"inputs": {
"input_language": "English",
"output_language": "German",
"text": "Thank you very much",
},
"expectations": {"expected_response": "Vielen Dank"},
},
]
result = mlflow.genai.optimize_prompts(
predict_fn=predict_fn,
train_data=dataset,
prompt_uris=[prompt.uri],
optimizer=GepaPromptOptimizer(reflection_model="openai:/gpt-5"),
scorers=[Correctness(model="openai:/gpt-5")],
)
Result Object​
The API returns a PromptOptimizationResult object:
result = mlflow.genai.optimize_prompts(...)
# Access optimized prompts
for prompt in result.optimized_prompts:
print(f"Name: {prompt.name}")
print(f"Version: {prompt.version}")
print(f"Template: {prompt.template}")
print(f"URI: {prompt.uri}")
# Check optimizer used
print(f"Optimizer: {result.optimizer_name}")
# View evaluation scores (if available)
print(f"Initial score: {result.initial_eval_score}")
print(f"Final score: {result.final_eval_score}")
Common Use Cases​
Improving Accuracy​
Optimize prompts to produce more accurate outputs:
from mlflow.genai.scorers import Correctness
result = mlflow.genai.optimize_prompts(
predict_fn=predict_fn,
train_data=dataset,
prompt_uris=[prompt.uri],
optimizer=GepaPromptOptimizer(reflection_model="openai:/gpt-5"),
scorers=[Correctness(model="openai:/gpt-5")],
)
Optimizing for Safeness​
Ensure outputs are safe:
from mlflow.genai.scorers import Safety
result = mlflow.genai.optimize_prompts(
predict_fn=predict_fn,
train_data=dataset,
prompt_uris=[prompt.uri],
optimizer=GepaPromptOptimizer(reflection_model="openai:/gpt-5"),
scorers=[Safety(model="openai:/gpt-5")],
)
Model Switching and Migration​
When switching between different language models (e.g., migrating from gpt-5 to gpt-5-mini for cost reduction), you may need to rewrite your prompts to maintain output quality with the new model. The mlflow.genai.optimize_prompts() API can help adapt prompts automatically using your existing application outputs as training data.
See the Auto-rewrite Prompts for New Models guide for a complete model migration workflow.
Troubleshooting​
Issue: Optimization Takes Too Long​
Solution: Reduce dataset size or reduce the optimizer budget:
# Use fewer examples
small_dataset = dataset[:20]
# Use faster model for optimization
optimizer = GepaPromptOptimizer(
reflection_model="openai:/gpt-5-mini", max_metric_calls=100
)
Issue: No Improvement Observed​
Solution: Check your evaluation metrics and increase dataset diversity:
- Ensure scorers accurately measure what you care about
- Increase training data size and diversity
- Try to modify optimizer configurations
- Verify outputs format matches expectations
Issue: Prompts Not Being Used​
Solution: Ensure predict_fn calls PromptVersion.format() during execution:
# âś… Correct - loads from registry
def predict_fn(question: str):
prompt = mlflow.genai.load_prompt("prompts:/qa@latest")
return llm_call(prompt.format(question=question))
# ❌ Incorrect - hardcoded prompt
def predict_fn(question: str):
return llm_call(f"Answer: {question}")
See Also​
- Auto-rewrite Prompts for New Models: Adapt prompts when switching between language models
- Create and Edit Prompts: Basic Prompt Registry usage
- Evaluate Prompts: Evaluate prompt performance