Optimize Prompts (Experimental)

The simple way to continuously improve your AI agents and prompts.

MLflow's prompt optimization lets you systematically enhance your AI applications with minimal code changes. Whether you're building with LangChain, OpenAI Agent, CrewAI, or your own custom implementation, MLflow provides a universal path from initial prototyping to steady improvement.

Minimum rewrites, no lock-in, just better prompts.

Currently, MLflow supports the GEPA optimization algorithm through the GepaPromptOptimizer. GEPA iteratively refines prompts using LLM-driven reflection and automated feedback, leading to systematic and data-driven improvements.

Why Use MLflow Prompt Optimization?

Zero Framework Lock-in: Works with ANY agent framework—LangChain, OpenAI Agent, CrewAI, or custom solutions
Minimal Code Changes: Add a few lines to start optimizing; no architectural rewrites needed
Data-Driven Improvement: Automatically learn from your evaluation data and custom metrics
Multi-Prompt Optimization: Jointly optimize multiple prompts for complex agent workflows
Granular Control: Optimize single prompts or entire multi-prompt workflows—you decide what to improve
Production-Ready: Built-in version control and registry for seamless deployment
Extensible: Bring your own optimization algorithms with simple base class extension

Version Requirements

The optimize_prompts API requires MLflow >= 3.5.0.

Quick Start

Here's a realistic example of optimizing a prompt for medical paper section classification:

python
import mlflow
import openai
from mlflow.genai.optimize import GepaPromptOptimizer
from mlflow.genai.scorers import Correctness

# Register initial prompt for classifying medical paper sections
prompt = mlflow.genai.register_prompt(
    name="medical_section_classifier",
    template="Classify this medical research paper sentence into one of these sections: CONCLUSIONS, RESULTS, METHODS, OBJECTIVE, BACKGROUND.\n\nSentence: {{sentence}}",
)


# Define your prediction function
def predict_fn(sentence: str) -> str:
    prompt = mlflow.genai.load_prompt("prompts:/medical_section_classifier/1")
    completion = openai.OpenAI().chat.completions.create(
        model="gpt-5-nano",
        # load prompt template using PromptVersion.format()
        messages=[{"role": "user", "content": prompt.format(sentence=sentence)}],
    )
    return completion.choices[0].message.content


# Training data with medical paper sentences and ground truth labels
# fmt: off
raw_data = [
    ("The emergence of HIV as a chronic condition means that people living with HIV are required to take more responsibility for the self-management of their condition , including making physical , emotional and social adjustments .", "BACKGROUND"),
    ("This paper describes the design and evaluation of Positive Outlook , an online program aiming to enhance the self-management skills of gay men living with HIV .", "BACKGROUND"),
    ("This study is designed as a randomised controlled trial in which men living with HIV in Australia will be assigned to either an intervention group or usual care control group .", "METHODS"),
    ("The intervention group will participate in the online group program ` Positive Outlook ' .", "METHODS"),
    ("The program is based on self-efficacy theory and uses a self-management approach to enhance skills , confidence and abilities to manage the psychosocial issues associated with HIV in daily life .", "METHODS"),
    ("Participants will access the program for a minimum of 90 minutes per week over seven weeks .", "METHODS"),
    ("Primary outcomes are domain specific self-efficacy , HIV related quality of life , and outcomes of health education .", "METHODS"),
    ("Secondary outcomes include : depression , anxiety and stress ; general health and quality of life ; adjustment to HIV ; and social support .", "METHODS"),
    ("Data collection will take place at baseline , completion of the intervention ( or eight weeks post randomisation ) and at 12 week follow-up .", "METHODS"),
    ("Results of the Positive Outlook study will provide information regarding the effectiveness of online group programs improving health related outcomes for men living with HIV .", "CONCLUSIONS"),
    ("The aim of this study was to evaluate the efficacy , safety and complications of orbital steroid injection versus oral steroid therapy in the management of thyroid-related ophthalmopathy .", "OBJECTIVE"),
    ("A total of 29 patients suffering from thyroid ophthalmopathy were included in this study .", "METHODS"),
    ("Patients were randomized into two groups : group I included 15 patients treated with oral prednisolone and group II included 14 patients treated with peribulbar triamcinolone orbital injection .", "METHODS"),
    ("Both groups showed improvement in symptoms and in clinical evidence of inflammation with improvement of eye movement and proptosis in most cases .", "RESULTS"),
    ("Mean exophthalmometry value before treatment was 22.6 1.98 mm that decreased to 18.6 0.996 mm in group I , compared with 23 1.86 mm that decreased to 19.08 1.16 mm in group II .", "RESULTS"),
    ("There was no change in the best-corrected visual acuity in both groups .", "RESULTS"),
    ("There was an increase in body weight , blood sugar , blood pressure and gastritis in group I in 66.7 % , 33.3 % , 50 % and 75 % , respectively , compared with 0 % , 0 % , 8.3 % and 8.3 % in group II .", "RESULTS"),
    ("Orbital steroid injection for thyroid-related ophthalmopathy is effective and safe .", "CONCLUSIONS"),
    ("It eliminates the adverse reactions associated with oral corticosteroid use .", "CONCLUSIONS"),
    ("The aim of this prospective randomized study was to examine whether active counseling and more liberal oral fluid intake decrease postoperative pain , nausea and vomiting in pediatric ambulatory tonsillectomy .", "OBJECTIVE"),
]
# fmt: on

# Format dataset for optimization
dataset = [
    {
        "inputs": {"sentence": sentence},
        "expectations": {"expected_response": label},
    }
    for sentence, label in raw_data
]

# Optimize the prompt
result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=dataset,
    prompt_uris=[prompt.uri],
    optimizer=GepaPromptOptimizer(
        reflection_model="openai:/gpt-5", max_metric_calls=300
    ),
    scorers=[Correctness(model="openai:/gpt-5-mini")],
)

# Use the optimized prompt
optimized_prompt = result.optimized_prompts[0]
print(f"Optimized template: {optimized_prompt.template}")

The API will automatically improve the prompt to better classify medical paper sections by learning from the training examples.

Example: Simple Prompt → Optimized Prompt

Before Optimization:

text
Classify this medical research paper sentence
into one of these sections: CONCLUSIONS, RESULTS,
METHODS, OBJECTIVE, BACKGROUND.

Sentence: {{sentence}}

After Optimization:

text
You are a single-sentence classifier for medical research abstracts. For each input sentence, decide which abstract section it belongs to and output exactly one label in UPPERCASE with no extra words, punctuation, or explanation.

Allowed labels: CONCLUSIONS, RESULTS, METHODS, OBJECTIVE, BACKGROUND

Input format:
- The prompt will be:
  "Classify this medical research paper sentence into one of these sections: CONCLUSIONS, RESULTS, METHODS, OBJECTIVE, BACKGROUND.

  Sentence: {{sentence}}"

Core rules:
- Use only the information in the single sentence.
- Classify by the sentence's function: context-setting vs aim vs procedure vs findings vs interpretation.
- Return exactly one uppercase label from the allowed set.

Decision guide and lexical cues:

1) RESULTS
- Reports observed findings/outcomes tied to data.
- Common cues: past-tense result verbs and outcome terms: "showed," "was/were associated with," "increased/decreased," "improved," "reduced," "significant," "p < …," "odds ratio," "risk ratio," "95% CI," percentages, rates, counts or numbers tied to effects/adverse events.
- If it explicitly states changes, associations, statistical significance, or quantified outcomes, choose RESULTS.

2) CONCLUSIONS
- Interpretation, implications, recommendations, or high-level takeaways.
- Common cues: "In conclusion," "These findings suggest/indicate," "We conclude," statements about practice/policy/clinical implications, benefit–risk judgments, feasibility statements.
- Sentences that forecast the significance/utility of the study's results ("Results will provide insight/information," "Findings will inform/guide practice") are CONCLUSIONS.
- Tie-break with RESULTS: If a sentence describes an outcome as a general claim without specific observed data/statistics, prefer CONCLUSIONS over RESULTS.

3) METHODS
- How the study was conducted: design, participants, interventions/programs, measurements/outcomes lists, timelines, procedures, or analyses.
- Common cues: design terms ("randomized," "double-blind," "cross-sectional," "cohort," "case-control"), "participants," "n =," inclusion/exclusion criteria, instruments/scales, dosing/protocols, schedules/timelines, statistical tests/analysis plans ("multivariate regression," "Kaplan–Meier," "ANOVA," "we will compare"), trial registration, ethics approval.
- Measurement/outcome lists are METHODS (e.g., "Secondary outcomes include: …"; "Primary outcome was …").
- Numbers specifying sample size (e.g., "n = 200") → METHODS; numbers tied to effects → RESULTS.
- Program/intervention descriptions, components, theoretical basis, and mechanisms are METHODS, even if written in present tense and even if they contain purpose phrases. Examples: "The program is based on self-efficacy theory…," "The intervention uses a self-management approach to enhance skills…," "The device is designed to…"
  - Important: An infinitive "to [verb] …" inside a program/intervention description (e.g., "uses X to improve Y") is METHODS, not OBJECTIVE, because it describes how the intervention works, not the study's aim.

4) OBJECTIVE
- The aim/purpose/hypothesis of the study.
- Common cues: "Objective(s):" "Aim/Purpose was," "We aimed/sought/intended to," "We hypothesized that …"
- Infinitive purpose phrases indicating the study's aim without procedures or results: "To determine/evaluate/assess/investigate whether …" → OBJECTIVE.
- Phrases like "The aim of this study was to evaluate the efficacy/safety of X vs Y …" → OBJECTIVE.
- If "We evaluated/assessed …" is clearly used as a purpose statement (not describing methods or results), label OBJECTIVE.

5) BACKGROUND
- Context, rationale, prior knowledge, unmet need; introduces topic without specific aims, procedures, or results.
- Common cues: burden/prevalence statements, "X is common," "X remains poorly understood," prior work summaries, general descriptions.
- If a sentence merely states that a paper describes/reports a program/design/evaluation without concrete procedures/analyses, label as BACKGROUND.

Important tie-break rules:
- RESULTS vs CONCLUSIONS: Observed data/findings → RESULTS; interpretation/generalization/recommendation → CONCLUSIONS.
- OBJECTIVE vs METHODS: Purpose/aim of the study → OBJECTIVE; concrete design/intervention details/measurements/analysis steps → METHODS.
- BACKGROUND vs OBJECTIVE: Context/motivation without an explicit study aim → BACKGROUND.
- BACKGROUND vs METHODS: General description without concrete procedures/analyses → BACKGROUND.
- The word "Results" at the start does not guarantee RESULTS; e.g., "Results will provide information …" → CONCLUSIONS.

Output constraint:
- Return exactly one uppercase label: CONCLUSIONS, RESULTS, METHODS, OBJECTIVE, or BACKGROUND. No extra text or punctuation.

Components

The mlflow.genai.optimize_prompts() API requires the following components:

Component	Description
Target Prompt URIs	List of prompt URIs to optimize (e.g., `["prompts:/qa/1"]`)
Predict Function	A callable that takes inputs as keyword arguments and returns outputs. Must load templates from MLflow prompt versions (e.g., call `PromptVersion.format()`).
Training Data	Dataset with `inputs` (dict) and `expectations` (expected results). Supports pandas DataFrame, list of dicts, or MLflow EvaluationDataset.
Optimizer	Prompt optimizer instance (e.g., `GepaPromptOptimizer`)

1. Target Prompt URIs

Specify which prompts to optimize using their URIs from MLflow Prompt Registry:

python
prompt_uris = [
    "prompts:/qa/1",  # Specific version
    "prompts:/instruction@latest",  # Latest version
]

You can reference prompts by:

Specific version: "prompts:/qa/1" - Optimize a particular version
Latest version: "prompts:/qa@latest" - Optimize the most recent version
Alias: "prompts:/qa@champion" - Optimize a version with a specific alias

2. Predict Function

Your predict_fn must:

Accept inputs as keyword arguments matching the inputs field of the dataset
Load the template from MLflow prompt versions using one of the following methods:
Return outputs in the same format as your training data (e.g., outputs = {"answer": "xxx"} if expectations = {"expected_response": {"answer": "xxx"}})

python
def predict_fn(question: str) -> str:
    # Load prompt from registry
    prompt = mlflow.genai.load_prompt("prompts:/qa/1")

    # Format the prompt with input variables
    formatted_prompt = prompt.format(question=question)

    # Call your LLM
    response = your_llm_call(formatted_prompt)

    return response

3. Training Data

Provide a dataset with inputs and expectations. Both columns should have dictionary values. inputs values will be passed to the predict function as keyword arguments. Please refer to Predefined LLM Scorers for the expected format of each built in scorers.

python
# List of dictionaries - Example: Medical paper classification
dataset = [
    {
        "inputs": {
            "sentence": "The emergence of HIV as a chronic condition means that people living with HIV are required to take more responsibility..."
        },
        "expectations": {"expected_response": "BACKGROUND"},
    },
    {
        "inputs": {
            "sentence": "This study is designed as a randomised controlled trial in which men living with HIV..."
        },
        "expectations": {"expected_response": "METHODS"},
    },
    {
        "inputs": {
            "sentence": "Both groups showed improvement in symptoms and in clinical evidence of inflammation..."
        },
        "expectations": {"expected_response": "RESULTS"},
    },
    {
        "inputs": {
            "sentence": "Orbital steroid injection for thyroid-related ophthalmopathy is effective and safe."
        },
        "expectations": {"expected_response": "CONCLUSIONS"},
    },
    {
        "inputs": {
            "sentence": "The aim of this study was to evaluate the efficacy, safety and complications..."
        },
        "expectations": {"expected_response": "OBJECTIVE"},
    },
]

# Or pandas DataFrame
import pandas as pd

dataset = pd.DataFrame(
    {
        "inputs": [
            {"sentence": "The emergence of HIV as a chronic condition..."},
            {"sentence": "This study is designed as a randomised controlled trial..."},
            {"sentence": "Both groups showed improvement in symptoms..."},
        ],
        "expectations": [
            {"expected_response": "BACKGROUND"},
            {"expected_response": "METHODS"},
            {"expected_response": "RESULTS"},
        ],
    }
)

4. Optimizer

Create an optimizer instance for the optimization algorithm. Currently only GepaPromptOptimizer is supported natively.

python
from mlflow.genai.optimize import GepaPromptOptimizer

optimizer = GepaPromptOptimizer(
    reflection_model="openai:/gpt-5",  # Powerful model for optimization
    max_metric_calls=100,
    display_progress_bar=False,
)

Advanced Usage

Using Custom Scorers

Define custom evaluation metrics to guide optimization:

python
from typing import Any
from mlflow.genai.scorers import scorer


@scorer
def accuracy_scorer(outputs: Any, expectations: dict[str, Any]):
    """Check if output matches expected value."""
    return 1.0 if outputs.lower() == expectations.lower() else 0.0


@scorer
def brevity_scorer(outputs: Any):
    """Prefer shorter outputs (max 50 chars)."""
    return min(1.0, 50 / max(len(outputs), 1))


# Combine scorers with a weighted objective
def weighted_objective(scores: dict[str, Any]):
    return 0.7 * scores["accuracy_scorer"] + 0.3 * scores["brevity_scorer"]


# Use custom scorers
result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=dataset,
    prompt_uris=[prompt.uri],
    optimizer=GepaPromptOptimizer(reflection_model="openai:/gpt-5"),
    scorers=[accuracy_scorer, brevity_scorer],
    aggregation=weighted_objective,
)

Custom Optimization Algorithm

Implement your own optimizer by extending BasePromptOptimizer:

python
from mlflow.genai.optimize import BasePromptOptimizer, PromptOptimizerOutput
from mlflow.genai.scorers import Correctness


class MyCustomOptimizer(BasePromptOptimizer):
    def __init__(self, model_name: str):
        self.model_name = model_name

    def optimize(self, eval_fn, train_data, target_prompts, enable_tracking):
        # Your custom optimization logic
        optimized_prompts = {}
        for prompt_name, prompt_template in target_prompts.items():
            # Implement your algorithm
            optimized_prompts[prompt_name] = your_optimization_algorithm(
                prompt_template, train_data, self.model_name
            )

        return PromptOptimizerOutput(optimized_prompts=optimized_prompts)


# Use custom optimizer
result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=dataset,
    prompt_uris=[prompt.uri],
    optimizer=MyCustomOptimizer(model_name="openai:/gpt-5"),
    scorers=[Correctness(model="openai:/gpt-5")],
)

Multi-Prompt Optimization

Optimize multiple prompts together:

python
import mlflow
from mlflow.genai.scorers import Correctness

# Register multiple prompts
plan_prompt = mlflow.genai.register_prompt(
    name="plan",
    template="Make a plan to answer {{question}}.",
)
answer_prompt = mlflow.genai.register_prompt(
    name="answer",
    template="Answer {{question}} following the plan: {{plan}}",
)


def predict_fn(question: str) -> str:
    plan_prompt = mlflow.genai.load_prompt("prompts:/plan/1")
    completion = openai.OpenAI().chat.completions.create(
        model="gpt-5",  # strong model
        messages=[{"role": "user", "content": plan_prompt.format(question=question)}],
    )
    plan = completion.choices[0].message.content

    answer_prompt = mlflow.genai.load_prompt("prompts:/answer/1")
    completion = openai.OpenAI().chat.completions.create(
        model="gpt-5-mini",  # cost efficient model
        messages=[
            {
                "role": "user",
                "content": answer_prompt.format(question=question, plan=plan),
            }
        ],
    )
    return completion.choices[0].message.content


# Optimize both
result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=dataset,
    prompt_uris=[plan_prompt.uri, answer_prompt.uri],
    optimizer=GepaPromptOptimizer(reflection_model="openai:/gpt-5"),
    scorers=[Correctness(model="openai:/gpt-5")],
)

# Access optimized prompts
optimized_plan = result.optimized_prompts[0]
optimized_answer = result.optimized_prompts[1]

Works with Any Agent Framework

MLflow's optimization is framework-agnostic—it works seamlessly with LangChain, DSPy, CrewAI, AutoGen, or any custom framework. No need to rewrite your existing agents or switch frameworks.

The example below shows how easy it is to optimize prompts in a LangChain workflow with minimal changes. Note that we call PromptVersion.to_single_brace_format() instead of PromptVersion.format() inside predict_fn.

python
import mlflow
from mlflow.genai.scorers import Correctness
from mlflow.genai.optimize.optimizers import GepaPromptOptimizer
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
from langchain.chains import LLMChain


def predict_fn(input_language, output_language, text):
    template = PromptTemplate(
        input_variables=["input_language", "output_language", "text"],
        template=prompt.to_single_brace_format(),  # call to_single_brace_format
    )

    llm = OpenAI(temperature=0.7)

    chain = LLMChain(llm=llm, prompt=template)

    result = chain.run(
        input_language=input_language, output_language=output_language, text=text
    )

    return result


dataset = [
    {
        "inputs": {
            "input_language": "English",
            "output_language": "French",
            "text": "Hello, how are you?",
        },
        "expectations": {"expected_response": "Bonjour, comment allez-vous?"},
    },
    {
        "inputs": {
            "input_language": "English",
            "output_language": "Spanish",
            "text": "Good morning",
        },
        "expectations": {"expected_response": "Buenos días"},
    },
    {
        "inputs": {
            "input_language": "English",
            "output_language": "German",
            "text": "Thank you very much",
        },
        "expectations": {"expected_response": "Vielen Dank"},
    },
]

result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=dataset,
    prompt_uris=[prompt.uri],
    optimizer=GepaPromptOptimizer(reflection_model="openai:/gpt-5"),
    scorers=[Correctness(model="openai:/gpt-5")],
)

Result Object

The API returns a PromptOptimizationResult object:

python
result = mlflow.genai.optimize_prompts(...)

# Access optimized prompts
for prompt in result.optimized_prompts:
    print(f"Name: {prompt.name}")
    print(f"Version: {prompt.version}")
    print(f"Template: {prompt.template}")
    print(f"URI: {prompt.uri}")

# Check optimizer used
print(f"Optimizer: {result.optimizer_name}")

# View evaluation scores (if available)
print(f"Initial score: {result.initial_eval_score}")
print(f"Final score: {result.final_eval_score}")

Common Use Cases

Improving Accuracy

Optimize prompts to produce more accurate outputs:

python
from mlflow.genai.scorers import Correctness


result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=dataset,
    prompt_uris=[prompt.uri],
    optimizer=GepaPromptOptimizer(reflection_model="openai:/gpt-5"),
    scorers=[Correctness(model="openai:/gpt-5")],
)

Optimizing for Safeness

Ensure outputs are safe:

python
from mlflow.genai.scorers import Safety


result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=dataset,
    prompt_uris=[prompt.uri],
    optimizer=GepaPromptOptimizer(reflection_model="openai:/gpt-5"),
    scorers=[Safety(model="openai:/gpt-5")],
)

Model Switching and Migration

When switching between different language models (e.g., migrating from gpt-5 to gpt-5-mini for cost reduction), you may need to rewrite your prompts to maintain output quality with the new model. The mlflow.genai.optimize_prompts() API can help adapt prompts automatically using your existing application outputs as training data.

See the Auto-rewrite Prompts for New Models guide for a complete model migration workflow.

Troubleshooting

Issue: Optimization Takes Too Long

Solution: Reduce dataset size or reduce the optimizer budget:

python
# Use fewer examples
small_dataset = dataset[:20]

# Use faster model for optimization
optimizer = GepaPromptOptimizer(
    reflection_model="openai:/gpt-5-mini", max_metric_calls=100
)

Issue: No Improvement Observed

Solution: Check your evaluation metrics and increase dataset diversity:

Ensure scorers accurately measure what you care about
Increase training data size and diversity
Try to modify optimizer configurations
Verify outputs format matches expectations

Issue: Prompts Not Being Used

Solution: Ensure predict_fn calls PromptVersion.format() during execution:

python
# ✅ Correct - loads from registry
def predict_fn(question: str):
    prompt = mlflow.genai.load_prompt("prompts:/qa@latest")
    return llm_call(prompt.format(question=question))


# ❌ Incorrect - hardcoded prompt
def predict_fn(question: str):
    return llm_call(f"Answer: {question}")

Quick Start​

Example: Simple Prompt → Optimized Prompt​

Components​

1. Target Prompt URIs​

2. Predict Function​

3. Training Data​

4. Optimizer​

Advanced Usage​

Using Custom Scorers​

Custom Optimization Algorithm​

Multi-Prompt Optimization​

Works with Any Agent Framework​

Result Object​

Common Use Cases​

Improving Accuracy​

Optimizing for Safeness​

Model Switching and Migration​

Troubleshooting​

Issue: Optimization Takes Too Long​

Issue: No Improvement Observed​

Issue: Prompts Not Being Used​

See Also​