GenAI Evaluation Quickstart
This quickstart guide will walk you through evaluating your GenAI applications with MLflow's comprehensive evaluation framework. In less than 5 minutes, you'll learn how to evaluate LLM outputs, use built-in and custom evaluation criteria, and analyze results in the MLflow UI.

Prerequisites
Install the required packages by running the following command:
pip install --upgrade mlflow>=3.3 openai
The code examples in this guide use the OpenAI SDK; however, MLflow's evaluation framework works with any LLM provider, including Anthropic, Google, Bedrock, and more.
Step 1: Set up your environment
Connect to MLflow
MLflow stores evaluation results in a tracking server. Connect your local environment to the tracking server by one of the following methods.
- Local
- Remote MLflow Server
- Databricks
For the fastest setup, you can run MLflow locally:
# Start MLflow tracking server locally
mlflow ui --backend-store-uri sqlite:///mlflow.db
# This will start the server at http://127.0.0.1:5000
If you have a remote MLflow tracking server, configure the connection:
import os
import mlflow
# Set your MLflow tracking URI
os.environ["MLFLOW_TRACKING_URI"] = "http://your-mlflow-server:5000"
# Or directly in code
mlflow.set_tracking_uri("http://your-mlflow-server:5000")
If you have a Databricks account, configure the connection:
import mlflow
mlflow.login()
This will prompt you for your configuration details (Databricks Host url and a PAT).
If you are unsure about how to set up an MLflow tracking server, you can start with the cloud-based MLflow powered by Databricks: Sign up for free →
Create a new MLflow Experiment
import mlflow
# This will create a new experiment called "GenAI Evaluation Quickstart" and set it as active
mlflow.set_experiment("GenAI Evaluation Quickstart")
Configure OpenAI API Key (or other LLM providers)
import os
# Use different env variable when using a different LLM provider
os.environ["OPENAI_API_KEY"] = "your-api-key-here" # Replace with your actual API key
Step 2: Create a simple QA function
First, we need to create a prediction function that takes a question and returns an answer. Here we use OpenAI's gpt-4o-mini
model to generate the answer, but you can use any other LLM provider if you prefer.
from openai import OpenAI
client = OpenAI()
def qa_predict_fn(question: str) -> str:
"""Simple Q&A prediction function using OpenAI"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "You are a helpful assistant. Answer questions concisely.",
},
{"role": "user", "content": question},
],
)
return response.choices[0].message.content
Step 3: Prepare an evaluation dataset
The evaluation dataset is a list of samples, each with an inputs
and expectations
field.
inputs
: The input to thepredict_fn
function above. The key(s) must match the parameter name of thepredict_fn
function.expectations
: The expected output from thepredict_fn
function, namely, ground truth for the answer.
The dataset can be a list of dictionaries, a pandas DataFrame, a spark DataFrame. Here we use a list of dictionaries for simplicity.
# Define a simple Q&A dataset with questions and expected answers
eval_dataset = [
{
"inputs": {"question": "What is the capital of France?"},
"expectations": {"expected_response": "Paris"},
},
{
"inputs": {"question": "Who was the first person to build an airplane?"},
"expectations": {"expected_response": "Wright Brothers"},
},
{
"inputs": {"question": "Who wrote Romeo and Juliet?"},
"expectations": {"expected_response": "William Shakespeare"},
},
]
Step 4: Define evaluation criteria using Scorers
Scorer is a function that computes a score for a given input-output pair against various evaluation criteria. You can use built-in scorers provided by MLflow for common evaluation criteria, as well as create your own custom scorers.
from mlflow.genai import scorer
from mlflow.genai.scorers import Correctness, Guidelines
@scorer
def is_concise(outputs: str) -> bool:
"""Evaluate if the answer is concise (less than 5 words)"""
return len(outputs.split()) <= 5
scorers = [
Correctness(),
Guidelines(name="is_english", guidelines="The answer must be in English"),
is_concise,
]
Here we use three scorers:
- Correctness: Evaluates if the answer is factually correct, using the "expected_response" field in the dataset.
- Guidelines: Evaluates if the answer meets the given guidelines.
is_concise
: A custom scorer defined using the scorer decorator to judge if the answer is concise (less than 5 words).
The first two scorers use LLMs to evaluate the response, so-called LLM-as-a-Judge. This is a powerful technique to assess the quality of the response, because it provides a human-like evaluation for complex language tasks while being more scalable and cost-effective than human evaluation.
The Scorer interface allows you to define various types of quality metrics for your application in a simple way. From a simple natural language guideline to a code function with the full control of the evaluation logic.
The default model used for LLM-as-a-Judge scorers such as Correctness and Guidelines is OpenAI gpt-4o-mini
. MLflow supports all major LLM providers, such as Anthropic, Bedrock, Google, xAI, and more, through the built-in adopters and LiteLLM.
Example of using different model providers for the judge model
# Anthropic
Correctness(model="anthropic:/claude-sonnet-4-20250514")
# Bedrock
Correctness(model="bedrock:/anthropic.claude-sonnet-4-20250514")
# Google
# Run `pip install litellm` to use Google as the judge model
Correctness(model="gemini/gemini-2.5-flash")
# xAI
# Run `pip install litellm` to use xAI as the judge model
Correctness(model="xai/grok-2-latest")
Step 5: Run the evaluation
Now we have all three components of the evaluation: dataset, prediction function, and scorers. Let's run the evaluation!
import mlflow
results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=qa_predict_fn,
scorers=scorers,
)
After running the code above, go to the MLflow UI and navigate to your experiment. You'll see the evaluation results with detailed metrics for each scorer.

By clicking on the each row in the table, you can see the detailed rationale behind the score and the trace of the prediction.

Summary
Congratulations! You've successfully:
- ✅ Set up MLflow GenAI Evaluation for your applications
- ✅ Evaluated a Q&A application with built-in scorers
- ✅ Created custom evaluation guidelines
- ✅ Learned to analyze results in the MLflow UI
MLflow's evaluation framework provides comprehensive tools for assessing GenAI application quality, helping you build more reliable and effective AI systems.