Evaluate & Monitor FAQ

This page addresses frequently asked questions about MLflow's GenAI evaluation.

Where can I find the evaluation results in MLflow UI?

After an evaluation completes, you can find the resulting runs on the experiment page. Click the run name to view aggregated metrics and metadata in the overview pane.

To inspect per-row evaluation results, open the Traces tab on the run overview page.

How to change the concurrency of the evaluation?

MLflow uses a thread pool to run the predict function and scorers in parallel. Configure the number of workers by setting the MLFLOW_GENAI_EVAL_MAX_WORKERS environment variable (default: 10).

bash
export MLFLOW_GENAI_EVAL_MAX_WORKERS=5

Why does MLflow make N+1 predictions during evaluation?

MLflow requires the predict function passed through the predict_fn parameter to emit a single trace per call. To ensure the function produces a trace, MLflow first runs one additional prediction on a single input.

If you are confident the predict function already generates traces, skip this validation by setting the MLFLOW_GENAI_EVAL_SKIP_TRACE_VALIDATION environment variable to true.

bash
export MLFLOW_GENAI_EVAL_SKIP_TRACE_VALIDATION=true

How do I change the name of the evaluation run?

By default, mlflow.genai.evaluate generates a random run name. Set a custom name by wrapping the call with mlflow.start_run.

python
with mlflow.start_run(run_name="My Evaluation Run") as run:
    mlflow.genai.evaluate(...)

How do I use Databricks Model Serving endpoints as the predict function?

MLflow provides mlflow.genai.to_predict_fn(), which wraps a Databricks Model Serving endpoint so it behaves like a predict function compatible with GenAI evaluation.

The wrapper:

Translates each input sample into the request payload expected by the endpoint.
Injects {"databricks_options": {"return_trace": True}} so the endpoint returns a model-generated trace.
Copies the trace into the current experiment so it appears in the MLflow UI.

python
import mlflow
from mlflow.genai.scorers import Correctness

mlflow.genai.evaluate(
    # The {"messages": ...} part must be compatible with the request schema of the endpoint
    data=[{"inputs": {"messages": [{"role": "user", "content": "What is MLflow?"}]}}],
    # Your Databricks Model Serving endpoint URI
    predict_fn=mlflow.genai.to_predict_fn("endpoints:/chat"),
    scorers=[Correctness()],
)

How to migrate from MLflow 2 LLM Evaluation?

See the Migrating from MLflow 2 LLM Evaluation guide.

How do I track the cost of LLM judges?

MLflow visualizes the cost of LLM judges in the assessment pane of the trace details page. When you open an assessment logged by an LLM judge, you can see the cost incurred for running the judge model. This feature is available only when you have the LiteLLM library installed.

Managing the balance between cost and accuracy is important. To use a more cost-effective LLM model while maintaining accuracy, you can leverage the LLM Judge Alignment feature.

How do I debug my scorers?

To debug your scorers, you can enable tracing for the scorer functions by setting the MLFLOW_GENAI_EVAL_ENABLE_SCORER_TRACING environment variable to true.

bash
export MLFLOW_GENAI_EVAL_ENABLE_SCORER_TRACING=true

When this is set to true, MLflow will trace scorer executions during the evaluation and allow you to inspect the input, output, and internal steps during the scorer execution.

To view the scorer trace, you can open the assessment pane of the trace details page and click the "View trace" link.

Where can I find the evaluation results in MLflow UI?​

How to change the concurrency of the evaluation?​

Why does MLflow make N+1 predictions during evaluation?​

How do I change the name of the evaluation run?​

How do I use Databricks Model Serving endpoints as the predict function?​

How to migrate from MLflow 2 LLM Evaluation?​

How do I track the cost of LLM judges?​

How do I debug my scorers?​