MLflow LLM Evaluation

With the emerging of ChatGPT, LLMs have shown its power of text generation in various fields, such as question answering, translating and text summarization. Evaluating LLMs’ performance is slightly different from traditional ML models, as very often there is no single ground truth to compare against. MLflow provides an API mlflow.evaluate() to help evaluate your LLMs.

MLflow’s LLM evaluation functionality consists of 3 main components:

  1. A model to evaluate: it can be an MLflow pyfunc model, a URI pointing to one registered MLflow model, or any python callable that represents your model, e.g, a HuggingFace text summarization pipeline.

  2. Metrics: the metrics to compute, LLM evaluate will use LLM metrics.

  3. Evaluation data: the data your model is evaluated at, it can be a pandas Dataframe, a python list, a numpy array or an mlflow.data.dataset.Dataset() instance.

Full Notebook Guides and Examples

If you’re interested in thorough use-case oriented guides that showcase the simplicity and power of MLflow’s evaluate functionality for LLMs, please navigate to the notebook collection below:

View the Notebook Guides

Quickstart

Below is a simple example that gives an quick overview of how MLflow LLM evaluation works. The example builds a simple question-answering model by wrapping “openai/gpt-4” with custom prompt. You can paste it to your IPython or local editor and execute it, and install missing dependencies as prompted. Running the code requires OpenAI API key, if you don’t have an OpenAI key, you can set it up by following the OpenAI guide.

export OPENAI_API_KEY='your-api-key-here'
import mlflow
import openai
import os
import pandas as pd
from getpass import getpass

eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) "
            "lifecycle. It was developed by Databricks, a company that specializes in big data and "
            "machine learning solutions. MLflow is designed to address the challenges that data "
            "scientists and machine learning engineers face when developing, training, and deploying "
            "machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data "
            "processing and analytics. It was developed in response to limitations of the Hadoop "
            "MapReduce computing model, offering improvements in speed and ease of use. Spark "
            "provides libraries for various tasks such as data ingestion, processing, and analysis "
            "through its components like Spark SQL for structured data, Spark Streaming for "
            "real-time data processing, and MLlib for machine learning tasks",
        ],
    }
)

with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"
    # Wrap "gpt-4" as an MLflow model.
    logged_model_info = mlflow.openai.log_model(
        model="gpt-4",
        task=openai.chat.completions,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )

    # Use predefined question-answering metrics to evaluate our model.
    results = mlflow.evaluate(
        logged_model_info.model_uri,
        eval_data,
        targets="ground_truth",
        model_type="question-answering",
    )
    print(f"See aggregated evaluation results below: \n{results.metrics}")

    # Evaluation result for each data record is available in `results.tables`.
    eval_table = results.tables["eval_results_table"]
    print(f"See evaluation table below: \n{eval_table}")

LLM Evaluation Metrics

There are two types of LLM evaluation metrics in MLflow:

  1. Metrics relying on SaaS model (e.g., OpenAI) for scoring, e.g., mlflow.metrics.genai.answer_relevance(). These metrics are created via mlflow.metrics.genai.make_genai_metric() method. For each data record, these metrics under the hood sends one prompt consisting of the following information to the SaaS model, and extract the score from model response:

    • Metrics definition.

    • Metrics grading criteria.

    • Reference examples.

    • Input data/context.

    • Model output.

    • [optional] Ground truth.

    More details of how these fields are set can be found in the section “Create your Custom LLM-evaluation Metrics”.

  2. Function-based per-row metrics. These metrics calculate a score for each data record (row in terms of Pandas/Spark dataframe), based on certain functions, like Rouge (mlflow.metrics.rougeL()) or Flesch Kincaid (mlflow.metrics.flesch_kincaid_grade_level()). These metrics are similar to traditional metrics.

Select Metrics to Evaluate

There are two ways to select metrics to evaluate your model:

  • Use default metrics for pre-defined model types.

  • Use a custom list of metrics.

Use Default Metrics for Pre-defined Model Types

MLflow LLM evaluation includes default collections of metrics for pre-selected tasks, e.g, “question-answering”. Depending on the LLM use case that you are evaluating, these pre-defined collections can greatly simplify the process of running evaluations. To use defaults metrics for pre-selected tasks, specify the model_type argument in mlflow.evaluate(), as shown by the example below:

results = mlflow.evaluate(
    model,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
)

The supported LLM model types and associated metrics are listed below:

1 Requires packages evaluate, torch, and transformers

2 Requires package textstat

3 Requires packages evaluate, nltk, and rouge-score

4 All retriever metrics have a default retriever_k value of 3 that can be overridden by specifying retriever_k in the evaluator_config argument.

Use a Custom List of Metrics

Using the pre-defined metrics associated with a given model type is not the only way to generate scoring metrics for LLM evaluation in MLflow. You can specify a custom list of metrics in the extra_metrics argument in mlflow.evaluate:

  • To add additional metrics to the default metrics list of pre-defined model type, keep the model_type and add your metrics to extra_metrics:

    results = mlflow.evaluate(
        model,
        eval_data,
        targets="ground_truth",
        model_type="question-answering",
        extra_metrics=[mlflow.metrics.latency()],
    )
    

    The above code will evaluate your model using all metrics for “question-answering” model plus mlflow.metrics.latency().

  • To disable default metric calculation and only calculate your selected metrics, remove the model_type argument and define the desired metrics.

    results = mlflow.evaluate(
        model,
        eval_data,
        targets="ground_truth",
        extra_metrics=[mlflow.metrics.toxicity(), mlflow.metrics.latency()],
    )
    

The full reference for supported evaluation metrics can be found here.

Metrics with LLM as the Judge

MLflow offers a few pre-canned metrics which uses LLM as the judge. Despite the difference under the hood, the usage is the same - put these metrics in the extra_metrics argument in mlflow.evaluate(). Here is the list of pre-canned metrics:

  • mlflow.metrics.genai.answer_similarity(): Use this metric when you want to evaluate how similar the model generated output is compared to the information in the ground_truth. High scores mean that your model outputs contain similar information as the ground_truth, while low scores mean that outputs may disagree with the ground_truth.

  • mlflow.metrics.genai.answer_correctness(): Use this metric when you want to evaluate how factually correct the model generated output is based on the information in the ground_truth. High scores mean that your model outputs contain similar information as the ground_truth and that this information is correct, while low scores mean that outputs may disagree with the ground_truth or that the information in the output is incorrect. Note that this builds onto answer_similarity.

  • mlflow.metrics.genai.answer_relevance(): Use this metric when you want to evaluate how relevant the model generated output is to the input (context is ignored). High scores mean that your model outputs are about the same subject as the input, while low scores mean that outputs may be non-topical.

  • mlflow.metrics.genai.relevance(): Use this metric when you want to evaluate how relevant the model generated output is with respect to both the input and the context. High scores mean that the model has understood the context and correct extracted relevant information from the context, while low score mean that output has completely ignored the question and the context and could be hallucinating.

  • mlflow.metrics.genai.faithfulness(): Use this metric when you want to evaluate how faithful the model generated output is based on the context provided. High scores mean that the outputs contain information that is in line with the context, while low scores mean that outputs may disagree with the context (input is ignored).

Selecting the LLM-as-judge Model

By default, llm-as-judge metrics use openai:/gpt-4 as the judge. You can change the default judge model by passing an override to the model argument within the metric definition, as shown below. In addition to OpenAI models, you can also use any endpoint via MLflow Deployments. Use mlflow.deployments.set_deployments_target() to set the target deployment client.

To use an endpoint hosted by a local MLflow AI Gateway, you can use the following code.

from mlflow.deployments import set_deployments_target

set_deployments_target("http://localhost:5000")
my_answer_similarity = mlflow.metrics.genai.answer_similarity(
    model="endpoints:/my-endpoint"
)

To use an endpoint hosted on Databricks, you can use the following code.

from mlflow.deployments import set_deployments_target

set_deployments_target("databricks")
llama2_answer_similarity = mlflow.metrics.genai.answer_similarity(
    model="endpoints:/databricks-llama-2-70b-chat"
)

For more information about how various models perform as judges, please refer to this blog.

Creating Custom LLM-evaluation Metrics

Create LLM-as-judge Evaluation Metrics (Category 1)

You can also create your own Saas LLM evaluation metrics with MLflow API mlflow.metrics.genai.make_genai_metric(), which needs the following information:

  • name: the name of your custom metric.

  • definition: describe what’s the metric doing.

  • grading_prompt: describe the scoring critieria.

  • examples: a few input/output examples with score, they are used as a reference for LLM judge.

  • model: the identifier of LLM judge, in the format of “openai:/gpt-4” or “endpoints:/databricks-llama-2-70b-chat”.

  • parameters: the extra parameters to send to LLM judge, e.g., temperature for "openai:/gpt-4o-mini".

  • aggregations: The list of options to aggregate the per-row scores using numpy functions.

  • greater_is_better: indicates if a higher score means your model is better.

Under the hood, definition, grading_prompt, examples together with evaluation data and model output will be composed into a long prompt and sent to LLM. If you are familiar with the concept of prompt engineering, SaaS LLM evaluation metric is basically trying to compose a “right” prompt containing instructions, data and model output so that LLM, e.g., GPT4 can output the information we want.

Now let’s create a custom GenAI metrics called “professionalism”, which measures how professional our model output is.

Let’s first create a few examples with scores, these will be the reference samples LLM judge uses. To create such examples, we will use mlflow.metrics.genai.EvaluationExample() class, which has 4 fields:

  • input: input text.

  • output: output text.

  • score: the score for output in the context of input.

  • justification: why do we give the score for the data.

professionalism_example_score_2 = mlflow.metrics.genai.EvaluationExample(
    input="What is MLflow?",
    output=(
        "MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps "
        "you track experiments, package your code and models, and collaborate with your team, making the whole ML "
        "workflow smoother. It's like your Swiss Army knife for machine learning!"
    ),
    score=2,
    justification=(
        "The response is written in a casual tone. It uses contractions, filler words such as 'like', and "
        "exclamation points, which make it sound less professional. "
    ),
)
professionalism_example_score_4 = mlflow.metrics.genai.EvaluationExample(
    input="What is MLflow?",
    output=(
        "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was "
        "developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is "
        "designed to address the challenges that data scientists and machine learning engineers face when "
        "developing, training, and deploying machine learning models.",
    ),
    score=4,
    justification=("The response is written in a formal language and a neutral tone. "),
)

Now let’s define the professionalism metric, you will see how each field is set up.

professionalism = mlflow.metrics.genai.make_genai_metric(
    name="professionalism",
    definition=(
        "Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is "
        "tailored to the context and audience. It often involves avoiding overly casual language, slang, or "
        "colloquialisms, and instead using clear, concise, and respectful language."
    ),
    grading_prompt=(
        "Professionalism: If the answer is written using a professional tone, below are the details for different scores: "
        "- Score 0: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for "
        "professional contexts."
        "- Score 1: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in "
        "some informal professional settings."
        "- Score 2: Language is overall formal but still have casual words/phrases. Borderline for professional contexts."
        "- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. "
        "- Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for formal "
        "business or academic settings. "
    ),
    examples=[professionalism_example_score_2, professionalism_example_score_4],
    model="openai:/gpt-4o-mini",
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance"],
    greater_is_better=True,
)

Create heuristic-based LLM Evaluation Metrics (Category 2)

This is very similar to creating custom traditional metrics, with the exception of returning a mlflow.metrics.MetricValue() instance. Basically you need to:

  1. Implement a eval_fn to define your scoring logic, it must take in 2 args predictions and targets. eval_fn must return a mlflow.metrics.MetricValue() instance.

  2. Pass eval_fn and other arguments to mlflow.metrics.make_metric API to create the metric.

The following code creates a dummy per-row metric called "over_10_chars": if the model output is greater than 10, the score is “yes” otherwise “no”.

def eval_fn(predictions, targets):
    scores = []
    for i in range(len(predictions)):
        if len(predictions[i]) > 10:
            scores.append("yes")
        else:
            scores.append("no")
    return MetricValue(
        scores=scores,
        aggregate_results=standard_aggregations(scores),
    )


# Create an EvaluationMetric object.
passing_code_metric = make_metric(
    eval_fn=eval_fn, greater_is_better=False, name="over_10_chars"
)

To create a custom metric that is dependent on other metrics, include those other metrics’ names as an argument after predictions and targets. This can be the name of a builtin metric or another custom metric. Ensure that you do not accidentally have any circular dependencies in your metrics, or the evaluation will fail.

The following code creates a dummy per-row metric called "toxic_or_over_10_chars": if the model output is greater than 10 or the toxicity score is greater than 0.5, the score is “yes” otherwise “no”.

def eval_fn(predictions, targets, toxicity, over_10_chars):
    scores = []
    for i in range(len(predictions)):
        if toxicity.scores[i] > 0.5 or over_10_chars.scores[i]:
            scores.append("yes")
        else:
            scores.append("no")
    return MetricValue(scores=scores)


# Create an EvaluationMetric object.
toxic_and_over_10_chars_metric = make_metric(
    eval_fn=eval_fn, greater_is_better=False, name="toxic_or_over_10_chars"
)

Prepare Your LLM for Evaluating

In order to evaluate your LLM with mlflow.evaluate(), your LLM has to be one of the following type:

  1. A mlflow.pyfunc.PyFuncModel() instance or a URI pointing to a logged mlflow.pyfunc.PyFuncModel model. In general we call that MLflow model. The

  2. A python function that takes in string inputs and outputs a single string. Your callable must match the signature of mlflow.pyfunc.PyFuncModel.predict() (without params argument), briefly it should:

    • Has data as the only argument, which can be a pandas.Dataframe, numpy.ndarray, python list, dictionary or scipy matrix.

    • Returns one of pandas.DataFrame, pandas.Series, numpy.ndarray or list.

  3. An MLflow Deployments endpoint URI pointing to a local MLflow AI Gateway, Databricks Foundation Models API, and External Models in Databricks Model Serving.

  4. Set model=None, and put model outputs in data. Only applicable when the data is a Pandas dataframe.

Evaluating with an MLflow Model

For detailed instruction on how to convert your model into a mlflow.pyfunc.PyFuncModel instance, please read this doc. But in short, to evaluate your model as an MLflow model, we recommend following the steps below:

  1. Package your LLM as an MLflow model and log it to MLflow server by log_model. Each flavor (opeanai, pytorch, …) has its own log_model API, e.g., mlflow.openai.log_model():

    with mlflow.start_run():
        system_prompt = "Answer the following question in two sentences"
        # Wrap "gpt-4o-mini" as an MLflow model.
        logged_model_info = mlflow.openai.log_model(
            model="gpt-4o-mini",
            task=openai.chat.completions,
            artifact_path="model",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": "{question}"},
            ],
        )
    
  2. Use the URI of logged model as the model instance in mlflow.evaluate():

    results = mlflow.evaluate(
        logged_model_info.model_uri,
        eval_data,
        targets="ground_truth",
        model_type="question-answering",
    )
    

Evaluating with a Custom Function

As of MLflow 2.8.0, mlflow.evaluate() supports evaluating a python function without requiring logging the model to MLflow. This is useful when you don’t want to log the model and just want to evaluate it. The following example uses mlflow.evaluate() to evaluate a function. You also need to set up OpenAI authentication to run the code below.

import mlflow
import openai
import pandas as pd
from typing import List

eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It was developed in response to limitations of the Hadoop MapReduce computing model, offering improvements in speed and ease of use. Spark provides libraries for various tasks such as data ingestion, processing, and analysis through its components like Spark SQL for structured data, Spark Streaming for real-time data processing, and MLlib for machine learning tasks",
        ],
    }
)


def openai_qa(inputs: pd.DataFrame) -> List[str]:
    predictions = []
    system_prompt = "Please answer the following question in formal language."

    for _, row in inputs.iterrows():
        completion = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": row["inputs"]},
            ],
        )
        predictions.append(completion.choices[0].message.content)

    return predictions


with mlflow.start_run():
    results = mlflow.evaluate(
        model=openai_qa,
        data=eval_data,
        targets="ground_truth",
        model_type="question-answering",
    )

print(results.metrics)
Output
{
    "flesch_kincaid_grade_level/v1/mean": 14.75,
    "flesch_kincaid_grade_level/v1/variance": 0.5625,
    "flesch_kincaid_grade_level/v1/p90": 15.35,
    "ari_grade_level/v1/mean": 18.15,
    "ari_grade_level/v1/variance": 0.5625,
    "ari_grade_level/v1/p90": 18.75,
    "exact_match/v1": 0.0,
}

Evaluating with an MLflow Deployments Endpoint

For MLflow >= 2.11.0, mlflow.evaluate() supports evaluating a model endpoint by directly passing the MLflow Deployments endpoint URI to the model argument. This is particularly useful when you want to evaluate a deployed model hosted by a local MLflow AI Gateway, Databricks Foundation Models API, and External Models in Databricks Model Serving, without implementing custom prediction logic to wrap it as an MLflow model or a python function.

Please don’t forget to set the target deployment client by using mlflow.deployments.set_deployments_target() before calling mlflow.evaluate() with the endpoint URI, as shown in the example below. Otherwise, you will see an error message like MlflowException: No deployments target has been set....

Hint

When you want to use an endpoint not hosted by an MLflow AI Gateway or Databricks, you can create a custom Python function following the Evaluating with a Custom Function guide and use it as the model argument.

Supported Input Data Formats

The input data can be either of the following format when using an URI of the MLflow Deployment Endpoint as the model:

Data Format

Example

Additional Notes

A pandas DataFrame with a string column.

pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ]
    }
)

For this input format, MLflow will construct the appropriate request payload to the model endpoint type. For example, if your model is a chat endpoint (llm/v1/chat), MLflow will wrap your input string with the chat messages format like {"messages": [{"role": "user", "content": "What is MLflow?"}]}. If you want to customize the request payload e.g. including system prompt, please use the next format.

A pandas DataFrame with a dictionary column.

pd.DataFrame(
    {
        "inputs": [
            {
                "messages": [
                    {"role": "system", "content": "Please answer."},
                    {"role": "user", "content": "What is MLflow?"},
                ],
                "max_tokens": 100,
            },
            # ... more dictionary records
        ]
    }
)

In this format, the dictionary should have the correct request format for your model endpoint. Please refer to the MLflow Deployments documentation for more information about the request format for different model endpoint types.

A list of input strings.

[
    "What is MLflow?",
    "What is Spark?",
]

The mlflow.evaluate() also accepts a list input.

A list of request payload (dictionary).

[
    {
        "messages": [
            {"role": "system", "content": "Please answer."},
            {"role": "user", "content": "What is MLflow?"},
        ],
        "max_tokens": 100,
    },
    # ... more dictionary records
]

Similarly to Pandas DataFrame input, the dictionary should have the correct request format for your model endpoint.

Passing Inference Parameters

You can pass additional inference parameters such as max_tokens, temperature, n, to the model endpoint by setting the inference_params argument in mlflow.evaluate(). The inference_params argument is a dictionary that contains the parameters to be passed to the model endpoint. The specified parameters are used for all the input record in the evaluation dataset.

Note

When your input is a dictionary format that represents request payload, it can also include the parameters like max_tokens. If there are overlapping parameters in both the inference_params and the input data, the values in the inference_params will take precedence.

Examples

Chat Endpoint hosted by a local MLflow AI Gateway

import mlflow
from mlflow.deployments import set_deployments_target
import pandas as pd

# Point the client to the local MLflow AI Gateway
set_deployments_target("http://localhost:5000")

eval_data = pd.DataFrame(
    {
        # Input data must be a string column and named "inputs".
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        # Additional ground truth data for evaluating the answer
        "ground_truth": [
            "MLflow is an open-source platform ....",
            "Apache Spark is an open-source, ...",
        ],
    }
)


with mlflow.start_run() as run:
    results = mlflow.evaluate(
        model="endpoints:/my-chat-endpoint",
        data=eval_data,
        targets="ground_truth",
        inference_params={"max_tokens": 100, "temperature": 0.0},
        model_type="question-answering",
    )

Completion Endpoint hosted on Databricks Foundation Models API

import mlflow
from mlflow.deployments import set_deployments_target
import pandas as pd

# Point the client to Databricks Foundation Models API
set_deployments_target("databricks")

eval_data = pd.DataFrame(
    {
        # Input data must be a string column and named "inputs".
        "inputs": [
            "Write 3 reasons why you should use MLflow?",
            "Can you explain the difference between classification and regression?",
        ],
    }
)


with mlflow.start_run() as run:
    results = mlflow.evaluate(
        model="endpoints:/databricks-mpt-7b-instruct",
        data=eval_data,
        inference_params={"max_tokens": 100, "temperature": 0.0},
        model_type="text",
    )

Evaluating External Models in Databricks Model Serving can be done in the same way, you just need to specify the different URI that points to the serving endpoint like "endpoints:/your-chat-endpoint".

Evaluating with a Static Dataset

For MLflow >= 2.8.0, mlflow.evaluate() supports evaluating a static dataset without specifying a model. This is useful when you save the model output to a column in a Pandas DataFrame or an MLflow PandasDataset, and want to evaluate the static dataset without re-running the model.

If you are using a Pandas DataFrame, you must specify the column name that contains the model output using the top-level predictions parameter in mlflow.evaluate():

import mlflow
import pandas as pd

eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. "
            "It was developed by Databricks, a company that specializes in big data and machine learning solutions. "
            "MLflow is designed to address the challenges that data scientists and machine learning engineers "
            "face when developing, training, and deploying machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data processing and "
            "analytics. It was developed in response to limitations of the Hadoop MapReduce computing model, "
            "offering improvements in speed and ease of use. Spark provides libraries for various tasks such as "
            "data ingestion, processing, and analysis through its components like Spark SQL for structured data, "
            "Spark Streaming for real-time data processing, and MLlib for machine learning tasks",
        ],
        "predictions": [
            "MLflow is an open-source platform that provides handy tools to manage Machine Learning workflow "
            "lifecycle in a simple way",
            "Spark is a popular open-source distributed computing system designed for big data processing and analytics.",
        ],
    }
)

with mlflow.start_run() as run:
    results = mlflow.evaluate(
        data=eval_data,
        targets="ground_truth",
        predictions="predictions",
        extra_metrics=[mlflow.metrics.genai.answer_similarity()],
        evaluators="default",
    )
    print(f"See aggregated evaluation results below: \n{results.metrics}")

    eval_table = results.tables["eval_results_table"]
    print(f"See evaluation table below: \n{eval_table}")

Viewing Evaluation Results

View Evaluation Results via Code

mlflow.evaluate() returns the evaluation results as an mlflow.models.EvaluationResult() instance. To see the score on selected metrics, you can check:

  • metrics: stores the aggregated results, like average/variance across the evaluation dataset. Let’s take a second pass on the code example above and focus on printing out the aggregated results.

    with mlflow.start_run() as run:
        results = mlflow.evaluate(
            data=eval_data,
            targets="ground_truth",
            predictions="predictions",
            extra_metrics=[mlflow.metrics.genai.answer_similarity()],
            evaluators="default",
        )
        print(f"See aggregated evaluation results below: \n{results.metrics}")
    
  • tables["eval_results_table"]: stores the per-row evaluation results.

    with mlflow.start_run() as run:
        results = mlflow.evaluate(
            data=eval_data,
            targets="ground_truth",
            predictions="predictions",
            extra_metrics=[mlflow.metrics.genai.answer_similarity()],
            evaluators="default",
        )
        print(
            f"See per-data evaluation results below: \n{results.tables['eval_results_table']}"
        )
    

View Evaluation Results via the MLflow UI

Your evaluation result is automatically logged into MLflow server, so you can view your evaluation results directly from the MLflow UI. To view the evaluation results on MLflow UI, please follow the steps below:

  1. Go to the experiment view of your MLflow experiment.

  2. Select the “Evaluation” tab.

  3. Select the runs you want to check evaluation results.

  4. Select the metrics from the dropdown menu on the right side.

Please see the screenshot below for clarity:

Demo UI of MLflow evaluate