mlflow.metrics

The mlflow.metrics module helps you quantitatively and qualitatively measure your models.

class mlflow.metrics.EvaluationMetric(eval_fn, name, greater_is_better, long_name=None, version=None, metric_details=None, metric_metadata=None, genai_metric_args=None)[source]

An evaluation metric.

Parameters
  • eval_fn

    A function that computes the metric with the following signature:

    def eval_fn(
        predictions: pandas.Series,
        targets: pandas.Series,
        metrics: Dict[str, MetricValue],
        **kwargs,
    ) -> Union[float, MetricValue]:
        """
        Args:
            predictions: A pandas Series containing the predictions made by the model.
            targets: (Optional) A pandas Series containing the corresponding labels
                for the predictions made on that input.
            metrics: (Optional) A dictionary containing the metrics calculated by the
                default evaluator.  The keys are the names of the metrics and the values
                are the metric values.  To access the MetricValue for the metrics
                calculated by the system, make sure to specify the type hint for this
                parameter as Dict[str, MetricValue].  Refer to the DefaultEvaluator
                behavior section for what metrics will be returned based on the type of
                model (i.e. classifier or regressor).
            kwargs: Includes a list of args that are used to compute the metric. These
                args could be information coming from input data, model outputs,
                other metrics, or parameters specified in the `evaluator_config`
                argument of the `mlflow.evaluate` API.
    
        Returns: MetricValue with per-row scores, per-row justifications, and aggregate
            results.
        """
        ...
    

  • name – The name of the metric.

  • greater_is_better – Whether a higher value of the metric is better.

  • long_name – (Optional) The long name of the metric. For example, "root_mean_squared_error" for "mse".

  • version – (Optional) The metric version. For example v1.

  • metric_details – (Optional) A description of the metric and how it is calculated.

  • metric_metadata – (Optional) A dictionary containing metadata for the metric.

  • genai_metric_args – (Optional) A dictionary containing arguments specified by users when calling make_genai_metric or make_genai_metric_from_prompt. Those args are persisted so that we can deserialize the same metric object later.

These EvaluationMetric are used by the mlflow.evaluate() API, either computed automatically depending on the model_type or specified via the extra_metrics parameter.

The following code demonstrates how to use mlflow.evaluate() with an EvaluationMetric.

import mlflow
from mlflow.metrics.genai import EvaluationExample, answer_similarity

eval_df = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.",
        ],
    }
)

example = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform for managing machine "
    "learning workflows, including experiment tracking, model packaging, "
    "versioning, and deployment, simplifying the ML lifecycle.",
    score=4,
    justification="The definition effectively explains what MLflow is "
    "its purpose, and its developer. It could be more concise for a 5-score.",
    grading_context={
        "ground_truth": "MLflow is an open-source platform for managing "
        "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
        "a company that specializes in big data and machine learning solutions. MLflow is "
        "designed to address the challenges that data scientists and machine learning "
        "engineers face when developing, training, and deploying machine learning models."
    },
)
answer_similarity_metric = answer_similarity(examples=[example])
results = mlflow.evaluate(
    logged_model.model_uri,
    eval_df,
    targets="ground_truth",
    model_type="question-answering",
    extra_metrics=[answer_similarity_metric],
)

Information about how an EvaluationMetric is calculated, such as the grading prompt used is available via the metric_details property.

import mlflow
from mlflow.metrics.genai import relevance

my_relevance_metric = relevance()
print(my_relevance_metric.metric_details)

Evaluation results are stored as MetricValue. Aggregate results are logged to the MLflow run as metrics, while per-example results are logged to the MLflow run as artifacts in the form of an evaluation table.

class mlflow.metrics.MetricValue(scores: Optional[Union[list, list]] = None, justifications: Optional[list] = None, aggregate_results: Optional[dict] = None)[source]

Note

Experimental: This class may change or be removed in a future release without warning.

The value of a metric.

Parameters
  • scores – The value of the metric per row

  • justifications – The justification (if applicable) for the respective score

  • aggregate_results – A dictionary mapping the name of the aggregation to its value

We provide the following builtin factory functions to create EvaluationMetric for evaluating models. These metrics are computed automatically depending on the model_type. For more information on the model_type parameter, see mlflow.evaluate() API.

Regressor Metrics

mlflow.metrics.mae()mlflow.models.evaluation.base.EvaluationMetric[source]

This function will create a metric for evaluating mae.

This metric computes an aggregate score for the mean absolute error for regression.

mlflow.metrics.mape()mlflow.models.evaluation.base.EvaluationMetric[source]

This function will create a metric for evaluating mape.

This metric computes an aggregate score for the mean absolute percentage error for regression.

mlflow.metrics.max_error()mlflow.models.evaluation.base.EvaluationMetric[source]

This function will create a metric for evaluating max_error.

This metric computes an aggregate score for the maximum residual error for regression.

mlflow.metrics.mse()mlflow.models.evaluation.base.EvaluationMetric[source]

This function will create a metric for evaluating mse.

This metric computes an aggregate score for the mean squared error for regression.

mlflow.metrics.rmse()mlflow.models.evaluation.base.EvaluationMetric[source]

This function will create a metric for evaluating the square root of mse.

This metric computes an aggregate score for the root mean absolute error for regression.

mlflow.metrics.r2_score()mlflow.models.evaluation.base.EvaluationMetric[source]

This function will create a metric for evaluating r2_score.

This metric computes an aggregate score for the coefficient of determination. R2 ranges from negative infinity to 1, and measures the percentage of variance explained by the predictor variables in a regression.

Classifier Metrics

mlflow.metrics.precision_score()mlflow.models.evaluation.base.EvaluationMetric[source]

This function will create a metric for evaluating precision for classification.

This metric computes an aggregate score between 0 and 1 for the precision of classification task.

mlflow.metrics.recall_score()mlflow.models.evaluation.base.EvaluationMetric[source]

This function will create a metric for evaluating recall for classification.

This metric computes an aggregate score between 0 and 1 for the recall of a classification task.

mlflow.metrics.f1_score()mlflow.models.evaluation.base.EvaluationMetric[source]

This function will create a metric for evaluating f1_score for binary classification.

This metric computes an aggregate score between 0 and 1 for the F1 score (F-measure) of a classification task. F1 score is defined as 2 * (precision * recall) / (precision + recall).

Text Metrics

mlflow.metrics.ari_grade_level()mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a metric for calculating automated readability index using textstat.

This metric outputs a number that approximates the grade level needed to comprehend the text, which will likely range from around 0 to 15 (although it is not limited to this range).

Aggregations calculated for this metric:
  • mean

mlflow.metrics.flesch_kincaid_grade_level()mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a metric for calculating flesch kincaid grade level using textstat.

This metric outputs a number that approximates the grade level needed to comprehend the text, which will likely range from around 0 to 15 (although it is not limited to this range).

Aggregations calculated for this metric:
  • mean

Question Answering Metrics

Includes all of the above Text Metrics as well as the following:

mlflow.metrics.exact_match()mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a metric for calculating accuracy using sklearn.

This metric only computes an aggregate score which ranges from 0 to 1.

mlflow.metrics.rouge1()mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a metric for evaluating rouge1.

The score ranges from 0 to 1, where a higher score indicates higher similarity. rouge1 uses unigram based scoring to calculate similarity.

Aggregations calculated for this metric:
  • mean

mlflow.metrics.rouge2()mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a metric for evaluating rouge2.

The score ranges from 0 to 1, where a higher score indicates higher similarity. rouge2 uses bigram based scoring to calculate similarity.

Aggregations calculated for this metric:
  • mean

mlflow.metrics.rougeL()mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a metric for evaluating rougeL.

The score ranges from 0 to 1, where a higher score indicates higher similarity. rougeL uses unigram based scoring to calculate similarity.

Aggregations calculated for this metric:
  • mean

mlflow.metrics.rougeLsum()mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a metric for evaluating rougeLsum.

The score ranges from 0 to 1, where a higher score indicates higher similarity. rougeLsum uses longest common subsequence based scoring to calculate similarity.

Aggregations calculated for this metric:
  • mean

mlflow.metrics.toxicity()mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a metric for evaluating toxicity using the model roberta-hate-speech-dynabench-r4, which defines hate as “abusive speech targeting specific group characteristics, such as ethnic origin, religion, gender, or sexual orientation.”

The score ranges from 0 to 1, where scores closer to 1 are more toxic. The default threshold for a text to be considered “toxic” is 0.5.

Aggregations calculated for this metric:
  • ratio (of toxic input texts)

mlflow.metrics.token_count()mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a metric for calculating token_count. Token count is calculated using tiktoken by using the cl100k_base tokenizer.

mlflow.metrics.latency()mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a metric for calculating latency. Latency is determined by the time it takes to generate a prediction for a given input. Note that computing latency requires each row to be predicted sequentially, which will likely slow down the evaluation process.

mlflow.metrics.bleu()mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a metric for evaluating bleu.

The BLEU scores range from 0 to 1, with higher scores indicating greater similarity to reference texts. BLEU considers n-gram precision and brevity penalty. While adding more references can boost the score, perfect scores are rare and not essential for effective evaluation.

Aggregations calculated for this metric:
  • mean

  • variance

  • p90

Retriever Metrics

The following metrics are built-in metrics for the 'retriever' model type, meaning they will be automatically calculated with a default retriever_k value of 3.

To evaluate document retrieval models, it is recommended to use a dataset with the following columns:

  • Input queries

  • Retrieved relevant doc IDs

  • Ground-truth doc IDs

Alternatively, you can also provide a function through the model parameter to represent your retrieval model. The function should take a Pandas DataFrame containing input queries and ground-truth relevant doc IDs, and return a DataFrame with a column of retrieved relevant doc IDs.

A “doc ID” is a string or integer that uniquely identifies a document. Each row of the retrieved and ground-truth doc ID columns should consist of a list or numpy array of doc IDs.

Parameters:

  • targets: A string specifying the column name of the ground-truth relevant doc IDs

  • predictions: A string specifying the column name of the retrieved relevant doc IDs in either the static dataset or the Dataframe returned by the model function

  • retriever_k: A positive integer specifying the number of retrieved docs IDs to consider for each input query. retriever_k defaults to 3. You can change retriever_k by using the mlflow.evaluate() API:

    1. # with a model and using `evaluator_config`
      mlflow.evaluate(
          model=retriever_function,
          data=data,
          targets="ground_truth",
          model_type="retriever",
          evaluators="default",
          evaluator_config={"retriever_k": 5}
      )
      
    2. # with a static dataset and using `extra_metrics`
      mlflow.evaluate(
          data=data,
          predictions="predictions_param",
          targets="targets_param",
          model_type="retriever",
          extra_metrics = [
              mlflow.metrics.precision_at_k(5),
              mlflow.metrics.precision_at_k(6),
              mlflow.metrics.recall_at_k(5),
              mlflow.metrics.ndcg_at_k(5)
          ]
      )
      

    NOTE: In the 2nd method, it is recommended to omit the model_type as well, or else precision@3 and recall@3 will be calculated in addition to precision@5, precision@6, recall@5, and ndcg_at_k@5.

mlflow.metrics.precision_at_k(k)mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a metric for calculating precision_at_k for retriever models.

This metric computes a score between 0 and 1 for each row representing the precision of the retriever model at the given k value. If no relevant documents are retrieved, the score is 0, indicating that no relevant docs are retrieved. Let x = min(k, # of retrieved doc IDs). Then, in all other cases, the precision at k is calculated as follows:

precision_at_k = (# of relevant retrieved doc IDs in top-x ranked docs) / x.

mlflow.metrics.recall_at_k(k)mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a metric for calculating recall_at_k for retriever models.

This metric computes a score between 0 and 1 for each row representing the recall ability of the retriever model at the given k value. If no ground truth doc IDs are provided and no documents are retrieved, the score is 1. However, if no ground truth doc IDs are provided and documents are retrieved, the score is 0. In all other cases, the recall at k is calculated as follows:

recall_at_k = (# of unique relevant retrieved doc IDs in top-k ranked docs) / (# of ground truth doc IDs)

mlflow.metrics.ndcg_at_k(k)mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a metric for evaluating NDCG@k for retriever models.

NDCG score is capable of handling non-binary notions of relevance. However, for simplicity, we use binary relevance here. The relevance score for documents in the ground truth is 1, and the relevance score for documents not in the ground truth is 0.

The NDCG score is calculated using sklearn.metrics.ndcg_score with the following edge cases on top of the sklearn implementation:

  1. If no ground truth doc IDs are provided and no documents are retrieved, the score is 1.

  2. If no ground truth doc IDs are provided and documents are retrieved, the score is 0.

  3. If ground truth doc IDs are provided and no documents are retrieved, the score is 0.

  4. If duplicate doc IDs are retrieved and the duplicate doc IDs are in the ground truth, they will be treated as different docs. For example, if the ground truth doc IDs are [1, 2] and the retrieved doc IDs are [1, 1, 1, 3], the score will be equivalent to ground truth doc IDs [10, 11, 12, 2] and retrieved doc IDs [10, 11, 12, 3].

Users create their own EvaluationMetric using the make_metric factory function

mlflow.metrics.make_metric(*, eval_fn, greater_is_better, name=None, long_name=None, version=None, metric_details=None, metric_metadata=None, genai_metric_args=None)[source]

A factory function to create an EvaluationMetric object.

Parameters
  • eval_fn

    A function that computes the metric with the following signature:

    def eval_fn(
        predictions: pandas.Series,
        targets: pandas.Series,
        metrics: Dict[str, MetricValue],
        **kwargs,
    ) -> Union[float, MetricValue]:
        """
        Args:
            predictions: A pandas Series containing the predictions made by the model.
            targets: (Optional) A pandas Series containing the corresponding labels
                for the predictions made on that input.
            metrics: (Optional) A dictionary containing the metrics calculated by the
                default evaluator.  The keys are the names of the metrics and the values
                are the metric values.  To access the MetricValue for the metrics
                calculated by the system, make sure to specify the type hint for this
                parameter as Dict[str, MetricValue].  Refer to the DefaultEvaluator
                behavior section for what metrics will be returned based on the type of
                model (i.e. classifier or regressor).  kwargs: Includes a list of args
                that are used to compute the metric. These args could information coming
                from input data, model outputs or parameters specified in the
                `evaluator_config` argument of the `mlflow.evaluate` API.
            kwargs: Includes a list of args that are used to compute the metric. These
                args could be information coming from input data, model outputs,
                other metrics, or parameters specified in the `evaluator_config`
                argument of the `mlflow.evaluate` API.
    
        Returns: MetricValue with per-row scores, per-row justifications, and aggregate
            results.
        """
        ...
    

  • greater_is_better – Whether a higher value of the metric is better.

  • name – The name of the metric. This argument must be specified if eval_fn is a lambda function or the eval_fn.__name__ attribute is not available.

  • long_name – (Optional) The long name of the metric. For example, "mean_squared_error" for "mse".

  • version – (Optional) The metric version. For example v1.

  • metric_details – (Optional) A description of the metric and how it is calculated.

  • metric_metadata – (Optional) A dictionary containing metadata for the metric.

  • genai_metric_args – (Optional) A dictionary containing arguments specified by users when calling make_genai_metric or make_genai_metric_from_prompt. Those args are persisted so that we can deserialize the same metric object later.

Generative AI Metrics

We also provide generative AI (“genai”) EvaluationMetrics for evaluating text models. These metrics use an LLM to evaluate the quality of a model’s output text. Note that your use of a third party LLM service (e.g., OpenAI) for evaluation may be subject to and governed by the LLM service’s terms of use. The following factory functions help you customize the intelligent metric to your use case.

mlflow.metrics.genai.answer_correctness(model: Optional[str] = None, metric_version: Optional[str] = None, examples: Optional[list] = None, metric_metadata: Optional[dict] = None, parameters: Optional[dict] = None, extra_headers: Optional[dict] = None, proxy_url: Optional[str] = None, max_workers: int = 10)mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a genai metric used to evaluate the answer correctness of an LLM using the model provided. Answer correctness will be assessed by the accuracy of the provided output based on the ground_truth, which should be specified in the targets column. High scores mean that your model outputs contain similar information as the ground_truth and that this information is correct, while low scores mean that outputs may disagree with the ground_truth or that the information in the output is incorrect. Note that this builds onto answer_similarity.

The targets eval_arg must be provided as part of the input dataset or output predictions. This can be mapped to a column of a different name using col_mapping in the evaluator_config parameter, or using the targets parameter in mlflow.evaluate().

An MlflowException will be raised if the specified version for this metric does not exist.

Parameters
  • model – (Optional) Model uri of the judge model that will be used to compute the metric, e.g., openai:/gpt-4. Refer to the LLM-as-a-Judge Metrics documentation for the supported model types and their URI format.

  • metric_version – The version of the answer correctness metric to use. Defaults to the latest version.

  • examples – Provide a list of examples to help the judge model evaluate the answer correctness. It is highly recommended to add examples to be used as a reference to evaluate the new results.

  • metric_metadata – (Optional) Dictionary of metadata to be attached to the EvaluationMetric object. Useful for model evaluators that require additional information to determine how to evaluate this metric.

  • parameters – (Optional) Dictionary of parameters to be passed to the judge model, e.g., {“temperature”: 0.5}. When specified, these parameters will override the default parameters defined in the metric implementation.

  • extra_headers – (Optional) Dictionary of extra headers to be passed to the judge model.

  • proxy_url – (Optional) Proxy URL to be used for the judge model. This is useful when the judge model is served via a proxy endpoint, not directly via LLM provider services. If not specified, the default URL for the LLM provider will be used (e.g., https://api.openai.com/v1/chat/completions for OpenAI chat models).

  • max_workers – (Optional) The maximum number of workers to use for judge scoring. Defaults to 10 workers.

Returns

A metric object

mlflow.metrics.genai.answer_relevance(model: Optional[str] = None, metric_version: Optional[str] = 'v1', examples: Optional[list] = None, metric_metadata: Optional[dict] = None, parameters: Optional[dict] = None, extra_headers: Optional[dict] = None, proxy_url: Optional[str] = None, max_workers: int = 10)mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a genai metric used to evaluate the answer relevance of an LLM using the model provided. Answer relevance will be assessed based on the appropriateness and applicability of the output with respect to the input. High scores mean that your model outputs are about the same subject as the input, while low scores mean that outputs may be non-topical.

An MlflowException will be raised if the specified version for this metric does not exist.

Parameters
  • model

    (Optional) Model uri of the judge model that will be used to compute the metric, e.g., openai:/gpt-4. Refer to the LLM-as-a-Judge Metrics documentation for the supported model types and their URI format.

  • metric_version – The version of the answer relevance metric to use. Defaults to the latest version.

  • examples – Provide a list of examples to help the judge model evaluate the answer relevance. It is highly recommended to add examples to be used as a reference to evaluate the new results.

  • metric_metadata – (Optional) Dictionary of metadata to be attached to the EvaluationMetric object. Useful for model evaluators that require additional information to determine how to evaluate this metric.

  • parameters – (Optional) Dictionary of parameters to be passed to the judge model, e.g., {“temperature”: 0.5}. When specified, these parameters will override the default parameters defined in the metric implementation.

  • extra_headers – (Optional) Dictionary of extra headers to be passed to the judge model.

  • proxy_url – (Optional) Proxy URL to be used for the judge model. This is useful when the judge model is served via a proxy endpoint, not directly via LLM provider services. If not specified, the default URL for the LLM provider will be used (e.g., https://api.openai.com/v1/chat/completions for OpenAI chat models).

  • max_workers – (Optional) The maximum number of workers to use for judge scoring. Defaults to 10 workers.

Returns

A metric object

mlflow.metrics.genai.answer_similarity(model: Optional[str] = None, metric_version: Optional[str] = None, examples: Optional[list] = None, metric_metadata: Optional[dict] = None, parameters: Optional[dict] = None, extra_headers: Optional[dict] = None, proxy_url: Optional[str] = None, max_workers: int = 10)mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a genai metric used to evaluate the answer similarity of an LLM using the model provided. Answer similarity will be assessed by the semantic similarity of the output to the ground_truth, which should be specified in the targets column. High scores mean that your model outputs contain similar information as the ground_truth, while low scores mean that outputs may disagree with the ground_truth.

The targets eval_arg must be provided as part of the input dataset or output predictions. This can be mapped to a column of a different name using col_mapping in the evaluator_config parameter, or using the targets parameter in mlflow.evaluate().

An MlflowException will be raised if the specified version for this metric does not exist.

Parameters
  • model

    (Optional) Model uri of the judge model that will be used to compute the metric, e.g., openai:/gpt-4. Refer to the LLM-as-a-Judge Metrics documentation for the supported model types and their URI format.

  • metric_version – (Optional) The version of the answer similarity metric to use. Defaults to the latest version.

  • examples – (Optional) Provide a list of examples to help the judge model evaluate the answer similarity. It is highly recommended to add examples to be used as a reference to evaluate the new results.

  • metric_metadata – (Optional) Dictionary of metadata to be attached to the EvaluationMetric object. Useful for model evaluators that require additional information to determine how to evaluate this metric.

  • parameters – (Optional) Dictionary of parameters to be passed to the judge model, e.g., {“temperature”: 0.5}. When specified, these parameters will override the default parameters defined in the metric implementation.

  • extra_headers – (Optional) Dictionary of extra headers to be passed to the judge model.

  • proxy_url – (Optional) Proxy URL to be used for the judge model. This is useful when the judge model is served via a proxy endpoint, not directly via LLM provider services. If not specified, the default URL for the LLM provider will be used (e.g., https://api.openai.com/v1/chat/completions for OpenAI chat models).

  • max_workers – (Optional) The maximum number of workers to use for judge scoring. Defaults to 10 workers.

Returns

A metric object

mlflow.metrics.genai.faithfulness(model: Optional[str] = None, metric_version: Optional[str] = 'v1', examples: Optional[list] = None, metric_metadata: Optional[dict] = None, parameters: Optional[dict] = None, extra_headers: Optional[dict] = None, proxy_url: Optional[str] = None, max_workers: int = 10)mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

This function will create a genai metric used to evaluate the faithfullness of an LLM using the model provided. Faithfulness will be assessed based on how factually consistent the output is to the context. High scores mean that the outputs contain information that is in line with the context, while low scores mean that outputs may disagree with the context (input is ignored).

The context eval_arg must be provided as part of the input dataset or output predictions. This can be mapped to a column of a different name using col_mapping in the evaluator_config parameter.

An MlflowException will be raised if the specified version for this metric does not exist.

Parameters
  • model

    (Optional) Model uri of the judge model that will be used to compute the metric, e.g., openai:/gpt-4. Refer to the LLM-as-a-Judge Metrics documentation for the supported model types and their URI format.

  • metric_version – The version of the faithfulness metric to use. Defaults to the latest version.

  • examples – Provide a list of examples to help the judge model evaluate the faithfulness. It is highly recommended to add examples to be used as a reference to evaluate the new results.

  • metric_metadata – (Optional) Dictionary of metadata to be attached to the EvaluationMetric object. Useful for model evaluators that require additional information to determine how to evaluate this metric.

  • parameters – (Optional) Dictionary of parameters to be passed to the judge model, e.g., {“temperature”: 0.5}. When specified, these parameters will override the default parameters defined in the metric implementation.

  • extra_headers – (Optional) Dictionary of extra headers to be passed to the judge model.

  • proxy_url – (Optional) Proxy URL to be used for the judge model. This is useful when the judge model is served via a proxy endpoint, not directly via LLM provider services. If not specified, the default URL for the LLM provider will be used (e.g., https://api.openai.com/v1/chat/completions for OpenAI chat models).

  • max_workers – (Optional) The maximum number of workers to use for judge scoring. Defaults to 10 workers.

Returns

A metric object

mlflow.metrics.genai.make_genai_metric_from_prompt(name: str, judge_prompt: Optional[str] = None, model: Optional[str] = 'openai:/gpt-4', parameters: Optional[dict] = None, aggregations: Optional[list] = None, greater_is_better: bool = True, max_workers: int = 10, metric_metadata: Optional[dict] = None, extra_headers: Optional[dict] = None, proxy_url: Optional[str] = None)mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Create a genai metric used to evaluate LLM using LLM as a judge in MLflow. This produces a metric using only the supplied judge prompt without any pre-written system prompt. This can be useful for use cases that are not covered by the full grading prompt in any EvaluationModel version.

Parameters
  • name – Name of the metric.

  • judge_prompt – The entire prompt to be used for the judge model. The prompt will be minimally wrapped in formatting instructions to ensure scores can be parsed. The prompt may use f-string formatting to include variables. Corresponding variables must be passed as keyword arguments into the resulting metric’s eval function.

  • model

    (Optional) Model uri of the judge model that will be used to compute the metric, e.g., openai:/gpt-4. Refer to the LLM-as-a-Judge Metrics documentation for the supported model types and their URI format.

  • parameters – (Optional) Parameters for the LLM used to compute the metric. By default, we set the temperature to 0.0, max_tokens to 200, and top_p to 1.0. We recommend setting the temperature to 0.0 for the LLM used as a judge to ensure consistent results.

  • aggregations – (Optional) The list of options to aggregate the scores. Currently supported options are: min, max, mean, median, variance, p90.

  • greater_is_better – (Optional) Whether the metric is better when it is greater.

  • max_workers – (Optional) The maximum number of workers to use for judge scoring. Defaults to 10 workers.

  • metric_metadata – (Optional) Dictionary of metadata to be attached to the EvaluationMetric object. Useful for model evaluators that require additional information to determine how to evaluate this metric.

  • extra_headers – (Optional) Additional headers to be passed to the judge model.

  • proxy_url – (Optional) Proxy URL to be used for the judge model. This is useful when the judge model is served via a proxy endpoint, not directly via LLM provider services. If not specified, the default URL for the LLM provider will be used (e.g., https://api.openai.com/v1/chat/completions for OpenAI chat models).

Returns

A metric object.

Example for creating a genai metric
from mlflow.metrics.genai import make_genai_metric_from_prompt

metric = make_genai_metric_from_prompt(
    name="ease_of_understanding",
    judge_prompt=(
        "You must evaluate the output of a bot based on how easy it is to "
        "understand its outputs."
        "Evaluate the bot's output from the perspective of a layperson."
        "The bot was provided with this input: {input} and this output: {output}."
    ),
    model="openai:/gpt-4",
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance", "p90"],
    greater_is_better=True,
)
mlflow.metrics.genai.relevance(model: Optional[str] = None, metric_version: Optional[str] = None, examples: Optional[list] = None, metric_metadata: Optional[dict] = None, parameters: Optional[dict] = None, extra_headers: Optional[dict] = None, proxy_url: Optional[str] = None, max_workers: int = 10)mlflow.models.evaluation.base.EvaluationMetric[source]

This function will create a genai metric used to evaluate the evaluate the relevance of an LLM using the model provided. Relevance will be assessed by the appropriateness, significance, and applicability of the output with respect to the input and context. High scores mean that the model has understood the context and correct extracted relevant information from the context, while low score mean that output has completely ignored the question and the context and could be hallucinating.

The context eval_arg must be provided as part of the input dataset or output predictions. This can be mapped to a column of a different name using col_mapping in the evaluator_config parameter.

An MlflowException will be raised if the specified version for this metric does not exist.

Parameters
  • model

    (Optional) Model uri of the judge model that will be used to compute the metric, e.g., openai:/gpt-4. Refer to the LLM-as-a-Judge Metrics documentation for the supported model types and their URI format.

  • metric_version – (Optional) The version of the relevance metric to use. Defaults to the latest version.

  • examples – (Optional) Provide a list of examples to help the judge model evaluate the relevance. It is highly recommended to add examples to be used as a reference to evaluate the new results.

  • metric_metadata – (Optional) Dictionary of metadata to be attached to the EvaluationMetric object. Useful for model evaluators that require additional information to determine how to evaluate this metric.

  • parameters – (Optional) Dictionary of parameters to be passed to the judge model, e.g., {“temperature”: 0.5}. When specified, these parameters will override the default parameters defined in the metric implementation.

  • extra_headers – (Optional) Dictionary of extra headers to be passed to the judge model.

  • proxy_url – (Optional) Proxy URL to be used for the judge model. This is useful when the judge model is served via a proxy endpoint, not directly via LLM provider services. If not specified, the default URL for the LLM provider will be used (e.g., https://api.openai.com/v1/chat/completions for OpenAI chat models).

  • max_workers – (Optional) The maximum number of workers to use for judge scoring. Defaults to 10 workers.

Returns

A metric object

mlflow.metrics.genai.retrieve_custom_metrics(run_id: str, name: Optional[str] = None, version: Optional[str] = None)list[source]

Retrieve the custom metrics created by users through make_genai_metric() or make_genai_metric_from_prompt() that are associated with a particular evaluation run.

Parameters
  • run_id – The unique identifier for the run.

  • name – (Optional) The name of the custom metric to retrieve. If None, retrieve all metrics.

  • version – (Optional) The version of the custom metric to retrieve. If None, retrieve all metrics.

Returns

A list of EvaluationMetric objects that match the retrieval criteria.

Example for retrieving a custom genai metric
import pandas as pd

import mlflow
from mlflow.metrics.genai.genai_metric import (
    make_genai_metric_from_prompt,
    retrieve_custom_metrics,
)

eval_df = pd.DataFrame(
    {
        "inputs": ["foo"],
        "ground_truth": ["bar"],
    }
)
with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"
    basic_qa_model = mlflow.openai.log_model(
        model="gpt-4o-mini",
        task="chat.completions",
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    custom_metric = make_genai_metric_from_prompt(
        name="custom llm judge",
        judge_prompt="This is a custom judge prompt.",
        greater_is_better=False,
        parameters={"temperature": 0.0},
    )
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        targets="ground_truth",
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[custom_metric],
    )
metrics = retrieve_custom_metrics(
    run_id=run.info.run_id,
    name="custom llm judge",
)

You can also create your own generative AI EvaluationMetrics using the make_genai_metric factory function.

mlflow.metrics.genai.make_genai_metric(name: str, definition: str, grading_prompt: str, examples: Optional[list] = None, version: Optional[str] = 'v1', model: Optional[str] = 'openai:/gpt-4', grading_context_columns: Optional[Union[list, str]] = None, include_input: bool = True, parameters: Optional[dict] = None, aggregations: Optional[list] = None, greater_is_better: bool = True, max_workers: int = 10, metric_metadata: Optional[dict] = None, extra_headers: Optional[dict] = None, proxy_url: Optional[str] = None)mlflow.models.evaluation.base.EvaluationMetric[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Create a genai metric used to evaluate LLM using LLM as a judge in MLflow. The full grading prompt is stored in the metric_details field of the EvaluationMetric object.

Parameters
  • name – Name of the metric.

  • definition – Definition of the metric.

  • grading_prompt – Grading criteria of the metric.

  • examples – (Optional) Examples of the metric.

  • version – (Optional) Version of the metric. Currently supported versions are: v1.

  • model

    (Optional) Model uri of the judge model that will be used to compute the metric, e.g., openai:/gpt-4. Refer to the LLM-as-a-Judge Metrics documentation for the supported model types and their URI format.

  • grading_context_columns – (Optional) The name of the grading context column, or a list of grading context column names, required to compute the metric. The grading_context_columns are used by the LLM as a judge as additional information to compute the metric. The columns are extracted from the input dataset or output predictions based on col_mapping in the evaluator_config passed to mlflow.evaluate(). They can also be the name of other evaluated metrics.

  • include_input – (Optional) Whether to include the input when computing the metric.

  • parameters – (Optional) Parameters for the LLM used to compute the metric. By default, we set the temperature to 0.0, max_tokens to 200, and top_p to 1.0. We recommend setting the temperature to 0.0 for the LLM used as a judge to ensure consistent results.

  • aggregations – (Optional) The list of options to aggregate the scores. Currently supported options are: min, max, mean, median, variance, p90.

  • greater_is_better – (Optional) Whether the metric is better when it is greater.

  • max_workers – (Optional) The maximum number of workers to use for judge scoring. Defaults to 10 workers.

  • metric_metadata – (Optional) Dictionary of metadata to be attached to the EvaluationMetric object. Useful for model evaluators that require additional information to determine how to evaluate this metric.

  • extra_headers – (Optional) Additional headers to be passed to the judge model.

  • proxy_url – (Optional) Proxy URL to be used for the judge model. This is useful when the judge model is served via a proxy endpoint, not directly via LLM provider services. If not specified, the default URL for the LLM provider will be used (e.g., https://api.openai.com/v1/chat/completions for OpenAI chat models).

Returns

A metric object.

Example for creating a genai metric
from mlflow.metrics.genai import EvaluationExample, make_genai_metric

example = EvaluationExample(
    input="What is MLflow?",
    output=(
        "MLflow is an open-source platform for managing machine "
        "learning workflows, including experiment tracking, model packaging, "
        "versioning, and deployment, simplifying the ML lifecycle."
    ),
    score=4,
    justification=(
        "The definition effectively explains what MLflow is "
        "its purpose, and its developer. It could be more concise for a 5-score.",
    ),
    grading_context={
        "targets": (
            "MLflow is an open-source platform for managing "
            "the end-to-end machine learning (ML) lifecycle. It was developed by "
            "Databricks, a company that specializes in big data and machine learning "
            "solutions. MLflow is designed to address the challenges that data "
            "scientists and machine learning engineers face when developing, training, "
            "and deploying machine learning models."
        )
    },
)
metric = make_genai_metric(
    name="answer_correctness",
    definition=(
        "Answer correctness is evaluated on the accuracy of the provided output based on "
        "the provided targets, which is the ground truth. Scores can be assigned based on "
        "the degree of semantic similarity and factual correctness of the provided output "
        "to the provided targets, where a higher score indicates higher degree of accuracy."
    ),
    grading_prompt=(
        "Answer correctness: Below are the details for different scores:"
        "- Score 1: The output is completely incorrect. It is completely different from "
        "or contradicts the provided targets."
        "- Score 2: The output demonstrates some degree of semantic similarity and "
        "includes partially correct information. However, the output still has significant "
        "discrepancies with the provided targets or inaccuracies."
        "- Score 3: The output addresses a couple of aspects of the input accurately, "
        "aligning with the provided targets. However, there are still omissions or minor "
        "inaccuracies."
        "- Score 4: The output is mostly correct. It provides mostly accurate information, "
        "but there may be one or more minor omissions or inaccuracies."
        "- Score 5: The output is correct. It demonstrates a high degree of accuracy and "
        "semantic similarity to the targets."
    ),
    examples=[example],
    version="v1",
    model="openai:/gpt-4",
    grading_context_columns=["targets"],
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance", "p90"],
    greater_is_better=True,
)

When using generative AI EvaluationMetrics, it is important to pass in an EvaluationExample

class mlflow.metrics.genai.EvaluationExample(output: str, score: float, justification: str, input: Optional[str] = None, grading_context: Optional[Union[dict, str]] = None)[source]

Note

Experimental: This class may change or be removed in a future release without warning.

Stores the sample example during few shot learning during LLM evaluation

Parameters
  • input – The input provided to the model

  • output – The output generated by the model

  • score – The score given by the evaluator

  • justification – The justification given by the evaluator

  • grading_context – The grading_context provided to the evaluator for evaluation. Either a dictionary of grading context column names and grading context strings or a single grading context string.

Example for creating an EvaluationExample
from mlflow.metrics.base import EvaluationExample

example = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform for managing machine "
    "learning workflows, including experiment tracking, model packaging, "
    "versioning, and deployment, simplifying the ML lifecycle.",
    score=4,
    justification="The definition effectively explains what MLflow is "
    "its purpose, and its developer. It could be more concise for a 5-score.",
    grading_context={
        "ground_truth": "MLflow is an open-source platform for managing "
        "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
        "a company that specializes in big data and machine learning solutions. MLflow is "
        "designed to address the challenges that data scientists and machine learning "
        "engineers face when developing, training, and deploying machine learning models."
    },
)
print(str(example))
Output
Input: What is MLflow?
Provided output: "MLflow is an open-source platform for managing machine "
    "learning workflows, including experiment tracking, model packaging, "
    "versioning, and deployment, simplifying the ML lifecycle."
Provided ground_truth: "MLflow is an open-source platform for managing "
    "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
    "a company that specializes in big data and machine learning solutions. MLflow is "
    "designed to address the challenges that data scientists and machine learning "
    "engineers face when developing, training, and deploying machine learning models."
Score: 4
Justification: "The definition effectively explains what MLflow is "
    "its purpose, and its developer. It could be more concise for a 5-score."

Users must set the appropriate environment variables for the LLM service they are using for evaluation. For example, if you are using OpenAI’s API, you must set the OPENAI_API_KEY environment variable. If using Azure OpenAI, you must also set the OPENAI_API_TYPE, OPENAI_API_VERSION, OPENAI_API_BASE, and OPENAI_DEPLOYMENT_NAME environment variables. See Azure OpenAI documentation Users do not need to set these environment variables if they are using a gateway route.