mlflow.metrics
The mlflow.metrics
module helps you quantitatively and qualitatively measure your models.
-
class
mlflow.metrics.
EvaluationMetric
(eval_fn, name, greater_is_better, long_name=None, version=None, metric_details=None, metric_metadata=None, genai_metric_args=None)[source] An evaluation metric.
- Parameters
eval_fn –
A function that computes the metric with the following signature:
def eval_fn( predictions: pandas.Series, targets: pandas.Series, metrics: Dict[str, MetricValue], **kwargs, ) -> Union[float, MetricValue]: """ Args: predictions: A pandas Series containing the predictions made by the model. targets: (Optional) A pandas Series containing the corresponding labels for the predictions made on that input. metrics: (Optional) A dictionary containing the metrics calculated by the default evaluator. The keys are the names of the metrics and the values are the metric values. To access the MetricValue for the metrics calculated by the system, make sure to specify the type hint for this parameter as Dict[str, MetricValue]. Refer to the DefaultEvaluator behavior section for what metrics will be returned based on the type of model (i.e. classifier or regressor). kwargs: Includes a list of args that are used to compute the metric. These args could be information coming from input data, model outputs, other metrics, or parameters specified in the `evaluator_config` argument of the `mlflow.evaluate` API. Returns: MetricValue with per-row scores, per-row justifications, and aggregate results. """ ...
name – The name of the metric.
greater_is_better – Whether a higher value of the metric is better.
long_name – (Optional) The long name of the metric. For example,
"root_mean_squared_error"
for"mse"
.version – (Optional) The metric version. For example
v1
.metric_details – (Optional) A description of the metric and how it is calculated.
metric_metadata – (Optional) A dictionary containing metadata for the metric.
genai_metric_args – (Optional) A dictionary containing arguments specified by users when calling make_genai_metric or make_genai_metric_from_prompt. Those args are persisted so that we can deserialize the same metric object later.
These EvaluationMetric
are used by the mlflow.evaluate()
API, either computed automatically depending on the model_type
or specified via the extra_metrics
parameter.
The following code demonstrates how to use mlflow.evaluate()
with an EvaluationMetric
.
import mlflow
from mlflow.metrics.genai import EvaluationExample, answer_similarity
eval_df = pd.DataFrame(
{
"inputs": [
"What is MLflow?",
],
"ground_truth": [
"MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.",
],
}
)
example = EvaluationExample(
input="What is MLflow?",
output="MLflow is an open-source platform for managing machine "
"learning workflows, including experiment tracking, model packaging, "
"versioning, and deployment, simplifying the ML lifecycle.",
score=4,
justification="The definition effectively explains what MLflow is "
"its purpose, and its developer. It could be more concise for a 5-score.",
grading_context={
"ground_truth": "MLflow is an open-source platform for managing "
"the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
"a company that specializes in big data and machine learning solutions. MLflow is "
"designed to address the challenges that data scientists and machine learning "
"engineers face when developing, training, and deploying machine learning models."
},
)
answer_similarity_metric = answer_similarity(examples=[example])
results = mlflow.evaluate(
logged_model.model_uri,
eval_df,
targets="ground_truth",
model_type="question-answering",
extra_metrics=[answer_similarity_metric],
)
Information about how an EvaluationMetric
is calculated, such as the grading prompt used is available via the metric_details
property.
import mlflow
from mlflow.metrics.genai import relevance
my_relevance_metric = relevance()
print(my_relevance_metric.metric_details)
Evaluation results are stored as MetricValue
. Aggregate results are logged to the MLflow run as metrics, while per-example results are logged to the MLflow run as artifacts in the form of an evaluation table.
-
class
mlflow.metrics.
MetricValue
(scores=None, justifications=None, aggregate_results=None)[source] Note
Experimental: This class may change or be removed in a future release without warning.
The value of a metric.
- Parameters
scores – The value of the metric per row
justifications – The justification (if applicable) for the respective score
aggregate_results – A dictionary mapping the name of the aggregation to its value
We provide the following builtin factory functions to create EvaluationMetric
for evaluating models. These metrics are computed automatically depending on the model_type
. For more information on the model_type
parameter, see mlflow.evaluate()
API.
Regressor Metrics
-
mlflow.metrics.
mae
() → mlflow.models.evaluation.base.EvaluationMetric[source] This function will create a metric for evaluating mae.
This metric computes an aggregate score for the mean absolute error for regression.
-
mlflow.metrics.
mape
() → mlflow.models.evaluation.base.EvaluationMetric[source] This function will create a metric for evaluating mape.
This metric computes an aggregate score for the mean absolute percentage error for regression.
-
mlflow.metrics.
max_error
() → mlflow.models.evaluation.base.EvaluationMetric[source] This function will create a metric for evaluating max_error.
This metric computes an aggregate score for the maximum residual error for regression.
-
mlflow.metrics.
mse
() → mlflow.models.evaluation.base.EvaluationMetric[source] This function will create a metric for evaluating mse.
This metric computes an aggregate score for the mean squared error for regression.
-
mlflow.metrics.
rmse
() → mlflow.models.evaluation.base.EvaluationMetric[source] This function will create a metric for evaluating the square root of mse.
This metric computes an aggregate score for the root mean absolute error for regression.
-
mlflow.metrics.
r2_score
() → mlflow.models.evaluation.base.EvaluationMetric[source] This function will create a metric for evaluating r2_score.
This metric computes an aggregate score for the coefficient of determination. R2 ranges from negative infinity to 1, and measures the percentage of variance explained by the predictor variables in a regression.
Classifier Metrics
-
mlflow.metrics.
precision_score
() → mlflow.models.evaluation.base.EvaluationMetric[source] This function will create a metric for evaluating precision for classification.
This metric computes an aggregate score between 0 and 1 for the precision of classification task.
-
mlflow.metrics.
recall_score
() → mlflow.models.evaluation.base.EvaluationMetric[source] This function will create a metric for evaluating recall for classification.
This metric computes an aggregate score between 0 and 1 for the recall of a classification task.
-
mlflow.metrics.
f1_score
() → mlflow.models.evaluation.base.EvaluationMetric[source] This function will create a metric for evaluating f1_score for binary classification.
This metric computes an aggregate score between 0 and 1 for the F1 score (F-measure) of a classification task. F1 score is defined as 2 * (precision * recall) / (precision + recall).
Text Metrics
-
mlflow.metrics.
ari_grade_level
() → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a metric for calculating automated readability index using textstat.
This metric outputs a number that approximates the grade level needed to comprehend the text, which will likely range from around 0 to 15 (although it is not limited to this range).
- Aggregations calculated for this metric:
mean
-
mlflow.metrics.
flesch_kincaid_grade_level
() → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a metric for calculating flesch kincaid grade level using textstat.
This metric outputs a number that approximates the grade level needed to comprehend the text, which will likely range from around 0 to 15 (although it is not limited to this range).
- Aggregations calculated for this metric:
mean
Question Answering Metrics
Includes all of the above Text Metrics as well as the following:
-
mlflow.metrics.
exact_match
() → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a metric for calculating accuracy using sklearn.
This metric only computes an aggregate score which ranges from 0 to 1.
-
mlflow.metrics.
rouge1
() → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a metric for evaluating rouge1.
The score ranges from 0 to 1, where a higher score indicates higher similarity. rouge1 uses unigram based scoring to calculate similarity.
- Aggregations calculated for this metric:
mean
-
mlflow.metrics.
rouge2
() → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a metric for evaluating rouge2.
The score ranges from 0 to 1, where a higher score indicates higher similarity. rouge2 uses bigram based scoring to calculate similarity.
- Aggregations calculated for this metric:
mean
-
mlflow.metrics.
rougeL
() → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a metric for evaluating rougeL.
The score ranges from 0 to 1, where a higher score indicates higher similarity. rougeL uses unigram based scoring to calculate similarity.
- Aggregations calculated for this metric:
mean
-
mlflow.metrics.
rougeLsum
() → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a metric for evaluating rougeLsum.
The score ranges from 0 to 1, where a higher score indicates higher similarity. rougeLsum uses longest common subsequence based scoring to calculate similarity.
- Aggregations calculated for this metric:
mean
-
mlflow.metrics.
toxicity
() → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a metric for evaluating toxicity using the model roberta-hate-speech-dynabench-r4, which defines hate as “abusive speech targeting specific group characteristics, such as ethnic origin, religion, gender, or sexual orientation.”
The score ranges from 0 to 1, where scores closer to 1 are more toxic. The default threshold for a text to be considered “toxic” is 0.5.
- Aggregations calculated for this metric:
ratio (of toxic input texts)
-
mlflow.metrics.
token_count
() → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a metric for calculating token_count. Token count is calculated using tiktoken by using the cl100k_base tokenizer.
-
mlflow.metrics.
latency
() → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a metric for calculating latency. Latency is determined by the time it takes to generate a prediction for a given input. Note that computing latency requires each row to be predicted sequentially, which will likely slow down the evaluation process.
Retriever Metrics
The following metrics are built-in metrics for the 'retriever'
model type, meaning they will be
automatically calculated with a default retriever_k
value of 3.
To evaluate document retrieval models, it is recommended to use a dataset with the following columns:
Input queries
Retrieved relevant doc IDs
Ground-truth doc IDs
Alternatively, you can also provide a function through the model
parameter to represent
your retrieval model. The function should take a Pandas DataFrame containing input queries and
ground-truth relevant doc IDs, and return a DataFrame with a column of retrieved relevant doc IDs.
A “doc ID” is a string or integer that uniquely identifies a document. Each row of the retrieved and ground-truth doc ID columns should consist of a list or numpy array of doc IDs.
Parameters:
targets
: A string specifying the column name of the ground-truth relevant doc IDspredictions
: A string specifying the column name of the retrieved relevant doc IDs in either the static dataset or the Dataframe returned by themodel
functionretriever_k
: A positive integer specifying the number of retrieved docs IDs to consider for each input query.retriever_k
defaults to 3. You can changeretriever_k
by using themlflow.evaluate()
API:# with a model and using `evaluator_config` mlflow.evaluate( model=retriever_function, data=data, targets="ground_truth", model_type="retriever", evaluators="default", evaluator_config={"retriever_k": 5} )
# with a static dataset and using `extra_metrics` mlflow.evaluate( data=data, predictions="predictions_param", targets="targets_param", model_type="retriever", extra_metrics = [ mlflow.metrics.precision_at_k(5), mlflow.metrics.precision_at_k(6), mlflow.metrics.recall_at_k(5), mlflow.metrics.ndcg_at_k(5) ] )
NOTE: In the 2nd method, it is recommended to omit the
model_type
as well, or elseprecision@3
andrecall@3
will be calculated in addition toprecision@5
,precision@6
,recall@5
, andndcg_at_k@5
.
-
mlflow.metrics.
precision_at_k
(k) → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a metric for calculating
precision_at_k
for retriever models.This metric computes a score between 0 and 1 for each row representing the precision of the retriever model at the given
k
value. If no relevant documents are retrieved, the score is 0, indicating that no relevant docs are retrieved. Letx = min(k, # of retrieved doc IDs)
. Then, in all other cases, the precision at k is calculated as follows:precision_at_k
= (# of relevant retrieved doc IDs in top-x
ranked docs) /x
.
-
mlflow.metrics.
recall_at_k
(k) → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a metric for calculating
recall_at_k
for retriever models.This metric computes a score between 0 and 1 for each row representing the recall ability of the retriever model at the given
k
value. If no ground truth doc IDs are provided and no documents are retrieved, the score is 1. However, if no ground truth doc IDs are provided and documents are retrieved, the score is 0. In all other cases, the recall at k is calculated as follows:recall_at_k
= (# of unique relevant retrieved doc IDs in top-k
ranked docs) / (# of ground truth doc IDs)
-
mlflow.metrics.
ndcg_at_k
(k) → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a metric for evaluating NDCG@k for retriever models.
NDCG score is capable of handling non-binary notions of relevance. However, for simplicity, we use binary relevance here. The relevance score for documents in the ground truth is 1, and the relevance score for documents not in the ground truth is 0.
The NDCG score is calculated using sklearn.metrics.ndcg_score with the following edge cases on top of the sklearn implementation:
If no ground truth doc IDs are provided and no documents are retrieved, the score is 1.
If no ground truth doc IDs are provided and documents are retrieved, the score is 0.
If ground truth doc IDs are provided and no documents are retrieved, the score is 0.
If duplicate doc IDs are retrieved and the duplicate doc IDs are in the ground truth, they will be treated as different docs. For example, if the ground truth doc IDs are [1, 2] and the retrieved doc IDs are [1, 1, 1, 3], the score will be equavalent to ground truth doc IDs [10, 11, 12, 2] and retrieved doc IDs [10, 11, 12, 3].
Users create their own EvaluationMetric
using the make_metric
factory function
-
mlflow.metrics.
make_metric
(*, eval_fn, greater_is_better, name=None, long_name=None, version=None, metric_details=None, metric_metadata=None, genai_metric_args=None)[source] A factory function to create an
EvaluationMetric
object.- Parameters
eval_fn –
A function that computes the metric with the following signature:
def eval_fn( predictions: pandas.Series, targets: pandas.Series, metrics: Dict[str, MetricValue], **kwargs, ) -> Union[float, MetricValue]: """ Args: predictions: A pandas Series containing the predictions made by the model. targets: (Optional) A pandas Series containing the corresponding labels for the predictions made on that input. metrics: (Optional) A dictionary containing the metrics calculated by the default evaluator. The keys are the names of the metrics and the values are the metric values. To access the MetricValue for the metrics calculated by the system, make sure to specify the type hint for this parameter as Dict[str, MetricValue]. Refer to the DefaultEvaluator behavior section for what metrics will be returned based on the type of model (i.e. classifier or regressor). kwargs: Includes a list of args that are used to compute the metric. These args could information coming from input data, model outputs or parameters specified in the `evaluator_config` argument of the `mlflow.evaluate` API. kwargs: Includes a list of args that are used to compute the metric. These args could be information coming from input data, model outputs, other metrics, or parameters specified in the `evaluator_config` argument of the `mlflow.evaluate` API. Returns: MetricValue with per-row scores, per-row justifications, and aggregate results. """ ...
greater_is_better – Whether a higher value of the metric is better.
name – The name of the metric. This argument must be specified if
eval_fn
is a lambda function or theeval_fn.__name__
attribute is not available.long_name – (Optional) The long name of the metric. For example,
"mean_squared_error"
for"mse"
.version – (Optional) The metric version. For example
v1
.metric_details – (Optional) A description of the metric and how it is calculated.
metric_metadata – (Optional) A dictionary containing metadata for the metric.
genai_metric_args – (Optional) A dictionary containing arguments specified by users when calling make_genai_metric or make_genai_metric_from_prompt. Those args are persisted so that we can deserialize the same metric object later.
Generative AI Metrics
We also provide generative AI (“genai”) EvaluationMetric
s for evaluating text models. These metrics use an LLM to evaluate the quality of a model’s output text. Note that your use of a third party LLM service (e.g., OpenAI) for evaluation may be subject to and governed by the LLM service’s terms of use. The following factory functions help you customize the intelligent metric to your use case.
-
mlflow.metrics.genai.
answer_correctness
(model: Optional[str] = None, metric_version: Optional[str] = None, examples: Optional[List[mlflow.metrics.genai.base.EvaluationExample]] = None, metric_metadata: Optional[Dict[str, Any]] = None) → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a genai metric used to evaluate the answer correctness of an LLM using the model provided. Answer correctness will be assessed by the accuracy of the provided output based on the
ground_truth
, which should be specified in thetargets
column.The
targets
eval_arg must be provided as part of the input dataset or output predictions. This can be mapped to a column of a different name usingcol_mapping
in theevaluator_config
parameter, or using thetargets
parameter in mlflow.evaluate().An MlflowException will be raised if the specified version for this metric does not exist.
- Parameters
model – Model uri of an openai or gateway judge model in the format of “openai:/gpt-4” or “gateway:/my-route”. Defaults to “openai:/gpt-4”. Your use of a third party LLM service (e.g., OpenAI) for evaluation may be subject to and governed by the LLM service’s terms of use.
metric_version – The version of the answer correctness metric to use. Defaults to the latest version.
examples – Provide a list of examples to help the judge model evaluate the answer correctness. It is highly recommended to add examples to be used as a reference to evaluate the new results.
metric_metadata – (Optional) Dictionary of metadata to be attached to the EvaluationMetric object. Useful for model evaluators that require additional information to determine how to evaluate this metric.
- Returns
A metric object
-
mlflow.metrics.genai.
answer_relevance
(model: Optional[str] = None, metric_version: Optional[str] = 'v1', examples: Optional[List[mlflow.metrics.genai.base.EvaluationExample]] = None, metric_metadata: Optional[Dict[str, Any]] = None) → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a genai metric used to evaluate the answer relevance of an LLM using the model provided. Answer relevance will be assessed based on the appropriateness and applicability of the output with respect to the input.
An MlflowException will be raised if the specified version for this metric does not exist.
- Parameters
model – Model uri of an openai or gateway judge model in the format of “openai:/gpt-4” or “gateway:/my-route”. Defaults to “openai:/gpt-4”. Your use of a third party LLM service (e.g., OpenAI) for evaluation may be subject to and governed by the LLM service’s terms of use.
metric_version – The version of the answer relevance metric to use. Defaults to the latest version.
examples – Provide a list of examples to help the judge model evaluate the answer relevance. It is highly recommended to add examples to be used as a reference to evaluate the new results.
metric_metadata – (Optional) Dictionary of metadata to be attached to the EvaluationMetric object. Useful for model evaluators that require additional information to determine how to evaluate this metric.
- Returns
A metric object
-
mlflow.metrics.genai.
answer_similarity
(model: Optional[str] = None, metric_version: Optional[str] = None, examples: Optional[List[mlflow.metrics.genai.base.EvaluationExample]] = None, metric_metadata: Optional[Dict[str, Any]] = None) → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a genai metric used to evaluate the answer similarity of an LLM using the model provided. Answer similarity will be assessed by the semantic similarity of the output to the
ground_truth
, which should be specified in thetargets
column.The
targets
eval_arg must be provided as part of the input dataset or output predictions. This can be mapped to a column of a different name usingcol_mapping
in theevaluator_config
parameter, or using thetargets
parameter in mlflow.evaluate().An MlflowException will be raised if the specified version for this metric does not exist.
- Parameters
model – (Optional) Model uri of an openai or gateway judge model in the format of “openai:/gpt-4” or “gateway:/my-route”. Defaults to “openai:/gpt-4”. Your use of a third party LLM service (e.g., OpenAI) for evaluation may be subject to and governed by the LLM service’s terms of use.
metric_version – (Optional) The version of the answer similarity metric to use. Defaults to the latest version.
examples – (Optional) Provide a list of examples to help the judge model evaluate the answer similarity. It is highly recommended to add examples to be used as a reference to evaluate the new results.
metric_metadata – (Optional) Dictionary of metadata to be attached to the EvaluationMetric object. Useful for model evaluators that require additional information to determine how to evaluate this metric.
- Returns
A metric object
-
mlflow.metrics.genai.
faithfulness
(model: Optional[str] = None, metric_version: Optional[str] = 'v1', examples: Optional[List[mlflow.metrics.genai.base.EvaluationExample]] = None, metric_metadata: Optional[Dict[str, Any]] = None) → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
This function will create a genai metric used to evaluate the faithfullness of an LLM using the model provided. Faithfulness will be assessed based on how factually consistent the output is to the
context
.The
context
eval_arg must be provided as part of the input dataset or output predictions. This can be mapped to a column of a different name usingcol_mapping
in theevaluator_config
parameter.An MlflowException will be raised if the specified version for this metric does not exist.
- Parameters
model – Model uri of an openai or gateway judge model in the format of “openai:/gpt-4” or “gateway:/my-route”. Defaults to “openai:/gpt-4”. Your use of a third party LLM service (e.g., OpenAI) for evaluation may be subject to and governed by the LLM service’s terms of use.
metric_version – The version of the faithfulness metric to use. Defaults to the latest version.
examples – Provide a list of examples to help the judge model evaluate the faithfulness. It is highly recommended to add examples to be used as a reference to evaluate the new results.
metric_metadata – (Optional) Dictionary of metadata to be attached to the EvaluationMetric object. Useful for model evaluators that require additional information to determine how to evaluate this metric.
- Returns
A metric object
-
mlflow.metrics.genai.
make_genai_metric_from_prompt
(name: str, judge_prompt: Optional[str] = None, model: Optional[str] = 'openai:/gpt-4', parameters: Optional[Dict[str, Any]] = None, aggregations: Optional[List[str]] = None, greater_is_better: bool = True, max_workers: int = 10, metric_metadata: Optional[Dict[str, Any]] = None) → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
Create a genai metric used to evaluate LLM using LLM as a judge in MLflow. This produces a metric using only the supplied judge prompt without any pre-written system prompt. This can be useful for use cases that are not covered by the full grading prompt in any
EvaluationModel
version.- Parameters
name – Name of the metric.
judge_prompt – The entire prompt to be used for the judge model. The prompt will be minimally wrapped in formatting instructions to ensure scores can be parsed. The prompt may use f-string formatting to include variables. Corresponding variables must be passed as keyword arguments into the resulting metric’s eval function.
model – (Optional) Model uri of an openai, gateway, or deployments judge model in the format of “openai:/gpt-4”, “gateway:/my-route”, “endpoints:/databricks-llama-2-70b-chat”. Defaults to “openai:/gpt-4”. If using Azure OpenAI, the
OPENAI_DEPLOYMENT_NAME
environment variable will take precedence. Your use of a third party LLM service (e.g., OpenAI) for evaluation may be subject to and governed by the LLM service’s terms of use.parameters – (Optional) Parameters for the LLM used to compute the metric. By default, we set the temperature to 0.0, max_tokens to 200, and top_p to 1.0. We recommend setting the temperature to 0.0 for the LLM used as a judge to ensure consistent results.
aggregations – (Optional) The list of options to aggregate the scores. Currently supported options are: min, max, mean, median, variance, p90.
greater_is_better – (Optional) Whether the metric is better when it is greater.
max_workers – (Optional) The maximum number of workers to use for judge scoring. Defaults to 10 workers.
metric_metadata – (Optional) Dictionary of metadata to be attached to the EvaluationMetric object. Useful for model evaluators that require additional information to determine how to evaluate this metric.
- Returns
A metric object.
from mlflow.metrics.genai import make_genai_metric_from_prompt metric = make_genai_metric_from_prompt( name="ease_of_understanding", judge_prompt=( "You must evaluate the output of a bot based on how easy it is to " "understand its outputs." "Evaluate the bot's output from the perspective of a layperson." "The bot was provided with this input: {input} and this output: {output}." ), model="openai:/gpt-4", parameters={"temperature": 0.0}, aggregations=["mean", "variance", "p90"], greater_is_better=True, )
-
mlflow.metrics.genai.
relevance
(model: Optional[str] = None, metric_version: Optional[str] = None, examples: Optional[List[mlflow.metrics.genai.base.EvaluationExample]] = None, metric_metadata: Optional[Dict[str, Any]] = None) → mlflow.models.evaluation.base.EvaluationMetric[source] This function will create a genai metric used to evaluate the evaluate the relevance of an LLM using the model provided. Relevance will be assessed by the appropriateness, significance, and applicability of the output with respect to the input and
context
.The
context
eval_arg must be provided as part of the input dataset or output predictions. This can be mapped to a column of a different name usingcol_mapping
in theevaluator_config
parameter.An MlflowException will be raised if the specified version for this metric does not exist.
- Parameters
model – (Optional) Model uri of an openai or gateway judge model in the format of “openai:/gpt-4” or “gateway:/my-route”. Defaults to “openai:/gpt-4”. Your use of a third party LLM service (e.g., OpenAI) for evaluation may be subject to and governed by the LLM service’s terms of use.
metric_version – (Optional) The version of the relevance metric to use. Defaults to the latest version.
examples – (Optional) Provide a list of examples to help the judge model evaluate the relevance. It is highly recommended to add examples to be used as a reference to evaluate the new results.
metric_metadata – (Optional) Dictionary of metadata to be attached to the EvaluationMetric object. Useful for model evaluators that require additional information to determine how to evaluate this metric.
- Returns
A metric object
-
mlflow.metrics.genai.
retrieve_custom_metrics
(run_id: str, name: Optional[str] = None, version: Optional[str] = None) → List[mlflow.models.evaluation.base.EvaluationMetric][source] Retrieve the custom metrics created by users through make_genai_metric() or make_genai_metric_from_prompt() that are associated with a particular evaluation run.
- Parameters
run_id – The unique identifier for the run.
name – (Optional) The name of the custom metric to retrieve. If None, retrieve all metrics.
version – (Optional) The version of the custom metric to retrieve. If None, retrieve all metrics.
- Returns
A list of EvaluationMetric objects that match the retrieval criteria.
import pandas as pd import mlflow from mlflow.metrics.genai.genai_metric import ( make_genai_metric_from_prompt, retrieve_custom_metrics, ) eval_df = pd.DataFrame( { "inputs": ["foo"], "ground_truth": ["bar"], } ) with mlflow.start_run() as run: system_prompt = "Answer the following question in two sentences" basic_qa_model = mlflow.openai.log_model( model="gpt-4o-mini", task="chat.completions", artifact_path="model", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": "{question}"}, ], ) custom_metric = make_genai_metric_from_prompt( name="custom llm judge", judge_prompt="This is a custom judge prompt.", greater_is_better=False, parameters={"temperature": 0.0}, ) results = mlflow.evaluate( basic_qa_model.model_uri, eval_df, targets="ground_truth", model_type="question-answering", evaluators="default", extra_metrics=[custom_metric], ) metrics = retrieve_custom_metrics( run_id=run.info.run_id, name="custom llm judge", )
You can also create your own generative AI EvaluationMetric
s using the make_genai_metric
factory function.
-
mlflow.metrics.genai.
make_genai_metric
(name: str, definition: str, grading_prompt: str, examples: Optional[List[mlflow.metrics.genai.base.EvaluationExample]] = None, version: Optional[str] = 'v1', model: Optional[str] = 'openai:/gpt-4', grading_context_columns: Optional[Union[List[str], str]] = None, include_input: bool = True, parameters: Optional[Dict[str, Any]] = None, aggregations: Optional[List[str]] = None, greater_is_better: bool = True, max_workers: int = 10, metric_metadata: Optional[Dict[str, Any]] = None) → mlflow.models.evaluation.base.EvaluationMetric[source] Note
Experimental: This function may change or be removed in a future release without warning.
Create a genai metric used to evaluate LLM using LLM as a judge in MLflow. The full grading prompt is stored in the metric_details field of the
EvaluationMetric
object.- Parameters
name – Name of the metric.
definition – Definition of the metric.
grading_prompt – Grading criteria of the metric.
examples – (Optional) Examples of the metric.
version – (Optional) Version of the metric. Currently supported versions are: v1.
model – (Optional) Model uri of an openai, gateway, or deployments judge model in the format of “openai:/gpt-4”, “gateway:/my-route”, “endpoints:/databricks-llama-2-70b-chat”. Defaults to “openai:/gpt-4”. If using Azure OpenAI, the
OPENAI_DEPLOYMENT_NAME
environment variable will take precedence. Your use of a third party LLM service (e.g., OpenAI) for evaluation may be subject to and governed by the LLM service’s terms of use.grading_context_columns – (Optional) The name of the grading context column, or a list of grading context column names, required to compute the metric. The
grading_context_columns
are used by the LLM as a judge as additional information to compute the metric. The columns are extracted from the input dataset or output predictions based oncol_mapping
in theevaluator_config
passed tomlflow.evaluate()
. They can also be the name of other evaluated metrics.include_input – (Optional) Whether to include the input when computing the metric.
parameters – (Optional) Parameters for the LLM used to compute the metric. By default, we set the temperature to 0.0, max_tokens to 200, and top_p to 1.0. We recommend setting the temperature to 0.0 for the LLM used as a judge to ensure consistent results.
aggregations – (Optional) The list of options to aggregate the scores. Currently supported options are: min, max, mean, median, variance, p90.
greater_is_better – (Optional) Whether the metric is better when it is greater.
max_workers – (Optional) The maximum number of workers to use for judge scoring. Defaults to 10 workers.
metric_metadata – (Optional) Dictionary of metadata to be attached to the EvaluationMetric object. Useful for model evaluators that require additional information to determine how to evaluate this metric.
- Returns
A metric object.
from mlflow.metrics.genai import EvaluationExample, make_genai_metric example = EvaluationExample( input="What is MLflow?", output=( "MLflow is an open-source platform for managing machine " "learning workflows, including experiment tracking, model packaging, " "versioning, and deployment, simplifying the ML lifecycle." ), score=4, justification=( "The definition effectively explains what MLflow is " "its purpose, and its developer. It could be more concise for a 5-score.", ), grading_context={ "targets": ( "MLflow is an open-source platform for managing " "the end-to-end machine learning (ML) lifecycle. It was developed by " "Databricks, a company that specializes in big data and machine learning " "solutions. MLflow is designed to address the challenges that data " "scientists and machine learning engineers face when developing, training, " "and deploying machine learning models." ) }, ) metric = make_genai_metric( name="answer_correctness", definition=( "Answer correctness is evaluated on the accuracy of the provided output based on " "the provided targets, which is the ground truth. Scores can be assigned based on " "the degree of semantic similarity and factual correctness of the provided output " "to the provided targets, where a higher score indicates higher degree of accuracy." ), grading_prompt=( "Answer correctness: Below are the details for different scores:" "- Score 1: The output is completely incorrect. It is completely different from " "or contradicts the provided targets." "- Score 2: The output demonstrates some degree of semantic similarity and " "includes partially correct information. However, the output still has significant " "discrepancies with the provided targets or inaccuracies." "- Score 3: The output addresses a couple of aspects of the input accurately, " "aligning with the provided targets. However, there are still omissions or minor " "inaccuracies." "- Score 4: The output is mostly correct. It provides mostly accurate information, " "but there may be one or more minor omissions or inaccuracies." "- Score 5: The output is correct. It demonstrates a high degree of accuracy and " "semantic similarity to the targets." ), examples=[example], version="v1", model="openai:/gpt-4", grading_context_columns=["targets"], parameters={"temperature": 0.0}, aggregations=["mean", "variance", "p90"], greater_is_better=True, )
When using generative AI EvaluationMetric
s, it is important to pass in an EvaluationExample
-
class
mlflow.metrics.genai.
EvaluationExample
(output: str, score: float, justification: str, input: Optional[str] = None, grading_context: Optional[Union[Dict[str, str], str]] = None)[source] Note
Experimental: This class may change or be removed in a future release without warning.
Stores the sample example during few shot learning during LLM evaluation
- Parameters
input – The input provided to the model
output – The output generated by the model
score – The score given by the evaluator
justification – The justification given by the evaluator
grading_context – The grading_context provided to the evaluator for evaluation. Either a dictionary of grading context column names and grading context strings or a single grading context string.
from mlflow.metrics.base import EvaluationExample example = EvaluationExample( input="What is MLflow?", output="MLflow is an open-source platform for managing machine " "learning workflows, including experiment tracking, model packaging, " "versioning, and deployment, simplifying the ML lifecycle.", score=4, justification="The definition effectively explains what MLflow is " "its purpose, and its developer. It could be more concise for a 5-score.", grading_context={ "ground_truth": "MLflow is an open-source platform for managing " "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, " "a company that specializes in big data and machine learning solutions. MLflow is " "designed to address the challenges that data scientists and machine learning " "engineers face when developing, training, and deploying machine learning models." }, ) print(str(example))
Input: What is MLflow? Provided output: "MLflow is an open-source platform for managing machine " "learning workflows, including experiment tracking, model packaging, " "versioning, and deployment, simplifying the ML lifecycle." Provided ground_truth: "MLflow is an open-source platform for managing " "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, " "a company that specializes in big data and machine learning solutions. MLflow is " "designed to address the challenges that data scientists and machine learning " "engineers face when developing, training, and deploying machine learning models." Score: 4 Justification: "The definition effectively explains what MLflow is " "its purpose, and its developer. It could be more concise for a 5-score."
Users must set the appropriate environment variables for the LLM service they are using for
evaluation. For example, if you are using OpenAI’s API, you must set the OPENAI_API_KEY
environment variable. If using Azure OpenAI, you must also set the OPENAI_API_TYPE
,
OPENAI_API_VERSION
, OPENAI_API_BASE
, and OPENAI_DEPLOYMENT_NAME
environment variables.
See Azure OpenAI documentation
Users do not need to set these environment variables if they are using a gateway route.