Tool Call Evaluation with Built-in Judges

AI agents often use tools (functions) to complete tasks - from fetching data to performing calculations. Evaluating tool-calling applications requires assessing whether agents select appropriate tools and provide correct arguments to fulfill user requests.

MLflow provides built-in judges designed specifically for evaluating tool-calling agents:

Available Tool Call Judges

Judge	What does it evaluate?	Requires ground-truth?	Requires traces?
ToolCallCorrectness	Are the tool calls and arguments correct for the user query?	No	⚠️ Trace Required
ToolCallEfficiency	Are the tool calls efficient without redundancy?	No	⚠️ Trace Required

tip

All tool call judges require MLflow Traces with at least one span marked as span_type="TOOL". Use the @mlflow.trace decorator with span_type="TOOL" on your tool functions.

Prerequisites for running the examples

Install MLflow and required packages
bash
```
pip install --upgrade mlflow
```
Create an MLflow experiment by following the setup your environment quickstart.

(Optional, if using OpenAI models) Use the native OpenAI SDK to connect to OpenAI-hosted models. Select a model from the available OpenAI models.

python
import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Create an OpenAI client
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"

Complete Agent Example

Here's a complete example showing how to build a tool-calling agent and evaluate it with the judges:

python
import json
import mlflow
import openai
from mlflow.genai.scorers import ToolCallCorrectness, ToolCallEfficiency

mlflow.openai.autolog()
client = openai.OpenAI()

# Define the tool schema for the LLM
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City and country"},
                },
                "required": ["location"],
            },
        },
    },
]


# Define the tool function with proper span type
@mlflow.trace(span_type="TOOL")
def get_weather(location: str) -> dict:
    # Simulated weather data - in practice, this would call a weather API
    return {"temperature": 72, "condition": "sunny", "location": location}


# Define your agent
@mlflow.trace
def agent(query: str):
    # Call the LLM with tools
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}],
        tools=tools,
    )
    message = response.choices[0].message
    responses = []
    if message.tool_calls:
        for tool_call in message.tool_calls:
            args = json.loads(tool_call.function.arguments)
            result = get_weather(**args)
            responses.append(
                {
                    "response": f"Weather in {result['location']}: {result['condition']}, {result['temperature']}°F"
                }
            )

    return {"response": responses if responses else message.content}


# Create evaluation dataset
eval_dataset = [
    {"inputs": {"query": "What's the weather like in Paris?"}},
    {"inputs": {"query": "How's the weather in Tokyo?"}},
]

# Run evaluation with tool call judges
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=agent,
    scorers=[
        ToolCallCorrectness(model="openai:/gpt-4o-mini"),
        ToolCallEfficiency(model="openai:/gpt-4o-mini"),
    ],
)

Understanding the Results

Each tool call judge evaluates tool spans separately:

ToolCallCorrectness: Assesses whether the agent selected appropriate tools and provided correct arguments
ToolCallEfficiency: Evaluates whether the agent made redundant or unnecessary tool calls

Select the LLM that powers the judge

You can change the judge model by using the model argument in the judge definition. The model must be specified in the format <provider>:/<model-name>, where <provider> is a LiteLLM-compatible model provider.

For a list of supported models, see selecting judge models.

Tool Call Evaluation with Built-in Judges

Available Tool Call Judges

Prerequisites for running the examples

Complete Agent Example

Understanding the Results

Select the LLM that powers the judge

Next Steps

ToolCallCorrectness

ToolCallEfficiency

Build evaluation datasets

Available Tool Call Judges​

Prerequisites for running the examples​

Complete Agent Example​

Understanding the Results​

Select the LLM that powers the judge​

Next Steps​

ToolCallCorrectness

ToolCallEfficiency

Build evaluation datasets

Available Tool Call Judges

Prerequisites for running the examples

Complete Agent Example

Understanding the Results

Select the LLM that powers the judge

Next Steps