ToolCallEfficiency Judge
The ToolCallEfficiency judge evaluates the agent's trajectory for redundancy in tool usage, such as tool calls with the same or similar arguments.
This built-in LLM judge is designed for evaluating AI agents and tool-calling applications where you need to ensure the agent operates efficiently without making unnecessary or duplicate tool calls.
Prerequisites for running the examples
-
Install MLflow and required packages
bashpip install --upgrade mlflow -
Create an MLflow experiment by following the setup your environment quickstart.
-
(Optional, if using OpenAI models) Use the native OpenAI SDK to connect to OpenAI-hosted models. Select a model from the available OpenAI models.
pythonimport mlflow
import os
import openai
# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured
# Enable auto-tracing for OpenAI
mlflow.openai.autolog()
# Create an OpenAI client
client = openai.OpenAI()
# Select an LLM
model_name = "gpt-4o-mini"
Usage examples
The ToolCallEfficiency judge can be invoked directly for single trace assessment or used with MLflow's evaluation framework for batch evaluation.
Requirements:
- Trace requirements: - The MLflow Trace must contain at least one span with
span_typeset toTOOL
- Invoke directly
- Invoke with evaluate()
from mlflow.genai.scorers import ToolCallEfficiency
import mlflow
# Get a trace from a previous run
trace = mlflow.get_trace("<your-trace-id>")
# Assess if tool calls are efficient
feedback = ToolCallEfficiency(name="my_tool_call_efficiency")(trace=trace)
print(feedback)
import mlflow
from mlflow.genai.scorers import ToolCallEfficiency
# Evaluate traces from previous runs
results = mlflow.genai.evaluate(
data=traces, # DataFrame or list containing trace data
scorers=[ToolCallEfficiency()],
)
For a complete agent example with this judge, see the Tool Call Evaluation guide.
Select the LLM that powers the judge
You can change the judge model by using the model argument in the judge definition. The model must be specified in the format <provider>:/<model-name>, where <provider> is a LiteLLM-compatible model provider.
For a list of supported models, see selecting judge models.
Interpret results
The judge returns a Feedback object with:
- value: "yes" if tool calls are efficient, "no" if otherwise
- rationale: Detailed explanation identifying:
- Which specific tool calls are redundant (if any)
- Why certain calls are considered duplicates or could be consolidated
- Why the tool usage is efficient