Skip to main content

ToolCallEfficiency Judge

The ToolCallEfficiency judge evaluates the agent's trajectory for redundancy in tool usage, such as tool calls with the same or similar arguments.

This built-in LLM judge is designed for evaluating AI agents and tool-calling applications where you need to ensure the agent operates efficiently without making unnecessary or duplicate tool calls.

Prerequisites for running the examples

  1. Install MLflow and required packages

    bash
    pip install --upgrade mlflow
  2. Create an MLflow experiment by following the setup your environment quickstart.

  3. (Optional, if using OpenAI models) Use the native OpenAI SDK to connect to OpenAI-hosted models. Select a model from the available OpenAI models.

    python
    import mlflow
    import os
    import openai

    # Ensure your OPENAI_API_KEY is set in your environment
    # os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

    # Enable auto-tracing for OpenAI
    mlflow.openai.autolog()

    # Create an OpenAI client
    client = openai.OpenAI()

    # Select an LLM
    model_name = "gpt-4o-mini"

Usage examples

The ToolCallEfficiency judge can be invoked directly for single trace assessment or used with MLflow's evaluation framework for batch evaluation.

Requirements:

  • Trace requirements: - The MLflow Trace must contain at least one span with span_type set to TOOL
python
from mlflow.genai.scorers import ToolCallEfficiency
import mlflow

# Get a trace from a previous run
trace = mlflow.get_trace("<your-trace-id>")

# Assess if tool calls are efficient
feedback = ToolCallEfficiency(name="my_tool_call_efficiency")(trace=trace)
print(feedback)
tip

For a complete agent example with this judge, see the Tool Call Evaluation guide.

Select the LLM that powers the judge

You can change the judge model by using the model argument in the judge definition. The model must be specified in the format <provider>:/<model-name>, where <provider> is a LiteLLM-compatible model provider.

For a list of supported models, see selecting judge models.

Interpret results

The judge returns a Feedback object with:

  • value: "yes" if tool calls are efficient, "no" if otherwise
  • rationale: Detailed explanation identifying:
    • Which specific tool calls are redundant (if any)
    • Why certain calls are considered duplicates or could be consolidated
    • Why the tool usage is efficient

Next steps