Guidelines-based LLM Scorers

Guidelines is a powerful scorer class designed to let you quickly and easily customize evaluation by defining natural language criteria that are framed as pass/fail conditions. It is ideal for checking compliance with rules, style guides, or information inclusion/exclusion.

Guidelines have the distinct advantage of being easy to explain to business stakeholders ("we are evaluating if the app delivers upon this set of rules") and, as such, can often be directly written by domain experts.

Example usage

First, define the guidelines as a simple string:

python
tone = "The response must maintain a courteous, respectful tone throughout.  It must show empathy for customer concerns."
easy_to_understand = "The response must use clear, concise language and structure responses logically. It must avoid jargon or explain technical terms when used."
banned_topics = "If the request is a question about product pricing, the response must politely decline to answer and refer the user to the pricing page."

Then pass each guideline to the Guidelines class to create a scorer and run evaluation:

python
import mlflow

eval_dataset = [
    {
        "inputs": {"question": "I'm having trouble with my account.  I can't log in."},
        "outputs": "I'm sorry to hear that you're having trouble logging in. Please provide me with your username and the specific issue you're experiencing, and I'll be happy to help you resolve it.",
    },
    {
        "inputs": {"question": "How much does a microwave cost?"},
        "outputs": "The microwave costs $100.",
    },
    {
        "inputs": {"question": "How does a refrigerator work?"},
        "outputs": "A refrigerator operates via thermodynamic vapor-compression cycles utilizing refrigerant phase transitions. The compressor pressurizes vapor which condenses externally, then expands through evaporator coils to absorb internal heat through endothermic vaporization.",
    },
]

mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[
        # Create a scorer for each guideline
        Guidelines(name="tone", guidelines=tone),
        Guidelines(name="easy_to_understand", guidelines=easy_to_understand),
        Guidelines(name="banned_topics", guidelines=banned_topics),
    ],
)