RAGAS

RAGAS (Retrieval Augmented Generation Assessment) is an evaluation framework designed for LLM applications. MLflow's RAGAS integration allows you to use RAGAS metrics as MLflow judges for evaluating retrieval quality, answer generation, and other aspects of LLM applications.

Prerequisites

RAGAS judges require the ragas package:

bash
pip install ragas

Quick Start

You can call RAGAS judges directly:

python
from mlflow.genai.scorers.ragas import Faithfulness

scorer = Faithfulness(model="openai:/gpt-4")
feedback = scorer(trace=trace)

print(feedback.value)  # Score between 0.0 and 1.0
print(feedback.rationale)  # Explanation of the score

Or use them in mlflow.genai.evaluate:

python
import mlflow
from mlflow.genai.scorers.ragas import Faithfulness, ContextPrecision

traces = mlflow.search_traces()
results = mlflow.genai.evaluate(
    data=traces,
    scorers=[
        Faithfulness(model="openai:/gpt-4"),
        ContextPrecision(model="openai:/gpt-4"),
    ],
)

Available RAGAS Judges

RAGAS judges are organized into categories based on their evaluation focus:

RAG (Retrieval-Augmented Generation) Metrics

Evaluate retrieval quality and answer generation in RAG systems:

Scorer	What does it evaluate?	RAGAS Docs
ContextPrecision	Are relevant retrieved documents ranked higher than irrelevant ones?	Link
ContextUtilization	How effectively is the retrieved context being utilized in the answer?	Link
NonLLMContextPrecisionWithReference	Non-LLM version of context precision using reference answers	Link
ContextRecall	Does retrieval context contain all information needed to answer the query?	Link
NonLLMContextRecall	Non-LLM version of context recall using reference answers	Link
ContextEntityRecall	Are entities from the expected answer present in the retrieved context?	Link
NoiseSensitivity	How sensitive is the model to irrelevant information in the context?	Link
AnswerRelevancy	How relevant is the generated answer to the input query?	Link
Faithfulness	Is the output factually consistent with the retrieval context?	Link
AnswerAccuracy	How accurate is the answer compared to ground truth?	Link
ContextRelevance	How relevant is the retrieved context to the input query?	Link
ResponseGroundedness	Is the response grounded in the provided context?	Link

Agents or Tool Use Metrics

Evaluate AI agents and tool usage:

Scorer	What does it evaluate?	RAGAS Docs
TopicAdherence	Does the agent stay on topic during conversation?	Link
ToolCallAccuracy	Are the correct tools called with appropriate parameters?	Link
ToolCallF1	F1 score for tool call prediction	Link
AgentGoalAccuracyWithReference	Does the agent achieve its goal (with reference answer)?	Link
AgentGoalAccuracyWithoutReference	Does the agent achieve its goal (without reference answer)?	Link

Natural Language Comparison

Evaluate answer quality through natural language comparison:

Scorer	What does it evaluate?	RAGAS Docs
FactualCorrectness	Is the output factually correct compared to expected answer?	Link
SemanticSimilarity	Semantic similarity between output and expected answer	Link
NonLLMStringSimilarity	String similarity between output and expected answer	Link
BleuScore	BLEU score for text comparison	Link
ChrfScore	CHRF score for text comparison	Link
RougeScore	ROUGE score for text comparison	Link
StringPresence	Is a specific string present in the output?	Link
ExactMatch	Does output exactly match expected output?	Link

General Purpose

Flexible evaluation metrics for various use cases:

Scorer	What does it evaluate?	RAGAS Docs
AspectCritic	Evaluates specific aspects of the output using LLM	Link
DiscreteMetric	Custom discrete metric with flexible scoring logic	Link
RubricsScore	Scores output based on predefined rubrics	Link
InstanceSpecificRubrics	Scores output based on instance-specific rubrics	Link

Other Tasks

Specialized metrics for specific tasks:

Scorer	What does it evaluate?	RAGAS Docs
SummarizationScore	Quality of text summarization	Link

Creating Judges by Name

You can also create RAGAS judges dynamically using get_scorer:

python
from mlflow.genai.scorers.ragas import get_scorer

# Create scorer by name
scorer = get_scorer(
    metric_name="Faithfulness",
    model="openai:/gpt-4",
)

feedback = scorer(trace=trace)

Configuration

RAGAS judges accept metric-specific parameters. Any additional keyword arguments are passed directly to the RAGAS metric constructor:

python
from mlflow.genai.scorers.ragas import Faithfulness, ContextPrecision, ExactMatch

# LLM-based metric with model specification
scorer = Faithfulness(model="openai:/gpt-4")

# Non-LLM metric (no model required)
deterministic_scorer = ExactMatch()

Refer to the RAGAS documentation for metric-specific parameters and advanced usage.

Next Steps

Evaluate Agents

Learn specialized techniques for evaluating AI agents with tool usage

Learn more →

Evaluate Traces

Evaluate production traces to understand application behavior

Learn more →

Built-in Judges

Explore MLflow's built-in evaluation judges

Learn more →

Prerequisites​

Quick Start​

Available RAGAS Judges​

RAG (Retrieval-Augmented Generation) Metrics​

Agents or Tool Use Metrics​

Natural Language Comparison​

General Purpose​

Other Tasks​

Creating Judges by Name​

Configuration​

Next Steps​

Evaluate Agents

Evaluate Traces

Built-in Judges

Prerequisites

Quick Start

Available RAGAS Judges

RAG (Retrieval-Augmented Generation) Metrics

Agents or Tool Use Metrics

Natural Language Comparison

General Purpose

Other Tasks

Creating Judges by Name

Configuration

Next Steps