Skip to main content

RAGAS

RAGAS (Retrieval Augmented Generation Assessment) is an evaluation framework designed for LLM applications. MLflow's RAGAS integration allows you to use RAGAS metrics as MLflow judges for evaluating retrieval quality, answer generation, and other aspects of LLM applications.

Prerequisites

RAGAS judges require the ragas package:

bash
pip install ragas

Quick Start

You can call RAGAS judges directly:

python
from mlflow.genai.scorers.ragas import Faithfulness

scorer = Faithfulness(model="openai:/gpt-4")
feedback = scorer(trace=trace)

print(feedback.value) # Score between 0.0 and 1.0
print(feedback.rationale) # Explanation of the score

Or use them in mlflow.genai.evaluate:

python
import mlflow
from mlflow.genai.scorers.ragas import Faithfulness, ContextPrecision

traces = mlflow.search_traces()
results = mlflow.genai.evaluate(
data=traces,
scorers=[
Faithfulness(model="openai:/gpt-4"),
ContextPrecision(model="openai:/gpt-4"),
],
)

Available RAGAS Judges

RAGAS judges are organized into categories based on their evaluation focus:

RAG (Retrieval-Augmented Generation) Metrics

Evaluate retrieval quality and answer generation in RAG systems:

ScorerWhat does it evaluate?RAGAS Docs
ContextPrecisionAre relevant retrieved documents ranked higher than irrelevant ones?Link
ContextUtilizationHow effectively is the retrieved context being utilized in the answer?Link
NonLLMContextPrecisionWithReferenceNon-LLM version of context precision using reference answersLink
ContextRecallDoes retrieval context contain all information needed to answer the query?Link
NonLLMContextRecallNon-LLM version of context recall using reference answersLink
ContextEntityRecallAre entities from the expected answer present in the retrieved context?Link
NoiseSensitivityHow sensitive is the model to irrelevant information in the context?Link
AnswerRelevancyHow relevant is the generated answer to the input query?Link
FaithfulnessIs the output factually consistent with the retrieval context?Link
AnswerAccuracyHow accurate is the answer compared to ground truth?Link
ContextRelevanceHow relevant is the retrieved context to the input query?Link
ResponseGroundednessIs the response grounded in the provided context?Link

Agents or Tool Use Metrics

Evaluate AI agents and tool usage:

ScorerWhat does it evaluate?RAGAS Docs
TopicAdherenceDoes the agent stay on topic during conversation?Link
ToolCallAccuracyAre the correct tools called with appropriate parameters?Link
ToolCallF1F1 score for tool call predictionLink
AgentGoalAccuracyWithReferenceDoes the agent achieve its goal (with reference answer)?Link
AgentGoalAccuracyWithoutReferenceDoes the agent achieve its goal (without reference answer)?Link

Natural Language Comparison

Evaluate answer quality through natural language comparison:

ScorerWhat does it evaluate?RAGAS Docs
FactualCorrectnessIs the output factually correct compared to expected answer?Link
SemanticSimilaritySemantic similarity between output and expected answerLink
NonLLMStringSimilarityString similarity between output and expected answerLink
BleuScoreBLEU score for text comparisonLink
ChrfScoreCHRF score for text comparisonLink
RougeScoreROUGE score for text comparisonLink
StringPresenceIs a specific string present in the output?Link
ExactMatchDoes output exactly match expected output?Link

General Purpose

Flexible evaluation metrics for various use cases:

ScorerWhat does it evaluate?RAGAS Docs
AspectCriticEvaluates specific aspects of the output using LLMLink
DiscreteMetricCustom discrete metric with flexible scoring logicLink
RubricsScoreScores output based on predefined rubricsLink
InstanceSpecificRubricsScores output based on instance-specific rubricsLink

Other Tasks

Specialized metrics for specific tasks:

ScorerWhat does it evaluate?RAGAS Docs
SummarizationScoreQuality of text summarizationLink

Creating Judges by Name

You can also create RAGAS judges dynamically using get_scorer:

python
from mlflow.genai.scorers.ragas import get_scorer

# Create scorer by name
scorer = get_scorer(
metric_name="Faithfulness",
model="openai:/gpt-4",
)

feedback = scorer(trace=trace)

Configuration

RAGAS judges accept metric-specific parameters. Any additional keyword arguments are passed directly to the RAGAS metric constructor:

python
from mlflow.genai.scorers.ragas import Faithfulness, ContextPrecision, ExactMatch

# LLM-based metric with model specification
scorer = Faithfulness(model="openai:/gpt-4")

# Non-LLM metric (no model required)
deterministic_scorer = ExactMatch()

Refer to the RAGAS documentation for metric-specific parameters and advanced usage.

Next Steps