RAGAS
RAGAS (Retrieval Augmented Generation Assessment) is an evaluation framework designed for LLM applications. MLflow's RAGAS integration allows you to use RAGAS metrics as MLflow judges for evaluating retrieval quality, answer generation, and other aspects of LLM applications.
Prerequisites
RAGAS judges require the ragas package:
pip install ragas
Quick Start
You can call RAGAS judges directly:
from mlflow.genai.scorers.ragas import Faithfulness
scorer = Faithfulness(model="openai:/gpt-4")
feedback = scorer(trace=trace)
print(feedback.value) # Score between 0.0 and 1.0
print(feedback.rationale) # Explanation of the score
Or use them in mlflow.genai.evaluate:
import mlflow
from mlflow.genai.scorers.ragas import Faithfulness, ContextPrecision
traces = mlflow.search_traces()
results = mlflow.genai.evaluate(
data=traces,
scorers=[
Faithfulness(model="openai:/gpt-4"),
ContextPrecision(model="openai:/gpt-4"),
],
)
Available RAGAS Judges
RAGAS judges are organized into categories based on their evaluation focus:
RAG (Retrieval-Augmented Generation) Metrics
Evaluate retrieval quality and answer generation in RAG systems:
| Scorer | What does it evaluate? | RAGAS Docs |
|---|---|---|
| ContextPrecision | Are relevant retrieved documents ranked higher than irrelevant ones? | Link |
| ContextUtilization | How effectively is the retrieved context being utilized in the answer? | Link |
| NonLLMContextPrecisionWithReference | Non-LLM version of context precision using reference answers | Link |
| ContextRecall | Does retrieval context contain all information needed to answer the query? | Link |
| NonLLMContextRecall | Non-LLM version of context recall using reference answers | Link |
| ContextEntityRecall | Are entities from the expected answer present in the retrieved context? | Link |
| NoiseSensitivity | How sensitive is the model to irrelevant information in the context? | Link |
| AnswerRelevancy | How relevant is the generated answer to the input query? | Link |
| Faithfulness | Is the output factually consistent with the retrieval context? | Link |
| AnswerAccuracy | How accurate is the answer compared to ground truth? | Link |
| ContextRelevance | How relevant is the retrieved context to the input query? | Link |
| ResponseGroundedness | Is the response grounded in the provided context? | Link |
Agents or Tool Use Metrics
Evaluate AI agents and tool usage:
| Scorer | What does it evaluate? | RAGAS Docs |
|---|---|---|
| TopicAdherence | Does the agent stay on topic during conversation? | Link |
| ToolCallAccuracy | Are the correct tools called with appropriate parameters? | Link |
| ToolCallF1 | F1 score for tool call prediction | Link |
| AgentGoalAccuracyWithReference | Does the agent achieve its goal (with reference answer)? | Link |
| AgentGoalAccuracyWithoutReference | Does the agent achieve its goal (without reference answer)? | Link |
Natural Language Comparison
Evaluate answer quality through natural language comparison:
| Scorer | What does it evaluate? | RAGAS Docs |
|---|---|---|
| FactualCorrectness | Is the output factually correct compared to expected answer? | Link |
| SemanticSimilarity | Semantic similarity between output and expected answer | Link |
| NonLLMStringSimilarity | String similarity between output and expected answer | Link |
| BleuScore | BLEU score for text comparison | Link |
| ChrfScore | CHRF score for text comparison | Link |
| RougeScore | ROUGE score for text comparison | Link |
| StringPresence | Is a specific string present in the output? | Link |
| ExactMatch | Does output exactly match expected output? | Link |
General Purpose
Flexible evaluation metrics for various use cases:
| Scorer | What does it evaluate? | RAGAS Docs |
|---|---|---|
| AspectCritic | Evaluates specific aspects of the output using LLM | Link |
| DiscreteMetric | Custom discrete metric with flexible scoring logic | Link |
| RubricsScore | Scores output based on predefined rubrics | Link |
| InstanceSpecificRubrics | Scores output based on instance-specific rubrics | Link |
Other Tasks
Specialized metrics for specific tasks:
| Scorer | What does it evaluate? | RAGAS Docs |
|---|---|---|
| SummarizationScore | Quality of text summarization | Link |
Creating Judges by Name
You can also create RAGAS judges dynamically using get_scorer:
from mlflow.genai.scorers.ragas import get_scorer
# Create scorer by name
scorer = get_scorer(
metric_name="Faithfulness",
model="openai:/gpt-4",
)
feedback = scorer(trace=trace)
Configuration
RAGAS judges accept metric-specific parameters. Any additional keyword arguments are passed directly to the RAGAS metric constructor:
from mlflow.genai.scorers.ragas import Faithfulness, ContextPrecision, ExactMatch
# LLM-based metric with model specification
scorer = Faithfulness(model="openai:/gpt-4")
# Non-LLM metric (no model required)
deterministic_scorer = ExactMatch()
Refer to the RAGAS documentation for metric-specific parameters and advanced usage.