Skip to main content

5 posts tagged with "mlops"

View All Tags

· 16 min read
Yuki Watanabe

Thumbnail

Augmenting LLMs with various data sources is a strong strategy to build LLM applications. However, as the system grows more complex, it becomes challenging to prototype and iteratively build improvements to these more complex systems.

LlamaIndex Workflow is a great framework to build such compound systems. Combined with MLflow, the Workflow API brings efficiency and robustness in the development cycle, enabling easy debugging, experiment tracking, and evaluation for continuous improvement.

In this blog, we will go through the journey of building a sophisticated chatbot with LlamaIndex's Workflow API and MLflow.

What is LlamaIndex Workflow?

LlamaIndex Workflow is an event-driven orchestration framework for designing dynamic AI applications. The core of LlamaIndex Workflow consists of:

  • Steps are units of execution, representing distinct actions in the workflow.

  • Events trigger these steps, acting as signals that control the workflow’s flow.

  • Workflow connects these two as a Python class. Each step is implemented as a method of the workflow class, defined with input and output events.

This simple yet powerful abstraction allows you to break down complex tasks into manageable steps, enabling greater flexibility and scalability. As a framework embodying event-driven design, using the Workflow APIs makes it intuitive to design parallel and asynchronous execution flows, significantly enhancing the efficiency of long-running tasks and aids in providing production-ready scalability.

Why Use MLflow with LlamaIndex Workflow?

Workflow provides great flexibility to design nearly arbitrary execution flows. However, with this great power comes a great responsibility. Without managing your changes properly, it can become a chaotic mess of indeterminate states and confusing configurations. After a few dozen changes, you may be asking yourself, "how did my workflow even work?".

MLflow brings a powerful MLOps harness to LlamaIndex Workflows throughout the end-to-end development cycle.

  • Experiment Tracking: MLflow allows you to record various components like steps, prompts, LLMs, and tools, making it easy to improve the system iteratively.

  • Reproducibility: MLflow packages environment information such as global configurations (Settings), library versions, and metadata to ensure consistent deployment across different stages of the ML lifecycle.

  • Tracing: Debugging issues in a complex event-driven workflow is cumbersome. MLflow Tracing is a production-ready observability solution that natively integrates with LlamaIndex, giving you observability into each internal stage within your Workflow.

  • Evaluation: Measuring is a crucial task for improving your model. MLflow Evaluation is great tool to evaluate the quality, speed, and cost of your LLM application. It is tightly integrated with MLflow's experiment tracking capabilities, streamlining the process of making iterative improvements.

Let's Build!🛠️

Strategy: Hybrid Approach Using Multiple Retrieval Methods

Retrieval-Augmented Generation (RAG) is a powerful framework, but the retrieval step can often become a bottleneck, because embedding-based retrieval may not always capture the most relevant context. While many techniques exist to improve retrieval quality, no single solution works universally. Therefore, an effective strategy is to combine multiple retrieval approaches.

The concept we will explore here is to run several retrieval methods in parallel: (1) standard vector search, (2) keyword-based search (BM25), and (3) web search. The retrieved contexts are then merged, with irrelevant data filtered out to enhance the overall quality.

Hybrid RAG Concept

How do we bring this concept to life? Let’s dive in and build this hybrid RAG using LlamaIndex Workflow and MLflow.

1. Set Up Repository

The sample code, including the environment setup script, is available in the GitHub repository. It contains a complete workflow definition, a hands-on notebook, and a sample dataset for running experiments. To clone it to your working environment, use the following command:

git clone https://github.com/mlflow/mlflow.git

After cloning the repository, set up the virtual environment by running:

cd mlflow/examples/llama_index/workflow
chmod +x install.sh
./install.sh

Once the installation is complete, start Jupyter Notebook within the Poetry environment using:

poetry run jupyter notebook

Next, open the Tutorial.ipynb notebook located in the root directory. Throughout this blog, we will walk through this notebook to guide you through the development process.

2. Start an MLflow Experiment

An MLflow Experiment is where you track all aspects of model development, including model definitions, configurations, parameters, dependency versions, and more. Let’s start by creating a new MLflow experiment called "LlamaIndex Workflow RAG":

import mlflow

mlflow.set_experiment("LlamaIndex Workflow RAG")

At this point, the experiment doesn't have any recorded data yet. To view the experiment in the MLflow UI, open a new terminal and run the mlflow ui command, then navigate to the provided URL in your browser:

poetry run mlflow ui

Empty MLflow Experiment

3. Choose your LLM and Embeddings

Now, set up your preferred LLM and embeddings models to LlamaIndex's Settings object. These models will be used throughout the LlamaIndex components.

For this demonstration, we’ll use OpenAI models, but you can easily switch to different LLM providers or local models by following the instructions in the notebook.

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OpenAI API Key")

from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

# LlamaIndex by default uses OpenAI APIs for LLMs and embeddings models. You can use the default
# model (`gpt-3.5-turbo` and `text-embeddings-ada-002` as of Oct 2024), but we recommend using the
# latest efficient models instead for getting better results with lower cost.
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")
Settings.llm = OpenAI(model="gpt-4o-mini")

💡 MLflow will automatically log the Settings configuration into your MLflow Experiment when logging models, ensuring reproducibility and reducing the risk of discrepancies between environments.

4. Set Up Web Search API

Later in this blog, we will add a web search capability to the QA bot. We will use Tavily AI, a search API optimized for LLM application and natively integrated with LlamaIndex. Visit their website to get an API key for free-tier use, or use different search engine integrated with LlamaIndex, e.g. GoogleSearchToolSpec.

Once you get the API key, set it to the environment variable:

os.environ["TAVILY_AI_API_KEY"] = getpass.getpass("Enter Tavily AI API Key")

5. Set Up Document Indices for Retrieval

The next step is to build a document index for retrieval from MLflow documentation. The urls.txt file in the data directory contains a list of MLflow documentation pages. These pages can be loaded as document objects using the web page reader utility.

from llama_index.readers.web import SimpleWebPageReader

with open("data/urls.txt", "r") as file:
urls = [line.strip() for line in file if line.strip()]

documents = SimpleWebPageReader(html_to_text=True).load_data(urls)

Next, ingest these documents into a vector database. In this tutorial, we’ll use the Qdrant vector store, which is free if self-hosted. If Docker is installed on your machine, you can start the Qdrant database by running the official Docker container:

$ docker pull qdrant/qdrant
$ docker run -p 6333:6333 -p 6334:6334 \
-v $(pwd)/.qdrant_storage:/qdrant/storage:z \
qdrant/qdrant

Once the container is running, you can create an index object that connects to the Qdrant database:

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore

client = qdrant_client.QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(client=client, collection_name="mlflow_doc")

from llama_index.core import StorageContext, VectorStoreIndex

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents=documents,
storage_context=storage_context
)

Of course, you can use your preferred vector store here. LlamaIndex supports a variety of vector databases, such as FAISS, Chroma, and Databricks Vector Search. If you choose an alternative, follow the relevant LlamaIndex documentation and update the workflow/workflow.py file accordingly.

In addition to evaluating the vector search retrieval, we will assess the keyword-based retriever (BM25) later. Let's set up local document storage to enable BM25 retrieval in the workflow.

from llama_index.core.node_parser import SentenceSplitter
from llama_index.retrievers.bm25 import BM25Retriever

splitter = SentenceSplitter(chunk_size=512)
nodes = splitter.get_nodes_from_documents(documents)
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes)
bm25_retriever.persist(".bm25_retriever")

6. Define a Workflow

Now that the environment and data sources are ready, we can build the workflow and experiment with it. The complete workflow code is defined in the workflow directory. Let's explore some key components of the implementation.

Events

The workflow/events.py file defines all the events used within the workflow. These are simple Pydantic models that carry information between workflow steps. For example, the VectorSearchRetrieveEvent triggers the vector search step by passing the user's query.

class VectorSearchRetrieveEvent(Event):
"""Event for triggering VectorStore index retrieval step."""
query: str

Prompts

Throughout the workflow execution, we call LLMs multiple times. The prompt templates for these LLM calls are defined in the workflow/prompts.py file.

Workflow Class

The main workflow class is defined in workflow/workflow.py. Let's break down how it works.

The constructor accepts a retrievers argument, which specifies the retrieval methods to be used in the workflow. For instance, if ["vector_search", "bm25"] is passed, the workflow performs vector search and keyword-based search, skipping web search.

💡 Deciding which retrievers to utilize dynamically allows us to experiment with different retrieval strategies without needing to replicate nearly identical model code.

class HybridRAGWorkflow(Workflow):

VALID_RETRIEVERS = {"vector_search", "bm25", "web_search"}

def __init__(self, retrievers=None, **kwargs):
super().__init__(**kwargs)
self.llm = Settings.llm
self.retrievers = retrievers or []

if invalid_retrievers := set(self.retrievers) - self.VALID_RETRIEVERS:
raise ValueError(f"Invalid retrievers specified: {invalid_retrievers}")

self._use_vs_retriever = "vector_search" in self.retrievers
self._use_bm25_retriever = "bm25" in self.retrievers
self._use_web_search = "web_search" in self.retrievers

if self._use_vs_retriever:
qd_client = qdrant_client.QdrantClient(host=_QDRANT_HOST, port=_QDRANT_PORT)
vector_store = QdrantVectorStore(client=qd_client, collection_name=_QDRANT_COLLECTION_NAME)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
self.vs_retriever = index.as_retriever()

if self._use_bm25_retriever:
self.bm25_retriever = BM25Retriever.from_persist_dir(_BM25_PERSIST_DIR)

if self._use_web_search:
self.tavily_tool = TavilyToolSpec(api_key=os.environ.get("TAVILY_AI_API_KEY"))

The workflow begins by executing a step that takes the StartEvent as input, which is the route_retrieval step in this case. This step inspects the retrievers parameter and triggers the necessary retrieval steps. By using the send_event() method of the context object, multiple events can be dispatched in parallel from this single step.

    # If no retriever is specified, proceed directly to the final query step with an empty context
if len(self.retrievers) == 0:
return QueryEvent(context="")

# Trigger the retrieval steps based on the configuration
if self._use_vs_retriever:
ctx.send_event(VectorSearchRetrieveEvent(query=query))
if self._use_bm25_retriever:
ctx.send_event(BM25RetrieveEvent(query=query))
if self._use_web_search:
ctx.send_event(TransformQueryEvent(query=query))

The retrieval steps are straightforward. However, the web search step is more advanced as it includes an additional step to transform the user's question into a search-friendly query using an LLM.

The results from all the retrieval steps are aggregated in the gather_retrieval_results step. Here, the ctx.collect_events() method is used to poll for the results of the asynchronously executed steps.

    results = ctx.collect_events(ev, [RetrievalResultEvent] * len(self.retrievers))

Passing all results from multiple retrievers often leads to a large context with unrelated or duplicate content. To address this, we need to filter and select the most relevant results. While a score-based approach is common, web search results do not return similarity scores. Therefore, we use an LLM to sort and filter out irrelevant results. The rerank step achieves this by leveraging the built-in reranker integration with RankGPT.

    reranker = RankGPTRerank(llm=self.llm, top_n=5)
reranked_nodes = reranker.postprocess_nodes(ev.nodes, query_str=query)
reranked_context = "\n".join(node.text for node in reranked_nodes)

Finally, the reranked context is passed to the LLM along with the user query to generate the final answer. The result is returned as a StopEvent with the result key.

    @step
async def query_result(self, ctx: Context, ev: QueryEvent) -> StopEvent:
"""Get result with relevant text."""
query = await ctx.get("query")

prompt = FINAL_QUERY_TEMPLATE.format(context=ev.context, query=query)
response = self.llm.complete(prompt).text
return StopEvent(result=response)

Now, let's instantiate the workflow and run it.

# Workflow with VS + BM25 retrieval
from workflow.workflow import HybridRAGWorkflow

workflow = HybridRAGWorkflow(retrievers=["vector_search", "bm25"], timeout=60)
response = await workflow.run(query="Why use MLflow with LlamaIndex?")
print(response)

7. Log the Workflow in an MLflow Experiment

Now we want to run the workflow with various different retrieval strategies and evaluate the performance of each. However, before running the evaluation, we'll log the model in MLflow to track both the model and its performance within an MLflow Experiment.

For the LlamaIndex Workflow, we use the new Model-from-code method, which logs models as standalone Python scripts. This approach avoids the risks and instability associated with serialization methods like pickle, relying instead on code as the single source of truth for the model definition. When combined with MLflow's environment-freezing capability, it provides a reliable way to persist models. For more details, refer to the MLflow documentation.

💡 In the workflow directory, there's a model.py script that imports the HybridRAGWorkflow and instantiates it with dynamic configurations passed via the model_config parameter during logging. This design allows you to track models with different configurations without duplicating the model definition.

We'll start an MLflow Run and log the model script model.py with different configurations using the mlflow.llama_index.log_model() API.

# Different configurations we will evaluate. We don't run evaluation for all permutation
# for demonstration purpose, but you can add as many patterns as you want.
run_name_to_retrievers = {
# 1. No retrievers (prior knowledge in LLM).
"none": [],
# 2. Vector search retrieval only.
"vs": ["vector_search"],
# 3. Vector search and keyword search (BM25)
"vs + bm25": ["vector_search", "bm25"],
# 4. All retrieval methods including web search.
"vs + bm25 + web": ["vector_search", "bm25", "web_search"],
}

# Create an MLflow Run and log model with each configuration.
models = []
for run_name, retrievers in run_name_to_retrievers.items():
with mlflow.start_run(run_name=run_name):
model_info = mlflow.llama_index.log_model(
# Specify the model Python script.
llama_index_model="workflow/model.py",
# Specify retrievers to use.
model_config={"retrievers": retrievers},
# Define dependency files to save along with the model
code_paths=["workflow"],
# Subdirectory to save artifacts (not important)
artifact_path="model",
)
models.append(model_info)

Now open the MLflow UI again, and this time it should show 4 MLflow Runs are recorded with different retrievers parameter values. By clicking each Run name and navigate to the "Artifacts" tab, you can see MLflow records the model and various metadata, such as dependency versions and settings.

MLflow Runs

8. Enable MLflow Tracing

Before running the evaluation, there’s one final step: enabling MLflow Tracing. We'll dive into this feature and why we do this here later, but for now, you can enable it with a simple one-line command. MLflow will automatically trace every LlamaIndex execution.

mlflow.llama_index.autolog()

9. Evaluate the Workflow with Different Retriever Strategies

The example repository includes a sample evaluation dataset, mlflow_qa_dataset.csv, containing 30 question-answer pairs related to MLflow.

import pandas as pd

eval_df = pd.read_csv("data/mlflow_qa_dataset.csv")
display(eval_df.head(3))

To evaluate the workflow, use the mlflow.evaluate() API, which requires (1) your dataset, (2) the logged model, and (3) the metrics you want to compute.

from mlflow.metrics import latency
from mlflow.metrics.genai import answer_correctness


for model_info in models:
with mlflow.start_run(run_id=model_info.run_id):
result = mlflow.evaluate(
# Pass the URI of the logged model above
model=model_info.model_uri,
data=eval_df,
# Specify the column for ground truth answers.
targets="ground_truth",
# Define the metrics to compute.
extra_metrics=[
latency(),
answer_correctness("openai:/gpt-4o-mini"),
],
# The answer_correctness metric requires "inputs" column to be
# present in the dataset. We have "query" instead so need to
# specify the mapping in `evaluator_config` parameter.
evaluator_config={"col_mapping": {"inputs": "query"}},
)

In this example, we evaluate the model with two metrics:

  1. Latency: Measures the time taken to execute a workflow for a single query.
  2. Answer Correctness: Evaluates the accuracy of answers based on the ground truth, scored by the OpenAI GPT-4o model on a 1–5 scale.

These metrics are just for demonstration purposes—you can add additional metrics like toxicity or faithfulness, or even create your own. See the MLflow documentation for the full set of built-in metrics and how to define custom metrics.

The evaluation process will take a few minutes. Once completed, you can view the results in the MLflow UI. Open the Experiment page and click on the chart icon 📈 above the Run list.

Evaluation Result

*💡 The evaluation results can be different depending on model set up and some randomness.

The first row shows bar charts for the answer correctness metrics, while the second row displays latency results. The best-performing combination is "Vector Search + BM25". Interestingly, adding web search not only increases latency significantly but also decreases answer correctness.

Why does this happen? It appears some answers from the web-search-enabled model are off-topic. For example, in response to a question about starting the Model Registry, the web-search model provides an unrelated answer about model deployment, while the "vs + bm25" model offers a correct response.

Answer Comparison

Where did this incorrect answer come from? This seems to be a retriever issue, as we only changed the retrieval strategy. However, it's difficult to see what each retriever returned from the final result. To gain deeper insights into what's happening behind the scenes, MLflow Tracing is the perfect solution.

10. Inspecting Quality Issues with MLflow Trace

MLflow Tracing is a new feature that brings observability to LLM applications. It integrates seamlessly with LlamaIndex, recording all inputs, outputs, and metadata about intermediate steps during workflow execution. Since we called mlflow.llama_index.autolog() at the start, every LlamaIndex operation has been traced and recorded in the MLflow Experiment.

To inspect the trace for a specific question from the evaluation, navigate to the "Traces" tab on the experiment page. Look for the row with the particular question in the request column and the run name "vs + bm25 + web." Clicking the request ID link opens the Trace UI, where you can view detailed information about each step in the execution, including inputs, outputs, metadata, and latency.

Trace

In this case, we identified the issue by examining the reranker step. The web search retriever returned irrelevant context related to model serving, and the reranker incorrectly ranked it as the most relevant. With this insight, we can determine potential improvements, such as refining the reranker to better understand MLflow topics, improving web search precision, or even removing the web search retriever altogether.

Conclusion

In this blog, we explored how the combination of LlamaIndex and MLflow can elevate the development of Retrieval-Augmented Generation (RAG) workflows, bringing together powerful model management and observability capabilities. By integrating multiple retrieval strategies (such as vector search, BM25, and web search) we demonstrated how flexible retrieval can enhance the performance of LLM-driven applications.

  • Experiment Tracking allowed us to organize and log different workflow configurations, ensuring reproducibility and enabling us to track model performance across multiple runs.
  • MLflow Evaluate enabled us to easily log and evaluate the workflow with different retriever strategies, using key metrics like latency and answer correctness to compare performance.
  • MLflow UI gave us a clear visualization of how various retrieval strategies impacted both accuracy and latency, helping us identify the most effective configurations.
  • MLflow Tracing, integrated with LlamaIndex, provided detailed observability into each step of the workflow for diagnosing quality issues, such as incorrect reranking of search results.

With these tools, you have a complete framework for building, logging, and optimizing RAG workflows. As LLM technology continues to evolve, the ability to track, evaluate, and fine-tune every aspect of model performance will be essential. We highly encourage you to experiment further and see how these tools can be tailored to your own applications.

To continue learning, explore the following resources:

· 12 min read
Awadelrahman M. A. Ahmed

We all (well, most of us) remember November 2022 when the public release of ChatGPT by OpenAI marked a significant turning point in the world of AI. While generative artificial intelligence (GenAI) had been evolving for some time, ChatGPT, built on OpenAI's GPT-3.5 architecture, quickly captured the public’s imagination. This led to an explosion of interest in GenAI, both within the tech industry and among the general public.

On the tools side, MLflow continues to solidify its position as the favorite tool for (machine learning operations) MLOps among the ML community. However, the rise of GenAI has introduced new needs in how we use MLflow. One of these new challenges is how we log models in MLflow. If you’ve used MLflow before (and I bet you have), you’re probably familiar with the mlflow.log_model() function and how it efficiently pickles model artifacts.

Particularly with GenAI, there’s a new requirement: logging the models "from code", instead of serializing it into a pickle file! And guess what? This need isn’t limited to GenAI models! So, in this post I will explore this concept and how MLflow has adapted to meet this new requirement.

You will notice that this feature is implemented at a very abstract level, allowing you to log any model "as code", whether it’s GenAI or not! I like to think of it as a generic approach, with GenAI models being just one of its use cases. So, in this post, I’ll explore this new feature, "Models from Code logging".

By the end of this post, you should be able to answer the three main questions: 'What,' 'Why,' and 'How' to use Models from Code logging.

What Is Models from Code Logging?

In fact, when MLflow announced this feature, it got me thinking in a more abstract way about the concept of a "model"! You might find it interesting as well, if you zoom out and consider a model as a mathematical representation or function that describes the relationship between input and output variables. At this level of abstraction, a model can be many things!

One might even recognize that a model, as an object or artifact, represents just one form of what a model can be, even if it’s the most popular in the ML community. If you think about it, a model can also be as simple as a piece of code for a mapping function or a code that sends API requests to external services such as OpenAI's APIs.

I'll explain the detailed workflow of how to log models from code later in the post, but for now, let's consider it at a high level with two main steps: first, writing your model code, and second, logging your model from code. This will look like the following figure:

High Level Models from Code Logging Workflow:

High Level Models-from-Code Logging Workflow

🔴 It's important to note that when we refer to "model code," we're talking about code that can be treated as a model itself. This means it's not your training code that generates a trained model object, but rather the step-by-step code that is executed as a model itself.

How Models from Code Differs from Object-Based Logging?

In the previous section, we discussed the concept of Models from Code logging. However, concepts often become clearer when contrasted with their alternatives; a technique known as contrast learning. In our case, the alternative is Object-Based logging, which is the commonly used approach for logging models in MLflow.

Object-Based logging treats a trained model as an object that can be stored and reused. After training, the model is saved as an object and can be easily loaded for deployment. For example, this process can be initiated by calling mlflow.log_model(), where MLflow handles the serialization, often using Pickle or similar methods.

Object-Based logging can be broken down into three high-level steps as in the following figure: first, creating the model object (whether by training it or acquiring it), second, serializing it (usually with Pickle or a similar tool), and third, logging it as an object.

High Level Object-Based Logging Workflow:

High Level Object-Based Logging Workflow

💡The main distinction between the popular Object-Based logging and Models from Code logging is that in the former, we log the model object itself, whether it's a model you've trained or a pre-trained model you've acquired. In the latter, however, we log the code that represents your model.

When Do You Need Models from Code Logging?

By now, I hope you have a clear understanding of what Models from Code logging is! You might still be wondering, though, about the specific use cases where this feature can be applied. This section will cover exactly that—the why!

While we mentioned GenAI as a motivational use case in the introduction, we also highlighted that MLflow has approached Models from Code logging in a more generic way and we will see that in the next section. This means you can leverage the generalizability of the Models from Code feature for a wide range of scenarios. I’ve identified three key usage patterns that I believe are particularly relevant:

1️⃣ When Your Model Relies on External Services:

This is one of the obvious and common use cases, especially with the rise of modern AI applications. It’s becoming increasingly clear that we are shifting from building AI at the "model" granularity to the "system" granularity.

In other words, AI is no longer just about individual models; it’s about how those models interact within a broader ecosystem. As we become more dependent on external AI services and APIs, the need for Models from Code logging becomes more pronounced.

For instance, frameworks like LangChain allow developers to build applications that chain together various AI models and services to perform complex tasks, such as language understanding and information retrieval. In such scenarios, the "model" is not just a set of trained parameters that can be pickled but a "system" of interconnected services, often orchestrated by code that makes API calls to external platforms.

Models from Code logging in these situations ensures that the entire workflow, including the logic and dependencies, is preserved. It offers is the ability to maintain the same model-like experience by capturing the code making it possible to faithfully recreate the model’s behavior, even when the actual computational work is performed outside your domain.

2️⃣ When You’re Combining Multiple Models to Calculate a Complex Metric:

Apart from GenAI, you can still benefit from the Models from Code feature in various other domains. There are many situations where multiple specialized models are combined to produce a comprehensive output. Note that we are not just referring to traditional ensemble modeling (predicting the same variable); often, you need to combine multiple models to predict different components of a complex inferential task.

One concrete example could be Customer Lifetime Value (CLV) in customer analytics. In the context of CLV, you might have separate models for:

  • Customer Retention: Forecasting how long a customer will continue to engage with the business.
  • Purchase Frequency: Predicting how often a customer will make a purchase.
  • Average Order Value: Estimating the typical value of each transaction.

Each of these models might already be logged and tracked properly using MLflow. Now, you need to "combine" these models into a single "system" that calculates CLV. We refer to it as a "system" because it contains multiple components.

The beauty of MLflow's Models from Code logging is that it allows you to treat this "CLV system" as a "CLV model". It enables you to leverage MLflow's capabilities, maintaining the MLflow-like model structure with all the advantages of tracking, versioning, and deploying your CLV model as a cohesive unit, even though it's built on top of other models. While such a complex model system is able to be built using a custom MLflow PythonModel, utilizing the Models from Code feature dramatically simplifies the serialization process, reducing the friction to building your solution.

3️⃣ When You Don’t Have Serialization at All:

Despite the rise of deep learning, industries still rely on rule-based algorithms that don’t produce serialized models. In these cases, Models from Code logging can be beneficial for integrating these processes into the MLflow ecosystem.

One example is in industrial quality control, where the Canny edge detection algorithm is often used to identify defects. This rule-based algorithm doesn’t involve serialization but is defined by specific steps.

Another example, which is gaining attention nowadays, is Causal AI. Constraint-based causal discovery algorithms like the PC (Peter-Clark) algorithm that discover causal relationships in data but are implemented as code rather than as model objects.

In either case, with the Models from Code feature, you can log the entire process as a "model" in MLflow, preserving the logic and parameters while benefiting from MLflow’s tracking and versioning features.

How To Implement Models from Code Logging?

I hope that by this point, you have a clear understanding of the "What" and "Why" of Models from Code, and now you might be eager to get hands-on and focus on the How!

In this section, I'll provide a generic workflow for implementing MLflow's Models from Code logging, followed by a basic yet broadly applicable example. I hope the workflow provides a broad understanding that allows you to address a wide range of scenarios. I will also include links at the end to resources that cover more specific use cases (e.g., AI models).

Models from Code Workflow:

A key "ingredient" of the implementation is MLflow's component pyfunc. If you're not familiar with it, think of pyfunc as a universal interface in MLflow that lets you turn any model, from any framework, into an MLflow model by defining a custom Python function. You can also refer to this earlier post if you wish to gain a deeper understanding.

For our Models from Code logging, we’ll particularly use the PythonModel class within pyfunc. This class in the MLflow Python client library allows us to create and manage Python functions as MLflow models. It enables us to define a custom function that processes input data and returns predictions or results. This model can then be deployed, tracked, and shared using MLflow's features.

It seems to be exactly what we're looking for—we have some code that serves as our model, and we want to log it! That's why you'll soon see mlflow.pyfunc.PythonModel in our code example!

Now, each time we need to implement Models from Code, we create two separate Python files:

  1. The first contains our model code (let's call it model_code.py). This file contains a class that inherits from the mlflow.pyfunc.PythonModel class. The class we're defining contains our model logic. It could be our calls to OpenAI APIs, CLV (Customer Lifetime Value) model, or our causal discovery code. We'll see a very simple 101 example soon.

    📌 But wait! IMPORTANT:

    • Our model_code.py script needs to call (i,e; include) mlflow.models.set_model() to set the model, which is crucial for loading the model back using load_model() for inference. You will notice this in the example.
  2. The second file logs our class (that we defined in model_code.py). Think of it as the driver code; it can be either a notebook or a Python script (let's call it driver.py). In this file, we'll include the code that is responsible for logging our model code (essentially, providing the path to model_code.py) .

Then we can deploy our model. Later, when the serving environment is loaded, model_code.py is executed, and when a serving request comes in, PyFuncClass.predict() is called.

This figure gives a generic template of these two files.

Models from Code files

A 101 Example of Model from Code Logging :

Let’s consider a straightforward example: a simple function to calculate the area of a circle based on its diameter. With Models from Code, we can log this calculation as a model! I like to think of it as framing the calculation as a prediction problem, allowing us to write our model code with a predict method.

1. Our model_code.py file :

import mlflow
import math

class CircleAreaModel(mlflow.pyfunc.PythonModel):
def predict(self, context, model_input, params=None):
return [math.pi * (r ** 2) for r in model_input]

# It's important to call set_model() so it can be loaded for inference
# Also, note that it is set to an instance of the class, not the class itself.
mlflow.models.set_model(model=CircleAreaModel())

2. Our driver.py file :

This can be defined within a notebook as well. Here are its essential contents:

import mlflow

code_path = "model_code.py" # make sure that you put the correct path

with mlflow.start_run():
logged_model_info = mlflow.pyfunc.log_model(
python_model=code_path,
artifact_path="test_code_logging"
)

#We can proint some info about the logged model
print(f"MLflow Run: {logged_model_info.run_id}")
print(f"Model URI: {logged_model_info.model_uri}")

How that looks like on MLflow:

Executing the driver.py will start an MLflow run and log our model as code. The files can been as demonstrated below:

Models from Code files

Conclusion and Further Learning

I hope that by this point, I have fulfilled the promises I made earlier! You should now have a clearer understanding of What Models from Code is and how it differs from the popular Object-Based approach which logs models as serialized objects. You should also have a solid foundation of Why and when to use it, as well as an understanding of How to implement it through our general example.

As we mentioned in the introduction and throughout the post, there are various use cases where Models from Code can be beneficial. Our 101 example is just the beginning—there is much more to explore. Below is a list of code examples that you may find helpful:

  1. Logging models from code using Pyfunc log model API ( model code | driver code )
  2. Logging model from code using Langchain log model API ( model code | driver code )

· 22 min read
Michael Berk
MLflow maintainers

In this blog, we'll guide you through creating an AutoGen agent framework within an MLflow custom PyFunc. By combining MLflow with AutoGen's ability to create multi-agent frameworks, we are able to create scalable and stable GenAI applications.

Agent Frameworks

Agent frameworks enable autonomous agents to handle complex, multi-turn tasks by integrating discrete logic at each step. These frameworks are crucial for LLM-driven workflows, where agents manage dynamic interactions across multiple stages. Each agent operates based on specific logic, enabling precise task automation, decision-making, and coordination. This is ideal for applications like workflow orchestration, customer support, and multi-agent systems, where LLMs must interpret evolving context and respond accordingly.

· 8 min read
Michael Berk
MLflow maintainers

In this blog, we'll guide you through creating a LangGraph chatbot using MLflow. By combining MLflow with LangGraph's ability to create and manage cyclical graphs, you can create powerful stateful, multi-actor applications in a scalable fashion.

Throughout this post we will demonstrate how to leverage MLflow's capabilities to create a serializable and servable MLflow model which can easily be tracked, versioned, and deployed on a variety of servers. We'll be using the langchain flavor combined with MLflow's model from code feature.

What is LangGraph?

LangGraph is a library for building stateful, multi-actor applications with LLMs, used to create agent and multi-agent workflows. Compared to other LLM frameworks, it offers these core benefits:

  • Cycles and Branching: Implement loops and conditionals in your apps.
  • Persistence: Automatically save state after each step in the graph. Pause and resume the graph execution at any point to support error recovery, human-in-the-loop workflows, time travel and more.
  • Human-in-the-Loop: Interrupt graph execution to approve or edit next action planned by the agent.
  • Streaming Support: Stream outputs as they are produced by each node (including token streaming).
  • Integration with LangChain: LangGraph integrates seamlessly with LangChain.

LangGraph allows you to define flows that involve cycles, essential for most agentic architectures, differentiating it from DAG-based solutions. As a very low-level framework, it provides fine-grained control over both the flow and state of your application, crucial for creating reliable agents. Additionally, LangGraph includes built-in persistence, enabling advanced human-in-the-loop and memory features.

LangGraph is inspired by Pregel and Apache Beam. The public interface draws inspiration from NetworkX. LangGraph is built by LangChain Inc, the creators of LangChain, but can be used without LangChain.

For a full walkthrough, check out the LangGraph Quickstart and for more on the fundamentals of design with LangGraph, check out the conceptual guides.

1 - Setup

First, we must install the required dependencies. We will use OpenAI for our LLM in this example, but using LangChain with LangGraph makes it easy to substitute any alternative supported LLM or LLM provider.

%%capture
%pip install langchain_openai==0.2.0 langchain==0.3.0 langgraph==0.2.27
%pip install -U mlflow

Next, let's get our relevant secrets. getpass, as demonstrated in the LangGraph quickstart is a great way to insert your keys into an interactive jupyter environment.

import os

# Set required environment variables for authenticating to OpenAI
# Check additional MLflow tutorials for examples of authentication if needed
# https://mlflow.org/docs/latest/llms/openai/guide/index.html#direct-openai-service-usage
assert "OPENAI_API_KEY" in os.environ, "Please set the OPENAI_API_KEY environment variable."

2 - Custom Utilities

While this is a demo, it's good practice to separate reusable utilities into a separate file/directory. Below we create three general utilities that theoretically would valuable when building additional MLflow + LangGraph implementations.

Note that we use the magic %%writefile command to create a new file in a jupyter notebook context. If you're running this outside of an interactive notebook, simply create the file below, omitting the %%writefile {FILE_NAME}.py line.

%%writefile langgraph_utils.py
# omit this line if directly creating this file; this command is purely for running within Jupyter

import os
from typing import Union
from langgraph.pregel.io import AddableValuesDict

def _langgraph_message_to_mlflow_message(
langgraph_message: AddableValuesDict,
) -> dict:
langgraph_type_to_mlflow_role = {
"human": "user",
"ai": "assistant",
"system": "system",
}

if type_clean := langgraph_type_to_mlflow_role.get(langgraph_message.type):
return {"role": type_clean, "content": langgraph_message.content}
else:
raise ValueError(f"Incorrect role specified: {langgraph_message.type}")


def get_most_recent_message(response: AddableValuesDict) -> dict:
most_recent_message = response.get("messages")[-1]
return _langgraph_message_to_mlflow_message(most_recent_message)["content"]


def increment_message_history(
response: AddableValuesDict, new_message: Union[dict, AddableValuesDict]
) -> list[dict]:
if isinstance(new_message, AddableValuesDict):
new_message = _langgraph_message_to_mlflow_message(new_message)

message_history = [
_langgraph_message_to_mlflow_message(message)
for message in response.get("messages")
]

return message_history + [new_message]

By the end of this step, you should see a new file in your current directory with the name langgraph_utils.py.

Note that it's best practice to add unit tests and properly organize your project into logically structured directories.

3 - Log the LangGraph Model

Great! Now that we have some reusable utilities located in ./langgraph_utils.py, we are ready to log the model with MLflow's official LangGraph flavor.

3.1 - Create our Model-From-Code File

Quickly, some background. MLflow looks to serialize model artifacts to the MLflow tracking server. Many popular ML packages don't have robust serialization and deserialization support, so MLflow looks to augment this functionality via the models from code feature. With models from code, we're able to leverage Python as the serialization format, instead of popular alternatives such as JSON or pkl. This opens up tons of flexibility and stability.

To create a Python file with models from code, we must perform the following steps:

  1. Create a new python file. Let's call it graph.py.
  2. Define our langgraph graph.
  3. Leverage mlflow.models.set_model to indicate to MLflow which object in the Python script is our model of interest.

That's it!

%%writefile graph.py
# omit this line if directly creating this file; this command is purely for running within Jupyter

from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langgraph.graph.state import CompiledStateGraph

import mlflow

import os
from typing import TypedDict, Annotated

def load_graph() -> CompiledStateGraph:
"""Create example chatbot from LangGraph Quickstart."""

assert "OPENAI_API_KEY" in os.environ, "Please set the OPENAI_API_KEY environment variable."

class State(TypedDict):
messages: Annotated[list, add_messages]

graph_builder = StateGraph(State)
llm = ChatOpenAI()

def chatbot(state: State):
return {"messages": [llm.invoke(state["messages"])]}

graph_builder.add_node("chatbot", chatbot)
graph_builder.add_edge(START, "chatbot")
graph_builder.add_edge("chatbot", END)
graph = graph_builder.compile()
return graph

# Set are model to be leveraged via model from code
mlflow.models.set_model(load_graph())

3.2 - Log with "Model from Code"

After creating this implementation, we can leverage the standard MLflow APIs to log the model.

import mlflow

with mlflow.start_run() as run_id:
model_info = mlflow.langchain.log_model(
lc_model="graph.py", # Path to our model Python file
artifact_path="langgraph",
)

model_uri = model_info.model_uri

4 - Use the Logged Model

Now that we have successfully logged a model, we can load it and leverage it for inference.

In the code below, we demonstrate that our chain has chatbot functionality!

import mlflow

# Custom utilities for handling chat history
from langgraph_utils import (
increment_message_history,
get_most_recent_message,
)

# Enable tracing
mlflow.set_experiment("Tracing example") # In Databricks, use an absolute path. Visit Databricks docs for more.
mlflow.langchain.autolog()

# Load the model
loaded_model = mlflow.langchain.load_model(model_uri)

# Show inference and message history functionality
print("-------- Message 1 -----------")
message = "What's my name?"
payload = {"messages": [{"role": "user", "content": message}]}
response = loaded_model.invoke(payload)

print(f"User: {message}")
print(f"Agent: {get_most_recent_message(response)}")

print("\n-------- Message 2 -----------")
message = "My name is Morpheus."
new_messages = increment_message_history(response, {"role": "user", "content": message})
payload = {"messages": new_messages}
response = loaded_model.invoke(payload)

print(f"User: {message}")
print(f"Agent: {get_most_recent_message(response)}")

print("\n-------- Message 3 -----------")
message = "What is my name?"
new_messages = increment_message_history(response, {"role": "user", "content": message})
payload = {"messages": new_messages}
response = loaded_model.invoke(payload)

print(f"User: {message}")
print(f"Agent: {get_most_recent_message(response)}")

Ouput:

-------- Message 1 -----------
User: What's my name?
Agent: I'm sorry, I cannot guess your name as I do not have access to that information. If you would like to share your name with me, feel free to do so.

-------- Message 2 -----------
User: My name is Morpheus.
Agent: Nice to meet you, Morpheus! How can I assist you today?

-------- Message 3 -----------
User: What is my name?
Agent: Your name is Morpheus.

4.1 - MLflow Tracing

Before concluding, let's demonstrate MLflow tracing.

MLflow Tracing is a feature that enhances LLM observability in your Generative AI (GenAI) applications by capturing detailed information about the execution of your application’s services. Tracing provides a way to record the inputs, outputs, and metadata associated with each intermediate step of a request, enabling you to easily pinpoint the source of bugs and unexpected behaviors.

Start the MLflow server as outlined in the tracking server docs. After entering the MLflow UI, we can see our experiment and corresponding traces.

MLflow UI Experiment Traces

As you can see, we've logged our traces and can easily see them by clicking our experiment of interest and the then the "Tracing" tab.

MLflow UI Trace

After clicking on one of the traces, we can now see run execution for a single query. Notice that we log inputs, outputs, and lots of great metadata such as usage and invocation parameters. As we scale our application both from a usage and complexity perspective, this thread-safe and highly-performant tracking system will ensure robust monitoring of the app.

5 - Summary

There are many logical extensions of the this tutorial, however the MLflow components can remain largely unchanged. Some examples include persisting chat history to a database, implementing a more complex langgraph object, productionizing this solution, and much more!

To summarize, here's what was covered in this tutorial:

  • Creating a simple LangGraph chain.
  • Leveraging MLflow model from code functionality to log our graph.
  • Loading the model via the standard MLflow APIs.
  • Leveraging MLflow tracing to view graph execution.

Happy coding!

· 4 min read
MLflow maintainers

We're excited to announce the release of a powerful new feature in MLflow: MLflow Tracing. This feature brings comprehensive instrumentation capabilities to your GenAI applications, enabling you to gain deep insights into the execution of your models and workflows, from simple chat interfaces to complex multi-stage Retrieval Augmented Generation (RAG) applications.

NOTE: MLflow Tracing has been released in MLflow 2.14.0 and is not available in previous versions.

Introducing MLflow Tracing

Tracing is a critical aspect of understanding and optimizing complex applications, especially in the realm of machine learning and artificial intelligence. With the release of MLflow Tracing, you can now easily capture, visualize, and analyze detailed execution traces of your GenAI applications. This new feature aims to provide greater visibility and control over your applications' performance and behavior, aiding in everything from fine-tuning to debugging.

What is MLflow Tracing?

MLflow Tracing offers a variety of methods to enable tracing in your applications:

  • Automated Tracing with LangChain: A fully automated integration with LangChain allows you to activate tracing simply by enabling mlflow.langchain.autolog().
  • Manual Trace Instrumentation with High-Level Fluent APIs: Use decorators, function wrappers, and context managers via the fluent API to add tracing functionality with minimal code modifications.
  • Low-Level Client APIs for Tracing: The MLflow client API provides a thread-safe way to handle trace implementations for fine-grained control of what and when data is recorded.

Getting Started with MLflow Tracing

LangChain Automatic Tracing

The easiest way to get started with MLflow Tracing is through the built-in integration with LangChain. By enabling autologging, traces are automatically logged to the active MLflow experiment when calling invocation APIs on chains. Here’s a quick example:

import os
from langchain.prompts import PromptTemplate
from langchain_openai import OpenAI
import mlflow

assert "OPENAI_API_KEY" in os.environ, "Please set your OPENAI_API_KEY environment variable."

mlflow.set_experiment("LangChain Tracing")
mlflow.langchain.autolog(log_models=True, log_input_examples=True)

llm = OpenAI(temperature=0.7, max_tokens=1000)
prompt_template = "Imagine you are {person}, and you are answering a question: {question}"
chain = prompt_template | llm

chain.invoke({"person": "Richard Feynman", "question": "Why should we colonize Mars?"})
chain.invoke({"person": "Linus Torvalds", "question": "Can I set everyone's access to sudo?"})

And this is what you will see after invoking the chains when navigating to the LangChain Tracing experiment in the MLflow UI:

Traces in UI

Fluent APIs for Manual Tracing

For more control, you can use MLflow’s fluent APIs to manually instrument your code. This approach allows you to capture detailed trace data with minimal changes to your existing code.

Trace Decorator

The trace decorator captures the inputs and outputs of a function:

import mlflow

mlflow.set_experiment("Tracing Demo")

@mlflow.trace
def some_function(x, y, z=2):
return x + (y - z)

some_function(2, 4)

Context Handler

The context handler is ideal for supplementing span information with additional data at the point of information generation:

import mlflow

@mlflow.trace
def first_func(x, y=2):
return x + y

@mlflow.trace
def second_func(a, b=3):
return a * b

def do_math(a, x, operation="add"):
with mlflow.start_span(name="Math") as span:
span.set_inputs({"a": a, "x": x})
span.set_attributes({"mode": operation})
first = first_func(x)
second = second_func(a)
result = first + second if operation == "add" else first - second
span.set_outputs({"result": result})
return result

do_math(8, 3, "add")

Comprehensive Tracing with Client APIs

For advanced use cases, the MLflow client API offers fine-grained control over trace management. These APIs allows you to create, manipulate, and retrieve traces programmatically, albeit with additional complexity throughout the implementation.

Starting and Managing Traces with the Client APIs

from mlflow import MlflowClient

client = MlflowClient()

# Start a new trace
root_span = client.start_trace("my_trace")
request_id = root_span.request_id

# Create a child span
child_span = client.start_span(
name="child_span",
request_id=request_id,
parent_id=root_span.span_id,
inputs={"input_key": "input_value"},
attributes={"attribute_key": "attribute_value"},
)

# End the child span
client.end_span(
request_id=child_span.request_id,
span_id=child_span.span_id,
outputs={"output_key": "output_value"},
attributes={"custom_attribute": "value"},
)

# End the root span (trace)
client.end_trace(
request_id=request_id,
outputs={"final_output_key": "final_output_value"},
attributes={"token_usage": "1174"},
)

Diving Deeper into Tracing

MLflow Tracing is designed to be flexible and powerful, supporting various use cases from simple function tracing to complex, asynchronous workflows.

To learn more about this feature, read the guide, review the API Docs and get started with the LangChain integration today!

Join Us on This Journey

The introduction of MLflow Tracing marks a significant milestone in our mission to provide comprehensive tools for managing machine learning workflows. We’re excited about the possibilities this new feature opens up and look forward to your feedback and contributions.

For those in our community with a passion for sharing knowledge, we invite you to collaborate. Whether it’s writing tutorials, sharing use-cases, or providing feedback, every contribution enriches the MLflow community.

Stay tuned for more updates, and as always, happy coding!