You have created a nice Langchain application using the latest best practices (that you read on this blog, right? Of course!) and you are wondering how you are going add monitoring to it ?
You have come to the right place.
Introduction
In this blog post, you will learn:
- what is monitoring and why it is so important
- how to use Langfuse
- how to integrate Langfuse into a Langchain chain that use LCEL
- the current limits of the integration
As usual, we will be as straightforward and simple as possible so let’s get to work!
Pre-requisites
- A functional Python coding environment with Python and an IDE (VSCode for example).
- An OpenAI account with some credits (this is different from ChatGPT as we will use the LLM service API instead of using ChatGPT).
- Having done the previous tutorial on creating a real world RAG chat app with Langchain LCEL (link). We will use the code of this tutorial as the basis.
- All the code can be found here. ###
Monitoring definition
Here’s a definition of monitoring:
Monitoring in software development refers to the process of continuously observing and tracking the performance, behavior, and health of software applications and systems.
The goal is to detect and diagnose issues, ensure optimal performance, and maintain the reliability and availability of the software.
So the goal is to see at all times what is happening inside your application and detect any problems. It goes from the simple log printing in the code to the monitoring dashboard where you can see all the metrics of you application like CPU/Memory usage or the network load used.
You can argue that monitoring is not useful when you are just beginning or your are doing an application just for you. But at the moment you want to share it with people, then it should become a necessity, even more in LLM applications.
Monitoring for LLM application
Monitoring for LLM applications that are open for the public in an absolute must because:
- LLM by itself can be a money blackhole: If you have launched a few of the tutorials in this blog and looked at the cost, you should have seen that it is pretty negligible (few cents). But that is only because we were testing it (and not using it the whole day) and only one person where using it (you). If you create an application that will be popular (which I hope for everyone), it will be used a lot more. Even worse, with RAG (link), or even just the history of your chat application, the cost of each usage will go up the more it is used. This results in a potential cost explosion that you need to monitor at all costs.
- LLM by itself is not a determinist process: At the difference of code in general where, you should have the same results whenever you give the same inputs, an LLM process can always give an unexpected results. Of course you can minimise this but the possibility is far bigger that in classical programing. For example, in your RAG, there is always the possibility that you will receive a result that is not formatted correctly and your whole pipeline throws an error.
- LLM by itself is an hype machine: LLM are really the rage right now, which is very nice but at the same time, can be extremely frightening. Imagine, you have created a nice application and you just launched it. For the occasion, you shared a nice post on whatever social media you are using and you go to sleep. 8h later, you have either 1M users and the corresponding bill or your application just crashed. In any case, monitoring (and also not sleeping) would have helped you a lot.
That is why, even more than with traditional systems, you need some serious monitoring for your LLM applications.
Multiple types of monitoring
When monitoring a LLM application, you will need to add 2 type of monitoring:
- the classical monitoring: this is the logs and the metrics that show of your application fare. For example, you will have the logs for the instance or the cluster where your application run, the used cpu or memory or even the network load. This is very useful to detect either failures or slowness because for example, you have too many users and you need to scale your infrastructure.
- the LLM monitoring: this is specific to LLM usage and it shows the token usage, the cost for each call, the metrics on your vector store usage. This is where you will detect that your application have hit on the LLM provider quotas and your queries doesn’t return anything or that your llm provider bills go through the roof. It can also be used to monitor how your application is used.
You always need to combine these two as the LLM monitoring only works when the langchain chain or the call works. The only way to know why your chain doesn’t work is by using the classical monitoring.
In our case, we will focus on the llm monitoring part as the classical monitoring heavily depends on what infrastructure you use to host your application.
LLM monitoring with Langfuse
For the LLM monitoring, we are going to use Langfuse. Here’s the official description:
Langfuse is an open-source LLM engineering platform that helps teams collaboratively debug, analyze, and iterate on their LLM applications.
It is an open source solution that you can deploy yourself (free) or a cloud version that you could use directly (with a monthly fixed price).
It has the following features:
- Track and debug: Helps you see everything happening in your LLM app, like API calls and user interactions, so you can find and fix problems easily.
- Manage prompts: Lets you organize, update, and deploy your prompts quickly, making it easier to test and improve them.
- Check output quality: Provides tools to score and get feedback on your model’s outputs, helping you understand and improve the quality of your application .
- Easy Integration: Works with popular tools like OpenAI, Langchain, and LlamaIndex, and provides SDKs for Python and JavaScript, making it easy to integrate.
- Analytics: Tracks important metrics like cost and speed, giving you insights into how well your app is performing.
- Security: Ensures your data is protected with top-notch security standards and compliance certifications.
So this is actually a great tool that you can use in any LLM application to monitor it. You will get the token count and the usage, you can mange prompts and use them inside your code, and you can easily integrate the tool in Langchain. But it also has some drawbacks:
- Setup: If you do not have some experience on how to deploy this kind of tool, it is not advised to try to deploy it yourself and you should just use the cloud version (which is not free).
- Integration with LCEL could be improved: integration with Langchain LCEL is basic and some features could be improved.
- Tokens and cost monitoring not always implemented: the token and cost monitoring does not work with every LLM provider. For example, it works flawlessly with OpenAI and Anthropic directly but if you use AWS Bedrock, you will not see any token usage. This is actually a limitation in the implementation of the Bedrock client in Langchain that in Langfuse but this is something that needs to be said.
So now that we presented everything, we need to implement this monitoring to see what is can do.
Initialize the work environnement
We will use the same setup as the previous posts, meaning Streamlit, Langchain, FAISS vector store and Pipenv for managing virtual env. For better readability, we will create a new folder called LLM-monitoring-with-complexe-chain
and copy inside it all the files needed from the tutorial create-complex-webapp-with-langchain-best-practices
.
mkdir LLM-monitoring-with-complexe-chain
cp -R create-complex-webapp-with-langchain-best-practices/* LLM-monitoring-with-complexe-chain/
cp create-complex-webapp-with-langchain-best-practices/.env LLM-monitoring-with-complexe-chain/
This will create the folder with the code for the tutorial using the previous post. We also made sure to copy the .env
file that contains the OpenAI credentials.
Now let’s install the project:
cd LLM-monitoring-with-complexe-chain
pipenv install
And make sure it works:
pipenv run streamlit run app.py
Setup of Langfuse
For the sake of simplicity, we are going to use the Langfuse cloud. So first, go on the link and create an account. You will have the following screen in which you will need to create a project (with whatever name):
Then, you will arrive here:
Now let’s create some keys (which are credentials to connect your application to Langfuse). You can do that by clicking on the Create new API keys
button in the API keys
section.
Then you can use these values to create the following environnement variables into your .env
file. Of course the values in the screens are already invalid.
LANGFUSE_PUBLIC_KEY=pk-xxx
LANGFUSE_SECRET_KEY=sk-xxx
LANGFUSE_HOST=https:xxx
We also need to add Langfuse to the dependency of the project:
pipenv install langfuse
Ok, now you are setup and we can integrate Langfuse directly inside Langchain.
Langfuse integration
The code that we will use is the code from the tutorial on creating a complex Langchain chain. That means it is a chain using LCEL and so you will need to use a specific method to integrate. This method is called callbacks.
LangChain provides a callbacks system that allows you to hook into the various stages of your LLM application. This is useful for logging, monitoring, streaming, and other tasks.
What is happening is that, whenever you chain or a task of your chain finish, Langchain will call the callback with all the information of the chains.
In this case, we will use the Langfuse callback that will save these chain data into Langfuse.
Let’s see how to do that:
from langfuse.callback import CallbackHandler
langfuse_handler = CallbackHandler()
Here’s what’s happening:
- We import the callback from the langfuse library
- We setup the langfuse_handler that will be used in the code. This handler will take the environnement variables LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY and LANGFUSE_HOST to connect to Langfuse. We could directly put these values in the callback but it is horrible from the side of security because you do no want to save these keys inside your repository.
Now we just need to set this callback in the chain:
chain = (
inputs | context | configurable_prompt | model | StrOutputParser()
).with_config(RunnableConfig(callbacks=[langfuse_handler]))
Here’s what’s happening:
- Every chain has a with_config method that allows us to give configuration. In this case, we are creating a RunnableConfig with the configuration callbacks where we give our Langfuse callbacks.
- You are not limited to only one callback. As you can see, you need to give a list, so you can give as much as you want but be warned that it will slow your process down a little.
Ok it is done! Very simple right ? Now let’s test it.
Test the monitoring
Now it is time to test the monitoring. Let’s upload the vector store files and ask a question.
And here’s what you have in Langfuse:
As you can see, you have one trace that shows which is the query you just did. You also have the cost of the query, the latency, the score if you use a vector store compatible and even the full input and output. This is really great right ?
Conclusion
As you have seen, monitoring is actually a very important part of an LLM application that allow us to be confident of how it works and avoid any pitfalls that will slow down your application grow. It is a necessary piece to any application that will be open to users. They are also simple to integrate so there is no excuse to not add any to your application. This is really the part that will allow you to sleep at night !
Afterward
I hope this tutorial helped you and taught you many things. I will update this post with more nuggets from time to time. Don’t forget to check my other post as I write a lot of cool posts on practical stuff in AI.
Cheers !