Learn 5 pragmatic strategies to lower your LLM costs from the simplest to the more complex and begin slashing those billes.
Introduction
Have you created a new AI SaaS product, or are you considering incorporating LLMs into your work? Do you already have a popular SaaS product, or are you planning to launch the next big thing? If you’re here, it likely means you’re concerned about the costs associated with LLMs, which can indeed become a financial burden.
So let me explain why LLMs cost this much and give you 5 pragmatic strategies that you could use to reduce your cost, in order of difficulty. Let’s go!
Why are LLMs so expensive ?
The answer is simple in fact, LLMs are iincredibly GPU hungry. You will need lots of GPUs just to run one, and even more to train it (not to mention the data centers needed to handle all this and the cost of electricity).
This means that LLM service providers (OpenAI, AWS, Google, Azure, …) needs to buy GPU in truckloads (thousands and thousands), and not the consumer grade you might be familiar with. We’re talking about GPUs priced in the tens of thousands (the latest Nvidia GPU, the H100, costs $40K).
Let’s use the example of OpenAI GPT4 to better understand this:
- OpenAI GPT4 is an LLM which has 1.8 trillion of parameters. This is clearly one of the biggest on the market.
- To train it, OpenAI utilized around 25,000 Nvidia A100 GPUs (which is the older version of the H100) for 90 to 100 days -> 66 millions of dollars to train it (and they has a deal with Microsoft to get cheap GPUs …)
- For running it on chatGPT, OpenAI use 8 Nvidia A100 GPU 24/7. The reason they don’t use more is that for each request, they only need to access a specific part of the model.
- Even OpenAI don’t own any GPU, they have a partnership with Microsoft from which they are renting the GPU (with a big discount). They pay 1$/hour for an A100 but the norma price would be 30$/hour at least.
Now let’s use the example of a Mistral 8x7B (which is a state of the art open source models):
- Mistral 8x7B is a 46 billion parameters (which is not even half the size of GPT4) but has very good performance
- You need at least 100GB of GPU memory to run it at decent speed
- On AWS, that would mean at least a g5.8xlarge which is 2.5$/hour -> 1800$/month
- With 100GB of memory, the model will be barely fast enough for a chat
So with these 2 examples, you can see how much LLMs are currently relying on GPU and specifically Nvidia. And also why, everything cost so much.
What about tokens ?
You are completely right, we never talked about tokens. Tokens are words or parts of words and represent the actual input to an LLM. When utilizing an LLM service, you’re charged per token.
When you chat with an LLM, your message is tokenized si that it is better processed by the LLM. For example “unhappily” will become 3 tokens: “un”, “happi”, “ly”. The goal really is to extract most of the meaning behind each words. Another example would be “Hello, how are you doing ?” which will become 7 tokens: “Hello”, “,”, “how”, “are”, “you”, “doing”, “?”.
This means that if I send “Hello, how are you doing?”, I will pay for 7 tokens, which is quite affordable since the prices are generally a few dollars per million tokens (GPT-4 costs $10.00 per million tokens).
So why the costs will or are already skyrocketing ? It is because most of the usages are token hungry
Token hungry usage
Token-hungry usage refers to use cases that consume a significant number of tokens, either in the input or output. Let’s see some examples:
- RAG: RAG is inherently token hungry as you are adding in the prompt all the relevant information of your documents. That means for even the simplest question, you will have a prompt with thousands of tokens in input and ouput.
- Chat: a Chat is also very token hungry as the more you talk with the the LLM, the more your chat history will get bigger. Because you need to add the chat conversation to the prompt, the token usage will keep getting bigger.
- Tools usage: Tools are a way to let an LLM use an external tool (as an api). You just give the documentation of the api to the llm with your request. This means that every single prompt can be enormous.
The more users you will have and the more LLM interaction you will do, the more your bill will grow. And that is why you need strategies to reduce your costs.
Strategy 1: Unlimited Chat is not the ultimate solution
The unlimited chat has been integrated in our lives with mostly ChatGPT where you can have a conversation as long as you want and nearly unlimited. It is actually a very good way to interact with an LLM. But it is not defacto the best way. You do not need to give an unlimited chat to all your users and all use cases, and need to think about the most cost efficient solution.
Let’s use some example to better explain:
- I want to create an app to load documents (pdf for example) that will create summaries.
- In this case, you don’t need to offer a chat feature; what’s necessary is generating a high-quality summary that is easy to understand and use.
- You do not need a chat-to-documents here, or at least, either a capped version of only for the premium users
- I want to create a RAG app where the user can load documents and analyse them.
- Here, a chat feature may be necessary, but you can impose a size limit. Without it, your users could quickly accumulate 100k tokens, significantly increasing costs.
- You could create some dashboard and graphic that will replace certain chat usage.
- I want to create an app that will integrate will all the tools in my ecosystem so that the user can launch processes in the chat
- This is actually a very nice use case and you will need a chat for it.
- If you use the simplest method, you will need to have the api specification of all tools in each prompt, which will make very big prompt.
- You can create a selector component where the user need to choose the tools he wants to use for each prompt.
As you can see, you do not need unlimited chat everytime, and you can use clever design to avoid it. This is the best introduction for the next strategy.
Strategy 2: UX is the key
By using UX, meaning the design of your app, you can limit, simplify, choose the behaviour of your users. UX looks simple but it is actually very difficult because you need to get in front of a sheet of paper and just draw how you want your app to be used. The goal here is to use UX “tricks” to limit the token usage.
Again, let’s use an example to illustrate: I want to create a podcast generator that will crawl the web on specific topic.
- The ChatGPT solution like would be to have a chat with internet crawling capabilites where you can analyse some topics and when ready you launch the creation with a tool.
- This setup will be incredibly costly as you are going to have internet crawled page + chat historic + tool documentation in the prompts.
- We can change this by not having a chat (at least not for the normal users) and having a component to add a list of webpage to crawl with some fields selector like sub-topics, tone, … This will cost you far less
UX is actually a very simple but powerful tool to lower cost in general because that will allow you to really control the usage of your application. Now let’s look at more technical solutions.
Strategy 3: The best model is not the best model for you
The gist of this strategy is that you do not always need the best (and often most costly) model for every use case. You can either use a smaller model but with a more effective prompt or only deploy the best model in specific cases.
Let’s use an example to illustrate: let’s take the document summarize example from before:
- you could have a feature where you show the main topics in the text and then have a way to deep dive inside a topic.
- for the main topics, you could use a smaller model (like GPT3.5) and then use a bigger model for the deep dive (GPT4).
- We could completely ditch OpenAI and use other cheaper models like Mistral (which are 10-20x cheaper)
- a more sophisticated way would be dynamic rooting for the best fit model depending of the use case (with this for example)
So using the best model does not garante you the best performance, or maybe you can accept this loss of performance if it is cheaper. This is a good introduction for the next part.
Strategy 4: Fine-tuned models for the save
A big model like GPT4 is actually a generalist model. It can handle whatever topic you give and is usable for any use cases. But that also makes it not cost efficient for you use case, because when using this model, you will never use it at 100%, you will always us only the relavant fragment.
That’s where fine-tuning comes in. By customizing a LLM model for your specific use case, a smaller model can become as effective as a larger, general-purpose one.
Let’s take the previous example of document summarizer but let’s say that it is something used only by lawyers :
- If the app is only used by lawyers, we do not need all the complexity of generalist GPT4 model. We could fine-tune a GPT3 model with all the prompt and results that you generated (and that you kept somewhere right ?).
- The fine-tuned model coud have even better performance than the GPT4 but for a fraction of the cost.
- You could even keep training the model to make it better and better
That is why fine-tuning is so efficient for cost optimization. You can literally change the magnitude of the model you use.
Also fine-tuning is not fully training the LLM, you are only training the latest layers of the LLM so it is not this costly (tens of dollars).
The only real pre-requisites is that you need to have a sufficient usage recorded to generate the training dataset. So this is not something you do at the beginning.
Strategy 5: Provision for your usage
This strategy is really for those that are heavily using LLM models and when I say heavily, I’m not joking.
In most of the cloud provider, you can provision for your usage a LLM model for a specific amount of time. What happens is that the cloud provider reserves a dedicated LLM cluster for you, ensuring a guaranteed token throughput at a fixed price.
But this is not cheap at all. For example, on AWS Bedrock, if you provision a Claude 3 Sonnet model for 1 month, that will cost you a whooping 58K$ for the month. But you will have a full cluster just for you and of course it is only cost efficient if you pay more than that per month in LLM usage.
Strategy 6: Deploy your own LLM service
This is actually the most difficult solution and I recommend it only on specific case. But you can deploy your own LLM service in the Cloud provider.
It takes time to create, it takes human resources to manage and it takes money to get either the GPU but it will be cheaper than using a managed LLM service like OpenAI, AWS Bedrock, …
The only thing is that you will be limited by Open Source models so mostly Llama, Mistral and others.
You can check this link if you want to learn more on how to deploy your own LLM models.
Conclusion
I tried to give you some strategies to help you manage your LLM costs, from the simplest to the most complex. The key is to truly understand where your app’s value lies and strive to deliver the most cost-efficient value to your users or company. It is good to have this in mind whenever you think about LLM app architecture as it will hep you converge to most cost-efficient solution.
Afterward
I hope this tutorial helped you and taught you many things. I will update this post with more nuggets from time to time. Don’t forget to check my other post as I write a lot of cool posts on practical stuff on AI.
Cheers!