Site icon Use AI the right way

5 pragmatic LLM cost optimization strategies

LLM meeting image

Learn 5 pragmatic strategies to lower your LLM costs from the simplest to the more complex and begin slashing those billes.

Introduction

Have you created a new AI SaaS product, or are you considering incorporating LLMs into your work? Do you already have a popular SaaS product, or are you planning to launch the next big thing? If you’re here, it likely means you’re concerned about the costs associated with LLMs, which can indeed become a financial burden.

So let me explain why LLMs cost this much and give you 5 pragmatic strategies that you could use to reduce your cost, in order of difficulty. Let’s go!

Why are LLMs so expensive ?

The answer is simple in fact, LLMs are iincredibly GPU hungry. You will need lots of GPUs just to run one, and even more to train it (not to mention the data centers needed to handle all this and the cost of electricity).
This means that LLM service providers (OpenAI, AWS, Google, Azure, …) needs to buy GPU in truckloads (thousands and thousands), and not the consumer grade you might be familiar with. We’re talking about GPUs priced in the tens of thousands (the latest Nvidia GPU, the H100, costs $40K).

Let’s use the example of OpenAI GPT4 to better understand this:

Now let’s use the example of a Mistral 8x7B (which is a state of the art open source models):

So with these 2 examples, you can see how much LLMs are currently relying on GPU and specifically Nvidia. And also why, everything cost so much.

What about tokens ?

You are completely right, we never talked about tokens. Tokens are words or parts of words and represent the actual input to an LLM. When utilizing an LLM service, you’re charged per token.

When you chat with an LLM, your message is tokenized si that it is better processed by the LLM. For example “unhappily” will become 3 tokens: “un”, “happi”, “ly”. The goal really is to extract most of the meaning behind each words. Another example would be “Hello, how are you doing ?” which will become 7 tokens: “Hello”, “,”, “how”, “are”, “you”, “doing”, “?”.

This means that if I send “Hello, how are you doing?”, I will pay for 7 tokens, which is quite affordable since the prices are generally a few dollars per million tokens (GPT-4 costs $10.00 per million tokens).

So why the costs will or are already skyrocketing ? It is because most of the usages are token hungry

Token hungry usage

Token-hungry usage refers to use cases that consume a significant number of tokens, either in the input or output. Let’s see some examples:

The more users you will have and the more LLM interaction you will do, the more your bill will grow. And that is why you need strategies to reduce your costs.

Strategy 1: Unlimited Chat is not the ultimate solution

The unlimited chat has been integrated in our lives with mostly ChatGPT where you can have a conversation as long as you want and nearly unlimited. It is actually a very good way to interact with an LLM. But it is not defacto the best way. You do not need to give an unlimited chat to all your users and all use cases, and need to think about the most cost efficient solution.

Let’s use some example to better explain:

As you can see, you do not need unlimited chat everytime, and you can use clever design to avoid it. This is the best introduction for the next strategy.

Strategy 2: UX is the key

By using UX, meaning the design of your app, you can limit, simplify, choose the behaviour of your users. UX looks simple but it is actually very difficult because you need to get in front of a sheet of paper and just draw how you want your app to be used. The goal here is to use UX “tricks” to limit the token usage.

Again, let’s use an example to illustrate: I want to create a podcast generator that will crawl the web on specific topic.

UX is actually a very simple but powerful tool to lower cost in general because that will allow you to really control the usage of your application. Now let’s look at more technical solutions.

Strategy 3: The best model is not the best model for you

The gist of this strategy is that you do not always need the best (and often most costly) model for every use case. You can either use a smaller model but with a more effective prompt or only deploy the best model in specific cases.


Let’s use an example to illustrate: let’s take the document summarize example from before:

So using the best model does not garante you the best performance, or maybe you can accept this loss of performance if it is cheaper. This is a good introduction for the next part.

Strategy 4: Fine-tuned models for the save

A big model like GPT4 is actually a generalist model. It can handle whatever topic you give and is usable for any use cases. But that also makes it not cost efficient for you use case, because when using this model, you will never use it at 100%, you will always us only the relavant fragment.

That’s where fine-tuning comes in. By customizing a LLM model for your specific use case, a smaller model can become as effective as a larger, general-purpose one.

Let’s take the previous example of document summarizer but let’s say that it is something used only by lawyers :

That is why fine-tuning is so efficient for cost optimization. You can literally change the magnitude of the model you use.
Also fine-tuning is not fully training the LLM, you are only training the latest layers of the LLM so it is not this costly (tens of dollars).
The only real pre-requisites is that you need to have a sufficient usage recorded to generate the training dataset. So this is not something you do at the beginning.

Strategy 5: Provision for your usage

This strategy is really for those that are heavily using LLM models and when I say heavily, I’m not joking.

In most of the cloud provider, you can provision for your usage a LLM model for a specific amount of time. What happens is that the cloud provider reserves a dedicated LLM cluster for you, ensuring a guaranteed token throughput at a fixed price.

But this is not cheap at all. For example, on AWS Bedrock, if you provision a Claude 3 Sonnet model for 1 month, that will cost you a whooping 58K$ for the month. But you will have a full cluster just for you and of course it is only cost efficient if you pay more than that per month in LLM usage.

Strategy 6: Deploy your own LLM service

This is actually the most difficult solution and I recommend it only on specific case. But you can deploy your own LLM service in the Cloud provider.

It takes time to create, it takes human resources to manage and it takes money to get either the GPU but it will be cheaper than using a managed LLM service like OpenAI, AWS Bedrock, …
The only thing is that you will be limited by Open Source models so mostly Llama, Mistral and others.
You can check this link if you want to learn more on how to deploy your own LLM models.

Conclusion

I tried to give you some strategies to help you manage your LLM costs, from the simplest to the most complex. The key is to truly understand where your app’s value lies and strive to deliver the most cost-efficient value to your users or company. It is good to have this in mind whenever you think about LLM app architecture as it will hep you converge to most cost-efficient solution.

Afterward

I hope this tutorial helped you and taught you many things. I will update this post with more nuggets from time to time. Don’t forget to check my other post as I write a lot of cool posts on practical stuff on AI.

Cheers!

Exit mobile version