How to make AWS Bedrock Langchain chains faster using BedrockCross-region inference

You have created an application using LangChain and AWS Bedrock and you are wondering how to have better performances and how to be more resilient ? Say no more, in this blog post we will see AWS Bedrock newest features: Cross-region inference endpoint.

Introduction

In this post, you will learn:

How Foundation Models in AWS Bedrock works?
What is Bedrock cross-region inference endpoint
Implement a LangChain chain using Bedrock cross-region inference endpoint

As usual, we will be as straightforward and simple as possible so let’s get to work!

Foundational models in AWS Bedrock

AWS Bedrock is a fully managed service designed to simplify the integration and use of foundation models (LLMs). Bedrock provides access to a variety of LLMs from top AI providers like Anthropic (Claude models), Meta (Llama models), Cohere (Embed models) and Mistral AI (Mistral models), among others.

The strength of Bedrock lies in its unified API and managed service, which allows users to easily switch between these models based on their needs without having to worry about interfacing and managing directly with different LLMs.

Serverless and Multi-Tenant Architecture

AWS Bedrock operates on a serverless architecture, which means that developers don’t need to manage or scale the infrastructure themselves. AWS handles the scalling dynamically, ensuring that your applications can scale automatically based on the application’s demand. This means that you pay only for the inference you use.

Bedrock uses a multi-tenant architecture, where multiple customers (or AWS accounts) share the same underlying LLM clusters and infrastructure. However, despite sharing the same clusters, each tenant’s data is securely isolated. This setup provides improved resource efficiency, allowing AWS to manage and optimize server load across multiple users and scale depending of the usage, while simultaneously giving users the illusion of a private environment.

Bedrock cross region inference endpoint

AWS Bedrock has introduced a new feature called cross-region inference endpoint. This allows you to invoke LLMs using a single endpoint but on multiple AWS regions.

As said before, Bedrock use a multi-tenant architecture where multiple accounts use the same underlying cluster. But these clusters are regional meaning you have one infrastructure per AWS Regions and so when you choose the an AWS Region, you are going to use the LLM cluster on this specific region.

And here’s why cross-region inference endpoint is so powerful:

You will have a single endpoint on a specific region
For each Bedrock LLM call, an routing system will route your call to the les used cluster of a region
You will have the same limits as if you were using a cluster 2 times bigger. For example for Claude 3.5 Sonnet, the quota is 250 calls per minute. but with 2 AWS regions, you will have 500 calls per minute.
Your latency will be better because you will be able to launch more inference in parallel (thanks to the intelligent routing).

How to activate and use cross region inference

To activate these cross region endpoint, you will need to request model access (for the compatible models) in all the region used in the cross region inference. Below you can see the different compatible models and the required region.

For example, for the EU cross region endpoint, you will need accesses for eu-west-1, eu-west-3 and eu-central-1 for the concerned model.

For each compatible model, what is called an inference profile is created automatically by Bedrock. To use them, you just need to replace the normal model id by this inference profile. The naming is actually simple, the name of the inference profile is simply “eu” or “us” + name of the inference profile.

How to use LangChain with Bedrock cross region inference endpoint

First, you will a need recent version of boto3, langchain at least 0.3.0 and langchain-aws at least 0.2.1 . Of course you can user newer version but these works.

poetry install langchain==0.3.0 langchain-aws==0.2.1 boto3==1.35.22

Now we just need to setup our bedrock object and launch it (don’t forget to setup your AWS profile or tokens if needed):

from langchain_aws.chat_models import ChatBedrock

inference_profile = "eu.anthropic.claude-3-5-sonnet-20240620-v1:0"
region_name = "eu-west-1"

model_chat_bedrock = ChatBedrock(
                        model_id=inference_profile,
                        region_name=region_name)

response = model_chat_bedrock.invoke("what is an LLM ?")

print(response.content)

Simple, right ? This is why these feature is so good because there is nothing to change in the apart from the model id. You only need to request the accesses in Bedrock.

Conclusion

AWS Bedrock cross-region inference endpoint is a powerful features that allow you to literally double your performance and quota using LLMs on Bedrock while minimizing code change (just the mode id to change and library updates). It works by creating new endpoint that automatically route all incoming inference request depending on the load of each cluster.
If you are using AWS Bedrock for heavy or even medium usage, this is something to add to your stack whenever you can!

Afterwards

I hope you really loved this post. Don’t forget to check my other post as I write a lot of cool posts on practical stuff in AI.
Follow me on and and please leave a comment.