Site icon Use AI the right way

Deploy a RAG app with Langchain in minutes

Representation of a rag app

Easily deploy a Langchain RAG application with Streamlit and begin prototyping right now.

Introduction

In this blog post, we are going to build a RAG application using Langchain, GPT4 (OpenAI), streamlit and python in just 10 min. We will define all these concepts and get you started with one more tool in your belt.

More precisely, you will learn in this blog post and in just 5 min:

Let’s begin then.

PS: If you do not want to use OpenAI, check this link to deploy your own LLM in 5 min.

What is a RAG pipeline ?

A Retrieval-Augmented Generation (RAG) pipeline is a powerful tool that combines the best of two worlds: retrieval (finding relevant information) and generation (creating responses) in large language models (LLMs). Here’s how it works:

  1. Retrieval: When a question or prompt is given to the model, the RAG pipeline first searches through its knowledge base to find the most relevant pieces of information. This is like having an instant lookup feature that finds key details related to the query.
  2. Augmentation: The information retrieved is then added to the original query to provide a meaningful context to the language model.
  3. Generation: With this meaningful context, the language model generates a response or output that’s not just based on its pre-trained knowledge but also on the relevant information retrieved in the first step.

By integrating retrieval into the generation process, a RAG pipeline allows LLMs to produce more accurate, relevant, and contextually informed responses. This makes LLMs incredibly more useful, especially in scenarios where up-to-date or specific information is crucial.

But, how does the RAG pipeline find the most relevant pieces of information? It uses embeddings.

What are embeddings ?

Embeddings are a fundamental concept in machine learning, particularly in natural language processing (NLP). Basically, they are representations of words, sentences, or even entire documents converted into numerical form which are better processed by computers. They are generated by specific embedding models (same as LLM models but smaller).

Here’s a simple breakdown:

  1. Transformation: Imagine you have a word like “apple.” In the world of embeddings, “apple” is converted into a list of numbers (a vector) that captures its meaning based on how it’s used in language.
  2. Context Capture: These numerical representations are designed so that words with similar meanings have similar embeddings. For example, “apple” and “fruit” would be closer in the numerical space than “apple” and “car.”
  3. Usage: Once text is converted into embeddings, we can perform various operations on it, like comparing word similarity, feeding it into machine learning models, or using it in applications like search engines and recommendation systems.

In the context of a RAG pipeline, embeddings allow the system to quickly retrieve information related to a query by comparing the numerical representations of the query and the data it has, finding the most relevant matches.
This is closely linked to one of the most complex part of RAG pipelines in general, how to chunks your text and why chunking strategy is so important.

What is a vector store ?

A vector store for large language models (LLMs) is essentially a database or storage system designed to efficiently store and retrieve vectors. Basically, it allows to store the embeddings generated (as vectors) in a way that simplify usage. It should be able to:

Why is a Chunking strategy so important?

Chunking involves breaking down large texts or datasets into smaller, more manageable pieces or “chunks.” It is an essential part of RAG:

  1. Improved Efficiency and Scalability: By breaking down data into smaller pieces, LLMs can handle information more quickly and efficiently. It will then allow to create for a more scalable system as we will only retrieve the most relevant piece of information, whether you have 5MB or 5TB of data.
  2. Enhanced Accuracy: Chunking allows the model to focus on smaller sections of text at a time, which can improve the accuracy of the retrieval and generation process. It ensures that the model isn’t overwhelmed with too much information at once, which can lead to a loss of relevant details.
  3. Contextual Relevance: In RAG systems, chunking ensures that the retrieval component fetches the most relevant information by focusing on specific segments of text. This targeted approach leads to more contextually relevant and coherent outputs from the generation phase.

Here’s a concrete example of analysing a customer review on a smartphone:

Overall, I’m quite pleased with my purchase of the XYZ Smartphone. The screen clarity is outstanding, offering vibrant colors and sharp details, making my viewing experience enjoyable. However, I’ve encountered issues with the battery life, which doesn’t last as long as advertised, often needing a recharge by midday. The camera quality is impressive, capturing high-resolution images even in low-light conditions. The phone’s design is sleek and comfortable to hold, but I find the placement of the power button awkward, often pressing it by accident. On the software side, the XYZ Smartphone runs smoothly, with no lag when switching between apps. The audio quality during calls is clear, but the speaker could be louder for media playback.

Question 1: How is the screen quality of the XYZ Smartphone?

Question 2: What are the customer’s thoughts on the smartphone’s battery life?

Without Chunking (Processing the Entire Text):

With Chunking (Splitting the Review into Relevant Sections):

Answer to Question 1 (Using Chunk 1):

Answer to Question 2 (Using Chunk 2):

As you can see, chunking allows to find the most relevant piece of context to be injected to the context of a prompt so to give the best answer. This is just a simple strategy using embeddings. There are more complex strategy that uses multiple type of chunking strategy at the same time for maximum relevance.
This is where you will use the most time, with the nitty-gritting of the chunking. This is literally the make-or-break moment of your use case.

What is Langchain and why it is so useful ?

Langchain is a framework designed to facilitate the creation and deployment of applications powered by large language models (LLMs). It provides a modular set of tools and integrations that allows easier integration of LLMs with the ecosystem. Basically, it allows you to create chains (or pipelines) which are sequence of multiple tasks and connections to a wide variety of systems.
Here’s some possible use cases with Langchain:

The use case I writed about are only the tip of the iceberg and this is possible only because of Langchain’s wide integration and chain like system.

What is Streamlit ?

Streamlit is a popular open-source application framework specifically designed for machine learning and data science (and so LLMs). It allows developers to create beautiful, interactive web applications quickly without any frontend code, only python code.
You can literally have an app in 2 min and there are multiple ready-made components and particullary a chat component.

Now let’s implement our own RAG application.

Pre-requisites

Initialize the work environment

Setup the OpenAI API key

To use the OpenAI GPT models, we need an API key. An API key is just a file that is created to use a service, in this case OpenAI and it contains credentials.

So if you don’t have one, create an API key for this use case and save it somewhere (don’t worry we will use it in the next part and save it securely).
Pro tips: don’t hesitate to create one API key per use case as it will be easier to monitor what is happening.

Prepare the python environment

We are going to use python for the coding. It is a sure choice for all LLM related use case and all LLM frameworks supports it.

We will use pipenv to create a virtualenv for this project. Creating a virtualenv is very important as it will make sure this project will run without harming your other ones.

Also please remember to always save your code using for example github.

Install pipenv

Let’s begin by installing pipenv, which will handle virtual environments and Python packages for us (there are others like poetry, but they are more difficult)::

pip install pipenv

Create a folder and add dependencies

Let’s create a folder that will contains the code:

mkdir RAG-pipeline-langchain-openai
cd RAG-pipeline-langchain-openai

And add all required dependencies. We will install:

pipenv install langchain streamlit langchain-openai python-dotenv faiss-cpu

Create the smallest Streamlit app

Let’s begin writing some code. We will work on iteratively and feature by feature with some testing between. We will the most simple streamlit application
Pro Tips: Iterative is the way to go in any software project. If we compare to building making, it is just making stuff, small part from spart, from the bottom and just checking that what you just made is robust enough. Simple right ?

Create a file called app.py and put this inside:

# Import the Streamlit library
import streamlit as st

# Write a simple message to the app's webpage
st.write('Hello, Metadocs readers!')

Now let’s run the code to make sure it works fine. Open a terminal to where the main.py file is and type this:

pipenv run streamlit run app.py

This will open up a browser pop with your application and it just took 2 lines of code !

Add the OpenAI api key

Now we need to add the OpenAI api key so that we can actually use OpenAI GPT4 model.
To do that you need to:

from dotenv import load_dotenv
import streamlit as st # This is from before

# Load the variables from .env and particulary OPENAI_KEY
load_dotenv()

Add GPT4 inference

Now let’s add some GPT4 inference to the app. We will begin by asking a question and showing the output in the app.

import os

from dotenv import load_dotenv
import streamlit as st
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# Load the variables from .env
load_dotenv()

st.write("Hello, Metadocs readers!")

template = """Answer the following question
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI(temperature=0, model_name="gpt-4", openai_api_key=os.environ["OPENAI_KEY"])

result = model.invoke("What is langchain framework in the context of Large Language Models ?")

st.write("What is the langchain framework in the context of Large Language Models ?")
st.write(result.content)

Okay, this is a pretty big step so I will explain everything bloc by bloc.

import os

from dotenv import load_dotenv
import streamlit as st
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

This is the import part. Here’s what it does:

# Load the variables from .env
load_dotenv()

We load the variable inside the .env as environment variables. Using environment variables is a good way to pass values inside the code without writing it in the code directly.

template = """Answer the following question
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

Here’s the interesting part. We create the template that will be used to generate the prompt. In this case, I only ask the chat to answer my question.
Pro Tips: you see how I’m super descriptive in the prompt on what I want the LLM to do ? This is how to do good prompting. You image you are talking to someone who does not know of the subject at all.

model = ChatOpenAI(temperature=0, model_name="gpt-4", openai_api_key=os.environ["OPENAI_KEY"])

result = model.invoke("What is langchain framework in the context of Large Language Models ?")

st.write("What is the langchain framework in the context of Large Language Models ?")
st.write(result.content)

And finally we ask our question:

Here’s what you will have:

You can pat yourself, you already have an application connected to a GPT4. Now we will add the RAG part.

Add a streamlit upload component

Now we will add a component to load a text file with streamlit. We can do it with just one line. Fantastic, right ?

uploaded_file = st.file_uploader("Choose a text file", type="txt")

if uploaded_file is not None:
    string_data = uploaded_file.getvalue().decode("utf-8")

Chunks the text with the simplest strategy

Now that we can upload a text file, we need to split it so that we can use it. This part is actually the most complex because:

Now that I have warned you enough, let’s do the chunk strategy:

splitted_data = string_data.split("\n\n")

Yes this is super simple. And yes, it will never work on actual data. This is just for the tutorial.

Use the simplest vector store

We will use the FAISS vector store, which was created by Meta, and it stores embeddings in-memory. If we were not using this, we would need to deploy or pay for a vector store.
Pro Tips: use FAISS only for tutorial and demo. For real application, it is not enough by itself and you should use a real vector store.

Let’s add a FAISS vector store then:

embedding = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_KEY"])

# ... rest of the code

vectorstore = FAISS.from_texts(
    splitted_data,
    embedding=embedding)
retriever = vectorstore.as_retriever()

Here’s the breakdown:

Now let’s modify the prompt template

Modify the prompt template for RAG

In a RAG use case, the prompt will not only use the question you want to ask but also the context that semantically match your question.
Here’s the corresponding prompt:

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

Create a langchain chain to connect all components

Now let’s connect all the components and with a langchain chain. A chain

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# ... rest of the code

chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | model
        | StrOutputParser()
)

This is another mouthful so let’s explain this too:

Add components to ask a question

Now that we have a working chain, let’s connect everything to a way to ask a question directly in the app. Again, we will use the sheer simplicity of Streamlit.:

question = st.text_input("Input your question for the uploaded document")

result = chain.invoke(question)

st.write(result)

Let’s explain this:

Look at all this code

Ok, now all the code has been writed. Here’s what it should look like:

import os

from dotenv import load_dotenv
import streamlit as st
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Load the variables from .env
load_dotenv()

st.write("Hello, Metadocs readers!")

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(temperature=0, model_name="gpt-4", openai_api_key=os.environ["OPENAI_KEY"])
embedding = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_KEY"])

uploaded_file = st.file_uploader("Choose a text file", type="txt")

if uploaded_file is not None:
    string_data = uploaded_file.getvalue().decode("utf-8")

    splitted_data = string_data.split("\n\n")

    vectorstore = FAISS.from_texts(
        splitted_data,
        embedding=embedding)
    retriever = vectorstore.as_retriever()

    chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | model
        | StrOutputParser()
    )

    question = st.text_input("Input your question for the uploaded document")

    result = chain.invoke(question)

    st.write(result)

With just 50 lines of code, you have a working RAG application. Let’s test it shall we ?

Test our beautiful DIY RAG application

Here’s some advice to test the application;

Let’s begin then. Launch the application:

pipenv run streamlit run app.py

Now load the given text file and ask the question. TADAAA:

Conclusion

Congratulation, you created your own RAG application so fast. You are really good you know !
But I need to warm to something before you continue to play with the app:

Afterward

I hope this tutorial helped you and taught you many things. I will update this post with more nuggets from time to time. Don’t forget to check my other post as I write a lot of cool posts on practical stuff in AI.

Cheers !

Exit mobile version