Building RAG with Llama 3 open-source and Elastic

Learn how to build a RAG system with Llama3 open source and Elastic. This blog provides practical examples of RAG using Llama3 as an LLM.

Elasticsearch has native integrations to industry leading Gen AI tools and providers. Check out our webinars on going Beyond RAG Basics, or building prod-ready apps Elastic Vector Database.

To build the best search solutions for your use case, start a free cloud trial or try Elastic on your local machine now.

This blog will walk through implementing RAG using two approaches.

  1. Elastic, Llamaindex, Llama 3 (8B) version running locally using Ollama.
  2. Elastic, Langchain, ELSER v2, Llama 3 (8B) version running locally using Ollama.

The notebooks are available at this GitHub location.

Before we get started, let's take a quick dive into Llama 3.

Llama 3 overview

Llama 3 is an open source large language model recently launched by Meta. This is a successor to Llama 2 and based on published metrics, is a significant improvement. It has good evaluation metrics, when compared to some of the recently published models such as Gemma 7B Instruct, Mistral 7B Instruct, etc. The model has two variants, which are the 8 billion and 70 billion parameter. An interesting thing to note is that at the time of writing this blog, Meta was still in the process of training 400B+ variant of Llama 3.

Meta Llama 3 Instruct Model Performance. (from https://ai.meta.com/blog/meta-llama-3/)

The above figure shows data on Llama3 performance across different datasets as compared to other models. In order to be optimized for performance for real world scenarios, Llama3 was also evaluated on a high quality human evaluation set.

Aggregated results of Human Evaluations across multiple categories and prompts (from https://ai.meta.com/blog/meta-llama-3/)

How to build RAG with Llama 3 open-source and Elastic

Dataset

For the dataset, we will use a fictional organization policy document in json format, available at this location.

Configure Ollama and Llama3

As we are using the Llama 3 8B parameter size model, we will be running that using Ollama. Follow the steps below to install Ollama.

  1. Browse to the URL https://ollama.com/download to download the Ollama installer based on your platform.

Note: The Windows version is in preview at the moment.

  1. Follow the instructions to install and run Ollama for your OS.
  2. Once installed, follow the commands below to download the Llama3 model.

This should take some time depending upon your network bandwidth. Once the run completes, you should end with the interface below.

To test Llama3, run the following command from a new terminal or enter the text at the prompt itself.

At the prompt, the output looks like below.

We now have Llama3 running locally using Ollama.

Elasticsearch setup

We will use Elastic cloud setup for this. Please follow the instructions here. Once successfully deployed, note the API Key and the Cloud ID, we will require them as part of our setup.

Application setup

There are two notebooks, one for RAG implemented using Llamaindex and Llama3, the other one with Langchain, ELSER v2 and Llama3. In the first notebook, we use Llama3 as a local LLM as well as provide embeddings. For the second notebook, we use ELSER v2 for the embeddings and Llama3 as the local LLM.

Method 1: Elastic, Llamaindex, Llama 3 (8B) version running locally using Ollama.

Step 1 : Install required dependencies

The above section installs the required llamaindex packages.

Step 2: Import required dependencies

We start with importing the required packages and classes for the app.

We start with providing a prompt to the user to capture the Cloud ID and API Key values.

If you are not familiar with obtaining the Cloud ID and API Key, please follow the links in the code snippet above to guide you with the process.

Step 3: document processing

We start with downloading the json document and building out Document objects with the payload.

We now define the Elasticsearch vector store (ElasticsearchStore), the embedding created using Llama3 and a pipeline to help process the payload constructed above and ingest into Elasticsearch.

The ingestion pipeline allows us to compose pipelines using different components, one of which allows us to generate embeddings using Llama3.

ElasticsearchStore is defined with the name of the index to be created, the vector field and the content field. And this index is created when we run the pipeline.

The index mapping created is as below:

The pipeline is executed using the step below. Once this pipeline run completes, the index workplace_index is now available for querying. Do note that the vector field content_vector is created as a dense vector with dimension 4096. The dimension size comes from the size of the embeddings generated from Llama3.

Step 4: LLM configuration

We now setup Llamaindex to use the Llama3 as the LLM. This as we covered before is done with the help of Ollama.

Step 5: Semantic search

We now configure Elasticsearch as the vector store for the Llamaindex query engine. The query engine is then used to answer your questions with contextually relevant data from Elasticsearch.

The response I received with Llama3 as the LLM and Elasticsearch as the Vector database is below.

This concludes the RAG setup based on using Llama3 as a local LLM and to generate embeddings.

Let's now move to the second method, which uses Llama3 as a local LLM, but we use Elastic’s ELSER v2 to generate embeddings and for semantic search.

Method 2: Elastic, Langchain, ELSER v2, Llama 3 (8B) version running locally using Ollama.

Step 1: Install required dependencies

The above section installs the required langchain packages.

Step 2: Import required dependencies

We start with importing the required packages and classes for the app. This step is similar to Step 2 in Method 1 above.

Next, provide a prompt to the user to capture the Cloud ID and API Key values.

Step 3: Document processing

Next, we move to downloading the json document and building the payload.

This step differs from the Method 1 approach, from how we use the LlamaIndex provided pipeline to process the document. Here we use the RecursiveCharacterTextSplitter to generate the chunks.

We now define the Elasticsearch vector store ElasticsearchStore.

The vector store is defined with the index to be created and the model to be used for embedding and retrieval. You can retrieve the model_id by navigating to Trained Models under Machine Learning.

This also results in the creation of an ingest pipeline in Elastic, which generates and stores the embeddings as the documents are ingested into Elastic.

We now add the documents processed above.

Step 4: LLM configuration

We set up the LLM to be used with the following. This is again different from method 1, where we used Llama3 for embeddings too.

Step 5: Semantic search

The necessary building blocks are all in place now. We tie them up together to perform semantic search using ELSER v2 and Llama3 as the LLM. Essentially, Elasticsearch ELSER v2 provides the contextually relevant response to the users question using its semantic search capabilities. The user's question is then enriched with the response from ELSER and structured using a template. This is then processed with Llama3 to generate relevant responses.

The response with Llama3 as the LLM and ELSER v2 for semantic search is as below:

This concludes the RAG setup based on using Llama3 as a local LLM and ELSER v2 for semantic search.

Conclusion

In this blog we looked at two approaches to RAG with Llama3 and Elastic. We explored Llama3 as an LLM and to generate embeddings. Next we used Llama3 as the local LLM and ELSER for embeddings and semantic search. We utilized two different frameworks, LlamaIndex and Langchain. You could implement the two methods using either of these frameworks. The notebooks were tested with the Llama3 8B parameter version. Both the notebooks are available at this GitHub location.

Related content

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself