RAG without “AG”?

Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

You’ve probably heard about RAG (Retrieval Augmented Generation), the go-to strategy to combine your own documents with an LLM to produce human-like answers to user questions. In this article, we’ll explore how far you can get by taking the LLM out of the equation and using just semantic search and ELSER.

If you’re new to RAG, you can read this article about RAG on Azure and AI, this one about RAG with Mistral, or this article about building a RAG application using Amazon Bedrock.

LLMs cons

While LLMs offer human-language answers instead of just returning documents, there are several considerations when implementing them, including:

Cost: Using an LLM service has a cost per token or needs specialized hardware if you want to run the model locally.
Latency: Adding an LLM step increases the response time.
Privacy: In cloud models, you’ll be sending your information to a third party, with everything that entails.
Management: Adding an LLM means that now you need to deal with a new stack component, different model providers and versions, prompt engineering, hallucinations, etc.

ELSER is all you need

If you think about it, a RAG system is as good as the search engine behind it. While it’s good to search for something and read a proper answer instead of getting a list of results, its value lies in:

Being able to make queries using a question instead of keywords so you don’t need to have the exact same words in the documents since the system “gets” the meaning.
Not needing to read the entire text to get the desired information since the LLM finds the answer in the context you send it and makes it visible.

Considering these, we can get very far using ELSER to retrieve relevant information via semantic search and structuring our documents so that we generate a user experience in which the user types a question and the interface leads them directly to an answer, without reading the entire document.

To get these benefits, we’ll use the semantic_text field, text chunking, and semantic highlighting. To learn about the latest semantic_text features, I recommend reading this article.

We’ll create an app using Streamlit to put everything together. It should look something like this:

The goal is to ask a question, get the sentence that answers it from the source documents, and then have a button to see the sentence in context. To improve the user experience, we’ll also add some metadata, like the article’s thumbnail, title, and link to the source. This way, depending on the article, the card with the answer will be different.

Requirements:

Elastic Serverless instance. Start your trial here
Python

Regarding the document structure, we’ll index Wikipedia pages so that each article’s section becomes an Elasticsearch document, and then, each document will be chunked into sentences. This way, we’ll get precision for the sentences that answer our question and a reference to see the sentence in context.

These are the steps to create and test the app:

I’ve only included the main blocks in this article, but you can access the full repository here.

Configure Inference endpoint

Import dependencies

To begin, we’ll configure the inference endpoint where we will define ELSER as our model and establish the chunking settings:

strategy: It can be “sentence” or “word.” We’ll choose “sentence” to make sure all of our chunks have complete sentences, and therefore, the highlighting works for phrases and not words to make the answers more fluid.
max_chunk_size: Defines the maximum amount of words in a chunk.
sentence_overlap: Number of overlapping sentences. It goes from 1 to 0. We’ll set it to 0 to have better highlighting precision. Number 1 is recommended when you want to capture adjoining content.

You can read this article to learn more about chunking strategies.

Configure mappings

We’ll configure the following fields to define our mappings:

It’s crucial to make sure you copy the content field into our semantic_content field to run semantic searches.

Upload documents

We’ll use the following script to upload the document from this Wikipedia page about Lionel Messi.

Create App

We’ll create an app that will get a question, then search in Elasticsearch for the most relevant sentences, and then create an answer using highlighting to showcase the most relevant answer individually while also showing the section from which the sentence came. This way, the user can read the answer the right way and then go deeper in a similar way to an LLM-generated answer that includes citations.

Install dependencies

Let’s begin by creating the function that runs the question’s semantic query:

In the ask method, we’ll return the first document corresponding to the complete section as full context and the first chunk from the highlight section as the answer, ordered by _score, that is, the most relevant to the question.

Now, we put everything together in a Streamlit app. To highlight the answer, we’ll use annotated_text, which is a component that makes it easier to color and identify the highlighted text.

Final test

To test, we only need to run the code and ask our question:

Not bad! The quality of the answers will be as good as our data and the correlation between the question and the available sentences.

Conclusion

With a good document structure and a good semantic search model like ELSER, it’s possible to build a Q&A experience with no need for an LLM. Though it has limitations, it’s an option worthy of trying to better understand data and not just sending everything to an LLM hoping for the best.

In this article, we showed that using semantic search, semantic highlighting, and a bit of Python code, you can get close to the results from a RAG system without its cons like cost, latency, privacy, and management, among others.

Elements like document structure and user experience in the UI help to compensate for the “human” effect you get from LLM-synthesized answers, focusing on the capacity of the vector database to find the exact sentence that answers the question.

A possible next step would be to complement answers with other data sources to create a richer experience where Wikipedia is only one of the sources used to answer the question, like you get in Perplexity: