Jina-embeddings-v2 is the first open-source 8K context-length embeddings model.

Late chunking in Elasticsearch with Jina Embeddings v2

Q: Late Chunking is a method that consists of chunking the text after embeddings (instead of chunking the text first) and then creating the embeddings for each isolated chunk.

Late Chunking is a method that consists of chunking the text after embeddings (instead of chunking the text first) and then creating the embeddings for each isolated chunk.

Elasticsearch has native integrations to industry leading Gen AI tools and providers. Check out our webinars on going Beyond RAG Basics, or building prod-ready apps Elastic Vector Database.

To build the best search solutions for your use case, start a free cloud trial or try Elastic on your local machine now.

In this article, we will configure and use jina-embeddings-v2, the first open-source 8K context-length embeddings model, starting with an OOTB implementation using semantic_text and then implementing Late Chunking.

Long-context models

We generally see embedding models with a context length of 512 tokens, which means that if we try to create a longer embedding, only the first 512 tokens will be added to the vector field. The problem with these short contexts is that the chunks will not be aware of the entire context but only the text within the chunks:

As you can see in the image, in Chunk 1 we know we are talking about Sarah Johnson, but in Chunk 2 we lost the direct reference. So, as the document gets longer, it might miss the dependencies to when Sarah Johnson was first mentioned and don’t connect that “Sarah Johnson”, “She”, and “Her” refer to the same person. This of course can get even more complex if there’s more than one person addressed as she/her, but for now let’s see the first way to address this issue.

Traditional long-context models that aim at generating text only care about dependencies on previous words so that the last tokens in the input matter more than earlier ones because a text generator’s task is to produce the next word after the input. However, Jina Embeddings 2 model is trained through three key stages: initially, it undergoes masked word pre-training using the 170 billion-word English C4 dataset. Next, it employs pairwise contrastive training with text pairs known to be similar or dissimilar, using a new corpus from Jina AI to refine embeddings so similar texts are closer and dissimilar ones are further apart. Finally, it is fine-tuned with text triplets and negative mining, incorporating a dataset with sentences of opposite grammatical polarity to improve handling of cases where embeddings might otherwise be too close for sentences with opposite meanings.

So, let’s see how this works: a longer context length allows us to keep in the same chunk the references to the first time Sarah Johnson was mentioned:

However, this also has its drawbacks. The fact that the context is larger means you are putting more information within the same dimensional space. This compression may dilute the context, removing potentially important information from the embeddings. Another drawback is that generating longer embeddings takes more computing resources. Finally, in a RAG system the size of the text chunk defines how much information you are sending to the LLM, which will affect precision, cost, and latency. The good news is that you don't have to use the whole 8K tokens, you can find a sweet spot based on your use case.

Jina, in an effort to bring together the best of both worlds, proposes an approach called Late Chunking. Late Chunking consists of chunking the text after embeddings, instead of chunking the text first, and then creating the embeddings for each isolated chunk. For this, you need a model capable of creating context-aware embeddings, and then you can chunk the generated embeddings while keeping the context, i.e. dependencies and relations between chunks.

We are going to set up the jina-embeddings-v2 model in Elasticsearch and use it with semantic_text, and then create a custom setup for Late Chunking.

Steps

Creating endpoint

With our HuggingFace Open Inference Service integration, running HuggingFace models is very simple. You just have to open the model web page, click View Code under Inference API, and grab the API URL from there. In that same screen you can Manage your tokens to create the API Key.

For more details about creating security tokens you can visit this.For the purpose of this article setting it as a read token is ok.

Once you have the url and api_key, go ahead and create the inference endpoint:

If you get this error "Model jinaai/jina-embeddings-v2-base-en is currently loading", it means that the model is warming up. Wait a couple of seconds and try again.

Creating index

We are going to use semantic_text field type. It will take care of inferring the embedding mappings and configurations, and doing the passage chunking for you! If you want to read more about it, you can go to this great article.

This approach will give us a great starting point by handling vector configuration and documents chunking for us. It will create 250-word chunks with a 100-word overlap. For customizations like increasing the chunk size to leverage the 8K context size, we have to go through a longer process we will explore in the Late Chunking section.

Indexing data

When using semantic_text we are covered. We just index data as usual.

Asking questions

Now we can use the semantic search query to asks questions to our data:

The first result will look like this:

Late Chunking implementation

Now that we have configured the embeddings model, we can create our own Late Chunking implementation in Elasticsearch. The process will require the following steps:

1. Create mappings

2. Load data

You can find the full implementation in the supporting Notebook.

We are not using the ingest pipeline approach here because we want to create special embeddings, instead we are going to use a python script which key role is getting annotations for the positions of the chunk tokens, generating embeddings for the whole document, and then chunking the embeddings based on the length we provide:

With this code you can define the text chunks size by splitting by sentence and getting chunk positions.

This second function will receive the annotations and the embeddings of the whole input to generate embedding chunks.

This is the part that puts it all together; tokenize the entire text input, and then pass it to the late_chunking function to chunk the pooled embeddings.

After this process, we can index our documents:

You can find the notebook with the full example step by step here.

Feel free to experiment with different values in the input_text variable.

3. Running queries

You can now run semantic search against the new data index:

The result will look like this:

Conclusion

Although still experimental, late-chunking has potentially many benefits, especially in RAG , since it allows you to keep key context information when you chunk your texts. Additionally, Jina embedding model helps to store shorter vectors, thus taking less space in memory and storage, and speeding up search retrieval. So, both of these features together with Elasticsearch enhance both efficiency and effectiveness in managing and retrieving information when using vector search.

Report an issue

Late chunking in Elasticsearch with Jina Embeddings v2

Long-context models