The process described above works well when users can only ask a single question. But this application allows follow-up questions as well, and this introduces a few additional complications. For example, there is a need to store all previous questions and answers, so that they can be included as additional context when sending the new question to the LLM.
The chat history in this application is managed through the ElasticsearchChatMessageHistory
class, another class that is part of the Elasticsearch integration with Langchain. Each group of related questions and answers are written to an Elasticsearch index with a reference to the session ID that was used.
You may have noticed in the previous section that even though the response from the LLM is streamed out to the client in chunks, an answer
variable is generated with the full response. This is so that the response, along with its question, can be added to the history after each interaction:
If the client sends a session_id
argument in the query string of the request URL, then the question is assumed to be made in the context of any previous questions under that same session.
The approach taken by this application for follow-up questions is to use the LLM to create a condensed question that summarizes the entire conversation, to be used for the retrieval phase. The purpose of this is to avoid running a vector search on a potentially large history of questions and answers. Here is the logic that performs this task:
This has a lot of similarities with how the main questions are handled, but in this case there is no need to use the streaming interface of the LLM, so the the invoke()
method is used instead.
To condense the question, a different prompt is used, stored in file **api/templates/condense_question_prompt.txt`:
This prompt renders all the questions and responses from the session, plus the new follow-up question at the end. The LLM is instructed to provide a simplified question that summarizes all the information.
To enable the LLM to have as much context as possible in the generation phase, the complete history of the conversation is added to the main prompt, along with the retrieved documents and the follow-up question. Here is the final version of the prompt as used in the example application:
You should note that the way the condensed question is used can be adapted to your needs. You may find that for some applications sending the condensed question also in the generation phase works better, also reducing the token count. Or perhaps not using a condensed question at all and always sending the entire chat history gives you better results. Hopefully now you have a good understanding of how this application works and can experiment with different prompts to find what works best for your use case.
Previously
Generation PhaseNext
Observability