Real-time Semi-Agentic RAG: A tutorial on how to implement RAG
September 27, 2025
Related Projects
What is RAG
RAG, or Retrieval Augmented Generation, is a way to include external content in a chatbot or agentic system so that it can answer with specific knowledge.
To answer the user’s message:
- A chatbot without RAG uses knowledge that the LLM learned during training.
- A chatbot with RAG, before answering, searches for information that is useful to answer the message and then injects the result into the final prompt, so that the context contains knowledge on how to answer.
flowchart TD
A[User Query] --> B{RAG or Traditional?}
B -->|Traditional| C[LLM with Training Data Only]
C --> D[Response]
B -->|RAG| E[Query Processing]
E --> F[Vector Database Search]
F --> G[Knowledge Retrieval]
G --> H[LLM + Retrieved Knowledge]
H --> I[Enhanced Response]
style C fill:#4A4A4A,stroke:#E74C3C,stroke-width:3px,rx:15,ry:15
style H fill:#2E7D32,stroke:#4CAF50,stroke-width:3px,rx:15,ry:15
style F fill:#1E3A8A,stroke:#3B82F6,stroke-width:3px,rx:15,ry:15
style A fill:#374151,stroke:#9CA3AF,stroke-width:2px,rx:12,ry:12
style B fill:#7C2D12,stroke:#F97316,stroke-width:2px,rx:12,ry:12
style D fill:#374151,stroke:#9CA3AF,stroke-width:2px,rx:12,ry:12
style E fill:#374151,stroke:#9CA3AF,stroke-width:2px,rx:12,ry:12
style G fill:#374151,stroke:#9CA3AF,stroke-width:2px,rx:12,ry:12
style I fill:#374151,stroke:#9CA3AF,stroke-width:2px,rx:12,ry:12
RAG Implementation Overview
A RAG pipeline can be implemented in many ways, but most of them use a vector database (Qdrant, Pinecone, Weaviate, Milvus, …) or a full-text search database (ElasticSearch, PostgreSQL, MeiliSearch, …), or alternatively a combination of both.
I have been using Qdrant so far because it offers the ability to do both vector search and full-text search, called hybrid search.
Embeddings and how text is searched
To implement text search, databases use something called an embedding of the text. It’s a “summary” of the contents of a piece of text, they can be of two types:
- Dense embeddings: They use a dense vector of numbers to encode the meaning of text
- Sparse embeddings: They use a sparse vector (usually a key-value map) of numbers to encode the words contained in text.
Dense embeddings (semantics)
Dense embeddings are a list of numbers that represent the semantics of some text. In concrete terms, you can see them as a fixed-length vector of numbers. This embedding is generated by an AI that, given the text, returns the vector embedding, encoding the semantics of the text.
Imagine that each number in the vector represents some characteristics of the text, ranging from 0 to 1. For example, the first index could indicate how serious the text is, the second could indicate how likely the text is to be talking about happy things, etc…
An analogy is the way colors can be represented as HSL (Hue, Saturation, Luminance). You can see the vector [h, s, l] as a representation of the color, with each index representing a characteristic of the color.
You might say, “what do we do with this embedding”, well, you can compare them!
Let’s use the color analogy again. Say a store has a database of paints they sell, with each paint saved as an HSL color. I want to paint my house with my favorite color, so I give the store the HSL values for it.
The store now has the job of finding a paint that is as close as possible to the one that I want. To do this, they need to compare the HSL values of their paints with the HSL values of my color. The one that has the most similar characteristics will be the most similar color.
To do this, they can use something called a ranking function, like cosine similarity or dot product, which, given two vectors, returns how similar the two are. At this point, we can just compare my favorite color to the paints in the store and select the one with the highest similarity score.
flowchart TD
A["Customer's Favorite Color H=120° S=60% L=50%"] --> B[Paint Store Database]
B --> C["Paint A<br />H=120° S=67% L=55%"]
B --> D["Paint B<br />H=120° S=70% L=41%"]
B --> E["Paint C<br />H=201° S=59% L=64%"]
B --> F["Paint D<br />H=60° S=26% L=50%"]
G[Similarity with Cosine Similarity] --> H["🏆 Paint A: Score 0.92"]
G --> I["Paint B: Score 0.89"]
G --> J["Paint C: Score 0.45"]
G --> K["Paint D: Score 0.21"]
C --> G
D --> G
E --> G
F --> G
style A fill:#40bf40,stroke:#2E7D32,stroke-width:3px,rx:15,ry:15,color:#ffffff
style B fill:#1E3A8A,stroke:#3B82F6,stroke-width:2px,rx:12,ry:12,color:#ffffff
style C fill:#40d940,stroke:#2E7D32,stroke-width:2px,rx:12,ry:12,color:#000000
style D fill:#1fb01f,stroke:#1B5E20,stroke-width:2px,rx:12,ry:12,color:#ffffff
style E fill:#6bb3d9,stroke:#1565C0,stroke-width:2px,rx:12,ry:12,color:#000000
style F fill:#a1a160,stroke:#7a7a48,stroke-width:2px,rx:12,ry:12,color:#000000
style G fill:#7C2D12,stroke:#F97316,stroke-width:2px,rx:12,ry:12,color:#ffffff
style H fill:#40d940,stroke:#ffd700,stroke-width:4px,rx:12,ry:12,color:#000000
style I fill:#1fb01f,stroke:#1B5E20,stroke-width:2px,rx:12,ry:12,color:#ffffff
style J fill:#6bb3d9,stroke:#1565C0,stroke-width:2px,rx:12,ry:12,color:#000000
style K fill:#a1a160,stroke:#7a7a48,stroke-width:2px,rx:12,ry:12,color:#000000
When using a vector database, you are the one to provide the embedding vector and whichever information you want to store alongside it, the combination of the two is called a point. The vector database’s job is to search points using the ranking function that you prefer, and optionally filter points based on some criteria, for example you might want to search only the colors that the store has in stock. It will then return a list of points, together with the similarity score of each of them.
If you want to use a vector database, I suggest looking at Qdrant, Milvus, Weaviate, pgvector.
Dense embeddings excel at generality but fail at specificity, as they will match with anything similar to them.
Sparse embedding (syntax)
They are implemented differently depending on which algorithm you want to use, I’ll explain with BM25 in mind.
Sparse vectors are usually also called “bag of words” because, as the name implies, they use words and their frequency within a piece of text as a way to represent the contents. They are stored as key-value (dictionary) pairs, with the key being the “id” of the word and the value being the “weight” or, simply, how commonly the word appears in the document.
When looking at a long piece of text, words like “the”, “i”, “is” (stopwords), appear very often, meaning that we should ignore or give little weight to them. On the other hand, if one word appears rarely, we want to give it a high weight.
When searching through our database using some text, if this text includes the rare word, then we likely want to find other texts that also include that word.
The difference between sparse and dense vectors is that dense vectors excel at searching for the meaning behind words:
- If we search for “wardrobe items”, dense vector search will match with pants, shirts, and hoodies, while sparse vector search will match with any items that describe themselves as “wardrobe items” in their description
- If we search for “Nike shoes”, dense vector search will match any shoe, while sparse vector search will match specifically with Nike shoes.
Sparse embeddings fail at generality but excel at specificity, as they will match with special keywords. Watch out for this, as a typo or a different way of saying the same thing will not be able to match.
Components of RAG
As we’ve seen, embeddings are a way to represent a piece of text so that it can be compared to a query to decide if that text is similar or not. That’s one part of RAG. Let’s look at all of them:
- Query generation: You need to decide which query (a piece of text) to use in the search. This part is surprisingly hard, as we will see later.
- Embedding generation: We convert the text of the query into its embedding representation so it can be compared to the existing vectors in the vector database. This can be either dense, sparse, or both.
- Embedding search: We use a vector database to compare the embedding of the query with the embeddings of the saved points. This will return a list of points that are similar to the embedding of the query, together with the similarity score of each point. We can then decide to filter out ones that have low relevance scores.
flowchart LR
A[👤 User Query] --> B[🔄 Query Generation]
B --> C[🧠 Embedding Generation]
C --> D[🔍 Database Search]
D --> E[📄 Results]
style A fill:#1E3A8A,stroke:#3B82F6,stroke-width:2px,rx:12,ry:12
style B fill:#7C2D12,stroke:#F97316,stroke-width:2px,rx:12,ry:12
style C fill:#2E7D32,stroke:#4CAF50,stroke-width:2px,rx:12,ry:12
style D fill:#7C2D12,stroke:#FBBF24,stroke-width:2px,rx:12,ry:12
style E fill:#831843,stroke:#EC4899,stroke-width:2px,rx:12,ry:12
Different implementations of RAG
Now that we know the different parts of RAG, let’s try to integrate it inside a chatbot, you can implement it with different levels of difficulty. As you go higher in difficulty, cost and latency increase, but so does retrieval quality.
Latency might not be a problem for 90% of chatbots, as getting an answer after 5 seconds compared to 2 seconds does not make a huge difference, but in certain scenarios like voice calls, even a difference of one second is enough to make the experience worse or unusable.
Costs can vary from free except hosting to a few cents per retrieval.
Let’s take the same scenario for all levels, a user is making a conversation with a chatbot, and it just sent a new message that an LLM needs to answer to.
Level 1 - No logic
You use the user’s message as a query. It has the upside of not requiring any logic behind it, meaning no added costs or latency, but has the downside that it fails when the message does not include what you want to search for. Imagine a conversation where a user asks a follow-up question. This question implies the subject as part of the past messages. For example, imagine the last two messages being “What kind of apples do you sell?” and “How much do they cost?“.
We obviously mean to search for the price of apples, but the RAG will receive just “How much do they cost?“.
flowchart LR
A[User Message] --> B[Direct Embedding]
B --> C[Vector Database]
C --> D[Results]
style A fill:#2E7D32,stroke:#4CAF50,stroke-width:2px,rx:12,ry:12
style B fill:#2E7D32,stroke:#4CAF50,stroke-width:2px,rx:12,ry:12
style C fill:#2E7D32,stroke:#4CAF50,stroke-width:2px,rx:12,ry:12
style D fill:#2E7D32,stroke:#4CAF50,stroke-width:2px,rx:12,ry:12
E[❌ Context Lost] --> A
style E fill:#991B1B,stroke:#EF4444,stroke-width:2px,rx:12,ry:12
You can create a heuristic where you include the last 1-2 messages hoping that it will include the necessary context, but you risk diluting the query with unnecessary information. Say you ask two completely different questions: “What kind of apples do you sell?” followed by “Do you also sell bread? If yes, how much does it cost?” There might be more knowledge about apples compared to bread, so when you pick the top 5 best results, there might not be any about bread.
This solution can be implemented for free by using a locally running embedding model like Qwen3 Embedding, more pop up every few weeks so keep an eye on them.
The total latency is given by the embedding generation for the query (10-20ms) and the query itself (20-50ms).
Level 2 - LLM to generate query
We saw that the main issue with Level 1 is keeping context of the conversation for indirect questions. Keeping multiple messages causes more issues, so instead of doing that, we can use another LLM (a simple, cheap, and fast one as we don’t need much intelligence) to convert the conversation plus the latest message into a query that will be passed to the next step.
flowchart LR
A[User Message + Context] --> B[🤖 LLM Query Generation]
B --> C[Embedding Generation]
C --> D[Vector Database]
D --> E[Results]
style A fill:#1E3A8A,stroke:#3B82F6,stroke-width:2px,rx:12,ry:12
style B fill:#1E3A8A,stroke:#3B82F6,stroke-width:2px,rx:12,ry:12
style C fill:#1E3A8A,stroke:#3B82F6,stroke-width:2px,rx:12,ry:12
style D fill:#1E3A8A,stroke:#3B82F6,stroke-width:2px,rx:12,ry:12
style E fill:#1E3A8A,stroke:#3B82F6,stroke-width:2px,rx:12,ry:12
We can ask the LLM something along the lines of:
Given those messages sent by a user to a chatbot:
<messages>
{messages}
</messages>
And their last message:
<last_message>
{last_message}
</last_message>
Generate a query that will be given to a RAG system to search for information on how to answer to this message After the LLM generated the query, we can use it as the query for searching the vector database.
This solution can also be implemented for free, but with higher effort, as you need to be able to self-host an LLM, which can be expensive. Alternatively, you can use cheap and fast LLMs like Gemini Flash Lite. This will not cost much as you could include only the last 10 messages, which might be around 2,000 tokens. Using a model like Gemini Flash Lite, it will cost around $0.0002 per call. The total latency will be the same as Level 1, but adding the LLM generation, which in my testing adds around 500-800ms.
Level 3 - Agentic RAG
Level 2 works but has the issue of reliability. It works only if every possible question that can be asked to the chatbot is answered as-is, or has enough knowledge to search using only one query. The moment your question requires a little bit of logic on how to search the data, or it needs follow-up queries, it won’t work as reliably, even if your knowledge base contains enough information to answer.
Say we are in a conversation between a person who has a fruit store and a fruit reseller. This chatbot is for the fruit reseller company.
During the conversation, the fruit store person asks questions about buying products for their store and has sent these last two messages: “What kind of apples do you sell?” and “How much do they cost per kg?“.
Let’s use the Level 2 logic with the query “How much do apples cost per kg?”, which is correct, and the vector database returns a piece of text that mentions the price of apples: “All apples cost $4 per kg, but there is different pricing if you buy big quantities”.
A different piece of text in the database says: “All fruit bought in 100kg quantity or more is discounted 30%“.
The store owner had mentioned earlier in the chat that they are looking to buy a lot of fruit, at least 100kg of each, so it would have been helpful for the chatbot to answer: “They cost $4 per kg, but since you are buying more than 100kg, they will be $2.80 per kg” without the user having to ask again about the discounted price.
This is because there might be relationships inside the knowledge base that cannot be known prior to searching it.
Another easy example might be a math professor chatbot explaining a theorem to a student. This text containing the proof of the theorem might rely on a prior theorem that the student does not yet know, and to understand the initial theorem, both need to be explained.
To solve this issue, we can use Agentic RAG. It is similar to Level 2, but instead of using the LLM once, we use it in an agentic way, so we can search through the knowledge base many times until we have all the information that we need.
The steps are:
- Generate the initial query given the last message of the user and the current conversation by using an LLM
- Query the database with the generated query to get the most relevant knowledge
- Ask another LLM to grade the result of this query, it can optionally pick which knowledge to include and which to exclude so that we only keep the really relevant things, removing useless ones. We also ask it if it thinks the knowledge that has been found is enough to answer the initial message, and if not, to generate another question that will be used to search again
- If the LLM generated a follow up question, repeat from step 2 with this new query (while keeping track of the history of the prior queries so that it does not regenerate the same things and it is more consistent)
- When the LLM thinks it does not need to search more, or it reached a limit that we set, get all the most relevant pieces of knowledge (or the ones that the LLM picked), and return them
This ideally is the “best” way to search as you use the logic encoded in the knowledge to search the knowledge base. You can also do an optional final step to summarize the picked knowledge into a more concise form that will then be used by the main LLM of the chatbot to answer the user.
flowchart TD
A[User Message + Context] --> B[🤖 Generate Initial Query]
B --> C[🔍 Search Vector DB]
C --> D[📄 Retrieve Knowledge]
D --> E{🧠 Evaluate Results}
E -->|Insufficient| F[🔄 Generate Follow-up Query]
F --> C
E -->|Sufficient| G[✅ Return Final Results]
H[Iteration Counter 1, 2, 3...] -.-> E
style A fill:#7C2D12,stroke:#FBBF24,stroke-width:2px,rx:15,ry:15
style B fill:#2E7D32,stroke:#4CAF50,stroke-width:2px,rx:15,ry:15
style C fill:#1E3A8A,stroke:#3B82F6,stroke-width:2px,rx:15,ry:15
style D fill:#7C2D12,stroke:#F97316,stroke-width:2px,rx:15,ry:15
style E fill:#991B1B,stroke:#F87171,stroke-width:2px,rx:15,ry:15
style F fill:#92400E,stroke:#FCD34D,stroke-width:2px,rx:15,ry:15
style G fill:#064E3B,stroke:#10B981,stroke-width:2px,rx:15,ry:15
style H fill:#374151,stroke:#9CA3AF,stroke-width:2px,rx:12,ry:12
This solution (if not implemented with a local LLM) can be way more expensive compared to Level 2, depending on how many iterations the agent does. Since this step also does not need high intelligence, we can use a cheap and fast model, again the same as Level 2. We can say the price goes from $0.0002 to $0.002 per call.
The main issue is latency. Since each iteration takes around 800-1000ms to complete, if the agent does 3 iterations, we might have to wait 3 seconds just for the RAG to find the knowledge! Adding the main LLM and other steps in between, we might get to 5 seconds of latency. In an age where reasoning models are getting more and more common, we might not care that much, but for latency-sensitive applications this is way too much.
Level 2.5 - One shot Agent simulation RAG
Let’s try a step back. We can try to simulate the same logic that the agent does by speculating on what follow-up questions it might need.
flowchart TD
A[User Message + Context] --> B[🤖 LLM Generates]
B --> C[Main Query]
B --> D[Follow-up Query 1]
B --> E[Follow-up Query 2]
B --> F[Follow-up Query 3]
C --> G[🔍 Search 1]
D --> H[🔍 Search 2]
E --> I[🔍 Search 3]
F --> J[🔍 Search 4]
G --> K[📊 Rank & Select Results]
H --> K
I --> K
J --> K
K --> L[✅ Final Results]
style A fill:#7C2D12,stroke:#FBBF24,stroke-width:2px,rx:15,ry:15
style B fill:#831843,stroke:#EC4899,stroke-width:2px,rx:15,ry:15
style C fill:#2E7D32,stroke:#4CAF50,stroke-width:2px,rx:15,ry:15
style D fill:#2E7D32,stroke:#4CAF50,stroke-width:2px,rx:15,ry:15
style E fill:#2E7D32,stroke:#4CAF50,stroke-width:2px,rx:15,ry:15
style F fill:#2E7D32,stroke:#4CAF50,stroke-width:2px,rx:15,ry:15
style G fill:#1E3A8A,stroke:#3B82F6,stroke-width:2px,rx:12,ry:12
style H fill:#1E3A8A,stroke:#3B82F6,stroke-width:2px,rx:12,ry:12
style I fill:#1E3A8A,stroke:#3B82F6,stroke-width:2px,rx:12,ry:12
style J fill:#1E3A8A,stroke:#3B82F6,stroke-width:2px,rx:12,ry:12
style K fill:#1E3A8A,stroke:#3B82F6,stroke-width:2px,rx:15,ry:15
style L fill:#064E3B,stroke:#10B981,stroke-width:2px,rx:15,ry:15
We ask the LLM to generate 2 things:
- Goal: This is a summary of what information should be needed to answer the user. It should be somewhat broad.
- Queries: A set of queries that will be executed in parallel to search the vector database. They should include a direct question, like the one generated in Level 2, and a few other different questions that might be useful to search. These other questions are obviously speculative. The LLM guesses what it might need to find based on the conversation and prior knowledge.
Say the LLM generated the goal and 3 queries, then:
- We will do 4 queries to the vector database in parallel: the goal and the 3 queries.
- We remove duplicates that were found and pick which documents to return.
The hard part is step 2! How do we pick which documents to include?
- Surely the ones from the goal and query 1 should be included, but how many? We want to have an upper limit on how much knowledge to include (either by number of documents or number of tokens) as it might make it more difficult (and more expensive) for the LLM that will answer. If we pick too many, the ones from queries 2 and 3 might be left out, but they might contain useful information.
- We could pick the best-scoring ones out of all of them, but this has the same issue as Level 1 where we included more than one message. It might be that the follow-up questions overthrow the main and more important ones from the goal and first query.
For this reason, I thought of two ways we could solve it:
- Reranking: This is yet another AI step that takes up some time. It is made to reorder (rerank) pieces of text given a query. Common ones are VoyageAI, ColBERT, and BGE reranker. We give the reranker all the knowledge we found, and as a query we use the goal. Then we select the highest-scoring reranked pieces of text and return them. This step adds more latency. If we use a hosted solution, it will be around 300ms. Cost is very low but still needs to be considered.
- Round robin and exponential decay: This does not include another AI and uses heuristics, hoping to include the best results of each query. The goal and query 1 are the most important, so we will allocate 50% of the available space and fill it with those (say we have a limit of 20 pieces of text, then we want to pick 10 from the results of the goal and query 1). Then for the remaining queries, we have 10 pieces left to fill. In our example we had queries 2 and 3, so we could decide to pick the best 5 from queries 2 and 3, or we can use exponential decay to pick fewer items the more queries were generated. We could pick 6 from query 2 and 4 from query 3, etc.
Or… you can combine both of them together, doing first step 2 and then step 1. You can mix and match every level to increase retrieval quality.
Final thoughts
At discerns.ai, I decided to go with Level 2.5 for things that require low latency like voice calls, and Level 3 for textual conversations or all use cases that need more in-depth knowledge. As the initial steps for Level 2.5 and Level 3 are similar, you can make the LLM decide if the question needs more in-depth search or if it can be easily answered. This is yet again another heuristic, but it could improve the experience!