
⚡ Understanding RAG by Building a ChatPDF App: Better Chunking & Smarter Context
In Part 1, we made it work.
In Part 2, we made it fast.
In Part 3, things got… interesting 😅
📌 Recap from Part 1 & 2
In the previous parts:
👉 Part 1
- Built a basic RAG pipeline using NumPy
- Understood embeddings + similarity search
👉 Part 2
- Switched to FAISS for fast retrieval ⚡
- Added persistence + re-ranking
At this point, everything looked solid.
😅 But Something Still Felt Off
I started testing with real questions…
query = "What is FAISS indexing?"
And sometimes the answer would:
- Talk about embeddings instead
- Miss key details
- Or feel… slightly off
🤔 The weird part?
The answer was actually present in the document.
But we weren’t retrieving the right chunk.
🧠 The Real Problem Was Not Search
FAISS was doing its job.
The issue was earlier in the pipeline:
We were feeding it bad chunks.
🔍 Let’s Look at the Old Chunking Logic
def generate_chunks(text, page_num):
chunks = []
i = 0
while i < len(text):
end = min(i + CHUNK_SIZE, len(text))
chunk = text[i:end]
if end < len(text):
last_space = chunk.rfind(" ")
if last_space != -1:
end = i + last_space
chunk = text[i:end]
chunks.append({"text": chunk, "page": page_num})
i = end - OVERLAP_SIZE
🧠 What This Was Doing
- Split text using fixed size
- Avoid breaking words
- Add overlap
Looks reasonable… right?
🚨 Where It Breaks
Let’s take a simple example:
Original:
"FAISS is a library for efficient similarity search. It is widely used in RAG systems."
Now imagine this gets chunked like:
Chunk 1:
"FAISS is a library for efficient similarity"
Chunk 2:
"search. It is widely used in RAG systems"
💥 What just happened?
- The sentence got split
- Meaning got split
- Embeddings lost context
Embeddings don’t understand fragments
They understand complete ideas
🔍 Let’s Visualize This
Let’s visualize what was actually happening 👇
👉 Notice how sentences are broken across chunks —
this is exactly what degrades retrieval quality.
💡 The Shift in Thinking
Instead of:
“Split text by size”
We need:
“Split text by meaning”
🚀 Step 1: Recursive Chunking (Respect Structure)
✅ New Approach
def generate_chunks_recursive(text, page_num, chunk_size, overlap_size):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk_slice = text[start:end]
for separator in ["\n\n", "\n", ". ", " "]:
last_break = chunk_slice.rfind(separator)
if last_break != -1:
if separator == ". ":
last_break += 1
break
else:
last_break = chunk_size
actual_end = start + last_break
final_chunk = text[start:actual_end].strip()
chunks.append({"text": final_chunk, "page": page_num})
start = actual_end - overlap_size
🧠 What Changed Here?
Instead of blindly splitting, we now:
for separator in ["\n\n", "\n", ". ", " "]:
We try:
- Paragraph
- Line
- Sentence
- Word
👉 This is a priority-based splitting strategy
💡 Why This Works Better
- Paragraphs stay intact
- Sentences stay intact
- Meaning stays intact
✅ Micro Summary
- What changed: Structure-aware chunking
- Why it matters: Better embeddings → better retrieval
🔁 Step 2: Overlap Still Matters
We still keep overlap:
[Chunk 1] "RAG systems work by retrieving relevant context"
[Chunk 2] "retrieving relevant context from documents"
🧠 Why This Is Important
- Prevents context gaps
- Keeps continuity between chunks
🔥 Step 3: Storing Full Context (Big Upgrade)
def generate_advanced_chunks(page_content, page_num):
search_chunks = generate_chunks_recursive(...)
for chunk in search_chunks:
chunk["text"] = f"[Page {page_num}] {chunk['text']}"
chunk["full_context"] = page_content
🧠 Why This Matters
Earlier:
👉 We only stored chunk text
Now:
👉 We also store the entire page
💡 What This Enables
- Better answer generation
- Flexibility for later processing
- Smarter context selection
🚨 New Problem Introduced
Now that we store full pages…
We started sending too much data to the LLM.
❌ Problem
- Large token usage
- Slower responses
- Irrelevant information
🚀 Step 4: Context Compression
def compress_context(query, full_text):
sentences = re.split(r'(?<=[.!?]) +', full_text)
query_words = set(query.lower().split())
scored_sentences = []
for s in sentences:
score = sum(1 for word in s.lower().split() if word in query_words)
scored_sentences.append((score, s))
top_sentences = sorted(scored_sentences, key=lambda x: x[0], reverse=True)[:MAX_SENTENCES]
return " ".join([s for _, s in top_sentences])
🧠 What’s Happening Here?
- Break into sentences
- Score relevance using query
- Keep only top sentences
🔍 Visualize It
Before:
Full page → 1000+ tokens ❌
After:
Relevant sentences → smaller context ✅
💡 Why This Is Powerful
- Faster responses
- Better relevance
- Lower token usage
✅ Micro Summary
- What changed: Context filtering
- Why it matters: Less noise → better answers
🔍 Retrieval Still Uses FAISS + Re-ranking
distances, indices = index.search(query_vector.reshape(1,-1), k=10)
results = ranker.rerank(rerank_request)
🧠 Flow
- FAISS → fast retrieval
- Re-ranker → improves relevance
- Top results → passed to LLM
💬 Smarter Answer Generation
compressed_text = compress_context(query, res['full_context'])
👉 Instead of raw chunks, we now send:
Focused, relevant context
🔁 Final System
🧠 What Changed Overall
| Feature | Part 1 | Part 2 | Part 3 |
|---|---|---|---|
| Search | NumPy | FAISS | FAISS + Rerank |
| Chunking | Basic | Basic | Recursive 🧠 |
| Context | Raw | Raw | Compressed 🔥 |
| Accuracy | Low | Medium | High |
🧠 Final Thought
This is where things clicked for me.
I kept thinking better models would fix my system…
But the real issue was:
Bad context in → bad answers out
Most people focus on:
- Models ❌
- Vector DB ❌
But the real gains came from:
👉 Chunking
👉 Context handling
🔜 What’s Next?
Now things get even more interesting.
In Part 4:
👉 We’ll move beyond basic retrieval and make the system smarter
- Token-aware chunking
- Better query understanding
- More intelligent retrieval
💬 Let’s Connect
If you're building something similar or experimenting with local LLMs, I’d love to hear your thoughts 👇
United States
NORTH AMERICA
Related News
How Braze’s CTO is rethinking engineering for the agentic area
11h ago
Amazon Employees Are 'Tokenmaxxing' Due To Pressure To Use AI Tools
22h ago
KDE Receives $1.4 Million Investment From Sovereign Tech Fund
2h ago
Instagram’s new ‘Instants’ feature combines elements from Snapchat and BeReal
2h ago
Six Claude Code Skills That Close the AI Agent Feedback Loop
2h ago


