Back to Blog
Introduction to RAGs

Introduction to RAGs

June 7, 2025 (22d ago)

pythonvector-databaseragchaiaurcode

Understanding Indexing, Vectorization, Chunking and Why RAGs Matter

In the world of Generative AI, you might’ve noticed a recurring problem: LLMs hallucinate. They confidently generate answers that sound right — but are often wrong or outdated.

That’s where RAG (Retrieval-Augmented Generation) comes in. It’s a powerful technique that enhances large language models (LLMs) by combining information retrieval with text generation — so your model doesn't just guess; it refers.

In this article, we’ll cover the foundational building blocks of RAG:


🔍 What is Indexing?

Indexing is the process of organizing and storing data in a way that makes it efficiently retrievable later.

Think of it like creating a searchable catalog of your documents — so when the user asks a question, the system can find the most relevant content before generating a response.

In RAG pipelines, indexing typically involves:

📌 In simple terms: Indexing = Preparing knowledge to be found later.


🧠 Why Do We Perform Vectorization?

LLMs can't "understand" raw text like humans do. They work with numbers — specifically, vectors.

Vectorization is the process of converting a piece of text into a numerical format using embeddings.

These embeddings capture the semantic meaning of the text. For example:

We use pre-trained models like

text-embedding-ada-002
from OpenAI, or open-source models like
sentence-transformers
for this step.

📌 Vectorization is what enables semantic search — the backbone of RAG retrieval.


🤔 Why Do RAGs Exist?

Great question.

LLMs are powerful but limited by their training data and context window. They:

RAG solves this.
By separating knowledge retrieval from text generation, RAG allows:

RAG = Retrieval + Generation

Here’s a simplified flow:

  1. User asks a question

  2. System retrieves relevant documents from a knowledge base (retrieval)

  3. The retrieved context is passed into the LLM (generation)

  4. The LLM responds using both its internal knowledge and external context

📌 RAG lets you “plug in” your own knowledge into an LLM — without fine-tuning.


📦 Why Do We Perform Chunking?

Imagine you upload a 100-page PDF. Should the system treat that entire document as one searchable block?

Of course not.

That’s where chunking comes in.

Chunking means breaking down long documents into smaller, more manageable pieces (e.g., paragraphs, sections).

This:

Typically, chunks are split by sentence, paragraph or a fixed number of tokens (e.g., 300–500 tokens).


🔁 Why Perform Overlapping Over Chunking?

Sometimes, chunking by itself loses context at the edges of each chunk.

Example:

Without overlap, the model may not understand the connection between the two.

Overlapping solves this by ensuring chunks share a few lines/tokens from their neighbors.

This:

A typical overlap might be 10–20% of the chunk size.

📌 Think of overlapping like giving your model a little memory across the boundaries.


✅ Summary: Tying It All Together

| Concept | Purpose | | --- | --- | | Indexing | Organizes data for efficient retrieval | | Vectorization | Converts text into semantic numerical format | | RAG | Enhances LLMs by retrieving external data | | Chunking | Breaks documents into smaller, searchable parts | | Overlapping | Preserves context between chunks |


🧠 Final Thoughts

RAG is becoming a foundational pattern in GenAI applications, from enterprise chatbots to coding agents and document assistants.

If you’re building AI products — don’t just prompt LLMs blindly.
Structure your knowledge. Retrieve smart. Generate responsibly.

Understanding how indexing, vectorization and chunking work under the hood gives you a strong edge in designing scalable, trustworthy AI systems.

Happy building! 🧱⚡