RAG Explained: Build an AI That Knows Your Own Data

RAG Explained: Build an AI That Knows Your Own Data

Retrieval-Augmented Generation lets you connect any LLM to your own documents, databases, and knowledge bases — no fine-tuning required.

Author

AICredits Team

Published

20 Mar 2026

Reading time

12 min read

The core problem: LLMs are frozen in time

When you call GPT-4o or Claude Sonnet, you are querying a model that was trained on a static snapshot of the internet, cut off at some point in the past. Ask it about your internal product documentation, a PDF your team wrote last month, or last quarter's sales figures — it will hallucinate or say it does not know.

There are two common solutions to this problem:

  1. Fine-tuning — retrain the model on your data. Expensive, slow, requires ML expertise, and the model still will not remember data added after the fine-tune.
  2. Retrieval-Augmented Generation (RAG) — dynamically fetch the relevant pieces of your data at query time and inject them into the prompt. Cheap, fast, works with any LLM, and updates as your data updates.

RAG is not new — it was introduced in a 2020 Meta paper — but it is now the default architecture for production AI features built on top of external data.

This guide walks you through every component, explains the real tradeoffs developers encounter, and gives you a full working example using LangChain and the AICredits API so you can run it today.


What RAG actually is

The name breaks down cleanly:

  • Retrieval — given the user's question, find the most relevant chunks of text from your private data.
  • Augmented — add those chunks to the prompt as additional context.
  • Generation — send the augmented prompt to the LLM and return its answer.

Here is the pipeline end to end:

Offline (run once, or on data change)
─────────────────────────────────────────────────────────────
  Raw documents (PDFs, Notion pages, Markdown, SQL rows…)
      │
      ▼
  Document Loader        ← reads and normalises raw text
      │
      ▼
  Text Splitter          ← breaks long docs into chunks
      │
      ▼
  Embedding Model        ← turns each chunk into a dense vector
      │
      ▼
  Vector Store           ← stores vectors + original text


Online (every user query)
─────────────────────────────────────────────────────────────
  User question
      │
      ▼
  Embed question         ← same embedding model as above
      │
      ▼
  Similarity search      ← find top-K nearest chunks in vector store
      │
      ▼
  Build augmented prompt ← system prompt + retrieved chunks + question
      │
      ▼
  LLM API call           ← GPT-4o, Claude, Gemini, etc.
      │
      ▼
  Answer to user

The offline path is your ingestion pipeline. The online path runs on every request.


The 4 components

1. Document loader

Responsible for reading your source data and converting it to plain text. LangChain ships loaders for PDFs (PyPDFLoader), HTML (WebBaseLoader), CSV (CSVLoader), Notion, Confluence, Google Drive, S3, and dozens more. For a database you would write a simple SQL query and treat each row as a document.

2. Text splitter

LLMs have context windows. Even if you could fit your entire knowledge base into one prompt, you would be sending thousands of tokens per query — expensive, slow, and noisy. The text splitter breaks documents into small, semantically coherent chunks so retrieval returns precise, relevant excerpts rather than entire documents.

The most common splitter is the recursive character splitter, which splits on \n\n, then \n, then spaces, working down the hierarchy until chunks are within the size limit. This preserves paragraph structure better than a naive character slice.

3. Embedding model

An embedding model maps text to a dense numerical vector (e.g., 1536 dimensions for OpenAI's text-embedding-3-small). Texts with similar meaning end up with similar vectors in this high-dimensional space. Cosine similarity between two vectors measures how semantically close they are.

You embed every chunk during ingestion, and you embed the user's question at query time. The retrieval step is just a nearest-neighbour search in this vector space.

4. Vector store

A database that stores your vectors and supports fast similarity search (approximate nearest neighbour, or ANN). Options range from an in-memory FAISS index to managed cloud services like Pinecone. The vector store also stores the original chunk text alongside each vector so you can retrieve the text once you find the matching vectors.


Step-by-step: ingestion to answer

Step 1 — Load documents

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
 
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
raw_docs = loader.load()
print(f"Loaded {len(raw_docs)} pages from {len(set(d.metadata['source'] for d in raw_docs))} files")

Step 2 — Split into chunks

from langchain.text_splitter import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # characters, not tokens
    chunk_overlap=64,      # overlap so context is not lost at boundaries
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(raw_docs)
print(f"Split into {len(chunks)} chunks")

Step 3 — Embed and store

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
 
# Use AICredits as the base URL — same OpenAI interface, INR billing
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key="sk-your-aicredits-key",
    openai_api_base="https://api.aicredits.in/v1",
)
 
vectorstore = FAISS.from_documents(chunks, embeddings)
vectorstore.save_local("./faiss_index")
print("Index saved to disk")

Step 4 — Retrieve at query time

vectorstore = FAISS.load_local(
    "./faiss_index",
    embeddings,
    allow_dangerous_deserialization=True,
)
 
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4},   # top 4 chunks
)
 
query = "What is our refund policy for annual subscriptions?"
relevant_chunks = retriever.invoke(query)
 
for i, doc in enumerate(relevant_chunks):
    print(f"--- Chunk {i+1} (source: {doc.metadata.get('source', 'unknown')}) ---")
    print(doc.page_content[:200])

Step 5 — Build the augmented prompt and call the LLM

from openai import OpenAI
 
client = OpenAI(
    api_key="sk-your-aicredits-key",
    base_url="https://api.aicredits.in/v1",
)
 
context = "\n\n---\n\n".join(doc.page_content for doc in relevant_chunks)
 
system_prompt = """You are a helpful assistant. Answer the user's question using ONLY
the context provided below. If the answer is not in the context, say
"I don't have that information in the provided documents."
 
Context:
{context}""".format(context=context)
 
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": query},
    ],
    temperature=0.2,   # lower temperature for factual retrieval tasks
)
 
print(response.choices[0].message.content)

Choosing chunk size: the recall-precision tradeoff

Chunk size is the most important tuning parameter in a RAG pipeline and the one most developers get wrong the first time.

| Chunk size | Effect on recall | Effect on precision | Good for | |------------|-----------------|---------------------|----------| | Very small (128–256 chars) | Low — a concept may span multiple chunks | High — retrieved text is tightly focused | FAQ-style Q&A, short facts | | Medium (512–1024 chars) | Balanced | Balanced | General-purpose documentation | | Large (2048+ chars) | High — full paragraphs retrieved | Low — retrieves irrelevant surrounding text | Long-form reasoning, narrative docs |

Practical starting point: 512 characters with 64-character overlap. Adjust based on your evaluation: if the model says it cannot find the answer but you know the answer is in the docs, increase chunk size or overlap. If the model returns inaccurate answers from tangentially related chunks, decrease chunk size.

Chunk overlap prevents information loss at boundaries. Without it, a sentence split across two chunks may never be fully retrieved. A 10–15% overlap is usually sufficient.


Embedding models: what to use and what it costs

All embedding models serve the same purpose — map text to vectors — but they differ in dimension count, quality, speed, and cost.

| Model | Provider | Dimensions | Cost (per 1M tokens) | Notes | |-------|----------|------------|----------------------|-------| | text-embedding-3-small | OpenAI | 1536 | ~$0.02 | Best cost/quality balance, widely supported | | text-embedding-3-large | OpenAI | 3072 | ~$0.13 | Higher quality, higher cost | | text-embedding-ada-002 | OpenAI | 1536 | ~$0.10 | Older, outperformed by 3-small | | nomic-embed-text | Nomic (open source) | 768 | Free (self-hosted) | Strong open-source alternative | | all-MiniLM-L6-v2 | Hugging Face | 384 | Free (self-hosted) | Fast, small, decent quality |

Recommendation for most projects: start with text-embedding-3-small via AICredits. At typical corpus sizes (a few thousand chunks), embedding costs are negligible — often less than ₹10 for the initial ingestion of an entire documentation site. At query time, embedding a single question costs a fraction of a paisa.

Important: once you choose an embedding model, you are locked in. The vectors in your store must come from the same model as the vectors you create at query time. Mixing models produces nonsense similarity scores.


Vector databases: FAISS vs pgvector vs Pinecone

FAISS (local, in-process)

Facebook AI Similarity Search is a C++ library with Python bindings. It runs entirely in memory on your machine — no network, no infrastructure, no cost. Indices can be serialised to disk and reloaded.

Use when: your corpus fits in memory (up to a few million small chunks), you are prototyping, or you want zero infrastructure overhead. Not suitable for multi-process or multi-server setups.

pgvector (PostgreSQL extension)

pgvector adds a vector column type and the <=> cosine distance operator to Postgres. Your vectors live in the same database as your application data. You can join vector search results with SQL filters — for example, "find the 5 most similar chunks from documents owned by this user."

Use when: you already run Postgres (very common), you need metadata filtering alongside vector search, or you want transactional consistency between your data and your vectors. AICredits itself uses pgvector for semantic caching. LangChain's PGVector class wraps this cleanly.

from langchain_postgres import PGVector
 
vectorstore = PGVector(
    embeddings=embeddings,
    collection_name="my_rag_docs",
    connection="postgresql://user:pass@localhost:5432/mydb",
)

Pinecone (managed cloud)

Pinecone is a purpose-built managed vector database. It handles sharding, replication, and index management for you. It supports metadata filtering, namespaces, and scales to billions of vectors.

Use when: your corpus is very large (tens of millions of vectors), you need sub-10ms query latency at high QPS, or you do not want to manage infrastructure. Pinecone has a free tier (1 index, 100K vectors) and paid plans starting around $70/month.


Writing the retrieval prompt correctly

How you inject retrieved context into the prompt matters. These patterns work well in production:

Pattern 1 — Strict grounding (recommended for factual Q&A):

You are a helpful assistant. Answer using ONLY the context below.
If the answer is not present, say: "I don't have that information."

Context:
[CHUNK 1]
---
[CHUNK 2]
---
[CHUNK 3]

Question: {user_question}

Pattern 2 — Soft grounding with citations:

You are a helpful assistant. Use the context provided to answer the question.
When you use information from the context, cite the source document name.
If the context is insufficient, say so and answer from general knowledge.

Context:
[SOURCE: pricing.pdf, page 3]
[CHUNK TEXT]

[SOURCE: faq.md]
[CHUNK TEXT]

Question: {user_question}

Key rules:

  • Place the context before the question, not after. Models attend better to context that precedes the question.
  • Use clear delimiters (---, XML-style <context> tags, or numbered sections) so the model can distinguish chunks.
  • Keep the injected context under 3,000 tokens if using a small context window model. For GPT-4o (128K) or Claude (200K) you have far more room, but signal quality drops if you inject too many low-relevance chunks.

Handling retrieval failures

Retrieval will sometimes fail to find relevant context — the user asked about something not in your documents, the chunk was split badly, or the query phrasing was unusual. Your pipeline needs to handle this gracefully.

from langchain_core.documents import Document
 
def rag_answer(query: str, retriever, client: OpenAI) -> str:
    relevant_chunks = retriever.invoke(query)
 
    # Score-based filtering — FAISS returns (doc, score) with similarity_search_with_score
    scored = retriever.vectorstore.similarity_search_with_score(query, k=4)
    # Cosine similarity: 1.0 = identical, 0.0 = unrelated
    # Threshold of 0.75 filters out weak matches
    good_chunks = [doc for doc, score in scored if score >= 0.75]
 
    if not good_chunks:
        # Graceful fallback — don't hallucinate
        return "I could not find relevant information in the knowledge base for your question."
 
    context = "\n\n---\n\n".join(doc.page_content for doc in good_chunks)
 
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"Answer using only this context:\n\n{context}",
            },
            {"role": "user", "content": query},
        ],
    )
    return response.choices[0].message.content

Common failure modes and fixes:

| Failure | Symptom | Fix | |---------|---------|-----| | No relevant chunks found | Model says "not in documents" for things that are | Decrease similarity threshold, increase chunk overlap | | Wrong chunks retrieved | Model answers a different question | Check if your embedding model handles domain vocabulary; try larger chunks | | Answer ignores context | Model uses training knowledge instead | Make the grounding instruction more explicit; lower temperature | | Truncated chunk misses key fact | Fact split across boundary | Increase chunk overlap |


Hybrid search: BM25 + vector similarity

Pure vector search excels at semantic similarity but can miss exact keyword matches. "PCI-DSS compliance" and "payment card industry standard" are semantically close, but a user searching for the exact acronym "PCI-DSS" may get better results from a keyword search.

Hybrid search combines BM25 (classic TF-IDF based keyword ranking) with vector similarity, typically via a weighted linear combination or Reciprocal Rank Fusion (RRF).

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
 
# Keyword retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4
 
# Vector retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
 
# Ensemble: 40% keyword weight, 60% semantic weight
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6],
)
 
results = ensemble_retriever.invoke("PCI-DSS compliance requirements")

When to use hybrid search: any domain with lots of proper nouns, product names, version numbers, codes, or acronyms. Legal, medical, financial, and technical documentation all benefit significantly from hybrid retrieval.


Full working example

Below is a self-contained script that ingests a folder of text files, builds a FAISS index, and answers questions using the AICredits API.

"""
rag_demo.py — Full RAG pipeline using LangChain + AICredits API
 
Install dependencies:
  pip install langchain langchain-openai langchain-community faiss-cpu openai pypdf
 
Set your AICredits API key:
  export AICREDITS_API_KEY="sk-your-key-here"
"""
 
import os
from pathlib import Path
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
 
# ── Config ────────────────────────────────────────────────────────────────────
 
API_KEY = os.environ["AICREDITS_API_KEY"]
BASE_URL = "https://api.aicredits.in/v1"
DOCS_DIR = "./my_docs"          # folder containing your .txt / .pdf files
INDEX_PATH = "./faiss_index"
EMBED_MODEL = "text-embedding-3-small"
CHAT_MODEL = "gpt-4o-mini"
CHUNK_SIZE = 512
CHUNK_OVERLAP = 64
TOP_K = 4
 
# ── Embeddings and LLM (both via AICredits) ───────────────────────────────────
 
embeddings = OpenAIEmbeddings(
    model=EMBED_MODEL,
    openai_api_key=API_KEY,
    openai_api_base=BASE_URL,
)
 
llm = ChatOpenAI(
    model=CHAT_MODEL,
    openai_api_key=API_KEY,
    openai_api_base=BASE_URL,
    temperature=0.2,
)
 
# ── Ingestion pipeline ────────────────────────────────────────────────────────
 
def build_index():
    print("Loading documents...")
    loader = DirectoryLoader(
        DOCS_DIR,
        glob="**/*.txt",
        loader_cls=TextLoader,
        loader_kwargs={"encoding": "utf-8"},
    )
    raw_docs = loader.load()
    print(f"  Loaded {len(raw_docs)} documents")
 
    print("Splitting into chunks...")
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
    )
    chunks = splitter.split_documents(raw_docs)
    print(f"  Created {len(chunks)} chunks")
 
    print("Embedding chunks and building FAISS index...")
    vectorstore = FAISS.from_documents(chunks, embeddings)
    vectorstore.save_local(INDEX_PATH)
    print(f"  Index saved to {INDEX_PATH}")
    return vectorstore
 
 
def load_index():
    return FAISS.load_local(
        INDEX_PATH,
        embeddings,
        allow_dangerous_deserialization=True,
    )
 
 
# ── RAG chain ─────────────────────────────────────────────────────────────────
 
RAG_PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template="""You are a helpful assistant. Use only the context below to answer.
If the answer is not in the context, say "I don't have that information."
 
Context:
{context}
 
Question: {question}
Answer:""",
)
 
 
def build_rag_chain(vectorstore):
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": TOP_K},
    )
    chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",                  # "stuff" = inject all chunks at once
        retriever=retriever,
        chain_type_kwargs={"prompt": RAG_PROMPT},
        return_source_documents=True,
    )
    return chain
 
 
# ── Main ──────────────────────────────────────────────────────────────────────
 
def main():
    # Build index if it does not exist
    index_exists = Path(INDEX_PATH).exists()
    if not index_exists:
        vectorstore = build_index()
    else:
        print(f"Loading existing index from {INDEX_PATH}...")
        vectorstore = load_index()
 
    chain = build_rag_chain(vectorstore)
 
    print("\nRAG system ready. Type 'quit' to exit.\n")
    while True:
        query = input("Your question: ").strip()
        if query.lower() in ("quit", "exit", "q"):
            break
        if not query:
            continue
 
        result = chain.invoke({"query": query})
        print(f"\nAnswer: {result['result']}")
 
        print("\nSources:")
        seen = set()
        for doc in result["source_documents"]:
            src = doc.metadata.get("source", "unknown")
            if src not in seen:
                print(f"  - {src}")
                seen.add(src)
        print()
 
 
if __name__ == "__main__":
    main()

To run it:

mkdir my_docs
echo "Our refund policy allows full refunds within 30 days for annual plans." > my_docs/policies.txt
echo "Monthly plans can be cancelled at any time with no refund for the current period." >> my_docs/policies.txt
 
export AICREDITS_API_KEY="sk-your-key-here"
python rag_demo.py

Cost breakdown: keeping RAG cheap with AICredits

RAG has three cost centres: embedding, retrieval (infrastructure), and generation. Here is a realistic estimate for a small internal documentation bot (500 documents, ~1,000 queries/day):

| Operation | Model | Tokens per call | Calls/day | Daily cost (USD) | |-----------|-------|----------------|-----------|-----------------| | Initial ingestion (once) | text-embedding-3-small | ~50K total | 1 | ~$0.001 | | Query embedding | text-embedding-3-small | ~50 | 1,000 | ~$0.001 | | Generation (4 chunks + question) | gpt-4o-mini | ~800 input + 300 output | 1,000 | ~$0.18 | | Total daily | | | | ~$0.18 |

At AICredits' INR pricing, that is roughly ₹17/day — less than a cup of chai — for a fully functioning private knowledge base AI. Via AICredits you pay in INR directly, with no USD card required, and the forex conversion is handled transparently on every request.

Tips to reduce cost further:

  • Cache repeated questions. AICredits includes semantic caching — if two users ask similar questions, the second call is served from cache at near-zero cost.
  • Use gpt-4o-mini for retrieval tasks. The bottleneck in RAG quality is retrieval, not the LLM. A small model reading good context outperforms a large model with no context.
  • Limit top-K. Retrieve 3–4 chunks, not 10. More chunks = more tokens = higher cost, and beyond a threshold they add noise rather than signal.
  • Filter by metadata before embedding search. If your documents have user_id or category metadata, filter to the relevant subset before running the ANN search. Fewer candidates = cheaper.

What to build next

Once you have a basic RAG pipeline working, common next steps are:

  • Evaluation — measure retrieval precision and answer accuracy using a small labelled test set. Tools like ragas automate this.
  • Re-ranking — after retrieving top-K chunks by vector similarity, run a cross-encoder re-ranker (like cross-encoder/ms-marco-MiniLM-L-6-v2) to reorder them by relevance before passing to the LLM.
  • Multi-hop retrieval — for questions that require combining information from multiple documents, use an agent loop that issues multiple retrieval queries.
  • Streaming — stream the LLM response back to the user while retrieval and prompt construction happen in the background.

RAG is one of the highest-leverage techniques available to developers building on LLMs today. You do not need a GPU, a data science team, or a model training budget — just your documents, an embedding model, a vector store, and an LLM API. All three are available through a single AICredits key, billed in INR, via the standard OpenAI interface.

Related Articles

Continue in Docs

Need implementation commands and endpoint details? Go to quickstart or API reference.