What is RAG and why does it matter?

RAG stands for Retrieval Augmented Generation. It is a technique where you retrieve relevant documents from your own data before asking the LLM to generate a response. Without RAG, the LLM can only answer from its training data, which does not include your company's internal documents, product details, or customer information. RAG grounds the AI in your actual data, reducing hallucination and enabling accurate answers about your specific domain.

How much data do I need for RAG?

RAG works with as few as 10 documents. A typical implementation handles 100 to 10,000 documents comfortably using PostgreSQL with pgvector. Larger datasets (100,000+ documents) benefit from dedicated vector databases like Pinecone or Weaviate. The quality of your documents matters more than quantity. Well written, accurate, up to date content produces better AI answers than a massive collection of outdated material.

How do I evaluate RAG quality?

Build a test set of 50 to 100 questions with known correct answers. Run each question through your RAG pipeline and score the response on three dimensions: relevance (did it retrieve the right documents?), accuracy (is the generated answer correct?), and completeness (did it include all important details?). Track these scores as you adjust chunking strategy, embedding model, and generation prompts. Good RAG systems score above 85% on all three dimensions.

Should I use pgvector or a dedicated vector database?

Start with pgvector if you already use PostgreSQL. It handles up to 1 million vectors with good performance and keeps your infrastructure simple. Switch to Pinecone or Weaviate when you exceed 1 million vectors, need sub 50ms query latency at high concurrency, or want advanced features like metadata filtering and hybrid search. For most business applications, pgvector is more than sufficient.

How do I keep the RAG knowledge base up to date?

Set up an ingestion pipeline that runs on a schedule or is triggered by document changes. When a document is updated, re chunk and re embed the updated version and replace the old vectors. For content that changes frequently (product pricing, inventory, support articles), run the ingestion pipeline daily. For stable content (policies, technical documentation), weekly or on change is sufficient.

How to Build a RAG Application — Complete 2026 Guide

What Is RAG

RAG (Retrieval Augmented Generation) is a two step pattern:

Retrieve: Find the most relevant documents or passages from your data
Generate: Send those documents plus the user's question to an LLM, which generates an answer grounded in your actual content

Without RAG, an LLM answers from its training data. With RAG, it answers from your data. This is the difference between a generic AI assistant and one that knows your products, policies, and customers.

Step 1: Prepare Your Data

RAG quality depends entirely on input data quality. Spend more time here than anywhere else.

Supported data sources

Source	Extraction Method
Markdown files	Read directly (already text)
PDF documents	pdf-parse or Apache Tika
Word documents	mammoth.js
Web pages	Cheerio (HTML to text)
Database records	SQL query, format as text
CSV/Excel	Papa Parse or SheetJS
Confluence/Notion	API export

Data cleaning

Before chunking, clean your raw content:

Remove headers, footers, and navigation elements from web pages
Strip formatting artifacts from PDF extraction
Remove duplicate content (same FAQ appearing on 10 different pages)
Fix encoding issues (special characters, smart quotes)
Remove content that is outdated or no longer accurate

Step 2: Chunk Your Documents

Chunking means splitting documents into smaller pieces that the retrieval system can search and the LLM can process.

Chunking strategies

Fixed size chunks: Split every 300 to 500 words with 50 word overlap. Simple to implement. Works well for uniform content like documentation.

Semantic chunks: Split at natural boundaries (paragraphs, sections, headings). Produces more coherent chunks. Better for structured documents.

Recursive splitting: Try to split at paragraph breaks first. If paragraphs are too long, split at sentence breaks. If sentences are too long, split at word boundaries.

Implementation

interface Chunk {
  text: string;
  source: string;
  metadata: {
    title?: string;
    section?: string;
    page?: number;
    url?: string;
  };
}

function chunkDocument(
  text: string,
  source: string,
  maxChunkSize: number = 500,
  overlap: number = 50
): Chunk[] {
  const words = text.split(/\s+/);
  const chunks: Chunk[] = [];

  for (let i = 0; i < words.length; i += maxChunkSize - overlap) {
    const chunkWords = words.slice(i, i + maxChunkSize);
    if (chunkWords.length < 20) continue; // Skip tiny trailing chunks

    chunks.push({
      text: chunkWords.join(" "),
      source,
      metadata: {},
    });
  }

  return chunks;
}

Chunk size guidelines

Content Type	Recommended Chunk Size	Overlap
FAQ answers	100 to 200 words	0 (each FAQ is one chunk)
Documentation	300 to 500 words	50 words
Long articles	400 to 600 words	75 words
Legal/compliance	200 to 300 words	50 words

Smaller chunks mean more precise retrieval but less context per chunk. Larger chunks provide more context but may include irrelevant information. Test both sizes and compare answer quality.

Step 3: Generate Embeddings

Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts produce similar vectors, enabling semantic search. These vectors are stored in a vector database that supports fast similarity queries.

Embedding models

Model	Dimensions	Cost	Best For
text-embedding-3-small (OpenAI)	1536	$0.02/1M tokens	Most applications
text-embedding-3-large (OpenAI)	3072	$0.13/1M tokens	Higher accuracy needs
Cohere embed-v3	1024	$0.10/1M tokens	Multilingual content

text-embedding-3-small is the default choice. It is cheap, fast, and accurate enough for most RAG applications. For deciding between RAG and custom model training, see our RAG vs fine tuning comparison.

Embedding implementation

import { OpenAI } from "openai";

const openai = new OpenAI();

async function embedText(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return response.data[0].embedding;
}

async function embedChunks(chunks: Chunk[]): Promise<void> {
  // Process in batches of 100 to respect rate limits
  for (let i = 0; i < chunks.length; i += 100) {
    const batch = chunks.slice(i, i + 100);
    const texts = batch.map((c) => c.text);

    const response = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input: texts,
    });

    for (let j = 0; j < batch.length; j++) {
      await db.insert(documents).values({
        text: batch[j].text,
        source: batch[j].source,
        metadata: batch[j].metadata,
        embedding: response.data[j].embedding,
      });
    }
  }
}

Embedding 1,000 chunks costs about $0.01. Even large knowledge bases with 10,000 chunks cost under $1 to embed.

Step 4: Set Up Vector Storage

Store embeddings in a database that supports similarity search.

PostgreSQL with pgvector

The simplest option if you already use PostgreSQL. Add the pgvector extension and create a table for your chunks. Our AI integration services include RAG pipeline setup for teams that need this live quickly.

-- Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;

-- Create the documents table
CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  text TEXT NOT NULL,
  source TEXT NOT NULL,
  metadata JSONB DEFAULT '{}',
  embedding vector(1536),
  created_at TIMESTAMP DEFAULT now()
);

-- Create an index for fast similarity search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

Retrieval query

async function retrieveRelevant(
  query: string,
  limit: number = 5
): Promise<Chunk[]> {
  const queryEmbedding = await embedText(query);

  const results = await db.execute(sql`
    SELECT text, source, metadata,
           1 - (embedding <=> ${queryEmbedding}::vector) as similarity
    FROM documents
    ORDER BY embedding <=> ${queryEmbedding}::vector
    LIMIT ${limit}
  `);

  return results.rows.map((r) => ({
    text: r.text,
    source: r.source,
    metadata: r.metadata,
    similarity: r.similarity,
  }));
}

The <=> operator computes cosine distance. Lower distance means higher similarity. The query returns the 5 most similar chunks to the user's question.

Step 5: Build the Generation Step

Now combine retrieval with generation. The user's question and the retrieved chunks go to the LLM together.

async function ragQuery(question: string): Promise<{
  answer: string;
  sources: string[];
}> {
  // 1. Retrieve relevant chunks
  const chunks = await retrieveRelevant(question, 5);

  // 2. Format context
  const context = chunks
    .map((c, i) => `[Source ${i + 1}: ${c.source}]\n${c.text}`)
    .join("\n\n---\n\n");

  // 3. Generate answer
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-6-20250514",
    max_tokens: 1024,
    system: `You are a helpful assistant that answers questions using the provided context.

Rules:
- Only answer from the provided context. Do not use outside knowledge.
- If the context does not contain the answer, say "I could not find this information in the available documents."
- Cite sources by number [Source 1], [Source 2] when referencing specific information.
- Be concise. Answer in 1 to 3 paragraphs unless the question requires more detail.

Context:
${context}`,
    messages: [{ role: "user", content: question }],
  });

  const answer =
    response.content[0].type === "text" ? response.content[0].text : "";

  return {
    answer,
    sources: [...new Set(chunks.map((c) => c.source))],
  };
}

Generation tips

Temperature 0 to 0.3 for factual Q&A. Higher temperature introduces creativity you do not want.
Include source citations in the system prompt. This forces the model to ground its answer in specific documents.
Set a "not found" instruction so the model says it does not know instead of guessing.
Limit context size to 3 to 5 chunks. More context is not always better; irrelevant chunks confuse the model.

Step 6: Improve Retrieval Quality

Basic vector similarity search works for 80% of queries. For the remaining 20%, use these techniques.

Hybrid search

Combine vector similarity with keyword search. Some queries are better served by exact term matching (product names, error codes, ID numbers).

-- Hybrid: combine vector similarity with text search
SELECT text, source,
  (0.7 * (1 - (embedding <=> query_embedding::vector))) +
  (0.3 * ts_rank(to_tsvector('english', text), plainto_tsquery('english', query_text)))
  as combined_score
FROM documents
ORDER BY combined_score DESC
LIMIT 5;

Query expansion

Rephrase the user's question before searching. A question like "how do I cancel?" might miss documents about "subscription management" or "account deletion."

async function expandQuery(question: string): Promise<string[]> {
  const response = await anthropic.messages.create({
    model: "claude-haiku-4-5-20251001",
    max_tokens: 256,
    system:
      "Generate 3 alternative phrasings of this search query. Return as a JSON array of strings.",
    messages: [{ role: "user", content: question }],
  });

  const text =
    response.content[0].type === "text" ? response.content[0].text : "[]";
  return JSON.parse(text);
}

Metadata filtering

Add metadata to chunks during ingestion (category, date, department) and let users filter retrieval.

// Only search product documentation, not policies
const chunks = await db.execute(sql`
  SELECT text, source
  FROM documents
  WHERE metadata->>'category' = 'product-docs'
  ORDER BY embedding <=> ${queryEmbedding}::vector
  LIMIT 5
`);

Step 7: Build the API and UI

Wrap your RAG pipeline in an API endpoint and connect it to a chat interface.

import { Hono } from "hono";

const app = new Hono();

app.post("/api/ask", async (c) => {
  const { question } = await c.req.json();

  if (!question || question.length > 1000) {
    return c.json({ error: "Invalid question" }, 400);
  }

  const result = await ragQuery(question);

  // Log for evaluation
  await db.insert(queryLog).values({
    question,
    answer: result.answer,
    sources: result.sources,
  });

  return c.json(result);
});

The UI should display:

The AI generated answer
Source documents with links (so users can verify)
A feedback mechanism (helpful / not helpful)
Suggested follow up questions

Step 8: Evaluate and Iterate

Building an evaluation set

Create 50 to 100 question and answer pairs from your documents. For each:

Write the question a user would ask
Write the ideal answer
Note which document(s) contain the answer

Automated evaluation

async function evaluateRAG(testSet: TestCase[]): Promise<EvalResults> {
  let retrievalHits = 0;
  let accurateAnswers = 0;

  for (const test of testSet) {
    const chunks = await retrieveRelevant(test.question, 5);
    const result = await ragQuery(test.question);

    // Check if the right source was retrieved
    if (chunks.some((c) => c.source === test.expectedSource)) {
      retrievalHits++;
    }

    // Use an LLM to judge answer accuracy
    const judgment = await judgeAccuracy(
      test.question,
      test.expectedAnswer,
      result.answer
    );
    if (judgment === "correct") accurateAnswers++;
  }

  return {
    retrievalAccuracy: retrievalHits / testSet.length,
    answerAccuracy: accurateAnswers / testSet.length,
  };
}

Run this evaluation after every change to chunking strategy, embedding model, or system prompt.

DIY vs Hire an Agency

Build it yourself when:

Your knowledge base is small (under 200 documents)
You are comfortable with TypeScript or Python and SQL
Basic Q&A over documents is the primary use case
You have time to iterate on retrieval quality

Hire an agency when:

The RAG system is customer facing (accuracy matters significantly)
You need advanced features (hybrid search, multi modal, real time ingestion)
Your data includes PDFs, scanned documents, or multimedia
You need integration with existing systems (CRM, support tools, databases)

At HouseofMVPs, we build RAG applications with production grade retrieval, evaluation pipelines, and source citation. Starting at $5,000 with 14 day delivery. See our RAG case study for a real example.

Common Mistakes

Skipping the data cleaning step. Dirty data produces wrong answers. Spend 30% of project time on data preparation.

Using chunks that are too large. Large chunks retrieve well but include too much irrelevant context. Stay under 500 words per chunk.

Not testing retrieval separately from generation. If the wrong documents are retrieved, the best LLM cannot generate a correct answer. Test retrieval accuracy independently.

Embedding the question differently than the documents. Use the same embedding model for both queries and documents. Mixing models produces poor similarity scores.

Not updating the knowledge base. A RAG system with stale data gives stale answers. Automate the ingestion pipeline to run on a schedule. Use the AI Readiness Assessment to evaluate whether your data is ready for a production RAG implementation before you build.

For building an AI agent that uses RAG alongside tool use, read how to build an AI agent. For business level AI integration strategy, see how to integrate AI into your business.

How to Build a RAG Application: Search Your Own Data With AI