RAGAIVector DatabaseLLM

How to Build a RAG Application: Search Your Own Data With AI

TL;DR: Building a RAG application means connecting a large language model to your documents, databases, and knowledge bases so it answers questions from your actual data instead of its training data. This guide covers chunking, embedding, vector storage, retrieval, and generation with working code.

HouseofMVPs··8 min read

What Is RAG

RAG (Retrieval Augmented Generation) is a two step pattern:

  1. Retrieve: Find the most relevant documents or passages from your data
  2. Generate: Send those documents plus the user's question to an LLM, which generates an answer grounded in your actual content

Without RAG, an LLM answers from its training data. With RAG, it answers from your data. This is the difference between a generic AI assistant and one that knows your products, policies, and customers.

Step 1: Prepare Your Data

RAG quality depends entirely on input data quality. Spend more time here than anywhere else.

Supported data sources

SourceExtraction Method
Markdown filesRead directly (already text)
PDF documentspdf-parse or Apache Tika
Word documentsmammoth.js
Web pagesCheerio (HTML to text)
Database recordsSQL query, format as text
CSV/ExcelPapa Parse or SheetJS
Confluence/NotionAPI export

Data cleaning

Before chunking, clean your raw content:

  • Remove headers, footers, and navigation elements from web pages
  • Strip formatting artifacts from PDF extraction
  • Remove duplicate content (same FAQ appearing on 10 different pages)
  • Fix encoding issues (special characters, smart quotes)
  • Remove content that is outdated or no longer accurate

Step 2: Chunk Your Documents

Chunking means splitting documents into smaller pieces that the retrieval system can search and the LLM can process.

Chunking strategies

Fixed size chunks: Split every 300 to 500 words with 50 word overlap. Simple to implement. Works well for uniform content like documentation.

Semantic chunks: Split at natural boundaries (paragraphs, sections, headings). Produces more coherent chunks. Better for structured documents.

Recursive splitting: Try to split at paragraph breaks first. If paragraphs are too long, split at sentence breaks. If sentences are too long, split at word boundaries.

Implementation

interface Chunk {
  text: string;
  source: string;
  metadata: {
    title?: string;
    section?: string;
    page?: number;
    url?: string;
  };
}

function chunkDocument(
  text: string,
  source: string,
  maxChunkSize: number = 500,
  overlap: number = 50
): Chunk[] {
  const words = text.split(/\s+/);
  const chunks: Chunk[] = [];

  for (let i = 0; i < words.length; i += maxChunkSize - overlap) {
    const chunkWords = words.slice(i, i + maxChunkSize);
    if (chunkWords.length < 20) continue; // Skip tiny trailing chunks

    chunks.push({
      text: chunkWords.join(" "),
      source,
      metadata: {},
    });
  }

  return chunks;
}

Chunk size guidelines

Content TypeRecommended Chunk SizeOverlap
FAQ answers100 to 200 words0 (each FAQ is one chunk)
Documentation300 to 500 words50 words
Long articles400 to 600 words75 words
Legal/compliance200 to 300 words50 words

Smaller chunks mean more precise retrieval but less context per chunk. Larger chunks provide more context but may include irrelevant information. Test both sizes and compare answer quality.

Step 3: Generate Embeddings

Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts produce similar vectors, enabling semantic search. These vectors are stored in a vector database that supports fast similarity queries.

Embedding models

ModelDimensionsCostBest For
text-embedding-3-small (OpenAI)1536$0.02/1M tokensMost applications
text-embedding-3-large (OpenAI)3072$0.13/1M tokensHigher accuracy needs
Cohere embed-v31024$0.10/1M tokensMultilingual content

text-embedding-3-small is the default choice. It is cheap, fast, and accurate enough for most RAG applications. For deciding between RAG and custom model training, see our RAG vs fine tuning comparison.

Embedding implementation

import { OpenAI } from "openai";

const openai = new OpenAI();

async function embedText(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return response.data[0].embedding;
}

async function embedChunks(chunks: Chunk[]): Promise<void> {
  // Process in batches of 100 to respect rate limits
  for (let i = 0; i < chunks.length; i += 100) {
    const batch = chunks.slice(i, i + 100);
    const texts = batch.map((c) => c.text);

    const response = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input: texts,
    });

    for (let j = 0; j < batch.length; j++) {
      await db.insert(documents).values({
        text: batch[j].text,
        source: batch[j].source,
        metadata: batch[j].metadata,
        embedding: response.data[j].embedding,
      });
    }
  }
}

Embedding 1,000 chunks costs about $0.01. Even large knowledge bases with 10,000 chunks cost under $1 to embed.

Step 4: Set Up Vector Storage

Store embeddings in a database that supports similarity search.

PostgreSQL with pgvector

The simplest option if you already use PostgreSQL. Add the pgvector extension and create a table for your chunks. Our AI integration services include RAG pipeline setup for teams that need this live quickly.

-- Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;

-- Create the documents table
CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  text TEXT NOT NULL,
  source TEXT NOT NULL,
  metadata JSONB DEFAULT '{}',
  embedding vector(1536),
  created_at TIMESTAMP DEFAULT now()
);

-- Create an index for fast similarity search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

Retrieval query

async function retrieveRelevant(
  query: string,
  limit: number = 5
): Promise<Chunk[]> {
  const queryEmbedding = await embedText(query);

  const results = await db.execute(sql`
    SELECT text, source, metadata,
           1 - (embedding <=> ${queryEmbedding}::vector) as similarity
    FROM documents
    ORDER BY embedding <=> ${queryEmbedding}::vector
    LIMIT ${limit}
  `);

  return results.rows.map((r) => ({
    text: r.text,
    source: r.source,
    metadata: r.metadata,
    similarity: r.similarity,
  }));
}

The <=> operator computes cosine distance. Lower distance means higher similarity. The query returns the 5 most similar chunks to the user's question.

Step 5: Build the Generation Step

Now combine retrieval with generation. The user's question and the retrieved chunks go to the LLM together.

async function ragQuery(question: string): Promise<{
  answer: string;
  sources: string[];
}> {
  // 1. Retrieve relevant chunks
  const chunks = await retrieveRelevant(question, 5);

  // 2. Format context
  const context = chunks
    .map((c, i) => `[Source ${i + 1}: ${c.source}]\n${c.text}`)
    .join("\n\n---\n\n");

  // 3. Generate answer
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-6-20250514",
    max_tokens: 1024,
    system: `You are a helpful assistant that answers questions using the provided context.

Rules:
- Only answer from the provided context. Do not use outside knowledge.
- If the context does not contain the answer, say "I could not find this information in the available documents."
- Cite sources by number [Source 1], [Source 2] when referencing specific information.
- Be concise. Answer in 1 to 3 paragraphs unless the question requires more detail.

Context:
${context}`,
    messages: [{ role: "user", content: question }],
  });

  const answer =
    response.content[0].type === "text" ? response.content[0].text : "";

  return {
    answer,
    sources: [...new Set(chunks.map((c) => c.source))],
  };
}

Generation tips

  • Temperature 0 to 0.3 for factual Q&A. Higher temperature introduces creativity you do not want.
  • Include source citations in the system prompt. This forces the model to ground its answer in specific documents.
  • Set a "not found" instruction so the model says it does not know instead of guessing.
  • Limit context size to 3 to 5 chunks. More context is not always better; irrelevant chunks confuse the model.

Step 6: Improve Retrieval Quality

Basic vector similarity search works for 80% of queries. For the remaining 20%, use these techniques.

Hybrid search

Combine vector similarity with keyword search. Some queries are better served by exact term matching (product names, error codes, ID numbers).

-- Hybrid: combine vector similarity with text search
SELECT text, source,
  (0.7 * (1 - (embedding <=> query_embedding::vector))) +
  (0.3 * ts_rank(to_tsvector('english', text), plainto_tsquery('english', query_text)))
  as combined_score
FROM documents
ORDER BY combined_score DESC
LIMIT 5;

Query expansion

Rephrase the user's question before searching. A question like "how do I cancel?" might miss documents about "subscription management" or "account deletion."

async function expandQuery(question: string): Promise<string[]> {
  const response = await anthropic.messages.create({
    model: "claude-haiku-4-5-20251001",
    max_tokens: 256,
    system:
      "Generate 3 alternative phrasings of this search query. Return as a JSON array of strings.",
    messages: [{ role: "user", content: question }],
  });

  const text =
    response.content[0].type === "text" ? response.content[0].text : "[]";
  return JSON.parse(text);
}

Metadata filtering

Add metadata to chunks during ingestion (category, date, department) and let users filter retrieval.

// Only search product documentation, not policies
const chunks = await db.execute(sql`
  SELECT text, source
  FROM documents
  WHERE metadata->>'category' = 'product-docs'
  ORDER BY embedding <=> ${queryEmbedding}::vector
  LIMIT 5
`);

Step 7: Build the API and UI

Wrap your RAG pipeline in an API endpoint and connect it to a chat interface.

import { Hono } from "hono";

const app = new Hono();

app.post("/api/ask", async (c) => {
  const { question } = await c.req.json();

  if (!question || question.length > 1000) {
    return c.json({ error: "Invalid question" }, 400);
  }

  const result = await ragQuery(question);

  // Log for evaluation
  await db.insert(queryLog).values({
    question,
    answer: result.answer,
    sources: result.sources,
  });

  return c.json(result);
});

The UI should display:

  • The AI generated answer
  • Source documents with links (so users can verify)
  • A feedback mechanism (helpful / not helpful)
  • Suggested follow up questions

Step 8: Evaluate and Iterate

Building an evaluation set

Create 50 to 100 question and answer pairs from your documents. For each:

  • Write the question a user would ask
  • Write the ideal answer
  • Note which document(s) contain the answer

Automated evaluation

async function evaluateRAG(testSet: TestCase[]): Promise<EvalResults> {
  let retrievalHits = 0;
  let accurateAnswers = 0;

  for (const test of testSet) {
    const chunks = await retrieveRelevant(test.question, 5);
    const result = await ragQuery(test.question);

    // Check if the right source was retrieved
    if (chunks.some((c) => c.source === test.expectedSource)) {
      retrievalHits++;
    }

    // Use an LLM to judge answer accuracy
    const judgment = await judgeAccuracy(
      test.question,
      test.expectedAnswer,
      result.answer
    );
    if (judgment === "correct") accurateAnswers++;
  }

  return {
    retrievalAccuracy: retrievalHits / testSet.length,
    answerAccuracy: accurateAnswers / testSet.length,
  };
}

Run this evaluation after every change to chunking strategy, embedding model, or system prompt.

DIY vs Hire an Agency

Build it yourself when:

  • Your knowledge base is small (under 200 documents)
  • You are comfortable with TypeScript or Python and SQL
  • Basic Q&A over documents is the primary use case
  • You have time to iterate on retrieval quality

Hire an agency when:

  • The RAG system is customer facing (accuracy matters significantly)
  • You need advanced features (hybrid search, multi modal, real time ingestion)
  • Your data includes PDFs, scanned documents, or multimedia
  • You need integration with existing systems (CRM, support tools, databases)

At HouseofMVPs, we build RAG applications with production grade retrieval, evaluation pipelines, and source citation. Starting at $5,000 with 14 day delivery. See our RAG case study for a real example.

Common Mistakes

Skipping the data cleaning step. Dirty data produces wrong answers. Spend 30% of project time on data preparation.

Using chunks that are too large. Large chunks retrieve well but include too much irrelevant context. Stay under 500 words per chunk.

Not testing retrieval separately from generation. If the wrong documents are retrieved, the best LLM cannot generate a correct answer. Test retrieval accuracy independently.

Embedding the question differently than the documents. Use the same embedding model for both queries and documents. Mixing models produces poor similarity scores.

Not updating the knowledge base. A RAG system with stale data gives stale answers. Automate the ingestion pipeline to run on a schedule. Use the AI Readiness Assessment to evaluate whether your data is ready for a production RAG implementation before you build.

For building an AI agent that uses RAG alongside tool use, read how to build an AI agent. For business level AI integration strategy, see how to integrate AI into your business.

Build With an AI-Native Agency

Security-First Architecture
Production-Ready in 14 Days
Fixed Scope & Price
AI-Optimized Engineering
Start Your Build

Free: 14-Day AI MVP Checklist

The exact checklist we use to ship production-ready MVPs in 2 weeks. Enter your email to download.

RAG Architecture Reference

A technical reference diagram covering the full RAG pipeline from ingestion to generation.

Frequently Asked Questions

Frequently Asked Questions

Free Estimate in 2 Minutes

50+ products shipped$10M+ funding raised2-week delivery

Already know your scope? Book a Fixed-Price Scope Review

Get Your Fixed-Price MVP Estimate