How to Build a RAG Application: Search Your Own Data With AI
TL;DR: Building a RAG application means connecting a large language model to your documents, databases, and knowledge bases so it answers questions from your actual data instead of its training data. This guide covers chunking, embedding, vector storage, retrieval, and generation with working code.
What Is RAG
RAG (Retrieval Augmented Generation) is a two step pattern:
- Retrieve: Find the most relevant documents or passages from your data
- Generate: Send those documents plus the user's question to an LLM, which generates an answer grounded in your actual content
Without RAG, an LLM answers from its training data. With RAG, it answers from your data. This is the difference between a generic AI assistant and one that knows your products, policies, and customers.
Step 1: Prepare Your Data
RAG quality depends entirely on input data quality. Spend more time here than anywhere else.
Supported data sources
| Source | Extraction Method |
|---|---|
| Markdown files | Read directly (already text) |
| PDF documents | pdf-parse or Apache Tika |
| Word documents | mammoth.js |
| Web pages | Cheerio (HTML to text) |
| Database records | SQL query, format as text |
| CSV/Excel | Papa Parse or SheetJS |
| Confluence/Notion | API export |
Data cleaning
Before chunking, clean your raw content:
- Remove headers, footers, and navigation elements from web pages
- Strip formatting artifacts from PDF extraction
- Remove duplicate content (same FAQ appearing on 10 different pages)
- Fix encoding issues (special characters, smart quotes)
- Remove content that is outdated or no longer accurate
Step 2: Chunk Your Documents
Chunking means splitting documents into smaller pieces that the retrieval system can search and the LLM can process.
Chunking strategies
Fixed size chunks: Split every 300 to 500 words with 50 word overlap. Simple to implement. Works well for uniform content like documentation.
Semantic chunks: Split at natural boundaries (paragraphs, sections, headings). Produces more coherent chunks. Better for structured documents.
Recursive splitting: Try to split at paragraph breaks first. If paragraphs are too long, split at sentence breaks. If sentences are too long, split at word boundaries.
Implementation
interface Chunk {
text: string;
source: string;
metadata: {
title?: string;
section?: string;
page?: number;
url?: string;
};
}
function chunkDocument(
text: string,
source: string,
maxChunkSize: number = 500,
overlap: number = 50
): Chunk[] {
const words = text.split(/\s+/);
const chunks: Chunk[] = [];
for (let i = 0; i < words.length; i += maxChunkSize - overlap) {
const chunkWords = words.slice(i, i + maxChunkSize);
if (chunkWords.length < 20) continue; // Skip tiny trailing chunks
chunks.push({
text: chunkWords.join(" "),
source,
metadata: {},
});
}
return chunks;
}
Chunk size guidelines
| Content Type | Recommended Chunk Size | Overlap |
|---|---|---|
| FAQ answers | 100 to 200 words | 0 (each FAQ is one chunk) |
| Documentation | 300 to 500 words | 50 words |
| Long articles | 400 to 600 words | 75 words |
| Legal/compliance | 200 to 300 words | 50 words |
Smaller chunks mean more precise retrieval but less context per chunk. Larger chunks provide more context but may include irrelevant information. Test both sizes and compare answer quality.
Step 3: Generate Embeddings
Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts produce similar vectors, enabling semantic search. These vectors are stored in a vector database that supports fast similarity queries.
Embedding models
| Model | Dimensions | Cost | Best For |
|---|---|---|---|
| text-embedding-3-small (OpenAI) | 1536 | $0.02/1M tokens | Most applications |
| text-embedding-3-large (OpenAI) | 3072 | $0.13/1M tokens | Higher accuracy needs |
| Cohere embed-v3 | 1024 | $0.10/1M tokens | Multilingual content |
text-embedding-3-small is the default choice. It is cheap, fast, and accurate enough for most RAG applications. For deciding between RAG and custom model training, see our RAG vs fine tuning comparison.
Embedding implementation
import { OpenAI } from "openai";
const openai = new OpenAI();
async function embedText(text: string): Promise<number[]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
});
return response.data[0].embedding;
}
async function embedChunks(chunks: Chunk[]): Promise<void> {
// Process in batches of 100 to respect rate limits
for (let i = 0; i < chunks.length; i += 100) {
const batch = chunks.slice(i, i + 100);
const texts = batch.map((c) => c.text);
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
for (let j = 0; j < batch.length; j++) {
await db.insert(documents).values({
text: batch[j].text,
source: batch[j].source,
metadata: batch[j].metadata,
embedding: response.data[j].embedding,
});
}
}
}
Embedding 1,000 chunks costs about $0.01. Even large knowledge bases with 10,000 chunks cost under $1 to embed.
Step 4: Set Up Vector Storage
Store embeddings in a database that supports similarity search.
PostgreSQL with pgvector
The simplest option if you already use PostgreSQL. Add the pgvector extension and create a table for your chunks. Our AI integration services include RAG pipeline setup for teams that need this live quickly.
-- Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;
-- Create the documents table
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
text TEXT NOT NULL,
source TEXT NOT NULL,
metadata JSONB DEFAULT '{}',
embedding vector(1536),
created_at TIMESTAMP DEFAULT now()
);
-- Create an index for fast similarity search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
Retrieval query
async function retrieveRelevant(
query: string,
limit: number = 5
): Promise<Chunk[]> {
const queryEmbedding = await embedText(query);
const results = await db.execute(sql`
SELECT text, source, metadata,
1 - (embedding <=> ${queryEmbedding}::vector) as similarity
FROM documents
ORDER BY embedding <=> ${queryEmbedding}::vector
LIMIT ${limit}
`);
return results.rows.map((r) => ({
text: r.text,
source: r.source,
metadata: r.metadata,
similarity: r.similarity,
}));
}
The <=> operator computes cosine distance. Lower distance means higher similarity. The query returns the 5 most similar chunks to the user's question.
Step 5: Build the Generation Step
Now combine retrieval with generation. The user's question and the retrieved chunks go to the LLM together.
async function ragQuery(question: string): Promise<{
answer: string;
sources: string[];
}> {
// 1. Retrieve relevant chunks
const chunks = await retrieveRelevant(question, 5);
// 2. Format context
const context = chunks
.map((c, i) => `[Source ${i + 1}: ${c.source}]\n${c.text}`)
.join("\n\n---\n\n");
// 3. Generate answer
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6-20250514",
max_tokens: 1024,
system: `You are a helpful assistant that answers questions using the provided context.
Rules:
- Only answer from the provided context. Do not use outside knowledge.
- If the context does not contain the answer, say "I could not find this information in the available documents."
- Cite sources by number [Source 1], [Source 2] when referencing specific information.
- Be concise. Answer in 1 to 3 paragraphs unless the question requires more detail.
Context:
${context}`,
messages: [{ role: "user", content: question }],
});
const answer =
response.content[0].type === "text" ? response.content[0].text : "";
return {
answer,
sources: [...new Set(chunks.map((c) => c.source))],
};
}
Generation tips
- Temperature 0 to 0.3 for factual Q&A. Higher temperature introduces creativity you do not want.
- Include source citations in the system prompt. This forces the model to ground its answer in specific documents.
- Set a "not found" instruction so the model says it does not know instead of guessing.
- Limit context size to 3 to 5 chunks. More context is not always better; irrelevant chunks confuse the model.
Step 6: Improve Retrieval Quality
Basic vector similarity search works for 80% of queries. For the remaining 20%, use these techniques.
Hybrid search
Combine vector similarity with keyword search. Some queries are better served by exact term matching (product names, error codes, ID numbers).
-- Hybrid: combine vector similarity with text search
SELECT text, source,
(0.7 * (1 - (embedding <=> query_embedding::vector))) +
(0.3 * ts_rank(to_tsvector('english', text), plainto_tsquery('english', query_text)))
as combined_score
FROM documents
ORDER BY combined_score DESC
LIMIT 5;
Query expansion
Rephrase the user's question before searching. A question like "how do I cancel?" might miss documents about "subscription management" or "account deletion."
async function expandQuery(question: string): Promise<string[]> {
const response = await anthropic.messages.create({
model: "claude-haiku-4-5-20251001",
max_tokens: 256,
system:
"Generate 3 alternative phrasings of this search query. Return as a JSON array of strings.",
messages: [{ role: "user", content: question }],
});
const text =
response.content[0].type === "text" ? response.content[0].text : "[]";
return JSON.parse(text);
}
Metadata filtering
Add metadata to chunks during ingestion (category, date, department) and let users filter retrieval.
// Only search product documentation, not policies
const chunks = await db.execute(sql`
SELECT text, source
FROM documents
WHERE metadata->>'category' = 'product-docs'
ORDER BY embedding <=> ${queryEmbedding}::vector
LIMIT 5
`);
Step 7: Build the API and UI
Wrap your RAG pipeline in an API endpoint and connect it to a chat interface.
import { Hono } from "hono";
const app = new Hono();
app.post("/api/ask", async (c) => {
const { question } = await c.req.json();
if (!question || question.length > 1000) {
return c.json({ error: "Invalid question" }, 400);
}
const result = await ragQuery(question);
// Log for evaluation
await db.insert(queryLog).values({
question,
answer: result.answer,
sources: result.sources,
});
return c.json(result);
});
The UI should display:
- The AI generated answer
- Source documents with links (so users can verify)
- A feedback mechanism (helpful / not helpful)
- Suggested follow up questions
Step 8: Evaluate and Iterate
Building an evaluation set
Create 50 to 100 question and answer pairs from your documents. For each:
- Write the question a user would ask
- Write the ideal answer
- Note which document(s) contain the answer
Automated evaluation
async function evaluateRAG(testSet: TestCase[]): Promise<EvalResults> {
let retrievalHits = 0;
let accurateAnswers = 0;
for (const test of testSet) {
const chunks = await retrieveRelevant(test.question, 5);
const result = await ragQuery(test.question);
// Check if the right source was retrieved
if (chunks.some((c) => c.source === test.expectedSource)) {
retrievalHits++;
}
// Use an LLM to judge answer accuracy
const judgment = await judgeAccuracy(
test.question,
test.expectedAnswer,
result.answer
);
if (judgment === "correct") accurateAnswers++;
}
return {
retrievalAccuracy: retrievalHits / testSet.length,
answerAccuracy: accurateAnswers / testSet.length,
};
}
Run this evaluation after every change to chunking strategy, embedding model, or system prompt.
DIY vs Hire an Agency
Build it yourself when:
- Your knowledge base is small (under 200 documents)
- You are comfortable with TypeScript or Python and SQL
- Basic Q&A over documents is the primary use case
- You have time to iterate on retrieval quality
Hire an agency when:
- The RAG system is customer facing (accuracy matters significantly)
- You need advanced features (hybrid search, multi modal, real time ingestion)
- Your data includes PDFs, scanned documents, or multimedia
- You need integration with existing systems (CRM, support tools, databases)
At HouseofMVPs, we build RAG applications with production grade retrieval, evaluation pipelines, and source citation. Starting at $5,000 with 14 day delivery. See our RAG case study for a real example.
Common Mistakes
Skipping the data cleaning step. Dirty data produces wrong answers. Spend 30% of project time on data preparation.
Using chunks that are too large. Large chunks retrieve well but include too much irrelevant context. Stay under 500 words per chunk.
Not testing retrieval separately from generation. If the wrong documents are retrieved, the best LLM cannot generate a correct answer. Test retrieval accuracy independently.
Embedding the question differently than the documents. Use the same embedding model for both queries and documents. Mixing models produces poor similarity scores.
Not updating the knowledge base. A RAG system with stale data gives stale answers. Automate the ingestion pipeline to run on a schedule. Use the AI Readiness Assessment to evaluate whether your data is ready for a production RAG implementation before you build.
For building an AI agent that uses RAG alongside tool use, read how to build an AI agent. For business level AI integration strategy, see how to integrate AI into your business.
Build With an AI-Native Agency
Free: 14-Day AI MVP Checklist
The exact checklist we use to ship production-ready MVPs in 2 weeks. Enter your email to download.
RAG Architecture Reference
A technical reference diagram covering the full RAG pipeline from ingestion to generation.
Frequently Asked Questions
Frequently Asked Questions
Free Estimate in 2 Minutes
Already know your scope? Book a Fixed-Price Scope Review
