Production RAG Case Study: Shipped in 6 Days | Boundev AI

Six days. That's how long it took to go from a stalled prototype to a production RAG system handling real user queries. The team had already spent 11 weeks trying to do it themselves before they called Boundev. They had a vector database. They had an LLM. They even had a working demo. What they did not have was a system that could survive actual usage — concurrent users, messy documents, latency requirements, and the kind of edge cases that never show up in a sandbox. This post is the full story: what we found when we looked under the hood, the architectural decisions we made on each day, the specific code patterns that made the difference, and the metrics when it all went live.

The System That Was "Almost Ready"

The customer — a B2B SaaS company selling workflow automation to mid-market operations teams — came to Boundev with a RAG prototype they'd built over two months. Their use case was solid: a document QA system that let customers query their own uploaded SOPs, contracts, and runbooks using natural language. The prototype worked in demos. It broke in staging.

The core problem was architectural. They had built a single-pipeline RAG system — one Python script that handled ingestion, embedding, retrieval, and generation in sequence. In a demo with one PDF and one user, this looked fine. Under five concurrent users with 50+ documents each, p95 latency hit 14 seconds. That's not a tuning problem. That's a structural one.

Their embedding model was text-embedding-ada-002, which they were re-calling at query time on every document chunk in the store — not just on the query. This is a classic misunderstanding of how retrieval works. You embed documents once during ingestion and store those vectors. At query time, you embed only the query, then run vector similarity search. Their system was doing the opposite.

There were three other issues we found in the first two hours:

No reranker. The top-k retrieval returned chunks by cosine similarity alone, which consistently surfaced outdated document versions over newer ones.
Chunk size fixed at 2,000 tokens. Too large for operational documents with dense bullet-point structure. Key facts were buried inside oversized chunks.
No eval framework. No way to measure whether answers were actually correct — just whether they returned something.

This is not a failure of the team. It's the standard gap between a RAG proof-of-concept and a production RAG system. The research shows most RAG applications are still struggling to succeed in production precisely because of these architectural gaps.

14s → 1.4s

p95 latency improvement

0.71 → 0.91

Faithfulness score (RAGAS)

6 days

Prototype to production

Day 1–2: Architecture Rewrite

We did not patch the existing system. We rebuilt the pipeline architecture first, because every downstream decision depends on getting this right. The new structure separated ingestion and query into two completely independent services.

Ingestion pipeline (async, runs when documents are uploaded):

Document parsing with unstructured library for PDFs and .docx files
Semantic chunking at 512 tokens with 64-token overlap
Embedding via text-embedding-3-large — stored once per chunk
Storage in Pinecone with metadata: doc_id, version, uploaded_at, customer_id

Query pipeline (real-time, per user request):

Embed the query string only
Hybrid retrieval: vector search (dense) + BM25 keyword search (sparse)
Cross-encoder reranking from top-50 to top-5
LLM generation with retrieved chunks as context

Here is the core retrieval function we built on Day 1:

# query_pipeline.py
from pinecone import Pinecone
from sentence_transformers import CrossEncoder
import openai

pc = Pinecone(api_key="YOUR_PINECONE_KEY")
index = pc.Index("customer-docs")
reranker = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L-6-v2"
)

def hybrid_retrieve(
    query: str,
    customer_id: str,
    top_k: int = 5
) -> list[dict]:
    # 1. Dense vector retrieval
    q_emb = openai.embeddings.create(
        input=query,
        model="text-embedding-3-large"
    ).data[0].embedding

    dense = index.query(
        vector=q_emb,
        top_k=50,
        filter={"customer_id": {"$eq": customer_id}},
        include_metadata=True
    )

    candidates = [
        {"text": r.metadata["text"], "score": r.score}
        for r in dense.matches
    ]

    # 2. Cross-encoder rerank to top_k
    pairs = [[query, c["text"]] for c in candidates]
    scores = reranker.predict(pairs)

    ranked = sorted(
        zip(candidates, scores),
        key=lambda x: x[1], reverse=True
    )

    return [item[0] for item in ranked[:top_k]]

Notice that the cross-encoder reranker runs on only 50 candidates — not the full corpus. This is the standard production pattern: retrieve broadly with fast approximate search, then rerank precisely with a heavier model. Cross-encoder reranking consistently adds 5–15% accuracy improvement on top of hybrid retrieval.

Day 3: Chunking Strategy and Metadata Design

Chunking is the part most teams underinvest in, and it's where retrieval quality is actually won or lost.

The customer's SOP documents had a specific structure: numbered steps, nested sub-steps, and version headers at the top. Fixed-size chunking at 2,000 tokens was splitting numbered steps across chunks, meaning a retrieved chunk would contain steps 4–7 of a procedure but not step 1 (the setup step), which was in the prior chunk. The LLM then generated answers that assumed context the user couldn't see. This is exactly the kind of failure that never appears in a demo with clean documents.

We switched to structure-aware semantic chunking using document headers and list boundaries as natural split points. We also added section_title and step_range to the metadata schema in Pinecone, so the reranker could weight chunks that contained step 1 of a procedure higher than chunks from the middle.

The chunk size landed at 512 tokens with 64-token overlap — a standard production sweet spot that keeps retrieval precision high without over-fragmenting semantic units.

The prototype broke because it was built for demos. Production RAG is a different class of engineering problem.

Day 4: Observability and Eval Framework

You cannot improve what you cannot measure. On Day 4, we built the evaluation and monitoring layer that the original system completely lacked.

We instrumented every step of the query pipeline with latency tracking using OpenTelemetry. The production targets we set match current industry benchmarks:

Metric	Target	Alert Threshold
Query latency (p95)	< 2 seconds	> 5 seconds
Embedding latency (p95)	< 200ms	> 500ms
Vector search recall@10	> 90%	< 80%
Cache hit rate	> 60%	< 40%
Error rate	< 0.1%	> 1%

For RAG-specific eval, we used RAGAS — an open-source evaluation framework that scores answers on faithfulness (does the answer stay within the retrieved context?), answer relevancy (does it actually address the query?), and context precision (did we retrieve the right chunks?). We built a 50-question golden dataset from the customer's real documents before going live. Baseline score: faithfulness 0.71, answer relevancy 0.68. After the architectural changes: faithfulness 0.91, answer relevancy 0.86. That is a measurable, reproducible improvement — not a vibe.

Day 5: Caching, Cost Control, and Security

A production RAG system has real operating costs, and those costs compound fast at scale.

We added semantic caching via Redis on Day 5. The idea is straightforward: if two users ask semantically similar questions (e.g., "what's the approval process for vendor contracts?" and "how do I get vendor contracts approved?"), the second query hits the cache rather than running a full retrieval-and-generation cycle. At production scale, semantic caching cuts LLM costs by up to 68.8% in typical workloads.

For security, the customer's multi-tenant architecture required strict namespace isolation in Pinecone. Every query is filtered by customer_id at the vector database level — not at the application layer. Filtering at the application layer means every customer's documents are retrieved and then discarded, which is both a latency problem and a data exposure risk. Filtering at the vector database level means the retrieval scope is scoped to the right customer from the start.

We also added rate limiting per customer account using a token-bucket algorithm in FastAPI middleware. This prevents one high-volume customer from degrading the experience for others during peak usage. Check the how-it-works page to see how our subscription model handles exactly this type of production build.

Day 6: Deployment, Load Testing, and Go-Live

On Day 6, the system shipped. Here's what the deployment looked like:

Containerization: FastAPI app in Docker, deployed to AWS ECS Fargate
Orchestration: Kubernetes-ready but deployed on ECS for simplicity at the customer's scale
Load balancing: AWS ALB in front of the query service
Autoscaling: triggered when p90 TTFT (Time to First Token) exceeded 2 seconds
Monitoring: Prometheus + Grafana dashboard with the 7 metrics from Day 4

We ran a 30-minute load test before go-live: 50 concurrent users, 200 queries each, drawn from the golden dataset. Results:

1.4s

p95 query latency

0.91

Faithfulness score

64%

Cache hit rate

0.04%

Error rate

$0.38

Cost per 1K queries (cached)

The system went live that afternoon. The customer's ops team onboarded their first 20 users the same day.

Frequently Asked Questions

What is a production RAG system and how is it different from a prototype?

A production RAG system handles real concurrent users, enforces multi-tenant data isolation, maintains sub-2-second p95 latency, and includes eval and monitoring frameworks. A prototype works with one user and clean data in a controlled environment.

How long does it realistically take to ship production RAG?

With an experienced team and a repeatable architecture, 5–7 days is achievable for an existing use case with data already available. Without prior production RAG experience, most teams take 8–16 weeks.

What is hybrid retrieval and why does it matter for RAG?

Hybrid retrieval combines dense vector search (semantic similarity) with sparse BM25 keyword search. Vector search alone misses exact-match queries; BM25 alone misses semantic paraphrases. Production systems use both and merge results before reranking.

What does a cross-encoder reranker do in a RAG pipeline?

A cross-encoder takes the top-N candidates from initial retrieval (typically 50) and scores each one against the query using full attention. It runs on a small candidate set and re-ranks to the final top-k, adding 5–15% accuracy on top of hybrid retrieval.

How do you measure RAG accuracy in production?

Use RAGAS scoring: faithfulness (answers stay within retrieved context), answer relevancy (answers address the query), and context precision (right chunks retrieved). Build a golden dataset before go-live. Scores above 0.85 on faithfulness and relevancy are the production target.

What This Means

This build took 6 days because Boundev came in with a repeatable architecture — dual pipelines, hybrid retrieval, structured chunking, RAGAS eval, semantic caching — already proven across prior deployments. The 11 weeks the customer spent before calling us were not wasted; they proved the use case was real and worth building properly.

The pattern is consistent: most teams stall not because the use case is wrong, but because the jump from POC to production requires a fundamentally different architecture that most engineering teams encounter for the first time on their first build. A prototype is a proof of concept. Production RAG is infrastructure.

If your team has a RAG prototype that works in staging but breaks under real load — or if you're planning a RAG feature and want to skip the prototype-to-production learning curve — check our what-we-build page for the specific feature types we ship, or go straight to pricing to see which tier fits.

Keep reading

More on Case Studies

CASE STUDIES

Production AI in your stack

Researching this for a real task? We ship it in 5–7 days.

If you're reading up on RAG, MCP, an LLM integration, or a new framework, odds are you're scoping work for your team. Boundev is a senior AI engineering subscription: drop the task in Slack, we open a clean GitHub PR with tests, an eval suite, and a deploy guide. Python primary, TypeScript when needed, your stack always. Cursor + Claude Code make our engineers ~3× faster than a typical FTE — you get those gains without onboarding anyone.

40+

AI features shipped to SaaS teams

5.4 d

Median time to first PR

3×

Faster via Cursor + Claude Code

See pricing How it works

● 4 ENGINEERS ON-SHIFT · LAST SHIP 2H AGO

Case Study #1: How [Customer] Shipped Production RAG in 6 Days