← ALL ARTICLES
AI ENGINEERING9 MIN READ

Building AI Features With Real Usage Data — Not Demo Prompts

Your AI feature isn't broken because your model is wrong. It's broken because you built it against prompts you invented, not the ones your users actually type. Here's the 4-step framework for fixing that.

M
Mayur Domadiya
Jun 03, 2026 · 9 min read

Most SaaS teams build their first AI feature the same way. Someone writes 15 sample prompts in a Notion doc. The team tests the LLM against those prompts. It works. They ship. Then three weeks after launch, support tickets start piling up — "the AI gave me garbage," "it didn't understand what I meant," "why does it suggest that?"

The prompts you imagined are not the prompts your users send. That is the whole problem. Building AI features on demo data is the same mistake as designing a checkout flow without looking at where users actually drop off. You get a product that works in your head but fails in production. This post covers how to collect real usage signals before you build, how to structure them into eval sets, and the specific framework we use at Boundev to ship AI features that hold up in production — not just in demos.

92%
Internal benchmark score on demo prompts for one client's search feature
61%
Same feature scored on real user inputs from live traffic
72x
Cost difference between overengineered and right-sized model selection

Why Demo Prompts Fail in Production

When your team invents test prompts, they carry invisible biases. You write what makes sense to you. Your vocabulary, your mental model of the product, your assumptions about user intent.

Real users do not share any of that.

A B2B SaaS with a natural language search feature learned this the hard way. Their internal test set had 40 prompts like "show me all open invoices from Q1." Their real users sent "stuff I haven't paid yet", "overdue things", and "the ones from that vendor." Same intent. Completely different surface form. The model that scored 92% on the internal benchmark scored 61% on live traffic.

Demo prompts also miss the long tail entirely. The top 10 use cases might cover 40% of volume. The other 60% is a thousand edge cases you did not anticipate. If your eval set does not represent the tail, your AI feature will embarrass you at scale.

The fix is not a better LLM. It is better data before you build.

The 4-Step Framework: From Zero Usage Data to Production-Ready AI

If you are building a new AI feature with no existing users, you are not starting from zero. You have logs, support tickets, search queries, and user interviews. Here is how to use them.

Step 1: Mine your existing product data

Before writing a single system prompt, spend one week in your existing data. Look for:

  • Search queries — what exact strings do users type in your current search box? These are verbatim intent signals
  • Support tickets — every ticket that describes a task the user wanted to do but could not is a use case your AI feature needs to handle
  • Feature requests — specifically the language users use to describe what they want, not the feature title — the raw request text
  • Session recordings — watch where users get stuck, that friction is a future AI interaction

A project management SaaS we worked with pulled 1,200 support tickets over 6 months. After removing duplicates and off-topic threads, they had 340 distinct user intents. That became the seed dataset for their AI assistant's eval set — before a single line of model code was written.

Step 2: Run a shadow collection phase

If you can deploy any version of the feature — even a basic one — behind a feature flag, do it. The goal is not to serve users well yet. The goal is to collect real inputs.

Set up logging that captures:

  • The raw user input (exact text, not sanitized)
  • The timestamp and user segment
  • What the model returned
  • Whether the user accepted, edited, or rejected the response

Even 2 weeks of shadow data from 50 beta users will tell you more than 200 invented prompts. You will see patterns you would never have guessed. You will also see failure modes that would have made it to production.

Step 3: Build your eval set from real data, not imagination

An eval set is the test suite for your AI feature. Most teams build it once, with their own prompts, and never update it. That is the wrong approach.

Your eval set should follow this composition:

Data Source % of Eval Set Why
Real user inputs (top 20% by frequency) 40% High-volume cases must always work
Real user inputs (long tail / edge cases) 30% This is where production fails happen
Support ticket inputs 20% These represent real frustration moments
Invented adversarial prompts 10% Test the boundaries, but not as primary

The 40-30-20-10 split is not arbitrary. Every team that inverts this — putting invented prompts at the top — ends up chasing a false benchmark while production quality degrades.

Not sure where to start with AI?

Book a free 20-minute AI Feature Scoping Call. We'll map your highest-ROI AI feature, tell you the real cost, and whether Boundev is the right fit. No decks. No BS.

Book scoping call →

Step 4: Build a continuous feedback loop

A static eval set is a snapshot. Your users' behavior changes as your product evolves. The eval set needs to evolve too.

Set up a lightweight feedback mechanism at the point of AI output. Even a thumbs-up/thumbs-down is enough. After 30 days:

  • Any prompt category with a thumbs-down rate above 15% goes into the eval set immediately
  • Any new prompt type that appears more than 20 times in logs gets manually reviewed for addition
  • The eval set grows by 50–100 examples per month in the first year

This is the difference between an AI feature that improves over time and one that silently degrades as usage patterns drift.

What Real Usage Data Changes in Your Build Decisions

Once you have real data, it changes three things that directly affect cost and architecture.

Model selection

Demo prompts are usually clean, well-formed, and short. Real prompts are messy, ambiguous, and sometimes 400 words long. A model that works on demo data might be overengineered — or underengineered — for what real users actually send.

We have seen teams pay $0.018/1K tokens for a frontier model when their actual use case — once tested on real prompts — worked just as well on a smaller model at $0.00025/1K tokens. That is a 72x cost difference. Real data lets you make that call with confidence instead of guessing.

Context window and chunking strategy

Real users write prompts that assume context your system does not have. "Fix the same issue as before" — what issue? "Use the format from the last report" — which report? Real usage data surfaces how much context your AI needs to carry, which determines whether you need a simple stateless prompt, a session memory layer, or a full RAG system.

Skipping this step and defaulting to RAG because it sounds right is how teams end up with unnecessary infrastructure costs and p95 latency above 6 seconds.

Guardrails and fallbacks

Real users will try to use your AI feature for things you did not intend. Not necessarily maliciously — they just explore. Your real usage data will show you exactly what the boundary cases are. You can build explicit handling for the top 10 off-topic input types before you ship, instead of discovering them in a support ticket at 2 AM.

The prompts you invent are hypotheses. The prompts your users send are facts. Build against facts.

The Difference Between Good and Bad Behavioral Baselines

Before you improve your AI feature, you need a baseline. Most teams skip this. They add a new model, redeploy, and assume it is better because the new model is bigger or cheaper.

A behavioral baseline means you know, exactly, how your feature behaves on a defined set of inputs before any changes. It has three components:

  1. Accuracy score — what percentage of eval set inputs does the model handle correctly? Define "correct" per input type before running the eval. Do not define it after.
  2. Latency profile — what are your p50, p90, and p99 response times under realistic load? Not on a laptop with a single request.
  3. Rejection rate — how often does the model refuse, produce a fallback, or return a hallucination? Measure this separately from accuracy.

Once you have this baseline from week one, every subsequent change — new model, new prompt, new context strategy — gets measured against it. You stop saying "it feels better" and start saying "p90 latency dropped from 3.8s to 1.4s and accuracy on tier-1 queries improved from 78% to 89%." If you want to see how we structure AI feature builds at Boundev, that is a good starting point.

That is how you make AI engineering decisions instead of AI engineering guesses.

What to Do This Week

If you have an AI feature in planning, one in development, or one already live, here is where to start:

If you are pre-build: Pull your last 90 days of search queries and support tickets. Tag them by user intent. Any intent that appears more than 5 times is a confirmed use case your AI feature must handle on day one.

If you are mid-build: Stop. Write down your current eval set. Count how many inputs came from real users versus how many you invented. If it is less than 40% real, your benchmark is telling you a story about your own assumptions, not your users.

If you are post-launch with declining satisfaction: Export the last 1,000 AI interactions. Segment by user response (edited, accepted, abandoned). The abandoned cohort is your real eval set. It tells you exactly where your feature breaks.

Every AI feature that is underperforming is underperforming on inputs the team did not anticipate. The fix is systematic, not magical — get the real data, build against it, measure correctly, iterate. Which cohort are you starting with?

Got an AI feature in mind?

Book a free 20-minute AI Feature Scoping Call. We'll tell you whether Boundev is the right fit, what tier you'd need, and how fast we can ship. We say no to about a third of calls — the fit either works or it doesn't.

Book scoping call →

M

Mayur Domadiya

Founder & CEO, Boundev AI

Mayur builds Boundev AI, the AI engineering subscription for US SaaS companies. Connect on Twitter or LinkedIn.

TAGS ·#ai-engineering#llm-evals#ai-workflows#for-founders#for-ctos
Production AI in your stack

Researching this for a real task? We ship it in 5–7 days.

If you're reading up on RAG, MCP, an LLM integration, or a new framework, odds are you're scoping work for your team. Boundev is a senior AI engineering subscription: drop the task in Slack, we open a clean GitHub PR with tests, an eval suite, and a deploy guide. Python primary, TypeScript when needed, your stack always. Cursor + Claude Code make our engineers ~3× faster than a typical FTE — you get those gains without onboarding anyone.

40+
AI features shipped to SaaS teams
5.4 d
Median time to first PR
Faster via Cursor + Claude Code
See pricingHow it works
● 4 ENGINEERS ON-SHIFT · LAST SHIP 2H AGO
Have a real AI task? Shipped as a GitHub PR in 5–7 days.See pricing →