← Back to writing

Run a 2-day feasibility spike before you promise an AI feature

Most AI features do not fail in production. They fail earlier, in a planning meeting where someone promised a capability nobody had tested against real data. The roadmap slot gets booked, two engineers start building, and three weeks in the team discovers the model cannot reliably do the one thing the whole feature depends on.

A feasibility spike is the cheapest insurance against that outcome. It is a short, time-boxed experiment that answers one question: can a current model do this task, on our actual data, well enough to ship? You run it before the feature enters a sprint, not after the demo falls apart.

The short version, for anyone skimming:

  • A feasibility spike is a two-day experiment that tests whether an AI feature is possible on your real data before you commit a sprint to it.
  • Day one wires the thinnest prompt to 30 to 50 real examples. Day two scores the outputs and sets a go or no-go pass rate.
  • One public benchmark ran 570 API calls across 15 models and 38 tasks for 2.29 dollars, which is roughly what a real spike costs in API spend.
  • If the spike clears your bar, it seeds your acceptance criteria and evals. If it does not, you saved a sprint and a credibility hit.

What a feasibility spike actually answers

Scoping decides which feature is worth building. A spike decides whether the one you picked is even possible right now. Those are different questions, and teams that skip the second one tend to learn the answer the expensive way. If you have not done the first step yet, start with scoping which AI feature to build first, then spike the winner.

The spike is deliberately narrow. It does not test your UI, your auth, your billing, or your latency budget. It tests the model-shaped risk: given a representative input, does the model produce an output you would be comfortable showing a customer, often enough to matter? Everything else is normal engineering you already know how to do. The model behavior is the part nobody can promise from a slide.

A good spike produces three things: a number (the pass rate on real examples), a short list of the failure modes you saw, and a recommendation that is either go, go-with-changes, or no-go. That is it. Resist the urge to polish a prototype. The deliverable is a decision, not a feature.

How to run the spike in two days

Two days is enough because you are testing a hypothesis, not building a product. The constraint is the point: it forces you to cut everything that is not the core risk.

Day one: wire the thinnest prompt to real data

Pull 30 to 50 real examples from your own product. Not synthetic samples, not the three clean cases from the pitch deck. Real customer records, real support tickets, real documents, with the messiness intact. The spike is worthless if the inputs are cleaner than production.

Write the simplest prompt that could work and call the model API directly in a notebook or a small script. Skip frameworks, skip retrieval if you can, skip fine-tuning. You want the floor: how well does a plain prompt on a current model handle your task? If the floor is already close, the feature is probably feasible and the rest is tuning. If the floor is far off, you have learned something important on day one.

Log every input and output to a file. You will need them tomorrow, and you will want the failures more than the successes.

Day two: score the outputs and set a bar

Now grade what you produced. Pick the scoring method that fits the task. For constrained outputs (a category, a field, a structured object) you can compare against a known answer and get an exact pass-fail. For open-ended outputs (a summary, a draft, an explanation) use a model-as-judge pass with a clear rubric, then spot-check a sample by hand to make sure the judge agrees with a person. Correctness, faithfulness, relevance, and safety are separate dimensions; do not collapse them into one vibe-based thumbs up.

The cost of this is small. A widely cited 2026 benchmark scored 15 models across 38 tasks in 570 API calls for 2.29 dollars. Your spike is one task on a few dozen examples, so the API bill is rounding error. The expensive resource is the two engineer-days, which is exactly why you cap it at two.

The pass-rate bar that decides go or no-go

Before you look at the score, write down the bar the feature needs to clear. Setting it after you see the number is how teams talk themselves into shipping something that does not work.

The bar depends on what happens when the model is wrong. If a wrong output is cheap to ignore, such as a suggested tag the user can dismiss, a 70 to 80 percent pass rate may be plenty. If a wrong output is costly or hard to reverse, such as an automated action or a customer-facing claim, you need a much higher bar plus a human in the loop, or the feature is not ready. The same raw pass rate can be a clear go for one feature and a clear no-go for another.

When the spike lands in the middle, the failure log tells you what to do next. If the misses cluster around a few input types, a second short spike with retrieval or a better prompt often moves the number. If the misses are scattered and unpredictable, that is a sign the task is genuinely hard for current models, and no amount of prompt tuning will save it. This is also where unrealistic promises quietly become the reason AI features stay stuck in the backlog: a vague yes with no evidence is impossible to plan around.

What a feasibility spike is not

A spike is not a prototype you keep. The code is throwaway by design, and treating it as a head start tempts you to build production on top of notebook glue. Keep the data, the prompt, and the scores; discard the rest.

A spike is also not a one-time gate. The model that fails today may pass in three months, and the model that passes today may regress on a quiet update. The pass rate and the scored examples you produced become the first version of a permanent eval. Once the feature is greenlit, grow that set and turn the spike into acceptance criteria and evals so you can catch regressions before users do.

Finally, a spike is not a substitute for shipping. It de-risks the decision; it does not replace the work. If you want a sense of the kinds of features that clear the bar and reach production, our notes on what we build for SaaS teams are a useful reference for scoping the next one.

Common questions

How many examples do I need for a feasibility spike?

Thirty to fifty real examples is enough to see the dominant failure modes and get a rough pass rate. You are not proving statistical significance, you are deciding whether to invest a sprint. If the result is borderline, expand the set before you build, not after.

Should I use a framework or retrieval in the spike?

Start without them. The goal of day one is to find the floor with a plain prompt on a current model. If the floor is close to your bar, you have your answer. Add retrieval or tooling only when the bare prompt misses in a pattern that retrieval would plausibly fix, and treat that as a second, separate spike.

What if the spike says no-go?

That is a win, not a failure. You spent two engineer-days and a few dollars instead of a three-week sprint and a broken demo. Write down the failure modes, note the model versions you tested, and re-run the spike when models improve or the scope narrows. A clear no-go protects the roadmap.

Who should run the spike?

One engineer who can call an API and read outputs critically, ideally with the product owner reviewing the scored examples on day two. You do not need a dedicated ML team to run a feasibility spike, and waiting until you hire one is often the reason the feature never gets tested at all.

Get shipped

Rather we just build it?

Book a free scoping call and we'll ship your production-safe AI feature this week.