What is the single most important question to ask an AI MVP agency?

Show me a production AI product you shipped in the last 90 days, including the GitHub commit history, the evaluation harness, and the production monitoring dashboard. This single request separates real production AI teams from demo shops. Real agencies will show you a live URL, walk through the architecture, and show you the test suite that catches regressions. Demo shops will show you a slide deck.

How long should an AI MVP take to build?

A production AI MVP with one core feature takes 1 to 2 weeks with an agency that has pre-built foundations. Multi-feature production AI MVPs with prompt engineering, output validation, and observability take 2 to 4 weeks. Anything quoted at 8 to 16 weeks for a basic AI MVP is either rate-card billing or a vendor that does not have pre-built foundations and is starting from a blank repo every time.

Should I pick a fixed-price agency or hourly engagement for an AI MVP?

Fixed-price for MVPs, hourly for ongoing maintenance. Fixed-price forces both sides to define scope tightly upfront and protects you from a 14-day build becoming a 14-week build. Hourly billing rewards inefficient work. The exception is genuine research projects where neither side can estimate scope; those are appropriately hourly. A defined MVP scope is not a research project.

What red flags should I look for when evaluating an AI MVP agency?

Five hard red flags. First, no published pricing on their website. Second, cannot show a production AI product they shipped in the last 90 days. Third, no answer for how they handle unexpected LLM output formats in production. Four, will not commit to code ownership on day one. Five, the proposal does not include evaluation infrastructure (regression tests, monitoring) as part of the build. Any single one of these is enough to walk.

How much should a production AI MVP cost in 2026?

A production AI MVP in 2026 costs $3,999 (1 week, 1 core feature) to $25,000 (3 to 4 weeks, multi-agent system or custom RAG) with a fixed-price agency. The sweet spot for an MVP-stage AI startup is $7,499 to $14,999 for a 2 to 3 week build. Hourly engagements at $80 to $200 per hour typically run $30,000 to $80,000 for the same scope because hourly billing rewards both sides for slow work.

Do AI MVP agencies hand over the code?

Real ones do, with the code in your GitHub on day one. Some agencies and most enterprise consultancies retain the code on their infrastructure as a way to lock you into ongoing fees. This is a deal-breaker for an MVP. You should own the code, the design files, the database schema, the API keys you provision, and all documentation. If an agency hesitates on this, walk.

What evaluation infrastructure should a production AI MVP include?

Three minimums. First, regression testing against a fixed set of prompts and expected outputs, so model updates do not silently degrade quality. Second, output validation that catches unexpected formats and triggers retries or fallbacks before broken outputs reach users. Third, cost monitoring with per-request logging and alerting on usage spikes. Without these three, the AI MVP will work for 50 users and break at 500.

Should the agency pick the AI model or should I pick it?

The agency should pick the model and explain the reasoning. A good agency reasons about model selection based on cost profile, latency requirements, quality requirements, and the specific use case. A bad agency defaults to GPT-5.5 because they always use it, or defaults to Claude because they read a Twitter thread. If the agency cannot explain why they picked the model for your specific use case, that is a competence signal.

How to Choose an AI MVP Agency: 25-Question Framework (2026)

TL;DR

Choosing an AI MVP development agency in 2026 requires evaluating five categories: production AI experience, pricing transparency, scoping rigor, code ownership, and post-launch support. The 25 questions in this framework separate real production AI teams from demo shops. Real agencies answer concretely with examples. Demo shops answer with marketing language. If a vendor fails on any of the five hard red flags (no published pricing, no recent production AI examples, no answer on output validation, no day-one code ownership, no evaluation infrastructure in scope), walk away. A production AI MVP in 2026 should cost $3,999 to $25,000 fixed-price and ship in 1 to 4 weeks.

Get a fixed-price scope from HouseofMVPs in 24 hours →

Why this matters

AI MVP development in 2026 is not the same as web MVP development in 2024. The core differences are real and engineering-relevant: prompt engineering, output validation, evaluation infrastructure, cost monitoring, and observability. Agencies that have shipped production AI know these patterns from experience. Agencies that have not are running a different playbook from the same office.

The cost of picking the wrong agency is not just the agency fee. It is the time-to-revenue penalty (every week you do not ship is a week you cannot validate the idea), the rework cost (broken AI code costs more to fix than to rewrite), and the trust cost with investors or customers when the product breaks in production. The cheapest agency is rarely the cheapest outcome.

This framework is what we tell founders when they ask how to evaluate vendors. It is the same set of questions we would want a founder to ask us before signing, because anyone who passes this filter is the kind of customer who will get a production AI MVP regardless of which agency they pick.

The 25 questions, in five categories

The framework covers five evaluation categories, weighted by predictive value of project success:

Category	Weight	Why
Production AI experience	35%	The single biggest predictor of AI MVP success
Pricing structure	25%	Aligns incentives between agency and customer
Scoping rigor	20%	Determines whether the project ships on time
Code ownership & lock-in	10%	Determines what you actually own at the end
Post-launch support	10%	Determines whether the AI MVP survives in production

Score each question on a 1-5 scale. Anything below 3 is a red flag for that category. A vendor below 3 on production AI experience or pricing structure is a walk.

Category 1: Production AI Experience (35%)

Q1: Show me a production AI product you shipped in the last 90 days.

Good answer: Live URL, GitHub repo (private but viewable), architecture walkthrough, before/after metrics. Real agencies show, not tell.

Bad answer: "We have shipped many AI products" without specifics. Or showing a 2024 case study as if it represents current capability. AI moved fast; a 2024 case study is archaeological evidence, not proof of current skill.

Q2: What model would you use for my use case, and why?

Good answer: Reasoning about cost profile, latency, quality requirements, and tradeoffs. Mentions specific models (Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro) and explains the fit.

Bad answer: "GPT-5.5 because it is best." or "Whatever the latest model is." Defaulting to one model regardless of use case is a competence signal.

Q3: How do you handle cases where the model returns unexpected output formats?

Good answer: Output schema validation with libraries like Zod or Pydantic, retry logic with backoff, fallback paths to simpler prompts, alerting when validation fails. Real production AI has all four.

Bad answer: "The model is good enough that this won't happen." Production AI users will, with certainty, produce inputs that cause unexpected outputs.

Q4: How do you monitor model performance over time?

Good answer: Token-level logging, cost monitoring per request, regression tests against fixed prompt sets, latency tracking, alerting on quality drift. Real agencies have this built into their template.

Bad answer: "We use cloud monitoring." Generic monitoring without LLM-specific signals is hope, not engineering.

Q5: What is your evaluation infrastructure?

Good answer: Eval harness with fixed prompts and expected outputs, A/B testing of prompt versions, golden dataset for regression detection. Examples of when eval caught a bug.

Bad answer: "We test as we go." This means no eval infrastructure, which means quality regressions when models update will go undetected.

Q6: What is your cost-control architecture?

Good answer: Prompt caching strategies, token budget per user/session, fallback to cheaper models for simple queries, alerting on cost spikes. Specific dollar examples.

Bad answer: "We use the API directly." This means no cost guardrails, which means a runaway loop could burn $5,000 in a week without anyone noticing.

Q7: Can you walk me through your most complex AI deployment?

Good answer: Specific case study with architecture decisions, technical challenges, how they were solved. Numbers where applicable.

Bad answer: Hand-waving about "complex AI projects" without specifics. Real production engineering has specific stories.

Category 2: Pricing Structure (25%)

Q8: Is your pricing fixed or hourly?

Good answer: Fixed-price with a clear scope document, payment schedule (typically 50% upfront, 50% on delivery), and acceptance criteria.

Bad answer: Hourly. Hourly for a defined MVP is misaligned incentives. The exception is genuine ongoing work, which is not an MVP.

Q9: Can you show me your scope document template before we sign?

Good answer: Yes, here it is. Includes feature list, exclusions, payment terms, change-order process, and acceptance criteria.

Bad answer: "We will create one once we start." This means scope is undefined, which means scope creep is the business model.

Q10: What is included at the base price?

Good answer: Explicit feature list, infrastructure (deployment, monitoring), code ownership, support period, and exclusions list.

Bad answer: "Everything you need." Vague answers turn into change orders.

Q11: What is excluded from the base price?

Good answer: Specific list. Mobile app, additional integrations, custom AI fine-tuning, etc. Each excluded item has a fixed add-on price.

Bad answer: "We will figure it out as we go." Code for "we will bill you for it."

Q12: How do you handle scope changes mid-build?

Good answer: Written change-order process. Small additions get rolled in. Large additions get re-scoped at fixed price with a signed change order.

Bad answer: "We will let you know if it adds cost." Discretionary change orders are how 2-week builds become 4-week builds.

Q13: What is your payment schedule?

Good answer: 50% upfront, 50% on delivery. Or milestones for larger projects. Clear terms before any work starts.

Bad answer: 100% upfront (you lose all leverage), or "we will work it out" (you have no leverage anyway).

Q14: Can you show me a case study where the original quote matched the final invoice?

Good answer: Yes, with specifics. Real fixed-price agencies stick to their quotes.

Bad answer: "Most of our projects come in close to the quote." Translation: "We have change orders on every project."

Category 3: Scoping Rigor (20%)

Q15: How long is your discovery process?

Good answer: 1 to 3 days for an MVP. Longer is rate-card billing for what should be focused work.

Bad answer: 4 to 6 weeks of discovery for a 2 to 4 week build. The discovery phase is longer than the build phase, which is a profit-extraction model not a delivery model.

Q16: What goes into your scope document?

Good answer: Feature list with acceptance criteria, technical architecture, tech stack, integrations, exclusions, payment terms, change-order process, delivery date.

Bad answer: Bullet points without acceptance criteria. Acceptance criteria are what defines "done."

Q17: How do you decide what features make the MVP?

Good answer: Reasoning about the hypothesis being validated. Features that prove the hypothesis are in. Features that do not are explicitly out.

Bad answer: "Whatever you want." Founders without engineering judgment will scope features that should not be in an MVP.

Q18: What is the acceptance criterion for "done"?

Good answer: Specific, measurable criteria per feature. Deployed to production. Real users can complete the workflow.

Bad answer: "When you are happy." Subjective acceptance criteria become indefinite revision cycles.

Q19: How do you handle dependencies on third-party APIs that may change?

Good answer: Identifies critical dependencies in the scope, builds adapter patterns to abstract them, monitors deprecation announcements.

Bad answer: "We use them directly." Tight coupling to third-party APIs is how MVPs break six months after launch.

Category 4: Code Ownership & Lock-in (10%)

Q20: When does the customer get access to the code?

Good answer: Day one. Code in the customer's GitHub from the first commit. Customer is the owner; agency has collaborator access.

Bad answer: "On launch day." Lock-in via delayed code transfer is a leverage move.

Q21: Who owns the design files, the database schema, the API keys?

Good answer: Customer owns all of it. Design files in customer Figma/Sketch. Database in customer-owned cloud account. API keys provisioned by customer.

Bad answer: Agency holds any of these. Even temporary holding creates dependency.

Q22: What happens if we want to take development in-house?

Good answer: We will hand off the codebase with documentation, do a knowledge transfer call, and you are off to the races.

Bad answer: Long migration process or proprietary tooling that does not transfer. Lock-in dressed up as helpful infrastructure.

Q23: Is any part of the architecture proprietary to your agency?

Good answer: No. Standard open-source frameworks. The customer can hire any other engineer to continue the work.

Bad answer: Proprietary internal frameworks, internal libraries, or platform-specific patterns. This is a moat against the customer.

Category 5: Post-Launch Support (10%)

Q24: What is included in post-launch support?

Good answer: Specific number of days (30 to 60 typical for fixed-price MVPs). Specific types of issues covered (bugs, minor adjustments, but not new feature work).

Bad answer: Vague "we are always here" without specifics. Or no included support, with everything billed hourly.

Q25: What does ongoing maintenance look like?

Good answer: Optional monthly retainer with defined scope (security updates, dependency updates, minor feature work, monitoring) at a fixed monthly rate. Customer can cancel anytime.

Bad answer: Required ongoing contract for some period, or unbounded hourly billing for ongoing work.

Scoring the framework

For each of the 25 questions, score 1-5:

1: Bad answer or refused to answer
3: Acceptable answer
5: Strong concrete answer with specifics

Calculate weighted score:

Category 1 (Production AI): average × 35
Category 2 (Pricing): average × 25
Category 3 (Scoping): average × 20
Category 4 (Code ownership): average × 10
Category 5 (Post-launch): average × 10

Total out of 500.

> 400: Strong candidate. Sign with confidence. 300-400: Acceptable, but probe weak categories before signing. < 300: Walk.

The five hard red flags (any one is enough to walk)

If a vendor hits any of these, the score does not matter. Walk regardless:

No published pricing on their website. "Request a quote" is hourly billing in disguise.
Cannot show a production AI product shipped in the last 90 days. Without recent production work, the team does not know current patterns.
No answer for how they handle unexpected LLM output formats in production. Means they have not shipped production AI at all.
Will not commit to code ownership on day one. Lock-in dressed as helpful infrastructure.
Evaluation infrastructure is not included in the base build. Means quality will silently degrade in production.

How HouseofMVPs scores on this framework

We answer every question concretely so prospects can evaluate us against the framework. The full pricing page is at /#pricing with explicit tier inclusions. Our scoping document is shared before signing. Code goes to the customer's GitHub on day one. Evaluation infrastructure (output validation, regression tests, cost monitoring) is included in the base price for every AI MVP at the Launch tier ($7,499) and above.

We pass our own framework with a score of 470/500. We grade ourselves down on the 30/500 because:

We do not have public case studies for every category (some clients require NDAs)
Post-launch support is 30 days standard; some agencies offer 60-90 days standard

If you find an agency that scores higher concretely (not in marketing language), they may be the right pick. The framework is what matters, not which agency you pick.

Related guides

Top AI MVP Development Agencies in 2026 — ranked comparison of major agencies
Best AI Agent Development Companies for Startups — agent-specific agency ranking
What Is Fixed-Price Development? — pricing model explained
Claude Opus 4.7 vs GPT-5.5 vs Gemini 3.1 Pro — model selection guide

Ready to evaluate HouseofMVPs against this framework? Get a fixed-price scope and answers to every question in 24 hours →