Dependable AI is an operating discipline, not a model choice.
The same five principles run every engagement I take — and every AI-first business I operate with my own P&L. They’re not a methodology trademark. They’re what survives contact with production.
Evals before features
If you can't measure quality, you can't ship changes — you can only gamble. Before I touch a feature, we define together what “good” means for it in numbers and human judgment, then I build the harness that measures it. Every release after that is gated by evidence.
Cost is a product requirement
AI-first products run 50–60% gross margins when inference goes unmanaged — old SaaS ran 70–90%. I treat cost like latency: every AI call has a price, every feature gets a ceiling and an owner, and the dashboard makes drift visible in days, not quarters.
Humans in the loop, by design
The dependable systems are Human+AI systems. I design the review loop as part of the product: low-confidence outputs route to people, their judgments feed the evals, and the loop tightens every week. Not unsupervised magic — supervised leverage.
Narrow first, then wide
One initiative, shipped and measured, beats five pilots in flight. I take a single stuck feature to dependable production, install the discipline along the way, and let the second and third initiatives ride on rails the first one built.
Your team owns everything
Code, evals, dashboards, decisions, runbooks. I work inside your repo, your engineers shadow every step, and handover is designed in from day 1 — not bolted on at the end. The measure of my work is what keeps running after I leave.
You see progress weekly. Not at the end.
Stalled AI initiatives die in silence — months of work, then a reveal that disappoints. I run the opposite cadence.
Weekly demo
Working software, in front of your decision-maker, every week. Not status decks — running systems.
Weekly numbers
Eval scores, cost per call, incident count. The same three numbers, every week, trending or explained.
Production from week two
Behind flags, with controlled blast radius — because real traffic teaches what staging can't.
Plain-language readouts
Every finding lands in language your board understands. The jargon stays in the code comments.
Model-agnostic, by principle
Vendor loyalty is a cost center. Every model and tool decision is made against your evals and your unit economics — and revisited when the numbers say so.
- Models: OpenAI, Anthropic, Gemini, open-weights — chosen by eval scores and cost per outcome, not by habit.
- Infrastructure: deep hands-on with GCP (Cloud Run, BigQuery); comfortable across AWS and Azure. I build in your stack, not mine.
- Patterns: multi-agent workflows, RAG with managed memory, custom decision systems, human-in-the-loop evaluation frameworks — all running today in systems I operate.
- No lock-in: everything I build runs without me. That's a design requirement, not a promise.
See the discipline before you buy it.
Bring one stuck initiative to a 30-minute call with me. You'll get an honest read — including "don't hire me" if that's the truth.
Every note is read by me, not a funnel. — Mayur Sethi