Production-grade AI, not demos.
This is where I share what actually ships. Not the LangChain quickstart that ran perfectly in the YouTube tutorial. Not the OpenAI demo that worked on a 50-row CSV but melted on the real corpus. This category is for engineers building AI features that real users hit thousands of times a day — the patterns, the failures, the cost numbers, and the architectural decisions that nobody talks about until 3am on launch night.
I've spent the last two years putting RAG pipelines into production, integrating multiple LLM providers, debugging hallucinations under real traffic, and figuring out which "AI engineering" advice on Twitter is real and which is hype. Most of it is hype. The signal is small, but where it exists, it's worth writing down. That's what this category is for.
Why this category exists
Most AI content online falls into two buckets. The first is hype — vague posts about how "AI changes everything" without a single concrete line of code. The second is academic — papers and benchmarks disconnected from any product context. Both are useless if you're trying to ship an AI feature on a deadline with a budget that matters and a customer base that will complain when the model says something stupid.
This category is the third thing. It's written for the engineer who has been told by a founder "add some AI" and now has to figure out what RAG even means, which LLM to pick, how much it'll cost at 10K users, and how to handle the inevitable case where the model returns garbage. It's grounded, it's specific, and it assumes you've shipped real software before. There's no introductory "what is an LLM" filler. There's a lot of "here's how I solved this specific production problem and what it cost me to learn the lesson."
What you'll find here
The posts in this category cover four broad areas: production RAG systems, AI agents and tool-use, LLM provider comparisons with real cost numbers, and the operational realities of running AI in production.
- Production RAG — chunking strategies that actually matter, embedding model comparisons, vector database benchmarks under real load, hybrid search, reranking, and the moment your "naive RAG" stops being good enough
- AI agents — when agents are worth the complexity, tool-use patterns, retry and observability, multi-step planning, and what "agentic" actually means once the demo is over
- LLM provider write-ups — OpenAI vs Claude vs Gemini for specific use cases, real latency numbers, real cost analysis, fallback strategies, and how to design for switching providers without rewriting your stack
- AI in real products — debugging hallucinations, prompt regression testing, evals that aren't theatre, handling user privacy, and the operational maturity nobody warns you about until you need it
Why RAG keeps breaking in production
The first RAG tutorial you read makes it look easy: chunk your documents, embed them, store them in a vector DB, query top-K, stuff the results into the prompt, done. And it works, beautifully, on the demo. Then you put it in front of real users and the wheels come off.
The user asks a multi-hop question and your top-K retrieval grabs the wrong chunk. The query is too short to embed meaningfully and you retrieve nothing useful. The user's question contains a typo of a product name and your semantic search misses it entirely — while a basic keyword search would have nailed it. The chunks are 500 tokens but the model needs the surrounding context to answer. You're paying for embeddings on documents that get queried twice a month. The cost numbers don't add up.
Production RAG is a stack of techniques layered on top of each other to handle these failure modes: hybrid retrieval (semantic + BM25), reranking with a cross-encoder, query expansion with the LLM, chunk overlap and structural awareness, metadata filtering, and observability so you can see what was retrieved when an answer was wrong. None of this is glamorous. All of it matters. Most of the depth in this category is on the realities of making each of these pieces work without exploding your cost structure or your latency budget.
The LLM provider question, honestly
OpenAI vs Claude vs Gemini is the most asked question in AI engineering, and the answer is always "it depends" — but it depends on specific things that nobody lays out clearly. Here's the actual framework I use.
If you need the absolute best reasoning on complex tasks, Claude 3.5 Sonnet and GPT-4 trade blows. Claude is often better at following nuanced instructions and writing code; GPT-4 is more consistent on edge cases. Gemini is cheaper and has a larger context window but trades off reliability on complex prompts in ways that show up subtly during evals.
For RAG-style "answer based on context", any of the frontier models work — pick on cost. For agentic workflows with tool-use, OpenAI's function calling has the most mature ecosystem and the most predictable behaviour. For long-context tasks (legal docs, full codebases, multi-document analysis), Gemini 1.5 Pro's 1M context window is genuinely in a different category. For latency-sensitive applications, GPT-4o-mini and Claude 3.5 Haiku are an order of magnitude faster than the flagship models and "good enough" for the vast majority of tasks.
The biggest mistake teams make is treating LLM choice as a permanent architectural decision. It isn't. Build a provider abstraction from day one, even if you ship with just one. Three months in, you will want to test alternatives — for cost, for quality, for redundancy when OpenAI has an outage at 4pm IST. Switching shouldn't require rewriting your whole prompt layer. The posts in this category cover the exact shape of provider abstractions that have held up across rewrites and migrations.
Hallucinations and how to actually handle them
"Just engineer the prompt better" is the most useless advice in AI engineering. Models hallucinate. They will hallucinate. The question is whether your system catches it.
For factual Q&A, the most reliable defence is grounded generation: force the model to cite specific chunks from its context, then verify the citations programmatically. If a citation doesn't match anything in the retrieved context, you know something went wrong before the user sees it. This single technique catches most fabrications and is dramatically more reliable than "be careful" prompts.
For longer-form generation, regression evals are essential. Pick 100 representative prompts. Generate outputs. Label them good or bad. Now every time you change a prompt or upgrade a model, run them through and compare. This isn't sexy infrastructure but it's the difference between "we hope it still works" and "we know it still works." Most teams skip this step and discover, three months in, that an innocuous-looking prompt tweak silently broke 12% of their outputs.
For high-stakes flows (medical advice, legal interpretation, financial decisions), you need a human-in-the-loop checkpoint. Period. The maths on confidence intervals doesn't save you when wrong outputs cause real damage. Build the workflow assuming the model will be wrong sometimes; design for that case explicitly. The posts in this category go into specific UX patterns for human-in-the-loop systems that don't feel slow to users.
The tools and frameworks I actually use
I've cycled through a lot of AI tooling in the last two years. Here's what stuck.
LangChain — useful for prototyping, hard to maintain in production. I use it for quick experiments and chains where the value is in the abstractions; for production code I usually drop down to the raw provider SDKs. LangSmith for tracing and observability is genuinely good though, and is the part of the LangChain ecosystem I rely on most.
LlamaIndex — better than LangChain specifically for RAG. The data ingestion and query engine abstractions are well-thought-out, and the team has been iterating on hard problems like reranking and structured outputs. If your project is mostly retrieval-augmented, start here instead of LangChain.
Vector databases — ChromaDB for prototyping and self-hosted small-scale deployments; Pinecone when you need fully managed and can pay; pgvector when you already have Postgres and want to keep your stack small. I rarely use FAISS in production anymore — the operational story (persistence, replication, sharding) isn't worth it for most teams when ChromaDB or pgvector solve the same problem with one fewer service to monitor.
Embedding models — OpenAI's text-embedding-3-large for English-heavy use cases; Cohere's multilingual models for international; bge-large-en for self-hosted scenarios. The choice matters more than people realise — switching embedding models means reindexing everything, so make this decision deliberately.
Hugging Face Transformers — when you need to fine-tune or run smaller models locally. Increasingly important as the open-source models close the gap with frontier proprietary ones. The category covers fine-tuning workflows that are realistic for small teams without GPU clusters.
Evals: the unglamorous work that decides whether your AI ships
Every AI team I've talked to has eventually arrived at the same realisation: without evals, you're flying blind. With evals, you have the only thing that lets you change prompts, swap models, or upgrade providers without breaking production. The teams that succeed treat evals like unit tests — boring, mandatory, run in CI. The teams that struggle treat them as something they'll "set up properly next quarter."
Practical eval design is much simpler than the AI eval literature makes it look. Start with 50 representative prompts. For each, write down the expected behaviour in plain English ("must cite a source", "must refuse to give legal advice", "must format the answer as JSON"). Then write an automated checker per behaviour — sometimes a regex, sometimes a string match, sometimes a small LLM-as-judge call. Run the whole set against your current configuration. That's your baseline.
Now every prompt change runs against the baseline before merge. Pass rate drops below threshold, the PR gets blocked. The first time this catches a regression that would have shipped to production, the team converts to evals for life. Until that moment, evals will feel like overhead. There's no shortcut — you have to feel the pain of an unnoticed regression once before the investment makes sense, which is exactly why this category covers the topic in depth, with the specific eval harnesses I've used.
The advanced layer — LLM-as-judge, pairwise comparison, calibrated rubrics — gets a separate post. Don't reach for it until the basics are in place. Skipping the basics for the fancy version is the most common pattern of failure I see.
Streaming, retries, and the user-perceived latency tax
Frontier model responses are slow. A long answer from GPT-4 or Claude Sonnet can take 8–15 seconds to fully generate. Build a synchronous API endpoint that waits for the full response and your users will think your product is broken. Stream the response token-by-token and the same 12-second wait feels fast because the user sees progress immediately.
Streaming isn't a polish — it's load-bearing UX. The architectural implications matter: streaming pushes you toward Server-Sent Events or websockets instead of plain REST, your client needs to handle partial JSON, and your error handling has to deal with mid-stream failures. Posts in this category cover the specific patterns I've used in production for streaming RAG answers, agent reasoning steps, and structured outputs.
Retries are the other half of the operational story. Frontier models occasionally return rate-limit errors, transient failures, or just refuse to follow the format you asked for. Naive retries pile up cost and latency. Smart retries use exponential backoff, swap to a fallback provider after N failures, and surface a useful error to the user instead of an opaque timeout. The pattern I've converged on after many production incidents lives in a dedicated post in this category.
The latency budget calculation that very few teams do: list every step in your AI pipeline (embedding query, vector search, reranking, LLM call, post-processing), assign realistic p95 latencies to each, sum them. If the total exceeds two seconds before the LLM call even starts, you have a budget problem. Fix it before going wider on retries and observability — both layer on top of latency and won't help you if the baseline is already slow.
Mistakes I see (and have made) repeatedly
The same patterns of failure show up in almost every AI project I review:
- No evals — shipping prompt changes by gut feeling, breaking subtle things nobody notices until users complain on Twitter
- Hardcoded provider — coupling business logic directly to OpenAI's specific quirks, making any migration a multi-week project
- No cost monitoring — discovering at the end of the month that one user's session burned $400 in tokens because nobody capped the context
- Bloated context — stuffing 100K tokens into every prompt because chunking is hard, paying 10x what you need to and degrading the model's focus
- Streaming as an afterthought — building synchronous responses first and trying to retrofit streaming later, which is functionally a rewrite of your transport layer
- No fallback for downtime — when OpenAI has an outage (and it will), your entire product stops working until they're back
- No prompt versioning — production prompts that get edited in a hurry by whoever has access, with no record of what changed when something breaks
Each of these gets a dedicated post in this category when I have time to write it up properly. The pattern is always the same: the bug is obvious in hindsight, the fix is straightforward once you've seen it, and the cost of not having addressed it earlier is significantly bigger than you'd estimate up front.
What's coming next in this category
The next few posts on the docket: a deep-dive into the RAG pipeline I built for a legal-tech project (with architecture diagrams, real production cost numbers, and the three failure modes that forced rewrites), a head-to-head benchmark of GPT-4o-mini vs Claude 3.5 Haiku on production-shaped traffic, a piece on how I structure evals so they actually run in CI without becoming a nuisance, a write-up of the agent-orchestration pattern I've converged on after three failed attempts at building "AI agents that just work," and a deep look at vector DB cost/performance tradeoffs at the 1M-document scale.
If there's a specific topic you want me to write about, the contact form on the homepage works. I read every message and the queue here is driven by what readers actually want to know — not what's trending on AI Twitter this week.
