What's the difference between RAG and fine-tuning, and when should I use each?

RAG injects relevant context into the prompt at query time, so the model can reference fresh or private data without retraining. Fine-tuning bakes new behaviour into the model weights themselves, which is permanent and expensive. For 90% of production use cases — internal documents, customer support, code Q&A — RAG is the right choice because the data changes and you want to update it without retraining. Fine-tuning makes sense for style/persona changes, very domain-specific reasoning that prompting can't reach, or extreme cost optimisation at huge scale.

Which LLM provider should I pick: OpenAI, Claude, or Gemini?

For RAG and standard tasks, all three are comparable — pick on cost or latency. For complex reasoning and code, Claude 3.5 Sonnet and GPT-4 trade blows. For long-context tasks (full codebases, legal docs), Gemini 1.5 Pro's 1M context is a different category. For latency-sensitive apps, GPT-4o-mini and Claude 3.5 Haiku are 10x faster. Most important: build a provider abstraction from day one so you can switch. Treating your LLM choice as permanent is the most common mistake I see.

How do I keep AI costs predictable in production?

Three controls: cap context size per request (don't let one user stuff 100K tokens into a chat history), cap requests per user per day, and monitor cost-per-user as a metric. The horror story is always the same — one user's session burns $400 because nobody capped anything. Set up cost alerts on your OpenAI/Anthropic dashboard. Use the cheaper models (4o-mini, Haiku) for any task where they're good enough — which is most of them.

Do I need LangChain or can I just call the OpenAI SDK directly?

For prototyping and complex chains, LangChain saves time. For production, I usually drop down to raw SDKs — LangChain's abstractions add overhead and the maintenance burden grows with the project. The hybrid pattern that works: use LangChain (or LlamaIndex) for the parts where its abstractions earn their place (document loaders, retrievers, prompt templates), use raw SDKs for the rest. LangSmith for tracing is genuinely good regardless of which framework you pick.

How do I evaluate AI outputs without it becoming theatre?

Start small. Pick 50 representative prompts. Write down expected behaviour in plain English per prompt ('must cite a source', 'must format as JSON'). Build automated checkers — regex, string match, or small LLM-as-judge calls — for each. Run the set on every prompt change before merging. Pass rate drops below threshold, the PR is blocked. The first time this catches a regression that would've shipped, the team converts forever. Don't reach for fancier eval frameworks until the basics are in place.

What vector database should I use for production RAG?

ChromaDB if you want self-hosted and small-scale. Pinecone if you want fully managed and can pay the per-month cost. pgvector if you already have Postgres and want to keep your stack small (and have under ~10M vectors). FAISS is fine for in-memory or batch jobs but the operational story (persistence, sharding, replication) usually isn't worth it. Don't over-optimise here — start with whichever is easiest, switch if you hit real scale limits. Most products never do.

AI & LLMs — RAG, LangChain, OpenAI Tutorials

Production-grade AI, not demos.

This is where I share what actually ships. Not the LangChain quickstart that ran perfectly in the YouTube tutorial. Not the OpenAI demo that worked on a 50-row CSV but melted on the real corpus. This category is for engineers building AI features that real users hit thousands of times a day — the patterns, the failures, the cost numbers, and the architectural decisions that nobody talks about until 3am on launch night.

I've spent the last two years putting RAG pipelines into production, integrating multiple LLM providers, debugging hallucinations under real traffic, and figuring out which "AI engineering" advice on Twitter is real and which is hype. Most of it is hype. The signal is small, but where it exists, it's worth writing down. That's what this category is for.

Why this category exists

Most AI content online falls into two buckets. The first is hype — vague posts about how "AI changes everything" without a single concrete line of code. The second is academic — papers and benchmarks disconnected from any product context. Both are useless if you're trying to ship an AI feature on a deadline with a budget that matters and a customer base that will complain when the model says something stupid.

This category is the third thing. It's written for the engineer who has been told by a founder "add some AI" and now has to figure out what RAG even means, which LLM to pick, how much it'll cost at 10K users, and how to handle the inevitable case where the model returns garbage. It's grounded, it's specific, and it assumes you've shipped real software before. There's no introductory "what is an LLM" filler. There's a lot of "here's how I solved this specific production problem and what it cost me to learn the lesson."

What you'll find here

The posts in this category cover four broad areas: production RAG systems, AI agents and tool-use, LLM provider comparisons with real cost numbers, and the operational realities of running AI in production.

Production RAG — chunking strategies that actually matter, embedding model comparisons, vector database benchmarks under real load, hybrid search, reranking, and the moment your "naive RAG" stops being good enough
AI agents — when agents are worth the complexity, tool-use patterns, retry and observability, multi-step planning, and what "agentic" actually means once the demo is over
LLM provider write-ups — OpenAI vs Claude vs Gemini for specific use cases, real latency numbers, real cost analysis, fallback strategies, and how to design for switching providers without rewriting your stack
AI in real products — debugging hallucinations, prompt regression testing, evals that aren't theatre, handling user privacy, and the operational maturity nobody warns you about until you need it

Why RAG keeps breaking in production

The first RAG tutorial you read makes it look easy: chunk your documents, embed them, store them in a vector DB, query top-K, stuff the results into the prompt, done. And it works, beautifully, on the demo. Then you put it in front of real users and the wheels come off.

The user asks a multi-hop question and your top-K retrieval grabs the wrong chunk. The query is too short to embed meaningfully and you retrieve nothing useful. The user's question contains a typo of a product name and your semantic search misses it entirely — while a basic keyword search would have nailed it. The chunks are 500 tokens but the model needs the surrounding context to answer. You're paying for embeddings on documents that get queried twice a month. The cost numbers don't add up.

Production RAG is a stack of techniques layered on top of each other to handle these failure modes: hybrid retrieval (semantic + BM25), reranking with a cross-encoder, query expansion with the LLM, chunk overlap and structural awareness, metadata filtering, and observability so you can see what was retrieved when an answer was wrong. None of this is glamorous. All of it matters. Most of the depth in this category is on the realities of making each of these pieces work without exploding your cost structure or your latency budget.

The LLM provider question, honestly

OpenAI vs Claude vs Gemini is the most asked question in AI engineering, and the answer is always "it depends" — but it depends on specific things that nobody lays out clearly. Here's the actual framework I use.

If you need the absolute best reasoning on complex tasks, Claude 3.5 Sonnet and GPT-4 trade blows. Claude is often better at following nuanced instructions and writing code; GPT-4 is more consistent on edge cases. Gemini is cheaper and has a larger context window but trades off reliability on complex prompts in ways that show up subtly during evals.

For RAG-style "answer based on context", any of the frontier models work — pick on cost. For agentic workflows with tool-use, OpenAI's function calling has the most mature ecosystem and the most predictable behaviour. For long-context tasks (legal docs, full codebases, multi-document analysis), Gemini 1.5 Pro's 1M context window is genuinely in a different category. For latency-sensitive applications, GPT-4o-mini and Claude 3.5 Haiku are an order of magnitude faster than the flagship models and "good enough" for the vast majority of tasks.

The biggest mistake teams make is treating LLM choice as a permanent architectural decision. It isn't. Build a provider abstraction from day one, even if you ship with just one. Three months in, you will want to test alternatives — for cost, for quality, for redundancy when OpenAI has an outage at 4pm IST. Switching shouldn't require rewriting your whole prompt layer. The posts in this category cover the exact shape of provider abstractions that have held up across rewrites and migrations.

Hallucinations and how to actually handle them

"Just engineer the prompt better" is the most useless advice in AI engineering. Models hallucinate. They will hallucinate. The question is whether your system catches it.

For factual Q&A, the most reliable defence is grounded generation: force the model to cite specific chunks from its context, then verify the citations programmatically. If a citation doesn't match anything in the retrieved context, you know something went wrong before the user sees it. This single technique catches most fabrications and is dramatically more reliable than "be careful" prompts.

For longer-form generation, regression evals are essential. Pick 100 representative prompts. Generate outputs. Label them good or bad. Now every time you change a prompt or upgrade a model, run them through and compare. This isn't sexy infrastructure but it's the difference between "we hope it still works" and "we know it still works." Most teams skip this step and discover, three months in, that an innocuous-looking prompt tweak silently broke 12% of their outputs.

For high-stakes flows (medical advice, legal interpretation, financial decisions), you need a human-in-the-loop checkpoint. Period. The maths on confidence intervals doesn't save you when wrong outputs cause real damage. Build the workflow assuming the model will be wrong sometimes; design for that case explicitly. The posts in this category go into specific UX patterns for human-in-the-loop systems that don't feel slow to users.

The tools and frameworks I actually use

I've cycled through a lot of AI tooling in the last two years. Here's what stuck.

LangChain — useful for prototyping, hard to maintain in production. I use it for quick experiments and chains where the value is in the abstractions; for production code I usually drop down to the raw provider SDKs. LangSmith for tracing and observability is genuinely good though, and is the part of the LangChain ecosystem I rely on most.

LlamaIndex — better than LangChain specifically for RAG. The data ingestion and query engine abstractions are well-thought-out, and the team has been iterating on hard problems like reranking and structured outputs. If your project is mostly retrieval-augmented, start here instead of LangChain.

Vector databases — ChromaDB for prototyping and self-hosted small-scale deployments; Pinecone when you need fully managed and can pay; pgvector when you already have Postgres and want to keep your stack small. I rarely use FAISS in production anymore — the operational story (persistence, replication, sharding) isn't worth it for most teams when ChromaDB or pgvector solve the same problem with one fewer service to monitor.

Embedding models — OpenAI's text-embedding-3-large for English-heavy use cases; Cohere's multilingual models for international; bge-large-en for self-hosted scenarios. The choice matters more than people realise — switching embedding models means reindexing everything, so make this decision deliberately.

Hugging Face Transformers — when you need to fine-tune or run smaller models locally. Increasingly important as the open-source models close the gap with frontier proprietary ones. The category covers fine-tuning workflows that are realistic for small teams without GPU clusters.

Evals: the unglamorous work that decides whether your AI ships

Every AI team I've talked to has eventually arrived at the same realisation: without evals, you're flying blind. With evals, you have the only thing that lets you change prompts, swap models, or upgrade providers without breaking production. The teams that succeed treat evals like unit tests — boring, mandatory, run in CI. The teams that struggle treat them as something they'll "set up properly next quarter."

Practical eval design is much simpler than the AI eval literature makes it look. Start with 50 representative prompts. For each, write down the expected behaviour in plain English ("must cite a source", "must refuse to give legal advice", "must format the answer as JSON"). Then write an automated checker per behaviour — sometimes a regex, sometimes a string match, sometimes a small LLM-as-judge call. Run the whole set against your current configuration. That's your baseline.

Now every prompt change runs against the baseline before merge. Pass rate drops below threshold, the PR gets blocked. The first time this catches a regression that would have shipped to production, the team converts to evals for life. Until that moment, evals will feel like overhead. There's no shortcut — you have to feel the pain of an unnoticed regression once before the investment makes sense, which is exactly why this category covers the topic in depth, with the specific eval harnesses I've used.

The advanced layer — LLM-as-judge, pairwise comparison, calibrated rubrics — gets a separate post. Don't reach for it until the basics are in place. Skipping the basics for the fancy version is the most common pattern of failure I see.

Streaming, retries, and the user-perceived latency tax

Frontier model responses are slow. A long answer from GPT-4 or Claude Sonnet can take 8–15 seconds to fully generate. Build a synchronous API endpoint that waits for the full response and your users will think your product is broken. Stream the response token-by-token and the same 12-second wait feels fast because the user sees progress immediately.

Streaming isn't a polish — it's load-bearing UX. The architectural implications matter: streaming pushes you toward Server-Sent Events or websockets instead of plain REST, your client needs to handle partial JSON, and your error handling has to deal with mid-stream failures. Posts in this category cover the specific patterns I've used in production for streaming RAG answers, agent reasoning steps, and structured outputs.

Retries are the other half of the operational story. Frontier models occasionally return rate-limit errors, transient failures, or just refuse to follow the format you asked for. Naive retries pile up cost and latency. Smart retries use exponential backoff, swap to a fallback provider after N failures, and surface a useful error to the user instead of an opaque timeout. The pattern I've converged on after many production incidents lives in a dedicated post in this category.

The latency budget calculation that very few teams do: list every step in your AI pipeline (embedding query, vector search, reranking, LLM call, post-processing), assign realistic p95 latencies to each, sum them. If the total exceeds two seconds before the LLM call even starts, you have a budget problem. Fix it before going wider on retries and observability — both layer on top of latency and won't help you if the baseline is already slow.

Mistakes I see (and have made) repeatedly

The same patterns of failure show up in almost every AI project I review:

No evals — shipping prompt changes by gut feeling, breaking subtle things nobody notices until users complain on Twitter
Hardcoded provider — coupling business logic directly to OpenAI's specific quirks, making any migration a multi-week project
No cost monitoring — discovering at the end of the month that one user's session burned $400 in tokens because nobody capped the context
Bloated context — stuffing 100K tokens into every prompt because chunking is hard, paying 10x what you need to and degrading the model's focus
Streaming as an afterthought — building synchronous responses first and trying to retrofit streaming later, which is functionally a rewrite of your transport layer
No fallback for downtime — when OpenAI has an outage (and it will), your entire product stops working until they're back
No prompt versioning — production prompts that get edited in a hurry by whoever has access, with no record of what changed when something breaks

Each of these gets a dedicated post in this category when I have time to write it up properly. The pattern is always the same: the bug is obvious in hindsight, the fix is straightforward once you've seen it, and the cost of not having addressed it earlier is significantly bigger than you'd estimate up front.

What's coming next in this category

The next few posts on the docket: a deep-dive into the RAG pipeline I built for a legal-tech project (with architecture diagrams, real production cost numbers, and the three failure modes that forced rewrites), a head-to-head benchmark of GPT-4o-mini vs Claude 3.5 Haiku on production-shaped traffic, a piece on how I structure evals so they actually run in CI without becoming a nuisance, a write-up of the agent-orchestration pattern I've converged on after three failed attempts at building "AI agents that just work," and a deep look at vector DB cost/performance tradeoffs at the 1M-document scale.

If there's a specific topic you want me to write about, the contact form on the homepage works. I read every message and the queue here is driven by what readers actually want to know — not what's trending on AI Twitter this week.

AI & LLMs