The Prompt Isn't the Product

Most teams I see building agents are spending their time on prompts and tools. That's where the obvious levers are, so it makes sense. Tweak the prompt, adjust a tool description, run it once, eyeball the result. Ship when it looks good.
‍

The teams whose agents keep getting better are doing something different. They're investing in the feedback loop around the prompts and tools: tracing, eval, iteration. The prompt is where you start. The loop is where the compounding happens.

Why Agents Are Hard to Iterate On

Agents are non-deterministic by design. Same input, same prompt, same tools, ten different tool call sequences. Most of them work. A few are slow or expensive. One or two are subtly wrong in ways that pass a basic output check. You can't see any of that from a single run, and you can't fix it by reading one transcript.
‍

Traditional software handles non-determinism by eliminating it. With agents, you can't, because the non-determinism is the feature. The model picking the path is the whole point.
‍

So the job shifts. You're not trying to make the agent deterministic, you're trying to constrain the distribution of behaviors to the ones you'd actually ship. That starts with being able to see the distribution, which most teams cannot.

Tracing With OTel

The highest-leverage thing you can do is instrument the agent with OpenTelemetry. Every tool call, every model response, every retry. The GenAI semantic conventions are maturing, every major observability vendor supports them in some form, and the investment is in the spans, not the vendor.
‍

What traces give you, in aggregate, is the shape of what your agent actually does. A few patterns worth looking for:
‍

→ Tool call thrash. The same tool called repeatedly in a single trajectory. Usually a sign the model didn't trust the response or couldn't parse it.

→ Order inversions. The agent acts before it verifies, or writes before it reads. The prompt didn't make the dependency clear, or the tools should be combined.

→ Tool confusion. Two tools with overlapping descriptions getting picked roughly 50/50. This is a tool description problem, not a model problem. The model is doing its best with the menu it was given.

→ Wrong-tool-for-the-job patterns. Your agent reaches for the most general tool when a more specific one exists. A common version: an agent given both raw SQL and a semantic layer will sometimes write SQL when it should have called the semantic layer, and the SQL is plausible but subtly wrong. You only catch this by comparing trajectories across many runs.
‍

None of these are visible in a single log. They show up when you aggregate.

Constraining the Distribution

A few levers help narrow agent behavior without trying to eliminate the randomness.
‍

→ Structured outputs. If the final answer is free-form prose, you're stuck evaluating prose. If it's a JSON schema with required fields, half your eval becomes schema validation. Same idea works on intermediate steps: have the agent emit a plan in structured form before it acts, and you've created a checkpoint you can inspect.‍

→ Tool design. A tool with a narrow surface, validated inputs, and a predictable return shape narrows model behavior more than prompt tuning does. The semantic-layer-versus-raw-SQL pattern is the clearest version of this. Give an agent raw SQL access and it has to invent joins, remember which table holds which metric, and not hallucinate columns. Give it a semantic layer that already encodes the metric definitions, and (assuming the layer covers the question being asked) the job collapses from "write SQL" to "pick a metric and a dimension." The first is open-ended generation. The second is closer to multiple choice. That's not a prompt engineering win, it's a problem-shape win.

→ Stop conditions. Most agent failures in production are loops. Hard caps on tool calls, retries, and token budgets are guardrails the model can't ignore.

→ Scope. The narrower the task, the tighter the distribution. A general-purpose agent is a probability cloud. A "summarize this ticket into these fields" agent is a tight distribution with predictable failure modes.

Evals Without Ground Truth

If you change a prompt without an eval set, you have no real way to know what you broke. The only way to iterate seriously is to have a set of cases that runs every time something changes. Fifty cases minimum, a few hundred is better. Each case has an input, optional setup state, and assertions.
‍

Three kinds of assertions tend to be useful:
‍

1. Deterministic checks. Did the output have the right JSON fields? Did the agent stay under 20 tool calls? Did it avoid calling delete_record? Cheap, fast, catches the obvious regressions.
‍

2. LLM-as-judge scores. For anything qualitative (did the response address the question, is it consistent with the retrieved docs), judges are the only practical option. They're noisier than deterministic checks. Run them with a different model than the agent uses, and run them multiple times so you can measure variance.
‍

3. Trajectory assertions. Did the agent take a reasonable path, not just produce a reasonable answer? This is where you encode the patterns you found in traces. A correct response arrived at via 30 redundant tool calls is still a bug.
‍

When the eval shifts after a change, the regressions are where the next iteration comes from. Net-positive isn't enough.

‍

Closing the Loop

The eval set isn't something you write once. Production traces have to feed it. Sample traces from prod, both at random and via heuristics (high tool count, low judge score, user thumbs-down), triage them, and add the interesting ones to the eval as new cases or regression tests.
‍

That's the loop:

→ Production emits OTel traces. A sampling job surfaces interesting ones. Those become eval cases. Eval runs on every prompt, tool, or model change. Regressions block deploy. New failures in prod become new cases.

→ Vendors will sell you pieces of this (Langfuse, Braintrust, LangSmith, and whichever startup pitched you this week), and the tools are useful. What they're selling is plumbing. Deciding what to assert and what counts as a regression is still your work.

Against My Own Argument

The strongest counterpoint to all of this is cost. Instrumenting OTel, building a sampling pipeline, maintaining an eval harness, triaging traces, keeping cases current as the product evolves. That's real engineering time. For an early-stage team with three users and a clear product hypothesis, that investment can easily exceed the cost of the bugs it would catch. An agent that works 80% of the time and ships next week is often better than a 95% agent that ships in a quarter.
‍

The honest answer is that the loop scales with the stakes. If your agent is a demo, skip it. If it's processing financial transactions, run more cases than you think you need. Most production agents sit between those poles, and the right amount of investment is more than teams typically do and less than the full pipeline I described.
‍

The other counterpoint, which I'd push back on, is that the models keep getting better and most of this scaffolding becomes unnecessary. Frontier models are clearly improving at planning and tool use, and that gap will keep widening. But surface area expands faster than capability. More tools, agents calling other agents, longer-horizon tasks. The teams who do well in that world will be the ones who can see what their systems are doing, regardless of how good the base model is.
‍

The prompt is the easy part. The loop around it is the part that compounds.