Capabilities, not conversations
We start by listing the actions an agent should be able to take — create a ticket, draft a refund, open a pull request, reschedule a shipment — and treat each one as a typed tool with explicit inputs, permissions, and side effects. The conversation is just the interface; the tools are the product.
Framing it this way forces the useful questions early. Who is allowed to trigger this action? What does 'undo' look like? Which steps need a human in the loop before anything is committed? Answer those and the agent stops being a demo and starts being software.
Guardrails are a design surface, not an afterthought
Every action runs behind validation, rate limits, and an audit trail. High-impact tools require confirmation; reversible ones run freely. We log the model's reasoning alongside the action so a human can always reconstruct why something happened.
Evals come with the build, not after it. We assemble a suite of real scenarios — the messy, ambiguous ones — and run them on every prompt or model change so quality is measured, not vibed.
Where it pays off
The wins are rarely the flashy ones. They look like a support queue that drains itself overnight, an ops workflow that no longer needs a spreadsheet, or a copilot that drafts the boring 80% so your team owns the important 20%.
Built well, an agent is just a very fast, very tireless colleague that follows your rules exactly. Built badly, it's a liability. The difference is entirely in the engineering around the model.