Agentic AI architecture defines how your AI system thinks, acts, and recovers from mistakes. That sentence sounds clean. But in production, the implications are anything but simple.
Welcome to our 5-part deep dive into Agentic AI. Over this series, we are stripping away the hype to focus on the core engineering principles required to build systems that actually work at scale. From individual reasoning loops to the safety layers missing in most enterprise deployments, you will learn the fundamental components of professional-grade AI agents.
This first article focuses on single-agent architecture: one LLM, one reasoning loop, one mission. Before you consider multi-agent systems or hierarchical orchestration, you must get the single-agent foundation right.
Architecture selection determines your system's reliability profile, cost behavior under load, latency at scale, failure recovery strategy, and compliance posture — long before you write a single prompt. Most teams treat it as a late-stage implementation detail. It isn't. It's a foundational engineering decision.
Imagine you're building an AI agent for a financial services company. The agent's job is to answer customer questions like "Why was I charged a late fee last month?" That sounds simple. But answering it correctly requires the agent to pull transaction history, interpret billing rules, cross-reference account status, and communicate a regulated explanation. All without hallucinating a policy that doesn't exist.
The architecture you choose for that agent determines whether it reasons through the answer iteratively, executes a precise computation, or follows a controlled sequence of predefined steps. Each choice carries a completely different set of operational consequences.
Three patterns dominate production single-agent systems today: ReAct, Code-Act, and Tool-Use loops. They didn't emerge simultaneously or arbitrarily. Each was a response to the shortcomings of what came before. Understanding why each pattern exists is as important as understanding what it does.
Before we dive in, if you'd prefer to watch rather than read, we've put together a video walkthrough — you can check it out here.
Before comparing patterns, you need a shared language. Five dimensions define the architectural decision space. Every tradeoff maps back to them. Throughout this article, every pattern will be explicitly evaluated against all five.
Autonomy is the agent's freedom to determine its own reasoning path. Controllability is your system's ability to constrain, audit, and override that path.
These are inversely correlated. More autonomy enables exploratory reasoning across unknown territory. More controllability enables predictable, auditable behavior, at the cost of flexibility. In regulated industries, controllability isn't a preference. It's a compliance requirement. Guardrails, policy engines, tool permission controls, and human-in-the-loop checkpoints are architectural mandates.
Each reasoning loop carries a latency cost. ReAct iterates — every cycle adds a full LLM call. Tool-Use is a more bounded — structured function that calls complete in one round-trip. Code-Act sits in the middle: one generation pass followed by variable sandbox execution time.
The engineering question isn't average latency. It's the latency variance. A system with a 2-second average and a 15-second p95 tail is a reliability problem regardless of its mean.
Token usage is the primary cost driver in ReAct — reasoning traces, observations, and re-reasoning accumulate per loop. In high-traffic systems with long reasoning chains, this becomes uncontrolled spending. Code-Act shifts cost from tokens to compute — you trade LLM spend for execution infrastructure. Tool-Use offers the most predictable cost profile: invocations are schema-bounded, and there is no open-ended token accumulation.
This is about infrastructure complexity, not conceptual complexity. ReAct needs only a loop controller and a logging layer. Code-Act requires sandboxed execution environments, resource limiters, timeout managers, and security filters. Tool-Use requires a maintained tool registry with versioned schemas, permission controls, and result normalization. The complexity is front-loaded into design in Tool-Use, distributed across operations in Code-Act, and relatively minimal in ReAct, until it isn't.
Enterprise adoption hinges on this dimension more than any other. Determinism means: given the same input, the system produces a predictable, auditable, replay-capable output. Free-form architectures like ReAct can produce different reasoning paths for identical inputs. That's valuable for exploration. It's a liability for audit trails, compliance reviews, and production debugging.
Hold these five dimensions in mind as you read. They are the lens through which every architectural decision should be evaluated.
Early LLM deployments were stateless: you gave the model a question, and it provided an answer. If the answer required retrieving external information, such as checking a database, calling an API, or reading a document, you had to do that retrieval manually, stitch the results together, and re-prompt.
ReAct (Reasoning + Acting) was a response to this limitation. It gave the LLM the ability to decide, mid-reasoning, what information it needed and go get it. The insight was simple but powerful: treat the reasoning process itself as a loop, not a single pass.
At each cycle, the LLM generates a reasoning step like its internal chain of thought. It then selects and invokes a tool, receives an observation from that tool, and uses that observation to inform the next reasoning step. The loop terminates when the LLM determines the task is complete, or when an external controller forces termination.
A concrete example: A user asks a research copilot: "What are the main risks of deploying transformer models in real-time trading systems?"
The agent doesn't know the answer in one pass. Its first reasoning step might be: "I need to find recent literature on transformer latency in financial systems." It calls a search tool. The observation returns three papers. Its next reasoning step: "One paper mentions inference latency variance — I should look specifically at p99 latency benchmarks." It calls the tool again with a refined query. After two more cycles, it has enough material to synthesize a grounded answer.
That multi-hop, self-directed information gathering is exactly what ReAct is designed for. No other architecture handles it as naturally.
The loop controller answers three questions at every cycle: Have we hit the iteration cap? Have we exceeded the token budget? Is the last observation a valid basis for continued reasoning? Without explicit answers to all three at runtime, the loop has no principled termination condition.
|
Dimension |
ReAct Position |
Engineering Implication |
|
Autonomy vs. Controllability |
High autonomy |
Requires loop guardrails and governance layering |
|
Latency vs. Thoroughness |
Variable, compounding |
p95 tail latency grows with iteration count |
|
Cost vs. Quality |
Token-heavy |
Cost is correlated with reasoning depth, not just traffic |
|
Simplicity vs. Specialization |
Low infra complexity |
Operational risk is hidden in loop behavior |
|
Deterministic vs. Free-Form |
Low determinism |
Audit trails require explicit reasoning trace logging |
A customer support platform deployed a ReAct agent to handle billing inquiries. In testing, it worked well — 3 to 4 loop iterations per query, clean answers.
In production, a subset of queries came in phrased ambiguously: "Why did my bill change?" — without account context. The agent's first reasoning step was to retrieve account history. The result was incomplete without a date range. Its next step was to query for recent changes. That result referenced a rate adjustment. The agent then queried for the rate adjustment policy. That document referenced an earlier amendment. The agent chased the amendment.
Forty-two tool calls. Ninety seconds. Eighteen dollars in API cost. One unanswered query. No circuit breaker had fired because each individual call succeeded. The loop controller had an iteration cap of 50 — never reached in testing.
This is what token explosion and latency compounding look like in practice. The reasoning was coherent. The failure was operational: no cost guardrail per query, no observation quality check, and an iteration cap set to a ceiling rather than a realistic production limit.
Other ReAct failure modes to instrument for:
Production insight: Most ReAct failures are operational, not conceptual. The reasoning logic is often sound. The system fails because the loop wasn't constrained, the observations weren't validated, or the instrumentation wasn't in place to catch drift before it compounded.
ReAct is effective for reasoning through ambiguous, open-ended tasks. But for structured computation like calculating amortization schedules, running statistical analysis, and transforming datasets, it is unnecessarily expensive and error-prone.
A ReAct agent asked to compute the compound interest on a loan over 10 years might reason through the formula in natural language, call a calculator tool, observe the result, re-reason to verify it, and call the tool again with adjusted parameters. Three loops. Multiple LLM calls. Token accumulation. For a task that is entirely deterministic.
Code-Act emerged as a response to this inefficiency. The insight: if the task is computational, don't reason through it in natural language loops — write code and execute it.
The LLM generates executable code like Python, SQL, and shell commands based on the user's request. That code runs in a sandboxed environment. The result returns to the orchestrator as structured output. There is no iterative reasoning loop. The LLM generates once; the code executes deterministically.
A concrete example: A financial analyst asks: "What was the month-over-month revenue variance across all product lines in Q3?"
A ReAct agent would query each product line separately, observe results, reason about differences, and potentially re-query for clarification. A Code-Act agent writes a Pandas script that loads the dataset, groups by product line and month, calculates variance, and returns a structured table — in a single execution pass.
Same answer. Fraction of the token cost. Fully deterministic output. Fully auditable code artifact.
The pre-execution security layer is as important as the sandbox itself. A prompt injection that escapes into code generation — and then executes in a poorly isolated environment — is a severe security incident. Static analysis before sandbox entry is not optional.
|
Dimension |
Code-Act Position |
Engineering Implication |
|
Autonomy vs. Controllability |
Medium |
Code generation is flexible; sandbox enforces hard limits |
|
Latency vs. Throughness |
Medium, execution-bounded |
Predictable once code quality is controlled |
|
Cost vs. Quality |
Compute-heavy |
Token savings shift to infrastructure spend |
|
Simplicity vs. Specialization |
High infra complexity |
Sandbox ops is a non-trivial operational surface |
|
Deterministic vs. Free-Form |
Medium-high |
Code output is deterministic; generation is not |
A data team deployed a Code-Act agent for ad hoc analytics. A data scientist asked: "Clean the dataset and remove outliers before running the regression."
The LLM generated a cleaning script that dropped rows where any column value exceeded 3 standard deviations. This was statistically reasonable. However, the dataset contained intentionally high-value records, large enterprise contracts, that were legitimate outliers by business logic. The code ran cleanly, the regression completed, and the result validated numerically.
The business insight was wrong. The agent had no way to distinguish statistical outliers from significant data points without the domain context that wasn't given. The code was correct. The computation was precise. The answer was misleading.
This is the Code-Act failure mode that is hardest to catch: not a crash, not an injection, a logically sound execution of a subtly incorrect instruction. Result validation must include business logic checks, not just schema checks.
Other Code-Act failure modes:
Production insight: Code-Act systems behave more like controlled compute pipelines than conversational agents. Treat sandbox operations, resource management, and execution audit trails with the same rigor you'd apply to any production data pipeline.
Both ReAct and Code-Act give the agent significant latitude. ReAct chooses its own reasoning path. Code-Act writes its own logic. For many enterprise contexts — regulated workflows, customer-facing systems, audit-sensitive processes — this latitude is a liability.
Tool-Use emerged as a response to a specific enterprise requirement: predictable, auditable agent behavior that can be governed, versioned, and restricted without constraining the underlying model's intelligence. The agent is smart. But what it can do is strictly defined.
The LLM operates in function-calling mode. It selects from a predefined registry of tools — each with a strictly typed schema — and provides structured arguments. It cannot act outside the registry. It cannot write code. It cannot chain open-ended reasoning across arbitrary observations.
A concrete example: A compliance agent at a bank is asked, "Flag all transactions over $10,000 from accounts opened in the last 30 days."
The tool registry includes:
query_transactions(filter, date_range), check_account_age(account_id), flag_transaction(transaction_id, reason).
The agent calls each in sequence, with validated arguments, receives structured responses, and flags qualifying records. Every invocation is logged with its exact input and output. The audit trail is complete and reproducible.
No open-ended reasoning. No code generation. No surprises. That's exactly what this context requires.
The policy engine and circuit breaker are what separate a production Tool-Use system from a prototype. The policy engine enforces what tools each user role or tenant can invoke, under what conditions, and at what volume. The circuit breaker prevents cascading failures when a downstream service degrades — if the query_transactions tool starts returning 500s, the circuit breaker opens before the agent retries itself into a latency incident.
|
Dimension |
Tool-Use Position |
Engineering Implication |
|
Autonomy vs. Controllability |
Low autonomy, high controllability |
Natively governance-friendly; restricted reasoning surface |
|
Latency vs. Throughness |
Predictable, bounded |
Consistent p95 behavior; SLA-safe |
|
Cost vs. Quality |
Controlled, invocation-bounded |
No token accumulation; cost correlates with call volume |
|
Simplicity vs. Specialization |
Medium infra complexity |
Design cost is front-loaded into tool schema engineering |
|
Deterministic vs. Free-Form |
High determinism |
Full audit trail; schema-validated inputs and outputs |
An HR platform deployed a Tool-Use agent for employee self-service. The tool registry included:
get_payslip, get_leave_balance, submit_leave_request, and get_policy_document
An employee asked: "I took emergency leave last week for a family hospitalization. Does that count against my regular leave balance or is there a separate provision?"
The agent had no query_emergency_leave_policy tool. It had get_policy_document, but only with a hardcoded list of policy IDs — and emergency leave provisions were in a document the registry didn't index. The agent returned a generic response pointing to the standard leave policy, which didn't address the question.
The failure wasn't a crash. It was a silent capability gap. The tool registry hadn't anticipated the full reasoning surface of the task. The employee escalated to HR anyway.
This is Tool-Use's signature failure: not a logic error, not a runaway loop — an under-designed registry that over-constrains legitimate reasoning. Tool design must be as thorough as the task surface it serves.
Other Tool-Use failure modes:
Production insight: Tool-Use architectures resemble microservice orchestration more than autonomous agents. If your team has distributed systems engineering experience, the mental model transfers directly — including the discipline required to maintain, version, and monitor a service registry at scale.
|
Dimension |
ReAct |
Code-Act |
Tool-Use |
|
Autonomy |
High |
Medium |
Low |
|
Latency Profile |
Variable, compounding |
Medium, execution-bounded |
Predictable, SLA-safe |
|
Cost Behavior |
Token-heavy, loop-dependent |
Compute-heavy, infra-dependent |
Controlled, invocation-bounded |
|
Determinism |
Low |
Medium |
High |
|
Primary Failure Mode |
Loop explosion, cascading reasoning errors |
Sandbox risk, subtle logic errors |
Schema drift, over-constrained registry |
|
Infra Complexity |
Low |
High |
Medium |
|
Enterprise Readiness |
Requires governance layering |
Requires sandbox ops maturity |
Natively governance-friendly |
|
Best Suited For |
Open-ended reasoning, research |
Structured computation, analytics |
Governed workflows, compliance |
No row here cleanly favors one pattern. Every entry reflects a contextual tradeoff. A team choosing ReAct for a research copilot is making a correct decision. A team choosing ReAct for a payment processing workflow is making a dangerous one.
Consider a legal research assistant that must: (1) explore case law across multiple jurisdictions (open-ended, multi-hop), (2) compute statistical frequency of specific rulings (structured computation), and (3) retrieve and cite official court records (governed, auditable).
No single pattern handles all three requirements well. ReAct alone would try to reason through the computation. Code-Act alone would need to generate code for the case law exploration. Tool-Use alone would need a registry large enough to cover every possible research path.
The right design composes them:
The orchestrator classifies intent and routes to the appropriate module. Each module operates within its own architectural constraints. Results converge at a synthesis layer before the final response.
Hybrid architectures carry real overhead: three separate systems to instrument, three separate failure modes to monitor, and an orchestration layer that itself becomes a reliability surface. Composition is the right answer when the task surface genuinely requires it — not as a default, and not as a way to avoid committing to a pattern.
The rule of thumb: start with the simplest pattern that handles your core use case. Add a second pattern only when you encounter a specific capability gap that cannot be solved within the first. Add a third only under the same discipline.
The right pattern depends on answering these questions honestly, in order.
The framework above works for clear cases. Real engineering decisions often land in the grey:
"Our task is computation-heavy, but the computation requirements aren't fully known upfront." This is common in exploratory analytics. Consider Code-Act with a ReAct wrapper — the outer loop reasons about what computation to run; the inner Code-Act module executes it. The hybrid section above covers this pattern.
"We're in a regulated industry, but our task is inherently ambiguous." This is the hardest grey zone. The honest answer is: constrain what you can with Tool-Use, and document the residual reasoning surface that falls outside schema control. Regulators generally care more about what you can demonstrate about your system's behavior than about the architecture label you apply to it.
"We want to start simple and iterate." Start with Tool-Use. It's the easiest to reason about, the easiest to audit, and the easiest to extend. Add ReAct capability when you hit a specific reasoning task that the registry can't handle. Add Code-Act when you hit a specific computation task that neither handles efficiently.
Demos are controlled environments. Production is not. Here is what every architect encounters in the real world, mapped to the pattern where each concern is most acute.
State persistence: Agents often need to remember prior actions across invocations. This is most complex in ReAct, where the reasoning state can grow large and requires careful scoping per session or user. In Tool-Use, state is simpler because each tool call is bounded. In Code-Act, generated code may reference prior outputs that need to be stored between execution passes. Where state is persisted, how it ages, and how it's isolated per tenant are engineering decisions that cannot be deferred.
Observability: ReAct requires loop-level tracing: what did the agent reason at each step, which tool did it choose, and what did the observation contain? Tool-Use requires invocation-level logging: which tool, which arguments, which result, which user. Code-Act requires both execution logs and code artifact storage — you need to know not just that the code ran, but exactly what code ran. Without pattern-specific instrumentation, debugging a production failure is archaeology.
Retry semantics: ReAct is the most dangerous pattern for uncontrolled retries. A single failed tool call that triggers a retry loop multiplies latency and cost simultaneously. All retries must be explicit, bounded by backoff, and counted against the loop controller's iteration budget.
Idempotent tool design: If an agent retries a Tool-Use call that already partially succeeded — a payment processed, an email sent, a record written — you need idempotency guarantees at the tool layer. This is a distributed systems requirement. Treat every tool with side effects as a service that must handle duplicate calls safely.
Token budgeting: Most important in ReAct. Each invocation should have an explicit token budget. Approaching the limit should trigger graceful degradation with a partial result and a clear signal, not silent truncation. Uncapped token budgets in high-traffic ReAct systems are the most common source of unexpected infrastructure bills.
Multi-tenant isolation: In SaaS deployments, one tenant's agent workload must not affect another's. Tool registries must be scoped per tenant. Code-Act sandboxes must be isolated per execution. ReAct loop controllers must enforce per-tenant iteration and budget caps. This is infrastructure design, not application design — and it must be in scope from the first production deployment, not retrofitted later.
There is no universally superior pattern.
ReAct, Code-Act, and Tool-Use each represent a coherent, principled response to a specific set of production requirements. ReAct exists because some tasks require open-ended, multi-hop reasoning that can't be anticipated in advance. Code-Act exists because some tasks are fundamentally computational and don't benefit from iterative natural language reasoning. Tool-Use exists because some contexts require deterministic, auditable, governed behavior that the other patterns can't natively provide.
What separates teams that build reliable production agents from those who don't isn't access to better models or frameworks. It's the discipline to treat architecture selection as a systems decision, one made with explicit awareness of cost behavior, failure modes, infrastructure requirements, and governance constraints.
Understanding the patterns is step one. Designing for production, with observability, guardrails, token budgets, idempotent tools, and multi-tenant isolation, is step two. And recognizing that these patterns are composable, not competitive, is step three.
Most importantly, the architecture decision you make before your first production deployment is the hardest one to change after it. Make it deliberately.
Agent architecture selection determines cost behavior, reliability, compliance posture, and scalability long before multi-agent sophistication becomes relevant. Get this right first. The patterns you layer on top of a solid single-agent foundation are powerful. The patterns you layer on top of a brittle one are just complexity.
Now that we’ve established how a single agent "thinks" and "acts," the next question is: how do these agents work together? In our next post, we move from individual reasoning loops to system-wide orchestration.
Join us for Centralized vs. Distributed Intelligence for Multi-Agent AI Systems, where we explore whether your enterprise needs one "Master Orchestrator" or a decentralized swarm of specialists.