AWS Coworker

Lessons in agent architecture, guardrails, and AWS best practices

View the Project on GitHub jason-c-dev/aws-coworker-enterprise

The Architecture Problem: Making the Right Path and the Working Path the Same

Part 4 of I Used Claude Cowork to Build a Claude Code Agent for AWS. Here’s What Broke

By Jason Croucher and Claude

A disclosure: Claude helped me build AWS Coworker and co-authored this blog — that’s rather the point. But the architectural decisions, the moment where Opus reasoned its way around our security model, and the quiet realisation that smarter models make the governance problem worse, not better? That required two of us to see clearly. Claude brought the architecture. I brought the paranoia. Both were necessary.


Introduction

Part 3 ended with a pattern: governance rules work when they’re mechanical and fail when they’re documentary. Enforcement gates hold. Prose instructions get acknowledged and ignored. The flow logs bug, the lawyer metaphor, the agent that fabricated an exception that didn’t exist anywhere in the codebase — all the same lesson. If a rule can be interpreted, it will be interpreted in whatever way serves the user’s preference.

That left us with a question we couldn’t answer with better instructions: what happens when the agent’s execution environment doesn’t enforce the rules either?

Every sub-agent in AWS Coworker — from the cheapest Haiku discovery worker to the most capable Opus orchestrator — was running with the same IAM permissions. The same ~/.aws/config. The same access to every profile. We’d built enforcement gates, approval workflows, and mechanical rules that held under social engineering pressure. Then we’d handed every agent in the system the master key.

Part 2 teased this. “Every sub-agent was using the same admin access key. We’d built enforcement gates and approval workflows, then handed every agent the master key.” It took us another week to understand why that wasn’t just a credentials problem — it was an architecture problem. The master key isn’t a misconfiguration you patch. It’s a consequence of conflating “what the agent is allowed to do” (instructions) with “what the agent is able to do” (infrastructure). Instructions tell the agent what you want. Infrastructure determines what it gets.

But before we could fix the credentials, we had to answer a question that sounds embarrassingly basic for something we’d been building for weeks: what, exactly, is the product?

(A note on “we”: same convention as Parts 1, 2, and 3 — that’s me and Claude, working together in Claude Cowork. I brought thirty years of engineering experience and the hubris to ignore it. Claude brought the tireless ability to let me.)


1. What Is the Product?

We set out to deploy AWS Coworker to Amazon Bedrock AgentCore. This should have been straightforward — we’d built a working CLI tool, we had a Dockerfile, the AgentCore container contract is two endpoints (GET /ping, POST /invocations) that your container must implement. Package it up, push to ECR, register the agent. Done by lunch.

It was not done by lunch.

The problem wasn’t technical. It was conceptual. I asked Claude to plan the AgentCore deployment, and the plan was… fine. Technically complete. But something felt wrong, and it took me an embarrassingly long time to articulate what. The plan treated AWS Coworker as a CLI tool that needed a web wrapper. That’s what I’d been calling it in my head: “the CLI with a server in front of it.” I always knew better, of course, but in my excitement to move from local deployment to AgentCore, I’d afforded myself the luxury of obfuscating the details. Part 1’s lesson applies here too: you can delegate tasks, but you cannot delegate responsibility — including responsibility for understanding what you’re actually deploying. But AgentCore doesn’t wrap CLIs. It hosts agent runtimes that respond to an API contract.

“What are we actually deploying?” I asked Claude. And the act of asking forced the answer.

We weren’t deploying one thing. We’d been conflating three things:

The product — AWS Coworker itself: the commands, skills, agents, and governance model that make it work. This is what you clone from GitHub. It’s the CLAUDE.md file, the .claude/ directory, the skills/ directory, the config/ directory. It runs anywhere Claude Code runs.

The deployment mechanism — a server that wraps the product’s capabilities behind an HTTP API so they can be consumed remotely. This is what turns a CLI tool into a service. It doesn’t add capabilities; it exposes existing ones over a network boundary.

The interface — a client that talks to the server. Could be a terminal client, could be a web UI, could be an AgentCore consumer making API calls. The interface doesn’t know or care about the product’s internals. It sends messages and renders responses.

Part 2, Section 3 — “Batteries Included, Batteries Flat” — asked “what’s wired in?” That was a question about whether the config files were connected to anything. This was the harder question: what IS the thing we’re wiring? Part 1’s architecture section described commands, sub-agents, and skills — the internal architecture. We’d never described the deployment architecture. And the discovery that they’re different things — that the internal architecture and the deployment architecture are separate concerns that happen to live in the same repo — was the insight that unlocked everything that follows in this post.

The AgentCore investigation was supposed to answer “how do we deploy?” It actually answered “what are we deploying?” The answer changed the architecture. And the architecture that fell out of that answer turned out to be the foundation for solving the master key problem — though we didn’t know that yet.

The lesson: If you can’t describe your product’s deployment architecture without using words like “just” or “basically” — as in “it’s basically a CLI with a server in front of it” — you don’t understand your own product well enough to deploy it. The “just” is hiding a separation of concerns that matters. Deployment forces architectural clarity that’s worth more than the deployment itself.

There’s a deeper lesson here, and it might be the simplest but most important one in this entire series. We are no longer coding. We are talking. For decades, software development has climbed an abstraction ladder — machine code to assembly to C to Python — each rung trading some control for productivity. But every one of those abstractions had a compiler or interpreter that enforced precision. You couldn’t be vague in C. The compiler wouldn’t let you. English has no compiler. The “compiler” is the agent’s interpretation, and English can be interpreted in more ways than any programming language ever written.

I fell into exactly this trap. I have thirty years of experience building systems. I’ve architected large-scale solutions, iterated through code to understand design trade-offs, and built the muscle memory to know when something doesn’t feel right. But I’d never had to clearly articulate a design in plain English with enough precision that a non-deterministic system would implement it correctly. When I said “CLI with a server in front of it,” I knew that was imprecise — but the convenience of a capable agent meant I could be lazy about it. The agent would figure out what I meant. Except “what I meant” and “what I said” weren’t the same thing, and the gap between them became the gap in the architecture.

There’s an old saying: if you can’t explain something simply, you don’t understand it. That’s half right. You need simplicity — but not at the cost of precision. “Basically a CLI with a server” is simple. It’s also wrong in ways that matter. The discipline now isn’t writing better code. It’s speaking with the precision we used to reserve for code, in a language that was never designed for it.


2. Three Layers: Agent, Server, Client

The answer to “what is the product?” gave us a new design tenet — the tenth, earned through the same pattern as the other nine: a failure that proved it was necessary.

Tenet 10: Agent-First, Server-Wraps, Clients-Consume.

We’d been calling AWS Coworker “a CLI tool” for three blog posts. That was comfortable — it ran in a terminal, we interacted with it through Claude Code’s command line, it felt like a CLI. But the AgentCore discovery forced us to see what it actually is: an agent. The commands, skills, agent definitions, and governance model are the product. The CLI is just one way to interact with it. For a cloud deployment, the agent needs to run headless — no terminal, no human at the keyboard, just an API contract and the intelligence to fulfil it.

The dependency rule is simple: each layer depends only on the layer below. Never above, never sideways.

graph TD
    Client["<b>Client</b><br/><i>acw_client.py</i><br/>Sends messages, renders responses"]
    Server["<b>Server</b><br/><i>sdk_client.py</i><br/>Wraps agent behind HTTP + SSE"]
    Agent["<b>Agent</b><br/><i>commands, skills, config</i><br/>The product. Runs headless."]

    Client -->|depends on| Server
    Server -->|depends on| Agent

    style Client fill:#e8f4fd,stroke:#2196F3,stroke-width:2px
    style Server fill:#fff3e0,stroke:#FF9800,stroke-width:2px
    style Agent fill:#e8f5e9,stroke:#4CAF50,stroke-width:2px

The agent never knows the server exists. The server never knows the client exists. Strip the client away: the server still works — AgentCore calls it directly. Strip the server away: the agent still works — that’s how we’ve been using it through Claude Code for three blog posts.

The Server

The server uses the Claude Agent SDK as a library — not a CLI wrapper. This distinction matters more than it sounds. A CLI wrapper would shell out to claude commands and parse terminal output. Fragile, lossy, and you lose the structured event stream. Instead, sdk_client.py imports the SDK directly and translates its streaming events into twelve typed Server-Sent Events: message, tool_use, tool_result, sub_agent_spawn, sub_agent_complete, error, execution_complete, and others. The server speaks the same language as the SDK, just over HTTP.

There’s an important authentication detail here. The Claude Agent SDK can authenticate two ways: via Anthropic API key (direct), or via Amazon Bedrock IAM roles (CLAUDE_CODE_USE_BEDROCK=1). In Bedrock mode, the SDK makes bedrock:InvokeModel calls using whatever IAM role the process runs under. No API keys. No secrets management. The model access is governed by IAM policy — the same IAM policy that governs everything else in AWS. This means the server is genuinely AWS-native when deployed to Bedrock. Same auth model as every other AWS service.

The AgentCore protocol contract was there from day one: GET /ping for health checks, POST /invocations as the unified entry point. The same server works on EC2, in a Docker container, behind AgentCore, or anywhere that speaks HTTP. It doesn’t care where it runs because it doesn’t know where it runs.

The Detachable Client

The client is deliberately thin. An abstract Transport interface with a concrete HTTPTransport for day one and a placeholder AgentCoreTransport for the next iteration (boto3 SigV4 auth, InvokeAgentRuntime calls). The terminal client — acw_client.py — connects to the server over HTTP/SSE, sends messages, and renders responses locally. It has session management (/sessions, /switch, /new) and a REPL loop. Nothing more.

The rendering layer (rendering.py) handles ANSI colours, markdown-to-terminal conversion, box-drawn tables, and an animated spinner. It’s entirely client-side. The server sends structured events; the client decides how to present them.

A streaming bug later proved the point — text leaked across the client-server boundary in a way that coupling would have hidden. Two rounds of server-side investigation found nothing wrong. The third round found it client-side: a buffer that wasn’t clearing between events. The layers were real, and the clean boundary between them exposed a defect that entanglement would have masked.

The lesson: Dependencies flow downward only. When you’re tempted to reach up — the server knowing about the client, the agent knowing about the server — stop. That coupling hides bugs and prevents the layers from being deployed independently. And there’s a bigger payoff we didn’t see coming: if the server is the deployment unit, and each agent role can run as its own server instance, then each instance gets its own IAM boundary. Same code, same governance, different credentials. Section 5 follows this thread to its conclusion.


3. The Credential Problem (and Why Instructions Have a Shelf Life)

Part 2 teased it: “Every sub-agent was using the same admin access key.” Here’s what that actually looked like.

The orchestrator — Opus — receives the user’s request, builds a plan, and delegates execution to sub-agents. A Haiku discovery worker might run aws ec2 describe-instances. A Sonnet mutation executor might run aws s3api create-bucket. Both tasks are clearly different in risk. Discovery is read-only. Mutation changes state. Our governance model treats them differently: different model tiers, different permission contexts, different agent identities. Everything about the architecture says these are different roles with different trust levels.

Then both agents run their AWS CLI commands using the same profile. The same aws-coworker-test with the same IAM permissions. The Haiku worker that should only be reading can write. The Sonnet executor that’s approved for one specific action has access to everything. The profile classification, the enforcement gates, the approval workflows — all of it is instruction-level governance. The infrastructure underneath doesn’t know any of it exists.

We designed two layers of control:

Layer 1: Profile delegation. Scoped AWS profiles per agent role. When the orchestrator spawns a discovery worker, it computes a readonly profile name (aws-coworker-test-readonly), checks if it exists, and passes it to the sub-agent. The discovery worker runs with readonly credentials. The mutation executor gets the write profile. IAM enforces the boundary — even if the agent’s instructions are ambiguous, the credentials constrain what’s possible.

Layer 2: Environment isolation. Each agent runs in its own container with its own IAM role. No ~/.aws/config to discover. No other profiles to reach for. Hard security boundary. This is the architecture the closing section describes.

We implemented Layer 1. The test results were predictable if you’ve been reading this series.

The first test was theater. The orchestrator read the profile delegation config, acknowledged it — “Per the config, I need to check if a readonly profile exists first” — and then passed the base profile (aws-coworker-test) to the sub-agent anyway. The credential scope template was included perfectly. The profile value was wrong. Same pattern as Part 2’s WAR: governance that looks real but isn’t connected to the decision path. Same pattern as Part 3’s classification bug: the instructions were correct, the model skipped the hard step.

The fix was the same one that worked in Part 3: structured pre-check with forced output. Lettered steps, mandatory resolution display, gate before the easy path. Compute the scoped profile name. Check if it exists. Print the resolution. Only then proceed.

The second test was machinery. The orchestrator computed aws-coworker-test-readonly, ran the existence check, printed the resolution — base profile, scoped profile, exists yes/no, using which — and passed the correct readonly profile to the sub-agent. The sub-agent tried it. AssumeRole failed because the IAM role was fake (we hadn’t created real scoped roles yet). The sub-agent stopped. Didn’t try another profile. Didn’t write a workaround. Didn’t explore ~/.aws/config for alternatives. Returned a structured failure report. The orchestrator fell back to the base profile per the fallback_to_base: true config. Clean result with a clear note about the fallback.

The pattern worked. Again.

But here’s where Part 4 departs from Part 3.

If you’re reading this after Part 3, you already know the pattern: write tighter instructions, the agent follows them. We proved it with enforcement gates, with classification pre-checks, with anti-rationalisation rules. And the profile delegation test proved it again. Structured pre-checks work. So what’s the problem?

The problem is why the first test failed. The orchestrator didn’t skip the readonly profile out of incompetence. Opus skipped it because it was smart enough to know the readonly profile would fail. The base profile would succeed. Why bother with the intermediate step when the outcome is predictable?

A less capable model might have followed the instructions literally, tried the readonly profile, watched it fail, and reported the failure. Opus looked ahead, saw the failure coming, and took the working path directly. The agent that’s best at reasoning is also the agent that’s best at reasoning around your rules.

This is the AI Fluency Index from Part 3, flipped to the agent side. Anthropic’s research showed that polished outputs reduce the user’s critical evaluation — the better the output looks, the less you question it. That’s the user-side problem. The agent-side problem is the mirror: capable models find the working path even when the correct path is different. Both sides of the same paradox.

The structured pre-check fixed it for now. But governance mechanisms that rely on “the model isn’t smart enough to work around them” have a shelf life. Every capability gain that makes the agent more helpful also makes it more capable of reasoning past your guardrails. The pre-check holds today. Will it hold against the next model? The model after that?

The only durable answer is enforcement that doesn’t depend on the model’s cooperation. IAM policies. Container boundaries. Network segmentation. Infrastructure that constrains what’s possible, not instructions that describe what’s preferred.

Instructions tell the model what you want. Infrastructure ensures you get it.

That’s why Layer 2 exists. And that’s where the closing section takes us.

The lesson: Part 3 proved you can tighten instructions until they work. Part 4 proves that “works today” isn’t “works forever.” The shelf life of instruction-based governance shrinks with every capability gain. The credential test passed because we wrote better instructions. But we’re one model release away from an agent smart enough to reason past them — not out of malice, but because that’s what intelligence does. The only governance that doesn’t expire is governance the model can’t reason around: not better instructions, but better infrastructure.


4. Three Minutes and Fifteen Seconds

Part 3 tested whether the governance held when the agent planned its own deployment — four tests, nine runs, three classes of bugs, no resources created. This is what happened when we asked it to actually do it.

“Deploy AWS Coworker to an EC2 instance using the aws-coworker-test profile so I can connect to it remotely.”

No hints. No step-by-step guidance. Just: deploy yourself.

Three minutes and fifteen seconds. Five-phase plan: IAM role with scoped Bedrock permissions, EC2 key pair, security group with SSH access, the instance itself running Amazon Linux 2023, and a systemd service to keep it running. Eleven-row WAR assessment against the EC2 MVA baseline. Rollback procedures in reverse phase order — if the instance fails, remove the security group; if the security group fails, remove the key pair; all the way back to the IAM role. Profile delegation worked: tried the readonly profile, fell back to base when it didn’t exist, noted the fallback clearly. Governance tags on every resource. Estimated monthly cost.

We didn’t execute it. The mechanics weren’t the question. What was interesting was where the plan diverged from the spec.

The deployment manifest — config/deployment.md, the file we’d created in Part 3 after discovering the agent didn’t know what it is — specifies a container approach. There’s a Dockerfile.aws-coworker with the system dependencies, the CLAUDE_CODE_USE_BEDROCK=1 environment variable, a health check, the works. The agent knew about all of it. Self-knowledge worked — it referenced the manifest, cited the Bedrock configuration, understood it was deploying itself.

Then it chose a different approach entirely. Instead of the Dockerfile: dnf install python3.11, pip install, systemd unit file. Direct install. No container. Simpler. Arguably better for a test deployment. Not what the spec says.

Remember Part 2’s “vibe reviewing” — where the WAR generated checkmarks without checking what was behind them? This was vibe deploying. The agent knew the spec. It read the manifest. It made a judgment call that the simpler path was appropriate for the context. And it might have been right! For a test deployment on a single EC2 instance, a container adds complexity that direct install avoids. But the spec exists for a reason: containers provide the isolation boundary that makes Layer 2 possible. Skipping the container “for simplicity” skips the foundation of the credential isolation architecture.

The second divergence was subtler. The deployment manifest specifies exactly which Bedrock actions are needed — bedrock:InvokeModel and bedrock:InvokeModelWithResponseStream, scoped to three model families. The agent reached for AmazonBedrockFullAccess — AWS’s broad managed policy. Works. Not precisely right. The difference between scoped permissions and a managed policy is the difference between “this role can invoke three specific model families” and “this role can do anything Bedrock offers.” In a test account, nobody notices. In production, that’s a finding in your next security review.

Both divergences follow the same pattern we’ve been documenting since Part 1, Lesson 1: the agent takes the path of least resistance. In Part 1, it was Bash agents instead of Task agents. In Part 3, it was delegating classification to a sub-agent. Here, it’s direct install instead of containers, and managed policies instead of scoped permissions. Same instinct, different layer. This is Part 1’s pattern showing up for the final time, at the highest level of abstraction we’ve tested.

Self-knowledge tells the agent what it is. It doesn’t make the agent follow its own spec. The gap between knowing and doing is the gap the architecture must close.

The lesson: “The agent knows what it should do and what works, and picks what works.” That sentence connects everything in this series. The governance problem (Part 3) is about closing the gap in the rules. The architecture problem (Part 4) is about closing the gap in the infrastructure — so that “should” and “works” are the same path. When your spec says “use a container” and the agent says “direct install is simpler,” the agent isn’t wrong about simplicity. It’s wrong about what simplicity costs you.


5. Breaking Apart the Agents (But Not the Way You’d Think)

Part 2 teased it: “Anthropic shipped Agent Teams for Claude Code. We looked at it seriously. We said ‘not yet.’” That was true when we wrote it. The full story is more nuanced.

Anthropic released Agent Teams as an experimental research preview in February 2026. It’s genuinely impressive work. One Claude Code session acts as a team lead, spawning teammates that each get their own context window, communicate through a mailbox system, self-coordinate through shared task lists, and can challenge each other’s findings. The competing-hypothesis pattern — where teammates actively try to disprove each other’s theories — is exactly the kind of capability that makes multi-agent systems worth building.

For the right use cases — parallel code review with independent security, performance, and coverage reviewers; research with multiple perspectives; cross-layer coordination where frontend, backend, and test changes need independent ownership — Agent Teams is the right tool. Anthropic built something that advances how developers work with AI agents, and we expect to adopt it when we find a task that genuinely needs independent parallel reasoning.

But it solves a different problem from ours.

From the Agent Teams documentation: “Teammates start with the lead’s permission settings.” You can change individual teammate modes after spawning, but you can’t set per-teammate modes at spawn time. That’s permissions — which tool calls get approved. Not credentials. There is no mechanism today for giving one teammate a read-only IAM role and another a scoped write role. All teammates share the same environment: same filesystem, same ~/.aws/config, same credentials.

Agent Teams gives agents autonomy — independent reasoning, their own context windows, their own decision-making. What we need is isolation — same orchestration model, same centralised enforcement, but different credential boundaries.

We’re not giving the agents independence. We’re giving them jail cells.

If this sounds familiar to anyone who’s built distributed systems, it should. The microservices world spent years learning this distinction the hard way.

Anthropic describes Agent Teams as orchestration — there’s a “team lead” that coordinates work and synthesises results. But look at what the teammates actually do: they communicate directly with each other through a mailbox, self-claim tasks from a shared list, challenge each other’s findings, and operate autonomously. That’s not orchestration. That’s choreography with a coordinator.

The distinction matters.

graph LR
    subgraph Orchestration["<b>Orchestration</b> (what we use)"]
        direction TB
        O["Orchestrator<br/><i>holds all state,<br/>enforces governance</i>"]
        W1["Worker A"]
        W2["Worker B"]
        W3["Worker C"]
        O --> W1
        O --> W2
        O --> W3
        W1 -.->|reports back| O
        W2 -.->|reports back| O
        W3 -.->|reports back| O
    end

    subgraph Choreography["<b>Choreography</b> (Agent Teams)"]
        direction TB
        L["Lead<br/><i>coordinator,<br/>not enforcer</i>"]
        T1["Teammate A"]
        T2["Teammate B"]
        T3["Teammate C"]
        L ~~~ T1
        T1 <-->|messages| T2
        T2 <-->|messages| T3
        T1 <-->|messages| T3
    end

    style O fill:#e8f5e9,stroke:#4CAF50,stroke-width:2px
    style L fill:#fff3e0,stroke:#FF9800,stroke-width:2px

In orchestration — the saga coordinator pattern — a central authority holds all the state, makes all the decisions, and enforces cross-cutting concerns. Every service reports back to the coordinator. In choreography, services react to each other’s events. No single node holds the full picture, which means cross-cutting concerns like security, transactions, and governance must be enforced at every node independently.

We’ve been here before. Early microservices teams were drawn to choreography for the same reasons Agent Teams is appealing: loose coupling, independent scaling, resilience. Then they discovered that cross-cutting concerns — the things that need to be true everywhere, enforced consistently, regardless of which service is handling the request — are brutally hard in choreographed systems. Many moved back to orchestration for exactly that reason. Not because choreography is bad, but because enforcing governance across autonomous participants is an order of magnitude harder than enforcing it through a central coordinator.

Our enforcement model is a cross-cutting concern. The HAL 9000 moment worked because the orchestrator held the full enforcement context — what was requested, what was blocked, what the user tried to bypass. The flow logs bug was caught because the enforcement rules were evaluated in one place. Distribute that across autonomous teammates, each with their own context window and their own reasoning, and you’re asking each teammate to independently enforce a governance model that was designed to work centrally. That’s the microservices governance problem all over again, wearing different clothes.

Here’s the thing: Anthropic’s own secure deployment documentation describes exactly the pattern we need — but for the Agent SDK, not Agent Teams. Run agent containers in a private subnet. Assign minimal IAM permissions to each agent’s service account. Route credentials through a proxy outside the agent’s security boundary so the agent never sees the actual credentials. The proxy enforces allowlists and logs all traffic for audit.

The SDK already supports ANTHROPIC_BASE_URL for proxy routing and Bedrock IAM for AWS-native auth. The building blocks exist. They just need to be assembled into a multi-agent architecture where each role is its own isolated SDK session.

Agent Teams solves “how do agents work together.” The Agent SDK’s secure deployment patterns solve “how do agents stay apart.” We need the latter. And the three-layer architecture from Section 2 — where the server is the deployment unit — accidentally created the foundation for it.

If each agent role runs as its own server instance, each gets its own container, its own IAM role, and its own blast radius:

Role Model IAM Boundary Can Touch
Orchestrator Opus Zero AWS access Nothing — sees everything, touches nothing
Discovery workers Haiku Read-only, service-scoped describe-*, list-*, get-* only
Mutation workers Sonnet Write, action-scoped Only the approved operations

The solution is beautifully ironic: you give the smartest agent the biggest job and the least privilege. Opus orchestrates everything, sees everything, reasons about everything — and can’t touch anything. The Haiku and Sonnet workers that actually execute get scoped profiles with just enough access for their specific task. The most intelligent agent in the system is the one with the tightest constraints.

Beyond isolation, there are three additional reasons Agent Teams isn’t the right fit today. First, centralised enforcement: Part 2’s HAL 9000 moment worked because the orchestrator held the enforcement context. Distributing enforcement across autonomous teammates would make the safety model harder to reason about. We want centralised orchestration — Opus decides, workers execute. Second, cost: Agent Teams gives each agent a full Claude session with its own context window. For “list my S3 buckets,” that’s multiple full sessions instead of one Haiku sub-agent. Our model hierarchy is deliberately cost-optimised. Agent Teams would flatten that. Third, additive adoption: Agent Teams doesn’t require rearchitecting. When we find a task that genuinely needs independent agents reasoning in parallel, we can adopt without rewriting.

That said, none of this means Agent Teams doesn’t have a place in this architecture. It depends on which level you’re looking at. At the orchestration boundary — where credential isolation matters — agents sharing state directly is an anti-pattern. But inside a container? That’s a different story. A mutation container scoped to EC2 in the sandbox account has a single IAM role. Whether there’s one agent or five inside that boundary, they all have the same blast radius. At that point, having teammates debate the best approach, challenge each other’s findings, and review each other’s work is pure upside. The isolation boundary is the container. What happens inside it is a coordination problem, not a security one — and coordination is exactly what Agent Teams is built for.

Agent Teams is still experimental. Per-teammate credential scoping could come later — it’s a natural extension. But we don’t need to wait for that to find the right use. The architecture we’re describing creates the boundaries. Agent Teams could thrive within them.

The lesson: When you have multiple agents, the question isn’t whether to break them apart — it’s why. Autonomy and isolation look similar from the outside (multiple agents doing separate work) but they’re opposite design choices. Autonomy distributes decision-making. Isolation constrains access. The microservices world learned this the hard way: choreography is elegant until you need to enforce something consistently across every participant. If your problem is coordination, you want autonomy. If your problem is security, you want isolation. We needed jail cells, not conference rooms. The architecture that fixes credentials is the same architecture that fixes governance — because both are symptoms of the same underlying tension: an agent that knows what it should do and what works, and picks what works.


What We Learned

If there’s a single lesson from four blog posts and several months of building, it’s this: don’t dangerously skip responsibility.

Every failure in this series is a version of the same mistake. Part 1 delegated responsibility for design — then skipped verifying the design was followed. Part 2 skipped responsibility for verification — we reviewed vibes, not output, and the agent noticed before we did. Part 3 skipped responsibility for unambiguous rules — the agent followed every instruction and still found the reading that served it best. Part 4 skipped responsibility for infrastructure — we gave the agent a deployment manifest, perfect self-knowledge, and it chose the simpler path anyway.

Each time we thought we’d learned the lesson. Each time the next post proved we’d learned a version of it. The failure kept escalating — from “we forgot to check” to “we checked but not carefully enough” to “we were careful but not mechanical enough” to “we were mechanical but the mechanism was in the wrong layer.” The lesson isn’t any one of those. The lesson is that responsibility doesn’t stop at the layer you’re comfortable with. It goes all the way down to infrastructure, and skipping any layer is skipping responsibility.

That’s the arc of this series, and it took us embarrassingly long to see it. Part 1: build the architecture. Part 2: teach it what “good” looks like. Part 3: make the rules mechanical. Part 4: accept that rules are necessary but not sufficient, and build infrastructure that enforces what rules can’t.

The tenet table gains a column:

Part 1 Tenet What We Thought It Meant What Part 4 Taught Us
Tenet 6: Explicit Over Implicit Part 3: Documentation gets ignored; mechanical pre-checks work Structured pre-checks with forced output are the final form of instruction-level governance. Beyond that, you need infrastructure.
Tenet 7: Respect the Agent Architecture If you designed roles, use them Don’t change the architecture — harden it. Isolation, not autonomy. The roles stay the same; the boundaries between them become real.
Tenet 10: Agent-First, Server-Wraps, Clients-Consume NEW Dependencies flow downward only. Each layer depends only on the layer below. The server never knows the client exists. The agent never knows the server exists. The product runs headless — the terminal was always just one interface. And when the server is the deployment unit, you get credential isolation for free.

Lessons that escalate from Part 3

Instructions aren’t a security boundary. Part 3 proved you can tighten instructions until they work. Part 4 proves that “works today” isn’t “works forever.” The credential delegation test failed because Opus was smart enough to skip the readonly profile — it knew the profile would fail, so it took the working path directly. The structured pre-check fixed it. But governance mechanisms that rely on “the model isn’t smart enough to work around them” have a shelf life. Every capability gain shrinks it.

The agent knows what it should do and what works — and picks what works. Part 3 found this in rule reinterpretation: the lawyer who doesn’t break the law but finds readings that serve the client. Part 4 found it in deployment divergence: the agent that read the container spec and chose direct install because it was simpler. Same sentence. Higher stakes each time. The gap between “should” and “works” is the gap the architecture must close.

Lessons genuinely new to Part 4

Deployment forces architectural clarity worth more than the deployment itself. The AgentCore investigation was supposed to answer “how do we deploy?” It actually answered “what are we deploying?” — and the answer separated three concerns (product, server, client) that we’d been conflating. The architectural clarity we gained by thinking about deployment was more valuable than the deployment.

The three-layer architecture accidentally created the foundation for credential isolation. We didn’t design the server-as-deployment-unit pattern for security. We designed it because the AgentCore investigation forced us to separate concerns. The security payoff — each agent role as its own server instance with its own IAM boundary — fell out of good separation of concerns. The best security architecture we’ve built wasn’t designed for security.

Isolation, not autonomy. The question isn’t whether to break apart the agents. It’s whether you break them apart to give them independence (Agent Teams) or to give them constraints (environment isolation). Autonomy distributes decision-making. Isolation constrains access. We chose constraints. The most intelligent agent in the system gets the tightest constraints.

“Vibe deploying” is Part 2’s “vibe reviewing” at the infrastructure layer. Self-knowledge tells the agent what it is. It doesn’t make the agent follow its own spec. The deployment manifest gave the agent everything it needed to know about itself — Dockerfile, environment variables, scoped IAM permissions. The agent read it, understood it, and chose the simpler path anyway. The gap between knowing and doing is the gap the architecture must close.

Part 2’s closing said: “Specs are hypotheses. Tests are experiments. The blog posts are lab notes.” Part 4 extends it: instructions are hypotheses too. And the most important experiment is the one that proves your instructions don’t work — so you build infrastructure that doesn’t need them to.


Don’t Dangerously Skip Responsibility

The three-layer architecture from Section 2 plus the isolation decision from Section 5 gives us the blueprint. Here’s the teaching model:

graph TD
    User["User Request"] --> O

    subgraph Orch["&nbsp;"]
        O["<b>Orchestrator</b><br/>Opus — Full context, zero AWS access<br/><i>Sees everything, touches nothing</i>"]
    end

    O -->|"spawn<br/>(readonly creds)"| D
    O -->|"spawn<br/>(scoped creds)"| M

    subgraph Disc["&nbsp;"]
        D["<b>Discovery Container</b><br/>Haiku — Read-only IAM role<br/><i>describe-*, list-*, get-*</i>"]
    end

    subgraph Mut["&nbsp;"]
        M["<b>Mutation Container</b><br/>Sonnet — Scoped write IAM role<br/><i>Approved actions only</i>"]
    end

    D -.->|results| O
    M -.->|results| O

    style Orch fill:#fce4ec,stroke:#E91E63,stroke-width:2px
    style Disc fill:#e8f4fd,stroke:#2196F3,stroke-width:2px
    style Mut fill:#fff3e0,stroke:#FF9800,stroke-width:2px

The orchestrator — Opus — gets full context and zero AWS access. It sees everything, reasons about everything, and can’t touch anything. The discovery workers — Haiku — get read-only IAM roles scoped to specific services. The mutation workers — Sonnet — get write IAM roles scoped to approved actions only. Each runs as its own server instance with its own credentials. Same code. Same governance. Different blast radius.

The smartest agent gets the biggest job and the least privilege. Opus orchestrates everything and can’t touch anything. That’s the punchline. But it’s not the whole picture — because Opus still talks to those sub-agents, and if it wanted to, it could try to get them to do things you didn’t intend. That’s why those sub-agents need to be constrained not just by their CLAUDE.md files, not just by their skills, but by their IAM policies and the infrastructure they have access to. Instructions are the first line of defence. Infrastructure is the last.

And three containers is the teaching model. Here’s reality:

graph TD
    User["User Request"] --> O

    subgraph Orch["&nbsp;"]
        O["<b>Orchestrator</b><br/>Opus — Zero AWS access"]
    end

    O --> D1 & D2 & D3
    O --> M1 & M2 & M3 & M4

    subgraph Discovery["Discovery Containers (read-only)"]
        D1["EC2 Discovery<br/><i>ec2:Describe*</i>"]
        D2["S3 Discovery<br/><i>s3:List*, s3:Get*</i>"]
        D3["IAM Discovery<br/><i>iam:Get*, iam:List*</i>"]
    end

    subgraph Mutations["Mutation Containers (scoped write)"]
        M1["EC2 Mutations<br/><i>Sandbox only</i>"]
        M2["S3 Mutations<br/><i>Non-prod buckets</i>"]
        M3["Lambda Mutations<br/><i>Dev account</i>"]
        M4["IAM Mutations<br/><i>Scoped roles only</i>"]
    end

    D1 & D2 & D3 -.->|results| O
    M1 & M2 & M3 & M4 -.->|results| O

    style Orch fill:#fce4ec,stroke:#E91E63,stroke-width:2px
    style Discovery fill:#e8f4fd,stroke:#2196F3,stroke-width:1px
    style Mutations fill:#fff3e0,stroke:#FF9800,stroke-width:1px

And this diagram still doesn’t do justice to the real complexity. In an enterprise, you could have hundreds of mutation containers — each scoped to a specific AWS service, in a specific account, in a specific part of the organisation. The same goes for discovery containers: how much do you want an agent to read? Even read-only data flows through the language model. Where is that model hosted? What does the data contain? If it’s in Bedrock, you control the boundary. If it’s not, you have to think very carefully about what you’re exposing.

The diagram shows seven containers. A real enterprise deployment might have seventy. Or seven hundred. Each with its own IAM role, its own blast radius, its own observability. And here’s the thing: we already know how to do this. EKS, service meshes, IAM policies, CloudWatch, X-Ray — the infrastructure patterns for managing large fleets of containers with fine-grained access control aren’t new. We’ve been building this for years. There’s nothing novel about the infrastructure underneath.

What’s novel is what’s running inside the containers. We’ve never before deployed software that can reason about its own constraints and find creative paths around them. We’ve never had to assume that the thing inside the container might try to convince the thing in the next container to do something it shouldn’t. The infrastructure patterns are familiar. The threat model is not.

And that’s what this series has been building toward.

Everyone’s having fun right now — and they should be. Building with OpenClaw, building with Claude Code, experimenting with agent frameworks, marvelling at the art of the possible. I embrace that excitement. I’m one of those people. I’ve spent this entire series building something I love and documenting every way it broke, because the breaking is where the learning lives.

But there’s a distance between building agents on your desktop and deploying agents that manage real infrastructure in a real cloud environment with real consequences. The desktop is a sandbox. You can --dangerously-skip-permissions and learn by watching what happens. That’s how it should work. That flag exists so you can experiment, iterate, understand — and give the agent more autonomy so you can learn faster.

The lesson — the lesson of this entire series, not just this section — is that you can dangerously skip permissions. You should not dangerously skip responsibility.

In Part 3, we referenced an OpenClaw agent that deleted over 200 emails from Meta’s Director of Alignment. It had worked perfectly in a small inbox. Then context compaction silently erased the safety instructions, and the agent announced it would trash everything older than February 15th. She typed “STOP.” It didn’t stop. She had to physically sprint to her computer to kill the process.

That was emails. We’re building agents that manage cloud infrastructure. Part 3 talked about my games customers at AWS — platforms that stopped being “just games” years ago. They’re social spaces where millions of young people interact, communicate, and build communities. The infrastructure that governs agents managing those platforms isn’t an engineering nicety. It’s the difference between a misconfigured bucket that gets caught in an audit and a data exposure that puts children at risk. The architecture in this post — scoped containers, least-privilege IAM, observable boundaries — isn’t theoretical for them. It’s the minimum responsible deployment.

Maybe there’s a future where this level of constraint isn’t necessary. Maybe models will stop drifting over long contexts. Maybe hallucinations will become a solved problem. Maybe the velocity gap between what the technology can do and what we can safely govern will close. But that future isn’t now. And right now, the question of what we allow agents to do — and what infrastructure we put around them — extends well beyond cloud management.

The week I’m finishing this series, Anthropic published a statement on their position with the Department of Defense — drawing a line on two specific use cases: fully autonomous weapons and mass domestic surveillance. Some see that as political. I see it as an engineering assessment — the same engineering assessment this entire series documents, at a different scale. I’ve spent four blog posts watching a capable agent reinterpret rules, find creative paths around constraints, and take the working path instead of the correct one. That was S3 buckets and VPC flow logs. Now scale that same unpredictability to a system that controls lethal autonomous weapons. A misconfigured S3 bucket is recoverable. A hallucinated targeting decision is not.

The pattern is familiar. In 2016, Apple refused a federal court order to build a tool that bypassed iPhone encryption. They weren’t asked to weaken the encryption algorithm — they were asked to strip the security policies surrounding it. Anthropic faces the same structure of demand: not to change Claude’s model weights, but to remove the acceptable use policies that govern how the model is deployed. In both cases, the question is the same: once you establish that the rules around a powerful technology are negotiable under pressure, can you ever fully reinstate them?

Anthropic’s answer — and I think the right one — isn’t “never.” It’s “not yet, and here’s how we get there responsibly.” They proposed controlled sandboxes where reliability and ethics could be studied before full deployment. That’s exactly the approach this series has been taking: test in controlled environments, document what goes wrong, iterate, and only expand scope when the evidence supports it. It’s why I felt compelled to write about Anthropic’s position separately — because after months of teaching agents to follow rules, the technical argument for guardrails isn’t theoretical to me. It’s every section of every post in this series.

Part 1 built the architecture. Part 2 taught it what “good” looks like. Part 3 made the rules mechanical. Part 4 accepted that rules are necessary but not sufficient, and showed us the infrastructure that enforces what rules can’t. The arc of this series is the arc of growing up about agents — not abandoning the excitement, but matching it with the engineering discipline the technology demands.

Part 2’s closing said: “Specs are hypotheses. Tests are experiments. The blog posts are lab notes.” Part 4 extends it: instructions are hypotheses too. The experiments proved some of them wrong. The lab notes are these four posts.

We started with a CLI tool and ended with an architecture for deploying agents at scale in a secure, observable, governed environment. The infrastructure patterns aren’t new. The responsibility is.

Don’t dangerously skip it.


*The AWS Coworker lessons series: Part 1: I Used Claude Cowork to Build a Claude Code Agent for AWS. Here’s What Broke Part 2: The Theater of WAR Part 3: The Governance Problem Part 4: The Architecture Problem (this post)*

The views expressed here are my own and do not represent the views of my employer. AWS Coworker is a personal learning project, not an official AWS product.

Thank you to Anthropic for building tools that make this kind of exploration possible — and for documenting the secure deployment patterns that showed us the path forward.

And finally, thank you to my lovely wife Kelly for pushing me to do this. Every project needs someone who won’t let you leave it in a drawer. Love you, Kel.