Why AI Agents Need Code Execution (Not Just Bigger Context Windows)
Understanding the Four Concepts
Skills
Skills are packaged execution units. They can be:
- Instructional markdown (prompt engineering)
- Deterministic code (Python/JS functions)
- Composable workflows (skills calling other skills)
The key property of skills is that they're persistent and reusable. Skills are versioned, immutable artifacts that can be shared and composed.
RLM (Recursive Language Models) Pattern
RLM is an inference-time technique for context decomposition, not a special model type.
Instead of:
LLM(10M token context) → answer
You do:
pythondef solve(problem): if simple(problem): return LLM(problem) else: sub_problems = decompose(problem) results = [solve(p) for p in sub_problems] return LLM(combine(results))
The pattern is recursive—the LLM breaks down problems and calls itself on smaller chunks.
Workflows
Workflows are collections of skills with data flow: Skill A → Skill B → Skill C, where outputs become inputs.
For example, a reconciliation bot with access to Stripe, QuickBooks, and Shopify:
- Fetch Stripe payments for the month (Skill A)
- Pull QuickBooks invoices and Shopify orders (Skill B)
- Match transactions across all three platforms (Skill C)
- Flag discrepancies and generate reconciliation report (Skill D)
If an agent creates a skill but the platform doesn't guide them to specify:
- What inputs the skill expects
- What outputs it produces
- What types those are
...then the vision of composable workflows isn't achievable.
The infrastructure needs to enforce this. At APIFunnel, when you compile a skill, you specify inputs and outputs explicitly. This makes skills truly composable and enables the agent to chain them together intelligently.
Code Execution
Code execution is the runtime environment for arbitrary logic. It's the substrate beneath everything else.
Sandboxed Python/JavaScript containers with:
- API access (hundreds of pre-wired integrations)
- Persistent state (session-based file storage)
- Scheduling (cron, webhooks, callbacks)
The key property is unrestricted execution—you write the logic, it runs.
How They Relate
RLM, skills, and workflows are different abstractions over code execution.
Code execution is the primitive. Everything else is a pattern built on top of it.
Unfortunately, in reading the fine print, code execution cannot reach its true potential without the proper infrastructure. An infrastructure built with a code execution first mindset, as opposed to an afterthought. CE frameworks must be architected in a way that lets the agent cook with guidance. This way you stay out of the agent's way by design. This principle is embedded into the CE as a service platform of choice. This avoids being boxed into the agent's current abilities.
Progressive Discovery: The Infrastructure Pattern That Scales
Traditional MCP (Model Context Protocol) is essentially serverless for agents:
- Tools exposed directly to the LLM
- Stateless inline calls
- Minimal infrastructure
It's simple and clean, but it breaks down once you move beyond simple use cases.
The Three-Step Progressive Discovery Flow
Every agent interaction follows this pattern:
pythonservers = await list_api_servers() # Returns: service IDs, descriptions, tool counts
Called once per task (typically one conversation turn). This gives the agent a map of available integrations.
pythontools = await tools_search( server_name="stripe_api", query="list charges" ) # Returns: tool names like stripe_api.list_charges
Semantic search across tool descriptions. The agent narrows down to relevant capabilities.
pythoninfo = await get_tool_info(node_id="tool:stripe_api.list_charges") # Returns: exact parameter names, types, required vs optional
This is the critical step—you get the exact schema. No guessing parameter names.
Why Progressive Discovery Matters
Any deviation from the three-step flow signals that context engineering is required.
If you see:
- Agent calls
list_api_servers()multiple times in one task → context gap - Agent guesses tool parameters without calling
get_tool_info()→ missing schema context - Agent retries code execution 5+ times → discovery context needs work
Coding errors are expected—retries for syntax bugs are normal.
But missing context that forces signature guessing? That's an infrastructure problem, not an agent problem.
Traditional MCP vs Code Execution with Progressive Discovery
| Feature | Traditional MCP | Code Execution + Discovery |
|---|---|---|
| Tool surface | Fixed (compile time) | Dynamic (runtime) |
| Schema discovery | Pre-loaded in context | Progressive (3-step flow) |
| State | Stateless | Session-persistent |
| Context management | All tools loaded upfront | On-demand discovery |
| Recursive workflows | Not supported | Native |
| Error recovery | Manual retry | Automatic with context |
Traditional MCP loads all tool schemas into context upfront. With 5-10 servers, this works fine. With 50+ APIs, you hit context limits. Progressive discovery solves this by fetching schemas on-demand.
Code Execution Infrastructure: Six Key Components
Reliable code execution requires:
- Discovery Layer —
list_api_servers(),tools_search(),get_tool_info()for schema inspection - Sandboxed Execution — Isolated containers with resource limits and network policies
- Session Persistence — State survives across executions (
/tmp/artifacts/, session IDs) - Seamless Authentication — OAuth flows, token refresh, and credential injection
- Skill Composition — Explicit input/output wiring enables skills to call other skills
- Scheduling & Durability — Cron, webhooks, and human-in-the-loop patterns
In our experience, code execution represents the purest form of agency we have seen to date.
Related Reading
- Task-Specific AI Agents — Why focused, stateless workers beat general-purpose assistants