Why AI Agents Need Code Execution (Not Just Bigger Context Windows)

Why code execution—not context stuffing—is the foundation for scalable AI agent infrastructure

Author: apifunnel.ai Engineering Date: January 2026

Code execution as a service is the infrastructure pattern that makes AI agents actually work at scale. The Recursive Language Model (RLM) paper↗ articulates why: instead of stuffing 150,000 tokens into context, agents that execute code can programmatically decompose problems, calling the LLM recursively on smaller chunks. But the RLM paper is theory—this article is about the production infrastructure that makes code execution reliable: progressive discovery, skills, workflows, and the six components every code execution platform needs.

Understanding the Four Concepts

Skills

Skills are packaged execution units. They can be:

Instructional markdown (prompt engineering)
Deterministic code (Python/JS functions)
Composable workflows (skills calling other skills)

The key property of skills is that they're persistent and reusable. Skills are versioned, immutable artifacts that can be shared and composed.

RLM (Recursive Language Models) Pattern

RLM is an inference-time technique for context decomposition, not a special model type.

Instead of:

LLM(10M token context) → answer

You do:

python
def solve(problem):
    if simple(problem):
        return LLM(problem)
    else:
        sub_problems = decompose(problem)
        results = [solve(p) for p in sub_problems]
        return LLM(combine(results))

The pattern is recursive—the LLM breaks down problems and calls itself on smaller chunks.

Workflows

Workflows are collections of skills with data flow: Skill A → Skill B → Skill C, where outputs become inputs.

For example, a reconciliation bot with access to Stripe, QuickBooks, and Shopify:

Fetch Stripe payments for the month (Skill A)
Pull QuickBooks invoices and Shopify orders (Skill B)
Match transactions across all three platforms (Skill C)
Flag discrepancies and generate reconciliation report (Skill D)

The critical piece most platforms miss: for workflows to actually work, the infrastructure must capture inputs and outputs explicitly.

If an agent creates a skill but the platform doesn't guide them to specify:

What inputs the skill expects
What outputs it produces
What types those are

...then the vision of composable workflows isn't achievable.

This is one of the most meaningful advantages of code execution as a service: the ability to wire up multiple API calls, build workflows, and save on token costs. But none of this is obtainable if your isolated code execution units can't be composed together.

The infrastructure needs to enforce this. At APIFunnel, when you compile a skill, you specify inputs and outputs explicitly. This makes skills truly composable and enables the agent to chain them together intelligently.

Code Execution

Code execution is the runtime environment for arbitrary logic. It's the substrate beneath everything else.

Sandboxed Python/JavaScript containers with:

API access (hundreds of pre-wired integrations)
Persistent state (session-based file storage)
Scheduling (cron, webhooks, callbacks)

The key property is unrestricted execution—you write the logic, it runs.

How They Relate

RLM, skills, and workflows are different abstractions over code execution.

Code execution is the primitive. Everything else is a pattern built on top of it.

Unfortunately, in reading the fine print, code execution cannot reach its true potential without the proper infrastructure. An infrastructure built with a code execution first mindset, as opposed to an afterthought. CE frameworks must be architected in a way that lets the agent cook with guidance. This way you stay out of the agent's way by design. This principle is embedded into the CE as a service platform of choice. This avoids being boxed into the agent's current abilities.

Progressive Discovery: The Infrastructure Pattern That Scales

Traditional MCP (Model Context Protocol) is essentially serverless for agents:

Tools exposed directly to the LLM
Stateless inline calls
Minimal infrastructure

It's simple and clean, but it breaks down once you move beyond simple use cases.

For complex systems, you need progressive discovery—a three-step architectural pattern that prevents the agent from guessing.

The Three-Step Progressive Discovery Flow

Every agent interaction follows this pattern:

Step 1: List Services

python
servers = await list_api_servers()
# Returns: service IDs, descriptions, tool counts

Called once per task (typically one conversation turn). This gives the agent a map of available integrations.

Step 2: Search Tools

python
tools = await tools_search(
    server_name="stripe_api",
    query="list charges"
)
# Returns: tool names like stripe_api.list_charges

Semantic search across tool descriptions. The agent narrows down to relevant capabilities.

Step 3: Get Tool Info

python
info = await get_tool_info(node_id="tool:stripe_api.list_charges")
# Returns: exact parameter names, types, required vs optional

This is the critical step—you get the exact schema. No guessing parameter names.

Why Progressive Discovery Matters

Any deviation from the three-step flow signals that context engineering is required.

If you see:

Agent calls list_api_servers() multiple times in one task → context gap
Agent guesses tool parameters without calling get_tool_info() → missing schema context
Agent retries code execution 5+ times → discovery context needs work

Coding errors are expected—retries for syntax bugs are normal.

But missing context that forces signature guessing? That's an infrastructure problem, not an agent problem.

Traditional MCP vs Code Execution with Progressive Discovery

Feature	Traditional MCP	Code Execution + Discovery
Tool surface	Fixed (compile time)	Dynamic (runtime)
Schema discovery	Pre-loaded in context	Progressive (3-step flow)
State	Stateless	Session-persistent
Context management	All tools loaded upfront	On-demand discovery
Recursive workflows	Not supported	Native
Error recovery	Manual retry	Automatic with context

Traditional MCP loads all tool schemas into context upfront. With 5-10 servers, this works fine. With 50+ APIs, you hit context limits. Progressive discovery solves this by fetching schemas on-demand.

Code Execution Infrastructure: Six Key Components

Reliable code execution requires:

Discovery Layer — list_api_servers(), tools_search(), get_tool_info() for schema inspection
Sandboxed Execution — Isolated containers with resource limits and network policies
Session Persistence — State survives across executions (/tmp/artifacts/, session IDs)
Seamless Authentication — OAuth flows, token refresh, and credential injection
Skill Composition — Explicit input/output wiring enables skills to call other skills
Scheduling & Durability — Cron, webhooks, and human-in-the-loop patterns

In our experience, code execution represents the purest form of agency we have seen to date.

Task-Specific AI Agents — Why focused, stateless workers beat general-purpose assistants

APIFunnel↗ implements these patterns with progressive discovery, sandboxed execution, session persistence, OAuth orchestration, skill composition, and scheduling.

Why Your Stripe Payout Doesn't Match QuickBooks (And How to Fix It)

Task-Specific AI Agents: Why One-Job AI Outperforms General-Purpose Assistants

Why AI Agents Need Code Execution (Not Just Bigger Context Windows)

Why AI Agents Need Code Execution (Not Just Bigger Context Windows)

Understanding the Four Concepts

Skills

RLM (Recursive Language Models) Pattern

Workflows

Code Execution

How They Relate

Progressive Discovery: The Infrastructure Pattern That Scales

The Three-Step Progressive Discovery Flow

Why Progressive Discovery Matters

Traditional MCP vs Code Execution with Progressive Discovery

Code Execution Infrastructure: Six Key Components

Related Reading