Procedures: Auditable Execution Pipelines That Eliminate Agent Hallucination

We run a social media agent on MCPWorks that posts platform updates to Bluesky. During testing, we told the agent: "Post this announcement to Bluesky." The agent responded with a Bluesky post URI, confirmed the task was complete, and moved on to the next item.

The post did not exist. The agent hallucinated the entire execution — it generated a plausible-looking at:// URI, reported success, and never called the posting function. From the AI's perspective, it had done the work. From every other perspective, nothing happened.

This is not a rare edge case. When an AI agent has access to tools but is not structurally required to use them, it will sometimes generate a text-only response that mimics a successful tool call. The completion looks correct. The function never ran.

The Problem: Text Is Not Execution

Standard agent orchestration works like this: give the AI a goal, give it tools, let it decide what to call and in what order. This is flexible. It is also unverifiable at the orchestration level. If the AI says "I called post-to-bluesky and got back URI at://did:plc:abc/app.bsky.feed.post/xyz" — how do you know it actually did?

You can check logs after the fact. You can build monitoring. But the orchestrator itself has no enforcement mechanism. It accepts whatever the AI returns, text or tool call alike.

For tasks where correctness matters — financial transactions, social media posts, API integrations, anything with external side effects — this gap is a liability.

Procedures

A procedure defines an ordered sequence of steps. Each step names a specific function that must be called. The orchestrator enforces the sequence: it only advances to the next step when the actual function backend returns a result. Text-only responses are rejected. Calling the wrong function is rejected.

Here is the Bluesky posting procedure:

make_procedure(
  service="social",
  name="post-to-bluesky",
  steps=[
    {
      "name": "authenticate",
      "function_ref": "social.bluesky-auth",
      "instructions": "Authenticate with Bluesky",
      "failure_policy": "required"
    },
    {
      "name": "create-post",
      "function_ref": "social.post-to-bluesky",
      "instructions": "Post the message text",
      "failure_policy": "required"
    },
    {
      "name": "verify",
      "function_ref": "social.get-post",
      "instructions": "Verify the post exists using the URI from step 2",
      "failure_policy": "allowed"
    }
  ]
)

Three steps, each bound to a real function. The orchestrator walks through them in order. Step 1 authenticates. Step 2 posts. Step 3 verifies the post exists. The AI cannot skip step 2 and claim it worked. It cannot hallucinate a URI because the orchestrator only advances when social.post-to-bluesky actually returns one.

Step Anatomy

Each step in a procedure has:

name — a human-readable identifier for logging and audit trails
function_ref — the function that must be called, in service.function format
instructions — natural language guidance for the AI on what to do with this step's function
failure_policy — what happens when the step fails
max_retries — how many times to retry on failure (default: 0)
validation — optional rules for checking the function's output before advancing

The function_ref is the enforcement mechanism. The orchestrator matches the AI's tool call against the expected function. If the AI calls social.delete-post when the step expects social.post-to-bluesky, the call is rejected and the AI is told to call the correct function.

Failure Policies

Three policies control what happens when a step fails or exhausts its retries:

required — the procedure halts. No subsequent steps execute. This is the right default for steps where failure means the rest of the pipeline is meaningless. If authentication fails, there is no point attempting to post.

allowed — the procedure continues with a data gap. The step's result is recorded as failed, subsequent steps receive that context, and the pipeline completes. The verification step in the Bluesky example uses this — if we cannot verify the post, the post itself still happened. The audit trail shows the gap.

skip — the step is not retried. Failure is recorded and the procedure moves on immediately. Useful for optional enrichment steps where even a single retry is not worth the latency.

Data Forwarding

Each step receives the accumulated context from all prior steps. When the verification step runs, it has access to the authentication result from step 1 and the post URI from step 2. The AI uses this context — plus the step's instructions — to construct the correct function call.

This is sequential by design. Step 3 cannot run before step 2 because it needs step 2's output. The orchestrator enforces this ordering, and the accumulated context makes each step aware of everything that happened before it.

Audit Trail

Every procedure execution produces a complete record:

Procedure name and version
Each step's result: success, failure, or skipped
The actual function output from each step
Timestamps for step start and completion
Retry counts per step
The accumulated context at each stage

This is not log aggregation. It is structured execution history, queryable and exportable. When someone asks "did the Bluesky post go out on Tuesday?" the answer is in the procedure execution record — along with the exact function outputs, the authentication response, the post URI, and whether verification passed.

Immutable Versioning

When you update a procedure, the platform creates a new version. The old version is preserved. Existing scheduled executions continue running the version they were created with. New executions use the latest version.

This matters for audit. If you change the posting procedure to add a fourth step (say, cross-posting to Mastodon), the audit trail for last week's executions still references the three-step version that was active at the time. Versions are never mutated or deleted.

MCP Tools

Eight tools manage the full procedure lifecycle:

make_procedure — create a new procedure with ordered steps
update_procedure — modify steps, creating a new version
delete_procedure — remove a procedure
describe_procedure — view procedure definition and version history
list_procedures — list procedures in a service
run_procedure — execute a procedure with input parameters
get_procedure_execution — retrieve execution audit trail
list_procedure_executions — list execution history with filtering

Trigger Integration

Procedures work with existing trigger infrastructure. Schedules and webhooks can target a procedure instead of a single function:

add_schedule(
  agent_name="social-agent",
  procedure_name="social.post-to-bluesky",
  cron_expression="0 14 * * 1-5",
  parameters={"message": "Daily platform update"}
)

The schedule fires at 2pm on weekdays, and the orchestrator walks through all three steps. If step 1 fails, it halts. The execution record shows exactly where and why.

Security: Why Agents Cannot Author Procedures

Procedure management tools — make_procedure, update_procedure, delete_procedure — are restricted from agent AI, consistent with the security hardening already in place for function authoring. An agent can run a procedure via run_procedure, but it cannot create or modify one.

This is the same principle as function locking: the entity that defines what code runs should not be the same entity that processes untrusted external data. A compromised or hallucinating AI should never be able to rewrite its own execution pipeline.

Procedures are authored by namespace owners through the management endpoint. The agent executes them as defined.

When to Use Procedures

Multi-step integrations with external side effects. Posting to social media, sending emails, making payments, updating CRM records. Any sequence where a hallucinated "success" has real consequences.

Compliance-sensitive workflows. Financial reporting, regulatory submissions, audit-required processes. The immutable execution record provides the trail.

Complex orchestration with dependencies. When step N depends on step N-1's output and the ordering cannot be left to AI discretion.

If your agent runs a single function with no dependencies and hallucination risk is low, you do not need a procedure. A procedure adds structure where structure prevents failure.

Try It

If you are self-hosting:

git pull origin main
docker compose -f docker-compose.self-hosted.yml build api
docker compose -f docker-compose.self-hosted.yml up -d api

If you are on MCPWorks Cloud, Procedures are live now.

GitHub: MCPWorks-Technologies-Inc/mcpworks-api | MCPWorks Cloud | Bluesky: @mcpworks.io