Monitor + debug runs
Full Lesson Reference
Routines fail quietly by default. They run at 3am, something goes wrong, nobody notices until Friday when the weekly report is missing. Monitoring is what turns an agent from a liability into an asset.
This lesson is the observability layer. Set it up once per Routine and you'll catch failures before they compound.
What "failed" actually looks like
Not every bad run throws an obvious error. The 5 failure modes:
- Hard failure
The run errors out. MCP auth rejected, API rate-limited, context file missing. The Routine status is failed. Easiest to catch.
- Silent success
The run completes. Status is success. Output is wrong or empty - wrong account, missing data, truncated report. Status lies.
- Drift
The run works for weeks, then slowly degrades. Account structure changes, a campaign is renamed, a column moves in a sheet. Output gets worse each run without failing.
-
Cost blowout The run succeeds but burns 10x the expected tokens. Something in the prompt or context is bloating the context window. Bill surprises you at month-end.
-
Skipped runs
The Routine is paused, rate-limited, or the cron is wrong. It doesn't run at all. Nothing visible.
The 3-layer monitoring setup
Layer 1: Run history in the Routines dashboard
Tell Claude
Show me the last 10 runs of the daily-gads-healthcheck-demo Routine. Include status, duration, and token cost.
You get a table. Look for patterns - a run that's suddenly 3x longer, a status that went from success to failure, a cost spike.
Check this weekly for every Routine you own. 60 seconds of reading catches 80% of problems.
Layer 2: Output validation in the Routine itself
Bake validation into the prompt. Before marking itself complete, the agent checks its own output:
Before saving the report, verify
- Row count for campaigns is between 5 and 50
- Total spend is within 20% of the 7-day average
- Every campaign has non-null CPA
- The report covers the correct date range
If any check fails, write a fail summary to the errors folder and do NOT save the main report.
This catches silent successes. The agent refuses to ship bad output.
Layer 3: External notification on failure
For any Routine you rely on, add a "ping me if this breaks" step:
If the run fails for any reason, post a message to Slack channel [ID] with:
- Routine name
- Error message
- Link to the run log
- Last successful run timestamp
Now a hard failure puts a notification in front of you the same day, not 3 weeks later.
Debugging a failed run
When a run fails, Claude can pull the full log for you
Show me the full log of the last run of daily-gads-healthcheck-demo. Highlight the point it failed.
You get the prompt, the context loaded, the tool calls made, and the error. Read it top to bottom.
The 5 most common causes
- Expired credential - OAuth token rotated, API key revoked, MCP disconnected. Fix: refresh the secret on the Routine.
- Missing context file - a file you attached was deleted or renamed. Fix: reattach.
- Rate limit - the Routine fir es at the same moment as 10 other Routines. Fix: stagger the cron (3:15, 3:30, 3:45 instead of 3:00 on the dot).
- Prompt ambiguity - the prompt works on Mondays and breaks on Saturdays because of a weekday assumption. Fix: test the prompt across edge cases.
- Upstream change - the platform renamed a field, the sheet structure changed, a column moved. Fix: update the prompt to match the new structure.
When to pause vs fix vs delete
Pause when
The Routine is fine but the target account is down, the template is being redesigned, or you're on leave. Pausing keeps everything ready to resume.
Fix when
The Routine works 9 out of 10 runs. One intermittent failure doesn't warrant a rebuild
- identify the cause, patch the prompt, move on.
Delete when
You haven't read the output in 2 months. The Routine has outlived its purpose. Cheaper to rebuild later than to leave dead agents running.
The weekly Routine review
Add a 10-minute block to your calendar every Friday
- List all active Routines
- For each: last run status, last successful output, token cost trend
- Pause anything that's failed 3+ times in a row
- Delete anything you haven't read the output from in 4+ weeks
- Note any cost trends worth investigating
Tell Claude to generate this review for you and paste it into the review block.
Power-user tips
- Log every run to Supabase - routine_id, started_at, ended_at, status, token_cost, output_hash. Gives you forever history beyond what the dashboard keeps.
- Use a single "routine-health" Slack channel - every failure across every Routine posts there. One place to check, not 15.
- Version control the prompts - keep Routine prompts in GitHub as .md files. When you update, diff the change before pushing live.
- Add a "why this matters" line to each Routine - future-you will thank you when you can't remember why this Routine exists
Action items
☐ Add output validation to your first Routine's prompt
☐ Add a Slack or email notification on failure
☐ Block 10 minutes every Friday for the weekly Routine review
☐ Check the last 10 runs of your first Routine - confirm success rate and token cost
Next lesson: Routines + Skills + MCPs.
Exercises
- Review the concepts covered in this lesson: Monitor + debug runs.
- Write down your key takeaway from this lesson.
- Practice running any commands or prompts mentioned above inside your terminal.