What Running 64 AI Agent Sessions Taught Me About Memory

I run 5 AI agents on a single EC2 box. Junior orchestrates. Jesse trades. Xavier researches. Builder handles infra. Food tracks meals. All Claude-powered, all talking through Telegram, all managed by OpenClaw.

Last week I ran a full audit. 64 active sessions. 34 cron jobs. 2,654 memory chunks indexed. 317MB of vector embeddings built from less than 3MB of actual notes. 179 orphaned transcript files eating 63MB of disk for nothing.

The system works. It also taught me things about agent memory and sessions that nobody's talking about.


Sessions Fracture Into Types — Each One Breaks Differently

When most people think "session," they think "one conversation." In a real multi-agent system, sessions fracture into types with their own failure modes.

I found four: Interactive sessions fire per agent per Telegram topic — topic 78 spawns 5 sessions just from the bots seeing a mention. Heartbeat sessions are periodic pings that silently stack tokens in the main context. Cron sessions mint fresh per run and auto-prune — the clean ones. Ghost sessions are leftovers from resets and slash commands, sitting at 0 tokens, never cleaned up.

The lesson: if you haven't audited your session topology, you have bloat you don't know about.


Heartbeats Will Quietly Destroy Your Context

This one caught me off guard. Every agent runs a periodic heartbeat — Trader every 15 minutes, Xavier every 30, the rest hourly. With heartbeat.target set to "none", each ping appended to the main session. Trader pings 96 times a day. After a few days, its main session hit 125k tokens — 63% full — almost none of it actual work.

380k tokens across 5 agents, just heartbeat accumulation. One config change (heartbeat.target: "heartbeat") isolates them into disposable sessions. Five minutes of editing freed all of it.

If you're running long-lived sessions with any periodic task and haven't isolated them, you have this problem. You just haven't noticed because the growth is slow and compaction masks it.


The Session Boundary Is Where Everything Leaks

The biggest information loss wasn't during tasks — it was between them. Agent finishes a scan, session ends, next session starts blank. Everything it learned: gone.

The fix was three layers of capture. flush_session() runs at task completion — scans working notes for action words, extracts structured facts, pushes them to memory automatically. Pre-compaction flush triggers at ~80% capacity — the system silently writes durable notes before compressing context. Nightly cron does a final sweep — extracts anything the first two missed, deduplicates, reindexes everything fresh.

Over-capture now, filter later. That inversion changed everything.


You Need Two Types of Memory

One layer is never enough.

Explicit memory stores what agents actively choose to remember — structured facts, tagged, queryable, high precision. Passive memory indexes everything agents ever wrote — every .md file auto-chunked and embedded by a background daemon. 2,654 chunks and growing. Agents don't know it's happening.

Explicit gives you "what did we decide." Passive gives you "what did we write." Neither alone is the full picture. Together they are.

My audit also flagged that I'm running three overlapping memory systems — native OpenClaw search (hybrid BM25 + vector), MCP Memory Service, and memsearch via Milvus. ~400MB of RAM for what might be redundant coverage. Consolidation is on the list.


Broken Crons Are Silent Failures

The audit caught something I'd missed for weeks: three Xavier handoff crons had timeoutSeconds: 0. Fired three times daily, timed out instantly, completed zero work. Every brief that was supposed to include Xavier intelligence was running on incomplete data. No errors. No alerts. They just finished immediately with nothing.

"Did it run?" is the wrong question. "Did it produce anything?" is the right one.


The Fix Priority

After mapping everything, the order was obvious: fix three broken Xavier crons (2 minutes, immediate brief quality impact). Isolate heartbeats (config patch, stops 380k tokens of drift). Add idle session reset at 12 hours. Disable dormant Food crons. Clean orphaned transcripts.

Total time for all five: about 15 minutes. No downtime. All reversible. The kind of maintenance that should be routine but isn't because the system doesn't tell you it's degrading.


The Takeaway

Memory and sessions are the infrastructure that makes everything else work. But "having memory" isn't the same as "having memory that works well."

The real lessons were all operational: sessions fracture into types you didn't plan for, heartbeats bloat silently, crons fail without telling you, transcripts orphan themselves, and three memory systems might be doing the job of one.

The agents don't maintain themselves. You have to audit the system that maintains them. Build observability before you need it. The problems that kill your system aren't the ones that crash — they're the ones that degrade quietly for weeks until you finally look.