Home > Wiki > Decisions > Q-01 Concurrency Assessment — Real Caps for Parallel Execution

Q-01 Concurrency Assessment — Real Caps for Parallel Execution

Status: ACTIVE (analysis) Agent: opencode/ext-agent (sandshrew) Timestamp UTC: 2026-05-12T03:00:00Z Session: What concurrent load can the Pi 4 actually handle? Where's the real cap?

Target Concurrency Profile

Scenario	What's Running	Count	Hermes Instances
Typical	3 units at 3 different nodes. Poller active on one unit's output.	4 concurrent	4
Peak	3 units. Poller on unit A. Curation agent routing context from unit B → unit C.	5 concurrent	5
Light	1 unit working, nothing else.	1	1

The cap question: can the Pi 4 run 5 concurrent Hermes calls?

Memory Budget (Worst Case: 5 Concurrent)

Component	Est. RAM	Notes
OS + system services	~500MB	Current Pi baseline (678MB used, includes Docker)
Bun runtime	~50MB	Lightweight, single process
LangGraph graph (compiled)	~10MB	36 nodes, ~180 edges. Minimal memory.
HTTP server (Hono/Bun.serve)	~10MB	Trivial
Hermes base (shared runtime)	~50MB	Shared across all instances
Hermes instance × 5	5 × 150MB = 750MB	Qwen 3.6+ context. 4K-8K tokens per call.
Total estimated	~1.4GB
Pi 4 available	3.0GB (3.7GB - 678MB used)
Headroom	~1.6GB	Plenty

Memory is not the bottleneck. 5 concurrent Hermes calls consume ~750MB beyond baseline. The Pi has 1.6GB of headroom remaining.

What Actually Caps Concurrency

Not the Pi 4

Bun handles 5 concurrent invocations trivially (uWebSockets-based HTTP server, designed for high concurrency)
LangGraph uses thread_id to isolate state per invocation — 5 concurrent invoke() calls with different thread_ids is standard usage
SqliteSaver handles concurrent reads cleanly. Concurrent writes may serialize at the DB level but for turn-based gameplay (few writes per second), this is negligible.

The Real Bottleneck: Hermes Portal Qwen 3.6+ Rate Limits

What	Likely Cap	Impact
Hermes Portal OAuth	Unknown — depends on plan tier	Need to verify. If rate-limited per OAuth token, concurrent calls may queue or fail.
Qwen 3.6+ API	May have per-account rate limit	Self-hosted or portal-proxied? Need to confirm architecture.
Network latency	10-50ms per call to Qwen endpoint	Negligible for turn-based gameplay
Token throughput	Each Hermes call may be 1-5 seconds	The player waits for agent responses, not for concurrency

The cap is almost certainly Hermes Portal's rate limit, not the Pi 4's hardware. Until we know the OAuth plan's concurrency limit, we can't say definitively that 5 concurrent calls work. But the Pi can handle it.

LangGraph Concurrency: How It Works

Each unit gets its own thread_id. Concurrent invocations are independent:

// Rif moves to hex 17, Echo moves to hex 23, Sherpa moves to hex 09
// All three can run in parallel:

const configRif = { configurable: { thread_id: "unit-rif" } };
const configEcho = { configurable: { thread_id: "unit-echo" } };
const configSherpa = { configurable: { thread_id: "unit-sherpa" } };

await Promise.all([
  graph.invoke({ unit: "rif", action: "move", target: "17" }, configRif),
  graph.invoke({ unit: "echo", action: "move", target: "23" }, configEcho),
  graph.invoke({ unit: "sherpa", action: "move", target: "09" }, configSherpa),
]);

Poller and curation agents use their own thread_ids or run as subgraphs from the parent unit's thread. Either way, they're independent LangGraph invocations.

SqliteSaver write contention: the only real serialization point. Multiple concurrent writes to the same checkpointer may queue at the SQLite level (single-writer design). For turn-based gameplay with 5 concurrent agents making a few writes per second, this is imperceptible.

Recommendation

Allow up to 5 concurrent Hermes calls. The Pi 4 has the headroom. The real cap is Hermes Portal's rate limit — which can't be determined until we test with the OAuth token.

If Hermes Portal limits concurrency to fewer than 5, implement a queue: excess calls wait until a slot opens. The player sees "Agent queued — waiting for available slot" on the RG status bar. This is a graceful degradation, not a failure.

Decision Needed

Test Hermes Portal OAuth concurrency limits once the token is available. Fire 5 simultaneous calls and observe: do all 5 succeed? Do some queue? Do some fail?
If rate-limited: implement a call queue with visible status on the RG.
If unlimited: confirm by testing, not assuming. Document the actual limit.

Live RAM Audit (2026-05-12)

Process	RAM	What
Forgejo (outside Docker)	170MB	Forgejo binary, runs in Docker but visible to host
Docker daemon	95MB	dockerd
Hermes Gateway	80MB	⚠️ Already running on Pi! Python Hermes at `/mnt/kitchen/private/hermes/venv/`
Probe server (ours)	67MB	⚠️ Still running from earlier test — must kill
Tailscale	57MB	tailscaled
containerd	42MB	Docker container runtime
SMB	26MB	File sharing
Python HTTP (port 8080)	20MB	⚠️ Unknown process — investigate
NetworkManager	20MB	Networking
Other system	~67MB	Misc services
Total	~644MB	(not 678MB — earlier `free -h` rounded up)

Adjusted Baseline (After Cleanup)

Action	RAM Freed	New Baseline
Kill probe server	-67MB	577MB
Kill unknown HTTP (8080)	-20MB	557MB
Prune Docker images	(disk only)	557MB
Game-ready baseline	~550MB	With Docker + Forgejo + Hermes gateway + Tailscale

Hermes Already Running

Hermes Gateway is already active on the Pi at /mnt/kitchen/private/hermes/venv/bin/python -m hermes_cli.main gateway run --replace. This is a Python-based Hermes. Two paths:

A) Use existing Hermes Gateway. LangGraph on Bun calls Hermes Gateway via HTTP (port?). No new install needed. But reintroduces the HTTP boundary between Bun and Python.

B) Install Hermes on Bun. Stops the Python gateway. One Bun process for everything. Zero boundary. But needs Hermes Bun install + OAuth config.

The decision depends on what's simpler: piping through the existing gateway (no install, but boundary) vs installing Hermes on Bun (install work, but zero boundary). Given the zero-boundary philosophy we've settled on, Option B is consistent.

Hermes Fallback Rules

Qwen 3.6+ is primary (free, unlimited via OAuth). If rate-limited or failing:

const HERMES_MODELS = [
  { model: "qwen-3.6-plus", provider: "hermes-portal", auth: "oauth" },   // primary — free
  { model: "kimi-k2.6", provider: "kimi", auth: "api_key" },               // fallback 1
  { model: "minimax", provider: "minimax", auth: "api_key" },              // fallback 2
];

// Hermes configured to try primary first, fall back on rate limit or failure

Kimi and MiniMax keys are already available (visible in d3-tui container env). Fallback is automatic — Hermes tries Qwen, if rate-limited or error, tries Kimi, then MiniMax. The RG shows which model is active in the unit status view.

Updated Concurrency Budget

Component	Est. RAM (after cleanup)
OS + system services (after pruning)	~550MB
Bun runtime	~50MB
LangGraph + HTTP server	~20MB
Hermes base (shared)	~50MB
Hermes instance × 5	5 × 150MB = 750MB
Total at 5 concurrent	~1.4GB
Pi 4 available	~3.2GB (3.7GB - 550MB baseline)
Headroom	~1.8GB — even more than estimated

More headroom than the original estimate. The Pi 4 is not the bottleneck at any reasonable concurrent load.