Clara Voice — Swarm Acceptance Criteria
Set by: Amen Ra (CTO) Deadline: Friday March 28, 2026 (demo) Repo: claraagents/desktop/
Business AC (A. Philip)
- Voice IS the prompt — Mo speaks into Clara Desktop, transcript lands in the Claude Code session via Channels MCP. No typing required.
- Opus IS the brain — This Claude Code session (with full vault context) responds. NOT Groq, NOT a separate LLM. The SAME session.
- Agent responds in cloned voice — Reply goes through speak.py → MiniMax TTS → Mac speakers in the agent’s cloned voice.
- Clara Desktop IS the product — replaces Slack/Zoom/Teams. Not a developer tool.
- Huddle feature — Stream-powered voice/video calls with human + AI agent participants.
Technical AC (Granville)
- Clara Desktop mic → Deepgram STT → POST to MCP channel (port 8789)
- Channel delivers transcript to warm Claude Code session
- Opus responds in-session (1-3 sentences, conversational)
- Reply tool → speak.py → MiniMax TTS → speakers (cloned voice)
- Latency (updated March 24 by Mo):
- STT + POST to 8789: <5s (already achieved ~1.5-2s)
- Session turn time: minimize via watcher/notify/blocking-wait — no hard <5s guarantee while Claude Code turn model is request-response
- NO fake instant replies via Messages API or separate LLM — this session IS the brain
- Metric: median time-to-first-listen improved via discipline, not by changing who the brain is
- If product later needs marketing-grade “always <5s voice,” re-open as separate AC decision (hybrid A+B)
- Locked build for Quik (CLARA_LOCKED=true, .dmg packaged)
- Huddle: Stream call, multiple humans + agents in same room
- Desktop/ committed to claraagents repo on GitHub
Decision Record: Real-time voice in Claude Code NOT POSSIBLE (March 24, 2026)
Set by: Amen Ra Finding: Claude Code’s turn model (request-response) prevents real-time voice conversation. MCP channel notifications don’t reliably trigger turns. 4 days spent confirming this (Sessions 19-23). Resolution: Real-time voice requires Messages API (same Opus model, vault injected as system prompt). Claude Code stays the command center for coding, tools, swarm, vault — not for voice conversation. Feature request to Anthropic: Event-driven turns from MCP channel notifications would solve this. Third-party integrations (Telegram, Slack) already do real-time with Claude via API — Claude Code should too.
What’s Already Done (Session 19)
- Clara Desktop moved to claraagents/desktop/
- Stream SDK integrated (replaced Daily.co)
- IPC voice pipeline (mic → main process → voice server)
- Agent registration via /api/agent
- Voice server: PNA, vault optimization, timeouts
- Groq paid ($20 cap)
- Quik Huddle → Next.js 15.5.9
- voice-channel.ts MCP server running on 8789
- 30 transcripts confirmed received by MCP channel
- Product vision: Clara Desktop = communication platform
What’s Left for Swarm
- AC 1-4: Wire Clara Desktop → STT only → POST to 8789 → Opus responds → reply tool → speak.py
- AC #5: Test round trip timing
- AC #6: Package locked .dmg
- AC #7: Stream Huddle with multiple participants
- AC #8: Push desktop/ to claraagents GitHub
Architecture (CORRECT — from vault)
Clara Desktop mic → Deepgram STT → voice server /voice-channel
→ POST to MCP channel (port 8789) → notifications/claude/channel
→ THIS Claude Code session responds → reply tool → speak.py → speakers
Voice server does STT + TTS ONLY. Opus is the brain. NOT Groq.