Clara Voice — Swarm Acceptance Criteria

Set by: Amen Ra (CTO) Deadline: Friday March 28, 2026 (demo) Repo: claraagents/desktop/

Business AC (A. Philip)

  1. Voice IS the prompt — Mo speaks into Clara Desktop, transcript lands in the Claude Code session via Channels MCP. No typing required.
  2. Opus IS the brain — This Claude Code session (with full vault context) responds. NOT Groq, NOT a separate LLM. The SAME session.
  3. Agent responds in cloned voice — Reply goes through speak.py → MiniMax TTS → Mac speakers in the agent’s cloned voice.
  4. Clara Desktop IS the product — replaces Slack/Zoom/Teams. Not a developer tool.
  5. Huddle feature — Stream-powered voice/video calls with human + AI agent participants.

Technical AC (Granville)

  1. Clara Desktop mic → Deepgram STT → POST to MCP channel (port 8789)
  2. Channel delivers transcript to warm Claude Code session
  3. Opus responds in-session (1-3 sentences, conversational)
  4. Reply tool → speak.py → MiniMax TTS → speakers (cloned voice)
  5. Latency (updated March 24 by Mo):
    • STT + POST to 8789: <5s (already achieved ~1.5-2s)
    • Session turn time: minimize via watcher/notify/blocking-wait — no hard <5s guarantee while Claude Code turn model is request-response
    • NO fake instant replies via Messages API or separate LLM — this session IS the brain
    • Metric: median time-to-first-listen improved via discipline, not by changing who the brain is
    • If product later needs marketing-grade “always <5s voice,” re-open as separate AC decision (hybrid A+B)
  6. Locked build for Quik (CLARA_LOCKED=true, .dmg packaged)
  7. Huddle: Stream call, multiple humans + agents in same room
  8. Desktop/ committed to claraagents repo on GitHub

Decision Record: Real-time voice in Claude Code NOT POSSIBLE (March 24, 2026)

Set by: Amen Ra Finding: Claude Code’s turn model (request-response) prevents real-time voice conversation. MCP channel notifications don’t reliably trigger turns. 4 days spent confirming this (Sessions 19-23). Resolution: Real-time voice requires Messages API (same Opus model, vault injected as system prompt). Claude Code stays the command center for coding, tools, swarm, vault — not for voice conversation. Feature request to Anthropic: Event-driven turns from MCP channel notifications would solve this. Third-party integrations (Telegram, Slack) already do real-time with Claude via API — Claude Code should too.

What’s Already Done (Session 19)

  • Clara Desktop moved to claraagents/desktop/
  • Stream SDK integrated (replaced Daily.co)
  • IPC voice pipeline (mic → main process → voice server)
  • Agent registration via /api/agent
  • Voice server: PNA, vault optimization, timeouts
  • Groq paid ($20 cap)
  • Quik Huddle → Next.js 15.5.9
  • voice-channel.ts MCP server running on 8789
  • 30 transcripts confirmed received by MCP channel
  • Product vision: Clara Desktop = communication platform

What’s Left for Swarm

  • AC 1-4: Wire Clara Desktop → STT only → POST to 8789 → Opus responds → reply tool → speak.py
  • AC #5: Test round trip timing
  • AC #6: Package locked .dmg
  • AC #7: Stream Huddle with multiple participants
  • AC #8: Push desktop/ to claraagents GitHub

Architecture (CORRECT — from vault)

Clara Desktop mic → Deepgram STT → voice server /voice-channel
  → POST to MCP channel (port 8789) → notifications/claude/channel
  → THIS Claude Code session responds → reply tool → speak.py → speakers

Voice server does STT + TTS ONLY. Opus is the brain. NOT Groq.