Clara’s Own Model — Fine-Tuned LLM on AWS

Decision (2026-03-10)

Amen Ra decided Clara needs her own fine-tuned LLM — not a ChatGPT wrapper. “There have been so many fake Black ChatGPTs — the demand is there, we should meet it.”

Why This Is Different

  • Every “Black ChatGPT” failed because they were skins on OpenAI/Anthropic APIs
  • Clara has real infrastructure: Yapit payments, 42+ Herus, Auset Platform
  • Clara has real stories: named after Clara Villarosa, Mary McLeod Bethune, Maya Angelou, Nikki Giovanni
  • Clara does things (Agent Teams) — not just answers questions

Technical Architecture (ALL AWS, lean)

  • Base model: Llama 3.3 70B (Meta, open-source, commercial-use OK)
  • Fine-tune method: QLoRA (4-bit quantized LoRA) — keeps GPU cost low
  • Training data: Extracted from all Heru codebases + cultural data (Kemetic, AAVE, Black business)
  • Fine-tune platform: AWS SageMaker (ml.g5.2xlarge, ~$15-30 per run)
  • Inference: SageMaker real-time endpoint (ml.g5.2xlarge, scale-to-zero when idle)
  • Integration: PRIMARY model in both Cloudflare AI Gateway AND build farm agentic loop
  • Monthly retrain: Automated pipeline collects new code patterns, retrains, validates, deploys

Cost Estimate

ResourceCostNotes
Fine-tune run~$15-302-4 hrs on g5.2xlarge
Inference endpoint$0-110/moScale-to-zero when idle
S3 storage~$2/moWeights + training data
Monthly retrain~$15-30/moAutomated
Total~$30-170/moOur own model

Workqueue Priorities

  • P64: Training data pipeline (READY NOW)
  • P65: SageMaker fine-tune job (after P64)
  • P66: Cloudflare AI Gateway integration (after P65)
  • P67: Build farm agentic-loop integration (after P65)
  • P68: Monthly retrain CI pipeline (after P64 + P65)

Build Farm — Local Ollama + API Providers (Updated 2026-03-10 afternoon)

Local Models (OPERATIONAL — March 10, 2026)

  • Ollama installed on BOTH EC2s (farm-1 + farm-2)
  • Primary: Qwen 2.5 Coder 3B (qwen2.5-coder:3b) — code-specialized, Tier 0
  • Fallback: Llama 3.2 3B (llama3.2:3b) — general-purpose, Tier 0.5
  • Config: OLLAMA_MAX_LOADED_MODELS=1 (only one model in memory at a time, 8GB RAM limit)
  • Config: OLLAMA_NUM_PARALLEL=1 (one request at a time per EC2)
  • Endpoint: http://localhost:11434/v1/chat/completions (OpenAI-compatible)
  • Speed: ~2-3 min per complex response with tools on t3.large CPU
  • Rate limits: NONE — unlimited, runs on our hardware
  • Concurrency: 1 agent per EC2 (RAM + CPU constraint)
  • Qwen behavior: Returns tool calls as JSON in content field (not structured tool_calls). Agentic loop has tryParseContentToolCall() to handle this.
  • Fetch timeout: 3 minutes via AbortController. Catch-all error handler — never crashes on fetch failure.
  • API keys: Deployed to /home/ec2-user/.agentic-env — must source before running agentic loop
  • This IS Clara v0 — same architecture that becomes Nikki (free tier) for public users

Complexity-Based Routing (PLANNED — in requirements doc)

  • COMPLEX tasks (refactor, debug, investigate) → prefer API models (Groq 17B, SambaNova 70B)
  • SIMPLE tasks (add, edit, commit) → prefer local Qwen Coder
  • Iterations 1-3 (planning) → prefer API. Iterations 4+ (execution) → prefer local.
  • Saves API tokens for when they matter most. ~50-100 API calls/day across 5 providers.

API Providers — Multi-Account Key Rotation (Updated 2026-03-11)

ALL keys are from SEPARATE email accounts (independent daily quotas). Rotation code exists in agentic-loop.js (lines 201-246).

ProviderKeysAccountsStatus
Groq (Llama 4 Scout 17B)4quikinfluence, quikcarry, quiknation, quikcarrentalWORKING
Cerebras (Llama 3.1 8B)5quikinfluence, quiknation, quikcarrental, quikevents, quikhuddleWORKING
SambaNova (Llama 3.3 70B)2quikinfluence, quikcarrentalWORKING
Google AI (Gemini 2.0 Flash)1quikinfluenceWORKING (1,500 req/day limit)
OpenRouter (Qwen3 Coder)1quikinfluenceWORKING
Together1quikinfluenceBROKEN — 401 invalid key, needs new key from dashboard
DeepSeek1quikinfluenceBROKEN — 402 no balance, needs credit added
  • Total working: 13 keys across 5 providers + 2 local Ollama = 15 models
  • Total broken: 2 providers (Together, DeepSeek) — NOT in agentic-loop.js MODELS array, NOT in .agentic-env
  • Should sustain 24 hours with separate accounts — if exhausting early, the bug is in the agentic loop (retry storms, not rotating properly), NOT in the key setup.

All API keys stored in SSM at /quik-nation/build-farm/{provider}-api-key Keys deployed to EC2s at /home/ec2-user/.agentic-env

Requirements Doc for Sonnet

.claude/plans/clara-model-routing-architecture.md (also in .cursor/plans/)

  • 5 requirements (R1-R5): smart routing, concurrency, iteration-based selection, fallback chain, metrics
  • 6 deliverables (D1-D6): routing logic, dispatch update, supervisor, registry, subdomains, Clara chat-assisted bug reporting
  • 5-phase growth: 0 → 30-170/mo → $150-300/mo