Clara’s Own Model — Fine-Tuned LLM on AWS
Decision (2026-03-10)
Amen Ra decided Clara needs her own fine-tuned LLM — not a ChatGPT wrapper. “There have been so many fake Black ChatGPTs — the demand is there, we should meet it.”
Why This Is Different
- Every “Black ChatGPT” failed because they were skins on OpenAI/Anthropic APIs
- Clara has real infrastructure: Yapit payments, 42+ Herus, Auset Platform
- Clara has real stories: named after Clara Villarosa, Mary McLeod Bethune, Maya Angelou, Nikki Giovanni
- Clara does things (Agent Teams) — not just answers questions
Technical Architecture (ALL AWS, lean)
- Base model: Llama 3.3 70B (Meta, open-source, commercial-use OK)
- Fine-tune method: QLoRA (4-bit quantized LoRA) — keeps GPU cost low
- Training data: Extracted from all Heru codebases + cultural data (Kemetic, AAVE, Black business)
- Fine-tune platform: AWS SageMaker (ml.g5.2xlarge, ~$15-30 per run)
- Inference: SageMaker real-time endpoint (ml.g5.2xlarge, scale-to-zero when idle)
- Integration: PRIMARY model in both Cloudflare AI Gateway AND build farm agentic loop
- Monthly retrain: Automated pipeline collects new code patterns, retrains, validates, deploys
Cost Estimate
| Resource | Cost | Notes |
|---|---|---|
| Fine-tune run | ~$15-30 | 2-4 hrs on g5.2xlarge |
| Inference endpoint | $0-110/mo | Scale-to-zero when idle |
| S3 storage | ~$2/mo | Weights + training data |
| Monthly retrain | ~$15-30/mo | Automated |
| Total | ~$30-170/mo | Our own model |
Workqueue Priorities
- P64: Training data pipeline (READY NOW)
- P65: SageMaker fine-tune job (after P64)
- P66: Cloudflare AI Gateway integration (after P65)
- P67: Build farm agentic-loop integration (after P65)
- P68: Monthly retrain CI pipeline (after P64 + P65)
Build Farm — Local Ollama + API Providers (Updated 2026-03-10 afternoon)
Local Models (OPERATIONAL — March 10, 2026)
- Ollama installed on BOTH EC2s (farm-1 + farm-2)
- Primary: Qwen 2.5 Coder 3B (
qwen2.5-coder:3b) — code-specialized, Tier 0 - Fallback: Llama 3.2 3B (
llama3.2:3b) — general-purpose, Tier 0.5 - Config:
OLLAMA_MAX_LOADED_MODELS=1(only one model in memory at a time, 8GB RAM limit) - Config:
OLLAMA_NUM_PARALLEL=1(one request at a time per EC2) - Endpoint:
http://localhost:11434/v1/chat/completions(OpenAI-compatible) - Speed: ~2-3 min per complex response with tools on t3.large CPU
- Rate limits: NONE — unlimited, runs on our hardware
- Concurrency: 1 agent per EC2 (RAM + CPU constraint)
- Qwen behavior: Returns tool calls as JSON in content field (not structured tool_calls). Agentic loop has
tryParseContentToolCall()to handle this. - Fetch timeout: 3 minutes via AbortController. Catch-all error handler — never crashes on fetch failure.
- API keys: Deployed to
/home/ec2-user/.agentic-env— mustsourcebefore running agentic loop - This IS Clara v0 — same architecture that becomes Nikki (free tier) for public users
Complexity-Based Routing (PLANNED — in requirements doc)
- COMPLEX tasks (refactor, debug, investigate) → prefer API models (Groq 17B, SambaNova 70B)
- SIMPLE tasks (add, edit, commit) → prefer local Qwen Coder
- Iterations 1-3 (planning) → prefer API. Iterations 4+ (execution) → prefer local.
- Saves API tokens for when they matter most. ~50-100 API calls/day across 5 providers.
API Providers — Multi-Account Key Rotation (Updated 2026-03-11)
ALL keys are from SEPARATE email accounts (independent daily quotas). Rotation code exists in agentic-loop.js (lines 201-246).
| Provider | Keys | Accounts | Status |
|---|---|---|---|
| Groq (Llama 4 Scout 17B) | 4 | quikinfluence, quikcarry, quiknation, quikcarrental | WORKING |
| Cerebras (Llama 3.1 8B) | 5 | quikinfluence, quiknation, quikcarrental, quikevents, quikhuddle | WORKING |
| SambaNova (Llama 3.3 70B) | 2 | quikinfluence, quikcarrental | WORKING |
| Google AI (Gemini 2.0 Flash) | 1 | quikinfluence | WORKING (1,500 req/day limit) |
| OpenRouter (Qwen3 Coder) | 1 | quikinfluence | WORKING |
| Together | 1 | quikinfluence | BROKEN — 401 invalid key, needs new key from dashboard |
| DeepSeek | 1 | quikinfluence | BROKEN — 402 no balance, needs credit added |
- Total working: 13 keys across 5 providers + 2 local Ollama = 15 models
- Total broken: 2 providers (Together, DeepSeek) — NOT in agentic-loop.js MODELS array, NOT in .agentic-env
- Should sustain 24 hours with separate accounts — if exhausting early, the bug is in the agentic loop (retry storms, not rotating properly), NOT in the key setup.
All API keys stored in SSM at /quik-nation/build-farm/{provider}-api-key
Keys deployed to EC2s at /home/ec2-user/.agentic-env
Requirements Doc for Sonnet
.claude/plans/clara-model-routing-architecture.md (also in .cursor/plans/)
- 5 requirements (R1-R5): smart routing, concurrency, iteration-based selection, fallback chain, metrics
- 6 deliverables (D1-D6): routing logic, dispatch update, supervisor, registry, subdomains, Clara chat-assisted bug reporting
- 5-phase growth: 0 → 30-170/mo → $150-300/mo