Clara’s Own Model — Fine-Tuned LLM on AWS

Decision (2026-03-10)

Amen Ra decided Clara needs her own fine-tuned LLM — not a ChatGPT wrapper. “There have been so many fake Black ChatGPTs — the demand is there, we should meet it.”

Why This Is Different

Every “Black ChatGPT” failed because they were skins on OpenAI/Anthropic APIs
Clara has real infrastructure: Yapit payments, 42+ Herus, Auset Platform
Clara has real stories: named after Clara Villarosa, Mary McLeod Bethune, Maya Angelou, Nikki Giovanni
Clara does things (Agent Teams) — not just answers questions

Technical Architecture (ALL AWS, lean)

Base model: Llama 3.3 70B (Meta, open-source, commercial-use OK)
Fine-tune method: QLoRA (4-bit quantized LoRA) — keeps GPU cost low
Training data: Extracted from all Heru codebases + cultural data (Kemetic, AAVE, Black business)
Fine-tune platform: AWS SageMaker (ml.g5.2xlarge, ~$15-30 per run)
Inference: SageMaker real-time endpoint (ml.g5.2xlarge, scale-to-zero when idle)
Integration: PRIMARY model in both Cloudflare AI Gateway AND build farm agentic loop
Monthly retrain: Automated pipeline collects new code patterns, retrains, validates, deploys

Cost Estimate

Resource	Cost	Notes
Fine-tune run	~$15-30	2-4 hrs on g5.2xlarge
Inference endpoint	$0-110/mo	Scale-to-zero when idle
S3 storage	~$2/mo	Weights + training data
Monthly retrain	~$15-30/mo	Automated
Total	~$30-170/mo	Our own model

Workqueue Priorities

P64: Training data pipeline (READY NOW)
P65: SageMaker fine-tune job (after P64)
P66: Cloudflare AI Gateway integration (after P65)
P67: Build farm agentic-loop integration (after P65)
P68: Monthly retrain CI pipeline (after P64 + P65)

Build Farm — Local Ollama + API Providers (Updated 2026-03-10 afternoon)

Local Models (OPERATIONAL — March 10, 2026)

Ollama installed on BOTH EC2s (farm-1 + farm-2)
Primary: Qwen 2.5 Coder 3B (qwen2.5-coder:3b) — code-specialized, Tier 0
Fallback: Llama 3.2 3B (llama3.2:3b) — general-purpose, Tier 0.5
Config: OLLAMA_MAX_LOADED_MODELS=1 (only one model in memory at a time, 8GB RAM limit)
Config: OLLAMA_NUM_PARALLEL=1 (one request at a time per EC2)
Endpoint: http://localhost:11434/v1/chat/completions (OpenAI-compatible)
Speed: ~2-3 min per complex response with tools on t3.large CPU
Rate limits: NONE — unlimited, runs on our hardware
Concurrency: 1 agent per EC2 (RAM + CPU constraint)
Qwen behavior: Returns tool calls as JSON in content field (not structured tool_calls). Agentic loop has tryParseContentToolCall() to handle this.
Fetch timeout: 3 minutes via AbortController. Catch-all error handler — never crashes on fetch failure.
API keys: Deployed to /home/ec2-user/.agentic-env — must source before running agentic loop
This IS Clara v0 — same architecture that becomes Nikki (free tier) for public users

Complexity-Based Routing (PLANNED — in requirements doc)

COMPLEX tasks (refactor, debug, investigate) → prefer API models (Groq 17B, SambaNova 70B)
SIMPLE tasks (add, edit, commit) → prefer local Qwen Coder
Iterations 1-3 (planning) → prefer API. Iterations 4+ (execution) → prefer local.
Saves API tokens for when they matter most. ~50-100 API calls/day across 5 providers.

API Providers — Multi-Account Key Rotation (Updated 2026-03-11)

ALL keys are from SEPARATE email accounts (independent daily quotas). Rotation code exists in agentic-loop.js (lines 201-246).

Provider	Keys	Accounts	Status
Groq (Llama 4 Scout 17B)	4	quikinfluence, quikcarry, quiknation, quikcarrental	WORKING
Cerebras (Llama 3.1 8B)	5	quikinfluence, quiknation, quikcarrental, quikevents, quikhuddle	WORKING
SambaNova (Llama 3.3 70B)	2	quikinfluence, quikcarrental	WORKING
Google AI (Gemini 2.0 Flash)	1	quikinfluence	WORKING (1,500 req/day limit)
OpenRouter (Qwen3 Coder)	1	quikinfluence	WORKING
Together	1	quikinfluence	BROKEN — 401 invalid key, needs new key from dashboard
DeepSeek	1	quikinfluence	BROKEN — 402 no balance, needs credit added

Total working: 13 keys across 5 providers + 2 local Ollama = 15 models
Total broken: 2 providers (Together, DeepSeek) — NOT in agentic-loop.js MODELS array, NOT in .agentic-env
Should sustain 24 hours with separate accounts — if exhausting early, the bug is in the agentic loop (retry storms, not rotating properly), NOT in the key setup.

All API keys stored in SSM at /quik-nation/build-farm/{provider}-api-key Keys deployed to EC2s at /home/ec2-user/.agentic-env

Requirements Doc for Sonnet

.claude/plans/clara-model-routing-architecture.md (also in .cursor/plans/)

5 requirements (R1-R5): smart routing, concurrency, iteration-based selection, fallback chain, metrics
6 deliverables (D1-D6): routing logic, dispatch update, supervisor, registry, subdomains, Clara chat-assisted bug reporting
5-phase growth: $0 \to$ 0 → $100 - 150/ m o \to$ 30-170/mo → $150-300/mo

Auset Brain

Explorer

clara-own-model