ML-Powered Intelligence: 500x Faster, 92% Cheaper
ML-Powered Intelligence: 500x Faster, 92% Cheaper
November 22, 2025
Today we’re announcing embedded machine learning in Chuchu. Two lightweight ML models now power instant decision-making with zero external dependencies and zero API costs.
The bigger picture: Commercial AI copilots charge $20-30/month per user or $200 per 50K requests ($4,000/1M requests). Chuchu with Groq costs ~$316/1M requests (92% cheaper) or $0 with Ollama. ML routing is one piece of how we achieve this.
The Problem with LLM-Only Routing
Every time you interact with an AI coding assistant, there’s a hidden cost:
Intent classification - deciding whether you want to read code, edit it, or search documentation - requires an LLM call:
- Latency: ~500ms per request
- Cost: ~$0.000005 per classification (using llama-3.1-8b-instant)
- Adds up: While routing costs are small, they’re pure overhead on top of your actual coding work
For simple decisions like “is this a query or an edit?”, calling a 70B parameter model is overkill.
The Solution: Hybrid ML + LLM
Chuchu now embeds two ML models that run locally in pure Go:
1. Intent Classifier
Routes user requests to the right agent (query, editor, research, review):
| Metric | ML Classifier | LLM Router |
|---|---|---|
| Latency | ~1ms | ~500ms |
| Cost | $0 | $0.000005 |
| Accuracy | 85-90% | 95%+ |
Training data: The model is trained on 1,200+ examples including the NL2Bash corpus1, which provides natural language descriptions of shell commands. This enables accurate classification of shell/git command requests like “run git diff and create commit message”.
Smart fallback: When the ML model is uncertain (confidence < threshold), it falls back to the LLM. You get speed when possible, accuracy when needed.
2. Complexity Classifier
Auto-detects if a task is simple or complex:
chu chat "fix typo in readme"
# → Simple task, stays in chat mode
chu chat "implement oauth2 with jwt"
# → Complex task, automatically activates Guided Mode
No more manual mode switching. The system adapts to task complexity automatically.
Real-World Impact
For 1000 requests per day (typical heavy usage):
Without ML (LLM only)
- Latency: 500ms × 1000 = 8.3 minutes of waiting
- Routing cost: $0.000005 × 1000 = $0.15/month
With ML (80% ML, 20% LLM fallback)
- Latency: (1ms × 800) + (500ms × 200) = 1.7 minutes
- Routing cost: ($0 × 800) + ($0.000005 × 200) = $0.03/month
Result:
- 83% faster (8.3min → 1.7min)
- 80% cheaper routing ($0.15 → $0.03)
- Same accuracy (smart fallback maintains quality)
Note: These are routing overhead costs only. Your actual LLM costs for coding tasks (queries, edits, research) are separate and unchanged.
How It Works
Both models use the same architecture:
User Input
↓
TF-IDF Vectorization (1-3 grams)
↓
Logistic Regression[^1]
↓
Confidence Score
↓
High confidence → ML result (1ms)
Low confidence → LLM fallback (500ms)
Model specs:
- Size: 19-66KB (tiny!)
- Runtime: Pure Go, zero Python dependencies
- Training: scikit-learn, exports to JSON
- Inference: <1ms per prediction
Configuration
Both models have configurable confidence thresholds:
# Intent classification (default: 0.7)
chu config get defaults.ml_intent_threshold
chu config set defaults.ml_intent_threshold 0.8 # More conservative
# Complexity detection (default: 0.55)
chu config get defaults.ml_complex_threshold
chu config set defaults.ml_complex_threshold 0.6 # Less Guided Mode
Lower threshold = more ML, faster but riskier Higher threshold = more LLM fallback, slower but safer
CLI Commands
Test the models interactively
# Test intent classification
chu ml test intent
> explain this code
Prediction: query (confidence: 0.89)
# Test complexity detection
chu ml test complexity
> implement oauth2 authentication
Prediction: complex (confidence: 0.91) → Would trigger Guided Mode
Evaluate accuracy
chu ml eval intent
Overall Accuracy: 89.1%
chu ml eval complexity
Overall Accuracy: 92.3%
Train your own models
# Retrain with your own examples
chu ml train intent
chu ml train complexity
Training data is in ml/{model}/data/training_data.csv - add your own examples and retrain to customize the models for your workflow.
Total Cost: Chuchu vs Commercial Copilos
Let’s compare total costs (routing + actual work) for 1M coding requests:
Commercial AI Copilots
GitHub Copilot / Cursor / Others:
- Typical pricing: $20-30/month per user (flat rate)
- Or: $200 per 50K requests for API access
- 1M requests = $4,000
Chuchu with Groq (Recommended Budget Setup)
Configuration:
- Router: llama-3.1-8b-instant (ML handles 80%, LLM 20%)
- Query: gpt-oss-120b (free tier on OpenRouter)
- Editor: llama-3.3-70b-versatile
- Research: groq/compound (free)
Cost breakdown for 1M requests:
Routing (20% LLM, 80% ML):
200K × $0.000005 = $1
Query agent (300K requests, ~500 tokens avg):
150M tokens × $0/M = $0 (free tier)
Editor agent (500K requests, ~1000 tokens avg):
500M tokens:
Input (400M): $0.59/M = $236
Output (100M): $0.79/M = $79
Total: $315
Research agent (200K requests):
Free (groq/compound)
Total: ~$316 for 1M requests (vs $4,000 commercial)
Savings: $3,684 (92% cheaper)
Chuchu with Ollama (Zero Cost)
Configuration:
- All agents: qwen2.5-coder:32b (local)
- Hardware: Needs 32GB RAM (~$500 one-time)
Cost for 1M requests: $0 (after hardware)
Electricity: ~$5-10/month for 24/7 usage
The Real Difference
ML routing saves $5 on routing per 1M requests, but that’s not the story.
The story is:
- Chuchu with Groq: $316/1M requests (92% cheaper than commercial)
- Chuchu with Ollama: $0/1M requests (100% cheaper)
- Commercial copilos: $4,000/1M requests
ML routing is just one piece. The real savings come from using affordable/free models with multi-agent architecture.
Routing Cost at Scale
Now let’s zoom in on just the routing overhead:
Routing costs scale with usage. For a codebase processing 100M tokens in actual coding work:
Typical usage pattern:
- 100M tokens of actual work (queries, edits, research)
- Estimated ~1.33M user interactions (averaging 75 tokens per request)
- Each interaction needs routing decision (85 tokens: 75 input + 10 output)
Without ML (LLM routing):
- 1.33M routing calls × $0.000005 = $6.65 routing overhead
- Plus your $100+ in actual LLM work costs
- Total: $106.65+
With ML (80% confidence, 20% LLM fallback):
- ML calls: 1.06M requests (80%) = $0
- LLM fallback: 266k requests (20%) × $0.000005 = $1.33
- Plus your $100+ in actual LLM work costs (unchanged)
- Total: $101.33+
- Routing savings: $5.32 (80% reduction on routing overhead)
Important: The $100+ in actual LLM usage for your coding tasks is separate and unchanged. ML only reduces the routing overhead (~5% of total costs).
Impact on Free Tier Profiles
Many OpenRouter models have rate limits (free tier). ML routing helps you stay under limits:
Example: Grok 4.1 Fast (free tier)
- Limit: ~100 requests/minute
- Without ML: Router uses 100 req/min = hitting limit
- With ML: Router uses 20 req/min = 80% capacity freed
This means you can handle 5x more chat interactions before hitting rate limits on free models.
Why This Matters
Speed: Fast routing means the assistant feels instant. No more waiting 500ms just to figure out what you want to do.
Cost: Routing happens on every interaction. At scale, ML routing saves real money without sacrificing quality.
Privacy: ML models run locally. Your task descriptions never leave your machine for simple routing decisions.
Offline: No internet? No problem. ML models work completely offline.
Customizable: Don’t like the default behavior? Adjust thresholds or retrain with your own data.
Real Usage Example
$ chu chat
> show me the auth code
# ML: "show" → query agent (1ms)
# Reads relevant files, explains auth flow
> add rate limiting to login endpoint
# ML: "add" → editor agent (1ms)
# Opens editor, implements rate limiting
> how do other projects handle rate limiting?
# ML: uncertain (confidence: 0.65 < 0.7)
# Fallback: LLM router → research agent (500ms)
# Searches web, summarizes best practices
> fix the test that's failing
# ML: "fix" + "test" → editor agent (1ms)
# Runs tests, identifies issue, fixes it
Result: 3 out of 4 requests used ML (1ms), only 1 used LLM (500ms). Total routing time: 503ms instead of 2000ms.
The Hybrid Advantage
Pure ML would be faster but less accurate. Pure LLM would be more accurate but slower and expensive.
Hybrid ML + LLM2 gives you the best of both worlds:
- Fast path for confident decisions (80-90% of requests)
- Smart fallback for edge cases
- Configurable balance between speed and accuracy
Technical Details
For implementation details, training data format, and hyperparameter tuning, see our ML Features documentation.
For performance benchmarks and ROI analysis, check the full metrics in the docs.
Try It Now
The ML models are embedded in Chuchu by default - no setup required. They “just work.”
Want to see them in action?
# Interactive testing
chu ml test intent
chu ml test complexity
# Check your current config
chu config get defaults.ml_intent_threshold
chu config get defaults.ml_complex_threshold
# View training data
cat ml/intent/data/training_data.csv
cat ml/complexity_detection/data/training_data.csv
What’s Next
We’re exploring:
- Confidence calibration: Even better fallback decisions
- Active learning: Learn from your corrections
- Code-specific features: Leverage AST patterns, not just text
- Personalization: Models that adapt to your coding style
But the foundation is here today: fast, cheap, accurate routing powered by embedded ML.
Summary
500x faster routing: 1ms vs 500ms routing latency
92% total cost savings: $316 vs $4,000 per 1M requests (Groq setup)
Or 100% savings: $0 with Ollama (local setup)
Zero deps runtime: Inference in pure Go
Smart fallback: LLM when uncertain
Routing optimization: Saves $5+ per 1M requests on overhead
Rate-limit friendly: 5x more requests on free tier profiles
Have questions about the ML system? Check out the full documentation or ask in GitHub Discussions!
References
See Also
- Full ML Features Documentation - Technical deep dive
- Groq Optimal Configurations - Combine ML routing with cheap Groq models
- Cost Optimization - Additional cost-saving strategies
-
Lin, X. V., Wang, C., Zettlemoyer, L., & Ernst, M. D. (2018). NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System. LREC 2018. https://victorialin.org/pubs/nl2bash.pdf ↩
-
Teerapittayanon, S., McDanel, B., & Kung, H. T. (2016). BranchyNet: Fast inference via early exiting from deep neural networks. ICPR 2016. https://arxiv.org/abs/1709.01686 ↩