ML-Powered Intelligence: 500x Faster, 92% Cheaper

November 22, 2025

Today we’re announcing embedded machine learning in Chuchu. Two lightweight ML models now power instant decision-making with zero external dependencies and zero API costs.

The bigger picture: Commercial AI copilots charge $20-30/month per user or $200 per 50K requests ($4,000/1M requests). Chuchu with Groq costs ~$316/1M requests (92% cheaper) or $0 with Ollama. ML routing is one piece of how we achieve this.

The Problem with LLM-Only Routing

Every time you interact with an AI coding assistant, there’s a hidden cost:

Intent classification - deciding whether you want to read code, edit it, or search documentation - requires an LLM call:

Latency: ~500ms per request
Cost: ~$0.000005 per classification (using llama-3.1-8b-instant)
Adds up: While routing costs are small, they’re pure overhead on top of your actual coding work

For simple decisions like “is this a query or an edit?”, calling a 70B parameter model is overkill.

The Solution: Hybrid ML + LLM

Chuchu now embeds two ML models that run locally in pure Go:

1. Intent Classifier

Routes user requests to the right agent (query, editor, research, review):

Metric	ML Classifier	LLM Router
Latency	~1ms	~500ms
Cost	$0	$0.000005
Accuracy	85-90%	95%+

Training data: The model is trained on 1,200+ examples including the NL2Bash corpus¹, which provides natural language descriptions of shell commands. This enables accurate classification of shell/git command requests like “run git diff and create commit message”.

Smart fallback: When the ML model is uncertain (confidence < threshold), it falls back to the LLM. You get speed when possible, accuracy when needed.

2. Complexity Classifier

Auto-detects if a task is simple or complex:

chu chat "fix typo in readme"
# → Simple task, stays in chat mode

chu chat "implement oauth2 with jwt"
# → Complex task, automatically activates Guided Mode

No more manual mode switching. The system adapts to task complexity automatically.

Real-World Impact

For 1000 requests per day (typical heavy usage):

Without ML (LLM only)

Latency: 500ms × 1000 = 8.3 minutes of waiting
Routing cost: $0.000005 × 1000 = $0.15/month

With ML (80% ML, 20% LLM fallback)

Latency: (1ms × 800) + (500ms × 200) = 1.7 minutes
Routing cost: ($0 × 800) + ($0.000005 × 200) = $0.03/month

Result:

83% faster (8.3min → 1.7min)
80% cheaper routing ($0.15 → $0.03)
Same accuracy (smart fallback maintains quality)

Note: These are routing overhead costs only. Your actual LLM costs for coding tasks (queries, edits, research) are separate and unchanged.

How It Works

Both models use the same architecture:

User Input
    ↓
TF-IDF Vectorization (1-3 grams)
    ↓
Logistic Regression[^1]
    ↓
Confidence Score
    ↓
High confidence → ML result (1ms)
Low confidence → LLM fallback (500ms)

Model specs:

Size: 19-66KB (tiny!)
Runtime: Pure Go, zero Python dependencies
Training: scikit-learn, exports to JSON
Inference: <1ms per prediction

Configuration

Both models have configurable confidence thresholds:

# Intent classification (default: 0.7)
chu config get defaults.ml_intent_threshold
chu config set defaults.ml_intent_threshold 0.8  # More conservative

# Complexity detection (default: 0.55)
chu config get defaults.ml_complex_threshold
chu config set defaults.ml_complex_threshold 0.6  # Less Guided Mode

Lower threshold = more ML, faster but riskier Higher threshold = more LLM fallback, slower but safer

CLI Commands

Test the models interactively

# Test intent classification
chu ml test intent
> explain this code
Prediction: query (confidence: 0.89)

# Test complexity detection
chu ml test complexity
> implement oauth2 authentication
Prediction: complex (confidence: 0.91) → Would trigger Guided Mode

Evaluate accuracy

chu ml eval intent
Overall Accuracy: 89.1%

chu ml eval complexity
Overall Accuracy: 92.3%

Train your own models

# Retrain with your own examples
chu ml train intent
chu ml train complexity

Training data is in ml/{model}/data/training_data.csv - add your own examples and retrain to customize the models for your workflow.

Total Cost: Chuchu vs Commercial Copilos

Let’s compare total costs (routing + actual work) for 1M coding requests:

Commercial AI Copilots

GitHub Copilot / Cursor / Others:

Typical pricing: $20-30/month per user (flat rate)
Or: $200 per 50K requests for API access
1M requests = $4,000

Chuchu with Groq (Recommended Budget Setup)

Configuration:

Router: llama-3.1-8b-instant (ML handles 80%, LLM 20%)
Query: gpt-oss-120b (free tier on OpenRouter)
Editor: llama-3.3-70b-versatile
Research: groq/compound (free)

Cost breakdown for 1M requests:

Routing (20% LLM, 80% ML):
  200K × $0.000005 = $1
  
Query agent (300K requests, ~500 tokens avg):
  150M tokens × $0/M = $0 (free tier)
  
Editor agent (500K requests, ~1000 tokens avg):
  500M tokens:
    Input (400M): $0.59/M = $236
    Output (100M): $0.79/M = $79
  Total: $315
  
Research agent (200K requests):
  Free (groq/compound)

Total: ~$316 for 1M requests (vs $4,000 commercial)

Savings: $3,684 (92% cheaper)

Chuchu with Ollama (Zero Cost)

Configuration:

All agents: qwen2.5-coder:32b (local)
Hardware: Needs 32GB RAM (~$500 one-time)

Cost for 1M requests: $0 (after hardware)

Electricity: ~$5-10/month for 24/7 usage

The Real Difference

ML routing saves $5 on routing per 1M requests, but that’s not the story.

The story is:

Chuchu with Groq: $316/1M requests (92% cheaper than commercial)
Chuchu with Ollama: $0/1M requests (100% cheaper)
Commercial copilos: $4,000/1M requests

ML routing is just one piece. The real savings come from using affordable/free models with multi-agent architecture.

Routing Cost at Scale

Now let’s zoom in on just the routing overhead:

Routing costs scale with usage. For a codebase processing 100M tokens in actual coding work:

Typical usage pattern:

100M tokens of actual work (queries, edits, research)
Estimated ~1.33M user interactions (averaging 75 tokens per request)
Each interaction needs routing decision (85 tokens: 75 input + 10 output)

Without ML (LLM routing):

1.33M routing calls × $0.000005 = $6.65 routing overhead
Plus your $100+ in actual LLM work costs
Total: $106.65+

With ML (80% confidence, 20% LLM fallback):

ML calls: 1.06M requests (80%) = $0
LLM fallback: 266k requests (20%) × $0.000005 = $1.33
Plus your $100+ in actual LLM work costs (unchanged)
Total: $101.33+
Routing savings: $5.32 (80% reduction on routing overhead)

Important: The $100+ in actual LLM usage for your coding tasks is separate and unchanged. ML only reduces the routing overhead (~5% of total costs).

Impact on Free Tier Profiles

Many OpenRouter models have rate limits (free tier). ML routing helps you stay under limits:

Example: Grok 4.1 Fast (free tier)

Limit: ~100 requests/minute
Without ML: Router uses 100 req/min = hitting limit
With ML: Router uses 20 req/min = 80% capacity freed

This means you can handle 5x more chat interactions before hitting rate limits on free models.

Why This Matters

Speed: Fast routing means the assistant feels instant. No more waiting 500ms just to figure out what you want to do.

Cost: Routing happens on every interaction. At scale, ML routing saves real money without sacrificing quality.

Privacy: ML models run locally. Your task descriptions never leave your machine for simple routing decisions.

Offline: No internet? No problem. ML models work completely offline.

Customizable: Don’t like the default behavior? Adjust thresholds or retrain with your own data.

Real Usage Example

$ chu chat

> show me the auth code
# ML: "show" → query agent (1ms)
# Reads relevant files, explains auth flow

> add rate limiting to login endpoint  
# ML: "add" → editor agent (1ms)
# Opens editor, implements rate limiting

> how do other projects handle rate limiting?
# ML: uncertain (confidence: 0.65 < 0.7)
# Fallback: LLM router → research agent (500ms)
# Searches web, summarizes best practices

> fix the test that's failing
# ML: "fix" + "test" → editor agent (1ms)
# Runs tests, identifies issue, fixes it

Result: 3 out of 4 requests used ML (1ms), only 1 used LLM (500ms). Total routing time: 503ms instead of 2000ms.

The Hybrid Advantage

Pure ML would be faster but less accurate. Pure LLM would be more accurate but slower and expensive.

Hybrid ML + LLM² gives you the best of both worlds:

Fast path for confident decisions (80-90% of requests)
Smart fallback for edge cases
Configurable balance between speed and accuracy

Technical Details

For implementation details, training data format, and hyperparameter tuning, see our ML Features documentation.

For performance benchmarks and ROI analysis, check the full metrics in the docs.

Try It Now

The ML models are embedded in Chuchu by default - no setup required. They “just work.”

Want to see them in action?

# Interactive testing
chu ml test intent
chu ml test complexity

# Check your current config
chu config get defaults.ml_intent_threshold
chu config get defaults.ml_complex_threshold

# View training data
cat ml/intent/data/training_data.csv
cat ml/complexity_detection/data/training_data.csv

What’s Next

We’re exploring:

Confidence calibration: Even better fallback decisions
Active learning: Learn from your corrections
Code-specific features: Leverage AST patterns, not just text
Personalization: Models that adapt to your coding style

But the foundation is here today: fast, cheap, accurate routing powered by embedded ML.

Summary

500x faster routing: 1ms vs 500ms routing latency

92% total cost savings: $316 vs $4,000 per 1M requests (Groq setup)

Or 100% savings: $0 with Ollama (local setup)

Zero deps runtime: Inference in pure Go

Smart fallback: LLM when uncertain

Routing optimization: Saves $5+ per 1M requests on overhead

Rate-limit friendly: 5x more requests on free tier profiles

Have questions about the ML system? Check out the full documentation or ask in GitHub Discussions!

ML-Powered Intelligence: 500x Faster, 92% Cheaper

ML-Powered Intelligence: 500x Faster, 92% Cheaper

The Problem with LLM-Only Routing

The Solution: Hybrid ML + LLM

1. Intent Classifier

2. Complexity Classifier

Real-World Impact

Without ML (LLM only)

With ML (80% ML, 20% LLM fallback)

How It Works

Configuration

CLI Commands

Test the models interactively

Evaluate accuracy

Train your own models

Total Cost: Chuchu vs Commercial Copilos

Commercial AI Copilots

Chuchu with Groq (Recommended Budget Setup)

Chuchu with Ollama (Zero Cost)

The Real Difference

Routing Cost at Scale

Impact on Free Tier Profiles

Why This Matters

Real Usage Example

The Hybrid Advantage

Technical Details

Try It Now

What’s Next

Summary

References

See Also