Running Chuchu Completely Offline with Ollama
Running Chuchu Completely Offline with Ollama
Want to use Chuchu without sending your code to the cloud? Ollama lets you run powerful LLMs locally on your machine, completely free and private.
Why Run Local Models?
- Privacy: Your code never leaves your machine
- Cost: $0 per token - run unlimited queries
- Speed: No network latency for small models
- Offline: Work anywhere, even without internet
- Control: Full control over model versions and updates
Prerequisites
- Install Ollama: https://ollama.com/download
- Have at least 8GB RAM (16GB+ recommended for larger models)
- SSD storage (models can be 4-30GB each)
Recommended Model Configurations
Balanced Setup (16GB RAM)
Best all-around configuration for most machines:
backend:
ollama:
type: ollama
base_url: http://localhost:11434
default_model: llama3.1:8b
agent_models:
router: llama3.1:8b # Fast routing
query: gpt-oss:latest # 20B for comprehension
editor: qwen3-coder:latest # Specialized for code
research: gpt-oss:latest # Good at synthesis
Required models:
ollama pull llama3.1:8b # ~4.7GB
ollama pull gpt-oss:latest # ~13GB
ollama pull qwen3-coder:latest # ~18GB
Total storage: ~36GB
Performance Setup (32GB+ RAM)
For powerful machines that can run larger models:
backend:
ollama:
type: ollama
base_url: http://localhost:11434
default_model: deepseek-r1:32b
agent_models:
router: llama3.1:8b # Fast routing
query: deepseek-r1:32b # Strong reasoning
editor: qwen3-coder:latest # Code specialist
research: deepseek-r1:32b # Excellent research
Required models:
ollama pull llama3.1:8b
ollama pull deepseek-r1:32b # ~20GB
ollama pull qwen3-coder:latest
Total storage: ~43GB
Minimal Setup (8GB RAM)
For resource-constrained machines:
backend:
ollama:
type: ollama
base_url: http://localhost:11434
default_model: llama3.1:8b
agent_models:
router: llama3.1:8b
query: llama3.1:8b
editor: llama3.1:8b
research: llama3.1:8b
Required models:
ollama pull llama3.1:8b
Total storage: ~4.7GB
Model Recommendations by Task
Router (Intent Classification)
- Best:
llama3.1:8b- Fast and accurate enough - Alternative:
phi3:mini- Even smaller/faster
Query (Code Analysis)
- Best:
deepseek-r1:32b- Excellent reasoning - Good:
gpt-oss:latest- Strong comprehension - Budget:
llama3.1:8b- Decent understanding
Editor (Code Generation)
- Best:
qwen3-coder:latest- Specialized for code (30B MoE) - Good:
deepseek-coder-v2:latest- Strong coding model - Budget:
codellama:13b- Decent code generation
Research (Information Synthesis)
- Best:
deepseek-r1:32b- Excellent at reasoning - Good:
gpt-oss:latest- Good synthesis - Budget:
llama3.1:8b- Basic research
Popular Ollama Models for Development
| Model | Size | RAM | Best For | Quantization |
|---|---|---|---|---|
| llama3.1:8b | 4.7GB | 8GB | Fast, general use | Q4_K_M |
| phi3:mini | 2.3GB | 4GB | Extremely fast routing | Q4_K_M |
| gpt-oss:latest | 13GB | 16GB | Comprehension, analysis | MXFP4 |
| qwen3-coder:latest | 18GB | 20GB | Code generation | Q4_K_M |
| deepseek-r1:32b | 20GB | 24GB | Reasoning, research | Q4_K_M |
| deepseek-coder-v2 | 16GB | 18GB | Code-focused tasks | Q4_K_M |
| codellama:13b | 7.4GB | 12GB | Budget code generation | Q4_K_M |
Setting Up
- Pull your chosen models:
ollama pull llama3.1:8b ollama pull gpt-oss:latest ollama pull qwen3-coder:latest - Update Chuchu’s model catalog:
chu models update
This will automatically detect installed Ollama models.
- Configure in Neovim:
Ctrl+X (in chat buffer) Select ollama backend Configure agent models - Or edit
~/.chuchu/setup.yamldirectly
Performance Tips
Speed Up Inference
- Use quantized models: Ollama defaults to Q4_K_M (good balance)
- Keep models in RAM: First run loads model, subsequent runs are fast
- Use smaller models for router: Router is called most frequently
- SSD storage: Models load much faster from SSD
Memory Management
# Check running models
ollama ps
# Stop a model to free memory
ollama stop llama3.1:8b
# Models auto-unload after 5 minutes of inactivity
Parallel Model Loading
Ollama can run multiple models simultaneously if you have enough RAM:
# In separate terminals
ollama run llama3.1:8b
ollama run qwen3-coder:latest
Switching Between Local and Cloud
You can configure multiple backends and switch between them as needed:
Example setup with both Ollama and Groq (~/.chuchu/setup.yaml):
defaults:
backend: ollama # Currently active backend
backend:
ollama:
type: ollama
base_url: http://localhost:11434
default_model: llama3.1:8b
agent_models:
router: llama3.1:8b
query: qwen3-coder:latest
editor: qwen3-coder:latest
research: llama3.1:8b
groq:
type: openai
base_url: https://api.groq.com/openai/v1
default_model: gpt-oss-120b-128k
agent_models:
router: llama-3.1-8b-instant
query: gpt-oss-120b-128k
editor: deepseek-r1-distill-qwen-32b
research: gpt-oss-120b-128k
To switch backends:
- In Neovim: Press
Ctrl+Xin chat buffer and select different backend - Manually: Edit
defaults.backendvalue and restart your session
Important: Only one backend is active at a time. Each backend has its own set of agent_models. You cannot mix models from different backends in the same session.
When to use which:
- Ollama: Privacy-sensitive code, unlimited usage, working offline
- Groq: Need faster inference, larger context, better quality for critical tasks
See our Hybrid Cloud/Local guide for detailed switching strategies.
Troubleshooting
Model Loading Slowly
- Check if you have enough RAM:
ollama ps - Ensure model is on SSD, not HDD
- Close other applications to free memory
Out of Memory
- Use smaller models or quantizations
- Run one model at a time
- Increase swap space (not recommended for performance)
Poor Quality Responses
- Try larger models (requires more RAM)
- Use specialized models (e.g., qwen3-coder for code)
- Check model quantization (Q4_K_M is good balance)
Comparing Local vs Cloud
| Aspect | Ollama (Local) | Groq (Cloud) |
|---|---|---|
| Privacy | Complete | Code sent to API |
| Cost | Free | Pay per token |
| Speed (first run) | Model load time | Instant |
| Speed (loaded) | No network | Very fast |
| Model quality | Limited by RAM | Largest models |
| Offline | Works offline | Requires internet |
| Setup | Download models | Just API key |
Model Discovery and Installation
Chuchu includes built-in model discovery and installation for Ollama:
Search for Models
# Search all ollama models
chu models search -b ollama
# Search with filters (ANDed together)
chu models search ollama coding fast
chu models search ollama llama3
The search results include an installed field showing which models are already available:
{
"id": "llama3.1:8b",
"name": "llama3.1:8b",
"tags": ["free", "fast", "versatile"],
"context_window": 8192,
"installed": true
}
Install Models
# Install a specific model
chu models install llama3.1:8b
# If already installed, you'll see:
# ✓ Model llama3.1:8b already installed
Discover New Models
For the full catalog of available models, visit ollama.com/library.
Update Chuchu’s model catalog periodically:
chu models update
Community Recommendations
Share your Ollama configuration on GitHub Discussions and help others find the best setup for their hardware!
Running into issues? Ask in GitHub Discussions