Running Chuchu Completely Offline with Ollama

Want to use Chuchu without sending your code to the cloud? Ollama lets you run powerful LLMs locally on your machine, completely free and private.

Why Run Local Models?

Privacy: Your code never leaves your machine
Cost: $0 per token - run unlimited queries
Speed: No network latency for small models
Offline: Work anywhere, even without internet
Control: Full control over model versions and updates

Prerequisites

Install Ollama: https://ollama.com/download
Have at least 8GB RAM (16GB+ recommended for larger models)
SSD storage (models can be 4-30GB each)

Recommended Model Configurations

Balanced Setup (16GB RAM)

Best all-around configuration for most machines:

backend:
  ollama:
    type: ollama
    base_url: http://localhost:11434
    default_model: llama3.1:8b
    agent_models:
      router: llama3.1:8b         # Fast routing
      query: gpt-oss:latest       # 20B for comprehension
      editor: qwen3-coder:latest  # Specialized for code
      research: gpt-oss:latest    # Good at synthesis

Required models:

ollama pull llama3.1:8b        # ~4.7GB
ollama pull gpt-oss:latest     # ~13GB
ollama pull qwen3-coder:latest # ~18GB

Total storage: ~36GB

Performance Setup (32GB+ RAM)

For powerful machines that can run larger models:

backend:
  ollama:
    type: ollama
    base_url: http://localhost:11434
    default_model: deepseek-r1:32b
    agent_models:
      router: llama3.1:8b         # Fast routing
      query: deepseek-r1:32b      # Strong reasoning
      editor: qwen3-coder:latest  # Code specialist
      research: deepseek-r1:32b   # Excellent research

Required models:

ollama pull llama3.1:8b
ollama pull deepseek-r1:32b    # ~20GB
ollama pull qwen3-coder:latest

Total storage: ~43GB

Minimal Setup (8GB RAM)

For resource-constrained machines:

backend:
  ollama:
    type: ollama
    base_url: http://localhost:11434
    default_model: llama3.1:8b
    agent_models:
      router: llama3.1:8b
      query: llama3.1:8b
      editor: llama3.1:8b
      research: llama3.1:8b

Required models:

ollama pull llama3.1:8b

Total storage: ~4.7GB

Model Recommendations by Task

Router (Intent Classification)

Best: llama3.1:8b - Fast and accurate enough
Alternative: phi3:mini - Even smaller/faster

Query (Code Analysis)

Best: deepseek-r1:32b - Excellent reasoning
Good: gpt-oss:latest - Strong comprehension
Budget: llama3.1:8b - Decent understanding

Editor (Code Generation)

Best: qwen3-coder:latest - Specialized for code (30B MoE)
Good: deepseek-coder-v2:latest - Strong coding model
Budget: codellama:13b - Decent code generation

Research (Information Synthesis)

Best: deepseek-r1:32b - Excellent at reasoning
Good: gpt-oss:latest - Good synthesis
Budget: llama3.1:8b - Basic research

Popular Ollama Models for Development

Model	Size	RAM	Best For	Quantization
llama3.1:8b	4.7GB	8GB	Fast, general use	Q4_K_M
phi3:mini	2.3GB	4GB	Extremely fast routing	Q4_K_M
gpt-oss:latest	13GB	16GB	Comprehension, analysis	MXFP4
qwen3-coder:latest	18GB	20GB	Code generation	Q4_K_M
deepseek-r1:32b	20GB	24GB	Reasoning, research	Q4_K_M
deepseek-coder-v2	16GB	18GB	Code-focused tasks	Q4_K_M
codellama:13b	7.4GB	12GB	Budget code generation	Q4_K_M

Setting Up

Pull your chosen models:

ollama pull llama3.1:8b
ollama pull gpt-oss:latest
ollama pull qwen3-coder:latest

Update Chuchu’s model catalog:
```
chu models update
```

This will automatically detect installed Ollama models.

Configure in Neovim:

Ctrl+X (in chat buffer)
Select ollama backend
Configure agent models

Or edit ~/.chuchu/setup.yaml directly

Performance Tips

Speed Up Inference

Use quantized models: Ollama defaults to Q4_K_M (good balance)
Keep models in RAM: First run loads model, subsequent runs are fast
Use smaller models for router: Router is called most frequently
SSD storage: Models load much faster from SSD

Memory Management

# Check running models
ollama ps

# Stop a model to free memory
ollama stop llama3.1:8b

# Models auto-unload after 5 minutes of inactivity

Parallel Model Loading

Ollama can run multiple models simultaneously if you have enough RAM:

# In separate terminals
ollama run llama3.1:8b
ollama run qwen3-coder:latest

Switching Between Local and Cloud

You can configure multiple backends and switch between them as needed:

Example setup with both Ollama and Groq (~/.chuchu/setup.yaml):

defaults:
  backend: ollama  # Currently active backend

backend:
  ollama:
    type: ollama
    base_url: http://localhost:11434
    default_model: llama3.1:8b
    agent_models:
      router: llama3.1:8b
      query: qwen3-coder:latest
      editor: qwen3-coder:latest
      research: llama3.1:8b
  
  groq:
    type: openai
    base_url: https://api.groq.com/openai/v1
    default_model: gpt-oss-120b-128k
    agent_models:
      router: llama-3.1-8b-instant
      query: gpt-oss-120b-128k
      editor: deepseek-r1-distill-qwen-32b
      research: gpt-oss-120b-128k

To switch backends:

In Neovim: Press Ctrl+X in chat buffer and select different backend
Manually: Edit defaults.backend value and restart your session

Important: Only one backend is active at a time. Each backend has its own set of agent_models. You cannot mix models from different backends in the same session.

When to use which:

Ollama: Privacy-sensitive code, unlimited usage, working offline
Groq: Need faster inference, larger context, better quality for critical tasks

See our Hybrid Cloud/Local guide for detailed switching strategies.

Troubleshooting

Model Loading Slowly

Check if you have enough RAM: ollama ps
Ensure model is on SSD, not HDD
Close other applications to free memory

Out of Memory

Use smaller models or quantizations
Run one model at a time
Increase swap space (not recommended for performance)

Poor Quality Responses

Try larger models (requires more RAM)
Use specialized models (e.g., qwen3-coder for code)
Check model quantization (Q4_K_M is good balance)

Comparing Local vs Cloud

Aspect	Ollama (Local)	Groq (Cloud)
Privacy	Complete	Code sent to API
Cost	Free	Pay per token
Speed (first run)	Model load time	Instant
Speed (loaded)	No network	Very fast
Model quality	Limited by RAM	Largest models
Offline	Works offline	Requires internet
Setup	Download models	Just API key

Model Discovery and Installation

Chuchu includes built-in model discovery and installation for Ollama:

Search for Models

# Search all ollama models
chu models search -b ollama

# Search with filters (ANDed together)
chu models search ollama coding fast
chu models search ollama llama3

The search results include an installed field showing which models are already available:

{
  "id": "llama3.1:8b",
  "name": "llama3.1:8b",
  "tags": ["free", "fast", "versatile"],
  "context_window": 8192,
  "installed": true
}

Install Models

# Install a specific model
chu models install llama3.1:8b

# If already installed, you'll see:
# ✓ Model llama3.1:8b already installed

Discover New Models

For the full catalog of available models, visit ollama.com/library.

Update Chuchu’s model catalog periodically:

chu models update

Community Recommendations

Share your Ollama configuration on GitHub Discussions and help others find the best setup for their hardware!

Running into issues? Ask in GitHub Discussions