Ollama's model library has grown to hundreds of models. The paradox of choice is real — scroll through the list and you'll find dozens of models in every size category, each claiming to be the best at something.
We tested the models that matter for practical AI agent use. Not benchmark leaderboard performance — real-world usefulness. Can this model answer questions helpfully? Can it write functional code? Does it follow instructions reliably? Does it hallucinate frequently?
Here are the ten models worth running, ordered by parameter count.
1. Qwen2.5:1.5b — The Survival Model
Size: 1.1GB | RAM needed: 2GB | Speed on CPU: 15-25 tok/s
When you have almost no resources — a Pi Zero, an old laptop, a container with 2GB RAM — this is the model that still works. It's not smart. It's not creative. But it follows simple instructions, answers basic factual questions, and can do elementary text processing.
Best for: IoT devices, ultra-lightweight agents, devices with severe memory constraints. Think of it as a slightly intelligent command parser rather than a conversational assistant.
Verdict: Surprisingly functional for its size. Won't replace a real conversation partner, but handles "turn this text into a JSON object" and "summarize this paragraph" reliably.
2. Gemma 3 4B — The Speed Demon
Size: 2.8GB | RAM needed: 4GB | Speed on GPU: 40-60 tok/s
Google's smallest competitive model. On a Pi 5 + AI HAT+ 2, it generates at 22-28 tok/s — fast enough that responses feel instant. On any modern GPU, it's essentially zero-latency.
Best for: Quick queries where response time matters more than depth. Great as a "fast model" in a dual-model ZeroClaw setup, handling simple questions while the larger model handles complex ones.
Verdict: The best model under 4GB for speed-sensitive deployments. Quality is a step below 8B models but adequate for 80% of daily assistant tasks.
3. Llama 3.1 8B Instruct — The All-Rounder
Size: 4.7GB | RAM needed: 6GB | Speed on GPU: 30-45 tok/s
The default recommendation for a reason. Meta's instruction-tuned 8B model hits a quality/size balance that's hard to beat. It follows instructions well, produces coherent long-form text, handles multi-turn conversations without losing context, and has solid factual knowledge.
Best for: General-purpose AI assistants, home setups, personal productivity. This is the model that makes self-hosted AI feel viable for daily use.
Verdict: If you only download one model, make it this one. Reliable, fast enough on modest hardware, good enough for most tasks.
4. Qwen3-8B — The Reasoning Upgrade
Size: 4.9GB | RAM needed: 6GB | Speed on GPU: 28-40 tok/s
Alibaba's latest 8B model with improved reasoning capabilities. In head-to-head comparisons with Llama 3.1 8B, Qwen3-8B produces better output on math problems, logical reasoning, and structured analysis. Slightly worse on creative writing and conversational flow.
Best for: Tasks that require thinking — data analysis, code review, problem decomposition. Choose this over Llama 3.1 8B when accuracy matters more than naturalness.
Verdict: The best 8B model for analytical tasks. Pair it with Llama 3.1 8B in a routing setup for the best of both worlds.
5. Qwen2.5-Coder 7B — The Code Specialist
Size: 4.4GB | RAM needed: 6GB | Speed on GPU: 30-45 tok/s
Purpose-built for code generation and understanding. Outperforms models twice its size on coding benchmarks — it writes functional Python, JavaScript, Rust, Go, and SQL more reliably than any general-purpose model in its weight class.
Best for: Developer assistants, code review agents, programming tutors. If your primary use case is coding help, this model gives you better results per parameter than anything else.
Verdict: Must-have for developers. Use it as a dedicated coding model alongside a general-purpose model for other tasks.
6. GLM-4-9B-0414 — The Multilingual Champion
Size: 5.5GB | RAM needed: 7GB | Speed on GPU: 25-35 tok/s
THUDM's GLM-4 excels at multilingual tasks. If your agent needs to handle Chinese, Japanese, Korean, or other CJK languages alongside English, GLM-4 provides the best multilingual quality at the 8-9B parameter scale.
Best for: Multilingual agents, translation tasks, agents serving users in multiple languages. ZeroClaw's multi-channel setup often serves users in different languages — GLM-4 handles the switching naturally.
Verdict: The default choice for multilingual deployments. Solid English performance with genuinely good CJK support.
7. DeepSeek V3.2 32B — The Quality Leap
Size: 18GB (Q4) | RAM needed: 20GB | Speed on GPU: 15-25 tok/s
This is where model quality jumps noticeably. DeepSeek V3.2 at 32B produces output that feels qualitatively different from 8B models — longer context awareness, fewer hallucinations, better reasoning chains, more nuanced writing.
Best for: Users with 24GB+ VRAM (RTX 3090, RTX 4090) who want the best local quality available. Power users, professional workflows, team-shared inference servers.
Verdict: The best model for hardware that can run it. If you have the VRAM, this is the local model that makes you stop missing cloud APIs.
8. Qwen2.5-Coder 32B — The Code Powerhouse
Size: 18GB (Q4) | RAM needed: 20GB | Speed on GPU: 15-25 tok/s
The 32B version of the Qwen coder. At this size, code generation quality approaches frontier cloud models. It handles complex codebases, multi-file changes, and architectural reasoning that smaller models can't manage.
Best for: Professional software development. Teams using AI-assisted coding where output quality directly affects productivity.
Verdict: If coding is your primary use case and you have the hardware, this is the model to run.
9. Llama 3.1 70B — The Cloud Killer
Size: 40GB (Q4) | RAM needed: 44GB | Speed on RTX 4090: 12-18 tok/s
The model that makes people cancel their ChatGPT subscriptions. At 70B parameters (even quantized to Q4), this model produces output that's competitive with cloud APIs on most tasks. Reasoning, writing, code, analysis — it handles them all at a level that feels professional.
Best for: Users with high-end hardware (RTX 4090 with 24GB, dual GPUs, or Apple Silicon with 64GB+ unified memory) who want to eliminate cloud AI dependency entirely.
Verdict: If your hardware can run it, you may not need a cloud API for anything except the absolute cutting edge of AI reasoning.
10. Mistral Large 123B — The Local Frontier
Size: 70GB (Q4) | RAM needed: 75GB | Speed: 5-10 tok/s on high-end setups
The largest model that's practically runnable locally with dual RTX 4090s or Apple M3/M4 Ultra with 192GB unified memory. At 123B parameters, it matches or exceeds many cloud API models.
Best for: Research labs, AI companies, organizations with dedicated inference hardware. Not practical for personal use.
Verdict: Proof that "local" and "frontier" aren't mutually exclusive anymore — if you're willing to invest in the hardware.
The Practical Stack
For most users, the optimal Ollama setup is two models:
- 1.**Fast model** for simple queries: gemma3:4b or llama3.1:8b
- 2.**Quality model** for complex tasks: deepseek-v3.2:32b or llama3.1:70b (hardware permitting)
ZeroClaw can route between them automatically based on query complexity. Simple questions go to the fast model (instant response). Complex reasoning goes to the quality model (better output). You get speed when it doesn't matter and quality when it does.
Download both. Configure the routing threshold. Let the agent decide. That's the 2026 local AI experience at its best.