There's a growing category of users who don't just prefer local AI — they require it. Lawyers handling privileged communications. Healthcare providers processing patient data. Government contractors working in classified environments. Companies whose compliance teams have vetoed cloud AI services. And a large number of individuals who simply don't want their conversations stored on someone else's servers.
For all of these users, "mostly local" isn't enough. They need a stack that works with the network cable unplugged. Here's how to build one.
The Architecture
Three components, each handling a distinct responsibility:
- •Ollama manages model downloads, quantization, and serves an OpenAI-compatible API on localhost
- •ZeroClaw provides the agent runtime — memory, channels, tools, personality, and orchestration
- •Open WebUI delivers a polished browser-based chat interface, similar to ChatGPT's UI
All three run locally. Ollama and ZeroClaw communicate over localhost. Open WebUI connects to Ollama's local API. Nothing touches the internet after initial setup.
Step 1: Install Ollama
```bash curl -fsSL https://ollama.com/install.sh | sh ```
On Windows, download the installer from ollama.com/download. On macOS, `brew install ollama` works.
Pull your models while you still have internet:
```bash ollama pull llama3.1:8b ollama pull qwen2.5-coder:7b ollama pull gemma3:4b ```
These models download once and are stored locally. After this step, Ollama never needs internet access again.
Choosing your model: If you have 16GB+ RAM or a GPU with 8GB+ VRAM, start with llama3.1:8b — it's the best all-around model for its size. If you're on 8GB RAM without a GPU, use gemma3:4b. For coding tasks specifically, qwen2.5-coder:7b outperforms models twice its size on code generation benchmarks.
Step 2: Install ZeroClaw
```bash curl -fsSL https://raw.githubusercontent.com/zeroclaw-labs/zeroclaw/main/scripts/bootstrap.sh | bash ```
Configure ZeroClaw to use Ollama as its provider. Edit `~/.zeroclaw/config.toml`:
```bash [provider] type = "openai-compatible" base_url = "http://localhost:11434/v1" model = "llama3.1:8b" api_key = "not-needed"
[agent] name = "LocalAssistant" personality = "Helpful, concise assistant running entirely on local hardware. Prioritize accuracy over speed."
[memory] type = "sqlite" path = "~/.zeroclaw/memory.db" ```
Start ZeroClaw:
```bash zeroclaw start ```
Test it:
```bash zeroclaw chat "What can you help me with?" ```
Step 3: Install Open WebUI
Open WebUI provides a familiar chat interface. Deploy it with Docker:
```bash docker run -d \ -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui \ --restart always \ ghcr.io/open-webui/open-webui:main ```
Open http://localhost:3000 in your browser. Create a local admin account (this is stored in the Docker volume, not sent anywhere). Open WebUI automatically detects your local Ollama instance and lists your downloaded models.
You now have a ChatGPT-style interface running entirely on your hardware.
Step 4: Disconnect from the Internet
This is the test. Unplug the ethernet cable or disable WiFi. Open WebUI should still load at localhost:3000. Type a message. You should get a response from your local model within a few seconds.
If it works disconnected, your stack is fully offline-capable. Reconnecting to the internet doesn't change anything — the stack continues to operate locally regardless of network state.
Model Selection by Hardware
The model you can run depends on your RAM and GPU:
- •gemma3:4b — fast, capable for basic tasks
- •qwen2.5:3b — smaller but surprisingly competent
- •llama3.1:8b — best general-purpose model at this tier
- •qwen2.5-coder:7b — for development work
- •Same models as above, but 3-5x faster inference
- •Response times drop from 5-8 seconds to 1-2 seconds
- •deepseek-v3.2:32b — substantial quality jump over 8B models
- •qwen2.5-coder:32b — excellent for software development
Adding ZeroClaw's Agent Features
The basic Ollama + Open WebUI stack gives you a chat interface. Adding ZeroClaw gives you an agent — something that remembers previous conversations, uses tools, and connects to messaging platforms.
Memory persistence: ZeroClaw stores conversations in SQLite with hybrid search (FTS5 + vector). Your agent remembers what you discussed yesterday, last week, or last month. Open WebUI alone doesn't provide cross-session memory.
Tool use: Configure ZeroClaw tools for file operations, calculations, or custom scripts. The agent can read files from your system, execute whitelisted commands, and interact with local services — all offline.
Multi-channel access: Connect Telegram, Discord, or other messaging platforms when you're online, while the actual inference stays local. The messaging platform only sees the text messages; the AI processing never leaves your machine.
Security Hardening
For users who need this stack for compliance or security reasons, additional hardening steps:
- 1.**Bind to localhost only.** Both Ollama and ZeroClaw should listen on 127.0.0.1, not 0.0.0.0. This is the default for ZeroClaw; verify for Ollama with `OLLAMA_HOST=127.0.0.1 ollama serve`.
- 2.**Disable telemetry.** Open WebUI has optional telemetry that should be disabled for air-gapped deployments: set `ENABLE_TELEMETRY=false` in the Docker environment.
- 3.**Use ZeroClaw's allowlist.** Configure file path and network allowlists to restrict what the agent can access, even on your local machine.
- 4.**Encrypt the memory database.** For sensitive conversations, use SQLite encryption (SQLCipher) for ZeroClaw's memory file.
What You Give Up
Honesty about trade-offs matters. Running fully offline means:
- •Model quality ceiling. Local models (8B-32B parameters) are good but not frontier-level. Cloud models like Claude 3.5 Opus or GPT-5 will produce better output on complex reasoning tasks.
- •No internet-connected tools. Web search, live data fetching, and external API calls don't work offline. The agent is limited to local knowledge and tools.
- •Hardware investment. You need dedicated hardware. The ongoing cost is electricity instead of API fees, but the upfront cost is real.
For most users, the practical approach is hybrid: run locally by default, route to the cloud when you need frontier-quality output, and keep the offline capability as a fallback for when the network is unavailable or the task is sensitive.
ZeroClaw supports this natively — configure both a local provider (Ollama) and a cloud provider, set routing rules, and the agent automatically picks the right backend per query. Your private conversations stay local. Your complex research questions go to the cloud. You control the boundary.