How to Set Up Ollama: Run Any LLM Locally in 5 Minutes (2026 Guide)
Step-by-step guide to installing Ollama and running AI models locally on your PC or Mac. From installation to your first conversation in under 5 minutes — no cloud, no API keys, completely private.
Compute Market Team
Our Top Pick
Apple Mac Mini M4 Pro
$1,399 – $1,599Apple M4 Pro | 12-core | 18-core
Last updated: March 3, 2026. Tested with Ollama 0.6.x on Windows 11, macOS 15, and Ubuntu 24.04.
Local AI in 5 Minutes — No Cloud Required
Ollama is the simplest way to run AI models on your own computer. No API keys, no cloud subscriptions, no data leaving your machine. Install it, pull a model, and start chatting. The entire process takes under 5 minutes.
This guide covers everything: installation on every platform, choosing the right model for your hardware, GPU optimization, connecting a web UI, and building on top of Ollama's local API. Whether you are on a $500 laptop or a $1,000 AI PC, this guide gets you running.
What You Need
Ollama runs on almost any modern computer. Here is what determines your experience:
| Hardware | Minimum (Small Models) | Recommended (7B–13B Models) | Optimal (30B+ Models) |
|---|---|---|---|
| RAM | 8GB | 16GB | 32GB+ |
| GPU VRAM | None (CPU mode) | 8–16GB | 24GB+ |
| Storage | 10GB free | 50GB free | 100GB+ free |
| CPU | Any modern 4-core | Any modern 6-core | Any modern 6-core+ |
GPU matters most. An NVIDIA GPU with 8GB+ VRAM makes inference 5–10x faster than CPU-only mode. The RTX 4060 Ti 16GB ($399–$449) is the sweet spot for most users — it fits 13B models comfortably. For the best experience, an RTX 4090 with 24GB handles even 30B models with ease. Apple Silicon Macs use unified memory, which Ollama leverages automatically via Metal — a Mac Mini M4 Pro ($1,399) runs 7B–13B models silently out of the box. For GPU recommendations by budget, see our budget GPU guide.
No GPU? No Problem.
Ollama works on CPU-only machines. It will be slower (5–10 tokens/sec vs 50–100+ tokens/sec with a GPU), but small 3B–7B models are still usable for chat. If you find yourself wanting more speed, that is when a GPU upgrade makes the most impact. See our GPU buyer's guide.
Step 1: Install Ollama
macOS
Download from ollama.com/download or use Homebrew:
brew install ollama
That is it. Ollama automatically detects Apple Silicon and uses Metal GPU acceleration. No drivers, no configuration.
Linux
One command:
curl -fsSL https://ollama.com/install.sh | sh
The installer detects your GPU (NVIDIA with CUDA, AMD with ROCm) and configures acceleration automatically. For NVIDIA GPUs, ensure you have the latest driver installed (nvidia-smi should show driver 560.x or newer).
Windows
Download the installer from ollama.com/download. Run it, accept defaults, done. Ollama runs as a system service and detects NVIDIA GPUs automatically.
Verify Installation
Open a terminal (or PowerShell on Windows) and run:
ollama --version
You should see something like ollama version 0.6.x. If it works, you are ready to run models.
Step 2: Run Your First Model
Pull and run a model in one command:
ollama run llama3.1:8b
The first time you run this, Ollama downloads the model (~4.7GB for the 8B Q4 version). Subsequent runs start instantly because the model is cached locally.
Once loaded, you are in a chat interface. Type a message, press Enter, and the model responds. That is it — you are running AI locally.
>>> What is the capital of France?
The capital of France is Paris. It is the largest city in France
and serves as the country's political, economic, and cultural center...
Press Ctrl+D or type /bye to exit.
Step 3: Choose the Right Model for Your Hardware
Ollama hosts hundreds of models. Here are our recommendations by hardware tier:
| Your GPU VRAM | Best Model | Size | Speed | Best For |
|---|---|---|---|---|
| No GPU (CPU only) | phi3:mini | 2.3GB | ~5–10 t/s | Basic chat, simple tasks |
| 8GB | llama3.1:8b | 4.7GB | ~30–50 t/s | General assistant, coding, writing |
| 12GB | llama3.1:8b | 4.7GB | ~40–60 t/s | Same, with more context headroom |
| 16GB | qwen2.5:14b | 9GB | ~35–45 t/s | More capable reasoning, multilingual |
| 24GB | qwen2.5:32b | 20GB | ~28–35 t/s | Near-GPT-4 quality for many tasks |
| 32GB+ | llama3.1:70b-q4 | ~40GB | ~15–20 t/s | Frontier-class local inference |
To try a specific model:
ollama run qwen2.5:14b
Specialized Models Worth Trying
- For coding:
ollama run deepseek-coder-v2:16b— excellent for code generation, debugging, and explanation - For reasoning:
ollama run deepseek-r1:14b— chain-of-thought reasoning, math, logic - For writing:
ollama run mistral:7b— fast, articulate, great for prose - For uncensored/unfiltered:
ollama run dolphin-mistral:7b— no refusals, useful for creative writing
Step 4: Verify GPU Acceleration
Make sure Ollama is using your GPU. Run a model and check GPU utilization:
NVIDIA GPUs
nvidia-smi
You should see Ollama processes using GPU memory. If the "GPU Memory Usage" shows 0MB while a model is loaded, GPU acceleration is not working — check your NVIDIA driver installation.
Apple Silicon
Open Activity Monitor → GPU tab. You should see "ollama_llama_server" using GPU. Metal acceleration is automatic on all Apple Silicon Macs.
AMD GPUs
rocm-smi
Verify GPU memory usage while a model is running. AMD GPU support requires ROCm 7.0+ on Linux. Windows AMD GPU support is limited — check Ollama's documentation for the latest compatibility.
Step 5: Add a Web Interface (Optional)
The command line works, but a web UI makes the experience much better. Open WebUI is the most popular choice — it provides a ChatGPT-like interface that connects to your local Ollama instance.
Quick Install with Docker
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000 in your browser. Create an account (local only), and you have a full ChatGPT-style interface connected to your local models. Conversations are stored locally, you can switch between models, and there is built-in RAG (retrieval-augmented generation) for chatting with your documents.
Without Docker
If you do not want Docker, try Enchanted (macOS native app), Chatbox (cross-platform desktop app), or simply use the Ollama API directly from your code.
Step 6: Use the Local API
Ollama exposes a local REST API on http://localhost:11434 that is compatible with the OpenAI API format. This means you can use it with any tool or library that supports the OpenAI API.
Quick Test
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Explain quantum computing in one paragraph",
"stream": false
}'
Use with Python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required but not used
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
This OpenAI-compatible API means you can swap cloud models for local models in your applications by changing only the base_url. Your code, prompts, and logic stay the same.
Performance Tips
1. Match Model Size to Your VRAM
The model should fit entirely in GPU VRAM for maximum speed. If a model exceeds your VRAM, Ollama automatically offloads layers to CPU — which works but is 5–10x slower for those layers. Check VRAM usage with nvidia-smi and choose a model that fits. For exact VRAM requirements by model, see our VRAM guide.
2. Use Q4_K_M Quantization (the Default)
Ollama defaults to Q4_K_M quantization for most models, which is the best balance of quality and VRAM usage. Avoid downloading FP16 versions unless you have VRAM to spare — the quality difference from Q4 to FP16 is minimal for most chat use cases.
3. Set Context Length Appropriately
Longer context windows use more VRAM. The default 2048 tokens is fine for most conversations. If you need longer context (for analyzing documents), increase it:
ollama run llama3.1:8b --num-ctx 8192
But be aware that doubling context roughly doubles the KV cache VRAM usage.
4. Keep Ollama Updated
Ollama's team ships performance improvements frequently. Update regularly:
# macOS (Homebrew)
brew upgrade ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
Common Issues & Fixes
"Model too large for GPU"
The model exceeds your VRAM. Switch to a smaller model or a more aggressively quantized version. For example, if qwen2.5:32b does not fit in 16GB, try qwen2.5:14b instead.
Slow inference (< 10 tokens/sec with a GPU)
Ollama may be running in CPU mode. Verify GPU detection:
ollama ps
If the "Processor" column shows "CPU" instead of "GPU", your GPU is not being used. Check driver installation and restart the Ollama service.
"Connection refused" from API
Ensure the Ollama service is running:
# Linux
systemctl status ollama
# macOS — Ollama should be in your menu bar
# Windows — check if Ollama is in the system tray
What to Do Next
Once Ollama is running, the possibilities expand quickly:
- Try different models — experiment with coding models, reasoning models, and different sizes to find what works best for your use case
- Set up Open WebUI — the ChatGPT-like interface makes daily use much more pleasant
- Build local AI tools — use the API to integrate AI into your own scripts, workflows, and applications
- Upgrade your hardware — if you find yourself wanting to run larger models, see our hardware guides:
Local AI is real, it is fast, and it is free. The only cost is the hardware you already own (or the hardware you are about to buy). Stop paying for API calls and start running models on your own machine.