Ollama is a free, open-source tool that lets you run large language models (LLMs) locally on your computer. It handles model downloading, GPU acceleration, and provides a simple command-line interface and API. No cloud, no API keys, no subscription required.

What hardware do I need to run Ollama?

Minimum: 8GB RAM and any modern CPU for small 3B models. Recommended: a GPU with at least 8GB VRAM (NVIDIA, AMD, or Apple Silicon) and 16GB system RAM. For the best experience with 7B–30B models, an NVIDIA GPU with 16–24GB VRAM is ideal.

Yes, Ollama is completely free and open-source (MIT license). The models it runs are also free — open-weight models like Llama 3.1, Mistral, Qwen, and DeepSeek are available at no cost.

Can I use Ollama on Mac?

Yes. Ollama runs natively on macOS with Apple Silicon (M1/M2/M3/M4) acceleration. It uses the Metal GPU framework automatically — no configuration needed. A Mac Mini M4 Pro with 24GB unified memory runs 7B–13B models excellently.

Tutorial10 min read

How to Set Up Ollama: Run Any LLM Locally in 5 Minutes (2026 Guide)

Step-by-step guide to installing Ollama and running AI models locally on your PC or Mac. From installation to your first conversation in under 5 minutes — no cloud, no API keys, completely private.

Compute Market Team

Published March 3, 2026

Our Top Pick

Apple Mac Mini M4 Pro

$1,399 – $1,599

Apple M4 Pro12-core18-core

Check Price on Amazon Full review →

Last updated: March 3, 2026. Tested with Ollama 0.6.x on Windows 11, macOS 15, and Ubuntu 24.04.

Local AI in 5 Minutes — No Cloud Required

Ollama is the simplest way to run AI models on your own computer. No API keys, no cloud subscriptions, no data leaving your machine. Install it, pull a model, and start chatting. The entire process takes under 5 minutes.

This guide covers everything: installation on every platform, choosing the right model for your hardware, GPU optimization, connecting a web UI, and building on top of Ollama's local API. Whether you are on a $500 laptop or a $1,000 AI PC, this guide gets you running.

What You Need

Ollama runs on almost any modern computer. Here is what determines your experience:

Hardware	Minimum (Small Models)	Recommended (7B–13B Models)	Optimal (30B+ Models)
RAM	8GB	16GB	32GB+
GPU VRAM	None (CPU mode)	8–16GB	24GB+
Storage	10GB free	50GB free	100GB+ free
CPU	Any modern 4-core	Any modern 6-core	Any modern 6-core+

GPU matters most. An NVIDIA GPU with 8GB+ VRAM makes inference 5–10x faster than CPU-only mode. The RTX 4060 Ti 16GB ($399–$449) is the sweet spot for most users — it fits 13B models comfortably. For the best experience, an RTX 4090 with 24GB handles even 30B models with ease. Apple Silicon Macs use unified memory, which Ollama leverages automatically via Metal — a Mac Mini M4 Pro ($1,399) runs 7B–13B models silently out of the box. For GPU recommendations by budget, see our budget GPU guide.

No GPU? No Problem.

Ollama works on CPU-only machines. It will be slower (5–10 tokens/sec vs 50–100+ tokens/sec with a GPU), but small 3B–7B models are still usable for chat. If you find yourself wanting more speed, that is when a GPU upgrade makes the most impact. See our GPU buyer's guide.

Step 1: Install Ollama

macOS

Download from ollama.com/download or use Homebrew:

brew install ollama

That is it. Ollama automatically detects Apple Silicon and uses Metal GPU acceleration. No drivers, no configuration.

Linux

One command:

curl -fsSL https://ollama.com/install.sh | sh

The installer detects your GPU (NVIDIA with CUDA, AMD with ROCm) and configures acceleration automatically. For NVIDIA GPUs, ensure you have the latest driver installed (nvidia-smi should show driver 560.x or newer).

Windows

Download the installer from ollama.com/download. Run it, accept defaults, done. Ollama runs as a system service and detects NVIDIA GPUs automatically.

Verify Installation

Open a terminal (or PowerShell on Windows) and run:

ollama --version

You should see something like ollama version 0.6.x. If it works, you are ready to run models.

Step 2: Run Your First Model

Pull and run a model in one command:

ollama run llama3.1:8b

The first time you run this, Ollama downloads the model (~4.7GB for the 8B Q4 version). Subsequent runs start instantly because the model is cached locally.

Once loaded, you are in a chat interface. Type a message, press Enter, and the model responds. That is it — you are running AI locally.

>>> What is the capital of France?

The capital of France is Paris. It is the largest city in France
and serves as the country's political, economic, and cultural center...

Press Ctrl+D or type /bye to exit.

Step 3: Choose the Right Model for Your Hardware

Ollama hosts hundreds of models. Here are our recommendations by hardware tier:

Your GPU VRAM	Best Model	Size	Speed	Best For
No GPU (CPU only)	phi3:mini	2.3GB	~5–10 t/s	Basic chat, simple tasks
8GB	llama3.1:8b	4.7GB	~30–50 t/s	General assistant, coding, writing
12GB	llama3.1:8b	4.7GB	~40–60 t/s	Same, with more context headroom
16GB	qwen2.5:14b	9GB	~35–45 t/s	More capable reasoning, multilingual
24GB	qwen2.5:32b	20GB	~28–35 t/s	Near-GPT-4 quality for many tasks
32GB+	llama3.1:70b-q4	~40GB	~15–20 t/s	Frontier-class local inference

To try a specific model:

ollama run qwen2.5:14b

Specialized Models Worth Trying

For coding: ollama run deepseek-coder-v2:16b — excellent for code generation, debugging, and explanation
For reasoning: ollama run deepseek-r1:14b — chain-of-thought reasoning, math, logic
For writing: ollama run mistral:7b — fast, articulate, great for prose
For uncensored/unfiltered: ollama run dolphin-mistral:7b — no refusals, useful for creative writing

Step 4: Verify GPU Acceleration

Make sure Ollama is using your GPU. Run a model and check GPU utilization:

NVIDIA GPUs

nvidia-smi

You should see Ollama processes using GPU memory. If the "GPU Memory Usage" shows 0MB while a model is loaded, GPU acceleration is not working — check your NVIDIA driver installation.

Apple Silicon

Open Activity Monitor → GPU tab. You should see "ollama_llama_server" using GPU. Metal acceleration is automatic on all Apple Silicon Macs.

AMD GPUs

rocm-smi

Verify GPU memory usage while a model is running. AMD GPU support requires ROCm 7.0+ on Linux. Windows AMD GPU support is limited — check Ollama's documentation for the latest compatibility.

Step 5: Add a Web Interface (Optional)

The command line works, but a web UI makes the experience much better. Open WebUI is the most popular choice — it provides a ChatGPT-like interface that connects to your local Ollama instance.

Quick Install with Docker

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. Create an account (local only), and you have a full ChatGPT-style interface connected to your local models. Conversations are stored locally, you can switch between models, and there is built-in RAG (retrieval-augmented generation) for chatting with your documents.

Without Docker

If you do not want Docker, try Enchanted (macOS native app), Chatbox (cross-platform desktop app), or simply use the Ollama API directly from your code.

Step 6: Use the Local API

Ollama exposes a local REST API on http://localhost:11434 that is compatible with the OpenAI API format. This means you can use it with any tool or library that supports the OpenAI API.

Quick Test

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Explain quantum computing in one paragraph",
  "stream": false
}'

Use with Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required but not used
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

This OpenAI-compatible API means you can swap cloud models for local models in your applications by changing only the base_url. Your code, prompts, and logic stay the same.

Performance Tips

1. Match Model Size to Your VRAM

The model should fit entirely in GPU VRAM for maximum speed. If a model exceeds your VRAM, Ollama automatically offloads layers to CPU — which works but is 5–10x slower for those layers. Check VRAM usage with nvidia-smi and choose a model that fits. For exact VRAM requirements by model, see our VRAM guide.

2. Use Q4_K_M Quantization (the Default)

Ollama defaults to Q4_K_M quantization for most models, which is the best balance of quality and VRAM usage. Avoid downloading FP16 versions unless you have VRAM to spare — the quality difference from Q4 to FP16 is minimal for most chat use cases.

3. Set Context Length Appropriately

Longer context windows use more VRAM. The default 2048 tokens is fine for most conversations. If you need longer context (for analyzing documents), increase it:

ollama run llama3.1:8b --num-ctx 8192

But be aware that doubling context roughly doubles the KV cache VRAM usage.

4. Keep Ollama Updated

Ollama's team ships performance improvements frequently. Update regularly:

# macOS (Homebrew)
brew upgrade ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Common Issues & Fixes

"Model too large for GPU"

The model exceeds your VRAM. Switch to a smaller model or a more aggressively quantized version. For example, if qwen2.5:32b does not fit in 16GB, try qwen2.5:14b instead.

Slow inference (< 10 tokens/sec with a GPU)

Ollama may be running in CPU mode. Verify GPU detection:

ollama ps

If the "Processor" column shows "CPU" instead of "GPU", your GPU is not being used. Check driver installation and restart the Ollama service.

"Connection refused" from API

Ensure the Ollama service is running:

# Linux
systemctl status ollama

# macOS — Ollama should be in your menu bar
# Windows — check if Ollama is in the system tray

What to Do Next

Once Ollama is running, the possibilities expand quickly:

Try different models — experiment with coding models, reasoning models, and different sizes to find what works best for your use case
Set up Open WebUI — the ChatGPT-like interface makes daily use much more pleasant
Build local AI tools — use the API to integrate AI into your own scripts, workflows, and applications
Upgrade your hardware — if you find yourself wanting to run larger models, see our hardware guides:

Local AI is real, it is fast, and it is free. The only cost is the hardware you already own (or the hardware you are about to buy). Stop paying for API calls and start running models on your own machine.

Pair-buy essentials

Pairs with your Apple Mac Mini M4 Pro

Apple Silicon ships with great compute but minimal I/O. These extend the box without breaking the silent-and-clean aesthetic.

CalDigit TS4 Thunderbolt 4 Dock
$320 – $400
18 ports, 98W charging, 2.5GbE — the only TB4 dock most Macs ever need.
Shop on Amazon
OWC Envoy Express Thunderbolt NVMe Enclosure
$80 – $110
TB3 NVMe at ~2,800 MB/s sustained. Apple's internal-storage tax is 4× the price/GB.
Shop on Amazon
Monoprice Cat6A SlimRun Ethernet — 10ft
$10 – $16
Double-shielded S/FTP, snagless — ready for the 10GbE port on Mac Studio / mini Pro.
Shop on Amazon

Show 3 more →

HumanCentric Mac Mini VESA Mount
$30 – $40
Snaps onto any 75/100mm VESA arm — hide the mini behind the screen. Verify your Mac mini revision.
Shop on Amazon
CyberPower CP850PFCLCD Pure-Sine UPS
$130 – $180
850VA pure sine + AVR — right-sized for Mac mini / Studio, with runtime for clean shutdown.
Shop on Amazon
ACASIS NVMe-to-USB Docking Station
$30 – $45
Slot any M.2 SSD over USB — handy for archiving model checkpoints off Apple's expensive internal storage. ~1 GB/s sustained, fine for cold loads.
Shop on Amazon

Includes paid promotion from ACASIS via Amazon Creator Connections. We earn a commission on qualifying purchases at no cost to you.

Ollamatutoriallocal LLMsetup guidebeginner2026