guides

How to Download Ollama and Run Your First Local AI Model

No cloud. No API fees. No internet required. Here is how to install Ollama, pick the right model for your hardware, and start running AI locally in under 10 minutes.

T. Calleja

22 Mar 2026 — 4 min read

Ollama is the easiest way to run large language models on your own hardware. No cloud account, no API fees, no internet required once the model is downloaded. This guide walks you through getting set up, choosing the right model for your machine, and actually doing something useful with it.

What Is Ollama?

Ollama is a free, open-source tool that lets you download and run AI language models locally. It handles all the heavy lifting — model downloads, quantization, hardware acceleration — so you just run one command and start chatting. It works on Mac (Apple Silicon and Intel), Windows, and Linux.

Under the hood it uses llama.cpp to run models efficiently on consumer hardware, with Metal GPU acceleration on Mac and CUDA on Nvidia GPUs. The result is that models that would require expensive cloud infrastructure can run on a decent laptop.

Step 1: Download and Install Ollama

Go to ollama.com and download the installer for your platform. It is a straightforward install — drag to Applications on Mac, run the .exe on Windows, or use the one-line install script on Linux:

curl -fsSL https://ollama.com/install.sh | sh

After installing, Ollama runs as a background service. You can verify it is working by opening a terminal and running:

ollama --version

You should see a version number. That means the Ollama server is running and ready to pull models.

Step 2: Pull and Run Your First Model

The fastest model to start with is Llama 3.2 3B — it is small enough to run on almost any machine and smart enough to be genuinely useful:

ollama run llama3.2

The first run will download the model (around 2GB). Once it is done, you will get a chat prompt directly in your terminal. Type anything and press Enter. To exit, type /bye.

That is it. You are running AI locally.

Choosing the Right Model for Your Hardware

The main constraint is VRAM (on Nvidia/AMD GPUs) or unified memory (on Apple Silicon). If a model does not fit entirely in GPU memory, it falls back to RAM or CPU, which is much slower. Here is a practical guide:

4GB VRAM or Unified Memory

llama3.2:3b — Fast, capable, great for chat and simple tasks
phi3:mini — Microsoft's compact model, surprisingly good at reasoning
gemma2:2b — Google's smallest Gemma, good for Q&A and summarization
qwen2.5:3b — Strong multilingual performance in a tiny footprint

8GB VRAM or Unified Memory

llama3.1:8b — The sweet spot for most users. Fast and genuinely capable
mistral:7b — Excellent instruction following, very fast inference
gemma2:9b — Google's mid-size model, strong on reasoning tasks
deepseek-r1:8b — Strong reasoning model, good for logic and code problems
codellama:7b — Optimized specifically for code generation and explanation

16GB VRAM or Unified Memory

llama3.1:13b — Noticeably smarter than the 8B, still fast on good hardware
mistral:12b — Mistral's larger model, great all-rounder
deepseek-r1:14b — Excellent reasoning and coding at this size
phi4:14b — Microsoft's latest, punches well above its weight

24GB+ VRAM or 32GB+ Unified Memory

llama3.1:70b (quantized) — Near GPT-4 quality for many tasks
deepseek-r1:32b — Serious reasoning model, competes with frontier models on benchmarks
qwen2.5:32b — Outstanding coding and multilingual model
mixtral:8x7b — Mixture-of-experts architecture, fast and smart

A practical note on Apple Silicon: the M1/M2/M3/M4 chips share memory between CPU and GPU, so a MacBook Pro with 16GB RAM can fully load a 13B model in GPU memory. This is a major advantage over discrete GPUs where only dedicated VRAM counts.

Useful Ollama Commands

A few commands worth knowing:

ollama list — see all models you have downloaded
ollama pull modelname — download a model without running it
ollama rm modelname — delete a model to free disk space
ollama ps — see what models are currently loaded in memory
ollama run modelname 'your prompt here' — run a one-shot prompt from the terminal

You can also pass a file as context. For example, to summarize a document:

ollama run llama3.1 'Summarize this:' < my-document.txt

Ideas for What to Actually Do With It

Once Ollama is running, here are some genuinely useful things you can do:

Private Document Q&A

Paste in a contract, report, or article and ask questions about it. Nothing leaves your machine. This is particularly useful for anything you would not want to send to a cloud API — legal documents, internal business data, personal notes.

Code Review and Explanation

Paste a function and ask 'what does this do' or 'how would you improve this'. Models like CodeLlama and Deepseek-R1 are specifically tuned for code and will often catch issues or suggest cleaner approaches. Again, your codebase stays local.

Writing Assistance

Draft emails, rewrite paragraphs, brainstorm headings, or ask it to make a piece of writing more concise. The 8B+ models handle this well enough for everyday writing tasks.

Learning and Research

Ask it to explain concepts, quiz you on a topic, or break down something complex into simpler terms. Models like Phi-4 and Mistral are particularly good at clear explanations.

Automating Terminal Tasks

Combine Ollama with shell scripts to build simple automation. Pipe text through a model, generate summaries of log files, or create a local 'ask me anything' script you can call from anywhere:

echo 'Explain this error: segmentation fault in malloc' | ollama run llama3.1

Running It Offline

On a plane, in a basement, in a location with spotty internet — once the model is downloaded, Ollama works with zero connectivity. This is one of its biggest practical advantages over web-based AI tools.

Going Further: Open WebUI

If you want a browser-based chat interface instead of the terminal, install Open WebUI. It connects to your local Ollama instance and gives you a ChatGPT-style UI with conversation history, model switching, and file uploads:

docker run -d -p 3000:80 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main

Then open localhost:3000 in your browser. You will see all your Ollama models available as options. This makes local AI accessible to anyone who finds the terminal intimidating.

The Bottom Line

Ollama removes almost all of the friction from running AI locally. The install takes two minutes, the first model takes a few minutes to download, and you are up and running. For anyone who has hesitated because local AI seemed technically daunting, this is the tool that changes that.

Start with llama3.2 on whatever hardware you have, see how it performs, then move up to a larger model if you want more capability. The difference between a 3B and an 8B model is noticeable, and the difference between 8B and 13B is noticeable again. Find the one that balances speed and quality for your machine and stick with it.