How to Download Ollama and Run Your First Local AI Model
No cloud. No API fees. No internet required. Here is how to install Ollama, pick the right model for your hardware, and start running AI locally in under 10 minutes.
Ollama is the easiest way to run large language models on your own hardware. No cloud account, no API fees, no internet required once the model is downloaded. This guide walks you through getting set up, choosing the right model for your machine, and actually doing something useful with it.
What Is Ollama?
Ollama is a free, open-source tool that lets you download and run AI language models locally. It handles all the heavy lifting — model downloads, quantization, hardware acceleration — so you just run one command and start chatting. It works on Mac (Apple Silicon and Intel), Windows, and Linux.
Under the hood it uses llama.cpp to run models efficiently on consumer hardware, with Metal GPU acceleration on Mac and CUDA on Nvidia GPUs. The result is that models that would require expensive cloud infrastructure can run on a decent laptop.
Step 1: Download and Install Ollama
Go to ollama.com and download the installer for your platform. It is a straightforward install — drag to Applications on Mac, run the .exe on Windows, or use the one-line install script on Linux:
curl -fsSL https://ollama.com/install.sh | sh
After installing, Ollama runs as a background service. You can verify it is working by opening a terminal and running:
ollama --version
You should see a version number. That means the Ollama server is running and ready to pull models.
Step 2: Pull and Run Your First Model
The fastest model to start with is Llama 3.2 3B — it is small enough to run on almost any machine and smart enough to be genuinely useful:
ollama run llama3.2
The first run will download the model (around 2GB). Once it is done, you will get a chat prompt directly in your terminal. Type anything and press Enter. To exit, type /bye.
That is it. You are running AI locally.
Choosing the Right Model for Your Hardware
The main constraint is VRAM (on Nvidia/AMD GPUs) or unified memory (on Apple Silicon). If a model does not fit entirely in GPU memory, it falls back to RAM or CPU, which is much slower. Here is a practical guide:
4GB VRAM or Unified Memory
- llama3.2:3b — Fast, capable, great for chat and simple tasks
- phi3:mini — Microsoft's compact model, surprisingly good at reasoning
- gemma2:2b — Google's smallest Gemma, good for Q&A and summarization
- qwen2.5:3b — Strong multilingual performance in a tiny footprint
8GB VRAM or Unified Memory
- llama3.1:8b — The sweet spot for most users. Fast and genuinely capable
- mistral:7b — Excellent instruction following, very fast inference
- gemma2:9b — Google's mid-size model, strong on reasoning tasks
- deepseek-r1:8b — Strong reasoning model, good for logic and code problems
- codellama:7b — Optimized specifically for code generation and explanation
16GB VRAM or Unified Memory
- llama3.1:13b — Noticeably smarter than the 8B, still fast on good hardware
- mistral:12b — Mistral's larger model, great all-rounder
- deepseek-r1:14b — Excellent reasoning and coding at this size
- phi4:14b — Microsoft's latest, punches well above its weight
24GB+ VRAM or 32GB+ Unified Memory
- llama3.1:70b (quantized) — Near GPT-4 quality for many tasks
- deepseek-r1:32b — Serious reasoning model, competes with frontier models on benchmarks
- qwen2.5:32b — Outstanding coding and multilingual model
- mixtral:8x7b — Mixture-of-experts architecture, fast and smart
A practical note on Apple Silicon: the M1/M2/M3/M4 chips share memory between CPU and GPU, so a MacBook Pro with 16GB RAM can fully load a 13B model in GPU memory. This is a major advantage over discrete GPUs where only dedicated VRAM counts.
Useful Ollama Commands
A few commands worth knowing:
- ollama list — see all models you have downloaded
- ollama pull modelname — download a model without running it
- ollama rm modelname — delete a model to free disk space
- ollama ps — see what models are currently loaded in memory
- ollama run modelname 'your prompt here' — run a one-shot prompt from the terminal
You can also pass a file as context. For example, to summarize a document:
ollama run llama3.1 'Summarize this:' < my-document.txt
Ideas for What to Actually Do With It
Once Ollama is running, here are some genuinely useful things you can do:
Private Document Q&A
Paste in a contract, report, or article and ask questions about it. Nothing leaves your machine. This is particularly useful for anything you would not want to send to a cloud API — legal documents, internal business data, personal notes.
Code Review and Explanation
Paste a function and ask 'what does this do' or 'how would you improve this'. Models like CodeLlama and Deepseek-R1 are specifically tuned for code and will often catch issues or suggest cleaner approaches. Again, your codebase stays local.
Writing Assistance
Draft emails, rewrite paragraphs, brainstorm headings, or ask it to make a piece of writing more concise. The 8B+ models handle this well enough for everyday writing tasks.
Learning and Research
Ask it to explain concepts, quiz you on a topic, or break down something complex into simpler terms. Models like Phi-4 and Mistral are particularly good at clear explanations.
Automating Terminal Tasks
Combine Ollama with shell scripts to build simple automation. Pipe text through a model, generate summaries of log files, or create a local 'ask me anything' script you can call from anywhere:
echo 'Explain this error: segmentation fault in malloc' | ollama run llama3.1
Running It Offline
On a plane, in a basement, in a location with spotty internet — once the model is downloaded, Ollama works with zero connectivity. This is one of its biggest practical advantages over web-based AI tools.
Going Further: Open WebUI
If you want a browser-based chat interface instead of the terminal, install Open WebUI. It connects to your local Ollama instance and gives you a ChatGPT-style UI with conversation history, model switching, and file uploads:
docker run -d -p 3000:80 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main
Then open localhost:3000 in your browser. You will see all your Ollama models available as options. This makes local AI accessible to anyone who finds the terminal intimidating.
The Bottom Line
Ollama removes almost all of the friction from running AI locally. The install takes two minutes, the first model takes a few minutes to download, and you are up and running. For anyone who has hesitated because local AI seemed technically daunting, this is the tool that changes that.
Start with llama3.2 on whatever hardware you have, see how it performs, then move up to a larger model if you want more capability. The difference between a 3B and an 8B model is noticeable, and the difference between 8B and 13B is noticeable again. Find the one that balances speed and quality for your machine and stick with it.