Running Local LLMs on a Mac with Ollama and Open WebUI

This guide covers everything from installing Ollama to chatting with a local language model through a browser interface, step by step. We’ll install the runtime, pull your first model, understand which models work well for what, set up Open WebUI as a chat frontend for the whole family, and go over the commands you’ll use day to day. Each step includes background on what’s happening and what to expect. You should have already completed the Mac preparation guide before starting here.

Install Ollama

brew install ollama

Homebrew downloads the Ollama binary, puts it on your PATH, and registers a LaunchAgent so Ollama starts automatically every time you log in. After the install finishes, Ollama is already running in the background and listening on port 11434.

Verify that it’s up:

curl http://localhost:11434/api/version

You should see something like:

{"version":"0.6.2"}

If you get “connection refused”, Ollama isn’t running yet. Start it manually with brew services start ollama and try the curl again.

Why native and not Docker?

Ollama needs to run directly on macOS, not inside a container. The native version gets Metal GPU acceleration through Apple’s unified memory. The Docker version runs inside a Linux VM (OrbStack or Docker Desktop) and has zero GPU access. On a 64GB M1 Max, that’s the difference between a 32B model running at a comfortable speed and the same model crawling through CPU-only inference.

Auto-start after reboot

The Homebrew install creates a LaunchAgent that starts Ollama on login. You can confirm the file exists:

ls ~/Library/LaunchAgents/ | grep ollama

This should print something like homebrew.mxcl.ollama.plist. If you see it, Ollama will come back on its own after a reboot. No manual intervention needed.

If you installed Ollama from the .dmg download instead of Homebrew, it handles auto-start differently (through a Login Item rather than a LaunchAgent) but should still come back after a reboot. Either way, verify after your next restart:

curl http://localhost:11434/api/version

Pull and run your first model

Ollama is running, but it doesn’t have any models yet. You need to download one first. Models are identified by name and size tag, like llama3.1:8b (Llama 3.1, 8 billion parameters).

ollama pull llama3.1:8b

This downloads about 4.7GB from Ollama’s model registry. You’ll see a progress bar:

pulling manifest
pulling 8eeb52dfb3bb... 100% ▕████████████████▏ 4.7 GB
pulling 73b313b5552d... 100% ▕████████████████▏ 1.4 KB
...
success

Once it’s done, start a conversation right in the terminal:

ollama run llama3.1:8b

This loads the model into memory (takes a second or two on first load) and drops you into an interactive chat. Type a question, get a response. Type /bye to exit. I remember the first time I did this and couldn’t stop grinning. A language model, running on my desk, no internet required.

The first time you load a model, watch your memory usage in Activity Monitor. Llama 3.1 8B uses about 8GB of unified memory. On a 64GB machine, that’s nothing. On a 32GB machine, it matters more when you have other services running.

Model recommendations

Not all models are created equal, and what works depends on your hardware. Here’s what I’ve tested on a Mac Studio M1 Max with 64GB unified memory.

Model	Size on disk	RAM needed	Good for	Speed (M1 Max 64GB)
Llama 3.2 3B	~2 GB	~4 GB	Quick tasks, testing, low overhead	Very fast, near-instant
Llama 3.1 8B	~4.7 GB	~8 GB	General purpose, daily driver	Fast, comfortable for chat
Qwen 2.5 32B	~20 GB	~24 GB	Best quality at reasonable speed	Medium, ~15 tok/s
Qwen 2.5 Coder 7B	~4.7 GB	~8 GB	Code generation, review	Fast
Qwen 2.5 Coder 32B	~20 GB	~24 GB	Complex coding tasks	Medium, ~15 tok/s
Mistral 7B	~4.1 GB	~8 GB	Compact, good European languages	Fast
DeepSeek Coder V2	~8.9 GB	~12 GB	Code specialist, fill-in-the-middle	Moderate

To pull any of these:

ollama pull llama3.2:3b
ollama pull qwen2.5:32b
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:32b
ollama pull mistral:7b
ollama pull deepseek-coder-v2:latest

A few notes on picking models:

Start with Llama 3.1 8B. It’s the safest default. Fast, capable enough for most things, leaves plenty of RAM for your other services.

Qwen 2.5 32B is the sweet spot for 64GB machines. Noticeably better answers than the 8B models. Uses a big chunk of memory but leaves enough room for Docker services running alongside it. This is what I reach for when quality matters.

3B models are useful, not just toys. Llama 3.2 3B handles summarization, simple Q&A, and text reformatting well enough. If you’re building automations that make many small requests, the speed advantage matters more than the quality gap.

Ollama loads models into memory on first request and unloads them after 5 minutes of inactivity (configurable). You don’t need to worry about memory management. Pull several models and switch between them as needed.

The toybox

The table above covers the safe defaults. Once you start looking around, you’ll find that the number of available models is staggering. There are specialist models for coding, translation, medical text, legal documents, SQL generation, function calling, roleplay, math, image description, and dozens of other niches. New ones appear weekly. It’s a toybox, and half the fun is pulling a model you’ve never heard of and seeing what it can do.

Browsing models

Ollama’s library at ollama.com/library is the simplest starting point. Models are tagged by category and sorted by popularity. Search for a task (“code”, “math”, “vision”) and you’ll find specialized models. Each model page shows sizes, RAM requirements, and a one-line pull command.

HuggingFace at huggingface.co/models is the bigger catalog. Filter by task (text generation, summarization, translation), sort by trending or most downloaded, and filter by format (GGUF for Ollama). Many models on HuggingFace can be imported into Ollama with a Modelfile, though the ones already in Ollama’s library are easier to get started with.

Open WebUI’s model browser lets you pull models directly from the chat interface without touching the terminal. Convenient when you want to try something mid-conversation.

Things worth trying

A few categories that are easy to miss if you only stick with the general-purpose models:

Vision models like llava or llama3.2-vision can describe images, read text from screenshots, and answer questions about photos. Pull one and drag an image into Open WebUI.

Embedding models like nomic-embed-text turn text into vectors for search and retrieval. Not for chatting, but useful if you want to build semantic search over your documents later.

Small specialist models often punch above their weight for narrow tasks. A 3B model fine-tuned for SQL generation can outperform a general 32B model at writing queries. Worth exploring if you have a specific use case.

The models are free. Disk space is the only cost. Pull ten, delete the ones that don’t click, keep the ones that surprise you. ollama rm cleans up in seconds.

What local models are good at (and not)

Be realistic about what these can do. Local models in the 7-32B parameter range are not ChatGPT or Claude replacements. What works depends heavily on which model and how many parameters you throw at it.

Works well, even with smaller models (3-8B):

Summaries and short abstracts
Tagging, classification, categorization
Extracting names, dates, amounts from text
Reformatting and restructuring data
Simple drafts (short emails, messages)

Works with larger models (32B+):

Longer-form drafting
Code explanation, translation between languages
Brainstorming
Q&A on well-known topics (but verify the answers)

Not there yet:

Complex multi-step reasoning
Factual accuracy on specific topics (they hallucinate, often confidently)
Long conversations with many turns (context degrades)
Following intricate instructions reliably
Math beyond basics

The quality gap compared to cloud models is real. What makes up for it is privacy, which we’ll get to once Open WebUI is running.

Install Open WebUI

Open WebUI gives you a browser-based chat interface that talks to Ollama. Think of it as a self-hosted ChatGPT frontend.

Create a directory for the stack and a docker-compose.yml inside it:

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3002:8080"
    extra_hosts:
      - "host.docker.internal:host-gateway"
    volumes:
      - open-webui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
      - WEBUI_AUTH=false

volumes:
  open-webui-data:

A quick breakdown of what each piece does:

ports: "3002:8080" maps port 3002 on your Mac to port 8080 inside the container. You’ll access the UI at http://localhost:3002.
extra_hosts adds a DNS entry inside the container so it can reach your Mac’s network. Ollama runs on the host, not in Docker, so the container needs a way to find it.
OLLAMA_BASE_URL tells Open WebUI where Ollama’s API lives. Since Ollama runs on the host at port 11434, we point it at host.docker.internal:11434.
WEBUI_AUTH=false disables the login screen. Anyone on your network can use the interface without creating an account.
open-webui-data is a Docker volume that persists chat history, settings, and user accounts between container restarts.
restart: unless-stopped brings the container back after a reboot or crash, unless you explicitly stopped it.

Start it:

docker compose up -d

You should see Docker pull the image (first time only, about 2GB) and start the container:

[+] Running 1/1
 ✔ Container open-webui  Started

Open http://localhost:3002 in your browser. On first visit, create an admin account (this is local, the account only exists on your machine). Pick a model from the dropdown at the top, and start chatting. If the dropdown is empty, Open WebUI can’t reach Ollama. Check that Ollama is running (curl http://localhost:11434/api/version) and that the OLLAMA_BASE_URL in your compose file is correct.

Making it available on your network

Because we mapped port 3002, Open WebUI is accessible from any device on your local network at http://<your-mac-ip>:3002. Your family can bookmark it on their phones and laptops. It looks and feels like ChatGPT, so there’s no learning curve.

With WEBUI_AUTH=false, there’s no login screen. Anyone on your network can use it. For a home network behind a router, that’s fine. If you want per-user chat history or access control, remove that line and let each family member create an account on first visit.

Things to try

Once Open WebUI is running, paste in a school newsletter and ask for a summary. Paste a recipe and ask it to scale from 2 to 6 portions. Ask it to draft a reply to your landlord about the utility bill.

The interesting part is what happens when the whole family starts using it. My wife pastes in letters from the school or the insurance company and asks what they actually mean. I’ve checked employment contracts for notice periods. Asked about my kid’s rash at 10 PM. The kind of questions you wouldn’t type into Google or ChatGPT because they’re too personal, too specific to your family. With a local model, there’s nobody on the other end. No account, no history, no profile being built. It’s just your Mac.

One more thing worth knowing: the model you downloaded today will behave the same way in six months. No silent updates that change how it responds. If you’ve used ChatGPT long enough to notice a model getting worse after an update, you know how annoying that is. Local models are frozen. You update when you choose to.

What about LM Studio?

LM Studio is a popular alternative. It has a GUI, a CLI, can run as a headless daemon, and supports Apple’s MLX framework which runs 20-30% faster than Ollama’s GGUF backend on Apple Silicon. Worth looking at, especially if inference speed matters to you. We’ll cover LM Studio in a separate guide and compare the two in detail.

This guide focuses on Ollama because it has the broader ecosystem today. Open WebUI, n8n, Continue (VS Code), LangChain, and most other tools that integrate with local LLMs expect an Ollama endpoint.

Useful Ollama commands

Once you’ve been running Ollama for a while, you’ll accumulate models and want to manage them. Here’s what you’ll reach for.

See what you have downloaded:

ollama list

Shows all models on disk with their size and when they were last modified. Useful for checking how much disk space your models are using.

Check what’s loaded in memory right now:

ollama ps

Shows models that are currently in RAM, how much memory they’re using, and when they’ll be unloaded. If nothing shows up, no model is active. Ollama loads models on demand and unloads them after 5 minutes of inactivity.

Download or update a model:

ollama pull qwen2.5:32b

If you already have the model, this checks for updates and only downloads changed layers. Safe to run anytime.

Start an interactive chat:

ollama run llama3.1:8b

Loads the model (if not already loaded) and drops you into a terminal chat session. If the model isn’t downloaded yet, run will pull it first automatically.

Inspect a model’s details:

ollama show qwen2.5:32b

Prints the model’s parameters, prompt template, license, and system prompt. Helpful when you want to understand how a model expects to be prompted, or to check what quantization you’re running.

Delete a model you no longer need:

ollama rm mistral:7b

Removes the model files from disk immediately. If you pulled a bunch of models to experiment and want to reclaim space, this is how.

API endpoints

Ollama exposes an OpenAI-compatible API on port 11434. This is how other tools talk to your local models. Any software that supports OpenAI’s API can point at http://localhost:11434/v1 and it works without modification.

Chat completion (the same format ChatGPT’s API uses):

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Returns a JSON response with the model’s reply. This is the endpoint that Open WebUI, n8n, Continue (VS Code extension), LangChain, and most other integrations use.

List available models (useful for tools that need to discover what’s installed):

curl http://localhost:11434/v1/models

Single-prompt generation (Ollama’s native endpoint, simpler than the chat format):

curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "prompt": "Why is the sky blue?", "stream": false}'

The stream: false flag makes it return the complete response in one go. Without it, Ollama streams tokens as they’re generated, which is useful for real-time UIs but harder to work with in scripts.

Checklist

Ollama installed via Homebrew (brew install ollama)
Ollama responding (curl http://localhost:11434/api/version)
LaunchAgent exists for auto-start after reboot
At least one model pulled (ollama pull llama3.1:8b)
Model runs in terminal (ollama run llama3.1:8b)
Open WebUI container running (docker compose up -d)
Open WebUI accessible at http://localhost:3002
Open WebUI sees your Ollama models in the dropdown
Tested from another device on your network