← All guides

Running Local LLMs on a Mac with Ollama and Open WebUI

Install Ollama, pick the right models for your hardware, and set up Open WebUI as a private ChatGPT alternative. Complete setup guide for Apple Silicon Macs.

This guide covers everything from installing Ollama to chatting with a local language model through a browser interface, step by step. We’ll install the runtime, pull your first model, understand which models work well for what, set up Open WebUI as a chat frontend for the whole family, and go over the commands you’ll use day to day. Each step includes background on what’s happening and what to expect. You should have already completed the Mac preparation guide before starting here.

Install Ollama

brew install ollama

Homebrew downloads the Ollama binary, puts it on your PATH, and registers a LaunchAgent so Ollama starts automatically every time you log in. After the install finishes, Ollama is already running in the background and listening on port 11434.

Verify that it’s up:

curl http://localhost:11434/api/version

You should see something like:

{"version":"0.6.2"}

If you get “connection refused”, Ollama isn’t running yet. Start it manually with brew services start ollama and try the curl again.

Why native and not Docker?

Ollama needs to run directly on macOS, not inside a container. The native version gets Metal GPU acceleration through Apple’s unified memory. The Docker version runs inside a Linux VM (OrbStack or Docker Desktop) and has zero GPU access. On a 64GB M1 Max, that’s the difference between a 32B model running at a comfortable speed and the same model crawling through CPU-only inference.

Auto-start after reboot

The Homebrew install creates a LaunchAgent that starts Ollama on login. You can confirm the file exists:

ls ~/Library/LaunchAgents/ | grep ollama

This should print something like homebrew.mxcl.ollama.plist. If you see it, Ollama will come back on its own after a reboot. No manual intervention needed.

If you installed Ollama from the .dmg download instead of Homebrew, it handles auto-start differently (through a Login Item rather than a LaunchAgent) but should still come back after a reboot. Either way, verify after your next restart:

curl http://localhost:11434/api/version

Pull and run your first model

Ollama is running, but it doesn’t have any models yet. You need to download one first. Models are identified by name and size tag, like llama3.1:8b (Llama 3.1, 8 billion parameters).

ollama pull llama3.1:8b

This downloads about 4.7GB from Ollama’s model registry. You’ll see a progress bar:

pulling manifest
pulling 8eeb52dfb3bb... 100% ▕████████████████▏ 4.7 GB
pulling 73b313b5552d... 100% ▕████████████████▏ 1.4 KB
...
success

Once it’s done, start a conversation right in the terminal:

ollama run llama3.1:8b

This loads the model into memory (takes a second or two on first load) and drops you into an interactive chat. Type a question, get a response. Type /bye to exit. I remember the first time I did this and couldn’t stop grinning. A language model, running on my desk, no internet required.

The first time you load a model, watch your memory usage in Activity Monitor. Llama 3.1 8B uses about 8GB of unified memory. On a 64GB machine, that’s nothing. On a 32GB machine, it matters more when you have other services running.

Model recommendations

Not all models are created equal, and what works depends on your hardware. Here’s what I’ve tested on a Mac Studio M1 Max with 64GB unified memory.

ModelSize on diskRAM neededGood forSpeed (M1 Max 64GB)
Llama 3.2 3B~2 GB~4 GBQuick tasks, testing, low overheadVery fast, near-instant
Llama 3.1 8B~4.7 GB~8 GBGeneral purpose, daily driverFast, comfortable for chat
Qwen 2.5 32B~20 GB~24 GBBest quality at reasonable speedMedium, ~15 tok/s
Qwen 2.5 Coder 7B~4.7 GB~8 GBCode generation, reviewFast
Qwen 2.5 Coder 32B~20 GB~24 GBComplex coding tasksMedium, ~15 tok/s
Mistral 7B~4.1 GB~8 GBCompact, good European languagesFast
DeepSeek Coder V2~8.9 GB~12 GBCode specialist, fill-in-the-middleModerate

To pull any of these:

ollama pull llama3.2:3b
ollama pull qwen2.5:32b
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:32b
ollama pull mistral:7b
ollama pull deepseek-coder-v2:latest

A few notes on picking models:

Start with Llama 3.1 8B. It’s the safest default. Fast, capable enough for most things, leaves plenty of RAM for your other services.

Qwen 2.5 32B is the sweet spot for 64GB machines. Noticeably better answers than the 8B models. Uses a big chunk of memory but leaves enough room for Docker services running alongside it. This is what I reach for when quality matters.

3B models are useful, not just toys. Llama 3.2 3B handles summarization, simple Q&A, and text reformatting well enough. If you’re building automations that make many small requests, the speed advantage matters more than the quality gap.

Ollama loads models into memory on first request and unloads them after 5 minutes of inactivity (configurable). You don’t need to worry about memory management. Pull several models and switch between them as needed.

The toybox

The table above covers the safe defaults. Once you start looking around, you’ll find that the number of available models is staggering. There are specialist models for coding, translation, medical text, legal documents, SQL generation, function calling, roleplay, math, image description, and dozens of other niches. New ones appear weekly. It’s a toybox, and half the fun is pulling a model you’ve never heard of and seeing what it can do.

Browsing models

Ollama’s library at ollama.com/library is the simplest starting point. Models are tagged by category and sorted by popularity. Search for a task (“code”, “math”, “vision”) and you’ll find specialized models. Each model page shows sizes, RAM requirements, and a one-line pull command.

HuggingFace at huggingface.co/models is the bigger catalog. Filter by task (text generation, summarization, translation), sort by trending or most downloaded, and filter by format (GGUF for Ollama). Many models on HuggingFace can be imported into Ollama with a Modelfile, though the ones already in Ollama’s library are easier to get started with.

Open WebUI’s model browser lets you pull models directly from the chat interface without touching the terminal. Convenient when you want to try something mid-conversation.

Things worth trying

A few categories that are easy to miss if you only stick with the general-purpose models:

Vision models like llava or llama3.2-vision can describe images, read text from screenshots, and answer questions about photos. Pull one and drag an image into Open WebUI.

Embedding models like nomic-embed-text turn text into vectors for search and retrieval. Not for chatting, but useful if you want to build semantic search over your documents later.

Small specialist models often punch above their weight for narrow tasks. A 3B model fine-tuned for SQL generation can outperform a general 32B model at writing queries. Worth exploring if you have a specific use case.

The models are free. Disk space is the only cost. Pull ten, delete the ones that don’t click, keep the ones that surprise you. ollama rm cleans up in seconds.

What local models are good at (and not)

Be realistic about what these can do. Local models in the 7-32B parameter range are not ChatGPT or Claude replacements. What works depends heavily on which model and how many parameters you throw at it.

Works well, even with smaller models (3-8B):

Works with larger models (32B+):

Not there yet:

The quality gap compared to cloud models is real. What makes up for it is privacy, which we’ll get to once Open WebUI is running.

Install Open WebUI

Open WebUI gives you a browser-based chat interface that talks to Ollama. Think of it as a self-hosted ChatGPT frontend.

Create a directory for the stack and a docker-compose.yml inside it:

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3002:8080"
    extra_hosts:
      - "host.docker.internal:host-gateway"
    volumes:
      - open-webui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
      - WEBUI_AUTH=false

volumes:
  open-webui-data:

A quick breakdown of what each piece does:

Start it:

docker compose up -d

You should see Docker pull the image (first time only, about 2GB) and start the container:

[+] Running 1/1
 ✔ Container open-webui  Started

Open http://localhost:3002 in your browser. On first visit, create an admin account (this is local, the account only exists on your machine). Pick a model from the dropdown at the top, and start chatting. If the dropdown is empty, Open WebUI can’t reach Ollama. Check that Ollama is running (curl http://localhost:11434/api/version) and that the OLLAMA_BASE_URL in your compose file is correct.

Making it available on your network

Because we mapped port 3002, Open WebUI is accessible from any device on your local network at http://<your-mac-ip>:3002. Your family can bookmark it on their phones and laptops. It looks and feels like ChatGPT, so there’s no learning curve.

With WEBUI_AUTH=false, there’s no login screen. Anyone on your network can use it. For a home network behind a router, that’s fine. If you want per-user chat history or access control, remove that line and let each family member create an account on first visit.

Things to try

Once Open WebUI is running, paste in a school newsletter and ask for a summary. Paste a recipe and ask it to scale from 2 to 6 portions. Ask it to draft a reply to your landlord about the utility bill.

The interesting part is what happens when the whole family starts using it. My wife pastes in letters from the school or the insurance company and asks what they actually mean. I’ve checked employment contracts for notice periods. Asked about my kid’s rash at 10 PM. The kind of questions you wouldn’t type into Google or ChatGPT because they’re too personal, too specific to your family. With a local model, there’s nobody on the other end. No account, no history, no profile being built. It’s just your Mac.

One more thing worth knowing: the model you downloaded today will behave the same way in six months. No silent updates that change how it responds. If you’ve used ChatGPT long enough to notice a model getting worse after an update, you know how annoying that is. Local models are frozen. You update when you choose to.

What about LM Studio?

LM Studio is a popular alternative. It has a GUI, a CLI, can run as a headless daemon, and supports Apple’s MLX framework which runs 20-30% faster than Ollama’s GGUF backend on Apple Silicon. Worth looking at, especially if inference speed matters to you. We’ll cover LM Studio in a separate guide and compare the two in detail.

This guide focuses on Ollama because it has the broader ecosystem today. Open WebUI, n8n, Continue (VS Code), LangChain, and most other tools that integrate with local LLMs expect an Ollama endpoint.

Useful Ollama commands

Once you’ve been running Ollama for a while, you’ll accumulate models and want to manage them. Here’s what you’ll reach for.

See what you have downloaded:

ollama list

Shows all models on disk with their size and when they were last modified. Useful for checking how much disk space your models are using.

Check what’s loaded in memory right now:

ollama ps

Shows models that are currently in RAM, how much memory they’re using, and when they’ll be unloaded. If nothing shows up, no model is active. Ollama loads models on demand and unloads them after 5 minutes of inactivity.

Download or update a model:

ollama pull qwen2.5:32b

If you already have the model, this checks for updates and only downloads changed layers. Safe to run anytime.

Start an interactive chat:

ollama run llama3.1:8b

Loads the model (if not already loaded) and drops you into a terminal chat session. If the model isn’t downloaded yet, run will pull it first automatically.

Inspect a model’s details:

ollama show qwen2.5:32b

Prints the model’s parameters, prompt template, license, and system prompt. Helpful when you want to understand how a model expects to be prompted, or to check what quantization you’re running.

Delete a model you no longer need:

ollama rm mistral:7b

Removes the model files from disk immediately. If you pulled a bunch of models to experiment and want to reclaim space, this is how.

API endpoints

Ollama exposes an OpenAI-compatible API on port 11434. This is how other tools talk to your local models. Any software that supports OpenAI’s API can point at http://localhost:11434/v1 and it works without modification.

Chat completion (the same format ChatGPT’s API uses):

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Returns a JSON response with the model’s reply. This is the endpoint that Open WebUI, n8n, Continue (VS Code extension), LangChain, and most other integrations use.

List available models (useful for tools that need to discover what’s installed):

curl http://localhost:11434/v1/models

Single-prompt generation (Ollama’s native endpoint, simpler than the chat format):

curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "prompt": "Why is the sky blue?", "stream": false}'

The stream: false flag makes it return the complete response in one go. Without it, Ollama streams tokens as they’re generated, which is useful for real-time UIs but harder to work with in scripts.

Checklist

Try it with your local LLM

Copy this guide and paste it into Open WebUI or any local chat interface as a new conversation. Your local model becomes a setup assistant that walks you through each step, explains commands, and helps troubleshoot errors.

Get notified when the repo goes live.

One mail. Promise.