# Local LLM Server — Reference Personal Ollama-based LLM server running on this Mac Studio (M4 Max, 128GB unified memory). This document is a single source of truth for Claude Code agents that need to query, manage, or troubleshoot the server. ## TL;DR There are two ways to reach the API: **Tailnet** (auth-free, requires being on the user's tailnet) and **Public Internet** (HTTPS, requires API key). | What | Tailnet (no key) | Public (HTTPS, key required) | |---|---|---| | API base | `http://kevin-macstudio.tail91f148.ts.net:11434/v1` | `https://kevin-macstudio.tail91f148.ts.net/v1` | | Models list | `http://...:11434/api/tags` | `https://kevin-macstudio.tail91f148.ts.net/api/tags` | | Web UI | `http://kevin-macstudio.tail91f148.ts.net:8080` | (not exposed publicly) | | Docs (this file) | `http://kevin-macstudio.tail91f148.ts.net:8081/` | `https://kevin-macstudio.tail91f148.ts.net/` | | Raw markdown | `http://...:8081/llm-server.md` | `https://kevin-macstudio.tail91f148.ts.net/llm-server.md` | - **Models**: `gemma-fast`, `gemma-max`, `qwen-uncensored` - **Tailnet auth**: none (Tailscale ACL is the boundary). Pass `api_key="ollama"` as a dummy when SDKs require one. - **Public auth**: send `Authorization: Bearer ` on every `/v1/*` and `/api/*` request. Docs (`/`, `/llm-server.md`) are open. The key is stored on the Mac at `/Users/kevin/.config/llm-api-key` (chmod 600). Treat it like a secret — anyone with it can use the GPU until the key is rotated. ## Models | Alias | Underlying model | Size | When to use | |---|---|---|---| | `gemma-fast` | `gemma4:26b-a4b-it-q8_0` (MoE 26B / 4B active) | 28GB | Default. Fast responses, good general quality. | | `gemma-max` | `gemma4:31b-it-q8_0` (Dense 31B) | 33GB | Complex reasoning, coding, longer-context tasks. Slower. | | `qwen-uncensored` | `tripolskypetr/qwen3.5-uncensored-aggressive:35b` (Q4_K_M, 34.7B) | 21GB | Refusals removed (0/465). Use when the censored models block a legitimate request. | All three have **thinking mode enabled by default** — the model spends tokens reasoning before answering. The OpenAI-compatible response splits this into: - `choices[0].message.content` → the final answer - `choices[0].message.reasoning` → the chain-of-thought To get a fast direct answer, set `"reasoning_effort": "none"` (or `"low"`) in the request body, and budget `max_tokens` generously (300+) when reasoning is on. ### Recommended sampling params for `qwen-uncensored` ```json {"temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5} ``` ## API usage ### curl — Tailnet (no key) ```bash curl http://kevin-macstudio.tail91f148.ts.net:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gemma-fast", "messages": [{"role":"user","content":"Hello"}], "max_tokens": 500, "reasoning_effort": "low" }' ``` ### curl — Public Internet (with key) ```bash curl https://kevin-macstudio.tail91f148.ts.net/v1/chat/completions \ -H "Authorization: Bearer $LLM_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "gemma-fast", "messages": [{"role":"user","content":"Hello"}], "max_tokens": 500, "reasoning_effort": "low" }' ``` ### Python (OpenAI SDK) ```python import os from openai import OpenAI # Tailnet (no key): client = OpenAI( base_url="http://kevin-macstudio.tail91f148.ts.net:11434/v1", api_key="ollama", # dummy ) # OR Public (with key): client = OpenAI( base_url="https://kevin-macstudio.tail91f148.ts.net/v1", api_key=os.environ["LLM_API_KEY"], ) resp = client.chat.completions.create( model="gemma-max", messages=[{"role": "user", "content": "Write a Python fibonacci generator."}], max_tokens=2000, reasoning_effort="low", ) print(resp.choices[0].message.content) ``` ### Streaming Add `"stream": true` to the curl body, or `stream=True` to the Python client. Standard OpenAI SSE format. ### Listing models ```bash curl http://localhost:11434/api/tags # Ollama native curl http://localhost:11434/v1/models # OpenAI-compat ``` ## Web UI Open WebUI runs on port **8080** with **auth disabled** (`WEBUI_AUTH=False`). Chat history is shared across all devices that hit the URL. - Visit `http://localhost:8080` (this Mac) or `http://kevin-macstudio.tail91f148.ts.net:8080` (any tailnet device) — opens straight to the chat, no login. - Model dropdown in the top-left lets you switch between `gemma-fast` / `gemma-max` / `qwen-uncensored` per chat. - Web UI is **not** exposed via Tailscale Funnel — only Caddy on :10000 (the API gateway) is publicly reachable. To re-enable login, set `WEBUI_AUTH=True` in `~/Library/LaunchAgents/com.openwebui.server.plist` and reload. ## Remote access Two modes: ### A. Tailnet (preferred — auth-free, full Web UI access) 1. Install Tailscale on the client device: - macOS / Windows: [tailscale.com/download](https://tailscale.com/download) - iOS / Android: App Store / Play Store - Linux: `curl -fsSL https://tailscale.com/install.sh | sh` 2. Sign in with the same Tailscale account that owns this Mac (Mac's tailnet hostname: `kevin-macstudio.tail91f148.ts.net`, tailnet IP: `100.106.249.39`). 3. Reach all services: - API: `http://kevin-macstudio.tail91f148.ts.net:11434` - Web UI: `http://kevin-macstudio.tail91f148.ts.net:8080` - Docs: `http://kevin-macstudio.tail91f148.ts.net:8081` ### B. Public Internet (HTTPS + API key — for clients that can't run Tailscale) The Mac runs **Tailscale Funnel** which proxies port 443 of `kevin-macstudio.tail91f148.ts.net` from the public internet to a local Caddy reverse proxy on port 10000. Caddy enforces `Authorization: Bearer ` on `/v1/*` and `/api/*`, and serves the docs at `/` without auth. 1. Get the API key from the Mac: `cat /Users/kevin/.config/llm-api-key` 2. Use it on every API request: ```bash curl https://kevin-macstudio.tail91f148.ts.net/v1/models \ -H "Authorization: Bearer YOUR_KEY_HERE" ``` 3. The Mac must be powered on and online. Tailscale Funnel registers the public DNS automatically. **Public mode does not expose** the Web UI (8080) or the no-auth tailnet docs server (8081). Only Caddy (which mediates Ollama) is reachable from the public internet. ### Rotating the API key ```bash # generate new openssl rand -hex 32 > ~/.config/llm-api-key chmod 600 ~/.config/llm-api-key # update Caddyfile (replace the old hex string in /Users/kevin/.config/caddy/Caddyfile) launchctl unload ~/Library/LaunchAgents/com.caddy.llmproxy.plist launchctl load ~/Library/LaunchAgents/com.caddy.llmproxy.plist ``` ## Service management Both services run as user-level LaunchAgents and auto-start on login. ### Plist locations - `/Users/kevin/Library/LaunchAgents/com.ollama.server.plist` — Ollama on :11434 - `/Users/kevin/Library/LaunchAgents/com.openwebui.server.plist` — Open WebUI on :8080 - `/Users/kevin/Library/LaunchAgents/com.llmdocs.server.plist` — tailnet-only docs on :8081 - `/Users/kevin/Library/LaunchAgents/com.caddy.llmproxy.plist` — Caddy reverse proxy on :10000 (gateway for Funnel-exposed public API + docs) ### Tailscale Funnel Funnel exposes Caddy (:10000) to the public internet at `https://kevin-macstudio.tail91f148.ts.net`. Manage with: ```bash tailscale funnel status # show current config tailscale funnel --bg 10000 # re-enable if disabled tailscale funnel reset # disable Funnel entirely ``` ### Restart ```bash launchctl unload ~/Library/LaunchAgents/com.ollama.server.plist launchctl load ~/Library/LaunchAgents/com.ollama.server.plist launchctl unload ~/Library/LaunchAgents/com.openwebui.server.plist launchctl load ~/Library/LaunchAgents/com.openwebui.server.plist ``` ### Logs - Ollama: `~/Library/Logs/ollama.{out,err}.log` - Open WebUI: `~/Library/Logs/openwebui.{out,err}.log` - Tailnet docs server: `~/Library/Logs/llmdocs.{out,err}.log` - Caddy: `~/Library/Logs/caddy.{out,err}.log` plus access log JSON at `~/Library/Logs/caddy.access.log` - Funnel watchdog: `~/Library/Logs/funnel-watchdog.log` - caffeinate: `~/Library/Logs/caffeinate.{out,err}.log` ### Sleep prevention (Mac always-on) The Mac is configured to never sleep, ensuring Funnel and all services stay reachable 24/7: - **Kernel-level**: `pmset disablesleep 1` (verified via `ioreg -l -k IOPMrootDomain | grep SleepDisabled` → Yes) - **User-level**: `caffeinate -i -m -s` running as a LaunchAgent - **Auto-recovery**: `funnel-watchdog` re-enables Tailscale Funnel every 5 minutes if it ever drops Inspect with: ```bash pmset -g | grep -E "sleep|tcpkeepalive|womp" # current power settings pmset -g assertions # what's preventing sleep ioreg -l -k IOPMrootDomain | grep SleepDisabled # kernel-level confirmation ``` To re-enable sleep (if ever needed): `sudo pmset -a disablesleep 0` ### Updating these docs The source file lives at `/Users/kevin/llm-server.md` and a copy is served from `/Users/kevin/llm-docs/llm-server.md`. After editing the source, sync with: ```bash cp /Users/kevin/llm-server.md /Users/kevin/llm-docs/llm-server.md ``` No restart needed — the static server reads on every request. ### Common ops ```bash ollama list # installed models ollama ps # currently loaded models + memory usage ollama run gemma-fast # interactive REPL in terminal ollama pull # download a new model ollama rm # delete a model tailscale status # tailnet device list tailscale ip -4 # this Mac's tailnet IP ``` ### Tuning knobs (in Ollama plist `EnvironmentVariables`) - `OLLAMA_HOST=0.0.0.0:11434` — bind all interfaces (Tailscale + localhost) - `OLLAMA_KEEP_ALIVE=30m` — unload model from VRAM after 30 min idle - `OLLAMA_MAX_LOADED_MODELS=2` — hold up to 2 models in memory simultaneously - `OLLAMA_FLASH_ATTENTION=1` — Metal flash attention - `OLLAMA_KV_CACHE_TYPE=q8_0` — quantized KV cache to save VRAM - `OLLAMA_NEW_ENGINE=1` — native Go engine (required for Gemma 4 / Qwen 3.5) ## Storage - **Model blobs**: `~/.ollama/models/blobs` (~82GB) - **Open WebUI data** (users, chats, prompts): `~/.open-webui` (SQLite DB, configs, RAG indexes) ## Troubleshooting ### `unable to load model: ... unknown model architecture: 'gemma4'` Ollama's bundled engine is too old. Upgrade: `brew upgrade ollama`, then reload its LaunchAgent. Custom GGUFs from Hugging Face that use `gemma4` / `qwen35moe` archs only work with Ollama's `OLLAMA_NEW_ENGINE=1` plus official-library models or recent enough llama.cpp. ### Model loaded but `content` is empty You hit `max_tokens` while still inside the reasoning trace. Either raise `max_tokens` to 500+ or set `"reasoning_effort": "none"`. ### Web UI shows "no models" Open WebUI couldn't reach Ollama. Check `OLLAMA_BASE_URL` env in the Open-WebUI plist (default `http://localhost:11434`), then restart Open WebUI. ### Service won't start after edits to plist Validate the plist: ```bash plutil -lint ~/Library/LaunchAgents/com.ollama.server.plist ``` Then unload + load. Errors land in `~/Library/Logs/.err.log`. ### Out of memory / slow - `ollama ps` to see loaded models. Lower `OLLAMA_MAX_LOADED_MODELS` to 1 if running both gemma-max and qwen-uncensored simultaneously becomes painful. - Activity Monitor → GPU history shows Metal usage. ## Adding a new model ```bash # Pull from Ollama library: ollama pull gemma4:e2b-it-q4_K_M # Pull from Hugging Face (only works for archs the running Ollama supports): ollama pull hf.co/USER/REPO:QUANT_TAG # Create a short alias (no extra disk; refers to same blobs): ollama cp gemma4:e2b-it-q4_K_M gemma-tiny ```