# Local LLM Server — Reference

Personal Ollama-based LLM server running on this Mac Studio (M4 Max, 128GB unified memory). This document is a single source of truth for Claude Code agents that need to query, manage, or troubleshoot the server.

## TL;DR

There are two ways to reach the API: **Tailnet** (auth-free, requires being on the user's tailnet) and **Public Internet** (HTTPS, requires API key).

| What | Tailnet (no key) | Public (HTTPS, key required) |
|---|---|---|
| API base | `http://kevin-macstudio.tail91f148.ts.net:11434/v1` | `https://kevin-macstudio.tail91f148.ts.net/v1` |
| Models list | `http://...:11434/api/tags` | `https://kevin-macstudio.tail91f148.ts.net/api/tags` |
| Web UI | `http://kevin-macstudio.tail91f148.ts.net:8080` | (not exposed publicly) |
| Docs (this file) | `http://kevin-macstudio.tail91f148.ts.net:8081/` | `https://kevin-macstudio.tail91f148.ts.net/` |
| Raw markdown | `http://...:8081/llm-server.md` | `https://kevin-macstudio.tail91f148.ts.net/llm-server.md` |

- **Models**: `gemma-fast`, `gemma-max`, `qwen-uncensored`
- **Tailnet auth**: none (Tailscale ACL is the boundary). Pass `api_key="ollama"` as a dummy when SDKs require one.
- **Public auth**: send `Authorization: Bearer <KEY>` on every `/v1/*` and `/api/*` request. Docs (`/`, `/llm-server.md`) are open. The key is stored on the Mac at `/Users/kevin/.config/llm-api-key` (chmod 600). Treat it like a secret — anyone with it can use the GPU until the key is rotated.

## Models

| Alias | Underlying model | Size | When to use |
|---|---|---|---|
| `gemma-fast` | `gemma4:26b-a4b-it-q8_0` (MoE 26B / 4B active) | 28GB | Default. Fast responses, good general quality. |
| `gemma-max` | `gemma4:31b-it-q8_0` (Dense 31B) | 33GB | Complex reasoning, coding, longer-context tasks. Slower. |
| `qwen-uncensored` | `tripolskypetr/qwen3.5-uncensored-aggressive:35b` (Q4_K_M, 34.7B) | 21GB | Refusals removed (0/465). Use when the censored models block a legitimate request. |

All three have **thinking mode enabled by default** — the model spends tokens reasoning before answering. The OpenAI-compatible response splits this into:
- `choices[0].message.content` → the final answer
- `choices[0].message.reasoning` → the chain-of-thought

To get a fast direct answer, set `"reasoning_effort": "none"` (or `"low"`) in the request body, and budget `max_tokens` generously (300+) when reasoning is on.

### Recommended sampling params for `qwen-uncensored`
```json
{"temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}
```

## API usage

### curl — Tailnet (no key)
```bash
curl http://kevin-macstudio.tail91f148.ts.net:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-fast",
    "messages": [{"role":"user","content":"Hello"}],
    "max_tokens": 500,
    "reasoning_effort": "low"
  }'
```

### curl — Public Internet (with key)
```bash
curl https://kevin-macstudio.tail91f148.ts.net/v1/chat/completions \
  -H "Authorization: Bearer $LLM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-fast",
    "messages": [{"role":"user","content":"Hello"}],
    "max_tokens": 500,
    "reasoning_effort": "low"
  }'
```

### Python (OpenAI SDK)
```python
import os
from openai import OpenAI

# Tailnet (no key):
client = OpenAI(
    base_url="http://kevin-macstudio.tail91f148.ts.net:11434/v1",
    api_key="ollama",  # dummy
)

# OR Public (with key):
client = OpenAI(
    base_url="https://kevin-macstudio.tail91f148.ts.net/v1",
    api_key=os.environ["LLM_API_KEY"],
)

resp = client.chat.completions.create(
    model="gemma-max",
    messages=[{"role": "user", "content": "Write a Python fibonacci generator."}],
    max_tokens=2000,
    reasoning_effort="low",
)
print(resp.choices[0].message.content)
```

### Streaming
Add `"stream": true` to the curl body, or `stream=True` to the Python client. Standard OpenAI SSE format.

### Listing models
```bash
curl http://localhost:11434/api/tags             # Ollama native
curl http://localhost:11434/v1/models            # OpenAI-compat
```

## Web UI

Open WebUI runs on port **8080** with **auth disabled** (`WEBUI_AUTH=False`). Chat history is shared across all devices that hit the URL.

- Visit `http://localhost:8080` (this Mac) or `http://kevin-macstudio.tail91f148.ts.net:8080` (any tailnet device) — opens straight to the chat, no login.
- Model dropdown in the top-left lets you switch between `gemma-fast` / `gemma-max` / `qwen-uncensored` per chat.
- Web UI is **not** exposed via Tailscale Funnel — only Caddy on :10000 (the API gateway) is publicly reachable.

To re-enable login, set `WEBUI_AUTH=True` in `~/Library/LaunchAgents/com.openwebui.server.plist` and reload.

## Remote access

Two modes:

### A. Tailnet (preferred — auth-free, full Web UI access)

1. Install Tailscale on the client device:
   - macOS / Windows: [tailscale.com/download](https://tailscale.com/download)
   - iOS / Android: App Store / Play Store
   - Linux: `curl -fsSL https://tailscale.com/install.sh | sh`
2. Sign in with the same Tailscale account that owns this Mac (Mac's tailnet hostname: `kevin-macstudio.tail91f148.ts.net`, tailnet IP: `100.106.249.39`).
3. Reach all services:
   - API: `http://kevin-macstudio.tail91f148.ts.net:11434`
   - Web UI: `http://kevin-macstudio.tail91f148.ts.net:8080`
   - Docs: `http://kevin-macstudio.tail91f148.ts.net:8081`

### B. Public Internet (HTTPS + API key — for clients that can't run Tailscale)

The Mac runs **Tailscale Funnel** which proxies port 443 of `kevin-macstudio.tail91f148.ts.net` from the public internet to a local Caddy reverse proxy on port 10000. Caddy enforces `Authorization: Bearer <key>` on `/v1/*` and `/api/*`, and serves the docs at `/` without auth.

1. Get the API key from the Mac: `cat /Users/kevin/.config/llm-api-key`
2. Use it on every API request:
   ```bash
   curl https://kevin-macstudio.tail91f148.ts.net/v1/models \
     -H "Authorization: Bearer YOUR_KEY_HERE"
   ```
3. The Mac must be powered on and online. Tailscale Funnel registers the public DNS automatically.

**Public mode does not expose** the Web UI (8080) or the no-auth tailnet docs server (8081). Only Caddy (which mediates Ollama) is reachable from the public internet.

### Rotating the API key
```bash
# generate new
openssl rand -hex 32 > ~/.config/llm-api-key
chmod 600 ~/.config/llm-api-key
# update Caddyfile (replace the old hex string in /Users/kevin/.config/caddy/Caddyfile)
launchctl unload ~/Library/LaunchAgents/com.caddy.llmproxy.plist
launchctl load   ~/Library/LaunchAgents/com.caddy.llmproxy.plist
```

## Service management

Both services run as user-level LaunchAgents and auto-start on login.

### Plist locations
- `/Users/kevin/Library/LaunchAgents/com.ollama.server.plist` — Ollama on :11434
- `/Users/kevin/Library/LaunchAgents/com.openwebui.server.plist` — Open WebUI on :8080
- `/Users/kevin/Library/LaunchAgents/com.llmdocs.server.plist` — tailnet-only docs on :8081
- `/Users/kevin/Library/LaunchAgents/com.caddy.llmproxy.plist` — Caddy reverse proxy on :10000 (gateway for Funnel-exposed public API + docs)

### Tailscale Funnel
Funnel exposes Caddy (:10000) to the public internet at `https://kevin-macstudio.tail91f148.ts.net`. Manage with:
```bash
tailscale funnel status         # show current config
tailscale funnel --bg 10000     # re-enable if disabled
tailscale funnel reset          # disable Funnel entirely
```

### Restart
```bash
launchctl unload ~/Library/LaunchAgents/com.ollama.server.plist
launchctl load   ~/Library/LaunchAgents/com.ollama.server.plist

launchctl unload ~/Library/LaunchAgents/com.openwebui.server.plist
launchctl load   ~/Library/LaunchAgents/com.openwebui.server.plist
```

### Logs
- Ollama: `~/Library/Logs/ollama.{out,err}.log`
- Open WebUI: `~/Library/Logs/openwebui.{out,err}.log`
- Tailnet docs server: `~/Library/Logs/llmdocs.{out,err}.log`
- Caddy: `~/Library/Logs/caddy.{out,err}.log` plus access log JSON at `~/Library/Logs/caddy.access.log`
- Funnel watchdog: `~/Library/Logs/funnel-watchdog.log`
- caffeinate: `~/Library/Logs/caffeinate.{out,err}.log`

### Sleep prevention (Mac always-on)
The Mac is configured to never sleep, ensuring Funnel and all services stay reachable 24/7:
- **Kernel-level**: `pmset disablesleep 1` (verified via `ioreg -l -k IOPMrootDomain | grep SleepDisabled` → Yes)
- **User-level**: `caffeinate -i -m -s` running as a LaunchAgent
- **Auto-recovery**: `funnel-watchdog` re-enables Tailscale Funnel every 5 minutes if it ever drops

Inspect with:
```bash
pmset -g | grep -E "sleep|tcpkeepalive|womp"      # current power settings
pmset -g assertions                                # what's preventing sleep
ioreg -l -k IOPMrootDomain | grep SleepDisabled    # kernel-level confirmation
```

To re-enable sleep (if ever needed): `sudo pmset -a disablesleep 0`

### Updating these docs
The source file lives at `/Users/kevin/llm-server.md` and a copy is served from `/Users/kevin/llm-docs/llm-server.md`. After editing the source, sync with:
```bash
cp /Users/kevin/llm-server.md /Users/kevin/llm-docs/llm-server.md
```
No restart needed — the static server reads on every request.

### Common ops
```bash
ollama list           # installed models
ollama ps             # currently loaded models + memory usage
ollama run gemma-fast # interactive REPL in terminal
ollama pull <model>   # download a new model
ollama rm <model>     # delete a model
tailscale status      # tailnet device list
tailscale ip -4       # this Mac's tailnet IP
```

### Tuning knobs (in Ollama plist `EnvironmentVariables`)
- `OLLAMA_HOST=0.0.0.0:11434` — bind all interfaces (Tailscale + localhost)
- `OLLAMA_KEEP_ALIVE=30m` — unload model from VRAM after 30 min idle
- `OLLAMA_MAX_LOADED_MODELS=2` — hold up to 2 models in memory simultaneously
- `OLLAMA_FLASH_ATTENTION=1` — Metal flash attention
- `OLLAMA_KV_CACHE_TYPE=q8_0` — quantized KV cache to save VRAM
- `OLLAMA_NEW_ENGINE=1` — native Go engine (required for Gemma 4 / Qwen 3.5)

## Storage

- **Model blobs**: `~/.ollama/models/blobs` (~82GB)
- **Open WebUI data** (users, chats, prompts): `~/.open-webui` (SQLite DB, configs, RAG indexes)

## Troubleshooting

### `unable to load model: ... unknown model architecture: 'gemma4'`
Ollama's bundled engine is too old. Upgrade: `brew upgrade ollama`, then reload its LaunchAgent. Custom GGUFs from Hugging Face that use `gemma4` / `qwen35moe` archs only work with Ollama's `OLLAMA_NEW_ENGINE=1` plus official-library models or recent enough llama.cpp.

### Model loaded but `content` is empty
You hit `max_tokens` while still inside the reasoning trace. Either raise `max_tokens` to 500+ or set `"reasoning_effort": "none"`.

### Web UI shows "no models"
Open WebUI couldn't reach Ollama. Check `OLLAMA_BASE_URL` env in the Open-WebUI plist (default `http://localhost:11434`), then restart Open WebUI.

### Service won't start after edits to plist
Validate the plist:
```bash
plutil -lint ~/Library/LaunchAgents/com.ollama.server.plist
```
Then unload + load. Errors land in `~/Library/Logs/<service>.err.log`.

### Out of memory / slow
- `ollama ps` to see loaded models. Lower `OLLAMA_MAX_LOADED_MODELS` to 1 if running both gemma-max and qwen-uncensored simultaneously becomes painful.
- Activity Monitor → GPU history shows Metal usage.

## Adding a new model

```bash
# Pull from Ollama library:
ollama pull gemma4:e2b-it-q4_K_M

# Pull from Hugging Face (only works for archs the running Ollama supports):
ollama pull hf.co/USER/REPO:QUANT_TAG

# Create a short alias (no extra disk; refers to same blobs):
ollama cp gemma4:e2b-it-q4_K_M gemma-tiny
```