//nbkelley /homelab

Mac Studio#

What Was Established#

Mac Studio M1 Max 64GB purchased 2026-04-17 for $2,299 (used/refurbished market). Deployed 2026-04-24 as the primary AI inference node for the homelab, replacing the Pavilion as the fast interactive reasoning machine.

Hardware#

Detail Value
Model Mac Studio (2022, M1 Max)
Hostname Legolas
IP 192.168.1.45
Memory 64GB Unified Memory
Memory bandwidth ~400 GB/s
Purchase price $2,299 (used)
Status Deployed, operational 2026-04-24

Why This Hardware#

64GB unified memory is the key spec for LLM inference on Apple Silicon. Memory bandwidth determines tokens/second — the M1 Max at ~400 GB/s delivers ~25+ t/s for 27-31B models vs the Pavilion’s ~15 t/s CPU-only for E4B. At the time of purchase, 64GB Mac Studio/Mac Mini new from Apple was backordered to late June. The used M1 Max at $2,299 was judged a reasonable buy given scarcity.

Role#

  • Primary LLM inference: all wiki pipeline LLM calls route to Legolas via Ollama at port 11434
  • Multiple simultaneous models: 64GB allows multiple models resident at once (no reload wait)
  • Monitoring pipeline: will take over final synthesis call from Pavilion (not yet migrated)

Models (active as of 2026-04-26)#

Model Use
gemma4:e2b Text cleaning / Markdown cleanup (step 1 of wiki pipeline)
qwen3.6:35b-a3b-coding-nvfp4 JSON crystallization + wiki page generation (step 2)
minicpm-v:8b PDF visual pages + image OCR (multimodal)

Workflow Split#

  • Legolas (Mac Studio): real-time inference for wiki pipeline, all three model tiers
  • Pavilion (nk-celebrimbor): long-context batch jobs, nomic-embed-text embeddings, scheduled Gemma pipeline

MLX Notes#

As of April 2026, MLX support for Gemma 4 31B is rough — mlx-community 4-bit models fail to load, LM Studio MLX backend doesn’t support Gemma 4, chat templates need manual handling. Ollama is the safer path initially.

Known Issues#

qwen3.6:35b Ollama freeze (macOS)#

Symptom: heartbeat stops; gemma4:e2b responds fine; curl http://192.168.1.45:11434/api/generate with qwen3.6:35b returns nothing (silent hang, not an error). ollama ps shows model as loaded.

Cause: Ollama process gets into a bad state on macOS — model appears loaded but accepts no new requests.

Fix (macOS — NOT systemctl, Legolas is not Linux):

pkill -f ollama
# or via brew:
brew services restart ollama

After restart, verify:

curl http://192.168.1.45:11434/api/generate \
  -d '{"model":"qwen3.6:35b-a3b-coding-nvfp4","prompt":"Reply OK","stream":false}'

Then clear stale lock files on wiki-llm before relaunching:

rm -f /opt/wiki/work/raw/**/*.ingest.lock

10s settle delay after consecutive vision calls#

After consecutive minicpm-v:8b requests, gemma4:e2b’s stream closes without done=true. Root cause not diagnosed. Workaround: 10s sleep between final minicpm-v call and the gemma call in convert_to_md.py.

AI Infrastructure Overview, Pavilion (AI PC) Configuration, AI-Driven Monitoring Pipeline, Wiki Pipeline Scripts

Sources#

Homelab AI - 2026-04-17 · ingested/chats/Homelab-AI---2026-04-17.md Work wiki session crystallization 2026-04-26 · wiki/ai/wiki-pipeline-scripts.md