Mac Studio#

What Was Established#

Mac Studio M1 Max 64GB purchased 2026-04-17 for $2,299 (used/refurbished market). Deployed 2026-04-24 as the primary AI inference node for the homelab, replacing the Pavilion as the fast interactive reasoning machine.

Hardware#

Detail	Value
Model	Mac Studio (2022, M1 Max)
Hostname	Legolas
IP	192.168.1.45
Memory	64GB Unified Memory
Memory bandwidth	~400 GB/s
Purchase price	$2,299 (used)
Status	Deployed, operational 2026-04-24

Why This Hardware#

64GB unified memory is the key spec for LLM inference on Apple Silicon. Memory bandwidth determines tokens/second — the M1 Max at ~400 GB/s delivers ~25+ t/s for 27-31B models vs the Pavilion’s ~15 t/s CPU-only for E4B. At the time of purchase, 64GB Mac Studio/Mac Mini new from Apple was backordered to late June. The used M1 Max at $2,299 was judged a reasonable buy given scarcity.

Role#

Primary LLM inference: all wiki pipeline LLM calls route to Legolas via Ollama at port 11434
Multiple simultaneous models: 64GB allows multiple models resident at once (no reload wait)
Monitoring pipeline: will take over final synthesis call from Pavilion (not yet migrated)

Models (active as of 2026-04-26)#

Model	Use
`gemma4:e2b`	Text cleaning / Markdown cleanup (step 1 of wiki pipeline)
`qwen3.6:35b-a3b-coding-nvfp4`	JSON crystallization + wiki page generation (step 2)
`minicpm-v:8b`	PDF visual pages + image OCR (multimodal)

Workflow Split#

Legolas (Mac Studio): real-time inference for wiki pipeline, all three model tiers
Pavilion (nk-celebrimbor): long-context batch jobs, nomic-embed-text embeddings, scheduled Gemma pipeline

MLX Notes#

As of April 2026, MLX support for Gemma 4 31B is rough — mlx-community 4-bit models fail to load, LM Studio MLX backend doesn’t support Gemma 4, chat templates need manual handling. Ollama is the safer path initially.

Known Issues#

qwen3.6:35b Ollama freeze (macOS)#

Symptom: heartbeat stops; gemma4:e2b responds fine; curl http://192.168.1.45:11434/api/generate with qwen3.6:35b returns nothing (silent hang, not an error). ollama ps shows model as loaded.

Cause: Ollama process gets into a bad state on macOS — model appears loaded but accepts no new requests.

Fix (macOS — NOT systemctl, Legolas is not Linux):

pkill -f ollama
# or via brew:
brew services restart ollama

After restart, verify:

curl http://192.168.1.45:11434/api/generate \
  -d '{"model":"qwen3.6:35b-a3b-coding-nvfp4","prompt":"Reply OK","stream":false}'

Then clear stale lock files on wiki-llm before relaunching:

rm -f /opt/wiki/work/raw/**/*.ingest.lock

10s settle delay after consecutive vision calls#

After consecutive minicpm-v:8b requests, gemma4:e2b’s stream closes without done=true. Root cause not diagnosed. Workaround: 10s sleep between final minicpm-v call and the gemma call in convert_to_md.py.

AI Infrastructure Overview, Pavilion (AI PC) Configuration, AI-Driven Monitoring Pipeline, Wiki Pipeline Scripts

Sources#

Homelab AI - 2026-04-17 · ingested/chats/Homelab-AI---2026-04-17.md Work wiki session crystallization 2026-04-26 · wiki/ai/wiki-pipeline-scripts.md