Wiki Pipeline Scripts#

What Was Established#

Eight Python scripts in /opt/wiki/homelab/scripts/ implement the full wiki pipeline: file conversion, document ingestion, conversation crystallization (standard, DeepSeek, and Claude formats), shared LLM infrastructure, wiki health-checking, and knowledge-graph integration. All scripts were ported from the work wiki pipeline (itself developed 2026-04-21 → 2026-04-26) with homelab-specific infrastructure baked in.

crystallize.py (Claude format) uses a two-step LLM approach: gemma4:e2b cleans, qwen3.6:35b crystallizes. crystallize_deepseek.py skips gemma — JSON parsing is handled deterministically in Python (load_conversation + _clean_text), so only qwen is needed.

Scripts#

Script	Purpose
`_llm_client.py`	Shared LLM/embedding/pgvector infrastructure — imported by all other scripts
`perform_full_integrate.py`	Shared mechanical layer — importable primitives for wiki skills. `parse_page()`, `search_similar()`, `embed_and_store()`, `add_wikilink()`, `delete_embedding()`, `upsert_index_entry()`, `append_log()`. Also CLI `main()` for single-page integrate workflow.
`lint_all.py`	Wiki health-check — 9-step lint: broken wikilinks, orphans, stale index, pgvector sync, orphaned sources/assets, conceptual flags. Produces dated report in `archive/`.
`crystallize.py`	Process Claude Code session transcripts (## USER / ## ASSISTANT format)
`crystallize_deepseek.py`	Process DeepSeek export JSON (mapping tree or split-file format)
`convert_deepseek_to_md.py`	Convert DeepSeek JSON exports to clean markdown transcripts
`convert_claude_to_md.py`	Convert Claude Code JSON exports (single or multi-conversation) to clean markdown transcripts
`ingest.py`	Process Markdown documents into wiki pages
`convert_to_md.py`	Convert raw files (.docx, .pdf, .pptx, images) to Markdown

Pipeline Architecture#

raw file (.pdf/.docx/.pptx/image)
    ↓ convert_to_md.py (PyMuPDF + minicpm-v:8b or python-docx)
    ↓
.md file or Claude JSON
    ↓ ingest.py / crystallize.py
    ↓
Step 1: gemma4:e2b → clean/strip noise          (Legolas, 192.168.1.45)
Step 2: qwen3.6:35b → crystallize into pages    (Legolas, 192.168.1.45)
Step 3: nomic-embed-text → 768-dim vector       (Celebrimbor, 192.168.2.192)
Step 4: store in pgvector                        (PostgreSQL, 192.168.1.57)
    ↓
source moved to ingested/<subdir>/  (atomic commit — only on full success)

DeepSeek JSON  ← different path: no gemma step
    ↓ crystallize_deepseek.py
    ↓
Step 1: load_conversation() → parse mapping tree → ## USER/## ASSISTANT text  (Python)
Step 2: _clean_text() → emoji strip, whitespace normalise                      (Python)
Step 3: qwen3.6:35b → crystallize into pages    (Legolas, 192.168.1.45)
Step 4: nomic-embed-text → store in pgvector    (Celebrimbor / PostgreSQL)
    ↓
source moved to ingested/chats/

Key Design Decisions#

Streaming with hang detection (_llm_client.py): Uses Ollama streaming API rather than blocking POST. A background thread fires a heartbeat every 30s showing elapsed time and token count. If no tokens arrive within LLM_IDLE_TIMEOUT (default 180s), the stream is aborted and the call retried. This distinguishes a busy model from a stuck one.

Exponential backoff retry: All LLM, embedding, and pgvector calls retry up to LLM_MAX_RETRIES (default 3) with exponential backoff. On final give-up, the source file is preserved in raw/ and the script exits non-zero — the next run retries automatically.

Lockfiles: A <source>.ingest.lock file sits beside each source during processing. A second invocation on the same file aborts with “Lock held”. Stale locks (older than LLM_HARD_TIMEOUT + 600s) are auto-cleared.

Source preservation on failure (exit code 2): If any LLM call exhausts retries, the source file stays in raw/. Source only moves to ingested/ on full pipeline success.

pgvector cross-reference before crystallization: Before calling qwen, the pipeline embeds the input and queries pgvector for similar existing pages (threshold 0.4). These are passed to qwen as context so it can decide whether to update an existing page or create a new one.

Chunking for long inputs (crystallize.py): Splits at ## USER / ## ASSISTANT boundaries, MAX_CHARS=100000 per chunk. Each chunk goes through the full clean→crystallize pipeline; pages are deduplicated by title at the end.

Shared Mechanical Layer Pattern#

perform_full_integrate.py establishes a pattern used by lint_all.py as well: wiki skills are split into a mechanical layer (Python script — embed, pgvector query, index CRUD, wikilink edits, log appends) and a judgment layer (LLM agent — reading candidate pages, confirming genuine connections, deciding whether to merge or cross-link). The mechanical layer functions are importable primitives; the LLM agent imports them rather than each skill duplicating its own requests.post and subprocess.run(["psql", ...]) calls.

Key primitives shared across integrate, restructure, and lint:

parse_page() — reads a page, returns {title, content (body-only), frontmatter}. Frontmatter is stripped from content before returning — this is the embedding fix (see Wiki System - Architecture).
search_similar() — embed query, pgvector similarity search, returns [{path, similarity}]
embed_and_store() — embed + upsert to pgvector via _llm_client
add_wikilink() — insert Target into a page’s named section
upsert_index_entry() / append_log() — index and log management

Infrastructure#

Service	Host	Address	Role
gemma4:e2b	Legolas	192.168.1.45:11434	Step 1 cleaning
qwen3.6:35b-a3b-coding-nvfp4	Legolas	192.168.1.45:11434	Step 2 crystallization
minicpm-v:8b	Legolas	192.168.1.45:11434	PDF/image OCR
nomic-embed-text	Celebrimbor	192.168.2.192:11434	768-dim embeddings
PostgreSQL	pgvector LXC	192.168.1.57:5432	pgvector storage, db homelab

convert_to_md.py — Notable Improvements#

Two improvements over the original approach:

Dropped and not has_images condition: Old logic sent pages with embedded images to minicpm-v even when the text layer was perfectly good (e.g. Scribe-generated PDFs with header logos). New condition: use text layer if len(text) >= 100, regardless of images. Only sparse pages go to minicpm-v.
Conditional 10s settle sleep: The 10s Ollama settle delay after vision calls now only fires when minicpm-v was actually invoked on at least one page. Text-only PDFs no longer pay the unnecessary delay. extract_from_pdf() returns (text, used_minicpm) tuple.

crystallize_deepseek.py — Historical Mode & Python Cleaning#

DeepSeek conversations are older (2024–2025) and treated as historical context:

historical: true added to frontmatter
Qwen prompt includes the note: “These are older DeepSeek conversations from 2024–2025. What is already in the wiki is more authoritative.”
MAX_CHARS=100000 — historical conversations are shorter
Accepts both native DeepSeek mapping tree format and split-file format ({"messages": [...], "title": ..., "inserted_at": ...})

Cleaning is done in Python, not by gemma. The rationale: load_conversation() already traverses the JSON mapping tree and emits clean ## USER / ## ASSISTANT blocks — there are no JSON artifacts left for an LLM to remove. _clean_text() handles the rest deterministically:

def _clean_text(text: str) -> str:
    # Remove emoji / pictograph characters
    text = re.sub(r'[\U0001F300-\U0001F9FF\U0001FA00-\U0001FA9F\U00002600-\U000027BF⌀-⏿]+', '', text, flags=re.UNICODE)
    # Collapse 3+ blank lines to two
    text = re.sub(r'\n{3,}', '\n\n', text)
    # Strip trailing whitespace per line
    text = '\n'.join(line.rstrip() for line in text.split('\n'))
    return text.strip()

Tunable Env Vars (`_llm_client.py`)#

Variable	Default	Purpose
`LLM_FIRST_TOKEN_TIMEOUT`	1200s	Max wait for first token (prompt eval + model load)
`LLM_IDLE_TIMEOUT`	180s	Max silence between tokens
`LLM_HARD_TIMEOUT`	3600s	Absolute cap per attempt
`LLM_MAX_RETRIES`	3	Retries with exponential backoff
`EMBED_CHUNK_SIZE`	5000	Chars per embedding chunk (nomic-embed-text ~11KB ceiling)
`EMBED_MAX_CHUNKS`	8	Max chunks before sampling beginning+middle+end

Open Questions#

tesseract-ocr not yet installed on wiki-llm — zero-text PDFs (vector-rendered, no text layer) cannot be extracted by minicpm-v reliably. Deferred until wiki-llm VM is resized (currently 2 cores / 4GB RAM). When installed: add _try_tesseract() helper in convert_to_md.py as fallback for pages where both text layer and minicpm-v return empty.
hugoboss (192.168.1.237) hosts an empty Hugo skeleton for help.hinterflix.com — no dedicated wiki page yet. GitHub repo NK-Iluvatar/blog exists but has zero commits pushed. See hugoboss for full machine audit.

Wiki System - Architecture, wiki-llm, Pavilion (AI PC) Configuration, PostgreSQL, AI Infrastructure Overview, AI-Driven Monitoring Pipeline

Mac Studio

Sources#

ingested/chats/2026-05-02-restructure-lint-crystallize-session.md
ingested/chats/2026-05-01-deepseek-second-pass-claude-code-crystallize.md
ingested/chats/039-Translating Questions into German Poetry.md Claude Code session 2026-04-26 · ingested/chats/2026-04-26-wiki-pipeline-scripts-homelab.md Claude Code session 2026-04-30 · crystallize_deepseek.py Python cleaning refactor