Wiki Pipeline Scripts#
What Was Established#
Eight Python scripts in /opt/wiki/homelab/scripts/ implement the full wiki pipeline: file conversion, document ingestion, conversation crystallization (standard, DeepSeek, and Claude formats), shared LLM infrastructure, wiki health-checking, and knowledge-graph integration. All scripts were ported from the work wiki pipeline (itself developed 2026-04-21 → 2026-04-26) with homelab-specific infrastructure baked in.
crystallize.py (Claude format) uses a two-step LLM approach: gemma4:e2b cleans, qwen3.6:35b crystallizes. crystallize_deepseek.py skips gemma — JSON parsing is handled deterministically in Python (load_conversation + _clean_text), so only qwen is needed.
Scripts#
| Script | Purpose |
|---|---|
_llm_client.py |
Shared LLM/embedding/pgvector infrastructure — imported by all other scripts |
perform_full_integrate.py |
Shared mechanical layer — importable primitives for wiki skills. parse_page(), search_similar(), embed_and_store(), add_wikilink(), delete_embedding(), upsert_index_entry(), append_log(). Also CLI main() for single-page integrate workflow. |
lint_all.py |
Wiki health-check — 9-step lint: broken wikilinks, orphans, stale index, pgvector sync, orphaned sources/assets, conceptual flags. Produces dated report in archive/. |
crystallize.py |
Process Claude Code session transcripts (## USER / ## ASSISTANT format) |
crystallize_deepseek.py |
Process DeepSeek export JSON (mapping tree or split-file format) |
convert_deepseek_to_md.py |
Convert DeepSeek JSON exports to clean markdown transcripts |
convert_claude_to_md.py |
Convert Claude Code JSON exports (single or multi-conversation) to clean markdown transcripts |
ingest.py |
Process Markdown documents into wiki pages |
convert_to_md.py |
Convert raw files (.docx, .pdf, .pptx, images) to Markdown |
Pipeline Architecture#
raw file (.pdf/.docx/.pptx/image)
↓ convert_to_md.py (PyMuPDF + minicpm-v:8b or python-docx)
↓
.md file or Claude JSON
↓ ingest.py / crystallize.py
↓
Step 1: gemma4:e2b → clean/strip noise (Legolas, 192.168.1.45)
Step 2: qwen3.6:35b → crystallize into pages (Legolas, 192.168.1.45)
Step 3: nomic-embed-text → 768-dim vector (Celebrimbor, 192.168.2.192)
Step 4: store in pgvector (PostgreSQL, 192.168.1.57)
↓
source moved to ingested/<subdir>/ (atomic commit — only on full success)
DeepSeek JSON ← different path: no gemma step
↓ crystallize_deepseek.py
↓
Step 1: load_conversation() → parse mapping tree → ## USER/## ASSISTANT text (Python)
Step 2: _clean_text() → emoji strip, whitespace normalise (Python)
Step 3: qwen3.6:35b → crystallize into pages (Legolas, 192.168.1.45)
Step 4: nomic-embed-text → store in pgvector (Celebrimbor / PostgreSQL)
↓
source moved to ingested/chats/Key Design Decisions#
Streaming with hang detection (_llm_client.py): Uses Ollama streaming API rather than blocking POST. A background thread fires a heartbeat every 30s showing elapsed time and token count. If no tokens arrive within LLM_IDLE_TIMEOUT (default 180s), the stream is aborted and the call retried. This distinguishes a busy model from a stuck one.
Exponential backoff retry: All LLM, embedding, and pgvector calls retry up to LLM_MAX_RETRIES (default 3) with exponential backoff. On final give-up, the source file is preserved in raw/ and the script exits non-zero — the next run retries automatically.
Lockfiles: A <source>.ingest.lock file sits beside each source during processing. A second invocation on the same file aborts with “Lock held”. Stale locks (older than LLM_HARD_TIMEOUT + 600s) are auto-cleared.
Source preservation on failure (exit code 2): If any LLM call exhausts retries, the source file stays in raw/. Source only moves to ingested/ on full pipeline success.
pgvector cross-reference before crystallization: Before calling qwen, the pipeline embeds the input and queries pgvector for similar existing pages (threshold 0.4). These are passed to qwen as context so it can decide whether to update an existing page or create a new one.
Chunking for long inputs (crystallize.py): Splits at ## USER / ## ASSISTANT boundaries, MAX_CHARS=100000 per chunk. Each chunk goes through the full clean→crystallize pipeline; pages are deduplicated by title at the end.
Shared Mechanical Layer Pattern#
perform_full_integrate.py establishes a pattern used by lint_all.py as well: wiki skills are split into a mechanical layer (Python script — embed, pgvector query, index CRUD, wikilink edits, log appends) and a judgment layer (LLM agent — reading candidate pages, confirming genuine connections, deciding whether to merge or cross-link). The mechanical layer functions are importable primitives; the LLM agent imports them rather than each skill duplicating its own requests.post and subprocess.run(["psql", ...]) calls.
Key primitives shared across integrate, restructure, and lint:
parse_page()— reads a page, returns{title, content (body-only), frontmatter}. Frontmatter is stripped from content before returning — this is the embedding fix (see Wiki System - Architecture).search_similar()— embed query, pgvector similarity search, returns[{path, similarity}]embed_and_store()— embed + upsert to pgvector via_llm_clientadd_wikilink()— insertTargetinto a page’s named sectionupsert_index_entry()/append_log()— index and log management
Infrastructure#
| Service | Host | Address | Role |
|---|---|---|---|
| gemma4:e2b | Legolas | 192.168.1.45:11434 | Step 1 cleaning |
| qwen3.6:35b-a3b-coding-nvfp4 | Legolas | 192.168.1.45:11434 | Step 2 crystallization |
| minicpm-v:8b | Legolas | 192.168.1.45:11434 | PDF/image OCR |
| nomic-embed-text | Celebrimbor | 192.168.2.192:11434 | 768-dim embeddings |
| PostgreSQL | pgvector LXC | 192.168.1.57:5432 | pgvector storage, db homelab |
convert_to_md.py — Notable Improvements#
Two improvements over the original approach:
-
Dropped
and not has_imagescondition: Old logic sent pages with embedded images to minicpm-v even when the text layer was perfectly good (e.g. Scribe-generated PDFs with header logos). New condition: use text layer iflen(text) >= 100, regardless of images. Only sparse pages go to minicpm-v. -
Conditional 10s settle sleep: The 10s Ollama settle delay after vision calls now only fires when minicpm-v was actually invoked on at least one page. Text-only PDFs no longer pay the unnecessary delay.
extract_from_pdf()returns(text, used_minicpm)tuple.
crystallize_deepseek.py — Historical Mode & Python Cleaning#
DeepSeek conversations are older (2024–2025) and treated as historical context:
historical: trueadded to frontmatter- Qwen prompt includes the note: “These are older DeepSeek conversations from 2024–2025. What is already in the wiki is more authoritative.”
MAX_CHARS=100000— historical conversations are shorter- Accepts both native DeepSeek mapping tree format and split-file format (
{"messages": [...], "title": ..., "inserted_at": ...})
Cleaning is done in Python, not by gemma. The rationale: load_conversation() already traverses the JSON mapping tree and emits clean ## USER / ## ASSISTANT blocks — there are no JSON artifacts left for an LLM to remove. _clean_text() handles the rest deterministically:
def _clean_text(text: str) -> str:
# Remove emoji / pictograph characters
text = re.sub(r'[\U0001F300-\U0001F9FF\U0001FA00-\U0001FA9F\U00002600-\U000027BF⌀-⏿]+', '', text, flags=re.UNICODE)
# Collapse 3+ blank lines to two
text = re.sub(r'\n{3,}', '\n\n', text)
# Strip trailing whitespace per line
text = '\n'.join(line.rstrip() for line in text.split('\n'))
return text.strip()Tunable Env Vars (_llm_client.py)#
| Variable | Default | Purpose |
|---|---|---|
LLM_FIRST_TOKEN_TIMEOUT |
1200s | Max wait for first token (prompt eval + model load) |
LLM_IDLE_TIMEOUT |
180s | Max silence between tokens |
LLM_HARD_TIMEOUT |
3600s | Absolute cap per attempt |
LLM_MAX_RETRIES |
3 | Retries with exponential backoff |
EMBED_CHUNK_SIZE |
5000 | Chars per embedding chunk (nomic-embed-text ~11KB ceiling) |
EMBED_MAX_CHUNKS |
8 | Max chunks before sampling beginning+middle+end |
Open Questions#
- tesseract-ocr not yet installed on wiki-llm — zero-text PDFs (vector-rendered, no text layer) cannot be extracted by minicpm-v reliably. Deferred until wiki-llm VM is resized (currently 2 cores / 4GB RAM). When installed: add
_try_tesseract()helper in convert_to_md.py as fallback for pages where both text layer and minicpm-v return empty. - hugoboss (192.168.1.237) hosts an empty Hugo skeleton for help.hinterflix.com — no dedicated wiki page yet. GitHub repo
NK-Iluvatar/blogexists but has zero commits pushed. See hugoboss for full machine audit.
Related Pages#
Wiki System - Architecture, wiki-llm, Pavilion (AI PC) Configuration, PostgreSQL, AI Infrastructure Overview, AI-Driven Monitoring Pipeline
Sources#
ingested/chats/2026-05-02-restructure-lint-crystallize-session.mdingested/chats/2026-05-01-deepseek-second-pass-claude-code-crystallize.mdingested/chats/039-Translating Questions into German Poetry.mdClaude Code session 2026-04-26 ·ingested/chats/2026-04-26-wiki-pipeline-scripts-homelab.mdClaude Code session 2026-04-30 · crystallize_deepseek.py Python cleaning refactor