CtrlK
BlogDocsLog inGet started
Tessl Logo

pdf-brain-ingest

Ingest PDF/Markdown/TXT files into joelclaw's docs memory pipeline with Inngest durability, durable NAS artifacts, and OTEL verification. Use when adding docs, running batch reindex, reconciling coverage, or recovering stuck runs. Triggers on: 'ingest pdf', 'ingest markdown', 'docs add', 'pdf-brain ingest', 'backfill books', 'docs reconcile', 'reindex docs', 'batch reindex'.

90

Quality

88%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

PDF Brain Ingest v2 (ADR-0234)

Staged artifact-chain pipeline with durable NAS storage, nomic embeddings, and workload queue orchestration.

Pipeline v2 Architecture

PDF (source, immutable)
  → Stage 1: CONVERT — opendataloader-pdf → {docId}.md (NAS artifact)
  → Stage 2: CLASSIFY + SUMMARIZE — taxonomy + LLM summary → {docId}.meta.json (NAS artifact)
  → Stage 3: CHUNK — markdown-native headings, no overlap → {docId}.chunks.jsonl (NAS artifact)
  → Stage 4: INDEX — upsert to docs + docs_chunks_v2 (nomic-embed-text-v1.5, 768-dim)

Key properties:

  • Durable: artifacts on NAS RAID5, survive reboots/crashes
  • Resumable: each stage checks for existing artifacts, skips if present
  • Recoverable: re-run any stage from existing artifacts without re-extracting
  • Observable: OTEL event per stage per book

Artifacts dir: /Volumes/three-body/docs-artifacts/{docId}/

  • {docId}.md — structured markdown extraction
  • {docId}.meta.json — taxonomy, summary, metadata
  • {docId}.chunks.jsonl — chunk records, one per line

Core Workflow

1) Preflight

joelclaw status
joelclaw docs status

Status now shows both v1 and v2 collection stats plus artifact availability.

2) Single File Ingest (v1 pipeline)

joelclaw docs add "/absolute/path/to/file.pdf"
joelclaw docs add "/absolute/path/to/file.pdf" --title "Title" --tags "tag1,tag2" --category programming

3) Single File v2 Reindex (artifact pipeline)

joelclaw docs reindex-v2 "/absolute/path/to/file.pdf"
joelclaw docs reindex-v2 "/absolute/path/to/file.pdf" --title "Title" --skip-existing

Fires docs/reindex-v2.requested → 4-stage artifact pipeline → NAS artifacts + docs_chunks_v2.

4) Batch Reindex (full library)

# Reindex all PDFs from NAS /Volumes/three-body/books/
joelclaw docs batch-reindex --skip-existing

# Reindex from existing Typesense docs collection
joelclaw docs batch-reindex --from-collection --skip-existing

Fires docs/reindex-batch.requested → scans NAS/collection → dispatches individual reindex-v2 events in batches of 10, concurrency 3.

--skip-existing skips books that already have all 3 artifacts on NAS (default true).

5) Monitor Progress

# Artifact count on NAS
ls /Volumes/three-body/docs-artifacts/ | wc -l

# v2 collection chunk count
joelclaw docs status

# OTEL events from pipeline
joelclaw otel search "docs.reindex" --hours 4
joelclaw o11y session system-bus --hours 4

# Individual run trace
joelclaw runs --count 10
joelclaw run <run-id>

6) Inspect Artifacts

# Read the extracted markdown
joelclaw docs markdown <doc-id>

# Read the summary + taxonomy metadata
joelclaw docs summary <doc-id>

# Or directly on NAS
cat /Volumes/three-body/docs-artifacts/<docId>/<docId>.md
cat /Volumes/three-body/docs-artifacts/<docId>/<docId>.meta.json | jq
wc -l /Volumes/three-body/docs-artifacts/<docId>/<docId>.chunks.jsonl

7) Retrieval from v2

# Search uses docs_chunks_v2 (nomic 768-dim) by default
joelclaw docs search "distributed consensus" --limit 8

# Context expansion
joelclaw docs context <chunk-id> --mode snippet-window
joelclaw docs context <chunk-id> --mode parent-section
joelclaw docs context <chunk-id> --mode section-neighborhood

8) Coverage Reconcile

joelclaw docs reconcile --sample 20

9) Recovery

If the batch stalls or books fail:

  • Inngest retries each step automatically (default retry policy)
  • --skip-existing means re-firing the batch only processes unfinished books
  • Check OTEL for errors: joelclaw otel list --level error --hours 4
  • Individual retry: joelclaw docs reindex-v2 "/path/to/failed.pdf"

Extraction Details

  • Primary: opendataloader-pdf v2.0.0 (Java-based, #1 in benchmarks, 0.90 accuracy)
  • Fallback: pypdf (basic text extraction if Java unavailable)
  • Requires: Java 11+ (OpenJDK 25 installed on panda, PATH configured in worker start.sh)
  • Speed: ~2.5s per book on M4 Pro

Embedding Model

  • v2: nomic-embed-text via ollama (GPU-accelerated on M4 Pro, 768-dim, retrieval-tuned). Pre-computed at ingest time, stored as raw float[] vectors. ~150x faster than Typesense CPU auto-embed.
  • v1: ts/all-MiniLM-L12-v2 — 384-dim, general-purpose, Typesense auto-embed (legacy, still in docs_chunks)
  • Ollama runs on panda at localhost:11434. System-bus-worker (host process) embeds at ingest time.

Chunking Strategy (ADR-0234)

  • Markdown-native heading detection (# markers, not heuristics)
  • Recursive splitting within sections exceeding target tokens
  • No overlap (arxiv R100-0 finding: 45% higher precision)
  • Two-level hierarchy: section chunks + snippet sub-chunks
  • heading_path derived from actual markdown heading levels
  • Context inheritance: retrieval_text includes [DOC: title] [SUMMARY: ...] [PATH: heading > path] [CONCEPTS: ...]

Acquisition Pipeline (aa-book → ingest)

joelclaw send pipeline/book.download -d '{
  "query": "designing data-intensive applications",
  "format": "pdf",
  "reason": "library expansion"
}'

Downloads via aa-book → NAS backup → fires docs/ingest.requested for immediate processing.

Inngest Events

EventFunctionPurpose
docs/ingest.requesteddocs-ingestv1 pipeline (single file)
docs/reindex-v2.requesteddocs-reindex-v2v2 artifact pipeline (single file)
docs/reindex-batch.requesteddocs-reindex-batchBatch orchestrator (all PDFs)
docs/backlog.requesteddocs-backlogLegacy manifest-based backfill
docs/enrich.requesteddocs-enrichRe-enrich metadata for existing doc
pipeline/book.downloadbook-downloadAcquire + ingest new book
Repository
joelhooks/joelclaw
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.