Ingest PDF/Markdown/TXT files into joelclaw's docs memory pipeline with Inngest durability, durable NAS artifacts, and OTEL verification. Use when adding docs, running batch reindex, reconciling coverage, or recovering stuck runs. Triggers on: 'ingest pdf', 'ingest markdown', 'docs add', 'pdf-brain ingest', 'backfill books', 'docs reconcile', 'reindex docs', 'batch reindex'.
90
88%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Staged artifact-chain pipeline with durable NAS storage, nomic embeddings, and workload queue orchestration.
PDF (source, immutable)
→ Stage 1: CONVERT — opendataloader-pdf → {docId}.md (NAS artifact)
→ Stage 2: CLASSIFY + SUMMARIZE — taxonomy + LLM summary → {docId}.meta.json (NAS artifact)
→ Stage 3: CHUNK — markdown-native headings, no overlap → {docId}.chunks.jsonl (NAS artifact)
→ Stage 4: INDEX — upsert to docs + docs_chunks_v2 (nomic-embed-text-v1.5, 768-dim)Key properties:
Artifacts dir: /Volumes/three-body/docs-artifacts/{docId}/
{docId}.md — structured markdown extraction{docId}.meta.json — taxonomy, summary, metadata{docId}.chunks.jsonl — chunk records, one per linejoelclaw status
joelclaw docs statusStatus now shows both v1 and v2 collection stats plus artifact availability.
joelclaw docs add "/absolute/path/to/file.pdf"
joelclaw docs add "/absolute/path/to/file.pdf" --title "Title" --tags "tag1,tag2" --category programmingjoelclaw docs reindex-v2 "/absolute/path/to/file.pdf"
joelclaw docs reindex-v2 "/absolute/path/to/file.pdf" --title "Title" --skip-existingFires docs/reindex-v2.requested → 4-stage artifact pipeline → NAS artifacts + docs_chunks_v2.
# Reindex all PDFs from NAS /Volumes/three-body/books/
joelclaw docs batch-reindex --skip-existing
# Reindex from existing Typesense docs collection
joelclaw docs batch-reindex --from-collection --skip-existingFires docs/reindex-batch.requested → scans NAS/collection → dispatches individual reindex-v2 events in batches of 10, concurrency 3.
--skip-existing skips books that already have all 3 artifacts on NAS (default true).
# Artifact count on NAS
ls /Volumes/three-body/docs-artifacts/ | wc -l
# v2 collection chunk count
joelclaw docs status
# OTEL events from pipeline
joelclaw otel search "docs.reindex" --hours 4
joelclaw o11y session system-bus --hours 4
# Individual run trace
joelclaw runs --count 10
joelclaw run <run-id># Read the extracted markdown
joelclaw docs markdown <doc-id>
# Read the summary + taxonomy metadata
joelclaw docs summary <doc-id>
# Or directly on NAS
cat /Volumes/three-body/docs-artifacts/<docId>/<docId>.md
cat /Volumes/three-body/docs-artifacts/<docId>/<docId>.meta.json | jq
wc -l /Volumes/three-body/docs-artifacts/<docId>/<docId>.chunks.jsonl# Search uses docs_chunks_v2 (nomic 768-dim) by default
joelclaw docs search "distributed consensus" --limit 8
# Context expansion
joelclaw docs context <chunk-id> --mode snippet-window
joelclaw docs context <chunk-id> --mode parent-section
joelclaw docs context <chunk-id> --mode section-neighborhoodjoelclaw docs reconcile --sample 20If the batch stalls or books fail:
--skip-existing means re-firing the batch only processes unfinished booksjoelclaw otel list --level error --hours 4joelclaw docs reindex-v2 "/path/to/failed.pdf"ts/all-MiniLM-L12-v2 — 384-dim, general-purpose, Typesense auto-embed (legacy, still in docs_chunks)localhost:11434. System-bus-worker (host process) embeds at ingest time.# markers, not heuristics)[DOC: title] [SUMMARY: ...] [PATH: heading > path] [CONCEPTS: ...]joelclaw send pipeline/book.download -d '{
"query": "designing data-intensive applications",
"format": "pdf",
"reason": "library expansion"
}'Downloads via aa-book → NAS backup → fires docs/ingest.requested for immediate processing.
| Event | Function | Purpose |
|---|---|---|
docs/ingest.requested | docs-ingest | v1 pipeline (single file) |
docs/reindex-v2.requested | docs-reindex-v2 | v2 artifact pipeline (single file) |
docs/reindex-batch.requested | docs-reindex-batch | Batch orchestrator (all PDFs) |
docs/backlog.requested | docs-backlog | Legacy manifest-based backfill |
docs/enrich.requested | docs-enrich | Re-enrich metadata for existing doc |
pipeline/book.download | book-download | Acquire + ingest new book |
825972c
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.