Research toolkit for triaging academic papers and GitHub projects. Triage papers and tools, reproduce benchmark claims, search Google Scholar, Semantic Scholar, PubMed, or Sci-Hub, and extract structured data from scientific PDFs.
92
92%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Critical
Do not install without reviewing
AI-powered tool for extracting structured data from scientific literature PDFs. Converts tables, figures, and text from research papers into Markdown tables or CSV files using PyMuPDF (free) or Mathpix OCR (high-precision) plus an LLM extraction layer.
semantic-scholar-search insteadsemantic-scholar-searchData extraction from PDFs is a pipeline with multiple failure points — always validate each stage.
See setup-and-troubleshooting.md.
export EXTRACTOR_API_KEY="your-api-key-here"
export EXTRACTOR_BASE_URL="https://api.anthropic.com"
export EXTRACTOR_MODEL="claude-sonnet-4-5-20250929"
# Optionally for Mathpix OCR:
export MATHPIX_APP_ID="your-mathpix-app-id"
export MATHPIX_APP_KEY="your-mathpix-app-key"# Enzyme kinetics data — extracts Km, Kcat, organism, mutant, etc.
./scripts/extractor.py paper.pdf --template enzyme -o results.md
# Experimental results — extracts conditions, values, p-values, sample size
./scripts/extractor.py paper.pdf --template experiment -o results.md
# Literature review — extracts authors, year, DOI, key findings
./scripts/extractor.py paper.pdf --template review -o references.md./scripts/extractor.py paper.pdf \
-p "Extract all protein structure data: PDB ID, resolution, R-value, R_free" \
-o custom.md./scripts/extractor.py paper.pdf --template enzyme -o results.csv --format csv./scripts/extractor.py paper.pdf --template enzyme -o results.md --ocr mathpix./scripts/batch_extract.py ./literature/ ./output/ --template enzyme --format csv./scripts/extractor.py paper.pdf --template review -o references.md --printWHY: AI extraction silently produces incorrect values when table headers are ambiguous, units are merged into cells, or multi-row headers span columns. A production gotcha is writing raw LLM output to a database without review.
BAD Pipe output directly to a database insert script. → GOOD Print results with --print, spot-check at least 3 rows against the source PDF, then save.
WHY: PyMuPDF extracts embedded text; it cannot OCR pixel images. A pitfall is getting empty or near-empty extraction with no error, leading to silently wrong results.
BAD Run ./scripts/extractor.py scanned.pdf --template enzyme -o out.md and trust the output. → GOOD Check that extracted text length is reasonable; if content is empty or very short, switch to --ocr mathpix.
WHY: Without explicit format instructions, the LLM may return prose summaries instead of tables, breaking downstream CSV conversion.
BAD -p "Extract protein data" → GOOD -p "Extract protein data into a Markdown table with columns: Name | PDB ID | Resolution | R-value. Use | as delimiter."
WHY: batch_extract.py continues on errors and reports failures at the end. A production pitfall is assuming a completed batch run means 100% extraction success.
BAD Count output files to verify completeness. → GOOD ALWAYS read the final summary printed by batch_extract.py; re-run failed files individually with --print to diagnose issues.
EXTRACTOR_API_KEY in plain text in the script or a committed fileWHY: API keys in source control are a security production gotcha — they will be exposed in git history even if later removed.
BAD Hardcode api_key = "sk-ant-..." in the script. → GOOD ALWAYS use environment variables or a .env file excluded from version control.
WHY: Mathpix requires paid credentials. Attempting OCR without MATHPIX_APP_ID and MATHPIX_APP_KEY set causes a hard failure mid-pipeline, wasting API quota already consumed for PDF upload.
BAD Use --ocr mathpix without verifying credentials. → GOOD ALWAYS confirm both MATHPIX_APP_ID and MATHPIX_APP_KEY are set before invoking Mathpix; optionally fall back to PyMuPDF.
google-scholar-search
pubmed-search
reproduce-benchmark
sci-data-extractor
sci-hub-search
semantic-scholar-search
triage-paper
triage-tool