Research toolkit for triaging academic papers and GitHub projects. Triage papers and tools, reproduce benchmark claims, search Google Scholar, Semantic Scholar, PubMed, or Sci-Hub, and extract structured data from scientific PDFs.
92
92%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Critical
Do not install without reviewing
cd skills/documentation/research/sci-data-extractor
uv venv
source .venv/bin/activate
uv pip install -r requirements.txtcd skills/documentation/research/sci-data-extractor
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtpython3 scripts/extractor.py --helpCreate a .env file in the skill directory (excluded from git):
EXTRACTOR_API_KEY=your-api-key-here
EXTRACTOR_BASE_URL=https://api.anthropic.com
EXTRACTOR_MODEL=claude-sonnet-4-5-20250929
EXTRACTOR_TEMPERATURE=0.1
EXTRACTOR_MAX_TOKENS=16384
# Optional: Mathpix OCR credentials
MATHPIX_APP_ID=your-mathpix-app-id
MATHPIX_APP_KEY=your-mathpix-app-keyLoad with dotenvx run -- before any command, or export variables directly.
Every session requires activating the venv:
source skills/documentation/research/sci-data-extractor/.venv/bin/activatePyMuPDF extracts embedded text only. If the PDF contains scanned images rather than embedded text, the extracted content will be empty or very short. Switch to Mathpix OCR:
./scripts/extractor.py paper.pdf --template enzyme -o results.md --ocr mathpixEnsure EXTRACTOR_API_KEY is exported or present in .env. The script reads both EXTRACTOR_API_KEY and API_KEY environment variables.
Verify that both MATHPIX_APP_ID and MATHPIX_APP_KEY are set and match your Mathpix account credentials at api.mathpix.com.
Add explicit output format instructions to your custom prompt. For example:
-p "Extract data into a Markdown table using | as delimiter. Include columns: Name | Value | Unit"batch_extract.py imports failThe batch script imports from extractor using a relative import. Always run it from the scripts/ directory or ensure extractor.py is on the Python path:
cd skills/documentation/research/sci-data-extractor
python3 scripts/batch_extract.py ./literature/ ./output/ --template enzymeIncrease the token limit via the environment variable:
export EXTRACTOR_MAX_TOKENS=32768The script automatically segments long documents; increasing EXTRACTOR_MAX_TOKENS reduces the chance of truncated table output.
fitz (PyMuPDF)pip install pymupdfNote: the package name is pymupdf but the import is fitz.
google-scholar-search
pubmed-search
reproduce-benchmark
sci-data-extractor
sci-hub-search
semantic-scholar-search
triage-paper
triage-tool