Research toolkit for triaging academic papers and GitHub projects. Triage papers and tools, reproduce benchmark claims, search Google Scholar, Semantic Scholar, PubMed, or Sci-Hub, and extract structured data from scientific PDFs.
92
92%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Critical
Do not install without reviewing
Search Google Scholar for academic papers and author profiles to build a candidate list before triaging.
triage-papertriage-papersemantic-scholar MCP is configured — optionally prefer it; it returns structured data with no rate-limit risk/tmp/<topic>-candidates.json — reuse it rather than re-runningSearch is discovery, not triage. The goal is a candidate list, not a finished reference.
semantic-scholar or google-scholar MCP is available. MCPs are faster, structured, and avoid rate-limit risk.Consider the semantic-scholar or google-scholar MCP server first. If either is configured and reachable, prefer it over this script.
See setup-and-troubleshooting.md for venv creation and dependency installation.
Activate the venv, then choose the appropriate subcommand:
# Basic keyword search
./scripts/google-scholar-search.py search --query "retrieval-augmented generation" --results 10# Advanced: filter by author and year range
./scripts/google-scholar-search.py advanced \
--query "LLM reasoning" --author "Yann LeCun" \
--year-start 2020 --year-end 2024 --results 10# Author profile
./scripts/google-scholar-search.py author --name "Geoffrey Hinton"# Export candidate list to JSON
./scripts/google-scholar-search.py search \
--query "prompt compression" --format json --output /tmp/candidates.json# Preview titles from a saved candidate list
python3 -c "import json; [print(p['title']) for p in json.load(open('/tmp/candidates.json'))]"If the script returns HTTP 429 or an unusually short HTML body, wait 60 seconds and retry once:
sleep 60 && ./scripts/google-scholar-search.py search --query "<topic>" --results 10NEVER retry in a tight loop. If still blocked, offer the semantic-scholar MCP as fallback and document the blocker.
Show the result list. For each promising paper, offer:
Found N results. Would you like to triage any of these with
triage-paper?
NEVER triage automatically — always confirm with the user first.
WHY: Discovery and triage are separate quality gates. Auto-triaging bypasses user review.
BAD Run triage-paper on every result. → GOOD Present the list; wait for the user to choose.
WHY: Repeated retries worsen the block and may trigger a longer ban.
BAD Retry in a tight loop. → GOOD Retry once after 60 s; switch to MCP fallback on second failure.
WHY: Scraped text is truncated and misformatted. Passing it to triage-paper corrupts the research record.
BAD Use the scraped abstract as the paper summary. → GOOD Use results for discovery only; re-fetch from arxiv or DOI during triage.
WHY: The script is the fragile fallback. Skipping the check needlessly risks rate-limiting.
BAD Invoke the script without checking for an MCP first. → GOOD Check for semantic-scholar/google-scholar MCP; only fall back to the script if neither is available.
WHY: Google Scholar's year filter is a ranking hint, not a strict predicate. Papers outside the range may appear; papers inside may be missing.
BAD Claim "all 2020–2024 papers on X". → GOOD Qualify results as "a sample in the requested range, as returned by Google Scholar."
WHY: A 200 response with a near-empty body is often a CAPTCHA redirect, not a genuine empty result set.
BAD Report "no papers found" when the HTML body is unusually short. → GOOD Check response length; if under ~1 KB, treat it as a potential block and retry once.
triage-paper for semantic-scholar MCP config