Research toolkit for triaging academic papers and GitHub projects. Triage papers and tools, reproduce benchmark claims, search Google Scholar, Semantic Scholar, PubMed, or Sci-Hub, and extract structured data from scientific PDFs.
92
92%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Critical
Do not install without reviewing
Run a tool's or paper's benchmark harness and record what could and could not be verified.
triage-tool or triage-paper firstbenchmarks/sources/ — report it and offer to re-run insteadunverified and document the blockerIf the benchmark harness or reported figures need to be looked up from the original paper, prefer these MCPs over WebFetch:
{
"mcpServers": {
"semantic-scholar": {
"type": "stdio",
"command": "uvx",
"args": ["semantic-scholar-fastmcp"]
},
"google-scholar": {
"type": "stdio",
"command": "uvx",
"args": ["google_scholar_mcp_server"]
}
}
}See triage-paper for full setup guidance.
Reproduction is a verification exercise, not a sales pitch.
(as reported). Never blend reported and verified numbers without distinguishing them.partially-verified.context-mode, jiang-llmlingua).references/<slug>.md) and analysis file (analysis/ANALYSIS-<slug>.md) if they exist.benchmarks/sources/ for <slug>-repro.md.tools/<slug>/.npm install in the submodule).failed.Compare actual output against reported figures. Categorise each metric as:
Read assets/templates/REPRO-benchmark.yaml to get the required frontmatter fields and section structure. Create benchmarks/sources/<slug>-repro.md with a YAML frontmatter block (all required_fields from the template) followed by the required sections.
Set outcome to one of:
| Value | Meaning |
|---|---|
verified | All reported figures reproduced within tolerance |
partially-verified | Some figures reproduced; others could not be confirmed |
unverified | Harness exists but was not run (missing infra, etc.) |
failed | Harness exists but errors prevented completion |
If an ANALYSIS file exists and Stage 2.2 ("Independent verification") does not yet reference the repro file, add a pointer:
See [`benchmarks/sources/<slug>-repro.md`](../benchmarks/sources/<slug>-repro.md) — outcome: <outcome>.Run the validator and summarise findings:
./scripts/validate-repro-benchmark.sh benchmarks/sources/<slug>-repro.mdReport: what was verified, what was not, and any caveats about fixture quality or environment.
# Check for existing repro
ls benchmarks/sources/<slug>-repro.md 2>/dev/null || echo "not found"
# Install harness deps (vendored submodule)
cd tools/<slug> && npm install
# Run benchmark and capture output
npm run benchmark 2>&1 | tee /tmp/<slug>-benchmark-out.txt
# Validate the completed repro file
./scripts/validate-repro-benchmark.sh benchmarks/sources/<slug>-repro.mdWHY: Files without frontmatter fail schema validation and break indexing tools that rely on structured metadata.
BAD Start the file with # <slug> — Benchmark Reproduction followed by bold-text fields. → GOOD Open with --- YAML frontmatter block containing all required fields before any prose.
WHY: Mixing unverified claims with observed results corrupts the research record.
BAD "Achieves 96% savings." when the harness was not run. → GOOD `"Reports 96% savings (as reported, BENCHMARK.md). Verified: 100% on 4 of 21 scenarios; remainder not run."
WHY: A silent failure looks like a clean run. Errors are data.
BAD Redirect stderr to /dev/null or omit error output from the repro file. → GOOD Quote the error verbatim in the Reproduction attempt section and set outcome to failed or partially-verified.
WHY: Re-running an expensive benchmark without checking for a prior run wastes time and creates conflicting records.
BAD Write a new repro file without checking benchmarks/sources/. → GOOD Check first; if a file exists, report it and ask whether to re-run.
google-scholar-search
pubmed-search
reproduce-benchmark
sci-data-extractor
sci-hub-search
semantic-scholar-search
triage-paper
triage-tool