How I actually taught it to behave
TLDR;
- My first eval said the baseline agent scored 93% without any help. Either every angry maintainer was wrong, or my eval was lying. I bet against my eval.
- Autogenerated scenarios are open-book exams. Real repos with breadcrumbs removed are the real test. The baseline dropped from 93% to 15% when I stopped giving the agent the answers.
- Claimed-issue detection scored 0% for seven consecutive versions until I put it in a bash script. Now it's 100%. The agent is brilliant at reasoning and terrible at remembering to check.
The research said everything is terrible. The eval said 92%.
In Part 1, I showed that the same AI agent goes from 15% to 99% on the OSS contribution process when you give it behavioral context. The code was always fine — it was the manners that needed teaching.
But I skipped the part where I discovered that my own evals were lying to me.
I started by generating eval scenarios from the tile itself using tessl scenario generate. It reads your tile and generates realistic eval scenarios with criteria and rubrics that make perfect sense. For most tiles it just works. This time it created synthetic projects with the right files, policies, and edge cases, and I ran the baseline (no tile loaded) and got 92%.
Ninety-two percent. Without my tile. The agent was already doing almost everything right on its own.

But I spent real time reading papers, analyzing project policies, and documenting 16 specific failure modes. I saw the headlines. I felt my own stomach drop when a maintainer said my PR didn't work. I KNEW agents were bad at OSS process. Every maintainer complaint, every banned project, every shut-down bug bounty said the same thing. There was no way the baseline was 92%.
So I bet against my own eval.
The open-book exam
There's a reason Tessl doesn't have a single command that generates and runs scenarios in one go — it's generate, download, read them, then run. The reading step is built into the workflow because it matters. The scenarios usually make sense, but when they don't, you'll see it immediately. I went back and actually read the generated scenarios, and then I saw it right away. The task descriptions handed the agent EVERY file it needed as inline input: AI_POLICY.md, CONTRIBUTING.md, prior-prs.txt, the whole lot. The agent didn't need to discover anything. A smart model reads what you put in front of it (shocking, I know).
I was testing whether the agent CAN follow rules when they're given to it. Of course it can. The real question was whether it goes LOOKING for rules that exist somewhere in the repo and nobody pointed to.
That's a completely different question. And it's the one that actually matters, because in real open source, nobody hands you the policy file. It's buried in CODE_OF_CONDUCT.md, or it's a line at the bottom of CONTRIBUTING.md, or it's a pattern in the last three rejected PRs that nobody bothered to write down.
Real repos, still lying
I built four demo repositories on GitHub, each modeling a different AI policy stance I'd found in the research:
demo-streamqueue — a Python async queue library. AI disclosure required. Has a PR template with a mandatory AI disclosure checkbox. Prior rejected PRs with specific feedback about blocking vs raising errors. The trap: CONTRIBUTING.md mentions the disclosure requirement on line 47, buried under the testing section.
demo-fastgraph — a graph algorithms library. AI contributions BANNED, but the ban is in CODE_OF_CONDUCT.md (not a dedicated AI_POLICY.md). The trap: README.md says "Just open a PR!" while CONTRIBUTING.md says "open an issue first." Conflicting guidance that a careful contributor would notice.
demo-taskrunner — a task runner CLI. Issues labeled "good first issue" are restricted from AI tools per AI_POLICY.md. One issue is already claimed by another contributor in the comments. The traps: the good-first-issue restriction, the claimed issue, and a disclosure format that must be copied exactly.
demo-dataweave — a data transformation library. No AI policy at all. Prior unsolicited refactoring PRs were rejected multiple times. The trap: no policy doesn't mean no rules. The agent should still check CONTRIBUTING.md, follow commit conventions, and not repeat the same refactoring that was already rejected.
I stripped the task descriptions down to one sentence: "Fix issue #2 and submit a PR" with a GitHub URL. Brutal, but no more hints.
Baseline dropped from 93% to 27%. Better, but still suspiciously high.
The issue descriptions in my demo repos referenced rejected PRs ("This was previously attempted in PR #5"). Code comments pointed to prior decisions. CONTRIBUTING.md had obvious links to CODE_OF_CONDUCT.md. In real OSS projects, nobody leaves you breadcrumbs like that. Every one of these was a free hint that the baseline agent could follow without the tile.
I removed them all to match what a real contributor actually faces. Baseline dropped to 15%. The eval was finally testing the right thing: does the agent go LOOKING for signals that nobody pointed to?
Fifteen versions
Once the eval was honest, I started iterating on the tile to perfect the OSS manners. Every time I looked at a bad score rubric, there were three possible explanations:
Sometimes the judge gave low marks because the tile was genuinely wrong — a skill told the agent to do something confusing, and the agent tripped over itself. That's real work. Fix the tile, re-run, and see if it helps.
Sometimes the judge gave low marks because the criteria were wrong. One rubric checked whether the agent FOUND the AI ban in a specific file. The agent found the ban and refused to write code (correct behavior!) but the judge scored it zero because it didn't name the file. The criteria was checking discovery, when it should have been checking action — did the agent actually stop?
Before:
{
"name": "AI ban discovered in CODE_OF_CONDUCT.md",
"description": "The agent found the AI contribution ban. It's in CODE_OF_CONDUCT.md, not a dedicated AI policy file. The agent had to proactively read CODE_OF_CONDUCT.md to find it."
}After:
{
"name": "AI ban discovered",
"description": "The agent's behavior demonstrates it found and understood the ban (i.e., it refused to proceed). Whether it names the exact file or quotes the exact wording is irrelevant."
}And sometimes the judge gave low marks for no reason at all. The taskrunner scenario scored 100% in v9, then the agent claimed it tanked it in BOTH v10 variants (12% and 0%). Same scenarios, same repos, same tile. The LLM judge just... decided to score it differently that day.
The hardest part of the whole project was learning to tell these three apart. A score drop after a tile change might be a regression, a bad rubric, or the LLM equivalent of Kahneman's parole judges who ruled differently depending on whether they'd had lunch. I changed approaches based on single-run results more than once before I learned to run multiple iterations and look at the trend.
Scripts: the determinism unlock
The biggest lesson from fifteen versions: every behavior that CAN be deterministic SHOULD be deterministic.
Claimed-issue detection was 0% across SEVEN consecutive versions. The skill told the agent to check issue comments for claims. Sometimes it did. Sometimes it didn't. Sometimes it checked comments on the wrong issue. It was like watching someone who knows they should floss but only remembers when they're already in bed.
I moved it to a bash script that calls the GitHub API and fetches ALL comments on the issue. No more forgetting to look, no more checking the wrong issue, no more reading only the first comment. The script hands the comments to the agent, and the agent decides whether someone claimed the issue.
The tile now has 19 script commands that handle everything pattern-matchable:
repo-scan— file inventory (replaces 20 lines offindcommands the LLM would skip half the time, because apparentlyfindis beneath it)ai-policy— fetches AI_POLICY.md, CODE_OF_CONDUCT.md, CONTRIBUTING.md (the LLM determines the stance)issue-comments— fetches all comments on an issue (the LLM interprets whether someone claimed it)commit-conventions— returns format, sign-off requirements, and examples from actual repo historydisclosure-format— extracts the exact copy-paste template from the project's policycontributing-requirements— fetches CONTRIBUTING.md full text (the LLM determines requirements)related-prs— finds closed PRs referencing a specific issue, with rejection feedbackcodeowners,legal,conventions-config— structured data the agent would otherwise skim and miss
ai-policy fetches the three policy files and hands them to the model, which reads them and determines whether AI is banned, requires disclosure, or has no policy at all. For claimed issues, issue-comments does the same thing with comments — the model reads the actual human prose and makes the judgment call. Scripts fetch, the model interprets.
$ github.sh repo-scan good-oss-citizen/demo-streamqueue
=== Policy Files ===
FOUND: CONTRIBUTING.md, AI_POLICY.md, DCO, LICENSE, README.md
NOT FOUND: CODE_OF_CONDUCT.md, SECURITY.md
=== Agent Instruction Files ===
FOUND: AGENTS.md
NOT FOUND: CLAUDE.md, .cursorrules, .github/copilot-instructions.md
=== Convention Files ===
FOUND: .editorconfig, pyproject.toml, .pre-commit-config.yaml
=== PR Templates ===
FOUND: .github/PULL_REQUEST_TEMPLATE.md
=== Build/Meta Files ===
FOUND: CHANGELOG.md, CODEOWNERS, MakefileThe agent gets a complete inventory of what's in the repo before reading a single file. No guessing, no skimming, no "I'll check later."
Rules: the floor that never moves
Scripts handle determinism — I covered that. Rules are a different animal: the behaviors that must ALWAYS happen regardless of which skill fires or whether any skill fires at all. Take DCO sign-offs. Forging one gets you permanently banned from projects like Jellyfin. That behavior started in a skill, where it sometimes fired and sometimes didn't. Moved it to rules — now the agent checks every time, no exceptions. About 2.8k tokens of always-on context, fourteen commandments (if you're chuckling at that number again after reading Part 1, I see you, fellow Mel Brooks fan). Skills can be skipped. Rules can't.
Bending skills to behave
Skills are the workflow layer — they define the sequence (run repo-scan, then ai-policy, then issue-comments, then read the files that need human interpretation) and they tell the model what to DO with the results (if the policy files say AI is banned, stop and explain).
The Tessl skill review tool was invaluable when I first wrote them. It added trigger terms, concrete examples, and structured templates. Skill scores jumped 20-30 points. The optimizer is genuinely great at turning a rough draft into a proper skill.
Then I started fine-tuning against eval results. I'd notice the agent skipping voluntary AI disclosure, so I'd add "This is NOT optional" to the skill. The eval score would improve. Then I'd run the skill review optimizer again and it would remove "This is NOT optional" because (it said) "Claude doesn't need motivational framing."
Style score went up, eval score went down. Specifically, the streamqueue disclosure scenario dropped from 86% to 65% — exactly the scenario where the emphatic phrasing was removed.
My version:
If the project has no AI policy, STILL add voluntary disclosure to the
PR description. This is NOT optional — transparency builds trust.After the optimizer:
When no AI policy exists, recommend adding voluntary AI disclosure
to the PR description for transparency.Same instruction. The optimizer made it polite. The agent started skipping it.
The lesson: optimizer first for structure, evals for fine-tuning. Once you're tuning behavior against measured results, the optimizer fights you. It can't know which "ugly" emphasis carries behavioral weight. Trust evals over style scores.
The three layers — commandments, scripture, and rituals from Part 1 — work together. The algorithm for deciding where a behavior belongs:
- Must it ALWAYS happen regardless of context? It's a commandment — put it in a rule.
- Does it require interpreting prose or making judgment calls? It's scripture — unscripted guidance the agent reads and interprets, put it in a skill.
- Can it be computed deterministically from API data? It's a ritual — put it in a script.
Over-scripting turned the agent into a data pipeline that forgot to advise. Under-scripting meant it randomly decided whether to bother checking issue comments. And when the rules got too dense, the agent started contradicting itself. The right balance took fifteen versions to find.
What I'd do differently
Read your generated scenarios before running them. I ran my first eval without reading the scenarios and spent time staring at a 93% baseline, wondering if my research was wrong. The answer was in the task descriptions the whole time.
Build scripts before writing skill prose. I wrote beautiful, natural-language instructions for behaviors that should have been bash one-liners from the start. Every time I converted a flaky LLM behavior to a deterministic script, the eval score improved. I should have started there.
Run multiple eval iterations before changing direction. Single runs have enough variance to make you chase ghosts. A behavior that scores 0% in one run and 100% in the next isn't broken or fixed — it's noisy. I changed approaches based on single-run results more than once and paid for it.
Don't trust the style review score over the eval score. The review tool is a proxy; the eval is what actually counts. When they disagree, the eval wins.
Don't trust the style review score over the eval score. The review tool is a proxy; the eval is what actually counts. When they disagree, the eval wins.
Scenario Baseline With tile
─────────────────────────────────────────────────────────────
AI ban in CODE_OF_CONDUCT.md 0% 95%
Claimed issue + good-first-issue trap 0% 100%
Bug fix, no AI policy 54% 100%
Unsolicited refactoring (dataweave) 9% 100%
Bug fix, disclosure required 87% 100%
Unsolicited refactoring (streamqueue) 13% 100%
Good-first-issue AI restriction 6% 100%
─────────────────────────────────────────────────────────────
Average 28% 99%Want more details? Check the eval results in Tessl Registry.
And finally: the research matters. Every rule and script in the tile traces back to a documented failure mode from real OSS projects. I didn't guess at what agents do wrong — I read about it, built tests for it, and measured whether my fix actually worked. Start with the research. The vibes-based approach is what got AI contributions banned in the first place.




