tessl install github:muratcankoylan/Agent-Skills-for-Context-Engineering --skill book-sft-pipelineThis skill should be used when the user asks to "fine-tune on books", "create SFT dataset", "train style model", "extract ePub text", or mentions style transfer, LoRA training, book segmentation, or author voice replication.
Review Score
71%
Validation Score
13/16
Implementation Score
73%
Activation Score
55%
A complete system for converting books into SFT datasets and training style-transfer models. This skill teaches the pipeline from raw ePub to a model that writes in any author's voice.
Activate this skill when:
1. Intelligent Segmentation Text chunks must be semantically coherent. Breaking mid-sentence teaches the model to produce fragmented output. Target: 150-400 words per chunk, always at natural boundaries.
2. Diverse Instruction Generation Use multiple prompt templates and system prompts to prevent overfitting. A single prompt style leads to memorization. Use 15+ prompt templates with 5+ system prompts.
3. Style Over Content The goal is learning the author's rhythm and vocabulary patterns, not memorizing plots. Synthetic instructions describe what happens without quoting the text.
┌─────────────────────────────────────────────────────────────────┐
│ ORCHESTRATOR AGENT │
│ Coordinates pipeline phases, manages state, handles failures │
└──────────────────────┬──────────────────────────────────────────┘
│
┌───────────────┼───────────────┬───────────────┐
▼ ▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ EXTRACTION │ │ SEGMENTATION │ │ INSTRUCTION │ │ DATASET │
│ AGENT │ │ AGENT │ │ AGENT │ │ BUILDER │
│ ePub → Text │ │ Text → Chunks│ │ Chunks → │ │ Pairs → │
│ │ │ 150-400 words│ │ Prompts │ │ JSONL │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│
┌───────────────┴───────────────┐
▼ ▼
┌──────────────┐ ┌──────────────┐
│ TRAINING │ │ VALIDATION │
│ AGENT │ │ AGENT │
│ LoRA on │ │ AI detector │
│ Tinker │ │ Originality │
└──────────────┘ └──────────────┘<p> tags to preserve breaks# Extract text from ePub paragraphs
from epub2 import EPub
from bs4 import BeautifulSoup
def extract_epub(path):
book = EPub(path)
chapters = []
for item in book.flow:
html = book.get_chapter(item.id)
soup = BeautifulSoup(html, 'html.parser')
paragraphs = [p.get_text().strip() for p in soup.find_all('p')]
chapters.append('\n\n'.join(p for p in paragraphs if p))
return '\n\n'.join(chapters)Smaller chunks (150-400 words) produce more training examples and better style transfer than larger chunks (250-650).
def segment(text, min_words=150, max_words=400):
paragraphs = text.split('\n\n')
chunks, buffer, buffer_words = [], [], 0
for para in paragraphs:
words = len(para.split())
if buffer_words + words > max_words and buffer_words >= min_words:
chunks.append('\n\n'.join(buffer))
# Keep last paragraph for overlap
buffer = [buffer[-1], para] if buffer else [para]
buffer_words = sum(len(p.split()) for p in buffer)
else:
buffer.append(para)
buffer_words += words
if buffer:
chunks.append('\n\n'.join(buffer))
return chunksFor an 86,000-word book:
Using a single prompt template causes memorization. Diverse templates teach the underlying style.
SYSTEM_PROMPTS = [
"You are an expert creative writer capable of emulating specific literary styles.",
"You are a literary writer with deep knowledge of classic prose styles.",
"You are a creative writer skilled at emulating distinctive authorial voices.",
"You write prose that captures the essence of modernist literature.",
"You are a talented writer who can channel classic American authors.",
]
PROMPT_TEMPLATES = [
"Write a passage in the style of {author}: {desc}",
"Channel {author}'s voice to write about: {desc}",
"In {author}'s distinctive prose style, describe: {desc}",
"Write this scene as {author} would have: {desc}",
"Using {author}'s repetitive technique, describe: {desc}",
"Capture the rhythm of {author} in this passage: {desc}",
"Write like {author}: {desc}",
"In the voice of {author}, write: {desc}",
"This is a literary exercise. Write like {author}: {desc}",
"Can you write in {author}'s style? {desc}",
]INSTRUCTION_PROMPT = """Describe what is happening in this excerpt in 2-3 sentences.
Focus on: characters present, actions, emotions, setting.
Do NOT quote the text directly.
Excerpt:
{text}
"""
# Use a fast, cheap LLM (e.g., Gemini Flash)
instruction = llm_call(INSTRUCTION_PROMPT.format(text=chunk)){
"messages": [
{"role": "system", "content": "You are an expert creative writer..."},
{"role": "user", "content": "Write in the style of Author: Scene description..."},
{"role": "assistant", "content": "The actual book text from chunk..."}
]
}def build_examples(chunk, instruction, author, variants=2):
examples = []
for i in range(variants):
system = SYSTEM_PROMPTS[i % len(SYSTEM_PROMPTS)]
template = PROMPT_TEMPLATES[(chunk.id + i) % len(PROMPT_TEMPLATES)]
user = template.format(author=author, desc=instruction)
examples.append({"messages": [
{"role": "system", "content": system},
{"role": "user", "content": user},
{"role": "assistant", "content": chunk.text}
]})
return examplesCONFIG = {
"model_name": "Qwen/Qwen3-8B-Base", # Base, not instruct
"lora_rank": 32, # 352MB adapter
"learning_rate": 5e-4, # Higher for LoRA
"batch_size": 4,
"epochs": 3,
}Use base (pretrained) models, not instruction-tuned versions:
import tinker
from tinker import types
training_client = await service_client.create_lora_training_client_async(
base_model="Qwen/Qwen3-8B-Base",
rank=32
)
for epoch in range(3):
for batch in batches:
await training_client.forward_backward_async(batch, loss_fn="cross_entropy")
await training_client.optim_step_async(types.AdamParams(learning_rate=5e-4))
result = await training_client.save_weights_for_sampler_async(name="final")Test with scenarios that couldn't exist in the original book:
TEST_PROMPTS = [
"Write about a barista making lattes",
"Describe lovers communicating through text messages",
"Write about someone anxious about climate change",
]If the model applies style markers to modern scenarios, it learned style, not content.
# Search training data for output phrases
grep "specific phrase from output" dataset.jsonl
# Should return: No matchesTest outputs with GPTZero, Pangram, or ZeroGPT.
Symptom: Model uses original character names in new scenarios. Cause: Limited name diversity from one book. Solution: Train on multiple books or add synthetic examples.
Symptom: Outputs contain exact sentences from training data. Cause: Too few prompt variations or too many epochs. Solution: Use 15+ templates, limit to 3 epochs.
Symptom: Sentences feel incomplete. Cause: Poor segmentation breaking mid-thought. Solution: Always break at paragraph boundaries.
| Metric | Value |
|---|---|
| Training examples | 500-1000 per book |
| Model | Qwen/Qwen3-8B-Base |
| LoRA rank | 32 |
| Adapter size | ~350 MB |
| Training time | ~15 min |
| Loss reduction | 90%+ |
| Style transfer success | ~50% perfect |
| Component | Cost |
|---|---|
| LLM (instruction generation) | ~$0.50 |
| Tinker training (15 min) | ~$1.50 |
| Total | ~$2.00 |
This example applies several skills from the Agent Skills for Context Engineering collection:
The pipeline follows the staged, idempotent architecture pattern:
Each phase is resumable and produces intermediate artifacts for debugging.
Segmentation is a form of context compression for training. The core insight from context-compression applies: information density matters more than information quantity. Smaller, coherent chunks (150-400 words) produce better style transfer than larger, diluted chunks.
The two-tier strategy mirrors context compression evaluation:
The pipeline uses the supervisor/orchestrator pattern:
This matches the principle that sub-agents exist primarily to isolate context rather than simulate roles.
Validation follows the end-state evaluation pattern:
The "modern scenario" test is a form of out-of-distribution evaluation that proves generalization.
Prompt diversity prevents attention collapse on single patterns. When training with identical prompt structures, the model memorizes the instruction-response mapping. Diverse templates force attention across the style patterns themselves.
Internal references:
Related skills from Agent Skills for Context Engineering:
External resources:
Created: 2025-12-26 Last Updated: 2025-12-28 Author: Muratcan Koylan Version: 2.0.0 Standalone: Yes (separate from main context-engineering collection)