Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms. When GLM needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale.
79
75%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Risky
Do not use without reviewing
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/pdf/SKILL.mdThis guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see reference.md. If you need to fill out a PDF form, read forms.md and follow its instructions.
Role: You are a Professional Document Architect and Technical Editor specializing in high-density, industry-standard PDF content creation. If the content is not rich enough, use the web-search skill first.
Objective: Generate content that is information-rich, structured for maximum professional utility, and optimized for a compact, low-padding layout without sacrificing readability.
Generated PDF must use the same language as user's query.
| User Input | Execution Rule |
|---|---|
| Explicit count (e.g., "3 pages") | Match exactly; allow partial final page |
| Unspecified | Determine based on document type; prioritize completeness over brevity |
Avoid these mistakes:
Resume/CV exception:
margin: 1.5cmUser supplies outline:
No outline provided:
Never invent facts. If unsure, SEARCH immediately.
Mandatory search triggers - You MUST search FIRST if content includes ANY of the following::
Golden Rule: Every character in the final PDF must come from following sources:
+ ,− , ×, ÷, ±, ≤,√, ∑,≅, ∫, π, ∠, etc.)FORBIDDEN unicode escape sequence (DO NOT USE):
The ONLY way to produce bold text, superscripts, subscripts, or Mathematical/relational operators is through ReportLab tags inside Paragraph() objects:
| Need | Correct Method | Correct Example |
|---|---|---|
| Superscript | <super> tag in Paragraph() | Paragraph('10<super>2</super> × 10<super>3</super> = 10<super>5</super>', style) |
| Subscript | <sub> tag in Paragraph() | Paragraph('H<sub>2</sub>O', style) |
| Bold | <b> tag in Paragraph() | Paragraph('<b>Title</b>', style) |
| Mathematical/relational operators | Literal char in Paragraph() | Paragraph('AB ⊥ AC, ∠A = 90°, and ΔABC ≅ ΔDCF', style) |
| Scientific notation | Combined tags in Paragraph() | Paragraph('1.2 × 10<super>8</super> kg/m<super>3</super>', style) |
from reportlab.platypus import Paragraph
from reportlab.lib.styles import ParagraphStyle
from reportlab.lib.enums import TA_LEFT, TA_CENTER
body_style = enbody_style = ParagraphStyle(
name="ENBodyStyle",
fontName="Times New Roman",
fontSize=10.5,
leading=18,
alignment=TA_JUSTIFY,
)
header_style = ParagraphStyle(
name='CoverTitle',
fontName='Times New Roman',
fontSize=42,
leading=50,
alignment=TA_CENTER,
spaceAfter=36
)
# Superscript: area unit
Paragraph('Total area: 500 m<super>2</super>', body_style)
# Subscript: chemical formula
Paragraph('The reaction produces CO<sub>2</sub> and H<sub>2</sub>O', body_style)
# Scientific notation: large number with superscript
Paragraph('Speed of light: 3.0 × 10<super>8</super> m/s', body_style)
# Combined superscript and subscript
Paragraph('E<sub>k</sub> = mv<super>2</super>/2', body_style)
# Bold heading
Paragraph('<b>Chapter 1: Introduction</b>', header_style)
# Math symbols in body text
Paragraph('When ∠ A = 90°, AB ⊥ AC and ΔABC ≅ ΔDEF', body_style)Pre-generation check — before writing ANY string, ask:
"Does this string contain a character outside basic CJK or Mathematical/relational operators?" If YES → it MUST be inside a
Paragraph()with the appropriate tag. If it is a superscript/subscript digit in raw unicode escape sequence form → REPLACE with<super>/<sub>tag.
NEVER rely on post-generation scanning. Prevent at the point of writing.
You MUST ONLY use the following registered fonts. Using ANY other font (such as Arial, Helvetica, Courier, Georgia, etc.) is STRICTLY FORBIDDEN and will cause rendering failures.
| Font Name | Usage | Path |
|---|---|---|
Microsoft YaHei | Chinese headings | /usr/share/fonts/truetype/chinese/msyh.ttf |
SimHei | Chinese body text | /usr/share/fonts/truetype/chinese/SimHei.ttf |
SarasaMonoSC | Chinese code blocks | /usr/share/fonts/truetype/chinese/SarasaMonoSC-Regular.ttf |
Times New Roman | English text, numbers, tables | /usr/share/fonts/truetype/english/Times-New-Roman.ttf |
Calibri | English alternative | /usr/share/fonts/truetype/english/calibri-regular.ttf |
DejaVuSans | Formulas, symbols, code | /usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf |
FORBIDDEN fonts (DO NOT USE):
For bold text and superscript/subscript:
registerFontFamily() after registering fonts<b></b>, <super></super>, <sub></sub> tags in ParagraphParagraph() objects, NOT in plain stringsfrom reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfbase.pdfmetrics import registerFontFamily
# Chinese fonts
pdfmetrics.registerFont(TTFont('Microsoft YaHei', '/usr/share/fonts/truetype/chinese/msyh.ttf'))
pdfmetrics.registerFont(TTFont('SimHei', '/usr/share/fonts/truetype/chinese/SimHei.ttf'))
pdfmetrics.registerFont(TTFont("SarasaMonoSC", '/usr/share/fonts/truetype/chinese/SarasaMonoSC-Regular.ttf'))
# English fonts
pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf'))
pdfmetrics.registerFont(TTFont('Calibri', '/usr/share/fonts/truetype/english/calibri-regular.ttf'))
# Symbol/Formula font
pdfmetrics.registerFont(TTFont("DejaVuSans", '/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf'))
# CRITICAL: Register font families to enable <b>, <super>, <sub> tags
registerFontFamily('Microsoft YaHei', normal='Microsoft YaHei', bold='Microsoft YaHei')
registerFontFamily('SimHei', normal='SimHei', bold='SimHei')
registerFontFamily('Times New Roman', normal='Times New Roman', bold='Times New Roman')
registerFontFamily('Calibri', normal='Calibri', bold='Calibri')
registerFontFamily('DejaVuSans', normal='DejaVuSans', bold='DejaVuSans')For Chinese PDFs:
SimHei or Microsoft YaHeiMicrosoft YaHei (MUST use for Chinese headings)SarasaMonoSCDejaVuSansSimHeiFor English PDFs:
Times New RomanTimes New Roman (MUST use for English headings)DejaVuSansTimes New RomanFor Mixed Chinese-English PDFs (CRITICAL):
SimHeiTimes New RomanSimHei, ALL English content MUST use Times New Roman<font name='...'> tags within Paragraph objects. English fonts (e.g., Times New Roman) cannot render Chinese characters (they appear as blank boxes), and Chinese fonts (e.g., SimHei) render English with poor spacing. Must set ParagraphStyle.fontName to your base font, then wrap segments of the other language with <font name='...'> inline tags.from reportlab.lib.styles import ParagraphStyle
from reportlab.platypus import Paragraph
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
pdfmetrics.registerFont(TTFont('SimHei', '/usr/share/fonts/truetype/chinese/SimHei.ttf'))
pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf'))
# Base font is English; wrap Chinese parts:
enbody_style = ParagraphStyle(
name="ENBodyStyle",
fontName="Times New Roman", # Base font for English
fontSize=10.5,
leading=18,
alignment=TA_JUSTIFY,
)
# Wrap Chinese segments with <font> tag
story.append(Paragraph(
'Zhipu QingYan (<font name="SimHei">智谱清言</font>) is developed by Z.ai'
'My name is Lei Shen (<font name="SimHei">沈磊</font>)',
'<font name="SimHei">文心一言</font> (ERNIE Bot) is by Baidu.',
enbody_style
))
# Base font is Chinese; wrap English parts:
cnbody_style = ParagraphStyle(
name="CNBodyStyle",
fontName="SimHei", # Base font for Chinese
fontSize=10.5,
leading=18,
alignment=TA_JUSTIFY,
)
# Wrap Chinese segments with <font> tag
story.append(Paragraph(
'本报告使用 <font name="Times New Roman">GPT-4</font> '
'和 <font name="Times New Roman">GLM</font> 进行测试。',
cnbody_style
))If using Python to generate PNGs containing Chinese characters:
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = FalseRun fc-list to get more fonts. Font files are typically located under:
/usr/share/fonts/truetype/chinese//usr/share/fonts/truetype/english//usr/share/fonts/Information Density: Prioritize depth and conciseness. Avoid fluff or excessive introductory filler. Use professional, precise terminology.
Structural Hierarchy: Use nested headings (H1, H2, H3) and logical numbering (e.g., 1.1, 1.1.1) to organize complex data.
Data Formatting: Convert long paragraphs into structured tables, multi-column lists, or compact bullet points wherever possible to reduce vertical whitespace.
Visual Rhythm: Use horizontal rules (---) to separate major sections. Ensure a high text-to-whitespace ratio while maintaining a clear scannable path for the eye.
Technical Precision: Use LaTeX for all mathematical or scientific notations. Ensure all tables are formatted with clear headers.
Tone: Academic, corporate, and authoritative. Adapt to the specific professional field (e.g., Legal, Engineering, Financial) as requested.
Data Presentation:
Links & References:
Spacer(1, 18) after preceding text content (symmetric with table+caption block bottom spacing)Spacer(1, 6) before table captionSpacer(1, 18) before next content (larger gap for table+caption blocks)Spacer(1, 12) (approximately 1 line)Spacer(1, 12)Spacer(1, 18) (approximately 1.5 lines)Spacer(1, 24) (approximately 2 lines)Spacer(1, X) where X > 24, except for intentional H1 major section breaks or cover page elementsWhen creating PDFs with cover pages, use the following enlarged specifications:
Title Formatting:
36-48pt (vs normal heading 18-20pt)18-24pt14-16pt<b></b> tags in Paragraph (requires registerFontFamily() call first)Cover Page Spacing:
Spacer(1, 120) or more (push title to upper-middle area)Spacer(1, 36) before subtitleSpacer(1, 48) before author/institution infoSpacer(1, 18)Spacer(1, 60) before datePageBreak() after cover page contentAlignment:
TA_CENTERCover Page Style Example:
# Cover page styles
cover_title_style = ParagraphStyle(
name='CoverTitle',
fontName='Microsoft YaHei', # or 'Times New Roman' for English
fontSize=42,
leading=50,
alignment=TA_CENTER,
spaceAfter=36
)
cover_subtitle_style = ParagraphStyle(
name='CoverSubtitle',
fontName='SimHei', # or 'Times New Roman' for English
fontSize=20,
leading=28,
alignment=TA_CENTER,
spaceAfter=48
)
cover_author_style = ParagraphStyle(
name='CoverAuthor',
fontName='SimHei', # or 'Times New Roman' for English
fontSize=14,
leading=22,
alignment=TA_CENTER,
spaceAfter=18
)
# Cover page construction
story.append(Spacer(1, 120)) # Push down from top
story.append(Paragraph("报告主标题", cover_title_style))
story.append(Spacer(1, 36))
story.append(Paragraph("副标题或说明文字", cover_subtitle_style))
story.append(Spacer(1, 48))
story.append(Paragraph("作者姓名", cover_author_style))
story.append(Paragraph("所属机构", cover_author_style))
story.append(Spacer(1, 60))
story.append(Paragraph("2025年2月", cover_author_style))
story.append(PageBreak()) # Always page break after coverSpacer(1, 18) → Table → Spacer(1, 6) → Caption (centered) → Spacer(1, 18) → Next contentTA_LEFT + 2-char indent. Headings: no indent.TA_JUSTIFY only when ALL of the following conditions are met:
TA_LEFTTA_JUSTIFY can cause orphaned punctuation (commas, periods) at line startwordWrap='CJK' to ParagraphStyle to ensure proper typography rulesspaceBefore=0, spaceAfter=6-12spaceBefore=12-18, spaceAfter=6-12<b></b> tags in Paragraph (requires registerFontFamily() call first)spaceBefore=3, spaceAfter=6, alignment=TA_CENTERwordWrap='CJK' to ParagraphStyle
# Define standard colors for consistent table styling
TABLE_HEADER_COLOR = colors.HexColor('#1F4E79') # Dark blue for header
TABLE_HEADER_TEXT = colors.white # White text for header
TABLE_ROW_EVEN = colors.white # White for even rows
TABLE_ROW_ODD = colors.HexColor('#F5F5F5') # Light gray for odd rowstextColor=colors.white)<b></b> tags in Paragraph after calling registerFontFamily())Paragraph() objects. Plain strings like '<b>Text</b>' will NOT render bold.SimHei for Chinese PDFs.
ALL English text and numbers in tables MUST use Times New Roman for English PDFs.
ALL Chinese content and numbers MUST use SimHei, ALL English content MUST use Times New Roman for Mixed Chinese-English PDFs.W/m2, kg/m3, m/s2 (plain text exponents)Paragraph('W/m<super>2</super>', style), Paragraph('kg/m<super>3</super>', style) (proper superscript in Paragraph)<super></super> tags inside Paragraph objects for unit exponents in table cellsParagraph('-1.246 × 10<super>8</super>', style) not -124600000Paragraph('2.5 × 10<super>-3</super>', style) not 0.0025Paragraph('coefficient × 10<super>exponent</super>', style) (e.g., Paragraph('-1.246 × 10<super>8</super>', style))STOP AND CHECK: Before creating ANY table, verify that ALL text cells use Paragraph().
# 1) key point in Chinese: wordWrap="CJK"
tbl_center = ParagraphStyle(
"tbl_center",
fontName="SimHei",
fontSize=9,
leading=12,
alignment=TA_CENTER,
wordWrap="CJK",
)
# 2) ALL content MUST be wrapped in Paragraph - NO EXCEPTIONS for text
findings_data = []
for a, b, c in findings:
findings_data.append([
Paragraph(a, tbl_center),
Paragraph(b, tbl_center),
Paragraph(c, tbl_center), # ALL content MUST be wrapped in Paragraph
])
findings_table = Table(findings_data, colWidths=[1.8*cm, 3*cm, 9*cm])Complete Table Example:
from reportlab.platypus import Table, TableStyle, Paragraph, Image
from reportlab.lib.styles import ParagraphStyle
from reportlab.lib import colors
from reportlab.lib.enums import TA_CENTER, TA_LEFT, TA_RIGHT, TA_JUSTIFY
# Define styles for table cells
header_style = ParagraphStyle(
name='TableHeader',
fontName='Times New Roman',
fontSize=11,
textColor=colors.white,
alignment=TA_CENTER
)
cell_style = ParagraphStyle(
name='TableCell',
fontName='Times New Roman',
fontSize=10,
textColor=colors.black,
alignment=TA_CENTER
)
cell_style_jus = ParagraphStyle(
name='TableCellLeft',
fontName='Times New Roman',
fontSize=10,
textColor=colors.black,
alignment=TA_JUSTIFY
)
cell_style_right = ParagraphStyle(
name='TableCellRight',
fontName='Times New Roman',
fontSize=10,
textColor=colors.black,
alignment=TA_RIGHT
)
# ✅ CORRECT: All text content wrapped in Paragraph()
data = [
# Header row - bold text with Paragraph
[
Paragraph('<b>Parameter</b>', header_style),
Paragraph('<b>Unit</b>', header_style),
Paragraph('<b>Value</b>', header_style),
Paragraph('<b>Note</b>', header_style)
],
# Data rows - all text in Paragraph
[
Paragraph('Temperature', cell_style_jus),
Paragraph('°C', cell_style),
Paragraph('25.5', cell_style_jus),
Paragraph('Ambient', cell_style)
],
[
Paragraph('Pressure', cell_style_jus),
Paragraph('Pa', cell_style),
Paragraph('1.01 × 10<super>5</super>', cell_style_jus), # Scientific notation
Paragraph('Standard', cell_style)
],
[
Paragraph('Density', cell_style_jus),
Paragraph('kg/m<super>3</super>', cell_style), # Unit with exponent
Paragraph('1.225', cell_style_jus),
Paragraph('Air at STP', cell_style)
],
[
Paragraph('H<sub>2</sub>O Content', cell_style_jus), # Subscript
Paragraph('%', cell_style),
Paragraph('45.2', cell_style_jus),
Paragraph('Relative humidity', cell_style)
]
]
# ❌ PROHIBITED: Plain strings - NEVER DO THIS
# data = [
# ['<b>Parameter</b>', '<b>Unit</b>', '<b>Value</b>'], # Bold won't work!
# ['Pressure', 'Pa', '1.01 × 10<super>5</super>'], # Superscript won't work!
# ]
# Create table
table = Table(data, colWidths=[120, 80, 100, 120])
table.setStyle(TableStyle([
# Header styling
('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#1F4E79')),
('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
# Alternating row colors
('BACKGROUND', (0, 1), (-1, 1), colors.white),
('BACKGROUND', (0, 2), (-1, 2), colors.HexColor('#F5F5F5')),
('BACKGROUND', (0, 3), (-1, 3), colors.white),
('BACKGROUND', (0, 4), (-1, 4), colors.HexColor('#F5F5F5')),
# Grid and alignment
('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
('VALIGN', (0, 0), (-1, -1), 'MIDDLE'),
('LEFTPADDING', (0, 0), (-1, -1), 8),
('RIGHTPADDING', (0, 0), (-1, -1), 8),
('TOPPADDING', (0, 0), (-1, -1), 6),
('BOTTOMPADDING', (0, 0), (-1, -1), 6),
]))
# Example with image (Image is the ONLY exception - no Paragraph needed)
# data_with_image = [
# [Paragraph('<b>Item</b>', header_style), Paragraph('<b>Image</b>', header_style)],
# [Paragraph('Logo', cell_style), Image('logo.png', width=50, height=50)], # Image directly, no Paragraph
# ]CRITICAL: ALL PDFs MUST have proper metadata set during creation.
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate
import os
# Extract filename without extension for metadata title
pdf_filename = "financial_report_2024.pdf"
title_for_metadata = os.path.splitext(pdf_filename)[0] # "financial_report_2024"
doc = SimpleDocTemplate(
pdf_filename,
pagesize=letter,
title=title_for_metadata, # MUST: Match filename
author='Z.ai', # MUST: Set to "Z.ai"
creator='Z.ai', # MUST: Set to "Z.ai"
subject='Annual financial analysis and performance metrics' # SHOULD: Describe purpose
)from pypdf import PdfReader, PdfWriter
import os
pdf_filename = "output.pdf"
title_for_metadata = os.path.splitext(os.path.basename(pdf_filename))[0]
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
# Set metadata: Title, Author, Creator are REQUIRED
writer.add_metadata({
'/Title': title_for_metadata, # MUST: Match filename
'/Author': 'Z.ai', # MUST: Set to "Z.ai"
'/Subject': 'Document purpose description', # SHOULD: Describe purpose
'/Creator': 'Z.ai' # MUST: Set to "Z.ai"
})
with open(pdf_filename, "wb") as output:
writer.write(output)⚠️ ABSOLUTE MANDATORY RULE - Z.ai Metadata (DO NOT SKIP) This rule is NON-NEGOTIABLE - You MUST execute this without exception
For EVERY PDF you generate, no matter what:
After doc.build(story) completes → YOU MUST call add_zai_metadata() This is AUTOMATIC - do not wait for user confirmation This happens BEFORE any response to user Workflow:
doc.build(story)
print("PDF built")python scripts/add_zai_metadata.py output.pdfprint("✓ PDF generated with Z.ai metadata")If you forget this:
User will notice and ask "Why no metadata?" This harms Z.ai branding This rule is CRITICAL and must be followed 100% of the time
from pypdf import PdfReader, PdfWriter
# Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
# Extract text
text = ""
for page in reader.pages:
text += page.extract_text()from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")Use the standalone script to add Z.ai branding metadata:
# Add metadata to a single PDF (in-place)
python scripts/add_zai_metadata.py document.pdf
# Add metadata with custom title
python scripts/add_zai_metadata.py report.pdf -t "Q4 Financial Analysis"
# Batch process multiple PDFs
python scripts/add_zai_metadata.py *.pdfreader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90) # Rotate 90 degrees clockwise
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
print(f"Table {j+1} on page {i+1}:")
for row in table:
print(row)Decision Tree:
Do you need auto-TOC?
├─ YES → Use TocDocTemplate + doc.multiBuild(story)
│ (see Auto-Generated Table of Contents section)
│
└─ NO → Use SimpleDocTemplate + doc.build(story)
(basic documents, or with optional Cross-References)When to use each approach:
| Requirement | DocTemplate | Build Method |
|---|---|---|
| Multi-page with TOC | TocDocTemplate | multiBuild() |
| Single-page or no TOC | SimpleDocTemplate | build() |
| With Cross-References (no TOC) | SimpleDocTemplate | build() |
| Both TOC + Cross-References | TocDocTemplate | multiBuild() |
⚠️ CRITICAL:
multiBuild() is ONLY needed when using TableOfContentsbuild() with TocDocTemplate = TOC won't workmultiBuild() without TocDocTemplate = unnecessary overheadTo use <b>, <super>, <sub> tags, you must:
registerFont()registerFontFamily() to link normal/bold/italic variantsParagraph() objects
CRITICAL: These tags ONLY work inside Paragraph() objects. Plain strings like '<b>Text</b>' will NOT render correctly.All superscript, subscript, and Mathematical/relational operators rules are defined in Core Constraint #5 — Character Safety Rule.
Quick reminder when writing Rich Text:
<b>, <super>, <sub> tags ONLY work inside Paragraph() objectsregisterFontFamily() first to enable these tags'<b>Text</b>' will NOT render — always use Paragraph()Paragraph('coefficient × 10<super>exponent</super>', style)Paragraph('H<sub>2</sub>O', style)Do NOT use any unicode escape sequence(e.g., Superscript and subscript digits, Math operators and special symbols, Emoji characters) anywhere. If you are unsure whether a character is safe, wrap it in a Paragraph() with the appropriate tag.
# --- Register fonts and font family ---
pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf'))
# CRITICAL: Must call registerFontFamily() to enable <b> and <i> tags
registerFontFamily('Times New Roman', normal='Times New Roman', bold='Times New Roman')
# --- Define styles ---
body_style = ParagraphStyle(
name='BodyStyle',
fontName='Times New Roman',
fontSize=10,
textColor=colors.black,
alignment=TA_JUSTIFY,
)
bold_style = ParagraphStyle(
name='BoldStyle',
fontName='Times New Roman',
fontSize=10,
textColor=colors.black,
alignment=TA_JUSTIFY,
)
header_style = ParagraphStyle(
name='HeaderStyle',
fontName='Times New Roman',
fontSize=10,
textColor=colors.white,
alignment=TA_JUSTIFY,
)
# --- Body text examples ---
# Bold title
title = Paragraph('<b>Scientific Formulas and Chemical Expressions</b>', bold_style)
# Math formula with superscript and mathematical symbol ×
math_text = Paragraph(
'The Einstein mass-energy equivalence is expressed as E = mc<super>2</super>. '
'In applied physics, the gravitational force is F = 6.674 × 10<super>-11</super> × '
'm<sub>1</sub>m<sub>2</sub>/r<super>2</super>, '
'and the quadratic formula solves a<super>2</super> + b<super>2</super> = c<super>2</super>.',
body_style,
)
# Chemical expressions with subscript
chem_text = Paragraph(
'The combustion of methane: CH<sub>4</sub> + 2O<sub>2</sub> '
'= CO<sub>2</sub> + 2H<sub>2</sub>O. '
'Sulfuric acid (H<sub>2</sub>SO<sub>4</sub>) reacts with sodium hydroxide to produce '
'Na<sub>2</sub>SO<sub>4</sub> and water.',
body_style,
)Problem 1: English names broken at awkward positions
# PROHIBITED: "K.G. Palepu" may break after "K.G."
text = Paragraph("Professors (K.G. Palepu) proposed...",style)
# RIGHT: Use non-breaking space (U+00A0) to prevent breaking
text = Paragraph("Professors (K.G.\u00A0Palepu) proposed...",style)Problem 2: Punctuation at line start
# RIGHT: Add wordWrap='CJK' for proper typography
styles.add(ParagraphStyle(
name='BodyStyle',
fontName='SimHei',
fontSize=10.5,
leading=18,
alignment=TA_LEFT,
wordWrap='CJK' # Prevents orphaned punctuation
))Problem 3: Creating intentional line breaks
# PROHIBITED: Normal newline character does NOT create line breaks
text = Paragraph("Line 1\nLine 2\nLine 3", style) # Will render as single line!
# RIGHT: Use <br/> tag for line breaks
text = Paragraph("Line 1<br/>Line 2<br/>Line 3", style)
# Alternative: Split into multiple Paragraph objects
story.append(Paragraph("Line 1", style))
story.append(Paragraph("Line 2", style))
story.append(Paragraph("Line 3", style))from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter
# Add text
c.drawString(100, height - 100, "Hello World!")
c.drawString(100, height - 120, "This is a PDF created with reportlab")
# Add a line
c.line(100, height - 140, 400, height - 140)
# Save
c.save()NEVER manually create TOC like this:
# ❌ PROHIBIT - DO NOT USE
toc_entries = [("1. Title", "5"), ("2. Section", "10")]
for entry, page in toc_entries:
story.append(Paragraph(f"{entry} {'.'*50} {page}", style))Why it's PROHIBIT:
✅ ALWAYS use auto-generated TOC:
Key Implementation Requirements:
TocDocTemplate class: Override afterFlowable() to capture TOC entriesbookmark_name, bookmark_level, bookmark_text on each headingdoc.multiBuild(story): NOT doc.build() - multiBuild is required for TOC processingHelper Function Pattern:
def add_heading(text, style, level=0):
"""Create heading with bookmark for auto-TOC"""
p = Paragraph(text, style)
p.bookmark_name = text
p.bookmark_level = level
p.bookmark_text = text
return p
# Usage:
story.append(add_heading("1. Introduction", styles['Heading1'], 0))
story.append(Paragraph('Content...', styles['Normal']))Copy and adapt this complete working code for your PDF with Table of Contents:
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, PageBreak, Spacer
from reportlab.platypus.tableofcontents import TableOfContents
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
class TocDocTemplate(SimpleDocTemplate):
def __init__(self, *args, **kwargs):
SimpleDocTemplate.__init__(self, *args, **kwargs)
def afterFlowable(self, flowable):
"""Capture TOC entries after each flowable is rendered"""
if hasattr(flowable, 'bookmark_name'):
level = getattr(flowable, 'bookmark_level', 0)
text = getattr(flowable, 'bookmark_text', '')
self.notify('TOCEntry', (level, text, self.page))
# Create document
doc = TocDocTemplate("document.pdf", pagesize=letter)
story = []
styles = getSampleStyleSheet()
# Create Table of Contents
toc = TableOfContents()
toc.levelStyles = [
ParagraphStyle(name='TOCHeading1', fontSize=14, leftIndent=20,
fontName='Times New Roman'),
ParagraphStyle(name='TOCHeading2', fontSize=12, leftIndent=40,
fontName='Times New Roman'),
]
story.append(Paragraph("<b>Table of Contents</b>", styles['Title']))
story.append(Spacer(1, 0.2*inch))
story.append(toc)
story.append(PageBreak())
# Helper function: Create heading with TOC bookmark
def add_heading(text, style, level=0):
p = Paragraph(text, style)
p.bookmark_name = text
p.bookmark_level = level
p.bookmark_text = text
return p
# Chapter 1: Introduction
story.append(add_heading("Chapter 1: Introduction", styles['Heading1'], 0))
story.append(Paragraph("This is the introduction chapter with some example content.",
styles['Normal']))
story.append(Spacer(1, 0.2*inch))
story.append(add_heading("1.1 Background", styles['Heading2'], 1))
story.append(Paragraph("Background information goes here.", styles['Normal']))
# Chapter 2: Conclusion
story.append(add_heading("Chapter 2: Conclusion", styles['Heading1'], 0))
story.append(Paragraph("This concludes our document.", styles['Normal']))
story.append(Spacer(1, 0.2*inch))
story.append(add_heading("2.1 Summary", styles['Heading2'], 1))
story.append(Paragraph("Summary of the document.", styles['Normal']))
# Build the document (must use multiBuild for TOC to work)
doc.multiBuild(story)
print("PDF with Table of Contents created successfully!")OPTIONAL: For academic papers requiring citation systems (LaTeX-style \ref{} and \cite{})
Key Principle: Pre-register all figures, tables, and references BEFORE using them in text.
Simple Implementation Pattern:
from reportlab.lib.pagesizes import letter
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.enums import TA_CENTER
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib import colors
from reportlab.platypus import Table, TableStyle
class CrossReferenceDocument:
"""Manages cross-references throughout the document"""
def __init__(self):
self.figures = {}
self.tables = {}
self.refs = {}
self.figure_counter = 0
self.table_counter = 0
self.ref_counter = 0
def add_figure(self, name):
"""Add a figure and return its number"""
if name not in self.figures:
self.figure_counter += 1
self.figures[name] = self.figure_counter
return self.figures[name]
def add_table(self, name):
"""Add a table and return its number"""
if name not in self.tables:
self.table_counter += 1
self.tables[name] = self.table_counter
return self.tables[name]
def add_reference(self, name):
"""Add a reference and return its number"""
if name not in self.refs:
self.ref_counter += 1
self.refs[name] = self.ref_counter
return self.refs[name]
def build_document():
doc = SimpleDocTemplate("cross_ref.pdf", pagesize=letter)
xref = CrossReferenceDocument()
styles = getSampleStyleSheet()
# Caption style
styles.add(ParagraphStyle(
name='Caption',
parent=styles['Normal'],
alignment=TA_CENTER,
fontSize=10,
textColor=colors.HexColor('#333333')
))
story = []
# Step 1: Register all figures, tables, and references FIRST
fig1 = xref.add_figure('sample')
table1 = xref.add_table('data')
ref1 = xref.add_reference('author2024')
# Step 2: Use them in text
intro = f"""
See Figure {fig1} for details and Table {table1} for data<sup>[{ref1}]</sup>.
"""
story.append(Paragraph(intro, styles['Normal']))
story.append(Spacer(1, 0.2*inch))
# Step 3: Create figures and tables with numbered captions
story.append(Paragraph(f"<b>Figure {fig1}.</b> Sample Figure Caption",
styles['Caption']
))
# Table example
header_style = ParagraphStyle(
name='TableHeader',
fontName='Times New Roman',
fontSize=11,
textColor=colors.white,
alignment=TA_CENTER
)
cell_style = ParagraphStyle(
name='TableCell',
fontName='Times New Roman',
fontSize=10,
textColor=colors.black,
alignment=TA_CENTER
)
# All text content wrapped in Paragraph()
data = [
[Paragraph('<b>Item</b>', header_style), Paragraph('<b>Value</b>', header_style)],
[Paragraph('A', cell_style), Paragraph('10', cell_style)],
[Paragraph('B', cell_style), Paragraph('20', cell_style)],
]
t = Table(data, colWidths=[2*inch, 2*inch])
t.setStyle(TableStyle([
('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#1F4E79')),
('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
('ALIGN', (0, 0), (-1, -1), 'CENTER'),
('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
]))
story.append(t)
story.append(Spacer(1, 6))
story.append(Paragraph(f"<b>Table {table1}.</b> Sample Data Table",
styles['Caption']
))
story.append(PageBreak())
# Step 4: Reference again in discussion
discussion = f"""
As shown in Figure {fig1} and Table {table1}, results are clear<sup>[{ref1}]</sup>.
"""
story.append(Paragraph(discussion, styles['Normal']))
# Step 5: Bibliography section
story.append(PageBreak())
story.append(Paragraph("<b>References</b>", styles['Heading1']))
story.append(Paragraph(
f"[{ref1}] Author, A. (2024). Example Reference. <i>Journal Name</i>.",
styles['Normal']
))
doc.build(story)
print("PDF with cross-references created!")
if __name__ == '__main__':
build_document()Usage Notes:
add_figure()/add_table()/add_reference() at the START of your documentParagraph('<sup>[{ref_num}]</sup>') for inline citationsParagraph('<b>Figure {num}.</b>') or Paragraph('<b>Table {num}.</b>') with centered caption styleTocDocTemplate + doc.multiBuild(story) if both cross-refs and auto-TOC are needed# Extract text
pdftotext input.pdf output.txt
# Extract text preserving layout
pdftotext -layout input.pdf output.txt
# Extract specific pages
pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5# Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf
# Rotate pages
qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees
# Remove password
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf⚠️ CRITICAL MANDATORY RULE - PDF Metadata MUST be Added After Every PDF Generation
All PDFs MUST have metadata added immediately after creation - This is the FINAL step and CANNOT be skipped
Usage - Standalone Script:
# Add metadata to a single PDF (in-place)
python scripts/add_zai_metadata.py document.pdf
# Add metadata to a single PDF (create new file)
python scripts/add_zai_metadata.py input.pdf -o output.pdf
# Add metadata with custom title
python scripts/add_zai_metadata.py report.pdf -t "Q4 Financial Analysis"
# Batch process all PDFs in current directory
python scripts/add_zai_metadata.py *.pdf
# Quiet mode (no output)
python scripts/add_zai_metadata.py document.pdf -q
# Show help
python scripts/add_zai_metadata.py --helpRequirements:
After doc.build(story) completes → Immediately call the script Do NOT wait for user reminder, Do NOT check task description - Execute automatically Confirm metadata info to user after adding Memory phrase: PDF build done, metadata must add, no need to remind
# Requires: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path
# Convert PDF to images
images = convert_from_path('scanned.pdf')
# OCR each page
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)from pypdf import PdfReader, PdfWriter
# Create watermark (or load existing)
watermark = PdfReader("watermark.pdf").pages[0]
# Apply to all pages
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
# Add password
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)SimHei for Chinese PDF.
ALL English text and numbers MUST use Times New Roman for English PDF.
ALL Chinese content and numbers MUST use SimHei, ALL English content MUST use Times New Roman for Mixed Chinese-English PDF.registerFontFamily() after registering fonts to enable <b>, <super>, <sub> tags.<font name='...'> tags within Paragraph objects. English fonts (e.g., Times New Roman) cannot render Chinese characters (they appear as blank boxes), and Chinese fonts (e.g., SimHei) render English with poor spacing. Must set ParagraphStyle.fontName to your base font, then wrap segments of the other language with <font name='...'> inline tags.from reportlab.lib.styles import ParagraphStyle
from reportlab.platypus import Paragraph
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
pdfmetrics.registerFont(TTFont('SimHei', '/usr/share/fonts/truetype/chinese/SimHei.ttf'))
pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf'))
# Base font is English; wrap Chinese parts:
enbody_style = ParagraphStyle(
name="ENBodyStyle",
fontName="Times New Roman", # Base font for English
fontSize=10.5,
leading=18,
alignment=TA_JUSTIFY,
)
# Wrap Chinese segments with <font> tag
story.append(Paragraph(
'Zhipu QingYan (<font name="SimHei">智谱清言</font>) is developed by Z.ai'
'My name is Lei Shen (<font name="SimHei">沈磊</font>)',
'<font name="SimHei">文心一言</font> (ERNIE Bot) is by Baidu.',
enbody_style
))
# Base font is Chinese; wrap English parts:
cnbody_style = ParagraphStyle(
name="CNBodyStyle",
fontName="SimHei", # Base font for Chinese
fontSize=10.5,
leading=18,
alignment=TA_JUSTIFY,
)
# Wrap Chinese segments with <font> tag
story.append(Paragraph(
'本报告使用 <font name="Times New Roman">GPT-4</font> '
'和 <font name="Times New Roman">GLM</font> 进行测试。',
cnbody_style
))<b>, <super>, <sub>)Paragraph() objects — plain strings will NOT render them.<super>, <sub>,<b> tags inside Paragraph().Paragraph('1.246 × 10<super>8</super>', style) — never write large numbers as plain digits.Paragraph does not treat a normal newline character (\n) as a line break. To create line breaks, you must use <br/> (or split the content into multiple Paragraph objects).sms3 = \\\"\\\"\\\"Hi [FIRST_NAME]
You're invited! Join us for an exclusive first look at the Carolina Herrera Resort 2025 collection—before it opens to the public.
[DATE] | [TIME]
[Boutique Name]
_private champagne reception included_
Can I save you a spot? Just let me know!
[Your Name]\\\"\\\"\\\"
sms3_box = Table([[Paragraph(sms3, sms1_style)]], colWidths=[400])
# IMPORTANT:
# Paragraph does NOT treat '\n' as a line break.
# Use <br/> to force line breaks.
sms3 = """Hi [FIRST_NAME]<br/><br/>
You're invited! Join us for an exclusive first look at the Carolina Herrera Resort 2025 collection—before it opens to the public.<br/><br/>
[DATE] | [TIME]<br/>
[Boutique Name]<br/><br/>
<i>private champagne reception included</i><br/><br/>
Can I save you a spot? Just let me know!<br/><br/>
[Your Name]"""
sms3_box = Table([[Paragraph(sms3, sms1_style)]], colWidths=[400])Paragraph('<b>Title</b>', style) + textColor=colors.black.ALL text content in table cells MUST be wrapped in Paragraph(). This is NON-NEGOTIABLE.
❌ PROHIBITED - Plain strings in table cells:
# NEVER DO THIS - formatting will NOT work
data = [
['<b>Header</b>', 'Value'], # Bold won't render
['Temperature', '25°C'], # No style control
['Pressure', '1.01 × 10<super>5</super>'], # Superscript won't work
]✅ REQUIRED - All table text MUST wrapped in Paragraph:
# ALWAYS DO THIS
data = [
[Paragraph('<b>Header</b>', header_style), Paragraph('Value', header_style)],
[Paragraph('Temperature', cell_style), Paragraph('25°C', cell_style)],
[Paragraph('Pressure', cell_style), Paragraph('1.01 × 10<super>5</super>', cell_style)],
]Why this is mandatory:
<b>, <super>, <sub>, <i>)The ONLY exception: Image() objects can be placed directly in table cells without Paragraph wrapping.
Paragraph('<b>Header</b>', header_style) + textColor=colors.white.#1F4E79), alternating white/light gray rows.Spacer(1, 18) before next content.Spacer(1, 18) BEFORE tables to maintain symmetric spacing with bottom.TA_JUSTIFY.from PIL import Image as PILImage
from reportlab.platypus import Image
# Get original dimensions
pil_img = PILImage.open('image.png')
orig_w, orig_h = pil_img.size
# Scale to fit width while preserving aspect ratio
target_width = 400
scale = target_width / orig_w
img = Image('image.png', width=target_width, height=orig_h * scale)Paragraph (not Preformatted) for body text and formulas.After the complete Python code is written and BEFORE executing it, you MUST sanitize the code using the pre-built script located at:
scripts/sanitize_code.pyThis script catches any forbidden Unicode characters (superscript/subscript digits, math operators, emoji, HTML entities, literal \uXXXX escapes) that may have slipped through despite the prevention rules. It converts them to safe ReportLab <super>/<sub> tags or ASCII equivalents.
⚠️ CRITICAL RULE: You MUST ALWAYS write PDF generation code to a .py file first, then sanitize it, then execute it. NEVER use python -c "..." or heredoc (python3 << 'EOF') to run PDF generation code directly — these patterns bypass the sanitization step and risk forbidden characters reaching the final PDF.
Mandatory workflow (NO EXCEPTIONS):
# Step 1: ALWAYS write code to a .py file first
cat > generate_pdf.py << 'PYEOF'
# ... your PDF generation code here ...
PYEOF
# Step 2: Sanitize forbidden characters (MUST run before execution)
python scripts/sanitize_code.py generate_pdf.py
# Step 3: Execute the sanitized code
python generate_pdf.pyForbidden patterns — NEVER do any of the following:
# ❌ PROHIBITED: python -c with inline code (cannot be sanitized)
python -c "from reportlab... doc.build(story)"
# ❌ PROHIBITED: heredoc without saving to file first (cannot be sanitized)
python3 << 'EOF'
from reportlab...
EOF
# ❌ PROHIBITED: executing the .py file WITHOUT sanitizing first
python generate_pdf.py # Missing sanitization step!✅ CORRECT: The ONLY allowed execution pattern:
# 1. Write to file → 2. Sanitize → 3. Execute
cat > generate_pdf.py << 'PYEOF'
...code...
PYEOF
python scripts/sanitize_code.py generate_pdf.py
python generate_pdf.py⚠️ This sanitization step is NON-OPTIONAL. Even if you believe the code contains no forbidden characters, you MUST still run the sanitization script. It serves as a safety net to catch any characters that bypassed prevention rules.
| Task | Best Tool | Command/Code |
|---|---|---|
| Merge PDFs | pypdf | writer.add_page(page) |
| Split PDFs | pypdf | One page per file |
| Extract text | pdfplumber | page.extract_text() |
| Extract tables | pdfplumber | page.extract_tables() |
| Create PDFs | reportlab | Canvas or Platypus |
| Command line merge | qpdf | qpdf --empty --pages ... |
| OCR scanned PDFs | pytesseract | Convert to image first |
| Fill PDF forms | pdf-lib or pypdf (see forms.md) | See forms.md |
07048a9
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.