or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

conversion.mdimages.mdindex.mdstyles.mdtransforms.mdwriters.md

transforms.mddocs/

0

# Document Transformation

1

2

Utilities for transforming document elements before conversion. Mammoth's transformation system allows for custom processing of paragraphs, runs, and other document components, enabling advanced document manipulation workflows.

3

4

## Capabilities

5

6

### Element Type Transforms

7

8

Create transformations that target specific document element types.

9

10

```python { .api }

11

def paragraph(transform_paragraph):

12

"""

13

Create transform that applies to paragraph elements.

14

15

Parameters:

16

- transform_paragraph: function, transforms paragraph elements

17

18

Returns:

19

Transform function that processes the entire document

20

"""

21

22

def run(transform_run):

23

"""

24

Create transform that applies to run elements.

25

26

Parameters:

27

- transform_run: function, transforms run elements

28

29

Returns:

30

Transform function that processes the entire document

31

"""

32

33

def element_of_type(element_type, transform):

34

"""

35

Create transform for specific element types.

36

37

Parameters:

38

- element_type: class/type to match

39

- transform: function to apply to matching elements

40

41

Returns:

42

Transform function that processes the entire document

43

"""

44

```

45

46

### Document Traversal

47

48

Functions for finding and extracting specific elements from the document tree.

49

50

```python { .api }

51

def get_descendants_of_type(element, element_type):

52

"""

53

Get all descendant elements of specified type.

54

55

Parameters:

56

- element: Root element to search from

57

- element_type: Type/class to filter for

58

59

Returns:

60

List of matching descendant elements

61

"""

62

63

def get_descendants(element):

64

"""

65

Get all descendant elements.

66

67

Parameters:

68

- element: Root element to search from

69

70

Returns:

71

List of all descendant elements

72

"""

73

```

74

75

## Document Element Types

76

77

When creating transforms, you'll work with these document element types:

78

79

```python { .api }

80

class Document:

81

"""Root document container."""

82

children: list # Child elements

83

notes: list # Footnotes and endnotes

84

comments: list # Document comments

85

86

class Paragraph:

87

"""Paragraph element with styling information."""

88

children: list # Child elements (runs, hyperlinks, etc.)

89

style_id: str # Word style ID

90

style_name: str # Word style name

91

numbering: object # List numbering information

92

alignment: str # Text alignment

93

indent: object # Indentation settings

94

95

class Run:

96

"""Text run with formatting."""

97

children: list # Child elements (text, breaks, etc.)

98

style_id: str # Word style ID

99

style_name: str # Word style name

100

is_bold: bool # Bold formatting

101

is_italic: bool # Italic formatting

102

is_underline: bool # Underline formatting

103

is_strikethrough: bool # Strikethrough formatting

104

is_all_caps: bool # All caps formatting

105

is_small_caps: bool # Small caps formatting

106

vertical_alignment: str # Superscript/subscript

107

font: str # Font name

108

font_size: int # Font size in half-points

109

highlight: str # Highlight color

110

111

class Text:

112

"""Plain text node."""

113

value: str # Text content

114

115

class Hyperlink:

116

"""Hyperlink element."""

117

children: list # Child elements

118

href: str # Link URL

119

anchor: str # Internal anchor

120

target_frame: str # Target frame

121

122

class Image:

123

"""Image element."""

124

alt_text: str # Alternative text

125

content_type: str # MIME type

126

127

def open(self):

128

"""Open image data for reading."""

129

130

class Table:

131

"""Table element."""

132

children: list # TableRow elements

133

style_id: str # Word style ID

134

style_name: str # Word style name

135

136

class TableRow:

137

"""Table row element."""

138

children: list # TableCell elements

139

140

class TableCell:

141

"""Table cell element."""

142

children: list # Cell content elements

143

colspan: int # Column span

144

rowspan: int # Row span

145

146

class Break:

147

"""Line, page, or column break."""

148

break_type: str # "line", "page", "column"

149

```

150

151

## Transform Examples

152

153

### Remove Empty Paragraphs

154

155

```python

156

import mammoth

157

158

def remove_empty_paragraphs(paragraph):

159

# Check if paragraph has no text content

160

has_text = any(

161

isinstance(child, mammoth.documents.Text) and child.value.strip()

162

for child in mammoth.transforms.get_descendants(paragraph)

163

)

164

165

if not has_text:

166

return None # Remove the paragraph

167

return paragraph

168

169

# Create the transform

170

transform = mammoth.transforms.paragraph(remove_empty_paragraphs)

171

172

# Apply during conversion

173

with open("document.docx", "rb") as docx_file:

174

result = mammoth.convert_to_html(

175

docx_file,

176

transform_document=transform

177

)

178

```

179

180

### Convert Custom Styles

181

182

```python

183

import mammoth

184

185

def convert_custom_headings(paragraph):

186

# Convert custom heading styles to standard ones

187

if paragraph.style_name == "CustomHeading1":

188

paragraph = paragraph.copy(style_name="Heading 1")

189

elif paragraph.style_name == "CustomHeading2":

190

paragraph = paragraph.copy(style_name="Heading 2")

191

192

return paragraph

193

194

transform = mammoth.transforms.paragraph(convert_custom_headings)

195

196

with open("document.docx", "rb") as docx_file:

197

result = mammoth.convert_to_html(

198

docx_file,

199

transform_document=transform

200

)

201

```

202

203

### Modify Text Content

204

205

```python

206

import mammoth

207

208

def uppercase_bold_text(run):

209

if run.is_bold:

210

# Transform all text children to uppercase

211

new_children = []

212

for child in run.children:

213

if isinstance(child, mammoth.documents.Text):

214

new_children.append(

215

mammoth.documents.text(child.value.upper())

216

)

217

else:

218

new_children.append(child)

219

220

return run.copy(children=new_children)

221

222

return run

223

224

transform = mammoth.transforms.run(uppercase_bold_text)

225

226

with open("document.docx", "rb") as docx_file:

227

result = mammoth.convert_to_html(

228

docx_file,

229

transform_document=transform

230

)

231

```

232

233

### Complex Document Analysis

234

235

```python

236

import mammoth

237

238

def analyze_and_transform(document):

239

# Find all headings in the document

240

headings = []

241

for paragraph in mammoth.transforms.get_descendants_of_type(

242

document, mammoth.documents.Paragraph

243

):

244

if paragraph.style_name and "Heading" in paragraph.style_name:

245

headings.append(paragraph)

246

247

print(f"Found {len(headings)} headings")

248

249

# Find all images

250

images = mammoth.transforms.get_descendants_of_type(

251

document, mammoth.documents.Image

252

)

253

print(f"Found {len(images)} images")

254

255

# Return unchanged document

256

return document

257

258

with open("document.docx", "rb") as docx_file:

259

result = mammoth.convert_to_html(

260

docx_file,

261

transform_document=analyze_and_transform

262

)

263

```

264

265

### Combining Transforms

266

267

```python

268

import mammoth

269

270

def remove_comments(paragraph):

271

# Remove comment references

272

new_children = []

273

for child in paragraph.children:

274

if not isinstance(child, mammoth.documents.CommentReference):

275

new_children.append(child)

276

277

return paragraph.copy(children=new_children)

278

279

def normalize_whitespace(run):

280

new_children = []

281

for child in run.children:

282

if isinstance(child, mammoth.documents.Text):

283

# Normalize whitespace

284

normalized = " ".join(child.value.split())

285

new_children.append(mammoth.documents.text(normalized))

286

else:

287

new_children.append(child)

288

289

return run.copy(children=new_children)

290

291

def combined_transform(document):

292

# Apply multiple transforms in sequence

293

comment_transform = mammoth.transforms.paragraph(remove_comments)

294

whitespace_transform = mammoth.transforms.run(normalize_whitespace)

295

296

document = comment_transform(document)

297

document = whitespace_transform(document)

298

299

return document

300

301

with open("document.docx", "rb") as docx_file:

302

result = mammoth.convert_to_html(

303

docx_file,

304

transform_document=combined_transform

305

)

306

```

307

308

## Factory Functions

309

310

Mammoth provides factory functions for creating document elements:

311

312

```python { .api }

313

def document(children, notes=None, comments=None):

314

"""Create Document instance."""

315

316

def paragraph(children, style_id=None, style_name=None,

317

numbering=None, alignment=None, indent=None):

318

"""Create Paragraph instance."""

319

320

def run(children, style_id=None, style_name=None,

321

is_bold=None, is_italic=None, **kwargs):

322

"""Create Run instance with normalized boolean fields."""

323

324

def text(value):

325

"""Create Text instance."""

326

327

def hyperlink(children, href=None, anchor=None, target_frame=None):

328

"""Create Hyperlink instance."""

329

330

def table(children, style_id=None, style_name=None):

331

"""Create Table instance."""

332

```

333

334

These factory functions can be used when creating new document elements in transforms.