0
# Document Transformation
1
2
Utilities for transforming document elements before conversion. Mammoth's transformation system allows for custom processing of paragraphs, runs, and other document components, enabling advanced document manipulation workflows.
3
4
## Capabilities
5
6
### Element Type Transforms
7
8
Create transformations that target specific document element types.
9
10
```python { .api }
11
def paragraph(transform_paragraph):
12
"""
13
Create transform that applies to paragraph elements.
14
15
Parameters:
16
- transform_paragraph: function, transforms paragraph elements
17
18
Returns:
19
Transform function that processes the entire document
20
"""
21
22
def run(transform_run):
23
"""
24
Create transform that applies to run elements.
25
26
Parameters:
27
- transform_run: function, transforms run elements
28
29
Returns:
30
Transform function that processes the entire document
31
"""
32
33
def element_of_type(element_type, transform):
34
"""
35
Create transform for specific element types.
36
37
Parameters:
38
- element_type: class/type to match
39
- transform: function to apply to matching elements
40
41
Returns:
42
Transform function that processes the entire document
43
"""
44
```
45
46
### Document Traversal
47
48
Functions for finding and extracting specific elements from the document tree.
49
50
```python { .api }
51
def get_descendants_of_type(element, element_type):
52
"""
53
Get all descendant elements of specified type.
54
55
Parameters:
56
- element: Root element to search from
57
- element_type: Type/class to filter for
58
59
Returns:
60
List of matching descendant elements
61
"""
62
63
def get_descendants(element):
64
"""
65
Get all descendant elements.
66
67
Parameters:
68
- element: Root element to search from
69
70
Returns:
71
List of all descendant elements
72
"""
73
```
74
75
## Document Element Types
76
77
When creating transforms, you'll work with these document element types:
78
79
```python { .api }
80
class Document:
81
"""Root document container."""
82
children: list # Child elements
83
notes: list # Footnotes and endnotes
84
comments: list # Document comments
85
86
class Paragraph:
87
"""Paragraph element with styling information."""
88
children: list # Child elements (runs, hyperlinks, etc.)
89
style_id: str # Word style ID
90
style_name: str # Word style name
91
numbering: object # List numbering information
92
alignment: str # Text alignment
93
indent: object # Indentation settings
94
95
class Run:
96
"""Text run with formatting."""
97
children: list # Child elements (text, breaks, etc.)
98
style_id: str # Word style ID
99
style_name: str # Word style name
100
is_bold: bool # Bold formatting
101
is_italic: bool # Italic formatting
102
is_underline: bool # Underline formatting
103
is_strikethrough: bool # Strikethrough formatting
104
is_all_caps: bool # All caps formatting
105
is_small_caps: bool # Small caps formatting
106
vertical_alignment: str # Superscript/subscript
107
font: str # Font name
108
font_size: int # Font size in half-points
109
highlight: str # Highlight color
110
111
class Text:
112
"""Plain text node."""
113
value: str # Text content
114
115
class Hyperlink:
116
"""Hyperlink element."""
117
children: list # Child elements
118
href: str # Link URL
119
anchor: str # Internal anchor
120
target_frame: str # Target frame
121
122
class Image:
123
"""Image element."""
124
alt_text: str # Alternative text
125
content_type: str # MIME type
126
127
def open(self):
128
"""Open image data for reading."""
129
130
class Table:
131
"""Table element."""
132
children: list # TableRow elements
133
style_id: str # Word style ID
134
style_name: str # Word style name
135
136
class TableRow:
137
"""Table row element."""
138
children: list # TableCell elements
139
140
class TableCell:
141
"""Table cell element."""
142
children: list # Cell content elements
143
colspan: int # Column span
144
rowspan: int # Row span
145
146
class Break:
147
"""Line, page, or column break."""
148
break_type: str # "line", "page", "column"
149
```
150
151
## Transform Examples
152
153
### Remove Empty Paragraphs
154
155
```python
156
import mammoth
157
158
def remove_empty_paragraphs(paragraph):
159
# Check if paragraph has no text content
160
has_text = any(
161
isinstance(child, mammoth.documents.Text) and child.value.strip()
162
for child in mammoth.transforms.get_descendants(paragraph)
163
)
164
165
if not has_text:
166
return None # Remove the paragraph
167
return paragraph
168
169
# Create the transform
170
transform = mammoth.transforms.paragraph(remove_empty_paragraphs)
171
172
# Apply during conversion
173
with open("document.docx", "rb") as docx_file:
174
result = mammoth.convert_to_html(
175
docx_file,
176
transform_document=transform
177
)
178
```
179
180
### Convert Custom Styles
181
182
```python
183
import mammoth
184
185
def convert_custom_headings(paragraph):
186
# Convert custom heading styles to standard ones
187
if paragraph.style_name == "CustomHeading1":
188
paragraph = paragraph.copy(style_name="Heading 1")
189
elif paragraph.style_name == "CustomHeading2":
190
paragraph = paragraph.copy(style_name="Heading 2")
191
192
return paragraph
193
194
transform = mammoth.transforms.paragraph(convert_custom_headings)
195
196
with open("document.docx", "rb") as docx_file:
197
result = mammoth.convert_to_html(
198
docx_file,
199
transform_document=transform
200
)
201
```
202
203
### Modify Text Content
204
205
```python
206
import mammoth
207
208
def uppercase_bold_text(run):
209
if run.is_bold:
210
# Transform all text children to uppercase
211
new_children = []
212
for child in run.children:
213
if isinstance(child, mammoth.documents.Text):
214
new_children.append(
215
mammoth.documents.text(child.value.upper())
216
)
217
else:
218
new_children.append(child)
219
220
return run.copy(children=new_children)
221
222
return run
223
224
transform = mammoth.transforms.run(uppercase_bold_text)
225
226
with open("document.docx", "rb") as docx_file:
227
result = mammoth.convert_to_html(
228
docx_file,
229
transform_document=transform
230
)
231
```
232
233
### Complex Document Analysis
234
235
```python
236
import mammoth
237
238
def analyze_and_transform(document):
239
# Find all headings in the document
240
headings = []
241
for paragraph in mammoth.transforms.get_descendants_of_type(
242
document, mammoth.documents.Paragraph
243
):
244
if paragraph.style_name and "Heading" in paragraph.style_name:
245
headings.append(paragraph)
246
247
print(f"Found {len(headings)} headings")
248
249
# Find all images
250
images = mammoth.transforms.get_descendants_of_type(
251
document, mammoth.documents.Image
252
)
253
print(f"Found {len(images)} images")
254
255
# Return unchanged document
256
return document
257
258
with open("document.docx", "rb") as docx_file:
259
result = mammoth.convert_to_html(
260
docx_file,
261
transform_document=analyze_and_transform
262
)
263
```
264
265
### Combining Transforms
266
267
```python
268
import mammoth
269
270
def remove_comments(paragraph):
271
# Remove comment references
272
new_children = []
273
for child in paragraph.children:
274
if not isinstance(child, mammoth.documents.CommentReference):
275
new_children.append(child)
276
277
return paragraph.copy(children=new_children)
278
279
def normalize_whitespace(run):
280
new_children = []
281
for child in run.children:
282
if isinstance(child, mammoth.documents.Text):
283
# Normalize whitespace
284
normalized = " ".join(child.value.split())
285
new_children.append(mammoth.documents.text(normalized))
286
else:
287
new_children.append(child)
288
289
return run.copy(children=new_children)
290
291
def combined_transform(document):
292
# Apply multiple transforms in sequence
293
comment_transform = mammoth.transforms.paragraph(remove_comments)
294
whitespace_transform = mammoth.transforms.run(normalize_whitespace)
295
296
document = comment_transform(document)
297
document = whitespace_transform(document)
298
299
return document
300
301
with open("document.docx", "rb") as docx_file:
302
result = mammoth.convert_to_html(
303
docx_file,
304
transform_document=combined_transform
305
)
306
```
307
308
## Factory Functions
309
310
Mammoth provides factory functions for creating document elements:
311
312
```python { .api }
313
def document(children, notes=None, comments=None):
314
"""Create Document instance."""
315
316
def paragraph(children, style_id=None, style_name=None,
317
numbering=None, alignment=None, indent=None):
318
"""Create Paragraph instance."""
319
320
def run(children, style_id=None, style_name=None,
321
is_bold=None, is_italic=None, **kwargs):
322
"""Create Run instance with normalized boolean fields."""
323
324
def text(value):
325
"""Create Text instance."""
326
327
def hyperlink(children, href=None, anchor=None, target_frame=None):
328
"""Create Hyperlink instance."""
329
330
def table(children, style_id=None, style_name=None):
331
"""Create Table instance."""
332
```
333
334
These factory functions can be used when creating new document elements in transforms.