Tessl Tile for pypi/mammoth@1.10.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

conversion.md images.md index.md styles.md transforms.md writers.md

transforms.mddocs/

0
# Document Transformation
1

2
Utilities for transforming document elements before conversion. Mammoth's transformation system allows for custom processing of paragraphs, runs, and other document components, enabling advanced document manipulation workflows.
3

4
## Capabilities
5

6
### Element Type Transforms
7

8
Create transformations that target specific document element types.
9

10
```python { .api }
11
def paragraph(transform_paragraph):
12
    """
13
    Create transform that applies to paragraph elements.
14
    
15
    Parameters:
16
    - transform_paragraph: function, transforms paragraph elements
17
    
18
    Returns:
19
    Transform function that processes the entire document
20
    """
21

22
def run(transform_run):
23
    """
24
    Create transform that applies to run elements.
25
    
26
    Parameters:
27
    - transform_run: function, transforms run elements
28
    
29
    Returns:
30
    Transform function that processes the entire document
31
    """
32

33
def element_of_type(element_type, transform):
34
    """
35
    Create transform for specific element types.
36
    
37
    Parameters:
38
    - element_type: class/type to match
39
    - transform: function to apply to matching elements
40
    
41
    Returns:
42
    Transform function that processes the entire document
43
    """
44
```
45

46
### Document Traversal
47

48
Functions for finding and extracting specific elements from the document tree.
49

50
```python { .api }
51
def get_descendants_of_type(element, element_type):
52
    """
53
    Get all descendant elements of specified type.
54
    
55
    Parameters:
56
    - element: Root element to search from
57
    - element_type: Type/class to filter for
58
    
59
    Returns:
60
    List of matching descendant elements
61
    """
62

63
def get_descendants(element):
64
    """
65
    Get all descendant elements.
66
    
67
    Parameters:
68
    - element: Root element to search from
69
    
70
    Returns:
71
    List of all descendant elements
72
    """
73
```
74

75
## Document Element Types
76

77
When creating transforms, you'll work with these document element types:
78

79
```python { .api }
80
class Document:
81
    """Root document container."""
82
    children: list  # Child elements
83
    notes: list     # Footnotes and endnotes
84
    comments: list  # Document comments
85

86
class Paragraph:
87
    """Paragraph element with styling information."""
88
    children: list      # Child elements (runs, hyperlinks, etc.)
89
    style_id: str       # Word style ID
90
    style_name: str     # Word style name
91
    numbering: object   # List numbering information
92
    alignment: str      # Text alignment
93
    indent: object      # Indentation settings
94

95
class Run:
96
    """Text run with formatting."""
97
    children: list           # Child elements (text, breaks, etc.)
98
    style_id: str           # Word style ID
99
    style_name: str         # Word style name
100
    is_bold: bool           # Bold formatting
101
    is_italic: bool         # Italic formatting
102
    is_underline: bool      # Underline formatting
103
    is_strikethrough: bool  # Strikethrough formatting
104
    is_all_caps: bool       # All caps formatting
105
    is_small_caps: bool     # Small caps formatting
106
    vertical_alignment: str # Superscript/subscript
107
    font: str               # Font name
108
    font_size: int          # Font size in half-points
109
    highlight: str          # Highlight color
110

111
class Text:
112
    """Plain text node."""
113
    value: str  # Text content
114

115
class Hyperlink:
116
    """Hyperlink element."""
117
    children: list      # Child elements
118
    href: str           # Link URL
119
    anchor: str         # Internal anchor
120
    target_frame: str   # Target frame
121

122
class Image:
123
    """Image element."""
124
    alt_text: str      # Alternative text
125
    content_type: str  # MIME type
126
    
127
    def open(self):
128
        """Open image data for reading."""
129

130
class Table:
131
    """Table element."""
132
    children: list    # TableRow elements
133
    style_id: str     # Word style ID
134
    style_name: str   # Word style name
135

136
class TableRow:
137
    """Table row element."""
138
    children: list  # TableCell elements
139

140
class TableCell:
141
    """Table cell element."""
142
    children: list  # Cell content elements
143
    colspan: int    # Column span
144
    rowspan: int    # Row span
145

146
class Break:
147
    """Line, page, or column break."""
148
    break_type: str  # "line", "page", "column"
149
```
150

151
## Transform Examples
152

153
### Remove Empty Paragraphs
154

155
```python
156
import mammoth
157

158
def remove_empty_paragraphs(paragraph):
159
    # Check if paragraph has no text content
160
    has_text = any(
161
        isinstance(child, mammoth.documents.Text) and child.value.strip()
162
        for child in mammoth.transforms.get_descendants(paragraph)
163
    )
164
    
165
    if not has_text:
166
        return None  # Remove the paragraph
167
    return paragraph
168

169
# Create the transform
170
transform = mammoth.transforms.paragraph(remove_empty_paragraphs)
171

172
# Apply during conversion
173
with open("document.docx", "rb") as docx_file:
174
    result = mammoth.convert_to_html(
175
        docx_file,
176
        transform_document=transform
177
    )
178
```
179

180
### Convert Custom Styles
181

182
```python
183
import mammoth
184

185
def convert_custom_headings(paragraph):
186
    # Convert custom heading styles to standard ones
187
    if paragraph.style_name == "CustomHeading1":
188
        paragraph = paragraph.copy(style_name="Heading 1")
189
    elif paragraph.style_name == "CustomHeading2":
190
        paragraph = paragraph.copy(style_name="Heading 2")
191
    
192
    return paragraph
193

194
transform = mammoth.transforms.paragraph(convert_custom_headings)
195

196
with open("document.docx", "rb") as docx_file:
197
    result = mammoth.convert_to_html(
198
        docx_file,
199
        transform_document=transform
200
    )
201
```
202

203
### Modify Text Content
204

205
```python
206
import mammoth
207

208
def uppercase_bold_text(run):
209
    if run.is_bold:
210
        # Transform all text children to uppercase
211
        new_children = []
212
        for child in run.children:
213
            if isinstance(child, mammoth.documents.Text):
214
                new_children.append(
215
                    mammoth.documents.text(child.value.upper())
216
                )
217
            else:
218
                new_children.append(child)
219
        
220
        return run.copy(children=new_children)
221
    
222
    return run
223

224
transform = mammoth.transforms.run(uppercase_bold_text)
225

226
with open("document.docx", "rb") as docx_file:
227
    result = mammoth.convert_to_html(
228
        docx_file,
229
        transform_document=transform
230
    )
231
```
232

233
### Complex Document Analysis
234

235
```python
236
import mammoth
237

238
def analyze_and_transform(document):
239
    # Find all headings in the document
240
    headings = []
241
    for paragraph in mammoth.transforms.get_descendants_of_type(
242
        document, mammoth.documents.Paragraph
243
    ):
244
        if paragraph.style_name and "Heading" in paragraph.style_name:
245
            headings.append(paragraph)
246
    
247
    print(f"Found {len(headings)} headings")
248
    
249
    # Find all images
250
    images = mammoth.transforms.get_descendants_of_type(
251
        document, mammoth.documents.Image
252
    )
253
    print(f"Found {len(images)} images")
254
    
255
    # Return unchanged document
256
    return document
257

258
with open("document.docx", "rb") as docx_file:
259
    result = mammoth.convert_to_html(
260
        docx_file,
261
        transform_document=analyze_and_transform
262
    )
263
```
264

265
### Combining Transforms
266

267
```python
268
import mammoth
269

270
def remove_comments(paragraph):
271
    # Remove comment references
272
    new_children = []
273
    for child in paragraph.children:
274
        if not isinstance(child, mammoth.documents.CommentReference):
275
            new_children.append(child)
276
    
277
    return paragraph.copy(children=new_children)
278

279
def normalize_whitespace(run):
280
    new_children = []
281
    for child in run.children:
282
        if isinstance(child, mammoth.documents.Text):
283
            # Normalize whitespace
284
            normalized = " ".join(child.value.split())
285
            new_children.append(mammoth.documents.text(normalized))
286
        else:
287
            new_children.append(child)
288
    
289
    return run.copy(children=new_children)
290

291
def combined_transform(document):
292
    # Apply multiple transforms in sequence
293
    comment_transform = mammoth.transforms.paragraph(remove_comments)
294
    whitespace_transform = mammoth.transforms.run(normalize_whitespace)
295
    
296
    document = comment_transform(document)
297
    document = whitespace_transform(document)
298
    
299
    return document
300

301
with open("document.docx", "rb") as docx_file:
302
    result = mammoth.convert_to_html(
303
        docx_file,
304
        transform_document=combined_transform
305
    )
306
```
307

308
## Factory Functions
309

310
Mammoth provides factory functions for creating document elements:
311

312
```python { .api }
313
def document(children, notes=None, comments=None):
314
    """Create Document instance."""
315

316
def paragraph(children, style_id=None, style_name=None, 
317
             numbering=None, alignment=None, indent=None):
318
    """Create Paragraph instance."""
319

320
def run(children, style_id=None, style_name=None, 
321
        is_bold=None, is_italic=None, **kwargs):
322
    """Create Run instance with normalized boolean fields."""
323

324
def text(value):
325
    """Create Text instance."""
326

327
def hyperlink(children, href=None, anchor=None, target_frame=None):
328
    """Create Hyperlink instance."""
329

330
def table(children, style_id=None, style_name=None):
331
    """Create Table instance."""
332
```
333

334
These factory functions can be used when creating new document elements in transforms.

Version

Tile

Files

transforms.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

transforms.mddocs/