Tessl Tile for pypi/mammoth@1.10.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-mammoth

Convert Word documents from docx to simple and clean HTML and Markdown

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/mammoth@1.10.x

To install, run

npx @tessl/cli install tessl/pypi-mammoth@1.10.0

0
# Mammoth
1

2
A robust Python library that converts Microsoft Word .docx documents into clean, semantic HTML and Markdown formats. Mammoth focuses on preserving the semantic structure of documents by converting styled elements (like headings, lists, tables) to appropriate HTML tags rather than attempting to replicate exact visual formatting.
3

4
## Package Information
5

6
- **Package Name**: mammoth
7
- **Package Type**: PyPI
8
- **Language**: Python
9
- **Installation**: `pip install mammoth`
10
- **Version**: 1.10.0
11
- **Python Requirements**: >= 3.7
12

13
## Core Imports
14

15
```python
16
import mammoth
17
```
18

19
Access to main conversion functions:
20

21
```python
22
from mammoth import convert_to_html, convert_to_markdown, extract_raw_text
23
```
24

25
Access to styling and transformation utilities:
26

27
```python
28
from mammoth import images, transforms, underline
29
```
30

31
Access to writers and HTML generation:
32

33
```python
34
from mammoth.writers import writer, formats, HtmlWriter, MarkdownWriter
35
from mammoth.html import text, element, tag, collapsible_element, strip_empty, collapse, write
36
```
37

38
## Basic Usage
39

40
```python
41
import mammoth
42

43
# Convert DOCX to HTML
44
with open("document.docx", "rb") as docx_file:
45
    result = mammoth.convert_to_html(docx_file)
46
    html = result.value  # The generated HTML
47
    messages = result.messages  # Any conversion warnings
48

49
print(html)
50

51
# Convert DOCX to Markdown
52
with open("document.docx", "rb") as docx_file:
53
    result = mammoth.convert_to_markdown(docx_file)
54
    markdown = result.value  # The generated Markdown
55

56
# Extract plain text only
57
with open("document.docx", "rb") as docx_file:
58
    result = mammoth.extract_raw_text(docx_file)
59
    text = result.value  # Plain text content
60
```
61

62
## Architecture
63

64
Mammoth processes DOCX documents through a well-defined pipeline:
65

66
- **Document Reading**: Parses DOCX files into an internal document object model
67
- **Style Mapping**: Applies style mappings to convert Word styles to HTML elements
68
- **Transformation**: Applies document transformations before conversion
69
- **HTML/Markdown Generation**: Renders final output using specialized writers
70
- **Result Handling**: Returns structured Result objects with content and messages
71

72
The library supports extensive customization through style maps, image handlers, and document transformers, making it highly adaptable for different use cases while maintaining clean, semantic output.
73

74
## Capabilities
75

76
### Document Conversion
77

78
Core conversion functions for transforming DOCX files to HTML and Markdown formats with comprehensive options for customization and style mapping.
79

80
```python { .api }
81
def convert_to_html(fileobj, **kwargs):
82
    """Convert DOCX file to HTML format."""
83

84
def convert_to_markdown(fileobj, **kwargs):
85
    """Convert DOCX file to Markdown format."""
86

87
def convert(fileobj, transform_document=None, id_prefix=None, 
88
           include_embedded_style_map=True, **kwargs):
89
    """Core conversion function with full parameter control."""
90

91
def extract_raw_text(fileobj):
92
    """Extract plain text from DOCX file."""
93
```
94

95
[Document Conversion](./conversion.md)
96

97
### Writers and Output Generation
98

99
Writer system for generating HTML and Markdown output with flexible interfaces for custom rendering and output format creation.
100

101
```python { .api }
102
def writer(output_format=None):
103
    """Create writer instance for specified output format."""
104

105
def formats():
106
    """Get available output format keys."""
107

108
class HtmlWriter:
109
    """HTML writer for generating HTML output."""
110

111
class MarkdownWriter:
112
    """Markdown writer for generating Markdown output."""
113
```
114

115
[Writers and Output Generation](./writers.md)
116

117
### Image Handling
118

119
Functions for processing and converting images embedded in DOCX documents, including data URI conversion and custom image handling.
120

121
```python { .api }
122
def img_element(func):
123
    """Decorator for creating image conversion functions."""
124

125
def data_uri(image):
126
    """Convert images to base64 data URIs."""
127
```
128

129
[Image Handling](./images.md)
130

131
### Document Transformation
132

133
Utilities for transforming document elements before conversion, allowing for custom processing of paragraphs, runs, and other document components.
134

135
```python { .api }
136
def paragraph(transform_paragraph):
137
    """Create transform for paragraph elements."""
138

139
def run(transform_run):
140
    """Create transform for run elements."""
141

142
def element_of_type(element_type, transform):
143
    """Create transform for specific element types."""
144
```
145

146
[Document Transformation](./transforms.md)
147

148
### Style System
149

150
Comprehensive style mapping system for converting Word document styles to HTML elements, including parsers and matchers for complex styling rules.
151

152
```python { .api }
153
def embed_style_map(fileobj, style_map):
154
    """Embed style map into DOCX file."""
155

156
def read_embedded_style_map(fileobj):
157
    """Read embedded style map from DOCX file."""
158
```
159

160
[Style System](./styles.md)
161

162
### HTML Element Creation
163

164
Core functions for creating and manipulating HTML elements, nodes, and structures during document conversion.
165

166
```python { .api }
167
def text(value):
168
    """Create a text node with specified value."""
169

170
def element(tag_names, attributes=None, children=None, collapsible=None, separator=None):
171
    """Create HTML element with tag, attributes, and children."""
172

173
def tag(tag_names, attributes=None, collapsible=None, separator=None):
174
    """Create HTML tag definition."""
175

176
def collapsible_element(tag_names, attributes=None, children=None):
177
    """Create collapsible HTML element."""
178

179
def strip_empty(nodes):
180
    """Remove empty nodes from node list."""
181

182
def collapse(nodes):
183
    """Collapse adjacent similar nodes."""
184

185
def write(writer, nodes):
186
    """Write nodes using specified writer."""
187
```
188

189
### Underline Handling
190

191
Functions for converting underline formatting to custom HTML elements.
192

193
```python { .api }
194
def element(name):
195
    """Create underline converter that wraps content in specified HTML element."""
196
```
197

198
### Command-Line Interface
199

200
Command-line tool for converting DOCX files with support for various output formats and options.
201

202
```python { .api }
203
def main():
204
    """Command-line interface entry point."""
205

206
class ImageWriter:
207
    """Handles writing images to separate files in output directory."""
208
    
209
    def __init__(self, output_dir):
210
        """Initialize with output directory path."""
211
    
212
    def __call__(self, element):
213
        """Write image element to file and return attributes."""
214
```
215

216
Console command: `mammoth <docx-path> [output-path] [options]`
217

218
Arguments:
219
- `docx-path`: Path to the .docx file to convert
220
- `output-path`: Optional output path for generated document (writes to stdout if not specified)
221

222
Options:
223
- `--output-dir`: Output directory for generated HTML and images (mutually exclusive with output-path)
224
- `--output-format`: Output format (choices: html, markdown)
225
- `--style-map`: File containing a style map
226

227
## Types
228

229
```python { .api }
230
class Result:
231
    """Container for operation results with messages."""
232
    value: any  # The result value
233
    messages: list  # List of warning/error messages
234
    
235
    def map(self, func):
236
        """Transform the value."""
237
    
238
    def bind(self, func):
239
        """Chain operations that return Results."""
240

241
class Message:
242
    """Warning/error message structure."""
243
    type: str  # Message type
244
    message: str  # Message content
245

246
def warning(message):
247
    """Create a warning message."""
248

249
def success(value):
250
    """Create a successful Result with no messages."""
251

252
def combine(results):
253
    """Combine multiple Results into one."""
254

255
# HTML Node Types
256

257
class Node:
258
    """Base class for all HTML nodes."""
259

260
class TextNode(Node):
261
    """Text content node."""
262
    value: str  # Text content
263

264
class Tag:
265
    """HTML tag definition."""
266
    tag_names: list  # List of tag names
267
    attributes: dict  # HTML attributes
268
    collapsible: bool  # Whether tag can be collapsed
269
    separator: str  # Separator for multiple tags
270
    
271
    @property
272
    def tag_name(self):
273
        """Get primary tag name."""
274

275
class Element(Node):
276
    """HTML element node with tag and children."""
277
    tag: Tag  # Tag definition
278
    children: list  # Child nodes
279
    
280
    @property
281
    def tag_name(self):
282
        """Get primary tag name."""
283
    
284
    @property
285
    def tag_names(self):
286
        """Get all tag names."""
287
    
288
    @property
289
    def attributes(self):
290
        """Get HTML attributes."""
291
    
292
    @property
293
    def collapsible(self):
294
        """Check if element is collapsible."""
295
    
296
    def is_void(self):
297
        """Check if element is void (self-closing)."""
298

299
class ForceWrite(Node):
300
    """Special node that forces writing even if empty."""
301

302
class NodeVisitor:
303
    """Base class for visiting HTML nodes."""
304
```