Convert Word documents from docx to simple and clean HTML and Markdown
npx @tessl/cli install tessl/pypi-mammoth@1.10.00
# Mammoth
1
2
A robust Python library that converts Microsoft Word .docx documents into clean, semantic HTML and Markdown formats. Mammoth focuses on preserving the semantic structure of documents by converting styled elements (like headings, lists, tables) to appropriate HTML tags rather than attempting to replicate exact visual formatting.
3
4
## Package Information
5
6
- **Package Name**: mammoth
7
- **Package Type**: PyPI
8
- **Language**: Python
9
- **Installation**: `pip install mammoth`
10
- **Version**: 1.10.0
11
- **Python Requirements**: >= 3.7
12
13
## Core Imports
14
15
```python
16
import mammoth
17
```
18
19
Access to main conversion functions:
20
21
```python
22
from mammoth import convert_to_html, convert_to_markdown, extract_raw_text
23
```
24
25
Access to styling and transformation utilities:
26
27
```python
28
from mammoth import images, transforms, underline
29
```
30
31
Access to writers and HTML generation:
32
33
```python
34
from mammoth.writers import writer, formats, HtmlWriter, MarkdownWriter
35
from mammoth.html import text, element, tag, collapsible_element, strip_empty, collapse, write
36
```
37
38
## Basic Usage
39
40
```python
41
import mammoth
42
43
# Convert DOCX to HTML
44
with open("document.docx", "rb") as docx_file:
45
result = mammoth.convert_to_html(docx_file)
46
html = result.value # The generated HTML
47
messages = result.messages # Any conversion warnings
48
49
print(html)
50
51
# Convert DOCX to Markdown
52
with open("document.docx", "rb") as docx_file:
53
result = mammoth.convert_to_markdown(docx_file)
54
markdown = result.value # The generated Markdown
55
56
# Extract plain text only
57
with open("document.docx", "rb") as docx_file:
58
result = mammoth.extract_raw_text(docx_file)
59
text = result.value # Plain text content
60
```
61
62
## Architecture
63
64
Mammoth processes DOCX documents through a well-defined pipeline:
65
66
- **Document Reading**: Parses DOCX files into an internal document object model
67
- **Style Mapping**: Applies style mappings to convert Word styles to HTML elements
68
- **Transformation**: Applies document transformations before conversion
69
- **HTML/Markdown Generation**: Renders final output using specialized writers
70
- **Result Handling**: Returns structured Result objects with content and messages
71
72
The library supports extensive customization through style maps, image handlers, and document transformers, making it highly adaptable for different use cases while maintaining clean, semantic output.
73
74
## Capabilities
75
76
### Document Conversion
77
78
Core conversion functions for transforming DOCX files to HTML and Markdown formats with comprehensive options for customization and style mapping.
79
80
```python { .api }
81
def convert_to_html(fileobj, **kwargs):
82
"""Convert DOCX file to HTML format."""
83
84
def convert_to_markdown(fileobj, **kwargs):
85
"""Convert DOCX file to Markdown format."""
86
87
def convert(fileobj, transform_document=None, id_prefix=None,
88
include_embedded_style_map=True, **kwargs):
89
"""Core conversion function with full parameter control."""
90
91
def extract_raw_text(fileobj):
92
"""Extract plain text from DOCX file."""
93
```
94
95
[Document Conversion](./conversion.md)
96
97
### Writers and Output Generation
98
99
Writer system for generating HTML and Markdown output with flexible interfaces for custom rendering and output format creation.
100
101
```python { .api }
102
def writer(output_format=None):
103
"""Create writer instance for specified output format."""
104
105
def formats():
106
"""Get available output format keys."""
107
108
class HtmlWriter:
109
"""HTML writer for generating HTML output."""
110
111
class MarkdownWriter:
112
"""Markdown writer for generating Markdown output."""
113
```
114
115
[Writers and Output Generation](./writers.md)
116
117
### Image Handling
118
119
Functions for processing and converting images embedded in DOCX documents, including data URI conversion and custom image handling.
120
121
```python { .api }
122
def img_element(func):
123
"""Decorator for creating image conversion functions."""
124
125
def data_uri(image):
126
"""Convert images to base64 data URIs."""
127
```
128
129
[Image Handling](./images.md)
130
131
### Document Transformation
132
133
Utilities for transforming document elements before conversion, allowing for custom processing of paragraphs, runs, and other document components.
134
135
```python { .api }
136
def paragraph(transform_paragraph):
137
"""Create transform for paragraph elements."""
138
139
def run(transform_run):
140
"""Create transform for run elements."""
141
142
def element_of_type(element_type, transform):
143
"""Create transform for specific element types."""
144
```
145
146
[Document Transformation](./transforms.md)
147
148
### Style System
149
150
Comprehensive style mapping system for converting Word document styles to HTML elements, including parsers and matchers for complex styling rules.
151
152
```python { .api }
153
def embed_style_map(fileobj, style_map):
154
"""Embed style map into DOCX file."""
155
156
def read_embedded_style_map(fileobj):
157
"""Read embedded style map from DOCX file."""
158
```
159
160
[Style System](./styles.md)
161
162
### HTML Element Creation
163
164
Core functions for creating and manipulating HTML elements, nodes, and structures during document conversion.
165
166
```python { .api }
167
def text(value):
168
"""Create a text node with specified value."""
169
170
def element(tag_names, attributes=None, children=None, collapsible=None, separator=None):
171
"""Create HTML element with tag, attributes, and children."""
172
173
def tag(tag_names, attributes=None, collapsible=None, separator=None):
174
"""Create HTML tag definition."""
175
176
def collapsible_element(tag_names, attributes=None, children=None):
177
"""Create collapsible HTML element."""
178
179
def strip_empty(nodes):
180
"""Remove empty nodes from node list."""
181
182
def collapse(nodes):
183
"""Collapse adjacent similar nodes."""
184
185
def write(writer, nodes):
186
"""Write nodes using specified writer."""
187
```
188
189
### Underline Handling
190
191
Functions for converting underline formatting to custom HTML elements.
192
193
```python { .api }
194
def element(name):
195
"""Create underline converter that wraps content in specified HTML element."""
196
```
197
198
### Command-Line Interface
199
200
Command-line tool for converting DOCX files with support for various output formats and options.
201
202
```python { .api }
203
def main():
204
"""Command-line interface entry point."""
205
206
class ImageWriter:
207
"""Handles writing images to separate files in output directory."""
208
209
def __init__(self, output_dir):
210
"""Initialize with output directory path."""
211
212
def __call__(self, element):
213
"""Write image element to file and return attributes."""
214
```
215
216
Console command: `mammoth <docx-path> [output-path] [options]`
217
218
Arguments:
219
- `docx-path`: Path to the .docx file to convert
220
- `output-path`: Optional output path for generated document (writes to stdout if not specified)
221
222
Options:
223
- `--output-dir`: Output directory for generated HTML and images (mutually exclusive with output-path)
224
- `--output-format`: Output format (choices: html, markdown)
225
- `--style-map`: File containing a style map
226
227
## Types
228
229
```python { .api }
230
class Result:
231
"""Container for operation results with messages."""
232
value: any # The result value
233
messages: list # List of warning/error messages
234
235
def map(self, func):
236
"""Transform the value."""
237
238
def bind(self, func):
239
"""Chain operations that return Results."""
240
241
class Message:
242
"""Warning/error message structure."""
243
type: str # Message type
244
message: str # Message content
245
246
def warning(message):
247
"""Create a warning message."""
248
249
def success(value):
250
"""Create a successful Result with no messages."""
251
252
def combine(results):
253
"""Combine multiple Results into one."""
254
255
# HTML Node Types
256
257
class Node:
258
"""Base class for all HTML nodes."""
259
260
class TextNode(Node):
261
"""Text content node."""
262
value: str # Text content
263
264
class Tag:
265
"""HTML tag definition."""
266
tag_names: list # List of tag names
267
attributes: dict # HTML attributes
268
collapsible: bool # Whether tag can be collapsed
269
separator: str # Separator for multiple tags
270
271
@property
272
def tag_name(self):
273
"""Get primary tag name."""
274
275
class Element(Node):
276
"""HTML element node with tag and children."""
277
tag: Tag # Tag definition
278
children: list # Child nodes
279
280
@property
281
def tag_name(self):
282
"""Get primary tag name."""
283
284
@property
285
def tag_names(self):
286
"""Get all tag names."""
287
288
@property
289
def attributes(self):
290
"""Get HTML attributes."""
291
292
@property
293
def collapsible(self):
294
"""Check if element is collapsible."""
295
296
def is_void(self):
297
"""Check if element is void (self-closing)."""
298
299
class ForceWrite(Node):
300
"""Special node that forces writing even if empty."""
301
302
class NodeVisitor:
303
"""Base class for visiting HTML nodes."""
304
```