0
# Document Conversion
1
2
Core conversion functions for transforming DOCX files to HTML and Markdown formats. These functions provide comprehensive options for customization, style mapping, and output control.
3
4
## Capabilities
5
6
### HTML Conversion
7
8
Converts DOCX documents to clean, semantic HTML with support for headings, lists, tables, images, and extensive formatting options.
9
10
```python { .api }
11
def convert_to_html(fileobj, **kwargs):
12
"""
13
Convert DOCX file to HTML format.
14
15
Parameters:
16
- fileobj: File object (opened DOCX file in binary mode)
17
- style_map: str, custom style mapping rules
18
- convert_image: function, custom image conversion function
19
- ignore_empty_paragraphs: bool, whether to skip empty paragraphs (default: True)
20
- id_prefix: str, prefix for HTML element IDs
21
- include_embedded_style_map: bool, use embedded style maps (default: True)
22
- include_default_style_map: bool, use built-in style mappings (default: True)
23
24
Returns:
25
Result object with .value (HTML string) and .messages (list of warnings)
26
"""
27
```
28
29
Usage example:
30
31
```python
32
import mammoth
33
34
# Basic HTML conversion
35
with open("document.docx", "rb") as docx_file:
36
result = mammoth.convert_to_html(docx_file)
37
html = result.value
38
39
# HTML conversion with custom options
40
with open("document.docx", "rb") as docx_file:
41
result = mammoth.convert_to_html(
42
docx_file,
43
style_map="p.Heading1 => h1.custom-heading",
44
id_prefix="doc-",
45
ignore_empty_paragraphs=False
46
)
47
```
48
49
### Markdown Conversion
50
51
Converts DOCX documents to clean Markdown format, preserving document structure and formatting in Markdown syntax.
52
53
```python { .api }
54
def convert_to_markdown(fileobj, **kwargs):
55
"""
56
Convert DOCX file to Markdown format.
57
58
Parameters: Same as convert_to_html()
59
60
Returns:
61
Result object with .value (Markdown string) and .messages (list of warnings)
62
"""
63
```
64
65
Usage example:
66
67
```python
68
import mammoth
69
70
# Basic Markdown conversion
71
with open("document.docx", "rb") as docx_file:
72
result = mammoth.convert_to_markdown(docx_file)
73
markdown = result.value
74
75
# Check for conversion warnings
76
if result.messages:
77
for message in result.messages:
78
print(f"{message.type}: {message.message}")
79
```
80
81
### Core Conversion Function
82
83
The underlying conversion function with full parameter control, supporting both HTML and Markdown output formats.
84
85
```python { .api }
86
def convert(fileobj, transform_document=None, id_prefix=None,
87
include_embedded_style_map=True, **kwargs):
88
"""
89
Core conversion function with full parameter control.
90
91
Parameters:
92
- fileobj: File object containing DOCX data
93
- transform_document: function, transforms document before conversion
94
- id_prefix: str, prefix for HTML element IDs
95
- include_embedded_style_map: bool, whether to use embedded style maps
96
- output_format: str, "html" or "markdown"
97
- style_map: str, custom style mapping string
98
- convert_image: function, custom image conversion function
99
- ignore_empty_paragraphs: bool, skip empty paragraphs (default: True)
100
- include_default_style_map: bool, use built-in styles (default: True)
101
102
Returns:
103
Result object with converted content and messages
104
"""
105
```
106
107
Usage example:
108
109
```python
110
import mammoth
111
112
def custom_transform(document):
113
# Custom document transformation
114
return document
115
116
with open("document.docx", "rb") as docx_file:
117
result = mammoth.convert(
118
docx_file,
119
output_format="html",
120
transform_document=custom_transform,
121
style_map="p.CustomStyle => div.special"
122
)
123
```
124
125
### Text Extraction
126
127
Extracts plain text content from DOCX documents without formatting, useful for text analysis and processing.
128
129
```python { .api }
130
def extract_raw_text(fileobj):
131
"""
132
Extract plain text from DOCX file.
133
134
Parameters:
135
- fileobj: File object (opened DOCX file in binary mode)
136
137
Returns:
138
Result object with .value (plain text string) and .messages (list)
139
"""
140
```
141
142
Usage example:
143
144
```python
145
import mammoth
146
147
with open("document.docx", "rb") as docx_file:
148
result = mammoth.extract_raw_text(docx_file)
149
text = result.value
150
print(text) # Plain text content
151
```
152
153
## Supported Options
154
155
All conversion functions accept these common options:
156
157
- **style_map**: Custom style mapping rules as a string
158
- **embedded_style_map**: Style map extracted from the DOCX file itself
159
- **include_default_style_map**: Whether to include built-in style mappings (default: True)
160
- **ignore_empty_paragraphs**: Whether to skip empty paragraph elements (default: True)
161
- **convert_image**: Custom function for handling image conversion
162
- **output_format**: Target format ("html" or "markdown")
163
- **id_prefix**: Prefix for generated HTML element IDs
164
165
## Error Handling
166
167
All conversion functions return Result objects that contain both the converted content and any warnings or errors encountered during processing:
168
169
```python
170
result = mammoth.convert_to_html(docx_file)
171
172
# Access the converted content
173
html = result.value
174
175
# Check for warnings or errors
176
for message in result.messages:
177
if message.type == "error":
178
print(f"Error: {message.message}")
179
elif message.type == "warning":
180
print(f"Warning: {message.message}")
181
```