or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-mammoth

Convert Word documents from docx to simple and clean HTML and Markdown

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/mammoth@1.10.x

To install, run

npx @tessl/cli install tessl/pypi-mammoth@1.10.0

0

# Mammoth

1

2

A robust Python library that converts Microsoft Word .docx documents into clean, semantic HTML and Markdown formats. Mammoth focuses on preserving the semantic structure of documents by converting styled elements (like headings, lists, tables) to appropriate HTML tags rather than attempting to replicate exact visual formatting.

3

4

## Package Information

5

6

- **Package Name**: mammoth

7

- **Package Type**: PyPI

8

- **Language**: Python

9

- **Installation**: `pip install mammoth`

10

- **Version**: 1.10.0

11

- **Python Requirements**: >= 3.7

12

13

## Core Imports

14

15

```python

16

import mammoth

17

```

18

19

Access to main conversion functions:

20

21

```python

22

from mammoth import convert_to_html, convert_to_markdown, extract_raw_text

23

```

24

25

Access to styling and transformation utilities:

26

27

```python

28

from mammoth import images, transforms, underline

29

```

30

31

Access to writers and HTML generation:

32

33

```python

34

from mammoth.writers import writer, formats, HtmlWriter, MarkdownWriter

35

from mammoth.html import text, element, tag, collapsible_element, strip_empty, collapse, write

36

```

37

38

## Basic Usage

39

40

```python

41

import mammoth

42

43

# Convert DOCX to HTML

44

with open("document.docx", "rb") as docx_file:

45

result = mammoth.convert_to_html(docx_file)

46

html = result.value # The generated HTML

47

messages = result.messages # Any conversion warnings

48

49

print(html)

50

51

# Convert DOCX to Markdown

52

with open("document.docx", "rb") as docx_file:

53

result = mammoth.convert_to_markdown(docx_file)

54

markdown = result.value # The generated Markdown

55

56

# Extract plain text only

57

with open("document.docx", "rb") as docx_file:

58

result = mammoth.extract_raw_text(docx_file)

59

text = result.value # Plain text content

60

```

61

62

## Architecture

63

64

Mammoth processes DOCX documents through a well-defined pipeline:

65

66

- **Document Reading**: Parses DOCX files into an internal document object model

67

- **Style Mapping**: Applies style mappings to convert Word styles to HTML elements

68

- **Transformation**: Applies document transformations before conversion

69

- **HTML/Markdown Generation**: Renders final output using specialized writers

70

- **Result Handling**: Returns structured Result objects with content and messages

71

72

The library supports extensive customization through style maps, image handlers, and document transformers, making it highly adaptable for different use cases while maintaining clean, semantic output.

73

74

## Capabilities

75

76

### Document Conversion

77

78

Core conversion functions for transforming DOCX files to HTML and Markdown formats with comprehensive options for customization and style mapping.

79

80

```python { .api }

81

def convert_to_html(fileobj, **kwargs):

82

"""Convert DOCX file to HTML format."""

83

84

def convert_to_markdown(fileobj, **kwargs):

85

"""Convert DOCX file to Markdown format."""

86

87

def convert(fileobj, transform_document=None, id_prefix=None,

88

include_embedded_style_map=True, **kwargs):

89

"""Core conversion function with full parameter control."""

90

91

def extract_raw_text(fileobj):

92

"""Extract plain text from DOCX file."""

93

```

94

95

[Document Conversion](./conversion.md)

96

97

### Writers and Output Generation

98

99

Writer system for generating HTML and Markdown output with flexible interfaces for custom rendering and output format creation.

100

101

```python { .api }

102

def writer(output_format=None):

103

"""Create writer instance for specified output format."""

104

105

def formats():

106

"""Get available output format keys."""

107

108

class HtmlWriter:

109

"""HTML writer for generating HTML output."""

110

111

class MarkdownWriter:

112

"""Markdown writer for generating Markdown output."""

113

```

114

115

[Writers and Output Generation](./writers.md)

116

117

### Image Handling

118

119

Functions for processing and converting images embedded in DOCX documents, including data URI conversion and custom image handling.

120

121

```python { .api }

122

def img_element(func):

123

"""Decorator for creating image conversion functions."""

124

125

def data_uri(image):

126

"""Convert images to base64 data URIs."""

127

```

128

129

[Image Handling](./images.md)

130

131

### Document Transformation

132

133

Utilities for transforming document elements before conversion, allowing for custom processing of paragraphs, runs, and other document components.

134

135

```python { .api }

136

def paragraph(transform_paragraph):

137

"""Create transform for paragraph elements."""

138

139

def run(transform_run):

140

"""Create transform for run elements."""

141

142

def element_of_type(element_type, transform):

143

"""Create transform for specific element types."""

144

```

145

146

[Document Transformation](./transforms.md)

147

148

### Style System

149

150

Comprehensive style mapping system for converting Word document styles to HTML elements, including parsers and matchers for complex styling rules.

151

152

```python { .api }

153

def embed_style_map(fileobj, style_map):

154

"""Embed style map into DOCX file."""

155

156

def read_embedded_style_map(fileobj):

157

"""Read embedded style map from DOCX file."""

158

```

159

160

[Style System](./styles.md)

161

162

### HTML Element Creation

163

164

Core functions for creating and manipulating HTML elements, nodes, and structures during document conversion.

165

166

```python { .api }

167

def text(value):

168

"""Create a text node with specified value."""

169

170

def element(tag_names, attributes=None, children=None, collapsible=None, separator=None):

171

"""Create HTML element with tag, attributes, and children."""

172

173

def tag(tag_names, attributes=None, collapsible=None, separator=None):

174

"""Create HTML tag definition."""

175

176

def collapsible_element(tag_names, attributes=None, children=None):

177

"""Create collapsible HTML element."""

178

179

def strip_empty(nodes):

180

"""Remove empty nodes from node list."""

181

182

def collapse(nodes):

183

"""Collapse adjacent similar nodes."""

184

185

def write(writer, nodes):

186

"""Write nodes using specified writer."""

187

```

188

189

### Underline Handling

190

191

Functions for converting underline formatting to custom HTML elements.

192

193

```python { .api }

194

def element(name):

195

"""Create underline converter that wraps content in specified HTML element."""

196

```

197

198

### Command-Line Interface

199

200

Command-line tool for converting DOCX files with support for various output formats and options.

201

202

```python { .api }

203

def main():

204

"""Command-line interface entry point."""

205

206

class ImageWriter:

207

"""Handles writing images to separate files in output directory."""

208

209

def __init__(self, output_dir):

210

"""Initialize with output directory path."""

211

212

def __call__(self, element):

213

"""Write image element to file and return attributes."""

214

```

215

216

Console command: `mammoth <docx-path> [output-path] [options]`

217

218

Arguments:

219

- `docx-path`: Path to the .docx file to convert

220

- `output-path`: Optional output path for generated document (writes to stdout if not specified)

221

222

Options:

223

- `--output-dir`: Output directory for generated HTML and images (mutually exclusive with output-path)

224

- `--output-format`: Output format (choices: html, markdown)

225

- `--style-map`: File containing a style map

226

227

## Types

228

229

```python { .api }

230

class Result:

231

"""Container for operation results with messages."""

232

value: any # The result value

233

messages: list # List of warning/error messages

234

235

def map(self, func):

236

"""Transform the value."""

237

238

def bind(self, func):

239

"""Chain operations that return Results."""

240

241

class Message:

242

"""Warning/error message structure."""

243

type: str # Message type

244

message: str # Message content

245

246

def warning(message):

247

"""Create a warning message."""

248

249

def success(value):

250

"""Create a successful Result with no messages."""

251

252

def combine(results):

253

"""Combine multiple Results into one."""

254

255

# HTML Node Types

256

257

class Node:

258

"""Base class for all HTML nodes."""

259

260

class TextNode(Node):

261

"""Text content node."""

262

value: str # Text content

263

264

class Tag:

265

"""HTML tag definition."""

266

tag_names: list # List of tag names

267

attributes: dict # HTML attributes

268

collapsible: bool # Whether tag can be collapsed

269

separator: str # Separator for multiple tags

270

271

@property

272

def tag_name(self):

273

"""Get primary tag name."""

274

275

class Element(Node):

276

"""HTML element node with tag and children."""

277

tag: Tag # Tag definition

278

children: list # Child nodes

279

280

@property

281

def tag_name(self):

282

"""Get primary tag name."""

283

284

@property

285

def tag_names(self):

286

"""Get all tag names."""

287

288

@property

289

def attributes(self):

290

"""Get HTML attributes."""

291

292

@property

293

def collapsible(self):

294

"""Check if element is collapsible."""

295

296

def is_void(self):

297

"""Check if element is void (self-closing)."""

298

299

class ForceWrite(Node):

300

"""Special node that forces writing even if empty."""

301

302

class NodeVisitor:

303

"""Base class for visiting HTML nodes."""

304

```