or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

conversion.mdimages.mdindex.mdstyles.mdtransforms.mdwriters.md

conversion.mddocs/

0

# Document Conversion

1

2

Core conversion functions for transforming DOCX files to HTML and Markdown formats. These functions provide comprehensive options for customization, style mapping, and output control.

3

4

## Capabilities

5

6

### HTML Conversion

7

8

Converts DOCX documents to clean, semantic HTML with support for headings, lists, tables, images, and extensive formatting options.

9

10

```python { .api }

11

def convert_to_html(fileobj, **kwargs):

12

"""

13

Convert DOCX file to HTML format.

14

15

Parameters:

16

- fileobj: File object (opened DOCX file in binary mode)

17

- style_map: str, custom style mapping rules

18

- convert_image: function, custom image conversion function

19

- ignore_empty_paragraphs: bool, whether to skip empty paragraphs (default: True)

20

- id_prefix: str, prefix for HTML element IDs

21

- include_embedded_style_map: bool, use embedded style maps (default: True)

22

- include_default_style_map: bool, use built-in style mappings (default: True)

23

24

Returns:

25

Result object with .value (HTML string) and .messages (list of warnings)

26

"""

27

```

28

29

Usage example:

30

31

```python

32

import mammoth

33

34

# Basic HTML conversion

35

with open("document.docx", "rb") as docx_file:

36

result = mammoth.convert_to_html(docx_file)

37

html = result.value

38

39

# HTML conversion with custom options

40

with open("document.docx", "rb") as docx_file:

41

result = mammoth.convert_to_html(

42

docx_file,

43

style_map="p.Heading1 => h1.custom-heading",

44

id_prefix="doc-",

45

ignore_empty_paragraphs=False

46

)

47

```

48

49

### Markdown Conversion

50

51

Converts DOCX documents to clean Markdown format, preserving document structure and formatting in Markdown syntax.

52

53

```python { .api }

54

def convert_to_markdown(fileobj, **kwargs):

55

"""

56

Convert DOCX file to Markdown format.

57

58

Parameters: Same as convert_to_html()

59

60

Returns:

61

Result object with .value (Markdown string) and .messages (list of warnings)

62

"""

63

```

64

65

Usage example:

66

67

```python

68

import mammoth

69

70

# Basic Markdown conversion

71

with open("document.docx", "rb") as docx_file:

72

result = mammoth.convert_to_markdown(docx_file)

73

markdown = result.value

74

75

# Check for conversion warnings

76

if result.messages:

77

for message in result.messages:

78

print(f"{message.type}: {message.message}")

79

```

80

81

### Core Conversion Function

82

83

The underlying conversion function with full parameter control, supporting both HTML and Markdown output formats.

84

85

```python { .api }

86

def convert(fileobj, transform_document=None, id_prefix=None,

87

include_embedded_style_map=True, **kwargs):

88

"""

89

Core conversion function with full parameter control.

90

91

Parameters:

92

- fileobj: File object containing DOCX data

93

- transform_document: function, transforms document before conversion

94

- id_prefix: str, prefix for HTML element IDs

95

- include_embedded_style_map: bool, whether to use embedded style maps

96

- output_format: str, "html" or "markdown"

97

- style_map: str, custom style mapping string

98

- convert_image: function, custom image conversion function

99

- ignore_empty_paragraphs: bool, skip empty paragraphs (default: True)

100

- include_default_style_map: bool, use built-in styles (default: True)

101

102

Returns:

103

Result object with converted content and messages

104

"""

105

```

106

107

Usage example:

108

109

```python

110

import mammoth

111

112

def custom_transform(document):

113

# Custom document transformation

114

return document

115

116

with open("document.docx", "rb") as docx_file:

117

result = mammoth.convert(

118

docx_file,

119

output_format="html",

120

transform_document=custom_transform,

121

style_map="p.CustomStyle => div.special"

122

)

123

```

124

125

### Text Extraction

126

127

Extracts plain text content from DOCX documents without formatting, useful for text analysis and processing.

128

129

```python { .api }

130

def extract_raw_text(fileobj):

131

"""

132

Extract plain text from DOCX file.

133

134

Parameters:

135

- fileobj: File object (opened DOCX file in binary mode)

136

137

Returns:

138

Result object with .value (plain text string) and .messages (list)

139

"""

140

```

141

142

Usage example:

143

144

```python

145

import mammoth

146

147

with open("document.docx", "rb") as docx_file:

148

result = mammoth.extract_raw_text(docx_file)

149

text = result.value

150

print(text) # Plain text content

151

```

152

153

## Supported Options

154

155

All conversion functions accept these common options:

156

157

- **style_map**: Custom style mapping rules as a string

158

- **embedded_style_map**: Style map extracted from the DOCX file itself

159

- **include_default_style_map**: Whether to include built-in style mappings (default: True)

160

- **ignore_empty_paragraphs**: Whether to skip empty paragraph elements (default: True)

161

- **convert_image**: Custom function for handling image conversion

162

- **output_format**: Target format ("html" or "markdown")

163

- **id_prefix**: Prefix for generated HTML element IDs

164

165

## Error Handling

166

167

All conversion functions return Result objects that contain both the converted content and any warnings or errors encountered during processing:

168

169

```python

170

result = mammoth.convert_to_html(docx_file)

171

172

# Access the converted content

173

html = result.value

174

175

# Check for warnings or errors

176

for message in result.messages:

177

if message.type == "error":

178

print(f"Error: {message.message}")

179

elif message.type == "warning":

180

print(f"Warning: {message.message}")

181

```