tessl/pypi-mammoth

Convert Word documents from docx to simple and clean HTML and Markdown

—

Pending

Overview

Eval results

Files

Document Conversion

Name: tessl/pypi-mammoth
Author: tessl

Core conversion functions for transforming DOCX files to HTML and Markdown formats. These functions provide comprehensive options for customization, style mapping, and output control.

Capabilities

HTML Conversion

Converts DOCX documents to clean, semantic HTML with support for headings, lists, tables, images, and extensive formatting options.

def convert_to_html(fileobj, **kwargs):
    """
    Convert DOCX file to HTML format.
    
    Parameters:
    - fileobj: File object (opened DOCX file in binary mode)
    - style_map: str, custom style mapping rules
    - convert_image: function, custom image conversion function
    - ignore_empty_paragraphs: bool, whether to skip empty paragraphs (default: True)
    - id_prefix: str, prefix for HTML element IDs
    - include_embedded_style_map: bool, use embedded style maps (default: True)
    - include_default_style_map: bool, use built-in style mappings (default: True)
    
    Returns:
    Result object with .value (HTML string) and .messages (list of warnings)
    """

Usage example:

import mammoth

# Basic HTML conversion
with open("document.docx", "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file)
    html = result.value
    
# HTML conversion with custom options
with open("document.docx", "rb") as docx_file:
    result = mammoth.convert_to_html(
        docx_file,
        style_map="p.Heading1 => h1.custom-heading",
        id_prefix="doc-",
        ignore_empty_paragraphs=False
    )

Markdown Conversion

Converts DOCX documents to clean Markdown format, preserving document structure and formatting in Markdown syntax.

def convert_to_markdown(fileobj, **kwargs):
    """
    Convert DOCX file to Markdown format.
    
    Parameters: Same as convert_to_html()
    
    Returns:
    Result object with .value (Markdown string) and .messages (list of warnings)
    """

Usage example:

import mammoth

# Basic Markdown conversion
with open("document.docx", "rb") as docx_file:
    result = mammoth.convert_to_markdown(docx_file)
    markdown = result.value
    
# Check for conversion warnings
if result.messages:
    for message in result.messages:
        print(f"{message.type}: {message.message}")

Core Conversion Function

The underlying conversion function with full parameter control, supporting both HTML and Markdown output formats.

def convert(fileobj, transform_document=None, id_prefix=None, 
           include_embedded_style_map=True, **kwargs):
    """
    Core conversion function with full parameter control.
    
    Parameters:
    - fileobj: File object containing DOCX data
    - transform_document: function, transforms document before conversion
    - id_prefix: str, prefix for HTML element IDs
    - include_embedded_style_map: bool, whether to use embedded style maps
    - output_format: str, "html" or "markdown"
    - style_map: str, custom style mapping string
    - convert_image: function, custom image conversion function
    - ignore_empty_paragraphs: bool, skip empty paragraphs (default: True)
    - include_default_style_map: bool, use built-in styles (default: True)
    
    Returns:
    Result object with converted content and messages
    """

Usage example:

import mammoth

def custom_transform(document):
    # Custom document transformation
    return document

with open("document.docx", "rb") as docx_file:
    result = mammoth.convert(
        docx_file,
        output_format="html",
        transform_document=custom_transform,
        style_map="p.CustomStyle => div.special"
    )

Text Extraction

Extracts plain text content from DOCX documents without formatting, useful for text analysis and processing.

def extract_raw_text(fileobj):
    """
    Extract plain text from DOCX file.
    
    Parameters:
    - fileobj: File object (opened DOCX file in binary mode)
    
    Returns:
    Result object with .value (plain text string) and .messages (list)
    """

Usage example:

import mammoth

with open("document.docx", "rb") as docx_file:
    result = mammoth.extract_raw_text(docx_file)
    text = result.value
    print(text)  # Plain text content

Supported Options

All conversion functions accept these common options:

style_map: Custom style mapping rules as a string
embedded_style_map: Style map extracted from the DOCX file itself
include_default_style_map: Whether to include built-in style mappings (default: True)
ignore_empty_paragraphs: Whether to skip empty paragraph elements (default: True)
convert_image: Custom function for handling image conversion
output_format: Target format ("html" or "markdown")
id_prefix: Prefix for generated HTML element IDs

Error Handling

All conversion functions return Result objects that contain both the converted content and any warnings or errors encountered during processing:

result = mammoth.convert_to_html(docx_file)

# Access the converted content
html = result.value

# Check for warnings or errors
for message in result.messages:
    if message.type == "error":
        print(f"Error: {message.message}")
    elif message.type == "warning":
        print(f"Warning: {message.message}")

Install with Tessl CLI