or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

conversion.mdimages.mdindex.mdstyle-maps.mdstyles.mdtransforms.md
tile.json

tessl/npm-mammoth

Convert Word documents from docx to simple HTML and Markdown

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
npmpkg:npm/mammoth@1.10.x

To install, run

npx @tessl/cli install tessl/npm-mammoth@1.10.0

index.mddocs/

Mammoth

Mammoth is designed to convert .docx documents, such as those created by Microsoft Word, Google Docs and LibreOffice, to HTML and Markdown formats. It focuses on semantic markup preservation rather than visual formatting, converting document styles (like Heading 1) to appropriate HTML elements (like h1 tags) while ignoring font styling details.

Package Information

  • Package Name: mammoth
  • Package Type: npm
  • Language: JavaScript with TypeScript definitions
  • Installation: npm install mammoth

Core Imports

const mammoth = require("mammoth");

TypeScript:

import mammoth = require("mammoth");
// or
const mammoth = require("mammoth");

Browser (standalone):

// Include mammoth.browser.js or mammoth.browser.min.js
const mammoth = window.mammoth;

Basic Usage

const mammoth = require("mammoth");

// Convert DOCX to HTML
mammoth.convertToHtml({path: "document.docx"})
    .then(function(result){
        const html = result.value; // The generated HTML
        const messages = result.messages; // Any messages, such as warnings
    })
    .catch(function(error) {
        console.error(error);
    });

// Extract raw text
mammoth.extractRawText({path: "document.docx"})
    .then(function(result){
        const text = result.value; // The raw text
        const messages = result.messages;
    });

CLI Usage

Mammoth also provides a command-line interface:

# Convert DOCX to HTML
mammoth document.docx output.html

# Convert with style map
mammoth document.docx output.html --style-map=custom-style-map

# Convert to Markdown (deprecated)
mammoth document.docx --output-format=markdown

# Extract images to directory
mammoth document.docx --output-dir=output-dir

Architecture

Mammoth is built around several key components:

  • Document Conversion: Core DOCX to HTML/Markdown conversion with customizable style mappings
  • Image Processing: Flexible image handling with built-in and custom converters
  • Document Transformation: Pre-conversion document modification and element transforms
  • Style Mapping: Custom styling rules for converting Word styles to HTML elements

Capabilities

Document Conversion

Core functionality for converting DOCX documents to HTML and Markdown formats, with support for custom style mappings and conversion options.

function convertToHtml(input: Input, options?: Options): Promise<Result>;
function convertToMarkdown(input: Input, options?: Options): Promise<Result>;
function extractRawText(input: Input): Promise<Result>;

Document Conversion

Image Handling

Image conversion utilities for customizing how images in DOCX documents are processed and included in the output.

const images: {
  dataUri: ImageConverter;
  imgElement: (func: (image: Image) => Promise<ImageAttributes>) => ImageConverter;
};

Image Handling

Document Transforms

Document transformation utilities for modifying document elements before conversion, enabling custom preprocessing of document structure.

const transforms: {
  paragraph: (transform: (element: any) => any) => (element: any) => any;
  run: (transform: (element: any) => any) => (element: any) => any;
  getDescendants: (element: any) => any[];
  getDescendantsOfType: (element: any, type: string) => any[];
};

Document Transforms

Style Utilities

Utilities for handling underline and other styling elements in document conversion.

const underline: {
  element: (name: string) => (html: any) => any;
};

Style Utilities

Style Map Management

Functions for embedding and reading custom style maps in DOCX documents.

function embedStyleMap(input: Input, styleMap: string): Promise<{
  toArrayBuffer: () => ArrayBuffer;
  toBuffer: () => Buffer;
}>;
function readEmbeddedStyleMap(input: Input): Promise<string>;

Style Map Management

Types

type Input = PathInput | BufferInput | ArrayBufferInput;

interface PathInput {
  path: string;
}

interface BufferInput {
  buffer: Buffer;
}

interface ArrayBufferInput {
  arrayBuffer: ArrayBuffer;
}

interface Options {
  styleMap?: string | string[];
  includeEmbeddedStyleMap?: boolean;
  includeDefaultStyleMap?: boolean;
  convertImage?: ImageConverter;
  ignoreEmptyParagraphs?: boolean;
  idPrefix?: string;
  transformDocument?: (element: any) => any;
}

interface Result {
  value: string;
  messages: Message[];
}

type Message = Warning | Error;

interface Warning {
  type: "warning";
  message: string;
}

interface Error {
  type: "error";
  message: string;
  error: unknown;
}

interface Image {
  contentType: string;
  readAsArrayBuffer(): Promise<ArrayBuffer>;
  readAsBase64String(): Promise<string>;
  readAsBuffer(): Promise<Buffer>;
  read(): Promise<Buffer>;
  read(encoding: string): Promise<string>;
}

interface ImageConverter {
  __mammothBrand: "ImageConverter";
}

interface ImageAttributes {
  src: string;
  [key: string]: string;
}