Convert Word documents from docx to simple HTML and Markdown
—
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Pending
The risk profile of this skill
Core functionality for converting DOCX documents to HTML and Markdown formats, with support for custom style mappings and conversion options.
Converts the source document to HTML.
function convertToHtml(input: Input, options?: Options): Promise<Result>;input: Document input - can be a file path, Buffer, or ArrayBuffer
{path: string} - Path to the .docx file (Node.js){buffer: Buffer} - Buffer containing .docx file (Node.js){arrayBuffer: ArrayBuffer} - ArrayBuffer containing .docx file (Browser)options (optional): Conversion options
styleMap: Custom style mappings (string or string array)includeEmbeddedStyleMap: Include embedded style maps (default: true)includeDefaultStyleMap: Include default style mappings (default: true)convertImage: Custom image converter functionignoreEmptyParagraphs: Ignore empty paragraphs (default: true)idPrefix: Prefix for generated IDs (default: "")transformDocument: Document transformation functionPromise resolving to a Result object:
value: The generated HTML stringmessages: Array of warnings/errors during conversionconst mammoth = require("mammoth");
mammoth.convertToHtml({path: "document.docx"})
.then(function(result){
const html = result.value;
const messages = result.messages;
console.log(html);
})
.catch(function(error) {
console.error(error);
});const options = {
styleMap: [
"p[style-name='Section Title'] => h1:fresh",
"p[style-name='Subsection Title'] => h2:fresh"
]
};
mammoth.convertToHtml({path: "document.docx"}, options);const options = {
convertImage: mammoth.images.imgElement(function(image) {
return image.readAsBase64String().then(function(imageBuffer) {
return {
src: "data:" + image.contentType + ";base64," + imageBuffer
};
});
})
};
mammoth.convertToHtml({buffer: docxBuffer}, options);Converts the source document to Markdown. Note: Markdown support is deprecated.
function convertToMarkdown(input: Input, options?: Options): Promise<Result>;Same as convertToHtml, but returns Markdown instead of HTML.
Promise resolving to a Result object:
value: The generated Markdown stringmessages: Array of warnings/errors during conversionmammoth.convertToMarkdown({path: "document.docx"})
.then(function(result){
const markdown = result.value;
console.log(markdown);
});Extract the raw text of the document, ignoring all formatting. Each paragraph is followed by two newlines.
function extractRawText(input: Input): Promise<Result>;input: Document input (same format as convertToHtml)Promise resolving to a Result object:
value: The raw text stringmessages: Array of warnings/errors during extractionmammoth.extractRawText({path: "document.docx"})
.then(function(result){
const text = result.value;
console.log(text);
});Style mappings control how Word styles are converted to HTML elements:
// Basic style mapping
"p[style-name='Heading 1'] => h1"
// With CSS classes
"p[style-name='Warning'] => p.warning"
// Fresh elements (avoid nested elements)
"p[style-name='Title'] => h1:fresh"
// Character styles
"r[style-name='Code'] => code"
// Bold/italic/underline
"b => strong"
"i => em"
"u => span.underline"Mammoth performs no sanitization of the source document and should be used extremely carefully with untrusted user input. Source documents can contain:
javascript: targetsAlways sanitize the output HTML when embedding in web pages.