Convert Word documents from docx to simple HTML and Markdown
—
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Pending
The risk profile of this skill
Document transformation utilities for modifying document elements before conversion, enabling custom preprocessing of document structure.
Note: The API for document transforms should be considered unstable and may change between versions. Pin to a specific version if you rely on this behavior.
Apply a transformation to paragraph elements in the document.
function paragraph(transform: (element: any) => any): (element: any) => any;transform: Function that takes a paragraph element and returns the modified elementA transformation function that can be used with the transformDocument option.
const mammoth = require("mammoth");
function transformParagraph(element) {
// Convert center-aligned paragraphs to headings
if (element.alignment === "center" && !element.styleId) {
return {...element, styleId: "Heading2"};
}
return element;
}
const options = {
transformDocument: mammoth.transforms.paragraph(transformParagraph)
};
mammoth.convertToHtml({path: "document.docx"}, options);Apply a transformation to run elements (text runs) in the document.
function run(transform: (element: any) => any): (element: any) => any;transform: Function that takes a run element and returns the modified elementA transformation function that can be used with the transformDocument option.
function transformRun(element) {
// Convert runs with monospace font to code
if (element.font && element.font.name === "Courier New") {
return {...element, styleId: "Code"};
}
return element;
}
const options = {
transformDocument: mammoth.transforms.run(transformRun)
};Get all descendant elements from a document element.
function getDescendants(element: any): any[];element: The document element to traverseArray of all descendant elements found in the element tree.
function analyzeDocument(documentElement) {
const allDescendants = mammoth.transforms.getDescendants(documentElement);
console.log(`Document contains ${allDescendants.length} elements`);
allDescendants.forEach(function(descendant) {
console.log(`Element type: ${descendant.type}`);
});
}Get all descendant elements of a specific type from a document element.
function getDescendantsOfType(element: any, type: string): any[];element: The document element to traversetype: The element type to filter for (e.g., "paragraph", "run", "table")Array of descendant elements matching the specified type.
function countParagraphs(documentElement) {
const paragraphs = mammoth.transforms.getDescendantsOfType(documentElement, "paragraph");
console.log(`Document contains ${paragraphs.length} paragraphs`);
return paragraphs;
}
function findTables(documentElement) {
const tables = mammoth.transforms.getDescendantsOfType(documentElement, "table");
return tables;
}For more complex transformations, you can write your own recursive transformation function:
function transformElement(element: any): any {
if (element.children) {
const children = element.children.map(transformElement);
element = {...element, children: children};
}
// Apply specific transformations based on element type
if (element.type === "paragraph") {
return transformParagraph(element);
} else if (element.type === "run") {
return transformRun(element);
}
return element;
}function transformElement(element) {
// Recursively transform children first
if (element.children) {
const children = element.children.map(transformElement);
element = {...element, children: children};
}
// Transform paragraphs
if (element.type === "paragraph") {
// Convert center-aligned paragraphs to headings
if (element.alignment === "center" && !element.styleId) {
return {...element, styleId: "Heading2"};
}
// Convert paragraphs with specific text patterns
if (element.children && element.children.length > 0) {
const text = element.children
.filter(child => child.type === "text")
.map(child => child.value)
.join("");
if (text.startsWith("TODO:")) {
return {...element, styleId: "TodoItem"};
}
}
}
// Transform runs
if (element.type === "run") {
// Convert monospace font runs to code
if (element.font && element.font.name === "Courier New") {
return {...element, styleId: "Code"};
}
}
return element;
}
const options = {
transformDocument: transformElement
};
mammoth.convertToHtml({path: "document.docx"}, options);Document elements you might encounter during transformation:
"paragraph": Paragraph elements"run": Text runs within paragraphs"text": Text content"table": Table elements"table-row": Table row elements"table-cell": Table cell elements"hyperlink": Link elements"image": Image elements"line-break": Line break elements"footnote-reference": Footnote references"endnote-reference": Endnote referencesCommon properties found on document elements:
type: "paragraph"styleId: Style identifier from the documentstyleName: Human-readable style namealignment: Text alignment ("left", "center", "right", "justify")children: Array of child elementstype: "run"font: Font information objectisBold: Boolean indicating bold formattingisItalic: Boolean indicating italic formattingisUnderline: Boolean indicating underline formattingisStrikethrough: Boolean indicating strikethrough formattingverticalAlignment: "superscript" or "subscript"children: Array of child elements (usually text)type: "text"value: The actual text content