Various implementations of LangChain.js text splitters for retrieval-augmented generation (RAG) pipelines
—
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Pending
The risk profile of this skill
Specialized splitters optimized for specific document formats like Markdown and LaTeX. Designed to preserve document structure and formatting semantics while providing intelligent chunking.
Specialized splitter for Markdown documents that preserves heading hierarchy and structural elements.
/**
* Text splitter optimized for Markdown documents
* Preserves heading structure and code blocks
*/
class MarkdownTextSplitter extends RecursiveCharacterTextSplitter implements MarkdownTextSplitterParams {
constructor(fields?: Partial<MarkdownTextSplitterParams>);
}
type MarkdownTextSplitterParams = TextSplitterParams;Usage Examples:
import { MarkdownTextSplitter } from "@langchain/textsplitters";
// Basic Markdown splitting
const markdownSplitter = new MarkdownTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const markdownContent = `# Main Title
This is the introduction paragraph with some **bold text** and *italic text*.
## Section One
Here's content in section one with a [link](https://example.com).
### Subsection
More detailed content here.
\`\`\`javascript
// Code block that should be preserved
function example() {
return "Hello World";
}
\`\`\`
## Section Two
Final section with a list:
- Item one
- Item two
- Item three
> This is a blockquote that should be preserved.
`;
const chunks = await markdownSplitter.splitText(markdownContent);
// Preserves heading boundaries, code blocks, and list structureSpecialized splitter for LaTeX documents that understands document structure and mathematical environments.
/**
* Text splitter optimized for LaTeX documents
* Preserves document structure, sections, and math environments
*/
class LatexTextSplitter extends RecursiveCharacterTextSplitter implements LatexTextSplitterParams {
constructor(fields?: Partial<LatexTextSplitterParams>);
}
type LatexTextSplitterParams = TextSplitterParams;Usage Examples:
import { LatexTextSplitter } from "@langchain/textsplitters";
// Basic LaTeX splitting
const latexSplitter = new LatexTextSplitter({
chunkSize: 800,
chunkOverlap: 100,
});
const latexContent = `\\documentclass{article}
\\usepackage{amsmath}
\\title{Research Paper Title}
\\author{Author Name}
\\date{}
\\begin{document}
\\maketitle
\\section{Introduction}
This is the introduction section with some mathematical notation: $E = mc^2$.
\\subsection{Background}
Some background information with an equation:
\\begin{equation}
f(x) = \\int_{-\\infty}^{\\infty} g(t) e^{-2\\pi i x t} dt
\\end{equation}
\\section{Methodology}
The methodology section describes our approach.
\\begin{itemize}
\\item First step of the process
\\item Second step with more details
\\item Final step and conclusions
\\end{itemize}
\\section{Results}
Results are presented in this section.
\\begin{align}
y &= mx + b \\\\
z &= ax^2 + bx + c
\\end{align}
\\section{Conclusion}
Final conclusions and future work.
\\end{document}`;
const latexChunks = await latexSplitter.splitText(latexContent);
// Preserves section boundaries, equation environments, and document structureThe Markdown splitter uses intelligent separators that prioritize document structure:
Markdown Separator Hierarchy:
// Internal separator order used by MarkdownTextSplitter
const markdownSeparators = [
"\n## ", // H2 headings
"\n### ", // H3 headings
"\n#### ", // H4 headings
"\n##### ", // H5 headings
"\n###### ", // H6 headings
"```\n\n", // End of code blocks
"\n\n***\n\n", // Horizontal rules (asterisk)
"\n\n---\n\n", // Horizontal rules (dash)
"\n\n___\n\n", // Horizontal rules (underscore)
"\n\n", // Paragraph breaks
"\n", // Line breaks
" ", // Spaces
"" // Characters
];Advanced Markdown Usage:
// Custom configuration for documentation
const docSplitter = new MarkdownTextSplitter({
chunkSize: 1500,
chunkOverlap: 150,
keepSeparator: true, // Keep headings with content
});
// Process technical documentation
const technicalDoc = `# API Reference
## Authentication
All API requests require authentication using Bearer tokens.
\`\`\`bash
curl -H "Authorization: Bearer YOUR_TOKEN" https://api.example.com/users
\`\`\`
## Endpoints
### GET /users
Retrieves a list of users.
**Parameters:**
- \`limit\` (optional): Maximum number of users to return
- \`offset\` (optional): Number of users to skip
**Response:**
\`\`\`json
{
"users": [...],
"total": 100,
"limit": 20,
"offset": 0
}
\`\`\`
### POST /users
Creates a new user.`;
const docChunks = await docSplitter.splitText(technicalDoc);
// Create structured documents
const docSections = await docSplitter.createDocuments(
[technicalDoc],
[{ type: "api_docs", version: "1.0" }],
{
chunkHeader: "=== API Documentation Section ===\n",
appendChunkOverlapHeader: true
}
);The LaTeX splitter uses separators that understand academic document structure:
LaTeX Separator Hierarchy:
// Internal separator order used by LatexTextSplitter
const latexSeparators = [
"\n\\chapter{", // Chapter divisions
"\n\\section{", // Section divisions
"\n\\subsection{", // Subsection divisions
"\n\\subsubsection{", // Subsubsection divisions
"\n\\begin{enumerate}", // List environments
"\n\\begin{itemize}", // List environments
"\n\\begin{description}", // Description lists
"\n\\begin{list}", // Generic lists
"\n\\begin{quote}", // Quote environments
"\n\\begin{quotation}", // Quotation environments
"\n\\begin{verse}", // Verse environments
"\n\\begin{verbatim}", // Verbatim environments
"\n\\begin{align}", // Math environments
"$$", // Display math
"$", // Inline math
"\n\n", // Paragraph breaks
"\n", // Line breaks
" ", // Spaces
"" // Characters
];Advanced LaTeX Usage:
// Configuration for academic papers
const academicSplitter = new LatexTextSplitter({
chunkSize: 2000, // Longer chunks for academic content
chunkOverlap: 200, // Good overlap for context
keepSeparator: true, // Preserve LaTeX commands
});
// Process research paper
const researchPaper = `\\section{Literature Review}
Previous work in this area includes studies by \\cite{smith2020} and \\cite{jones2021}.
\\subsection{Theoretical Framework}
The theoretical framework is based on the following principles:
\\begin{enumerate}
\\item First principle with mathematical foundation
\\item Second principle involving:
\\begin{equation}
\\mathbf{X} = \\mathbf{A}\\mathbf{B} + \\mathbf{C}
\\end{equation}
\\item Third principle with experimental validation
\\end{enumerate}
\\subsection{Experimental Design}
Our experimental approach follows established protocols.`;
const paperChunks = await academicSplitter.splitText(researchPaper);
// Process with metadata for citation tracking
const paperSections = await academicSplitter.createDocuments(
[researchPaper],
[{
paper_id: "smith2023_ml_approach",
authors: ["Smith, J.", "Doe, A."],
journal: "AI Research Quarterly"
}]
);Both format splitters work seamlessly with LangChain's document processing:
import { MarkdownTextSplitter, LatexTextSplitter } from "@langchain/textsplitters";
import { Document } from "@langchain/core/documents";
// Process mixed document types
const markdownDocs = [
new Document({
pageContent: readmeContent,
metadata: { type: "readme", language: "markdown" }
})
];
const latexDocs = [
new Document({
pageContent: paperContent,
metadata: { type: "paper", language: "latex" }
})
];
// Split with appropriate splitters
const markdownSplitter = new MarkdownTextSplitter({ chunkSize: 1000 });
const latexSplitter = new LatexTextSplitter({ chunkSize: 1500 });
const [splitMarkdown, splitLatex] = await Promise.all([
markdownSplitter.transformDocuments(markdownDocs),
latexSplitter.transformDocuments(latexDocs)
]);
// Combine results maintaining document type information
const allSplitDocs = [...splitMarkdown, ...splitLatex];Automatic format detection and processing workflow:
import { MarkdownTextSplitter, LatexTextSplitter } from "@langchain/textsplitters";
function createFormatSplitter(content: string, options = {}) {
// Simple format detection
if (content.includes('\\documentclass') || content.includes('\\begin{document}')) {
return new LatexTextSplitter(options);
} else if (content.includes('# ') || content.includes('## ') || content.includes('```')) {
return new MarkdownTextSplitter(options);
} else {
// Fall back to recursive character splitter
return new RecursiveCharacterTextSplitter(options);
}
}
// Process documents with automatic format detection
async function processDocuments(documents: Array<{content: string, metadata: any}>) {
const results = [];
for (const doc of documents) {
const splitter = createFormatSplitter(doc.content, {
chunkSize: 1000,
chunkOverlap: 100
});
const chunks = await splitter.createDocuments(
[doc.content],
[{ ...doc.metadata, detected_format: splitter.constructor.name }]
);
results.push(...chunks);
}
return results;
}