Various implementations of LangChain.js text splitters for retrieval-augmented generation (RAG) pipelines
—
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Pending
The risk profile of this skill
Advanced recursive splitting using a hierarchy of separators. Perfect for intelligent document chunking that preserves semantic structure and supports code-aware splitting for 18 programming languages.
Recursively splits text using a hierarchy of separators, trying larger structural separators first before falling back to smaller ones.
/**
* Text splitter that recursively tries different separators to preserve semantic structure
*/
class RecursiveCharacterTextSplitter extends TextSplitter implements RecursiveCharacterTextSplitterParams {
separators: string[];
constructor(fields?: Partial<RecursiveCharacterTextSplitterParams>);
splitText(text: string): Promise<string[]>;
static lc_name(): string;
static fromLanguage(
language: SupportedTextSplitterLanguage,
options?: Partial<RecursiveCharacterTextSplitterParams>
): RecursiveCharacterTextSplitter;
static getSeparatorsForLanguage(language: SupportedTextSplitterLanguage): string[];
}
interface RecursiveCharacterTextSplitterParams extends TextSplitterParams {
/** Array of separators to try in order of preference (default: ["\n\n", "\n", " ", ""]) */
separators: string[];
}Usage Examples:
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
// Basic recursive splitting with default separators
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 100,
chunkOverlap: 20,
});
const text = `# Title
This is a paragraph.
This is another paragraph.
Final paragraph here.`;
const chunks = await splitter.splitText(text);
// Tries paragraph breaks (\n\n) first, then line breaks (\n), then spaces, then characters
// Custom separator hierarchy
const customSplitter = new RecursiveCharacterTextSplitter({
separators: ["\n### ", "\n## ", "\n# ", "\n\n", "\n", " ", ""],
chunkSize: 200,
chunkOverlap: 50,
});
const markdown = `# Main Title
Content under main title.
## Section One
Content in section one.
### Subsection
Content in subsection.`;
const markdownChunks = await customSplitter.splitText(markdown);Create language-aware splitters that understand code structure and preserve semantic boundaries.
/**
* Create a recursive splitter optimized for a specific programming language
* @param language - The programming language to optimize for
* @param options - Additional configuration options
* @returns Configured RecursiveCharacterTextSplitter
*/
static fromLanguage(
language: SupportedTextSplitterLanguage,
options?: Partial<RecursiveCharacterTextSplitterParams>
): RecursiveCharacterTextSplitter;
/**
* Get the separator hierarchy for a specific programming language
* @param language - The programming language
* @returns Array of separators in order of preference
*/
static getSeparatorsForLanguage(language: SupportedTextSplitterLanguage): string[];
const SupportedTextSplitterLanguages = [
"cpp", "go", "java", "js", "php", "proto", "python", "rst",
"ruby", "rust", "scala", "swift", "markdown", "latex", "html", "sol"
] as const;
type SupportedTextSplitterLanguage = (typeof SupportedTextSplitterLanguages)[number];Language-Specific Examples:
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
// Python code splitting
const pythonSplitter = RecursiveCharacterTextSplitter.fromLanguage("python", {
chunkSize: 500,
chunkOverlap: 50,
});
const pythonCode = `class DataProcessor:
def __init__(self, config):
self.config = config
def process(self, data):
return self.transform(data)
def transform(self, data):
# Transform logic here
return processed_data`;
const pythonChunks = await pythonSplitter.splitText(pythonCode);
// Preserves class and function boundaries
// JavaScript code splitting
const jsSplitter = RecursiveCharacterTextSplitter.fromLanguage("js", {
chunkSize: 300,
chunkOverlap: 30,
});
const jsCode = `function processData(input) {
const transformed = input.map(item => {
return {
...item,
processed: true
};
});
return transformed.filter(item => item.valid);
}
const config = {
timeout: 5000,
retries: 3
};`;
const jsChunks = await jsSplitter.splitText(jsCode);
// Get separators for any supported language
const rustSeparators = RecursiveCharacterTextSplitter.getSeparatorsForLanguage("rust");
console.log(rustSeparators);
// ["\nfn ", "\nconst ", "\nlet ", "\nif ", "\nwhile ", "\nfor ", "\nloop ", "\nmatch ", "\nconst ", "\n\n", "\n", " ", ""]The recursive text splitter supports intelligent splitting for these languages:
const SupportedTextSplitterLanguages = [
"cpp", // C++
"go", // Go
"java", // Java
"js", // JavaScript/TypeScript
"php", // PHP
"proto", // Protocol Buffers
"python", // Python
"rst", // reStructuredText
"ruby", // Ruby
"rust", // Rust
"scala", // Scala
"swift", // Swift
"markdown", // Markdown
"latex", // LaTeX
"html", // HTML
"sol" // Solidity
] as const;Each language has optimized separators that prioritize structural elements:
Python separators: ["\nclass ", "\ndef ", "\n\tdef ", "\n\n", "\n", " ", ""]
JavaScript separators: ["\nfunction ", "\nconst ", "\nlet ", "\nvar ", "\nclass ", "\nif ", "\nfor ", "\nwhile ", "\nswitch ", "\ncase ", "\ndefault ", "\n\n", "\n", " ", ""]
Java separators: ["\nclass ", "\npublic ", "\nprotected ", "\nprivate ", "\nstatic ", "\nif ", "\nfor ", "\nwhile ", "\nswitch ", "\ncase ", "\n\n", "\n", " ", ""]
interface RecursiveCharacterTextSplitterParams extends TextSplitterParams {
/** Array of separators to try in order of preference */
separators: string[];
}Advanced Usage Examples:
// HTML-aware splitting with custom configuration
const htmlSplitter = RecursiveCharacterTextSplitter.fromLanguage("html", {
chunkSize: 1000,
chunkOverlap: 100,
keepSeparator: true, // Keep HTML tags
});
const htmlContent = `<html>
<head>
<title>Example Page</title>
</head>
<body>
<div class="content">
<h1>Main Heading</h1>
<p>First paragraph content.</p>
<p>Second paragraph content.</p>
</div>
</body>
</html>`;
const htmlChunks = await htmlSplitter.splitText(htmlContent);
// Markdown with custom separators prioritizing headings
const markdownSplitter = new RecursiveCharacterTextSplitter({
separators: [
"\n# ", // H1 headings
"\n## ", // H2 headings
"\n### ", // H3 headings
"\n\n", // Paragraph breaks
"\n", // Line breaks
" ", // Spaces
"" // Characters
],
chunkSize: 500,
chunkOverlap: 50,
keepSeparator: true
});
// Document splitting with overlap headers
const documents = await markdownSplitter.createDocuments(
[markdownContent],
[{ source: "guide.md" }],
{
chunkHeader: "--- Document Chunk ---\n",
chunkOverlapHeader: "(Continued) ",
appendChunkOverlapHeader: true
}
);Recursive text splitters work seamlessly with LangChain's document processing pipeline:
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { Document } from "@langchain/core/documents";
// Transform existing documents
const splitter = RecursiveCharacterTextSplitter.fromLanguage("python");
const docs = [
new Document({
pageContent: pythonCode,
metadata: { filename: "processor.py", author: "dev" }
})
];
// Use as document transformer
const splitDocs = await splitter.transformDocuments(docs);
// All documents maintain original metadata plus line location information
splitDocs.forEach(doc => {
console.log(doc.metadata);
// { filename: "processor.py", author: "dev", loc: { lines: { from: 1, to: 5 } } }
});