Various implementations of LangChain.js text splitters for retrieval-augmented generation (RAG) pipelines
—
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Pending
The risk profile of this skill
Token-aware splitting using tiktoken encoding for accurate token count management. Essential for applications that need precise token-based chunking for language models.
Splits text based on actual token boundaries using tiktoken encoding, providing accurate token-based chunking for language model applications.
/**
* Text splitter that splits text based on token count using tiktoken encoding
*/
class TokenTextSplitter extends TextSplitter implements TokenTextSplitterParams {
encodingName: tiktoken.TiktokenEncoding;
allowedSpecial: "all" | Array<string>;
disallowedSpecial: "all" | Array<string>;
private tokenizer: tiktoken.Tiktoken;
constructor(fields?: Partial<TokenTextSplitterParams>);
splitText(text: string): Promise<string[]>;
static lc_name(): string;
}
interface TokenTextSplitterParams extends TextSplitterParams {
/** The tiktoken encoding to use (default: "gpt2") */
encodingName: tiktoken.TiktokenEncoding;
/** Special tokens that are allowed in the text (default: []) */
allowedSpecial: "all" | Array<string>;
/** Special tokens that should cause errors if encountered (default: "all") */
disallowedSpecial: "all" | Array<string>;
}
// Tiktoken encoding types from js-tiktoken
namespace tiktoken {
type TiktokenEncoding = "gpt2" | "r50k_base" | "p50k_base" | "cl100k_base";
interface Tiktoken {
encode(text: string, allowedSpecial?: "all" | Array<string>, disallowedSpecial?: "all" | Array<string>): number[];
decode(tokens: number[]): string;
}
}Usage Examples:
import { TokenTextSplitter } from "@langchain/textsplitters";
// Basic token-based splitting with GPT-2 encoding
const splitter = new TokenTextSplitter({
encodingName: "gpt2",
chunkSize: 100, // 100 tokens per chunk
chunkOverlap: 20, // 20 token overlap
});
const text = `This is a sample text that will be split based on actual token boundaries
rather than character count. This ensures more accurate chunking for language model applications.`;
const chunks = await splitter.splitText(text);
// Each chunk contains approximately 100 tokens with 20 token overlap
// GPT-3.5/GPT-4 compatible splitting
const gpt4Splitter = new TokenTextSplitter({
encodingName: "cl100k_base", // Used by GPT-3.5 and GPT-4
chunkSize: 500,
chunkOverlap: 50,
});
const longText = `Your long content here that needs to be split into chunks
that respect the actual token boundaries used by modern language models...`;
const tokenChunks = await gpt4Splitter.splitText(longText);
// Different encoding options
const r50kSplitter = new TokenTextSplitter({
encodingName: "r50k_base", // Used by text-davinci-003 and earlier models
chunkSize: 200,
chunkOverlap: 30,
});Token text splitters support various tiktoken encodings for different language models:
/**
* Supported tiktoken encodings for different language models
*/
type TiktokenEncoding =
| "gpt2" // GPT-2, used by older models
| "r50k_base" // Used by text-davinci-003 and earlier GPT-3 models
| "p50k_base" // Used by text-davinci-002 and earlier GPT-3 models
| "cl100k_base"; // Used by GPT-3.5 and GPT-4 modelsEncoding Examples:
// For GPT-3.5-turbo and GPT-4 applications
const modernSplitter = new TokenTextSplitter({
encodingName: "cl100k_base",
chunkSize: 1000, // Approximately 1000 tokens
chunkOverlap: 100,
});
// For legacy GPT-3 models
const legacySplitter = new TokenTextSplitter({
encodingName: "r50k_base",
chunkSize: 2048, // Common context window size
chunkOverlap: 200,
});
// For GPT-2 applications or research
const gpt2Splitter = new TokenTextSplitter({
encodingName: "gpt2",
chunkSize: 512,
chunkOverlap: 50,
});Control how special tokens are handled during tokenization:
interface TokenTextSplitterParams extends TextSplitterParams {
/** Special tokens that are allowed in the text */
allowedSpecial: "all" | Array<string>;
/** Special tokens that should cause errors if encountered */
disallowedSpecial: "all" | Array<string>;
}Special Token Examples:
// Allow all special tokens
const permissiveSplitter = new TokenTextSplitter({
encodingName: "cl100k_base",
chunkSize: 500,
chunkOverlap: 50,
allowedSpecial: "all",
disallowedSpecial: [],
});
// Strict special token handling - error on any special tokens
const strictSplitter = new TokenTextSplitter({
encodingName: "cl100k_base",
chunkSize: 500,
chunkOverlap: 50,
allowedSpecial: [],
disallowedSpecial: "all",
});
// Allow specific special tokens
const customSplitter = new TokenTextSplitter({
encodingName: "cl100k_base",
chunkSize: 500,
chunkOverlap: 50,
allowedSpecial: ["<|endoftext|>", "<|startoftext|>"],
disallowedSpecial: "all",
});
const textWithSpecialTokens = "Regular text <|endoftext|> More text <|startoftext|> Final text";
const chunks = await customSplitter.splitText(textWithSpecialTokens);Token text splitters integrate with LangChain's document processing pipeline:
import { TokenTextSplitter } from "@langchain/textsplitters";
import { Document } from "@langchain/core/documents";
// Create documents with precise token management
const splitter = new TokenTextSplitter({
encodingName: "cl100k_base",
chunkSize: 300,
chunkOverlap: 30,
});
// Split documents for RAG applications
const documents = [
new Document({
pageContent: "Long article content that needs token-based splitting...",
metadata: { source: "article.txt", tokens: 1500 }
})
];
const splitDocs = await splitter.splitDocuments(documents);
// Each split document maintains metadata and adds line location
splitDocs.forEach(doc => {
console.log(`Chunk: ${doc.pageContent.substring(0, 50)}...`);
console.log(`Metadata:`, doc.metadata);
// Includes original metadata plus { loc: { lines: { from: X, to: Y } } }
});
// Create documents with chunk headers for context
const docsWithHeaders = await splitter.createDocuments(
[longArticleText],
[{ source: "research.pdf", page: 1 }],
{
chunkHeader: "=== DOCUMENT CHUNK ===\n",
chunkOverlapHeader: "[CONTINUED] ",
appendChunkOverlapHeader: true
}
);Token-based splitting is particularly useful for:
RAG Systems with Token Limits:
// Ensure chunks fit within model context windows
const ragSplitter = new TokenTextSplitter({
encodingName: "cl100k_base",
chunkSize: 500, // Leave room for query + system prompt
chunkOverlap: 50, // Maintain context between chunks
});
const knowledge = await splitter.createDocuments(documentTexts, metadatas);
// Each chunk guaranteed to fit within token budgetPrompt Engineering:
// Split prompts to fit model limits while preserving structure
const promptSplitter = new TokenTextSplitter({
encodingName: "cl100k_base",
chunkSize: 2000, // Under 4K context limit
chunkOverlap: 100, // Maintain instruction context
});
const longPrompt = "Complex multi-part instructions...";
const promptChunks = await promptSplitter.splitText(longPrompt);Content Processing Pipelines:
// Process large documents with precise token accounting
const pipelineSplitter = new TokenTextSplitter({
encodingName: "cl100k_base",
chunkSize: 1000,
chunkOverlap: 100,
lengthFunction: async (text: string) => {
// Custom length function could add processing overhead
const tokenCount = await countTokens(text);
return tokenCount;
}
});
const processedDocs = await pipelineSplitter.transformDocuments(inputDocs);Token-based splitting requires tokenization which has performance implications:
// Reuse splitter instances to avoid repeated tokenizer initialization
const sharedSplitter = new TokenTextSplitter({
encodingName: "cl100k_base",
chunkSize: 500,
chunkOverlap: 50,
});
// Process multiple texts with same splitter
const allChunks = await Promise.all(
texts.map(text => sharedSplitter.splitText(text))
);
// For high-volume processing, consider batching
const batchSize = 10;
for (let i = 0; i < texts.length; i += batchSize) {
const batch = texts.slice(i, i + batchSize);
const batchChunks = await Promise.all(
batch.map(text => sharedSplitter.splitText(text))
);
// Process batch results
}