CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/npm-langchain--textsplitters

Various implementations of LangChain.js text splitters for retrieval-augmented generation (RAG) pipelines

Pending
Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Pending

The risk profile of this skill

Overview
Eval results
Files

token-splitting.mddocs/

Token-Based Splitting

Token-aware splitting using tiktoken encoding for accurate token count management. Essential for applications that need precise token-based chunking for language models.

Capabilities

TokenTextSplitter Class

Splits text based on actual token boundaries using tiktoken encoding, providing accurate token-based chunking for language model applications.

/**
 * Text splitter that splits text based on token count using tiktoken encoding
 */
class TokenTextSplitter extends TextSplitter implements TokenTextSplitterParams {
  encodingName: tiktoken.TiktokenEncoding;
  allowedSpecial: "all" | Array<string>;
  disallowedSpecial: "all" | Array<string>;
  private tokenizer: tiktoken.Tiktoken;
  
  constructor(fields?: Partial<TokenTextSplitterParams>);
  splitText(text: string): Promise<string[]>;
  static lc_name(): string;
}

interface TokenTextSplitterParams extends TextSplitterParams {
  /** The tiktoken encoding to use (default: "gpt2") */
  encodingName: tiktoken.TiktokenEncoding;
  /** Special tokens that are allowed in the text (default: []) */
  allowedSpecial: "all" | Array<string>;
  /** Special tokens that should cause errors if encountered (default: "all") */
  disallowedSpecial: "all" | Array<string>;
}

// Tiktoken encoding types from js-tiktoken
namespace tiktoken {
  type TiktokenEncoding = "gpt2" | "r50k_base" | "p50k_base" | "cl100k_base";
  
  interface Tiktoken {
    encode(text: string, allowedSpecial?: "all" | Array<string>, disallowedSpecial?: "all" | Array<string>): number[];
    decode(tokens: number[]): string;
  }
}

Usage Examples:

import { TokenTextSplitter } from "@langchain/textsplitters";

// Basic token-based splitting with GPT-2 encoding
const splitter = new TokenTextSplitter({
  encodingName: "gpt2",
  chunkSize: 100,  // 100 tokens per chunk
  chunkOverlap: 20, // 20 token overlap
});

const text = `This is a sample text that will be split based on actual token boundaries 
rather than character count. This ensures more accurate chunking for language model applications.`;

const chunks = await splitter.splitText(text);
// Each chunk contains approximately 100 tokens with 20 token overlap

// GPT-3.5/GPT-4 compatible splitting
const gpt4Splitter = new TokenTextSplitter({
  encodingName: "cl100k_base", // Used by GPT-3.5 and GPT-4
  chunkSize: 500,
  chunkOverlap: 50,
});

const longText = `Your long content here that needs to be split into chunks 
that respect the actual token boundaries used by modern language models...`;

const tokenChunks = await gpt4Splitter.splitText(longText);

// Different encoding options
const r50kSplitter = new TokenTextSplitter({
  encodingName: "r50k_base", // Used by text-davinci-003 and earlier models
  chunkSize: 200,
  chunkOverlap: 30,
});

Encoding Options

Token text splitters support various tiktoken encodings for different language models:

/**
 * Supported tiktoken encodings for different language models
 */
type TiktokenEncoding = 
  | "gpt2"           // GPT-2, used by older models
  | "r50k_base"      // Used by text-davinci-003 and earlier GPT-3 models  
  | "p50k_base"      // Used by text-davinci-002 and earlier GPT-3 models
  | "cl100k_base";   // Used by GPT-3.5 and GPT-4 models

Encoding Examples:

// For GPT-3.5-turbo and GPT-4 applications
const modernSplitter = new TokenTextSplitter({
  encodingName: "cl100k_base",
  chunkSize: 1000,   // Approximately 1000 tokens
  chunkOverlap: 100,
});

// For legacy GPT-3 models
const legacySplitter = new TokenTextSplitter({
  encodingName: "r50k_base", 
  chunkSize: 2048,   // Common context window size
  chunkOverlap: 200,
});

// For GPT-2 applications or research
const gpt2Splitter = new TokenTextSplitter({
  encodingName: "gpt2",
  chunkSize: 512,
  chunkOverlap: 50,
});

Special Token Handling

Control how special tokens are handled during tokenization:

interface TokenTextSplitterParams extends TextSplitterParams {
  /** Special tokens that are allowed in the text */
  allowedSpecial: "all" | Array<string>;
  /** Special tokens that should cause errors if encountered */
  disallowedSpecial: "all" | Array<string>;
}

Special Token Examples:

// Allow all special tokens
const permissiveSplitter = new TokenTextSplitter({
  encodingName: "cl100k_base",
  chunkSize: 500,
  chunkOverlap: 50,
  allowedSpecial: "all",
  disallowedSpecial: [],
});

// Strict special token handling - error on any special tokens
const strictSplitter = new TokenTextSplitter({
  encodingName: "cl100k_base",
  chunkSize: 500,
  chunkOverlap: 50,
  allowedSpecial: [],
  disallowedSpecial: "all",
});

// Allow specific special tokens
const customSplitter = new TokenTextSplitter({
  encodingName: "cl100k_base",
  chunkSize: 500,
  chunkOverlap: 50,
  allowedSpecial: ["<|endoftext|>", "<|startoftext|>"],
  disallowedSpecial: "all",
});

const textWithSpecialTokens = "Regular text <|endoftext|> More text <|startoftext|> Final text";
const chunks = await customSplitter.splitText(textWithSpecialTokens);

Document Processing

Token text splitters integrate with LangChain's document processing pipeline:

import { TokenTextSplitter } from "@langchain/textsplitters";
import { Document } from "@langchain/core/documents";

// Create documents with precise token management
const splitter = new TokenTextSplitter({
  encodingName: "cl100k_base",
  chunkSize: 300,
  chunkOverlap: 30,
});

// Split documents for RAG applications
const documents = [
  new Document({
    pageContent: "Long article content that needs token-based splitting...",
    metadata: { source: "article.txt", tokens: 1500 }
  })
];

const splitDocs = await splitter.splitDocuments(documents);

// Each split document maintains metadata and adds line location
splitDocs.forEach(doc => {
  console.log(`Chunk: ${doc.pageContent.substring(0, 50)}...`);
  console.log(`Metadata:`, doc.metadata);
  // Includes original metadata plus { loc: { lines: { from: X, to: Y } } }
});

// Create documents with chunk headers for context
const docsWithHeaders = await splitter.createDocuments(
  [longArticleText],
  [{ source: "research.pdf", page: 1 }],
  {
    chunkHeader: "=== DOCUMENT CHUNK ===\n",
    chunkOverlapHeader: "[CONTINUED] ",
    appendChunkOverlapHeader: true
  }
);

Practical Applications

Token-based splitting is particularly useful for:

RAG Systems with Token Limits:

// Ensure chunks fit within model context windows
const ragSplitter = new TokenTextSplitter({
  encodingName: "cl100k_base",
  chunkSize: 500,    // Leave room for query + system prompt
  chunkOverlap: 50,  // Maintain context between chunks
});

const knowledge = await splitter.createDocuments(documentTexts, metadatas);
// Each chunk guaranteed to fit within token budget

Prompt Engineering:

// Split prompts to fit model limits while preserving structure
const promptSplitter = new TokenTextSplitter({
  encodingName: "cl100k_base", 
  chunkSize: 2000,   // Under 4K context limit
  chunkOverlap: 100, // Maintain instruction context
});

const longPrompt = "Complex multi-part instructions...";
const promptChunks = await promptSplitter.splitText(longPrompt);

Content Processing Pipelines:

// Process large documents with precise token accounting
const pipelineSplitter = new TokenTextSplitter({
  encodingName: "cl100k_base",
  chunkSize: 1000,
  chunkOverlap: 100,
  lengthFunction: async (text: string) => {
    // Custom length function could add processing overhead
    const tokenCount = await countTokens(text);
    return tokenCount;
  }
});

const processedDocs = await pipelineSplitter.transformDocuments(inputDocs);

Performance Considerations

Token-based splitting requires tokenization which has performance implications:

// Reuse splitter instances to avoid repeated tokenizer initialization
const sharedSplitter = new TokenTextSplitter({
  encodingName: "cl100k_base",
  chunkSize: 500,
  chunkOverlap: 50,
});

// Process multiple texts with same splitter
const allChunks = await Promise.all(
  texts.map(text => sharedSplitter.splitText(text))
);

// For high-volume processing, consider batching
const batchSize = 10;
for (let i = 0; i < texts.length; i += batchSize) {
  const batch = texts.slice(i, i + batchSize);
  const batchChunks = await Promise.all(
    batch.map(text => sharedSplitter.splitText(text))
  );
  // Process batch results
}

docs

character-splitting.md

format-splitting.md

index.md

recursive-splitting.md

token-splitting.md

tile.json