CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/npm-langchain--textsplitters

Various implementations of LangChain.js text splitters for retrieval-augmented generation (RAG) pipelines

Pending
Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Pending

The risk profile of this skill

Overview
Eval results
Files

character-splitting.mddocs/

Character Text Splitting

Basic text splitting functionality using simple character-based separators. Ideal for basic document chunking with predictable separator patterns like line breaks or paragraph markers.

Capabilities

CharacterTextSplitter Class

Splits text based on a single character separator with configurable chunk size and overlap.

/**
 * Text splitter that splits text based on a single character separator
 */
class CharacterTextSplitter extends TextSplitter implements CharacterTextSplitterParams {
  separator: string;
  
  constructor(fields?: Partial<CharacterTextSplitterParams>);
  splitText(text: string): Promise<string[]>;
  static lc_name(): string;
}

interface CharacterTextSplitterParams extends TextSplitterParams {
  /** The character(s) to split text on (default: "\n\n") */
  separator: string;
}

Usage Examples:

import { CharacterTextSplitter } from "@langchain/textsplitters";

// Basic paragraph splitting
const splitter = new CharacterTextSplitter({
  separator: "\n\n",
  chunkSize: 1000,
  chunkOverlap: 200,
});

const text = `Paragraph one content here.

Paragraph two content here.

Paragraph three content here.`;

const chunks = await splitter.splitText(text);
// Result: ["Paragraph one content here.", "Paragraph two content here.", "Paragraph three content here."]

// Custom separator splitting
const csvSplitter = new CharacterTextSplitter({
  separator: ",",
  chunkSize: 50,
  chunkOverlap: 10,
});

const csvData = "apple,banana,cherry,date,elderberry,fig";
const csvChunks = await csvSplitter.splitText(csvData);

// Word-level splitting
const wordSplitter = new CharacterTextSplitter({
  separator: " ",
  chunkSize: 20,
  chunkOverlap: 5,
});

const sentence = "The quick brown fox jumps over the lazy dog";
const wordChunks = await wordSplitter.splitText(sentence);

Document Creation

Create Document objects from split text with metadata preservation.

/**
 * Create Document objects from split text
 * @param texts - Array of texts to split and convert to documents
 * @param metadatas - Optional metadata for each text
 * @param chunkHeaderOptions - Optional chunk header configuration
 * @returns Array of Document objects with split text and metadata
 */
createDocuments(
  texts: string[],
  metadatas?: Record<string, any>[],
  chunkHeaderOptions?: TextSplitterChunkHeaderOptions
): Promise<Document[]>;

Usage Examples:

import { CharacterTextSplitter } from "@langchain/textsplitters";

const splitter = new CharacterTextSplitter({
  separator: "\n\n",
  chunkSize: 100,
  chunkOverlap: 20,
});

// Create documents with metadata
const texts = ["First document text", "Second document text"];
const metadatas = [
  { source: "doc1.txt", author: "Alice" },
  { source: "doc2.txt", author: "Bob" }
];

const documents = await splitter.createDocuments(texts, metadatas);
// Each document will have pageContent with split text and merged metadata

// Create documents with chunk headers
const documentsWithHeaders = await splitter.createDocuments(
  texts,
  metadatas,
  {
    chunkHeader: "=== DOCUMENT CHUNK ===\n",
    chunkOverlapHeader: "(continued from previous chunk) ",
    appendChunkOverlapHeader: true
  }
);

Document Splitting

Split existing Document objects while preserving their metadata.

/**
 * Split existing Document objects
 * @param documents - Array of documents to split
 * @param chunkHeaderOptions - Optional chunk header configuration
 * @returns Array of split Document objects
 */
splitDocuments(
  documents: Document[],
  chunkHeaderOptions?: TextSplitterChunkHeaderOptions
): Promise<Document[]>;

Usage Examples:

import { CharacterTextSplitter } from "@langchain/textsplitters";
import { Document } from "@langchain/core/documents";

const splitter = new CharacterTextSplitter({
  separator: "\n",
  chunkSize: 50,
  chunkOverlap: 10,
});

const originalDocs = [
  new Document({
    pageContent: "Line one\nLine two\nLine three\nLine four",
    metadata: { source: "example.txt", type: "text" }
  })
];

const splitDocs = await splitter.splitDocuments(originalDocs);
// Results in multiple documents, each with preserved metadata plus line location info

Configuration Options

All character text splitters support the base TextSplitterParams configuration.

interface TextSplitterParams {
  /** Maximum size of each chunk in characters (default: 1000) */
  chunkSize: number;
  /** Number of characters to overlap between chunks (default: 200) */
  chunkOverlap: number;
  /** Whether to keep the separator in the split text (default: false) */
  keepSeparator: boolean;
  /** Custom function to calculate text length (default: text.length) */
  lengthFunction?: ((text: string) => number) | ((text: string) => Promise<number>);
}

type TextSplitterChunkHeaderOptions = {
  /** Header text to prepend to each chunk */
  chunkHeader?: string;
  /** Header text for chunks that continue from previous (default: "(cont'd) ") */
  chunkOverlapHeader?: string;
  /** Whether to append overlap header to continuing chunks (default: false) */
  appendChunkOverlapHeader?: boolean;
};

Configuration Examples:

// Custom length function using token count
const tokenBasedSplitter = new CharacterTextSplitter({
  separator: "\n",
  chunkSize: 100, // 100 tokens instead of characters
  chunkOverlap: 20,
  lengthFunction: (text: string) => {
    // Simple token estimation (actual implementation would use proper tokenizer)
    return text.split(/\s+/).length;
  }
});

// Keep separators in output
const separatorKeepingSplitter = new CharacterTextSplitter({
  separator: "\n---\n",
  chunkSize: 500,
  chunkOverlap: 0,
  keepSeparator: true // Separators will be included in the chunks
});

docs

character-splitting.md

format-splitting.md

index.md

recursive-splitting.md

token-splitting.md

tile.json