or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/npm-langchain--textsplitters

Various implementations of LangChain.js text splitters for retrieval-augmented generation (RAG) pipelines

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
npmpkg:npm/@langchain/textsplitters@0.1.x

To install, run

npx @tessl/cli install tessl/npm-langchain--textsplitters@0.1.0

0

# LangChain Text Splitters

1

2

LangChain Text Splitters provides various implementations of text splitting functionality for LangChain.js, most commonly used as part of retrieval-augmented generation (RAG) pipelines. The library offers abstract base classes and concrete implementations for splitting text documents into smaller chunks with configurable size, overlap, and length functions.

3

4

## Package Information

5

6

- **Package Name**: @langchain/textsplitters

7

- **Package Type**: npm

8

- **Language**: TypeScript

9

- **Installation**: `npm install @langchain/textsplitters @langchain/core js-tiktoken`

10

11

## Core Imports

12

13

```typescript

14

import {

15

// Classes

16

TextSplitter,

17

CharacterTextSplitter,

18

RecursiveCharacterTextSplitter,

19

TokenTextSplitter,

20

MarkdownTextSplitter,

21

LatexTextSplitter,

22

23

// Interfaces and Types

24

TextSplitterParams,

25

TextSplitterChunkHeaderOptions,

26

CharacterTextSplitterParams,

27

RecursiveCharacterTextSplitterParams,

28

TokenTextSplitterParams,

29

MarkdownTextSplitterParams,

30

LatexTextSplitterParams,

31

SupportedTextSplitterLanguage,

32

SupportedTextSplitterLanguages

33

} from "@langchain/textsplitters";

34

35

// Required imports for tiktoken functionality

36

import type * as tiktoken from "js-tiktoken";

37

import { Document } from "@langchain/core/documents";

38

```

39

40

For CommonJS:

41

42

```javascript

43

const {

44

// Classes

45

TextSplitter,

46

CharacterTextSplitter,

47

RecursiveCharacterTextSplitter,

48

TokenTextSplitter,

49

MarkdownTextSplitter,

50

LatexTextSplitter,

51

52

// Interfaces and Types

53

TextSplitterParams,

54

TextSplitterChunkHeaderOptions,

55

CharacterTextSplitterParams,

56

RecursiveCharacterTextSplitterParams,

57

TokenTextSplitterParams,

58

MarkdownTextSplitterParams,

59

LatexTextSplitterParams,

60

SupportedTextSplitterLanguage,

61

SupportedTextSplitterLanguages

62

} = require("@langchain/textsplitters");

63

64

// Required for document processing

65

const { Document } = require("@langchain/core/documents");

66

```

67

68

## Basic Usage

69

70

```typescript

71

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

72

import { Document } from "@langchain/core/documents";

73

74

// Create a text splitter with custom configuration

75

const splitter = new RecursiveCharacterTextSplitter({

76

chunkSize: 1000,

77

chunkOverlap: 200,

78

});

79

80

// Split text into chunks

81

const text = "Your long text content here...";

82

const chunks = await splitter.splitText(text);

83

84

// Create documents from text with metadata

85

const docs = await splitter.createDocuments(

86

[text],

87

[{ source: "example.txt" }]

88

);

89

90

// Split existing documents

91

const existingDocs = [

92

new Document({ pageContent: text, metadata: { source: "doc1" } })

93

];

94

const splitDocs = await splitter.splitDocuments(existingDocs);

95

```

96

97

## Architecture

98

99

LangChain Text Splitters is built around several key components:

100

101

- **Abstract Base Class**: `TextSplitter` provides core splitting functionality and document transformation interface

102

- **Concrete Implementations**: Specialized splitters for different splitting strategies (character-based, recursive, token-based)

103

- **Document Integration**: Full integration with LangChain's Document ecosystem via BaseDocumentTransformer

104

- **Metadata Preservation**: Automatic line number tracking and metadata propagation through splitting operations

105

- **Language-Aware Splitting**: Support for 18 programming languages with language-specific separators

106

107

## Capabilities

108

109

### Basic Text Splitting

110

111

Core text splitting functionality using simple character-based separators. Ideal for basic document chunking with predictable separator patterns.

112

113

```typescript { .api }

114

class CharacterTextSplitter extends TextSplitter {

115

constructor(fields?: Partial<CharacterTextSplitterParams>);

116

splitText(text: string): Promise<string[]>;

117

}

118

119

interface CharacterTextSplitterParams extends TextSplitterParams {

120

separator: string;

121

}

122

```

123

124

[Character Text Splitting](./character-splitting.md)

125

126

### Recursive Text Splitting

127

128

Advanced recursive splitting using a hierarchy of separators. Perfect for intelligent document chunking that preserves semantic structure and supports code-aware splitting.

129

130

```typescript { .api }

131

class RecursiveCharacterTextSplitter extends TextSplitter {

132

constructor(fields?: Partial<RecursiveCharacterTextSplitterParams>);

133

splitText(text: string): Promise<string[]>;

134

static fromLanguage(

135

language: SupportedTextSplitterLanguage,

136

options?: Partial<RecursiveCharacterTextSplitterParams>

137

): RecursiveCharacterTextSplitter;

138

static getSeparatorsForLanguage(language: SupportedTextSplitterLanguage): string[];

139

}

140

141

interface RecursiveCharacterTextSplitterParams extends TextSplitterParams {

142

separators: string[];

143

}

144

145

const SupportedTextSplitterLanguages = [

146

"cpp", "go", "java", "js", "php", "proto", "python", "rst",

147

"ruby", "rust", "scala", "swift", "markdown", "latex", "html", "sol"

148

] as const;

149

150

type SupportedTextSplitterLanguage = (typeof SupportedTextSplitterLanguages)[number];

151

```

152

153

[Recursive Text Splitting](./recursive-splitting.md)

154

155

### Token-Based Splitting

156

157

Token-aware splitting using tiktoken encoding for accurate token count management. Essential for applications that need precise token-based chunking for language models.

158

159

```typescript { .api }

160

class TokenTextSplitter extends TextSplitter {

161

constructor(fields?: Partial<TokenTextSplitterParams>);

162

splitText(text: string): Promise<string[]>;

163

}

164

165

interface TokenTextSplitterParams extends TextSplitterParams {

166

encodingName: tiktoken.TiktokenEncoding;

167

allowedSpecial: "all" | Array<string>;

168

disallowedSpecial: "all" | Array<string>;

169

}

170

171

// Tiktoken encoding types from js-tiktoken

172

namespace tiktoken {

173

type TiktokenEncoding = "gpt2" | "r50k_base" | "p50k_base" | "cl100k_base";

174

175

interface Tiktoken {

176

encode(text: string, allowedSpecial?: "all" | Array<string>, disallowedSpecial?: "all" | Array<string>): number[];

177

decode(tokens: number[]): string;

178

}

179

}

180

```

181

182

[Token-Based Splitting](./token-splitting.md)

183

184

### Document Format Splitting

185

186

Specialized splitters optimized for specific document formats like Markdown and LaTeX. Designed to preserve document structure and formatting semantics.

187

188

```typescript { .api }

189

class MarkdownTextSplitter extends RecursiveCharacterTextSplitter {

190

constructor(fields?: Partial<MarkdownTextSplitterParams>);

191

}

192

193

class LatexTextSplitter extends RecursiveCharacterTextSplitter {

194

constructor(fields?: Partial<LatexTextSplitterParams>);

195

}

196

197

type MarkdownTextSplitterParams = TextSplitterParams;

198

type LatexTextSplitterParams = TextSplitterParams;

199

```

200

201

[Document Format Splitting](./format-splitting.md)

202

203

## Core Types

204

205

```typescript { .api }

206

interface TextSplitterParams {

207

chunkSize: number;

208

chunkOverlap: number;

209

keepSeparator: boolean;

210

lengthFunction?: ((text: string) => number) | ((text: string) => Promise<number>);

211

}

212

213

type TextSplitterChunkHeaderOptions = {

214

chunkHeader?: string;

215

chunkOverlapHeader?: string;

216

appendChunkOverlapHeader?: boolean;

217

};

218

219

// Base class from @langchain/core/documents

220

abstract class BaseDocumentTransformer {

221

abstract transformDocuments(documents: Document[], ...args: any[]): Promise<Document[]>;

222

}

223

224

abstract class TextSplitter extends BaseDocumentTransformer implements TextSplitterParams {

225

lc_namespace: string[];

226

chunkSize: number;

227

chunkOverlap: number;

228

keepSeparator: boolean;

229

lengthFunction: ((text: string) => number) | ((text: string) => Promise<number>);

230

231

constructor(fields?: Partial<TextSplitterParams>);

232

abstract splitText(text: string): Promise<string[]>;

233

transformDocuments(

234

documents: Document[],

235

chunkHeaderOptions?: TextSplitterChunkHeaderOptions

236

): Promise<Document[]>;

237

createDocuments(

238

texts: string[],

239

metadatas?: Record<string, any>[],

240

chunkHeaderOptions?: TextSplitterChunkHeaderOptions

241

): Promise<Document[]>;

242

splitDocuments(

243

documents: Document[],

244

chunkHeaderOptions?: TextSplitterChunkHeaderOptions

245

): Promise<Document[]>;

246

protected splitOnSeparator(text: string, separator: string): string[];

247

mergeSplits(splits: string[], separator: string): Promise<string[]>;

248

private numberOfNewLines(text: string, start?: number, end?: number): number;

249

private joinDocs(docs: string[], separator: string): string | null;

250

}

251

```