or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

character-splitting.mdcode-splitting.mdcore-base.mddocument-structure.mdindex.mdnlp-splitting.mdtoken-splitting.md

core-base.mddocs/

0

# Core Base Classes and Utilities

1

2

The core base classes and utilities provide the fundamental interfaces, enums, and utility functions that form the foundation of all text splitting functionality in langchain-text-splitters. These components define the common patterns and contracts used throughout the library.

3

4

## Capabilities

5

6

### TextSplitter Abstract Base Class

7

8

The core abstract interface that all text splitters implement, providing common functionality and defining the splitting contract.

9

10

```python { .api }

11

class TextSplitter(BaseDocumentTransformer, ABC):

12

def __init__(

13

self,

14

chunk_size: int = 4000,

15

chunk_overlap: int = 200,

16

length_function: Callable[[str], int] = len,

17

keep_separator: Union[bool, Literal["start", "end"]] = False,

18

add_start_index: bool = False,

19

strip_whitespace: bool = True

20

) -> None: ...

21

22

@abstractmethod

23

def split_text(self, text: str) -> list[str]: ...

24

25

def create_documents(

26

self,

27

texts: list[str],

28

metadatas: Optional[list[dict[Any, Any]]] = None

29

) -> list[Document]: ...

30

31

def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...

32

33

@classmethod

34

def from_huggingface_tokenizer(

35

cls,

36

tokenizer: Any,

37

**kwargs: Any

38

) -> "TextSplitter": ...

39

40

@classmethod

41

def from_tiktoken_encoder(

42

cls,

43

encoding_name: str = "gpt2",

44

model_name: Optional[str] = None,

45

allowed_special: Union[Literal["all"], AbstractSet[str]] = set(),

46

disallowed_special: Union[Literal["all"], Collection[str]] = "all",

47

**kwargs: Any

48

) -> Self: ...

49

```

50

51

**Constructor Parameters:**

52

- `chunk_size`: Maximum size of chunks to return (default: `4000`)

53

- `chunk_overlap`: Overlap in characters between chunks (default: `200`)

54

- `length_function`: Function that measures the length of given chunks (default: `len`)

55

- `keep_separator`: Whether to keep the separator and where to place it (default: `False`)

56

- `add_start_index`: If `True`, includes chunk's start index in metadata (default: `False`)

57

- `strip_whitespace`: If `True`, strips whitespace from start and end of documents (default: `True`)

58

59

**Abstract Methods:**

60

- `split_text()`: Must be implemented by all concrete splitters

61

62

**Concrete Methods:**

63

- `create_documents()`: Create Document objects from text list with optional metadata

64

- `split_documents()`: Split existing Document objects into smaller chunks

65

66

**Factory Methods:**

67

- `from_huggingface_tokenizer()`: Create splitter from HuggingFace tokenizer

68

- `from_tiktoken_encoder()`: Create splitter from tiktoken encoder

69

70

**Usage:**

71

72

```python

73

from langchain_text_splitters import TextSplitter

74

from langchain_core.documents import Document

75

76

# Example concrete implementation (normally you'd use CharacterTextSplitter)

77

class SimpleTextSplitter(TextSplitter):

78

def split_text(self, text: str) -> list[str]:

79

# Simple implementation that splits on periods

80

sentences = text.split('.')

81

return [s.strip() + '.' for s in sentences if s.strip()]

82

83

# Using the splitter

84

splitter = SimpleTextSplitter(

85

chunk_size=1000,

86

chunk_overlap=100,

87

add_start_index=True,

88

strip_whitespace=True

89

)

90

91

# Split text

92

text = "First sentence. Second sentence. Third sentence."

93

chunks = splitter.split_text(text)

94

95

# Create documents with metadata

96

documents = splitter.create_documents(

97

texts=[text],

98

metadatas=[{"source": "example.txt", "author": "unknown"}]

99

)

100

101

# Split existing documents

102

existing_docs = [Document(page_content=text, metadata={"page": 1})]

103

split_docs = splitter.split_documents(existing_docs)

104

```

105

106

### Language Enumeration

107

108

Enumeration defining supported programming languages for language-specific text splitting.

109

110

```python { .api }

111

class Language(Enum):

112

CPP = "cpp"

113

GO = "go"

114

JAVA = "java"

115

KOTLIN = "kotlin"

116

JS = "js"

117

TS = "ts"

118

PHP = "php"

119

PROTO = "proto"

120

PYTHON = "python"

121

RST = "rst"

122

RUBY = "ruby"

123

RUST = "rust"

124

SCALA = "scala"

125

SWIFT = "swift"

126

MARKDOWN = "markdown"

127

LATEX = "latex"

128

HTML = "html"

129

SOL = "sol"

130

CSHARP = "csharp"

131

COBOL = "cobol"

132

C = "c"

133

LUA = "lua"

134

PERL = "perl"

135

HASKELL = "haskell"

136

ELIXIR = "elixir"

137

POWERSHELL = "powershell"

138

VISUALBASIC6 = "visualbasic6"

139

```

140

141

**Usage:**

142

143

```python

144

from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

145

146

# Use with language-specific splitting

147

python_splitter = RecursiveCharacterTextSplitter.from_language(

148

language=Language.PYTHON,

149

chunk_size=2000

150

)

151

152

# Get separators for a language

153

js_separators = RecursiveCharacterTextSplitter.get_separators_for_language(Language.JS)

154

155

# Compare language values

156

if some_language == Language.PYTHON.value: # "python"

157

print("This is Python code")

158

```

159

160

### Tokenizer Configuration

161

162

Configuration dataclass for token-based text splitting operations.

163

164

```python { .api }

165

@dataclass(frozen=True)

166

class Tokenizer:

167

chunk_overlap: int

168

tokens_per_chunk: int

169

decode: Callable[[list[int]], str]

170

encode: Callable[[str], list[int]]

171

```

172

173

**Fields:**

174

- `chunk_overlap`: Number of tokens to overlap between chunks

175

- `tokens_per_chunk`: Maximum number of tokens per chunk

176

- `decode`: Function to decode token IDs back to text

177

- `encode`: Function to encode text to token IDs

178

179

**Usage:**

180

181

```python

182

from langchain_text_splitters import Tokenizer, split_text_on_tokens

183

import tiktoken

184

185

# Create tokenizer configuration

186

encoding = tiktoken.get_encoding("gpt2")

187

tokenizer_config = Tokenizer(

188

chunk_overlap=50,

189

tokens_per_chunk=512,

190

decode=encoding.decode,

191

encode=encoding.encode

192

)

193

194

# Use with splitting function

195

text = "Long text to be tokenized and split..."

196

chunks = split_text_on_tokens(text=text, tokenizer=tokenizer_config)

197

```

198

199

### Token-Based Splitting Utility

200

201

Utility function for splitting text using a tokenizer configuration.

202

203

```python { .api }

204

def split_text_on_tokens(*, text: str, tokenizer: Tokenizer) -> list[str]: ...

205

```

206

207

**Parameters:**

208

- `text`: The text to split

209

- `tokenizer`: Tokenizer configuration object

210

211

**Returns:**

212

- List of text chunks split according to token boundaries

213

214

**Usage:**

215

216

```python

217

from langchain_text_splitters import split_text_on_tokens, Tokenizer

218

from transformers import AutoTokenizer

219

220

# Using HuggingFace tokenizer

221

hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

222

223

tokenizer_config = Tokenizer(

224

chunk_overlap=25,

225

tokens_per_chunk=256,

226

decode=lambda tokens: hf_tokenizer.decode(tokens, skip_special_tokens=True),

227

encode=lambda text: hf_tokenizer.encode(text, add_special_tokens=False)

228

)

229

230

text = "This is a sample text that will be tokenized and split into chunks."

231

chunks = split_text_on_tokens(text=text, tokenizer=tokenizer_config)

232

```

233

234

### Type Definitions

235

236

The text splitters package provides several TypedDict definitions for structured data used across various splitters.

237

238

```python { .api }

239

class ElementType(TypedDict):

240

"""Element type as typed dict for HTML elements."""

241

url: str

242

xpath: str

243

content: str

244

metadata: dict[str, str]

245

246

class HeaderType(TypedDict):

247

"""Header type as typed dict for markdown headers."""

248

level: int

249

name: str

250

data: str

251

252

class LineType(TypedDict):

253

"""Line type as typed dict for text lines with metadata."""

254

metadata: dict[str, str]

255

content: str

256

```

257

258

These types are used by:

259

- `ElementType`: HTML-based splitters for structured element data

260

- `HeaderType`: Markdown splitters for header information

261

- `LineType`: Markdown splitters for line-based processing

262

263

### Base Document Transformer Integration

264

265

All text splitters inherit from LangChain's `BaseDocumentTransformer`, providing consistent integration with the LangChain ecosystem.

266

267

```python { .api }

268

# Inherited from langchain_core

269

class BaseDocumentTransformer(ABC):

270

@abstractmethod

271

def transform_documents(

272

self,

273

documents: Sequence[Document],

274

**kwargs: Any

275

) -> Sequence[Document]: ...

276

277

async def atransform_documents(

278

self,

279

documents: Sequence[Document],

280

**kwargs: Any

281

) -> Sequence[Document]: ...

282

```

283

284

## Error Handling and Validation

285

286

The base `TextSplitter` class includes built-in validation for common configuration errors:

287

288

```python

289

# These will raise ValueError

290

TextSplitter(chunk_size=0) # chunk_size must be > 0

291

TextSplitter(chunk_overlap=-1) # chunk_overlap must be >= 0

292

TextSplitter(chunk_size=100, chunk_overlap=200) # overlap > chunk_size

293

```

294

295

## Design Principles

296

297

### Inheritance Hierarchy

298

The library follows a clear inheritance pattern:

299

1. `BaseDocumentTransformer` (from LangChain Core)

300

2. `TextSplitter` (abstract base class)

301

3. Concrete implementations (`CharacterTextSplitter`, `TokenTextSplitter`, etc.)

302

303

### Factory Pattern

304

Many splitters provide factory methods for convenient initialization:

305

- `from_language()` for language-specific splitting

306

- `from_huggingface_tokenizer()` for HuggingFace integration

307

- `from_tiktoken_encoder()` for OpenAI tokenizer integration

308

309

### Configuration Flexibility

310

All splitters accept common configuration parameters through the base class while allowing specific customization through their own parameters.

311

312

## Best Practices

313

314

1. **Extend TextSplitter**: When creating custom splitters, extend `TextSplitter` and implement `split_text()`

315

2. **Use factory methods**: Leverage factory methods for common initialization patterns

316

3. **Validate parameters**: The base class provides validation; add custom validation in subclasses

317

4. **Preserve metadata**: Use `create_documents()` and `split_documents()` to maintain document metadata

318

5. **Handle edge cases**: Consider empty strings, very short texts, and texts smaller than chunk_size

319

6. **Choose appropriate length functions**: For token-based splitting, use token counting functions

320

7. **Test with real data**: Validate your splitter configuration with representative data