0
# Core Base Classes and Utilities
1
2
The core base classes and utilities provide the fundamental interfaces, enums, and utility functions that form the foundation of all text splitting functionality in langchain-text-splitters. These components define the common patterns and contracts used throughout the library.
3
4
## Capabilities
5
6
### TextSplitter Abstract Base Class
7
8
The core abstract interface that all text splitters implement, providing common functionality and defining the splitting contract.
9
10
```python { .api }
11
class TextSplitter(BaseDocumentTransformer, ABC):
12
def __init__(
13
self,
14
chunk_size: int = 4000,
15
chunk_overlap: int = 200,
16
length_function: Callable[[str], int] = len,
17
keep_separator: Union[bool, Literal["start", "end"]] = False,
18
add_start_index: bool = False,
19
strip_whitespace: bool = True
20
) -> None: ...
21
22
@abstractmethod
23
def split_text(self, text: str) -> list[str]: ...
24
25
def create_documents(
26
self,
27
texts: list[str],
28
metadatas: Optional[list[dict[Any, Any]]] = None
29
) -> list[Document]: ...
30
31
def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...
32
33
@classmethod
34
def from_huggingface_tokenizer(
35
cls,
36
tokenizer: Any,
37
**kwargs: Any
38
) -> "TextSplitter": ...
39
40
@classmethod
41
def from_tiktoken_encoder(
42
cls,
43
encoding_name: str = "gpt2",
44
model_name: Optional[str] = None,
45
allowed_special: Union[Literal["all"], AbstractSet[str]] = set(),
46
disallowed_special: Union[Literal["all"], Collection[str]] = "all",
47
**kwargs: Any
48
) -> Self: ...
49
```
50
51
**Constructor Parameters:**
52
- `chunk_size`: Maximum size of chunks to return (default: `4000`)
53
- `chunk_overlap`: Overlap in characters between chunks (default: `200`)
54
- `length_function`: Function that measures the length of given chunks (default: `len`)
55
- `keep_separator`: Whether to keep the separator and where to place it (default: `False`)
56
- `add_start_index`: If `True`, includes chunk's start index in metadata (default: `False`)
57
- `strip_whitespace`: If `True`, strips whitespace from start and end of documents (default: `True`)
58
59
**Abstract Methods:**
60
- `split_text()`: Must be implemented by all concrete splitters
61
62
**Concrete Methods:**
63
- `create_documents()`: Create Document objects from text list with optional metadata
64
- `split_documents()`: Split existing Document objects into smaller chunks
65
66
**Factory Methods:**
67
- `from_huggingface_tokenizer()`: Create splitter from HuggingFace tokenizer
68
- `from_tiktoken_encoder()`: Create splitter from tiktoken encoder
69
70
**Usage:**
71
72
```python
73
from langchain_text_splitters import TextSplitter
74
from langchain_core.documents import Document
75
76
# Example concrete implementation (normally you'd use CharacterTextSplitter)
77
class SimpleTextSplitter(TextSplitter):
78
def split_text(self, text: str) -> list[str]:
79
# Simple implementation that splits on periods
80
sentences = text.split('.')
81
return [s.strip() + '.' for s in sentences if s.strip()]
82
83
# Using the splitter
84
splitter = SimpleTextSplitter(
85
chunk_size=1000,
86
chunk_overlap=100,
87
add_start_index=True,
88
strip_whitespace=True
89
)
90
91
# Split text
92
text = "First sentence. Second sentence. Third sentence."
93
chunks = splitter.split_text(text)
94
95
# Create documents with metadata
96
documents = splitter.create_documents(
97
texts=[text],
98
metadatas=[{"source": "example.txt", "author": "unknown"}]
99
)
100
101
# Split existing documents
102
existing_docs = [Document(page_content=text, metadata={"page": 1})]
103
split_docs = splitter.split_documents(existing_docs)
104
```
105
106
### Language Enumeration
107
108
Enumeration defining supported programming languages for language-specific text splitting.
109
110
```python { .api }
111
class Language(Enum):
112
CPP = "cpp"
113
GO = "go"
114
JAVA = "java"
115
KOTLIN = "kotlin"
116
JS = "js"
117
TS = "ts"
118
PHP = "php"
119
PROTO = "proto"
120
PYTHON = "python"
121
RST = "rst"
122
RUBY = "ruby"
123
RUST = "rust"
124
SCALA = "scala"
125
SWIFT = "swift"
126
MARKDOWN = "markdown"
127
LATEX = "latex"
128
HTML = "html"
129
SOL = "sol"
130
CSHARP = "csharp"
131
COBOL = "cobol"
132
C = "c"
133
LUA = "lua"
134
PERL = "perl"
135
HASKELL = "haskell"
136
ELIXIR = "elixir"
137
POWERSHELL = "powershell"
138
VISUALBASIC6 = "visualbasic6"
139
```
140
141
**Usage:**
142
143
```python
144
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter
145
146
# Use with language-specific splitting
147
python_splitter = RecursiveCharacterTextSplitter.from_language(
148
language=Language.PYTHON,
149
chunk_size=2000
150
)
151
152
# Get separators for a language
153
js_separators = RecursiveCharacterTextSplitter.get_separators_for_language(Language.JS)
154
155
# Compare language values
156
if some_language == Language.PYTHON.value: # "python"
157
print("This is Python code")
158
```
159
160
### Tokenizer Configuration
161
162
Configuration dataclass for token-based text splitting operations.
163
164
```python { .api }
165
@dataclass(frozen=True)
166
class Tokenizer:
167
chunk_overlap: int
168
tokens_per_chunk: int
169
decode: Callable[[list[int]], str]
170
encode: Callable[[str], list[int]]
171
```
172
173
**Fields:**
174
- `chunk_overlap`: Number of tokens to overlap between chunks
175
- `tokens_per_chunk`: Maximum number of tokens per chunk
176
- `decode`: Function to decode token IDs back to text
177
- `encode`: Function to encode text to token IDs
178
179
**Usage:**
180
181
```python
182
from langchain_text_splitters import Tokenizer, split_text_on_tokens
183
import tiktoken
184
185
# Create tokenizer configuration
186
encoding = tiktoken.get_encoding("gpt2")
187
tokenizer_config = Tokenizer(
188
chunk_overlap=50,
189
tokens_per_chunk=512,
190
decode=encoding.decode,
191
encode=encoding.encode
192
)
193
194
# Use with splitting function
195
text = "Long text to be tokenized and split..."
196
chunks = split_text_on_tokens(text=text, tokenizer=tokenizer_config)
197
```
198
199
### Token-Based Splitting Utility
200
201
Utility function for splitting text using a tokenizer configuration.
202
203
```python { .api }
204
def split_text_on_tokens(*, text: str, tokenizer: Tokenizer) -> list[str]: ...
205
```
206
207
**Parameters:**
208
- `text`: The text to split
209
- `tokenizer`: Tokenizer configuration object
210
211
**Returns:**
212
- List of text chunks split according to token boundaries
213
214
**Usage:**
215
216
```python
217
from langchain_text_splitters import split_text_on_tokens, Tokenizer
218
from transformers import AutoTokenizer
219
220
# Using HuggingFace tokenizer
221
hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
222
223
tokenizer_config = Tokenizer(
224
chunk_overlap=25,
225
tokens_per_chunk=256,
226
decode=lambda tokens: hf_tokenizer.decode(tokens, skip_special_tokens=True),
227
encode=lambda text: hf_tokenizer.encode(text, add_special_tokens=False)
228
)
229
230
text = "This is a sample text that will be tokenized and split into chunks."
231
chunks = split_text_on_tokens(text=text, tokenizer=tokenizer_config)
232
```
233
234
### Type Definitions
235
236
The text splitters package provides several TypedDict definitions for structured data used across various splitters.
237
238
```python { .api }
239
class ElementType(TypedDict):
240
"""Element type as typed dict for HTML elements."""
241
url: str
242
xpath: str
243
content: str
244
metadata: dict[str, str]
245
246
class HeaderType(TypedDict):
247
"""Header type as typed dict for markdown headers."""
248
level: int
249
name: str
250
data: str
251
252
class LineType(TypedDict):
253
"""Line type as typed dict for text lines with metadata."""
254
metadata: dict[str, str]
255
content: str
256
```
257
258
These types are used by:
259
- `ElementType`: HTML-based splitters for structured element data
260
- `HeaderType`: Markdown splitters for header information
261
- `LineType`: Markdown splitters for line-based processing
262
263
### Base Document Transformer Integration
264
265
All text splitters inherit from LangChain's `BaseDocumentTransformer`, providing consistent integration with the LangChain ecosystem.
266
267
```python { .api }
268
# Inherited from langchain_core
269
class BaseDocumentTransformer(ABC):
270
@abstractmethod
271
def transform_documents(
272
self,
273
documents: Sequence[Document],
274
**kwargs: Any
275
) -> Sequence[Document]: ...
276
277
async def atransform_documents(
278
self,
279
documents: Sequence[Document],
280
**kwargs: Any
281
) -> Sequence[Document]: ...
282
```
283
284
## Error Handling and Validation
285
286
The base `TextSplitter` class includes built-in validation for common configuration errors:
287
288
```python
289
# These will raise ValueError
290
TextSplitter(chunk_size=0) # chunk_size must be > 0
291
TextSplitter(chunk_overlap=-1) # chunk_overlap must be >= 0
292
TextSplitter(chunk_size=100, chunk_overlap=200) # overlap > chunk_size
293
```
294
295
## Design Principles
296
297
### Inheritance Hierarchy
298
The library follows a clear inheritance pattern:
299
1. `BaseDocumentTransformer` (from LangChain Core)
300
2. `TextSplitter` (abstract base class)
301
3. Concrete implementations (`CharacterTextSplitter`, `TokenTextSplitter`, etc.)
302
303
### Factory Pattern
304
Many splitters provide factory methods for convenient initialization:
305
- `from_language()` for language-specific splitting
306
- `from_huggingface_tokenizer()` for HuggingFace integration
307
- `from_tiktoken_encoder()` for OpenAI tokenizer integration
308
309
### Configuration Flexibility
310
All splitters accept common configuration parameters through the base class while allowing specific customization through their own parameters.
311
312
## Best Practices
313
314
1. **Extend TextSplitter**: When creating custom splitters, extend `TextSplitter` and implement `split_text()`
315
2. **Use factory methods**: Leverage factory methods for common initialization patterns
316
3. **Validate parameters**: The base class provides validation; add custom validation in subclasses
317
4. **Preserve metadata**: Use `create_documents()` and `split_documents()` to maintain document metadata
318
5. **Handle edge cases**: Consider empty strings, very short texts, and texts smaller than chunk_size
319
6. **Choose appropriate length functions**: For token-based splitting, use token counting functions
320
7. **Test with real data**: Validate your splitter configuration with representative data