Tessl Tile for pypi/langchain-text-splitters@0.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

character-splitting.md code-splitting.md core-base.md document-structure.md index.md nlp-splitting.md token-splitting.md

core-base.mddocs/

0
# Core Base Classes and Utilities
1

2
The core base classes and utilities provide the fundamental interfaces, enums, and utility functions that form the foundation of all text splitting functionality in langchain-text-splitters. These components define the common patterns and contracts used throughout the library.
3

4
## Capabilities
5

6
### TextSplitter Abstract Base Class
7

8
The core abstract interface that all text splitters implement, providing common functionality and defining the splitting contract.
9

10
```python { .api }
11
class TextSplitter(BaseDocumentTransformer, ABC):
12
    def __init__(
13
        self,
14
        chunk_size: int = 4000,
15
        chunk_overlap: int = 200,
16
        length_function: Callable[[str], int] = len,
17
        keep_separator: Union[bool, Literal["start", "end"]] = False,
18
        add_start_index: bool = False,
19
        strip_whitespace: bool = True
20
    ) -> None: ...
21
    
22
    @abstractmethod
23
    def split_text(self, text: str) -> list[str]: ...
24
    
25
    def create_documents(
26
        self,
27
        texts: list[str],
28
        metadatas: Optional[list[dict[Any, Any]]] = None
29
    ) -> list[Document]: ...
30
    
31
    def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...
32
    
33
    @classmethod
34
    def from_huggingface_tokenizer(
35
        cls,
36
        tokenizer: Any,
37
        **kwargs: Any
38
    ) -> "TextSplitter": ...
39
    
40
    @classmethod
41
    def from_tiktoken_encoder(
42
        cls,
43
        encoding_name: str = "gpt2",
44
        model_name: Optional[str] = None,
45
        allowed_special: Union[Literal["all"], AbstractSet[str]] = set(),
46
        disallowed_special: Union[Literal["all"], Collection[str]] = "all",
47
        **kwargs: Any
48
    ) -> Self: ...
49
```
50

51
**Constructor Parameters:**
52
- `chunk_size`: Maximum size of chunks to return (default: `4000`)
53
- `chunk_overlap`: Overlap in characters between chunks (default: `200`)
54
- `length_function`: Function that measures the length of given chunks (default: `len`)
55
- `keep_separator`: Whether to keep the separator and where to place it (default: `False`)
56
- `add_start_index`: If `True`, includes chunk's start index in metadata (default: `False`)
57
- `strip_whitespace`: If `True`, strips whitespace from start and end of documents (default: `True`)
58

59
**Abstract Methods:**
60
- `split_text()`: Must be implemented by all concrete splitters
61

62
**Concrete Methods:**
63
- `create_documents()`: Create Document objects from text list with optional metadata
64
- `split_documents()`: Split existing Document objects into smaller chunks
65

66
**Factory Methods:**
67
- `from_huggingface_tokenizer()`: Create splitter from HuggingFace tokenizer
68
- `from_tiktoken_encoder()`: Create splitter from tiktoken encoder
69

70
**Usage:**
71

72
```python
73
from langchain_text_splitters import TextSplitter
74
from langchain_core.documents import Document
75

76
# Example concrete implementation (normally you'd use CharacterTextSplitter)
77
class SimpleTextSplitter(TextSplitter):
78
    def split_text(self, text: str) -> list[str]:
79
        # Simple implementation that splits on periods
80
        sentences = text.split('.')
81
        return [s.strip() + '.' for s in sentences if s.strip()]
82

83
# Using the splitter
84
splitter = SimpleTextSplitter(
85
    chunk_size=1000,
86
    chunk_overlap=100,
87
    add_start_index=True,
88
    strip_whitespace=True
89
)
90

91
# Split text
92
text = "First sentence. Second sentence. Third sentence."
93
chunks = splitter.split_text(text)
94

95
# Create documents with metadata
96
documents = splitter.create_documents(
97
    texts=[text],
98
    metadatas=[{"source": "example.txt", "author": "unknown"}]
99
)
100

101
# Split existing documents
102
existing_docs = [Document(page_content=text, metadata={"page": 1})]
103
split_docs = splitter.split_documents(existing_docs)
104
```
105

106
### Language Enumeration
107

108
Enumeration defining supported programming languages for language-specific text splitting.
109

110
```python { .api }
111
class Language(Enum):
112
    CPP = "cpp"
113
    GO = "go"
114
    JAVA = "java"
115
    KOTLIN = "kotlin"
116
    JS = "js"
117
    TS = "ts"
118
    PHP = "php"
119
    PROTO = "proto"
120
    PYTHON = "python"
121
    RST = "rst"
122
    RUBY = "ruby"
123
    RUST = "rust"
124
    SCALA = "scala"
125
    SWIFT = "swift"
126
    MARKDOWN = "markdown"
127
    LATEX = "latex"
128
    HTML = "html"
129
    SOL = "sol"
130
    CSHARP = "csharp"
131
    COBOL = "cobol"
132
    C = "c"
133
    LUA = "lua"
134
    PERL = "perl"
135
    HASKELL = "haskell"
136
    ELIXIR = "elixir"
137
    POWERSHELL = "powershell"
138
    VISUALBASIC6 = "visualbasic6"
139
```
140

141
**Usage:**
142

143
```python
144
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter
145

146
# Use with language-specific splitting
147
python_splitter = RecursiveCharacterTextSplitter.from_language(
148
    language=Language.PYTHON,
149
    chunk_size=2000
150
)
151

152
# Get separators for a language
153
js_separators = RecursiveCharacterTextSplitter.get_separators_for_language(Language.JS)
154

155
# Compare language values
156
if some_language == Language.PYTHON.value:  # "python"
157
    print("This is Python code")
158
```
159

160
### Tokenizer Configuration
161

162
Configuration dataclass for token-based text splitting operations.
163

164
```python { .api }
165
@dataclass(frozen=True)
166
class Tokenizer:
167
    chunk_overlap: int
168
    tokens_per_chunk: int
169
    decode: Callable[[list[int]], str]
170
    encode: Callable[[str], list[int]]
171
```
172

173
**Fields:**
174
- `chunk_overlap`: Number of tokens to overlap between chunks
175
- `tokens_per_chunk`: Maximum number of tokens per chunk
176
- `decode`: Function to decode token IDs back to text
177
- `encode`: Function to encode text to token IDs
178

179
**Usage:**
180

181
```python
182
from langchain_text_splitters import Tokenizer, split_text_on_tokens
183
import tiktoken
184

185
# Create tokenizer configuration
186
encoding = tiktoken.get_encoding("gpt2")
187
tokenizer_config = Tokenizer(
188
    chunk_overlap=50,
189
    tokens_per_chunk=512,
190
    decode=encoding.decode,
191
    encode=encoding.encode
192
)
193

194
# Use with splitting function
195
text = "Long text to be tokenized and split..."
196
chunks = split_text_on_tokens(text=text, tokenizer=tokenizer_config)
197
```
198

199
### Token-Based Splitting Utility
200

201
Utility function for splitting text using a tokenizer configuration.
202

203
```python { .api }
204
def split_text_on_tokens(*, text: str, tokenizer: Tokenizer) -> list[str]: ...
205
```
206

207
**Parameters:**
208
- `text`: The text to split
209
- `tokenizer`: Tokenizer configuration object
210

211
**Returns:**
212
- List of text chunks split according to token boundaries
213

214
**Usage:**
215

216
```python
217
from langchain_text_splitters import split_text_on_tokens, Tokenizer
218
from transformers import AutoTokenizer
219

220
# Using HuggingFace tokenizer
221
hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
222

223
tokenizer_config = Tokenizer(
224
    chunk_overlap=25,
225
    tokens_per_chunk=256,
226
    decode=lambda tokens: hf_tokenizer.decode(tokens, skip_special_tokens=True),
227
    encode=lambda text: hf_tokenizer.encode(text, add_special_tokens=False)
228
)
229

230
text = "This is a sample text that will be tokenized and split into chunks."
231
chunks = split_text_on_tokens(text=text, tokenizer=tokenizer_config)
232
```
233

234
### Type Definitions
235

236
The text splitters package provides several TypedDict definitions for structured data used across various splitters.
237

238
```python { .api }
239
class ElementType(TypedDict):
240
    """Element type as typed dict for HTML elements."""
241
    url: str
242
    xpath: str
243
    content: str
244
    metadata: dict[str, str]
245

246
class HeaderType(TypedDict):
247
    """Header type as typed dict for markdown headers."""
248
    level: int
249
    name: str
250
    data: str
251

252
class LineType(TypedDict):
253
    """Line type as typed dict for text lines with metadata."""
254
    metadata: dict[str, str]
255
    content: str
256
```
257

258
These types are used by:
259
- `ElementType`: HTML-based splitters for structured element data
260
- `HeaderType`: Markdown splitters for header information 
261
- `LineType`: Markdown splitters for line-based processing
262

263
### Base Document Transformer Integration
264

265
All text splitters inherit from LangChain's `BaseDocumentTransformer`, providing consistent integration with the LangChain ecosystem.
266

267
```python { .api }
268
# Inherited from langchain_core
269
class BaseDocumentTransformer(ABC):
270
    @abstractmethod
271
    def transform_documents(
272
        self,
273
        documents: Sequence[Document],
274
        **kwargs: Any
275
    ) -> Sequence[Document]: ...
276
    
277
    async def atransform_documents(
278
        self,
279
        documents: Sequence[Document],
280
        **kwargs: Any
281
    ) -> Sequence[Document]: ...
282
```
283

284
## Error Handling and Validation
285

286
The base `TextSplitter` class includes built-in validation for common configuration errors:
287

288
```python
289
# These will raise ValueError
290
TextSplitter(chunk_size=0)           # chunk_size must be > 0
291
TextSplitter(chunk_overlap=-1)       # chunk_overlap must be >= 0  
292
TextSplitter(chunk_size=100, chunk_overlap=200)  # overlap > chunk_size
293
```
294

295
## Design Principles
296

297
### Inheritance Hierarchy
298
The library follows a clear inheritance pattern:
299
1. `BaseDocumentTransformer` (from LangChain Core)
300
2. `TextSplitter` (abstract base class)
301
3. Concrete implementations (`CharacterTextSplitter`, `TokenTextSplitter`, etc.)
302

303
### Factory Pattern
304
Many splitters provide factory methods for convenient initialization:
305
- `from_language()` for language-specific splitting
306
- `from_huggingface_tokenizer()` for HuggingFace integration
307
- `from_tiktoken_encoder()` for OpenAI tokenizer integration
308

309
### Configuration Flexibility
310
All splitters accept common configuration parameters through the base class while allowing specific customization through their own parameters.
311

312
## Best Practices
313

314
1. **Extend TextSplitter**: When creating custom splitters, extend `TextSplitter` and implement `split_text()`
315
2. **Use factory methods**: Leverage factory methods for common initialization patterns
316
3. **Validate parameters**: The base class provides validation; add custom validation in subclasses
317
4. **Preserve metadata**: Use `create_documents()` and `split_documents()` to maintain document metadata
318
5. **Handle edge cases**: Consider empty strings, very short texts, and texts smaller than chunk_size
319
6. **Choose appropriate length functions**: For token-based splitting, use token counting functions
320
7. **Test with real data**: Validate your splitter configuration with representative data

Version

Tile

Files

core-base.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

core-base.mddocs/