Tessl Tile for pypi/semchunk@3.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md

index.mddocs/

0
# semchunk
1

2
A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks. semchunk uses an efficient algorithm that prioritizes semantic boundaries over simple character or token-based splitting, making it ideal for RAG applications, document processing pipelines, and any system requiring intelligent text segmentation.
3

4
## Package Information
5

6
- **Package Name**: semchunk
7
- **Language**: Python
8
- **Installation**: `pip install semchunk`
9
- **Alternative**: `conda install -c conda-forge semchunk`
10

11
## Core Imports
12

13
```python
14
import semchunk
15
```
16

17
Common usage patterns:
18

19
```python
20
from semchunk import chunk, Chunker, chunkerify
21
```
22

23
## Basic Usage
24

25
```python
26
import semchunk
27
import tiktoken
28

29
# Basic chunking with OpenAI tokenizer
30
# Note: Consider deducting special tokens from chunk_size if your tokenizer adds them
31
chunker = semchunk.chunkerify('gpt-4', chunk_size=512)
32
text = "The quick brown fox jumps over the lazy dog. This is a test sentence."
33
chunks = chunker(text)
34

35
# Chunking with offsets
36
chunks, offsets = chunker(text, offsets=True)
37

38
# Chunking with overlap
39
overlapped_chunks = chunker(text, overlap=0.1)  # 10% overlap
40

41
# Using the chunk function directly
42
encoding = tiktoken.encoding_for_model('gpt-4')
43
def count_tokens(text):
44
    return len(encoding.encode(text))
45

46
chunks = semchunk.chunk(
47
    text=text,
48
    chunk_size=512,
49
    token_counter=count_tokens
50
)
51
```
52

53
## Architecture
54

55
semchunk uses a hierarchical splitting strategy that preserves semantic boundaries through a 5-step algorithm:
56

57
### Algorithm Steps
58

59
1. **Split text using the most semantically meaningful splitter possible**
60
2. **Recursively split resulting chunks until all are ≤ specified chunk size**
61
3. **Merge under-sized chunks back together until chunk size is reached**
62
4. **Reattach non-whitespace splitters to chunk ends (if within size limits)**
63
5. **Exclude chunks consisting entirely of whitespace characters** (since v3.0.0)
64

65
### Semantic Splitter Hierarchy
66

67
semchunk uses the following splitters in order of semantic preference:
68

69
1. **Paragraph breaks**: Largest sequence of newlines (`\n`) and/or carriage returns (`\r`)
70
2. **Section breaks**: Largest sequence of tabs (`\t`)
71
3. **Whitespace boundaries**: Largest sequence of whitespace characters, with smart targeting of whitespace preceded by meaningful punctuation (since v3.2.0)
72
4. **Sentence terminators**: `.`, `?`, `!`, `*`
73
5. **Clause separators**: `;`, `,`, `(`, `)`, `[`, `]`, `"`, `"`, `'`, `'`, `'`, `"`, `` ` ``
74
6. **Sentence interrupters**: `:`, `—`, `…`
75
7. **Word joiners**: `/`, `\`, `–`, `&`, `-`
76
8. **Character-level**: All other characters (fallback)
77

78
### Key Features
79

80
- **Token-Aware Chunking**: Respects token limits while maintaining semantic coherence
81
- **Recursive Processing**: Handles oversized segments by recursively applying the same semantic rules
82
- **Offset Tracking**: Optional character-level tracking for precise text reconstruction
83
- **Overlap Support**: Configurable chunk overlap for better context preservation
84
- **Performance Optimization**: 85% faster than alternatives like semantic-text-splitter through efficient caching and text length heuristics
85

86
## Capabilities
87

88
### Core Chunking Function
89

90
Direct text chunking with full control over parameters and caching options.
91

92
```python { .api }
93
def chunk(
94
    text: str,
95
    chunk_size: int,
96
    token_counter: Callable[[str], int],
97
    memoize: bool = True,
98
    offsets: bool = False,
99
    overlap: float | int | None = None,
100
    cache_maxsize: int | None = None,
101
) -> list[str] | tuple[list[str], list[tuple[int, int]]]:
102
    """
103
    Split a text into semantically meaningful chunks of a specified size.
104

105
    Parameters:
106
    - text: The text to be chunked
107
    - chunk_size: The maximum number of tokens a chunk may contain
108
    - token_counter: A callable that takes a string and returns the number of tokens in it
109
    - memoize: Whether to memoize the token counter for performance. Defaults to True
110
    - offsets: Whether to return the start and end offsets of each chunk. Defaults to False
111
    - overlap: The proportion of the chunk size (if <1) or number of tokens (if >=1) 
112
               by which chunks should overlap. Defaults to None
113
    - cache_maxsize: The maximum number of text-token count pairs that can be stored 
114
                     in the token counter's cache. Defaults to None (unbounded)
115

116
    Returns:
117
    - If offsets=False: list[str] - List of chunks up to chunk_size tokens long
118
    - If offsets=True: tuple[list[str], list[tuple[int, int]]] - Chunks and their 
119
                       (start, end) character offsets in the original text
120

121
    Raises:
122
    - ValueError: If chunk_size is not provided and tokenizer lacks model_max_length
123
    """
124
```
125

126
### Chunker Factory Function
127

128
Create configured chunkers from tokenizers or token counters with automatic optimization.
129

130
```python { .api }
131
def chunkerify(
132
    tokenizer_or_token_counter: str | tiktoken.Encoding | transformers.PreTrainedTokenizer | tokenizers.Tokenizer | Callable[[str], int],
133
    chunk_size: int | None = None,
134
    max_token_chars: int | None = None,
135
    memoize: bool = True,
136
    cache_maxsize: int | None = None,
137
) -> Chunker:
138
    """
139
    Construct a chunker that splits texts into semantically meaningful chunks.
140

141
    Parameters:
142
    - tokenizer_or_token_counter: Either:
143
        * Name of a tiktoken or transformers tokenizer (e.g., 'gpt-4', 'cl100k_base')
144
        * A tokenizer object with an encode() method (tiktoken, transformers, tokenizers)
145
        * A token counter function that returns the number of tokens in input text
146
    - chunk_size: Maximum number of tokens per chunk. Defaults to tokenizer's 
147
                  model_max_length if available, otherwise raises ValueError
148
    - max_token_chars: Maximum number of characters a token may contain. Used to 
149
                       significantly speed up token counting for long inputs by using heuristics 
150
                       to avoid tokenizing texts that would exceed chunk_size. Auto-detected from 
151
                       tokenizer vocabulary if possible
152
    - memoize: Whether to memoize the token counter. Defaults to True
153
    - cache_maxsize: Maximum number of text-token count pairs in cache. Defaults to None
154

155
    Returns:
156
    - Chunker: A configured chunker instance that can process single texts or sequences
157

158
    Raises:
159
    - ValueError: If tokenizer_or_token_counter is a string that doesn't match any 
160
                  known tokenizer, or if chunk_size is None and tokenizer lacks 
161
                  model_max_length attribute, or if required libraries are not installed
162
    """
163
```
164

165
### Chunker Class
166

167
High-performance chunker for processing single texts or sequences with multiprocessing support.
168

169
```python { .api }
170
class Chunker:
171
    def __init__(self, chunk_size: int, token_counter: Callable[[str], int]) -> None:
172
        """
173
        Initialize a chunker with specified chunk size and token counter.
174

175
        Parameters:
176
        - chunk_size: Maximum number of tokens per chunk
177
        - token_counter: Function that takes a string and returns token count
178
        """
179

180
    def __call__(
181
        self,
182
        text_or_texts: str | Sequence[str],
183
        processes: int = 1,
184
        progress: bool = False,
185
        offsets: bool = False,
186
        overlap: int | float | None = None,
187
    ) -> list[str] | tuple[list[str], list[tuple[int, int]]] | list[list[str]] | tuple[list[list[str]], list[list[tuple[int, int]]]]:
188
        """
189
        Split text or texts into semantically meaningful chunks.
190

191
        Parameters:
192
        - text_or_texts: Single text string or sequence of text strings to chunk
193
        - processes: Number of processes for multiprocessing when processing multiple texts. 
194
                     Defaults to 1 (single process)
195
        - progress: Whether to display a progress bar when processing multiple texts. 
196
                    Defaults to False
197
        - offsets: Whether to return start and end character offsets for each chunk. 
198
                   Defaults to False
199
        - overlap: Proportion of chunk size (if <1) or number of tokens (if >=1) 
200
                   by which chunks should overlap. Defaults to None
201

202
        Returns:
203
        For single text input:
204
        - If offsets=False: list[str] - List of chunks
205
        - If offsets=True: tuple[list[str], list[tuple[int, int]]] - Chunks and offsets
206

207
        For multiple text input:
208
        - If offsets=False: list[list[str]] - List of chunk lists, one per input text
209
        - If offsets=True: tuple[list[list[str]], list[list[tuple[int, int]]]] - 
210
                           Chunk lists and offset lists for each input text
211
        """
212
```
213

214
## Performance Optimization
215

216
semchunk includes several performance optimizations to handle large texts efficiently:
217

218
### Token Counter Memoization
219

220
Enabled by default (`memoize=True`), this caches token counts for repeated text segments, significantly speeding up processing of documents with repeated content.
221

222
### Max Token Characters Heuristic
223

224
The `max_token_chars` parameter enables a smart optimization that avoids tokenizing very long texts when they would obviously exceed the chunk size. The algorithm:
225

226
1. Uses a heuristic based on `chunk_size * 6` to identify potentially long texts
227
2. For texts longer than this heuristic, tokenizes only a prefix of length `heuristic + max_token_chars`
228
3. If this prefix already exceeds `chunk_size`, returns `chunk_size + 1` without full tokenization
229
4. This can provide significant speedups (up to 85% faster) for very long documents
230

231
### Multiprocessing Support
232

233
The `Chunker` class supports parallel processing of multiple texts via the `processes` parameter, using the `mpire` library with dill serialization for robust multiprocessing.
234

235
### Special Token Handling
236

237
When using tokenizers that add special tokens (like BOS/EOS tokens), semchunk automatically:
238

239
1. Detects if the tokenizer supports the `add_special_tokens` parameter and disables it during chunking
240
2. Attempts to reduce the effective `chunk_size` by the number of special tokens when auto-detecting chunk size from `model_max_length`
241
3. For manual chunk size specification, consider deducting the number of special tokens your tokenizer adds to ensure chunks don't exceed your intended limits
242

243
## Types
244

245
```python { .api }
246
# Core imports
247
from typing import Callable, Sequence
248

249
# Type annotations used in the API
250
TokenCounter = Callable[[str], int]
251

252
# Offset tuple type (start, end character positions)
253
OffsetTuple = tuple[int, int]
254

255
# When TYPE_CHECKING is True, these imports are available for type hints:
256
# import tiktoken
257
# import tokenizers  
258
# import transformers
259

260
# The tokenizer_or_token_counter parameter accepts any of:
261
# - str: Model name or encoding name (e.g., 'gpt-4', 'cl100k_base')
262
# - tiktoken.Encoding: tiktoken encoder object
263
# - transformers.PreTrainedTokenizer: Hugging Face tokenizer
264
# - tokenizers.Tokenizer: Fast tokenizer from tokenizers library  
265
# - Callable[[str], int]: Custom token counter function
266
```
267

268
## Usage Examples
269

270
### Working with Different Tokenizers
271

272
```python
273
import semchunk
274

275
# OpenAI tiktoken models
276
chunker_gpt4 = semchunk.chunkerify('gpt-4', chunk_size=1000)
277
chunker_gpt35 = semchunk.chunkerify('gpt-3.5-turbo', chunk_size=1000)
278

279
# tiktoken encodings
280
chunker_cl100k = semchunk.chunkerify('cl100k_base', chunk_size=1000)
281

282
# Hugging Face transformers
283
from transformers import AutoTokenizer
284
hf_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
285
chunker_bert = semchunk.chunkerify(hf_tokenizer, chunk_size=512)
286

287
# Custom token counter
288
def simple_word_counter(text: str) -> int:
289
    return len(text.split())
290

291
chunker_words = semchunk.chunkerify(simple_word_counter, chunk_size=100)
292
```
293

294
### Processing Multiple Texts
295

296
```python
297
import semchunk
298

299
# Prepare chunker and texts
300
chunker = semchunk.chunkerify('gpt-4', chunk_size=512)
301
documents = [
302
    "First document text...",
303
    "Second document text...",
304
    "Third document text..."
305
]
306

307
# Process with multiprocessing
308
chunks_per_doc = chunker(documents, processes=4, progress=True)
309

310
# Process with offsets
311
chunks_per_doc, offsets_per_doc = chunker(
312
    documents, 
313
    processes=4, 
314
    progress=True, 
315
    offsets=True
316
)
317

318
# With overlap for better context preservation
319
overlapped_chunks = chunker(
320
    documents, 
321
    overlap=0.2,  # 20% overlap
322
    processes=4
323
)
324
```
325

326
### Advanced Configuration
327

328
```python
329
import semchunk
330
from functools import lru_cache
331

332
# Custom token counter with caching
333
@lru_cache(maxsize=1000)
334
def cached_word_counter(text: str) -> int:
335
    return len(text.split())
336

337
# Direct chunk function usage with custom settings
338
text = "Long document text..."
339
chunks = semchunk.chunk(
340
    text=text,
341
    chunk_size=200,
342
    token_counter=cached_word_counter,
343
    memoize=False,  # Already cached manually
344
    offsets=True,
345
    overlap=50,  # 50 token overlap
346
    cache_maxsize=500
347
)
348

349
# Chunker with performance optimization
350
chunker = semchunk.chunkerify(
351
    'gpt-4',
352
    chunk_size=1000,
353
    max_token_chars=10,  # Optimize for typical token lengths
354
    cache_maxsize=2000   # Large cache for repeated texts
355
)
356
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/