0
# semchunk
1
2
A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks. semchunk uses an efficient algorithm that prioritizes semantic boundaries over simple character or token-based splitting, making it ideal for RAG applications, document processing pipelines, and any system requiring intelligent text segmentation.
3
4
## Package Information
5
6
- **Package Name**: semchunk
7
- **Language**: Python
8
- **Installation**: `pip install semchunk`
9
- **Alternative**: `conda install -c conda-forge semchunk`
10
11
## Core Imports
12
13
```python
14
import semchunk
15
```
16
17
Common usage patterns:
18
19
```python
20
from semchunk import chunk, Chunker, chunkerify
21
```
22
23
## Basic Usage
24
25
```python
26
import semchunk
27
import tiktoken
28
29
# Basic chunking with OpenAI tokenizer
30
# Note: Consider deducting special tokens from chunk_size if your tokenizer adds them
31
chunker = semchunk.chunkerify('gpt-4', chunk_size=512)
32
text = "The quick brown fox jumps over the lazy dog. This is a test sentence."
33
chunks = chunker(text)
34
35
# Chunking with offsets
36
chunks, offsets = chunker(text, offsets=True)
37
38
# Chunking with overlap
39
overlapped_chunks = chunker(text, overlap=0.1) # 10% overlap
40
41
# Using the chunk function directly
42
encoding = tiktoken.encoding_for_model('gpt-4')
43
def count_tokens(text):
44
return len(encoding.encode(text))
45
46
chunks = semchunk.chunk(
47
text=text,
48
chunk_size=512,
49
token_counter=count_tokens
50
)
51
```
52
53
## Architecture
54
55
semchunk uses a hierarchical splitting strategy that preserves semantic boundaries through a 5-step algorithm:
56
57
### Algorithm Steps
58
59
1. **Split text using the most semantically meaningful splitter possible**
60
2. **Recursively split resulting chunks until all are ≤ specified chunk size**
61
3. **Merge under-sized chunks back together until chunk size is reached**
62
4. **Reattach non-whitespace splitters to chunk ends (if within size limits)**
63
5. **Exclude chunks consisting entirely of whitespace characters** (since v3.0.0)
64
65
### Semantic Splitter Hierarchy
66
67
semchunk uses the following splitters in order of semantic preference:
68
69
1. **Paragraph breaks**: Largest sequence of newlines (`\n`) and/or carriage returns (`\r`)
70
2. **Section breaks**: Largest sequence of tabs (`\t`)
71
3. **Whitespace boundaries**: Largest sequence of whitespace characters, with smart targeting of whitespace preceded by meaningful punctuation (since v3.2.0)
72
4. **Sentence terminators**: `.`, `?`, `!`, `*`
73
5. **Clause separators**: `;`, `,`, `(`, `)`, `[`, `]`, `"`, `"`, `'`, `'`, `'`, `"`, `` ` ``
74
6. **Sentence interrupters**: `:`, `—`, `…`
75
7. **Word joiners**: `/`, `\`, `–`, `&`, `-`
76
8. **Character-level**: All other characters (fallback)
77
78
### Key Features
79
80
- **Token-Aware Chunking**: Respects token limits while maintaining semantic coherence
81
- **Recursive Processing**: Handles oversized segments by recursively applying the same semantic rules
82
- **Offset Tracking**: Optional character-level tracking for precise text reconstruction
83
- **Overlap Support**: Configurable chunk overlap for better context preservation
84
- **Performance Optimization**: 85% faster than alternatives like semantic-text-splitter through efficient caching and text length heuristics
85
86
## Capabilities
87
88
### Core Chunking Function
89
90
Direct text chunking with full control over parameters and caching options.
91
92
```python { .api }
93
def chunk(
94
text: str,
95
chunk_size: int,
96
token_counter: Callable[[str], int],
97
memoize: bool = True,
98
offsets: bool = False,
99
overlap: float | int | None = None,
100
cache_maxsize: int | None = None,
101
) -> list[str] | tuple[list[str], list[tuple[int, int]]]:
102
"""
103
Split a text into semantically meaningful chunks of a specified size.
104
105
Parameters:
106
- text: The text to be chunked
107
- chunk_size: The maximum number of tokens a chunk may contain
108
- token_counter: A callable that takes a string and returns the number of tokens in it
109
- memoize: Whether to memoize the token counter for performance. Defaults to True
110
- offsets: Whether to return the start and end offsets of each chunk. Defaults to False
111
- overlap: The proportion of the chunk size (if <1) or number of tokens (if >=1)
112
by which chunks should overlap. Defaults to None
113
- cache_maxsize: The maximum number of text-token count pairs that can be stored
114
in the token counter's cache. Defaults to None (unbounded)
115
116
Returns:
117
- If offsets=False: list[str] - List of chunks up to chunk_size tokens long
118
- If offsets=True: tuple[list[str], list[tuple[int, int]]] - Chunks and their
119
(start, end) character offsets in the original text
120
121
Raises:
122
- ValueError: If chunk_size is not provided and tokenizer lacks model_max_length
123
"""
124
```
125
126
### Chunker Factory Function
127
128
Create configured chunkers from tokenizers or token counters with automatic optimization.
129
130
```python { .api }
131
def chunkerify(
132
tokenizer_or_token_counter: str | tiktoken.Encoding | transformers.PreTrainedTokenizer | tokenizers.Tokenizer | Callable[[str], int],
133
chunk_size: int | None = None,
134
max_token_chars: int | None = None,
135
memoize: bool = True,
136
cache_maxsize: int | None = None,
137
) -> Chunker:
138
"""
139
Construct a chunker that splits texts into semantically meaningful chunks.
140
141
Parameters:
142
- tokenizer_or_token_counter: Either:
143
* Name of a tiktoken or transformers tokenizer (e.g., 'gpt-4', 'cl100k_base')
144
* A tokenizer object with an encode() method (tiktoken, transformers, tokenizers)
145
* A token counter function that returns the number of tokens in input text
146
- chunk_size: Maximum number of tokens per chunk. Defaults to tokenizer's
147
model_max_length if available, otherwise raises ValueError
148
- max_token_chars: Maximum number of characters a token may contain. Used to
149
significantly speed up token counting for long inputs by using heuristics
150
to avoid tokenizing texts that would exceed chunk_size. Auto-detected from
151
tokenizer vocabulary if possible
152
- memoize: Whether to memoize the token counter. Defaults to True
153
- cache_maxsize: Maximum number of text-token count pairs in cache. Defaults to None
154
155
Returns:
156
- Chunker: A configured chunker instance that can process single texts or sequences
157
158
Raises:
159
- ValueError: If tokenizer_or_token_counter is a string that doesn't match any
160
known tokenizer, or if chunk_size is None and tokenizer lacks
161
model_max_length attribute, or if required libraries are not installed
162
"""
163
```
164
165
### Chunker Class
166
167
High-performance chunker for processing single texts or sequences with multiprocessing support.
168
169
```python { .api }
170
class Chunker:
171
def __init__(self, chunk_size: int, token_counter: Callable[[str], int]) -> None:
172
"""
173
Initialize a chunker with specified chunk size and token counter.
174
175
Parameters:
176
- chunk_size: Maximum number of tokens per chunk
177
- token_counter: Function that takes a string and returns token count
178
"""
179
180
def __call__(
181
self,
182
text_or_texts: str | Sequence[str],
183
processes: int = 1,
184
progress: bool = False,
185
offsets: bool = False,
186
overlap: int | float | None = None,
187
) -> list[str] | tuple[list[str], list[tuple[int, int]]] | list[list[str]] | tuple[list[list[str]], list[list[tuple[int, int]]]]:
188
"""
189
Split text or texts into semantically meaningful chunks.
190
191
Parameters:
192
- text_or_texts: Single text string or sequence of text strings to chunk
193
- processes: Number of processes for multiprocessing when processing multiple texts.
194
Defaults to 1 (single process)
195
- progress: Whether to display a progress bar when processing multiple texts.
196
Defaults to False
197
- offsets: Whether to return start and end character offsets for each chunk.
198
Defaults to False
199
- overlap: Proportion of chunk size (if <1) or number of tokens (if >=1)
200
by which chunks should overlap. Defaults to None
201
202
Returns:
203
For single text input:
204
- If offsets=False: list[str] - List of chunks
205
- If offsets=True: tuple[list[str], list[tuple[int, int]]] - Chunks and offsets
206
207
For multiple text input:
208
- If offsets=False: list[list[str]] - List of chunk lists, one per input text
209
- If offsets=True: tuple[list[list[str]], list[list[tuple[int, int]]]] -
210
Chunk lists and offset lists for each input text
211
"""
212
```
213
214
## Performance Optimization
215
216
semchunk includes several performance optimizations to handle large texts efficiently:
217
218
### Token Counter Memoization
219
220
Enabled by default (`memoize=True`), this caches token counts for repeated text segments, significantly speeding up processing of documents with repeated content.
221
222
### Max Token Characters Heuristic
223
224
The `max_token_chars` parameter enables a smart optimization that avoids tokenizing very long texts when they would obviously exceed the chunk size. The algorithm:
225
226
1. Uses a heuristic based on `chunk_size * 6` to identify potentially long texts
227
2. For texts longer than this heuristic, tokenizes only a prefix of length `heuristic + max_token_chars`
228
3. If this prefix already exceeds `chunk_size`, returns `chunk_size + 1` without full tokenization
229
4. This can provide significant speedups (up to 85% faster) for very long documents
230
231
### Multiprocessing Support
232
233
The `Chunker` class supports parallel processing of multiple texts via the `processes` parameter, using the `mpire` library with dill serialization for robust multiprocessing.
234
235
### Special Token Handling
236
237
When using tokenizers that add special tokens (like BOS/EOS tokens), semchunk automatically:
238
239
1. Detects if the tokenizer supports the `add_special_tokens` parameter and disables it during chunking
240
2. Attempts to reduce the effective `chunk_size` by the number of special tokens when auto-detecting chunk size from `model_max_length`
241
3. For manual chunk size specification, consider deducting the number of special tokens your tokenizer adds to ensure chunks don't exceed your intended limits
242
243
## Types
244
245
```python { .api }
246
# Core imports
247
from typing import Callable, Sequence
248
249
# Type annotations used in the API
250
TokenCounter = Callable[[str], int]
251
252
# Offset tuple type (start, end character positions)
253
OffsetTuple = tuple[int, int]
254
255
# When TYPE_CHECKING is True, these imports are available for type hints:
256
# import tiktoken
257
# import tokenizers
258
# import transformers
259
260
# The tokenizer_or_token_counter parameter accepts any of:
261
# - str: Model name or encoding name (e.g., 'gpt-4', 'cl100k_base')
262
# - tiktoken.Encoding: tiktoken encoder object
263
# - transformers.PreTrainedTokenizer: Hugging Face tokenizer
264
# - tokenizers.Tokenizer: Fast tokenizer from tokenizers library
265
# - Callable[[str], int]: Custom token counter function
266
```
267
268
## Usage Examples
269
270
### Working with Different Tokenizers
271
272
```python
273
import semchunk
274
275
# OpenAI tiktoken models
276
chunker_gpt4 = semchunk.chunkerify('gpt-4', chunk_size=1000)
277
chunker_gpt35 = semchunk.chunkerify('gpt-3.5-turbo', chunk_size=1000)
278
279
# tiktoken encodings
280
chunker_cl100k = semchunk.chunkerify('cl100k_base', chunk_size=1000)
281
282
# Hugging Face transformers
283
from transformers import AutoTokenizer
284
hf_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
285
chunker_bert = semchunk.chunkerify(hf_tokenizer, chunk_size=512)
286
287
# Custom token counter
288
def simple_word_counter(text: str) -> int:
289
return len(text.split())
290
291
chunker_words = semchunk.chunkerify(simple_word_counter, chunk_size=100)
292
```
293
294
### Processing Multiple Texts
295
296
```python
297
import semchunk
298
299
# Prepare chunker and texts
300
chunker = semchunk.chunkerify('gpt-4', chunk_size=512)
301
documents = [
302
"First document text...",
303
"Second document text...",
304
"Third document text..."
305
]
306
307
# Process with multiprocessing
308
chunks_per_doc = chunker(documents, processes=4, progress=True)
309
310
# Process with offsets
311
chunks_per_doc, offsets_per_doc = chunker(
312
documents,
313
processes=4,
314
progress=True,
315
offsets=True
316
)
317
318
# With overlap for better context preservation
319
overlapped_chunks = chunker(
320
documents,
321
overlap=0.2, # 20% overlap
322
processes=4
323
)
324
```
325
326
### Advanced Configuration
327
328
```python
329
import semchunk
330
from functools import lru_cache
331
332
# Custom token counter with caching
333
@lru_cache(maxsize=1000)
334
def cached_word_counter(text: str) -> int:
335
return len(text.split())
336
337
# Direct chunk function usage with custom settings
338
text = "Long document text..."
339
chunks = semchunk.chunk(
340
text=text,
341
chunk_size=200,
342
token_counter=cached_word_counter,
343
memoize=False, # Already cached manually
344
offsets=True,
345
overlap=50, # 50 token overlap
346
cache_maxsize=500
347
)
348
349
# Chunker with performance optimization
350
chunker = semchunk.chunkerify(
351
'gpt-4',
352
chunk_size=1000,
353
max_token_chars=10, # Optimize for typical token lengths
354
cache_maxsize=2000 # Large cache for repeated texts
355
)
356
```