or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-semchunk

A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/semchunk@3.2.x

To install, run

npx @tessl/cli install tessl/pypi-semchunk@3.2.0

0

# semchunk

1

2

A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks. semchunk uses an efficient algorithm that prioritizes semantic boundaries over simple character or token-based splitting, making it ideal for RAG applications, document processing pipelines, and any system requiring intelligent text segmentation.

3

4

## Package Information

5

6

- **Package Name**: semchunk

7

- **Language**: Python

8

- **Installation**: `pip install semchunk`

9

- **Alternative**: `conda install -c conda-forge semchunk`

10

11

## Core Imports

12

13

```python

14

import semchunk

15

```

16

17

Common usage patterns:

18

19

```python

20

from semchunk import chunk, Chunker, chunkerify

21

```

22

23

## Basic Usage

24

25

```python

26

import semchunk

27

import tiktoken

28

29

# Basic chunking with OpenAI tokenizer

30

# Note: Consider deducting special tokens from chunk_size if your tokenizer adds them

31

chunker = semchunk.chunkerify('gpt-4', chunk_size=512)

32

text = "The quick brown fox jumps over the lazy dog. This is a test sentence."

33

chunks = chunker(text)

34

35

# Chunking with offsets

36

chunks, offsets = chunker(text, offsets=True)

37

38

# Chunking with overlap

39

overlapped_chunks = chunker(text, overlap=0.1) # 10% overlap

40

41

# Using the chunk function directly

42

encoding = tiktoken.encoding_for_model('gpt-4')

43

def count_tokens(text):

44

return len(encoding.encode(text))

45

46

chunks = semchunk.chunk(

47

text=text,

48

chunk_size=512,

49

token_counter=count_tokens

50

)

51

```

52

53

## Architecture

54

55

semchunk uses a hierarchical splitting strategy that preserves semantic boundaries through a 5-step algorithm:

56

57

### Algorithm Steps

58

59

1. **Split text using the most semantically meaningful splitter possible**

60

2. **Recursively split resulting chunks until all are ≤ specified chunk size**

61

3. **Merge under-sized chunks back together until chunk size is reached**

62

4. **Reattach non-whitespace splitters to chunk ends (if within size limits)**

63

5. **Exclude chunks consisting entirely of whitespace characters** (since v3.0.0)

64

65

### Semantic Splitter Hierarchy

66

67

semchunk uses the following splitters in order of semantic preference:

68

69

1. **Paragraph breaks**: Largest sequence of newlines (`\n`) and/or carriage returns (`\r`)

70

2. **Section breaks**: Largest sequence of tabs (`\t`)

71

3. **Whitespace boundaries**: Largest sequence of whitespace characters, with smart targeting of whitespace preceded by meaningful punctuation (since v3.2.0)

72

4. **Sentence terminators**: `.`, `?`, `!`, `*`

73

5. **Clause separators**: `;`, `,`, `(`, `)`, `[`, `]`, `"`, `"`, `'`, `'`, `'`, `"`, `` ` ``

74

6. **Sentence interrupters**: `:`, `—`, `…`

75

7. **Word joiners**: `/`, `\`, `–`, `&`, `-`

76

8. **Character-level**: All other characters (fallback)

77

78

### Key Features

79

80

- **Token-Aware Chunking**: Respects token limits while maintaining semantic coherence

81

- **Recursive Processing**: Handles oversized segments by recursively applying the same semantic rules

82

- **Offset Tracking**: Optional character-level tracking for precise text reconstruction

83

- **Overlap Support**: Configurable chunk overlap for better context preservation

84

- **Performance Optimization**: 85% faster than alternatives like semantic-text-splitter through efficient caching and text length heuristics

85

86

## Capabilities

87

88

### Core Chunking Function

89

90

Direct text chunking with full control over parameters and caching options.

91

92

```python { .api }

93

def chunk(

94

text: str,

95

chunk_size: int,

96

token_counter: Callable[[str], int],

97

memoize: bool = True,

98

offsets: bool = False,

99

overlap: float | int | None = None,

100

cache_maxsize: int | None = None,

101

) -> list[str] | tuple[list[str], list[tuple[int, int]]]:

102

"""

103

Split a text into semantically meaningful chunks of a specified size.

104

105

Parameters:

106

- text: The text to be chunked

107

- chunk_size: The maximum number of tokens a chunk may contain

108

- token_counter: A callable that takes a string and returns the number of tokens in it

109

- memoize: Whether to memoize the token counter for performance. Defaults to True

110

- offsets: Whether to return the start and end offsets of each chunk. Defaults to False

111

- overlap: The proportion of the chunk size (if <1) or number of tokens (if >=1)

112

by which chunks should overlap. Defaults to None

113

- cache_maxsize: The maximum number of text-token count pairs that can be stored

114

in the token counter's cache. Defaults to None (unbounded)

115

116

Returns:

117

- If offsets=False: list[str] - List of chunks up to chunk_size tokens long

118

- If offsets=True: tuple[list[str], list[tuple[int, int]]] - Chunks and their

119

(start, end) character offsets in the original text

120

121

Raises:

122

- ValueError: If chunk_size is not provided and tokenizer lacks model_max_length

123

"""

124

```

125

126

### Chunker Factory Function

127

128

Create configured chunkers from tokenizers or token counters with automatic optimization.

129

130

```python { .api }

131

def chunkerify(

132

tokenizer_or_token_counter: str | tiktoken.Encoding | transformers.PreTrainedTokenizer | tokenizers.Tokenizer | Callable[[str], int],

133

chunk_size: int | None = None,

134

max_token_chars: int | None = None,

135

memoize: bool = True,

136

cache_maxsize: int | None = None,

137

) -> Chunker:

138

"""

139

Construct a chunker that splits texts into semantically meaningful chunks.

140

141

Parameters:

142

- tokenizer_or_token_counter: Either:

143

* Name of a tiktoken or transformers tokenizer (e.g., 'gpt-4', 'cl100k_base')

144

* A tokenizer object with an encode() method (tiktoken, transformers, tokenizers)

145

* A token counter function that returns the number of tokens in input text

146

- chunk_size: Maximum number of tokens per chunk. Defaults to tokenizer's

147

model_max_length if available, otherwise raises ValueError

148

- max_token_chars: Maximum number of characters a token may contain. Used to

149

significantly speed up token counting for long inputs by using heuristics

150

to avoid tokenizing texts that would exceed chunk_size. Auto-detected from

151

tokenizer vocabulary if possible

152

- memoize: Whether to memoize the token counter. Defaults to True

153

- cache_maxsize: Maximum number of text-token count pairs in cache. Defaults to None

154

155

Returns:

156

- Chunker: A configured chunker instance that can process single texts or sequences

157

158

Raises:

159

- ValueError: If tokenizer_or_token_counter is a string that doesn't match any

160

known tokenizer, or if chunk_size is None and tokenizer lacks

161

model_max_length attribute, or if required libraries are not installed

162

"""

163

```

164

165

### Chunker Class

166

167

High-performance chunker for processing single texts or sequences with multiprocessing support.

168

169

```python { .api }

170

class Chunker:

171

def __init__(self, chunk_size: int, token_counter: Callable[[str], int]) -> None:

172

"""

173

Initialize a chunker with specified chunk size and token counter.

174

175

Parameters:

176

- chunk_size: Maximum number of tokens per chunk

177

- token_counter: Function that takes a string and returns token count

178

"""

179

180

def __call__(

181

self,

182

text_or_texts: str | Sequence[str],

183

processes: int = 1,

184

progress: bool = False,

185

offsets: bool = False,

186

overlap: int | float | None = None,

187

) -> list[str] | tuple[list[str], list[tuple[int, int]]] | list[list[str]] | tuple[list[list[str]], list[list[tuple[int, int]]]]:

188

"""

189

Split text or texts into semantically meaningful chunks.

190

191

Parameters:

192

- text_or_texts: Single text string or sequence of text strings to chunk

193

- processes: Number of processes for multiprocessing when processing multiple texts.

194

Defaults to 1 (single process)

195

- progress: Whether to display a progress bar when processing multiple texts.

196

Defaults to False

197

- offsets: Whether to return start and end character offsets for each chunk.

198

Defaults to False

199

- overlap: Proportion of chunk size (if <1) or number of tokens (if >=1)

200

by which chunks should overlap. Defaults to None

201

202

Returns:

203

For single text input:

204

- If offsets=False: list[str] - List of chunks

205

- If offsets=True: tuple[list[str], list[tuple[int, int]]] - Chunks and offsets

206

207

For multiple text input:

208

- If offsets=False: list[list[str]] - List of chunk lists, one per input text

209

- If offsets=True: tuple[list[list[str]], list[list[tuple[int, int]]]] -

210

Chunk lists and offset lists for each input text

211

"""

212

```

213

214

## Performance Optimization

215

216

semchunk includes several performance optimizations to handle large texts efficiently:

217

218

### Token Counter Memoization

219

220

Enabled by default (`memoize=True`), this caches token counts for repeated text segments, significantly speeding up processing of documents with repeated content.

221

222

### Max Token Characters Heuristic

223

224

The `max_token_chars` parameter enables a smart optimization that avoids tokenizing very long texts when they would obviously exceed the chunk size. The algorithm:

225

226

1. Uses a heuristic based on `chunk_size * 6` to identify potentially long texts

227

2. For texts longer than this heuristic, tokenizes only a prefix of length `heuristic + max_token_chars`

228

3. If this prefix already exceeds `chunk_size`, returns `chunk_size + 1` without full tokenization

229

4. This can provide significant speedups (up to 85% faster) for very long documents

230

231

### Multiprocessing Support

232

233

The `Chunker` class supports parallel processing of multiple texts via the `processes` parameter, using the `mpire` library with dill serialization for robust multiprocessing.

234

235

### Special Token Handling

236

237

When using tokenizers that add special tokens (like BOS/EOS tokens), semchunk automatically:

238

239

1. Detects if the tokenizer supports the `add_special_tokens` parameter and disables it during chunking

240

2. Attempts to reduce the effective `chunk_size` by the number of special tokens when auto-detecting chunk size from `model_max_length`

241

3. For manual chunk size specification, consider deducting the number of special tokens your tokenizer adds to ensure chunks don't exceed your intended limits

242

243

## Types

244

245

```python { .api }

246

# Core imports

247

from typing import Callable, Sequence

248

249

# Type annotations used in the API

250

TokenCounter = Callable[[str], int]

251

252

# Offset tuple type (start, end character positions)

253

OffsetTuple = tuple[int, int]

254

255

# When TYPE_CHECKING is True, these imports are available for type hints:

256

# import tiktoken

257

# import tokenizers

258

# import transformers

259

260

# The tokenizer_or_token_counter parameter accepts any of:

261

# - str: Model name or encoding name (e.g., 'gpt-4', 'cl100k_base')

262

# - tiktoken.Encoding: tiktoken encoder object

263

# - transformers.PreTrainedTokenizer: Hugging Face tokenizer

264

# - tokenizers.Tokenizer: Fast tokenizer from tokenizers library

265

# - Callable[[str], int]: Custom token counter function

266

```

267

268

## Usage Examples

269

270

### Working with Different Tokenizers

271

272

```python

273

import semchunk

274

275

# OpenAI tiktoken models

276

chunker_gpt4 = semchunk.chunkerify('gpt-4', chunk_size=1000)

277

chunker_gpt35 = semchunk.chunkerify('gpt-3.5-turbo', chunk_size=1000)

278

279

# tiktoken encodings

280

chunker_cl100k = semchunk.chunkerify('cl100k_base', chunk_size=1000)

281

282

# Hugging Face transformers

283

from transformers import AutoTokenizer

284

hf_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

285

chunker_bert = semchunk.chunkerify(hf_tokenizer, chunk_size=512)

286

287

# Custom token counter

288

def simple_word_counter(text: str) -> int:

289

return len(text.split())

290

291

chunker_words = semchunk.chunkerify(simple_word_counter, chunk_size=100)

292

```

293

294

### Processing Multiple Texts

295

296

```python

297

import semchunk

298

299

# Prepare chunker and texts

300

chunker = semchunk.chunkerify('gpt-4', chunk_size=512)

301

documents = [

302

"First document text...",

303

"Second document text...",

304

"Third document text..."

305

]

306

307

# Process with multiprocessing

308

chunks_per_doc = chunker(documents, processes=4, progress=True)

309

310

# Process with offsets

311

chunks_per_doc, offsets_per_doc = chunker(

312

documents,

313

processes=4,

314

progress=True,

315

offsets=True

316

)

317

318

# With overlap for better context preservation

319

overlapped_chunks = chunker(

320

documents,

321

overlap=0.2, # 20% overlap

322

processes=4

323

)

324

```

325

326

### Advanced Configuration

327

328

```python

329

import semchunk

330

from functools import lru_cache

331

332

# Custom token counter with caching

333

@lru_cache(maxsize=1000)

334

def cached_word_counter(text: str) -> int:

335

return len(text.split())

336

337

# Direct chunk function usage with custom settings

338

text = "Long document text..."

339

chunks = semchunk.chunk(

340

text=text,

341

chunk_size=200,

342

token_counter=cached_word_counter,

343

memoize=False, # Already cached manually

344

offsets=True,

345

overlap=50, # 50 token overlap

346

cache_maxsize=500

347

)

348

349

# Chunker with performance optimization

350

chunker = semchunk.chunkerify(

351

'gpt-4',

352

chunk_size=1000,

353

max_token_chars=10, # Optimize for typical token lengths

354

cache_maxsize=2000 # Large cache for repeated texts

355

)

356

```