Tessl Tile for pypi/langchain-text-splitters@0.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

character-splitting.md code-splitting.md core-base.md document-structure.md index.md nlp-splitting.md token-splitting.md

nlp-splitting.mddocs/

0
# NLP-Based Text Splitting
1

2
NLP-based text splitting provides intelligent text segmentation using natural language processing libraries. These splitters understand linguistic boundaries such as sentences and phrases, making them ideal for processing natural language text while preserving semantic coherence.
3

4
## Capabilities
5

6
### NLTK Text Splitting
7

8
Text splitting using NLTK's sentence tokenization, supporting multiple languages and tokenization approaches.
9

10
```python { .api }
11
class NLTKTextSplitter(TextSplitter):
12
    def __init__(
13
        self,
14
        separator: str = "\n\n",
15
        language: str = "english",
16
        *,
17
        use_span_tokenize: bool = False,
18
        **kwargs: Any
19
    ) -> None: ...
20
    
21
    def split_text(self, text: str) -> list[str]: ...
22
```
23

24
**Parameters:**
25
- `separator`: Separator used to join sentences into chunks (default: `"\n\n"`)
26
- `language`: Language for NLTK sentence tokenization (default: `"english"`)
27
- `use_span_tokenize`: Whether to use span tokenization for better performance (default: `False`)
28
- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`
29

30
**Usage:**
31

32
```python
33
from langchain_text_splitters import NLTKTextSplitter
34

35
# Basic NLTK splitting
36
nltk_splitter = NLTKTextSplitter(
37
    chunk_size=1000,
38
    chunk_overlap=100,
39
    language="english"
40
)
41

42
text = """
43
Natural language processing is a fascinating field. It combines computer science and linguistics. 
44
Machine learning has revolutionized how we approach NLP tasks. Today's models can understand 
45
context and generate human-like text. However, challenges remain in areas like common sense 
46
reasoning and multilingual understanding.
47
"""
48

49
chunks = nltk_splitter.split_text(text)
50

51
# Multi-language support
52
spanish_splitter = NLTKTextSplitter(
53
    language="spanish",
54
    chunk_size=800,
55
    separator="\n"
56
)
57

58
spanish_text = """
59
El procesamiento de lenguaje natural es un campo fascinante. Combina ciencias de la computación 
60
y lingüística. El aprendizaje automático ha revolucionado cómo abordamos las tareas de PLN.
61
"""
62

63
spanish_chunks = spanish_splitter.split_text(spanish_text)
64

65
# Span tokenization for better performance
66
span_splitter = NLTKTextSplitter(
67
    use_span_tokenize=True,
68
    chunk_size=1200,
69
    language="english"
70
)
71
```
72

73
**Supported Languages:**
74
NLTK supports sentence tokenization for multiple languages including:
75
- English, Spanish, French, German, Italian, Portuguese
76
- Dutch, Russian, Czech, Polish, Turkish
77
- And many others depending on NLTK data availability
78

79
### spaCy Text Splitting
80

81
Text splitting using spaCy's advanced NLP pipeline with sentence segmentation and linguistic analysis.
82

83
```python { .api }
84
class SpacyTextSplitter(TextSplitter):
85
    def __init__(
86
        self,
87
        separator: str = "\n\n",
88
        pipeline: str = "en_core_web_sm",
89
        max_length: int = 1000000,
90
        *,
91
        strip_whitespace: bool = True,
92
        **kwargs: Any
93
    ) -> None: ...
94
    
95
    def split_text(self, text: str) -> list[str]: ...
96
```
97

98
**Parameters:**
99
- `separator`: Separator used to join sentences into chunks (default: `"\n\n"`)
100
- `pipeline`: spaCy pipeline/model name (default: `"en_core_web_sm"`)
101
- `max_length`: Maximum text length for spaCy processing (default: `1000000`)
102
- `strip_whitespace`: Whether to strip whitespace from chunks (default: `True`)
103
- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`
104

105
**Usage:**
106

107
```python
108
from langchain_text_splitters import SpacyTextSplitter
109

110
# Basic spaCy splitting
111
spacy_splitter = SpacyTextSplitter(
112
    pipeline="en_core_web_sm",
113
    chunk_size=1000,
114
    chunk_overlap=100
115
)
116

117
text = """
118
The field of artificial intelligence has seen remarkable progress in recent years. Deep learning 
119
models have achieved human-level performance on many tasks. Computer vision systems can now 
120
recognize objects with incredible accuracy. Natural language models can generate coherent text 
121
and engage in meaningful conversations.
122
"""
123

124
chunks = spacy_splitter.split_text(text)
125

126
# Different language models
127
german_splitter = SpacyTextSplitter(
128
    pipeline="de_core_news_sm",  # German model
129
    chunk_size=800,
130
    separator="\n"
131
)
132

133
# Larger models for better accuracy
134
large_splitter = SpacyTextSplitter(
135
    pipeline="en_core_web_lg",  # Large English model
136
    chunk_size=1500,
137
    max_length=2000000  # Handle longer texts
138
)
139

140
# Custom separator and settings
141
custom_splitter = SpacyTextSplitter(
142
    pipeline="en_core_web_md",
143
    separator=" | ",  # Custom separator
144
    strip_whitespace=False,
145
    chunk_size=600
146
)
147
```
148

149
**Popular spaCy Models:**
150
- **English**: `en_core_web_sm`, `en_core_web_md`, `en_core_web_lg`
151
- **German**: `de_core_news_sm`, `de_core_news_md`, `de_core_news_lg`
152
- **French**: `fr_core_news_sm`, `fr_core_news_md`, `fr_core_news_lg`
153
- **Spanish**: `es_core_news_sm`, `es_core_news_md`, `es_core_news_lg`
154
- **Chinese**: `zh_core_web_sm`, `zh_core_web_md`, `zh_core_web_lg`
155
- **Japanese**: `ja_core_news_sm`, `ja_core_news_md`, `ja_core_news_lg`
156

157
### Korean Language Text Splitting
158

159
Specialized text splitting for Korean using Konlpy with the Kkma tokenizer.
160

161
```python { .api }
162
class KonlpyTextSplitter(TextSplitter):
163
    def __init__(
164
        self,
165
        separator: str = "\n\n",
166
        **kwargs: Any
167
    ) -> None: ...
168
    
169
    def split_text(self, text: str) -> list[str]: ...
170
```
171

172
**Parameters:**
173
- `separator`: Separator used to join sentences into chunks (default: `"\n\n"`)
174
- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`
175

176
**Usage:**
177

178
```python
179
from langchain_text_splitters import KonlpyTextSplitter
180

181
korean_splitter = KonlpyTextSplitter(
182
    chunk_size=800,
183
    chunk_overlap=100
184
)
185

186
korean_text = """
187
자연어 처리는 컴퓨터 과학과 언어학을 결합한 흥미로운 분야입니다. 기계 학습이 자연어 처리 작업에 
188
접근하는 방식을 혁신했습니다. 오늘날의 모델들은 맥락을 이해하고 인간과 같은 텍스트를 생성할 수 
189
있습니다. 그러나 상식적 추론과 다국어 이해와 같은 영역에서는 여전히 과제가 남아 있습니다.
190
"""
191

192
chunks = korean_splitter.split_text(korean_text)
193

194
# Custom separator for Korean text
195
korean_custom_splitter = KonlpyTextSplitter(
196
    separator="\n",
197
    chunk_size=600,
198
    chunk_overlap=50
199
)
200
```
201

202
The Korean splitter uses Konlpy's Kkma tokenizer, which provides:
203
- Morphological analysis
204
- Sentence boundary detection
205
- Support for Korean linguistic structures
206
- Proper handling of Korean punctuation and spacing
207

208
## Installation Requirements
209

210
Each NLP-based splitter requires specific dependencies:
211

212
### NLTK Text Splitter
213
```bash
214
pip install nltk
215
```
216

217
Download required NLTK data:
218
```python
219
import nltk
220
nltk.download('punkt')  # For sentence tokenization
221
nltk.download('punkt_tab')  # For newer NLTK versions
222
```
223

224
### spaCy Text Splitter
225
```bash
226
pip install spacy
227
```
228

229
Download language models:
230
```bash
231
# English
232
python -m spacy download en_core_web_sm
233

234
# Other languages
235
python -m spacy download de_core_news_sm  # German
236
python -m spacy download fr_core_news_sm  # French
237
python -m spacy download es_core_news_sm  # Spanish
238
```
239

240
### Konlpy Text Splitter
241
```bash
242
pip install konlpy
243
```
244

245
Note: Konlpy may require additional system dependencies depending on your platform.
246

247
## Comparison of NLP Splitters
248

249
| Splitter | Strengths | Best Use Cases | Performance |
250
|----------|-----------|----------------|-------------|
251
| **NLTK** | Lightweight, many languages, fast setup | Simple sentence splitting, multilingual text | Fast |
252
| **spaCy** | Advanced NLP, high accuracy, robust models | High-quality text processing, complex documents | Medium-Fast |
253
| **Konlpy** | Korean language expertise, morphological analysis | Korean text processing, Korean NLP tasks | Medium |
254
255
## Best Practices
256

257
1. **Choose the right tool**: Use NLTK for simple sentence splitting, spaCy for advanced analysis, Konlpy for Korean
258
2. **Model selection**: Choose model size based on accuracy vs. speed trade-offs
259
3. **Language matching**: Use language-specific models for non-English text
260
4. **Memory considerations**: Larger spaCy models require more memory
261
5. **Preprocessing**: Clean text before NLP processing for better results
262
6. **Sentence coherence**: NLP splitters maintain sentence boundaries, preserving semantic coherence
263
7. **Cultural context**: For specialized domains or cultures, consider domain-specific models
264
8. **Performance testing**: Benchmark different splitters with your specific text types

Version

Tile

Files

nlp-splitting.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

nlp-splitting.mddocs/