0
# NLP-Based Text Splitting
1
2
NLP-based text splitting provides intelligent text segmentation using natural language processing libraries. These splitters understand linguistic boundaries such as sentences and phrases, making them ideal for processing natural language text while preserving semantic coherence.
3
4
## Capabilities
5
6
### NLTK Text Splitting
7
8
Text splitting using NLTK's sentence tokenization, supporting multiple languages and tokenization approaches.
9
10
```python { .api }
11
class NLTKTextSplitter(TextSplitter):
12
def __init__(
13
self,
14
separator: str = "\n\n",
15
language: str = "english",
16
*,
17
use_span_tokenize: bool = False,
18
**kwargs: Any
19
) -> None: ...
20
21
def split_text(self, text: str) -> list[str]: ...
22
```
23
24
**Parameters:**
25
- `separator`: Separator used to join sentences into chunks (default: `"\n\n"`)
26
- `language`: Language for NLTK sentence tokenization (default: `"english"`)
27
- `use_span_tokenize`: Whether to use span tokenization for better performance (default: `False`)
28
- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`
29
30
**Usage:**
31
32
```python
33
from langchain_text_splitters import NLTKTextSplitter
34
35
# Basic NLTK splitting
36
nltk_splitter = NLTKTextSplitter(
37
chunk_size=1000,
38
chunk_overlap=100,
39
language="english"
40
)
41
42
text = """
43
Natural language processing is a fascinating field. It combines computer science and linguistics.
44
Machine learning has revolutionized how we approach NLP tasks. Today's models can understand
45
context and generate human-like text. However, challenges remain in areas like common sense
46
reasoning and multilingual understanding.
47
"""
48
49
chunks = nltk_splitter.split_text(text)
50
51
# Multi-language support
52
spanish_splitter = NLTKTextSplitter(
53
language="spanish",
54
chunk_size=800,
55
separator="\n"
56
)
57
58
spanish_text = """
59
El procesamiento de lenguaje natural es un campo fascinante. Combina ciencias de la computación
60
y lingüística. El aprendizaje automático ha revolucionado cómo abordamos las tareas de PLN.
61
"""
62
63
spanish_chunks = spanish_splitter.split_text(spanish_text)
64
65
# Span tokenization for better performance
66
span_splitter = NLTKTextSplitter(
67
use_span_tokenize=True,
68
chunk_size=1200,
69
language="english"
70
)
71
```
72
73
**Supported Languages:**
74
NLTK supports sentence tokenization for multiple languages including:
75
- English, Spanish, French, German, Italian, Portuguese
76
- Dutch, Russian, Czech, Polish, Turkish
77
- And many others depending on NLTK data availability
78
79
### spaCy Text Splitting
80
81
Text splitting using spaCy's advanced NLP pipeline with sentence segmentation and linguistic analysis.
82
83
```python { .api }
84
class SpacyTextSplitter(TextSplitter):
85
def __init__(
86
self,
87
separator: str = "\n\n",
88
pipeline: str = "en_core_web_sm",
89
max_length: int = 1000000,
90
*,
91
strip_whitespace: bool = True,
92
**kwargs: Any
93
) -> None: ...
94
95
def split_text(self, text: str) -> list[str]: ...
96
```
97
98
**Parameters:**
99
- `separator`: Separator used to join sentences into chunks (default: `"\n\n"`)
100
- `pipeline`: spaCy pipeline/model name (default: `"en_core_web_sm"`)
101
- `max_length`: Maximum text length for spaCy processing (default: `1000000`)
102
- `strip_whitespace`: Whether to strip whitespace from chunks (default: `True`)
103
- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`
104
105
**Usage:**
106
107
```python
108
from langchain_text_splitters import SpacyTextSplitter
109
110
# Basic spaCy splitting
111
spacy_splitter = SpacyTextSplitter(
112
pipeline="en_core_web_sm",
113
chunk_size=1000,
114
chunk_overlap=100
115
)
116
117
text = """
118
The field of artificial intelligence has seen remarkable progress in recent years. Deep learning
119
models have achieved human-level performance on many tasks. Computer vision systems can now
120
recognize objects with incredible accuracy. Natural language models can generate coherent text
121
and engage in meaningful conversations.
122
"""
123
124
chunks = spacy_splitter.split_text(text)
125
126
# Different language models
127
german_splitter = SpacyTextSplitter(
128
pipeline="de_core_news_sm", # German model
129
chunk_size=800,
130
separator="\n"
131
)
132
133
# Larger models for better accuracy
134
large_splitter = SpacyTextSplitter(
135
pipeline="en_core_web_lg", # Large English model
136
chunk_size=1500,
137
max_length=2000000 # Handle longer texts
138
)
139
140
# Custom separator and settings
141
custom_splitter = SpacyTextSplitter(
142
pipeline="en_core_web_md",
143
separator=" | ", # Custom separator
144
strip_whitespace=False,
145
chunk_size=600
146
)
147
```
148
149
**Popular spaCy Models:**
150
- **English**: `en_core_web_sm`, `en_core_web_md`, `en_core_web_lg`
151
- **German**: `de_core_news_sm`, `de_core_news_md`, `de_core_news_lg`
152
- **French**: `fr_core_news_sm`, `fr_core_news_md`, `fr_core_news_lg`
153
- **Spanish**: `es_core_news_sm`, `es_core_news_md`, `es_core_news_lg`
154
- **Chinese**: `zh_core_web_sm`, `zh_core_web_md`, `zh_core_web_lg`
155
- **Japanese**: `ja_core_news_sm`, `ja_core_news_md`, `ja_core_news_lg`
156
157
### Korean Language Text Splitting
158
159
Specialized text splitting for Korean using Konlpy with the Kkma tokenizer.
160
161
```python { .api }
162
class KonlpyTextSplitter(TextSplitter):
163
def __init__(
164
self,
165
separator: str = "\n\n",
166
**kwargs: Any
167
) -> None: ...
168
169
def split_text(self, text: str) -> list[str]: ...
170
```
171
172
**Parameters:**
173
- `separator`: Separator used to join sentences into chunks (default: `"\n\n"`)
174
- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`
175
176
**Usage:**
177
178
```python
179
from langchain_text_splitters import KonlpyTextSplitter
180
181
korean_splitter = KonlpyTextSplitter(
182
chunk_size=800,
183
chunk_overlap=100
184
)
185
186
korean_text = """
187
자연어 처리는 컴퓨터 과학과 언어학을 결합한 흥미로운 분야입니다. 기계 학습이 자연어 처리 작업에
188
접근하는 방식을 혁신했습니다. 오늘날의 모델들은 맥락을 이해하고 인간과 같은 텍스트를 생성할 수
189
있습니다. 그러나 상식적 추론과 다국어 이해와 같은 영역에서는 여전히 과제가 남아 있습니다.
190
"""
191
192
chunks = korean_splitter.split_text(korean_text)
193
194
# Custom separator for Korean text
195
korean_custom_splitter = KonlpyTextSplitter(
196
separator="\n",
197
chunk_size=600,
198
chunk_overlap=50
199
)
200
```
201
202
The Korean splitter uses Konlpy's Kkma tokenizer, which provides:
203
- Morphological analysis
204
- Sentence boundary detection
205
- Support for Korean linguistic structures
206
- Proper handling of Korean punctuation and spacing
207
208
## Installation Requirements
209
210
Each NLP-based splitter requires specific dependencies:
211
212
### NLTK Text Splitter
213
```bash
214
pip install nltk
215
```
216
217
Download required NLTK data:
218
```python
219
import nltk
220
nltk.download('punkt') # For sentence tokenization
221
nltk.download('punkt_tab') # For newer NLTK versions
222
```
223
224
### spaCy Text Splitter
225
```bash
226
pip install spacy
227
```
228
229
Download language models:
230
```bash
231
# English
232
python -m spacy download en_core_web_sm
233
234
# Other languages
235
python -m spacy download de_core_news_sm # German
236
python -m spacy download fr_core_news_sm # French
237
python -m spacy download es_core_news_sm # Spanish
238
```
239
240
### Konlpy Text Splitter
241
```bash
242
pip install konlpy
243
```
244
245
Note: Konlpy may require additional system dependencies depending on your platform.
246
247
## Comparison of NLP Splitters
248
249
| Splitter | Strengths | Best Use Cases | Performance |
250
|----------|-----------|----------------|-------------|
251
| **NLTK** | Lightweight, many languages, fast setup | Simple sentence splitting, multilingual text | Fast |
252
| **spaCy** | Advanced NLP, high accuracy, robust models | High-quality text processing, complex documents | Medium-Fast |
253
| **Konlpy** | Korean language expertise, morphological analysis | Korean text processing, Korean NLP tasks | Medium |
254
255
## Best Practices
256
257
1. **Choose the right tool**: Use NLTK for simple sentence splitting, spaCy for advanced analysis, Konlpy for Korean
258
2. **Model selection**: Choose model size based on accuracy vs. speed trade-offs
259
3. **Language matching**: Use language-specific models for non-English text
260
4. **Memory considerations**: Larger spaCy models require more memory
261
5. **Preprocessing**: Clean text before NLP processing for better results
262
6. **Sentence coherence**: NLP splitters maintain sentence boundaries, preserving semantic coherence
263
7. **Cultural context**: For specialized domains or cultures, consider domain-specific models
264
8. **Performance testing**: Benchmark different splitters with your specific text types