or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

character-splitting.mdcode-splitting.mdcore-base.mddocument-structure.mdindex.mdnlp-splitting.mdtoken-splitting.md

nlp-splitting.mddocs/

0

# NLP-Based Text Splitting

1

2

NLP-based text splitting provides intelligent text segmentation using natural language processing libraries. These splitters understand linguistic boundaries such as sentences and phrases, making them ideal for processing natural language text while preserving semantic coherence.

3

4

## Capabilities

5

6

### NLTK Text Splitting

7

8

Text splitting using NLTK's sentence tokenization, supporting multiple languages and tokenization approaches.

9

10

```python { .api }

11

class NLTKTextSplitter(TextSplitter):

12

def __init__(

13

self,

14

separator: str = "\n\n",

15

language: str = "english",

16

*,

17

use_span_tokenize: bool = False,

18

**kwargs: Any

19

) -> None: ...

20

21

def split_text(self, text: str) -> list[str]: ...

22

```

23

24

**Parameters:**

25

- `separator`: Separator used to join sentences into chunks (default: `"\n\n"`)

26

- `language`: Language for NLTK sentence tokenization (default: `"english"`)

27

- `use_span_tokenize`: Whether to use span tokenization for better performance (default: `False`)

28

- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`

29

30

**Usage:**

31

32

```python

33

from langchain_text_splitters import NLTKTextSplitter

34

35

# Basic NLTK splitting

36

nltk_splitter = NLTKTextSplitter(

37

chunk_size=1000,

38

chunk_overlap=100,

39

language="english"

40

)

41

42

text = """

43

Natural language processing is a fascinating field. It combines computer science and linguistics.

44

Machine learning has revolutionized how we approach NLP tasks. Today's models can understand

45

context and generate human-like text. However, challenges remain in areas like common sense

46

reasoning and multilingual understanding.

47

"""

48

49

chunks = nltk_splitter.split_text(text)

50

51

# Multi-language support

52

spanish_splitter = NLTKTextSplitter(

53

language="spanish",

54

chunk_size=800,

55

separator="\n"

56

)

57

58

spanish_text = """

59

El procesamiento de lenguaje natural es un campo fascinante. Combina ciencias de la computación

60

y lingüística. El aprendizaje automático ha revolucionado cómo abordamos las tareas de PLN.

61

"""

62

63

spanish_chunks = spanish_splitter.split_text(spanish_text)

64

65

# Span tokenization for better performance

66

span_splitter = NLTKTextSplitter(

67

use_span_tokenize=True,

68

chunk_size=1200,

69

language="english"

70

)

71

```

72

73

**Supported Languages:**

74

NLTK supports sentence tokenization for multiple languages including:

75

- English, Spanish, French, German, Italian, Portuguese

76

- Dutch, Russian, Czech, Polish, Turkish

77

- And many others depending on NLTK data availability

78

79

### spaCy Text Splitting

80

81

Text splitting using spaCy's advanced NLP pipeline with sentence segmentation and linguistic analysis.

82

83

```python { .api }

84

class SpacyTextSplitter(TextSplitter):

85

def __init__(

86

self,

87

separator: str = "\n\n",

88

pipeline: str = "en_core_web_sm",

89

max_length: int = 1000000,

90

*,

91

strip_whitespace: bool = True,

92

**kwargs: Any

93

) -> None: ...

94

95

def split_text(self, text: str) -> list[str]: ...

96

```

97

98

**Parameters:**

99

- `separator`: Separator used to join sentences into chunks (default: `"\n\n"`)

100

- `pipeline`: spaCy pipeline/model name (default: `"en_core_web_sm"`)

101

- `max_length`: Maximum text length for spaCy processing (default: `1000000`)

102

- `strip_whitespace`: Whether to strip whitespace from chunks (default: `True`)

103

- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`

104

105

**Usage:**

106

107

```python

108

from langchain_text_splitters import SpacyTextSplitter

109

110

# Basic spaCy splitting

111

spacy_splitter = SpacyTextSplitter(

112

pipeline="en_core_web_sm",

113

chunk_size=1000,

114

chunk_overlap=100

115

)

116

117

text = """

118

The field of artificial intelligence has seen remarkable progress in recent years. Deep learning

119

models have achieved human-level performance on many tasks. Computer vision systems can now

120

recognize objects with incredible accuracy. Natural language models can generate coherent text

121

and engage in meaningful conversations.

122

"""

123

124

chunks = spacy_splitter.split_text(text)

125

126

# Different language models

127

german_splitter = SpacyTextSplitter(

128

pipeline="de_core_news_sm", # German model

129

chunk_size=800,

130

separator="\n"

131

)

132

133

# Larger models for better accuracy

134

large_splitter = SpacyTextSplitter(

135

pipeline="en_core_web_lg", # Large English model

136

chunk_size=1500,

137

max_length=2000000 # Handle longer texts

138

)

139

140

# Custom separator and settings

141

custom_splitter = SpacyTextSplitter(

142

pipeline="en_core_web_md",

143

separator=" | ", # Custom separator

144

strip_whitespace=False,

145

chunk_size=600

146

)

147

```

148

149

**Popular spaCy Models:**

150

- **English**: `en_core_web_sm`, `en_core_web_md`, `en_core_web_lg`

151

- **German**: `de_core_news_sm`, `de_core_news_md`, `de_core_news_lg`

152

- **French**: `fr_core_news_sm`, `fr_core_news_md`, `fr_core_news_lg`

153

- **Spanish**: `es_core_news_sm`, `es_core_news_md`, `es_core_news_lg`

154

- **Chinese**: `zh_core_web_sm`, `zh_core_web_md`, `zh_core_web_lg`

155

- **Japanese**: `ja_core_news_sm`, `ja_core_news_md`, `ja_core_news_lg`

156

157

### Korean Language Text Splitting

158

159

Specialized text splitting for Korean using Konlpy with the Kkma tokenizer.

160

161

```python { .api }

162

class KonlpyTextSplitter(TextSplitter):

163

def __init__(

164

self,

165

separator: str = "\n\n",

166

**kwargs: Any

167

) -> None: ...

168

169

def split_text(self, text: str) -> list[str]: ...

170

```

171

172

**Parameters:**

173

- `separator`: Separator used to join sentences into chunks (default: `"\n\n"`)

174

- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`

175

176

**Usage:**

177

178

```python

179

from langchain_text_splitters import KonlpyTextSplitter

180

181

korean_splitter = KonlpyTextSplitter(

182

chunk_size=800,

183

chunk_overlap=100

184

)

185

186

korean_text = """

187

자연어 처리는 컴퓨터 과학과 언어학을 결합한 흥미로운 분야입니다. 기계 학습이 자연어 처리 작업에

188

접근하는 방식을 혁신했습니다. 오늘날의 모델들은 맥락을 이해하고 인간과 같은 텍스트를 생성할 수

189

있습니다. 그러나 상식적 추론과 다국어 이해와 같은 영역에서는 여전히 과제가 남아 있습니다.

190

"""

191

192

chunks = korean_splitter.split_text(korean_text)

193

194

# Custom separator for Korean text

195

korean_custom_splitter = KonlpyTextSplitter(

196

separator="\n",

197

chunk_size=600,

198

chunk_overlap=50

199

)

200

```

201

202

The Korean splitter uses Konlpy's Kkma tokenizer, which provides:

203

- Morphological analysis

204

- Sentence boundary detection

205

- Support for Korean linguistic structures

206

- Proper handling of Korean punctuation and spacing

207

208

## Installation Requirements

209

210

Each NLP-based splitter requires specific dependencies:

211

212

### NLTK Text Splitter

213

```bash

214

pip install nltk

215

```

216

217

Download required NLTK data:

218

```python

219

import nltk

220

nltk.download('punkt') # For sentence tokenization

221

nltk.download('punkt_tab') # For newer NLTK versions

222

```

223

224

### spaCy Text Splitter

225

```bash

226

pip install spacy

227

```

228

229

Download language models:

230

```bash

231

# English

232

python -m spacy download en_core_web_sm

233

234

# Other languages

235

python -m spacy download de_core_news_sm # German

236

python -m spacy download fr_core_news_sm # French

237

python -m spacy download es_core_news_sm # Spanish

238

```

239

240

### Konlpy Text Splitter

241

```bash

242

pip install konlpy

243

```

244

245

Note: Konlpy may require additional system dependencies depending on your platform.

246

247

## Comparison of NLP Splitters

248

249

| Splitter | Strengths | Best Use Cases | Performance |

250

|----------|-----------|----------------|-------------|

251

| **NLTK** | Lightweight, many languages, fast setup | Simple sentence splitting, multilingual text | Fast |

252

| **spaCy** | Advanced NLP, high accuracy, robust models | High-quality text processing, complex documents | Medium-Fast |

253

| **Konlpy** | Korean language expertise, morphological analysis | Korean text processing, Korean NLP tasks | Medium |

254

255

## Best Practices

256

257

1. **Choose the right tool**: Use NLTK for simple sentence splitting, spaCy for advanced analysis, Konlpy for Korean

258

2. **Model selection**: Choose model size based on accuracy vs. speed trade-offs

259

3. **Language matching**: Use language-specific models for non-English text

260

4. **Memory considerations**: Larger spaCy models require more memory

261

5. **Preprocessing**: Clean text before NLP processing for better results

262

6. **Sentence coherence**: NLP splitters maintain sentence boundaries, preserving semantic coherence

263

7. **Cultural context**: For specialized domains or cultures, consider domain-specific models

264

8. **Performance testing**: Benchmark different splitters with your specific text types