LangChain text splitting utilities for breaking documents into manageable chunks for AI processing
npx @tessl/cli install tessl/pypi-langchain-text-splitters@0.3.00
# LangChain Text Splitters
1
2
LangChain Text Splitters provides comprehensive text splitting utilities for breaking down various types of documents into manageable chunks for processing by language models and other AI systems. The library offers specialized splitters for different content types and maintains document structure and context through intelligent chunking strategies.
3
4
## Package Information
5
6
- **Package Name**: langchain-text-splitters
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install langchain-text-splitters`
10
11
## Core Imports
12
13
```python
14
from langchain_text_splitters import (
15
TextSplitter,
16
CharacterTextSplitter,
17
RecursiveCharacterTextSplitter
18
)
19
```
20
21
For specific splitter types:
22
23
```python
24
from langchain_text_splitters import (
25
# HTML splitters
26
HTMLHeaderTextSplitter,
27
HTMLSectionSplitter,
28
HTMLSemanticPreservingSplitter,
29
# Markdown splitters
30
MarkdownHeaderTextSplitter,
31
MarkdownTextSplitter,
32
ExperimentalMarkdownSyntaxTextSplitter,
33
# Other specialized splitters
34
RecursiveJsonSplitter,
35
PythonCodeTextSplitter,
36
NLTKTextSplitter,
37
SpacyTextSplitter
38
)
39
```
40
41
For type definitions:
42
43
```python
44
from langchain_text_splitters import (
45
ElementType,
46
HeaderType,
47
LineType,
48
Language
49
)
50
```
51
52
## Basic Usage
53
54
```python
55
from langchain_text_splitters import RecursiveCharacterTextSplitter
56
57
# Create a text splitter with custom configuration
58
text_splitter = RecursiveCharacterTextSplitter(
59
chunk_size=1000,
60
chunk_overlap=200,
61
length_function=len,
62
is_separator_regex=False,
63
)
64
65
# Split text into chunks
66
text = "Your long document text here..."
67
chunks = text_splitter.split_text(text)
68
69
# Create Document objects with metadata
70
from langchain_core.documents import Document
71
documents = text_splitter.create_documents([text], [{"source": "example.txt"}])
72
73
# Split existing Document objects
74
existing_docs = [Document(page_content="Text content", metadata={"page": 1})]
75
split_docs = text_splitter.split_documents(existing_docs)
76
```
77
78
## Architecture
79
80
The package follows a well-defined inheritance hierarchy:
81
82
- **BaseDocumentTransformer**: Core LangChain interface for document transformation
83
- **TextSplitter**: Abstract base class defining the splitting interface
84
- **Specific Splitters**: Concrete implementations for different content types and strategies
85
86
Key design patterns:
87
- **Inheritance-based**: Most splitters extend the abstract `TextSplitter` class
88
- **Factory methods**: Classes provide `from_*` methods for convenient initialization
89
- **Language support**: Extensive programming language support via the `Language` enum
90
- **Document integration**: Seamless integration with LangChain's `Document` class for metadata preservation
91
92
## Capabilities
93
94
### Character-Based Text Splitting
95
96
Basic and advanced character-based text splitting strategies including simple separator-based splitting and recursive multi-separator splitting with language-specific support.
97
98
```python { .api }
99
class CharacterTextSplitter(TextSplitter):
100
def __init__(self, separator: str = "\n\n", is_separator_regex: bool = False, **kwargs): ...
101
def split_text(self, text: str) -> list[str]: ...
102
103
class RecursiveCharacterTextSplitter(TextSplitter):
104
def __init__(self, separators: Optional[list[str]] = None, keep_separator: bool = True, is_separator_regex: bool = False, **kwargs): ...
105
def split_text(self, text: str) -> list[str]: ...
106
@classmethod
107
def from_language(cls, language: Language, **kwargs) -> "RecursiveCharacterTextSplitter": ...
108
@staticmethod
109
def get_separators_for_language(language: Language) -> list[str]: ...
110
```
111
112
[Character-Based Splitting](./character-splitting.md)
113
114
### Token-Based Text Splitting
115
116
Advanced token-aware splitting using popular tokenizers including OpenAI's tiktoken, HuggingFace transformers, and sentence transformer models.
117
118
```python { .api }
119
class TokenTextSplitter(TextSplitter):
120
def __init__(self, encoding_name: str = "gpt2", model_name: Optional[str] = None, allowed_special: Union[Literal["all"], set[str]] = set(), disallowed_special: Union[Literal["all"], Collection[str]] = "all", **kwargs): ...
121
def split_text(self, text: str) -> list[str]: ...
122
123
class SentenceTransformersTokenTextSplitter(TextSplitter):
124
def __init__(self, chunk_overlap: int = 50, model_name: str = "sentence-transformers/all-mpnet-base-v2", tokens_per_chunk: Optional[int] = None, **kwargs): ...
125
def split_text(self, text: str) -> list[str]: ...
126
def count_tokens(self, text: str) -> int: ...
127
```
128
129
[Token-Based Splitting](./token-splitting.md)
130
131
### Document Structure-Aware Splitting
132
133
Specialized splitters that understand and preserve document structure for HTML, Markdown, JSON, and LaTeX documents while maintaining semantic context.
134
135
```python { .api }
136
class HTMLHeaderTextSplitter:
137
def __init__(self, headers_to_split_on: list[tuple[str, str]], return_each_element: bool = False): ...
138
def split_text(self, text: str) -> list[Document]: ...
139
def split_text_from_url(self, url: str, timeout: int = 10, **kwargs) -> list[Document]: ...
140
141
class HTMLSectionSplitter:
142
def __init__(self, headers_to_split_on: list[tuple[str, str]], **kwargs: Any): ...
143
def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...
144
def split_text(self, text: str) -> list[Document]: ...
145
146
class HTMLSemanticPreservingSplitter(BaseDocumentTransformer):
147
def __init__(self, headers_to_split_on: list[tuple[str, str]], *, max_chunk_size: int = 1000, chunk_overlap: int = 0, **kwargs): ...
148
def split_text(self, text: str) -> list[Document]: ...
149
def transform_documents(self, documents: Sequence[Document], **kwargs: Any) -> list[Document]: ...
150
151
class MarkdownHeaderTextSplitter:
152
def __init__(self, headers_to_split_on: list[tuple[str, str]], return_each_line: bool = False, strip_headers: bool = True, custom_header_patterns: Optional[dict[int, str]] = None): ...
153
def split_text(self, text: str) -> list[Document]: ...
154
155
class MarkdownTextSplitter(RecursiveCharacterTextSplitter):
156
def __init__(self, **kwargs: Any) -> None: ...
157
158
class ExperimentalMarkdownSyntaxTextSplitter:
159
def __init__(self, headers_to_split_on: Optional[list[tuple[str, str]]] = None, return_each_line: bool = False, strip_headers: bool = True): ...
160
def split_text(self, text: str) -> list[Document]: ...
161
162
class RecursiveJsonSplitter:
163
def __init__(self, max_chunk_size: int = 2000, min_chunk_size: Optional[int] = None): ...
164
def split_json(self, json_data: dict, convert_lists: bool = False) -> list[dict]: ...
165
def split_text(self, json_data: dict, convert_lists: bool = False, ensure_ascii: bool = True) -> list[str]: ...
166
```
167
168
[Document Structure Splitting](./document-structure.md)
169
170
### Code-Aware Text Splitting
171
172
Programming language-aware splitters that understand code syntax and structure for Python, JavaScript/TypeScript frameworks, and other programming languages.
173
174
```python { .api }
175
class PythonCodeTextSplitter(RecursiveCharacterTextSplitter):
176
def __init__(self, **kwargs): ...
177
178
class JSFrameworkTextSplitter(RecursiveCharacterTextSplitter):
179
def __init__(self, separators: Optional[list[str]] = None, chunk_size: int = 2000, chunk_overlap: int = 0, **kwargs): ...
180
def split_text(self, text: str) -> list[str]: ...
181
182
class LatexTextSplitter(RecursiveCharacterTextSplitter):
183
def __init__(self, **kwargs): ...
184
```
185
186
[Code-Aware Splitting](./code-splitting.md)
187
188
### Natural Language Processing Splitters
189
190
NLP-powered text splitters using NLTK, spaCy, and Konlpy for sentence-aware splitting with support for multiple languages including Korean.
191
192
```python { .api }
193
class NLTKTextSplitter(TextSplitter):
194
def __init__(self, separator: str = "\n\n", language: str = "english", use_span_tokenize: bool = False, **kwargs): ...
195
def split_text(self, text: str) -> list[str]: ...
196
197
class SpacyTextSplitter(TextSplitter):
198
def __init__(self, separator: str = "\n\n", pipeline: str = "en_core_web_sm", max_length: int = 1000000, strip_whitespace: bool = True, **kwargs): ...
199
def split_text(self, text: str) -> list[str]: ...
200
201
class KonlpyTextSplitter(TextSplitter):
202
def __init__(self, separator: str = "\n\n", **kwargs): ...
203
def split_text(self, text: str) -> list[str]: ...
204
```
205
206
[NLP-Based Splitting](./nlp-splitting.md)
207
208
### Core Base Classes and Utilities
209
210
Core interfaces, enums, and utility functions that provide the foundation for all text splitting functionality.
211
212
```python { .api }
213
class TextSplitter(BaseDocumentTransformer, ABC):
214
def __init__(self, chunk_size: int = 4000, chunk_overlap: int = 200, length_function: Callable[[str], int] = len, keep_separator: Union[bool, Literal["start", "end"]] = False, add_start_index: bool = False, strip_whitespace: bool = True): ...
215
@abstractmethod
216
def split_text(self, text: str) -> list[str]: ...
217
def create_documents(self, texts: list[str], metadatas: Optional[list[dict[Any, Any]]] = None) -> list[Document]: ...
218
def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...
219
220
class Language(Enum):
221
CPP = "cpp"
222
GO = "go"
223
JAVA = "java"
224
KOTLIN = "kotlin"
225
JS = "js"
226
TS = "ts"
227
PHP = "php"
228
PROTO = "proto"
229
PYTHON = "python"
230
RST = "rst"
231
RUBY = "ruby"
232
RUST = "rust"
233
SCALA = "scala"
234
SWIFT = "swift"
235
MARKDOWN = "markdown"
236
LATEX = "latex"
237
HTML = "html"
238
SOL = "sol"
239
CSHARP = "csharp"
240
COBOL = "cobol"
241
C = "c"
242
LUA = "lua"
243
PERL = "perl"
244
HASKELL = "haskell"
245
ELIXIR = "elixir"
246
POWERSHELL = "powershell"
247
VISUALBASIC6 = "visualbasic6"
248
249
def split_text_on_tokens(*, text: str, tokenizer: "Tokenizer") -> list[str]: ...
250
```
251
252
[Core Base Classes](./core-base.md)