Tessl Tile for pypi/langchain-text-splitters@0.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-langchain-text-splitters

LangChain text splitting utilities for breaking documents into manageable chunks for AI processing

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/langchain-text-splitters@0.3.x

To install, run

npx @tessl/cli install tessl/pypi-langchain-text-splitters@0.3.0

0
# LangChain Text Splitters
1

2
LangChain Text Splitters provides comprehensive text splitting utilities for breaking down various types of documents into manageable chunks for processing by language models and other AI systems. The library offers specialized splitters for different content types and maintains document structure and context through intelligent chunking strategies.
3

4
## Package Information
5

6
- **Package Name**: langchain-text-splitters
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install langchain-text-splitters`
10

11
## Core Imports
12

13
```python
14
from langchain_text_splitters import (
15
    TextSplitter,
16
    CharacterTextSplitter,
17
    RecursiveCharacterTextSplitter
18
)
19
```
20

21
For specific splitter types:
22

23
```python
24
from langchain_text_splitters import (
25
    # HTML splitters
26
    HTMLHeaderTextSplitter,
27
    HTMLSectionSplitter,
28
    HTMLSemanticPreservingSplitter,
29
    # Markdown splitters
30
    MarkdownHeaderTextSplitter,
31
    MarkdownTextSplitter,
32
    ExperimentalMarkdownSyntaxTextSplitter,
33
    # Other specialized splitters
34
    RecursiveJsonSplitter,
35
    PythonCodeTextSplitter,
36
    NLTKTextSplitter,
37
    SpacyTextSplitter
38
)
39
```
40

41
For type definitions:
42

43
```python
44
from langchain_text_splitters import (
45
    ElementType,
46
    HeaderType,
47
    LineType,
48
    Language
49
)
50
```
51

52
## Basic Usage
53

54
```python
55
from langchain_text_splitters import RecursiveCharacterTextSplitter
56

57
# Create a text splitter with custom configuration
58
text_splitter = RecursiveCharacterTextSplitter(
59
    chunk_size=1000,
60
    chunk_overlap=200,
61
    length_function=len,
62
    is_separator_regex=False,
63
)
64

65
# Split text into chunks
66
text = "Your long document text here..."
67
chunks = text_splitter.split_text(text)
68

69
# Create Document objects with metadata
70
from langchain_core.documents import Document
71
documents = text_splitter.create_documents([text], [{"source": "example.txt"}])
72

73
# Split existing Document objects
74
existing_docs = [Document(page_content="Text content", metadata={"page": 1})]
75
split_docs = text_splitter.split_documents(existing_docs)
76
```
77

78
## Architecture
79

80
The package follows a well-defined inheritance hierarchy:
81

82
- **BaseDocumentTransformer**: Core LangChain interface for document transformation
83
- **TextSplitter**: Abstract base class defining the splitting interface
84
- **Specific Splitters**: Concrete implementations for different content types and strategies
85

86
Key design patterns:
87
- **Inheritance-based**: Most splitters extend the abstract `TextSplitter` class
88
- **Factory methods**: Classes provide `from_*` methods for convenient initialization
89
- **Language support**: Extensive programming language support via the `Language` enum
90
- **Document integration**: Seamless integration with LangChain's `Document` class for metadata preservation
91

92
## Capabilities
93

94
### Character-Based Text Splitting
95

96
Basic and advanced character-based text splitting strategies including simple separator-based splitting and recursive multi-separator splitting with language-specific support.
97

98
```python { .api }
99
class CharacterTextSplitter(TextSplitter):
100
    def __init__(self, separator: str = "\n\n", is_separator_regex: bool = False, **kwargs): ...
101
    def split_text(self, text: str) -> list[str]: ...
102

103
class RecursiveCharacterTextSplitter(TextSplitter):
104
    def __init__(self, separators: Optional[list[str]] = None, keep_separator: bool = True, is_separator_regex: bool = False, **kwargs): ...
105
    def split_text(self, text: str) -> list[str]: ...
106
    @classmethod
107
    def from_language(cls, language: Language, **kwargs) -> "RecursiveCharacterTextSplitter": ...
108
    @staticmethod
109
    def get_separators_for_language(language: Language) -> list[str]: ...
110
```
111

112
[Character-Based Splitting](./character-splitting.md)
113

114
### Token-Based Text Splitting
115

116
Advanced token-aware splitting using popular tokenizers including OpenAI's tiktoken, HuggingFace transformers, and sentence transformer models.
117

118
```python { .api }
119
class TokenTextSplitter(TextSplitter):
120
    def __init__(self, encoding_name: str = "gpt2", model_name: Optional[str] = None, allowed_special: Union[Literal["all"], set[str]] = set(), disallowed_special: Union[Literal["all"], Collection[str]] = "all", **kwargs): ...
121
    def split_text(self, text: str) -> list[str]: ...
122

123
class SentenceTransformersTokenTextSplitter(TextSplitter):
124
    def __init__(self, chunk_overlap: int = 50, model_name: str = "sentence-transformers/all-mpnet-base-v2", tokens_per_chunk: Optional[int] = None, **kwargs): ...
125
    def split_text(self, text: str) -> list[str]: ...
126
    def count_tokens(self, text: str) -> int: ...
127
```
128

129
[Token-Based Splitting](./token-splitting.md)
130

131
### Document Structure-Aware Splitting
132

133
Specialized splitters that understand and preserve document structure for HTML, Markdown, JSON, and LaTeX documents while maintaining semantic context.
134

135
```python { .api }
136
class HTMLHeaderTextSplitter:
137
    def __init__(self, headers_to_split_on: list[tuple[str, str]], return_each_element: bool = False): ...
138
    def split_text(self, text: str) -> list[Document]: ...
139
    def split_text_from_url(self, url: str, timeout: int = 10, **kwargs) -> list[Document]: ...
140

141
class HTMLSectionSplitter:
142
    def __init__(self, headers_to_split_on: list[tuple[str, str]], **kwargs: Any): ...
143
    def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...
144
    def split_text(self, text: str) -> list[Document]: ...
145

146
class HTMLSemanticPreservingSplitter(BaseDocumentTransformer):
147
    def __init__(self, headers_to_split_on: list[tuple[str, str]], *, max_chunk_size: int = 1000, chunk_overlap: int = 0, **kwargs): ...
148
    def split_text(self, text: str) -> list[Document]: ...
149
    def transform_documents(self, documents: Sequence[Document], **kwargs: Any) -> list[Document]: ...
150

151
class MarkdownHeaderTextSplitter:
152
    def __init__(self, headers_to_split_on: list[tuple[str, str]], return_each_line: bool = False, strip_headers: bool = True, custom_header_patterns: Optional[dict[int, str]] = None): ...
153
    def split_text(self, text: str) -> list[Document]: ...
154

155
class MarkdownTextSplitter(RecursiveCharacterTextSplitter):
156
    def __init__(self, **kwargs: Any) -> None: ...
157

158
class ExperimentalMarkdownSyntaxTextSplitter:
159
    def __init__(self, headers_to_split_on: Optional[list[tuple[str, str]]] = None, return_each_line: bool = False, strip_headers: bool = True): ...
160
    def split_text(self, text: str) -> list[Document]: ...
161

162
class RecursiveJsonSplitter:
163
    def __init__(self, max_chunk_size: int = 2000, min_chunk_size: Optional[int] = None): ...
164
    def split_json(self, json_data: dict, convert_lists: bool = False) -> list[dict]: ...
165
    def split_text(self, json_data: dict, convert_lists: bool = False, ensure_ascii: bool = True) -> list[str]: ...
166
```
167

168
[Document Structure Splitting](./document-structure.md)
169

170
### Code-Aware Text Splitting
171

172
Programming language-aware splitters that understand code syntax and structure for Python, JavaScript/TypeScript frameworks, and other programming languages.
173

174
```python { .api }
175
class PythonCodeTextSplitter(RecursiveCharacterTextSplitter):
176
    def __init__(self, **kwargs): ...
177

178
class JSFrameworkTextSplitter(RecursiveCharacterTextSplitter):
179
    def __init__(self, separators: Optional[list[str]] = None, chunk_size: int = 2000, chunk_overlap: int = 0, **kwargs): ...
180
    def split_text(self, text: str) -> list[str]: ...
181

182
class LatexTextSplitter(RecursiveCharacterTextSplitter):
183
    def __init__(self, **kwargs): ...
184
```
185

186
[Code-Aware Splitting](./code-splitting.md)
187

188
### Natural Language Processing Splitters
189

190
NLP-powered text splitters using NLTK, spaCy, and Konlpy for sentence-aware splitting with support for multiple languages including Korean.
191

192
```python { .api }
193
class NLTKTextSplitter(TextSplitter):
194
    def __init__(self, separator: str = "\n\n", language: str = "english", use_span_tokenize: bool = False, **kwargs): ...
195
    def split_text(self, text: str) -> list[str]: ...
196

197
class SpacyTextSplitter(TextSplitter):
198
    def __init__(self, separator: str = "\n\n", pipeline: str = "en_core_web_sm", max_length: int = 1000000, strip_whitespace: bool = True, **kwargs): ...
199
    def split_text(self, text: str) -> list[str]: ...
200

201
class KonlpyTextSplitter(TextSplitter):
202
    def __init__(self, separator: str = "\n\n", **kwargs): ...
203
    def split_text(self, text: str) -> list[str]: ...
204
```
205

206
[NLP-Based Splitting](./nlp-splitting.md)
207

208
### Core Base Classes and Utilities
209

210
Core interfaces, enums, and utility functions that provide the foundation for all text splitting functionality.
211

212
```python { .api }
213
class TextSplitter(BaseDocumentTransformer, ABC):
214
    def __init__(self, chunk_size: int = 4000, chunk_overlap: int = 200, length_function: Callable[[str], int] = len, keep_separator: Union[bool, Literal["start", "end"]] = False, add_start_index: bool = False, strip_whitespace: bool = True): ...
215
    @abstractmethod
216
    def split_text(self, text: str) -> list[str]: ...
217
    def create_documents(self, texts: list[str], metadatas: Optional[list[dict[Any, Any]]] = None) -> list[Document]: ...
218
    def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...
219

220
class Language(Enum):
221
    CPP = "cpp"
222
    GO = "go"
223
    JAVA = "java"
224
    KOTLIN = "kotlin"
225
    JS = "js"
226
    TS = "ts"
227
    PHP = "php"
228
    PROTO = "proto"
229
    PYTHON = "python"
230
    RST = "rst"
231
    RUBY = "ruby"
232
    RUST = "rust"
233
    SCALA = "scala"
234
    SWIFT = "swift"
235
    MARKDOWN = "markdown"
236
    LATEX = "latex"
237
    HTML = "html"
238
    SOL = "sol"
239
    CSHARP = "csharp"
240
    COBOL = "cobol"
241
    C = "c"
242
    LUA = "lua"
243
    PERL = "perl"
244
    HASKELL = "haskell"
245
    ELIXIR = "elixir"
246
    POWERSHELL = "powershell"
247
    VISUALBASIC6 = "visualbasic6"
248

249
def split_text_on_tokens(*, text: str, tokenizer: "Tokenizer") -> list[str]: ...
250
```
251

252
[Core Base Classes](./core-base.md)