or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-langchain-text-splitters

LangChain text splitting utilities for breaking documents into manageable chunks for AI processing

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/langchain-text-splitters@0.3.x

To install, run

npx @tessl/cli install tessl/pypi-langchain-text-splitters@0.3.0

0

# LangChain Text Splitters

1

2

LangChain Text Splitters provides comprehensive text splitting utilities for breaking down various types of documents into manageable chunks for processing by language models and other AI systems. The library offers specialized splitters for different content types and maintains document structure and context through intelligent chunking strategies.

3

4

## Package Information

5

6

- **Package Name**: langchain-text-splitters

7

- **Package Type**: pypi

8

- **Language**: Python

9

- **Installation**: `pip install langchain-text-splitters`

10

11

## Core Imports

12

13

```python

14

from langchain_text_splitters import (

15

TextSplitter,

16

CharacterTextSplitter,

17

RecursiveCharacterTextSplitter

18

)

19

```

20

21

For specific splitter types:

22

23

```python

24

from langchain_text_splitters import (

25

# HTML splitters

26

HTMLHeaderTextSplitter,

27

HTMLSectionSplitter,

28

HTMLSemanticPreservingSplitter,

29

# Markdown splitters

30

MarkdownHeaderTextSplitter,

31

MarkdownTextSplitter,

32

ExperimentalMarkdownSyntaxTextSplitter,

33

# Other specialized splitters

34

RecursiveJsonSplitter,

35

PythonCodeTextSplitter,

36

NLTKTextSplitter,

37

SpacyTextSplitter

38

)

39

```

40

41

For type definitions:

42

43

```python

44

from langchain_text_splitters import (

45

ElementType,

46

HeaderType,

47

LineType,

48

Language

49

)

50

```

51

52

## Basic Usage

53

54

```python

55

from langchain_text_splitters import RecursiveCharacterTextSplitter

56

57

# Create a text splitter with custom configuration

58

text_splitter = RecursiveCharacterTextSplitter(

59

chunk_size=1000,

60

chunk_overlap=200,

61

length_function=len,

62

is_separator_regex=False,

63

)

64

65

# Split text into chunks

66

text = "Your long document text here..."

67

chunks = text_splitter.split_text(text)

68

69

# Create Document objects with metadata

70

from langchain_core.documents import Document

71

documents = text_splitter.create_documents([text], [{"source": "example.txt"}])

72

73

# Split existing Document objects

74

existing_docs = [Document(page_content="Text content", metadata={"page": 1})]

75

split_docs = text_splitter.split_documents(existing_docs)

76

```

77

78

## Architecture

79

80

The package follows a well-defined inheritance hierarchy:

81

82

- **BaseDocumentTransformer**: Core LangChain interface for document transformation

83

- **TextSplitter**: Abstract base class defining the splitting interface

84

- **Specific Splitters**: Concrete implementations for different content types and strategies

85

86

Key design patterns:

87

- **Inheritance-based**: Most splitters extend the abstract `TextSplitter` class

88

- **Factory methods**: Classes provide `from_*` methods for convenient initialization

89

- **Language support**: Extensive programming language support via the `Language` enum

90

- **Document integration**: Seamless integration with LangChain's `Document` class for metadata preservation

91

92

## Capabilities

93

94

### Character-Based Text Splitting

95

96

Basic and advanced character-based text splitting strategies including simple separator-based splitting and recursive multi-separator splitting with language-specific support.

97

98

```python { .api }

99

class CharacterTextSplitter(TextSplitter):

100

def __init__(self, separator: str = "\n\n", is_separator_regex: bool = False, **kwargs): ...

101

def split_text(self, text: str) -> list[str]: ...

102

103

class RecursiveCharacterTextSplitter(TextSplitter):

104

def __init__(self, separators: Optional[list[str]] = None, keep_separator: bool = True, is_separator_regex: bool = False, **kwargs): ...

105

def split_text(self, text: str) -> list[str]: ...

106

@classmethod

107

def from_language(cls, language: Language, **kwargs) -> "RecursiveCharacterTextSplitter": ...

108

@staticmethod

109

def get_separators_for_language(language: Language) -> list[str]: ...

110

```

111

112

[Character-Based Splitting](./character-splitting.md)

113

114

### Token-Based Text Splitting

115

116

Advanced token-aware splitting using popular tokenizers including OpenAI's tiktoken, HuggingFace transformers, and sentence transformer models.

117

118

```python { .api }

119

class TokenTextSplitter(TextSplitter):

120

def __init__(self, encoding_name: str = "gpt2", model_name: Optional[str] = None, allowed_special: Union[Literal["all"], set[str]] = set(), disallowed_special: Union[Literal["all"], Collection[str]] = "all", **kwargs): ...

121

def split_text(self, text: str) -> list[str]: ...

122

123

class SentenceTransformersTokenTextSplitter(TextSplitter):

124

def __init__(self, chunk_overlap: int = 50, model_name: str = "sentence-transformers/all-mpnet-base-v2", tokens_per_chunk: Optional[int] = None, **kwargs): ...

125

def split_text(self, text: str) -> list[str]: ...

126

def count_tokens(self, text: str) -> int: ...

127

```

128

129

[Token-Based Splitting](./token-splitting.md)

130

131

### Document Structure-Aware Splitting

132

133

Specialized splitters that understand and preserve document structure for HTML, Markdown, JSON, and LaTeX documents while maintaining semantic context.

134

135

```python { .api }

136

class HTMLHeaderTextSplitter:

137

def __init__(self, headers_to_split_on: list[tuple[str, str]], return_each_element: bool = False): ...

138

def split_text(self, text: str) -> list[Document]: ...

139

def split_text_from_url(self, url: str, timeout: int = 10, **kwargs) -> list[Document]: ...

140

141

class HTMLSectionSplitter:

142

def __init__(self, headers_to_split_on: list[tuple[str, str]], **kwargs: Any): ...

143

def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...

144

def split_text(self, text: str) -> list[Document]: ...

145

146

class HTMLSemanticPreservingSplitter(BaseDocumentTransformer):

147

def __init__(self, headers_to_split_on: list[tuple[str, str]], *, max_chunk_size: int = 1000, chunk_overlap: int = 0, **kwargs): ...

148

def split_text(self, text: str) -> list[Document]: ...

149

def transform_documents(self, documents: Sequence[Document], **kwargs: Any) -> list[Document]: ...

150

151

class MarkdownHeaderTextSplitter:

152

def __init__(self, headers_to_split_on: list[tuple[str, str]], return_each_line: bool = False, strip_headers: bool = True, custom_header_patterns: Optional[dict[int, str]] = None): ...

153

def split_text(self, text: str) -> list[Document]: ...

154

155

class MarkdownTextSplitter(RecursiveCharacterTextSplitter):

156

def __init__(self, **kwargs: Any) -> None: ...

157

158

class ExperimentalMarkdownSyntaxTextSplitter:

159

def __init__(self, headers_to_split_on: Optional[list[tuple[str, str]]] = None, return_each_line: bool = False, strip_headers: bool = True): ...

160

def split_text(self, text: str) -> list[Document]: ...

161

162

class RecursiveJsonSplitter:

163

def __init__(self, max_chunk_size: int = 2000, min_chunk_size: Optional[int] = None): ...

164

def split_json(self, json_data: dict, convert_lists: bool = False) -> list[dict]: ...

165

def split_text(self, json_data: dict, convert_lists: bool = False, ensure_ascii: bool = True) -> list[str]: ...

166

```

167

168

[Document Structure Splitting](./document-structure.md)

169

170

### Code-Aware Text Splitting

171

172

Programming language-aware splitters that understand code syntax and structure for Python, JavaScript/TypeScript frameworks, and other programming languages.

173

174

```python { .api }

175

class PythonCodeTextSplitter(RecursiveCharacterTextSplitter):

176

def __init__(self, **kwargs): ...

177

178

class JSFrameworkTextSplitter(RecursiveCharacterTextSplitter):

179

def __init__(self, separators: Optional[list[str]] = None, chunk_size: int = 2000, chunk_overlap: int = 0, **kwargs): ...

180

def split_text(self, text: str) -> list[str]: ...

181

182

class LatexTextSplitter(RecursiveCharacterTextSplitter):

183

def __init__(self, **kwargs): ...

184

```

185

186

[Code-Aware Splitting](./code-splitting.md)

187

188

### Natural Language Processing Splitters

189

190

NLP-powered text splitters using NLTK, spaCy, and Konlpy for sentence-aware splitting with support for multiple languages including Korean.

191

192

```python { .api }

193

class NLTKTextSplitter(TextSplitter):

194

def __init__(self, separator: str = "\n\n", language: str = "english", use_span_tokenize: bool = False, **kwargs): ...

195

def split_text(self, text: str) -> list[str]: ...

196

197

class SpacyTextSplitter(TextSplitter):

198

def __init__(self, separator: str = "\n\n", pipeline: str = "en_core_web_sm", max_length: int = 1000000, strip_whitespace: bool = True, **kwargs): ...

199

def split_text(self, text: str) -> list[str]: ...

200

201

class KonlpyTextSplitter(TextSplitter):

202

def __init__(self, separator: str = "\n\n", **kwargs): ...

203

def split_text(self, text: str) -> list[str]: ...

204

```

205

206

[NLP-Based Splitting](./nlp-splitting.md)

207

208

### Core Base Classes and Utilities

209

210

Core interfaces, enums, and utility functions that provide the foundation for all text splitting functionality.

211

212

```python { .api }

213

class TextSplitter(BaseDocumentTransformer, ABC):

214

def __init__(self, chunk_size: int = 4000, chunk_overlap: int = 200, length_function: Callable[[str], int] = len, keep_separator: Union[bool, Literal["start", "end"]] = False, add_start_index: bool = False, strip_whitespace: bool = True): ...

215

@abstractmethod

216

def split_text(self, text: str) -> list[str]: ...

217

def create_documents(self, texts: list[str], metadatas: Optional[list[dict[Any, Any]]] = None) -> list[Document]: ...

218

def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...

219

220

class Language(Enum):

221

CPP = "cpp"

222

GO = "go"

223

JAVA = "java"

224

KOTLIN = "kotlin"

225

JS = "js"

226

TS = "ts"

227

PHP = "php"

228

PROTO = "proto"

229

PYTHON = "python"

230

RST = "rst"

231

RUBY = "ruby"

232

RUST = "rust"

233

SCALA = "scala"

234

SWIFT = "swift"

235

MARKDOWN = "markdown"

236

LATEX = "latex"

237

HTML = "html"

238

SOL = "sol"

239

CSHARP = "csharp"

240

COBOL = "cobol"

241

C = "c"

242

LUA = "lua"

243

PERL = "perl"

244

HASKELL = "haskell"

245

ELIXIR = "elixir"

246

POWERSHELL = "powershell"

247

VISUALBASIC6 = "visualbasic6"

248

249

def split_text_on_tokens(*, text: str, tokenizer: "Tokenizer") -> list[str]: ...

250

```

251

252

[Core Base Classes](./core-base.md)