0
# Character-Based Text Splitting
1
2
Character-based splitting provides fundamental text segmentation based on specific character separators. This includes simple separator-based splitting and advanced recursive splitting strategies that try multiple separators in order of preference.
3
4
## Capabilities
5
6
### Basic Character Splitting
7
8
Simple text splitting based on a single separator string or regex pattern.
9
10
```python { .api }
11
class CharacterTextSplitter(TextSplitter):
12
def __init__(
13
self,
14
separator: str = "\n\n",
15
is_separator_regex: bool = False,
16
**kwargs: Any
17
) -> None: ...
18
19
def split_text(self, text: str) -> list[str]: ...
20
```
21
22
**Parameters:**
23
- `separator`: String or regex pattern to split on (default: `"\n\n"`)
24
- `is_separator_regex`: Whether separator should be treated as regex (default: `False`)
25
- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`
26
27
**Usage:**
28
29
```python
30
from langchain_text_splitters import CharacterTextSplitter
31
32
# Split on double newlines
33
splitter = CharacterTextSplitter(separator="\n\n", chunk_size=1000)
34
chunks = splitter.split_text("Paragraph 1\n\nParagraph 2\n\nParagraph 3")
35
36
# Split using regex
37
regex_splitter = CharacterTextSplitter(
38
separator=r"\s+", # Split on any whitespace
39
is_separator_regex=True,
40
chunk_size=500
41
)
42
chunks = regex_splitter.split_text("Word1 Word2 Word3\tWord4\nWord5")
43
```
44
45
### Recursive Character Splitting
46
47
Advanced splitting that tries multiple separators in order of preference, recursively splitting chunks that are still too large.
48
49
```python { .api }
50
class RecursiveCharacterTextSplitter(TextSplitter):
51
def __init__(
52
self,
53
separators: Optional[list[str]] = None,
54
keep_separator: Union[bool, Literal["start", "end"]] = True,
55
is_separator_regex: bool = False,
56
**kwargs: Any
57
) -> None: ...
58
59
def split_text(self, text: str) -> list[str]: ...
60
61
@classmethod
62
def from_language(
63
cls,
64
language: Language,
65
**kwargs: Any
66
) -> "RecursiveCharacterTextSplitter": ...
67
68
@staticmethod
69
def get_separators_for_language(language: Language) -> list[str]: ...
70
```
71
72
**Parameters:**
73
- `separators`: List of separators to try in order (default: `["\n\n", "\n", " ", ""]`)
74
- `keep_separator`: Whether to keep separator and where to place it (default: `True`)
75
- `is_separator_regex`: Whether separators should be treated as regex (default: `False`)
76
- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`
77
78
**Class Methods:**
79
- `from_language()`: Create splitter optimized for specific programming language
80
- `get_separators_for_language()`: Get separator list for programming language
81
82
**Usage:**
83
84
```python
85
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
86
87
# Basic recursive splitting
88
splitter = RecursiveCharacterTextSplitter(
89
chunk_size=1000,
90
chunk_overlap=200,
91
length_function=len,
92
is_separator_regex=False,
93
)
94
95
text = "Long document with multiple paragraphs and sections..."
96
chunks = splitter.split_text(text)
97
98
# Language-specific splitting for Python code
99
python_splitter = RecursiveCharacterTextSplitter.from_language(
100
language=Language.PYTHON,
101
chunk_size=2000,
102
chunk_overlap=100
103
)
104
python_code = """
105
def function1():
106
pass
107
108
class MyClass:
109
def method(self):
110
return "result"
111
"""
112
code_chunks = python_splitter.split_text(python_code)
113
114
# Custom separators
115
custom_splitter = RecursiveCharacterTextSplitter(
116
separators=["###", "##", "#", "\n\n", "\n", " ", ""],
117
chunk_size=500,
118
keep_separator=True
119
)
120
121
# Get separators for different languages
122
python_seps = RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)
123
js_seps = RecursiveCharacterTextSplitter.get_separators_for_language(Language.JS)
124
```
125
126
## Language Support
127
128
The `Language` enum supports the following programming languages with optimized separator patterns:
129
130
- **CPP, C**: C/C++ code splitting
131
- **CSHARP**: C# code splitting
132
- **GO**: Go code splitting
133
- **JAVA, KOTLIN, SCALA**: JVM language splitting
134
- **JS, TS**: JavaScript/TypeScript splitting
135
- **PHP**: PHP code splitting
136
- **PROTO**: Protocol Buffer definition splitting
137
- **PYTHON**: Python code splitting
138
- **RST**: reStructuredText splitting
139
- **RUBY**: Ruby code splitting
140
- **RUST**: Rust code splitting
141
- **SWIFT**: Swift code splitting
142
- **MARKDOWN**: Markdown document splitting
143
- **LATEX**: LaTeX document splitting
144
- **HTML**: HTML document splitting
145
- **SOL**: Solidity smart contract splitting
146
- **COBOL**: COBOL code splitting
147
- **LUA**: Lua script splitting
148
- **PERL**: Perl script splitting
149
- **HASKELL**: Haskell code splitting
150
- **ELIXIR**: Elixir code splitting
151
- **POWERSHELL**: PowerShell script splitting
152
- **VISUALBASIC6**: Visual Basic 6 code splitting
153
154
Each language has carefully tuned separator patterns that respect the syntax and structure of that language for optimal code splitting.
155
156
## Best Practices
157
158
1. **Choose appropriate separators**: Use natural break points like paragraphs (`\n\n`) for text, or language-specific patterns for code
159
2. **Configure chunk overlap**: Set reasonable overlap (10-20% of chunk size) to maintain context across chunks
160
3. **Use language-specific splitting**: For code, use `from_language()` method for better results
161
4. **Consider regex patterns**: Use `is_separator_regex=True` for complex splitting patterns
162
5. **Test chunk sizes**: Validate that resulting chunks fit within your model's context window