or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

character-splitting.mdcode-splitting.mdcore-base.mddocument-structure.mdindex.mdnlp-splitting.mdtoken-splitting.md

character-splitting.mddocs/

0

# Character-Based Text Splitting

1

2

Character-based splitting provides fundamental text segmentation based on specific character separators. This includes simple separator-based splitting and advanced recursive splitting strategies that try multiple separators in order of preference.

3

4

## Capabilities

5

6

### Basic Character Splitting

7

8

Simple text splitting based on a single separator string or regex pattern.

9

10

```python { .api }

11

class CharacterTextSplitter(TextSplitter):

12

def __init__(

13

self,

14

separator: str = "\n\n",

15

is_separator_regex: bool = False,

16

**kwargs: Any

17

) -> None: ...

18

19

def split_text(self, text: str) -> list[str]: ...

20

```

21

22

**Parameters:**

23

- `separator`: String or regex pattern to split on (default: `"\n\n"`)

24

- `is_separator_regex`: Whether separator should be treated as regex (default: `False`)

25

- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`

26

27

**Usage:**

28

29

```python

30

from langchain_text_splitters import CharacterTextSplitter

31

32

# Split on double newlines

33

splitter = CharacterTextSplitter(separator="\n\n", chunk_size=1000)

34

chunks = splitter.split_text("Paragraph 1\n\nParagraph 2\n\nParagraph 3")

35

36

# Split using regex

37

regex_splitter = CharacterTextSplitter(

38

separator=r"\s+", # Split on any whitespace

39

is_separator_regex=True,

40

chunk_size=500

41

)

42

chunks = regex_splitter.split_text("Word1 Word2 Word3\tWord4\nWord5")

43

```

44

45

### Recursive Character Splitting

46

47

Advanced splitting that tries multiple separators in order of preference, recursively splitting chunks that are still too large.

48

49

```python { .api }

50

class RecursiveCharacterTextSplitter(TextSplitter):

51

def __init__(

52

self,

53

separators: Optional[list[str]] = None,

54

keep_separator: Union[bool, Literal["start", "end"]] = True,

55

is_separator_regex: bool = False,

56

**kwargs: Any

57

) -> None: ...

58

59

def split_text(self, text: str) -> list[str]: ...

60

61

@classmethod

62

def from_language(

63

cls,

64

language: Language,

65

**kwargs: Any

66

) -> "RecursiveCharacterTextSplitter": ...

67

68

@staticmethod

69

def get_separators_for_language(language: Language) -> list[str]: ...

70

```

71

72

**Parameters:**

73

- `separators`: List of separators to try in order (default: `["\n\n", "\n", " ", ""]`)

74

- `keep_separator`: Whether to keep separator and where to place it (default: `True`)

75

- `is_separator_regex`: Whether separators should be treated as regex (default: `False`)

76

- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`

77

78

**Class Methods:**

79

- `from_language()`: Create splitter optimized for specific programming language

80

- `get_separators_for_language()`: Get separator list for programming language

81

82

**Usage:**

83

84

```python

85

from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

86

87

# Basic recursive splitting

88

splitter = RecursiveCharacterTextSplitter(

89

chunk_size=1000,

90

chunk_overlap=200,

91

length_function=len,

92

is_separator_regex=False,

93

)

94

95

text = "Long document with multiple paragraphs and sections..."

96

chunks = splitter.split_text(text)

97

98

# Language-specific splitting for Python code

99

python_splitter = RecursiveCharacterTextSplitter.from_language(

100

language=Language.PYTHON,

101

chunk_size=2000,

102

chunk_overlap=100

103

)

104

python_code = """

105

def function1():

106

pass

107

108

class MyClass:

109

def method(self):

110

return "result"

111

"""

112

code_chunks = python_splitter.split_text(python_code)

113

114

# Custom separators

115

custom_splitter = RecursiveCharacterTextSplitter(

116

separators=["###", "##", "#", "\n\n", "\n", " ", ""],

117

chunk_size=500,

118

keep_separator=True

119

)

120

121

# Get separators for different languages

122

python_seps = RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

123

js_seps = RecursiveCharacterTextSplitter.get_separators_for_language(Language.JS)

124

```

125

126

## Language Support

127

128

The `Language` enum supports the following programming languages with optimized separator patterns:

129

130

- **CPP, C**: C/C++ code splitting

131

- **CSHARP**: C# code splitting

132

- **GO**: Go code splitting

133

- **JAVA, KOTLIN, SCALA**: JVM language splitting

134

- **JS, TS**: JavaScript/TypeScript splitting

135

- **PHP**: PHP code splitting

136

- **PROTO**: Protocol Buffer definition splitting

137

- **PYTHON**: Python code splitting

138

- **RST**: reStructuredText splitting

139

- **RUBY**: Ruby code splitting

140

- **RUST**: Rust code splitting

141

- **SWIFT**: Swift code splitting

142

- **MARKDOWN**: Markdown document splitting

143

- **LATEX**: LaTeX document splitting

144

- **HTML**: HTML document splitting

145

- **SOL**: Solidity smart contract splitting

146

- **COBOL**: COBOL code splitting

147

- **LUA**: Lua script splitting

148

- **PERL**: Perl script splitting

149

- **HASKELL**: Haskell code splitting

150

- **ELIXIR**: Elixir code splitting

151

- **POWERSHELL**: PowerShell script splitting

152

- **VISUALBASIC6**: Visual Basic 6 code splitting

153

154

Each language has carefully tuned separator patterns that respect the syntax and structure of that language for optimal code splitting.

155

156

## Best Practices

157

158

1. **Choose appropriate separators**: Use natural break points like paragraphs (`\n\n`) for text, or language-specific patterns for code

159

2. **Configure chunk overlap**: Set reasonable overlap (10-20% of chunk size) to maintain context across chunks

160

3. **Use language-specific splitting**: For code, use `from_language()` method for better results

161

4. **Consider regex patterns**: Use `is_separator_regex=True` for complex splitting patterns

162

5. **Test chunk sizes**: Validate that resulting chunks fit within your model's context window