or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

character-splitting.mdcode-splitting.mdcore-base.mddocument-structure.mdindex.mdnlp-splitting.mdtoken-splitting.md

code-splitting.mddocs/

0

# Code-Aware Text Splitting

1

2

Code-aware text splitting provides specialized text segmentation that understands programming language syntax and structure. These splitters are designed to maintain code integrity by respecting logical boundaries such as function definitions, class declarations, and block structures.

3

4

## Capabilities

5

6

### Python Code Splitting

7

8

Specialized splitting for Python source code that respects Python syntax and structure.

9

10

```python { .api }

11

class PythonCodeTextSplitter(RecursiveCharacterTextSplitter):

12

def __init__(self, **kwargs: Any) -> None: ...

13

```

14

15

**Usage:**

16

17

```python

18

from langchain_text_splitters import PythonCodeTextSplitter

19

20

python_splitter = PythonCodeTextSplitter(

21

chunk_size=2000,

22

chunk_overlap=200

23

)

24

25

python_code = """

26

import os

27

import sys

28

29

def calculate_sum(a, b):

30

'''Calculate the sum of two numbers.'''

31

return a + b

32

33

class Calculator:

34

def __init__(self):

35

self.history = []

36

37

def add(self, x, y):

38

result = x + y

39

self.history.append(f"{x} + {y} = {result}")

40

return result

41

42

def get_history(self):

43

return self.history

44

45

if __name__ == "__main__":

46

calc = Calculator()

47

print(calc.add(5, 3))

48

"""

49

50

chunks = python_splitter.split_text(python_code)

51

```

52

53

The Python splitter uses separators optimized for Python syntax:

54

- Class definitions (`class `)

55

- Function definitions (`def `, `async def `)

56

- Control flow statements (`if `, `for `, `while `, `try `, `with `)

57

- Standard separators (`\n\n`, `\n`, ` `, ``)

58

59

### JavaScript/TypeScript Framework Splitting

60

61

Specialized splitting for React/JSX, Vue, and Svelte code that understands component boundaries and framework-specific syntax.

62

63

```python { .api }

64

class JSFrameworkTextSplitter(RecursiveCharacterTextSplitter):

65

def __init__(

66

self,

67

separators: Optional[list[str]] = None,

68

chunk_size: int = 2000,

69

chunk_overlap: int = 0,

70

**kwargs: Any

71

) -> None: ...

72

73

def split_text(self, text: str) -> list[str]: ...

74

```

75

76

**Parameters:**

77

- `separators`: Custom separator list (default: framework-optimized separators)

78

- `chunk_size`: Maximum chunk size (default: `2000`)

79

- `chunk_overlap`: Overlap between chunks (default: `0`)

80

81

**Usage:**

82

83

```python

84

from langchain_text_splitters import JSFrameworkTextSplitter

85

86

jsx_splitter = JSFrameworkTextSplitter(

87

chunk_size=1500,

88

chunk_overlap=100

89

)

90

91

react_code = """

92

import React, { useState, useEffect } from 'react';

93

94

const UserProfile = ({ userId }) => {

95

const [user, setUser] = useState(null);

96

const [loading, setLoading] = useState(true);

97

98

useEffect(() => {

99

fetchUser(userId)

100

.then(userData => {

101

setUser(userData);

102

setLoading(false);

103

})

104

.catch(error => {

105

console.error('Error fetching user:', error);

106

setLoading(false);

107

});

108

}, [userId]);

109

110

if (loading) {

111

return <LoadingSpinner />;

112

}

113

114

return (

115

<div className="user-profile">

116

<h1>{user.name}</h1>

117

<p>{user.email}</p>

118

</div>

119

);

120

};

121

122

export default UserProfile;

123

"""

124

125

chunks = jsx_splitter.split_text(react_code)

126

```

127

128

The JSX splitter recognizes:

129

- Component definitions and exports

130

- Hook declarations (`useState`, `useEffect`, etc.)

131

- JSX elements and fragments

132

- Import/export statements

133

- Function and arrow function boundaries

134

135

### LaTeX Document Splitting

136

137

Specialized splitting for LaTeX documents that respects LaTeX structure and formatting commands.

138

139

```python { .api }

140

class LatexTextSplitter(RecursiveCharacterTextSplitter):

141

def __init__(self, **kwargs: Any) -> None: ...

142

```

143

144

**Usage:**

145

146

```python

147

from langchain_text_splitters import LatexTextSplitter

148

149

latex_splitter = LatexTextSplitter(

150

chunk_size=1000,

151

chunk_overlap=100

152

)

153

154

latex_document = r"""

155

\documentclass{article}

156

\usepackage{amsmath}

157

158

\title{Mathematical Analysis}

159

\author{Author Name}

160

\date{\today}

161

162

\begin{document}

163

164

\maketitle

165

166

\section{Introduction}

167

This document presents a mathematical analysis of...

168

169

\subsection{Preliminaries}

170

Let us define the following concepts:

171

172

\begin{definition}

173

A function $f: \mathbb{R} \to \mathbb{R}$ is continuous at point $a$ if...

174

\end{definition}

175

176

\begin{theorem}

177

If $f$ is continuous on $[a, b]$ and differentiable on $(a, b)$, then...

178

\end{theorem}

179

180

\section{Main Results}

181

The main theorem can be stated as follows:

182

183

\begin{align}

184

\int_a^b f(x) dx &= F(b) - F(a) \\

185

&= \lim_{n \to \infty} \sum_{i=1}^n f(x_i) \Delta x

186

\end{align}

187

188

\end{document}

189

"""

190

191

chunks = latex_splitter.split_text(latex_document)

192

```

193

194

The LaTeX splitter uses separators that respect:

195

- Document structure (`\section`, `\subsection`, `\chapter`)

196

- Environment boundaries (`\begin{}`, `\end{}`)

197

- Mathematical expressions and equations

198

- Standard paragraph breaks

199

200

### Language-Specific Splitting via RecursiveCharacterTextSplitter

201

202

For other programming languages, use the `RecursiveCharacterTextSplitter.from_language()` method with the appropriate `Language` enum value.

203

204

```python { .api }

205

# Available through RecursiveCharacterTextSplitter

206

@classmethod

207

def from_language(

208

cls,

209

language: Language,

210

**kwargs: Any

211

) -> "RecursiveCharacterTextSplitter": ...

212

213

@staticmethod

214

def get_separators_for_language(language: Language) -> list[str]: ...

215

```

216

217

**Usage:**

218

219

```python

220

from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

221

222

# Java code splitting

223

java_splitter = RecursiveCharacterTextSplitter.from_language(

224

language=Language.JAVA,

225

chunk_size=2000,

226

chunk_overlap=200

227

)

228

229

java_code = """

230

public class Calculator {

231

private double result;

232

233

public Calculator() {

234

this.result = 0.0;

235

}

236

237

public double add(double a, double b) {

238

result = a + b;

239

return result;

240

}

241

242

public static void main(String[] args) {

243

Calculator calc = new Calculator();

244

System.out.println(calc.add(5.0, 3.0));

245

}

246

}

247

"""

248

249

java_chunks = java_splitter.split_text(java_code)

250

251

# C++ code splitting

252

cpp_splitter = RecursiveCharacterTextSplitter.from_language(

253

language=Language.CPP,

254

chunk_size=1500

255

)

256

257

# Get separators for inspection

258

cpp_separators = RecursiveCharacterTextSplitter.get_separators_for_language(Language.CPP)

259

```

260

261

## Supported Languages

262

263

The `Language` enum provides optimized separators for:

264

265

### Popular Languages

266

- **PYTHON**: Python source code

267

- **JS**, **TS**: JavaScript and TypeScript

268

- **JAVA**: Java source code

269

- **CPP**, **C**: C and C++ source code

270

- **CSHARP**: C# source code

271

- **GO**: Go source code

272

- **RUST**: Rust source code

273

- **RUBY**: Ruby source code

274

- **PHP**: PHP source code

275

276

### Additional Languages

277

- **KOTLIN**: Kotlin source code

278

- **SCALA**: Scala source code

279

- **SWIFT**: Swift source code

280

- **PROTO**: Protocol Buffer definitions

281

- **SOL**: Solidity smart contracts

282

- **COBOL**: COBOL source code

283

- **LUA**: Lua scripts

284

- **PERL**: Perl scripts

285

- **HASKELL**: Haskell source code

286

- **ELIXIR**: Elixir source code

287

- **POWERSHELL**: PowerShell scripts

288

- **VISUALBASIC6**: Visual Basic 6 source code

289

290

### Document Formats

291

- **MARKDOWN**: Markdown documents

292

- **LATEX**: LaTeX documents

293

- **HTML**: HTML documents

294

- **RST**: reStructuredText documents

295

296

## Best Practices

297

298

1. **Use language-specific splitters**: Always use the appropriate language splitter for better code structure preservation

299

2. **Configure appropriate chunk sizes**: Balance between preserving complete functions/classes and staying within token limits

300

3. **Consider minimal overlap**: Code chunks often need less overlap than prose text

301

4. **Test with your codebase**: Different coding styles may require different chunk sizes

302

5. **Preserve imports**: Ensure import/include statements are properly handled in your chunking strategy

303

6. **Maintain syntax validity**: Verify that code chunks maintain valid syntax boundaries

304

7. **Handle comments appropriately**: Consider how code comments should be distributed across chunks