0
# Code-Aware Text Splitting
1
2
Code-aware text splitting provides specialized text segmentation that understands programming language syntax and structure. These splitters are designed to maintain code integrity by respecting logical boundaries such as function definitions, class declarations, and block structures.
3
4
## Capabilities
5
6
### Python Code Splitting
7
8
Specialized splitting for Python source code that respects Python syntax and structure.
9
10
```python { .api }
11
class PythonCodeTextSplitter(RecursiveCharacterTextSplitter):
12
def __init__(self, **kwargs: Any) -> None: ...
13
```
14
15
**Usage:**
16
17
```python
18
from langchain_text_splitters import PythonCodeTextSplitter
19
20
python_splitter = PythonCodeTextSplitter(
21
chunk_size=2000,
22
chunk_overlap=200
23
)
24
25
python_code = """
26
import os
27
import sys
28
29
def calculate_sum(a, b):
30
'''Calculate the sum of two numbers.'''
31
return a + b
32
33
class Calculator:
34
def __init__(self):
35
self.history = []
36
37
def add(self, x, y):
38
result = x + y
39
self.history.append(f"{x} + {y} = {result}")
40
return result
41
42
def get_history(self):
43
return self.history
44
45
if __name__ == "__main__":
46
calc = Calculator()
47
print(calc.add(5, 3))
48
"""
49
50
chunks = python_splitter.split_text(python_code)
51
```
52
53
The Python splitter uses separators optimized for Python syntax:
54
- Class definitions (`class `)
55
- Function definitions (`def `, `async def `)
56
- Control flow statements (`if `, `for `, `while `, `try `, `with `)
57
- Standard separators (`\n\n`, `\n`, ` `, ``)
58
59
### JavaScript/TypeScript Framework Splitting
60
61
Specialized splitting for React/JSX, Vue, and Svelte code that understands component boundaries and framework-specific syntax.
62
63
```python { .api }
64
class JSFrameworkTextSplitter(RecursiveCharacterTextSplitter):
65
def __init__(
66
self,
67
separators: Optional[list[str]] = None,
68
chunk_size: int = 2000,
69
chunk_overlap: int = 0,
70
**kwargs: Any
71
) -> None: ...
72
73
def split_text(self, text: str) -> list[str]: ...
74
```
75
76
**Parameters:**
77
- `separators`: Custom separator list (default: framework-optimized separators)
78
- `chunk_size`: Maximum chunk size (default: `2000`)
79
- `chunk_overlap`: Overlap between chunks (default: `0`)
80
81
**Usage:**
82
83
```python
84
from langchain_text_splitters import JSFrameworkTextSplitter
85
86
jsx_splitter = JSFrameworkTextSplitter(
87
chunk_size=1500,
88
chunk_overlap=100
89
)
90
91
react_code = """
92
import React, { useState, useEffect } from 'react';
93
94
const UserProfile = ({ userId }) => {
95
const [user, setUser] = useState(null);
96
const [loading, setLoading] = useState(true);
97
98
useEffect(() => {
99
fetchUser(userId)
100
.then(userData => {
101
setUser(userData);
102
setLoading(false);
103
})
104
.catch(error => {
105
console.error('Error fetching user:', error);
106
setLoading(false);
107
});
108
}, [userId]);
109
110
if (loading) {
111
return <LoadingSpinner />;
112
}
113
114
return (
115
<div className="user-profile">
116
<h1>{user.name}</h1>
117
<p>{user.email}</p>
118
</div>
119
);
120
};
121
122
export default UserProfile;
123
"""
124
125
chunks = jsx_splitter.split_text(react_code)
126
```
127
128
The JSX splitter recognizes:
129
- Component definitions and exports
130
- Hook declarations (`useState`, `useEffect`, etc.)
131
- JSX elements and fragments
132
- Import/export statements
133
- Function and arrow function boundaries
134
135
### LaTeX Document Splitting
136
137
Specialized splitting for LaTeX documents that respects LaTeX structure and formatting commands.
138
139
```python { .api }
140
class LatexTextSplitter(RecursiveCharacterTextSplitter):
141
def __init__(self, **kwargs: Any) -> None: ...
142
```
143
144
**Usage:**
145
146
```python
147
from langchain_text_splitters import LatexTextSplitter
148
149
latex_splitter = LatexTextSplitter(
150
chunk_size=1000,
151
chunk_overlap=100
152
)
153
154
latex_document = r"""
155
\documentclass{article}
156
\usepackage{amsmath}
157
158
\title{Mathematical Analysis}
159
\author{Author Name}
160
\date{\today}
161
162
\begin{document}
163
164
\maketitle
165
166
\section{Introduction}
167
This document presents a mathematical analysis of...
168
169
\subsection{Preliminaries}
170
Let us define the following concepts:
171
172
\begin{definition}
173
A function $f: \mathbb{R} \to \mathbb{R}$ is continuous at point $a$ if...
174
\end{definition}
175
176
\begin{theorem}
177
If $f$ is continuous on $[a, b]$ and differentiable on $(a, b)$, then...
178
\end{theorem}
179
180
\section{Main Results}
181
The main theorem can be stated as follows:
182
183
\begin{align}
184
\int_a^b f(x) dx &= F(b) - F(a) \\
185
&= \lim_{n \to \infty} \sum_{i=1}^n f(x_i) \Delta x
186
\end{align}
187
188
\end{document}
189
"""
190
191
chunks = latex_splitter.split_text(latex_document)
192
```
193
194
The LaTeX splitter uses separators that respect:
195
- Document structure (`\section`, `\subsection`, `\chapter`)
196
- Environment boundaries (`\begin{}`, `\end{}`)
197
- Mathematical expressions and equations
198
- Standard paragraph breaks
199
200
### Language-Specific Splitting via RecursiveCharacterTextSplitter
201
202
For other programming languages, use the `RecursiveCharacterTextSplitter.from_language()` method with the appropriate `Language` enum value.
203
204
```python { .api }
205
# Available through RecursiveCharacterTextSplitter
206
@classmethod
207
def from_language(
208
cls,
209
language: Language,
210
**kwargs: Any
211
) -> "RecursiveCharacterTextSplitter": ...
212
213
@staticmethod
214
def get_separators_for_language(language: Language) -> list[str]: ...
215
```
216
217
**Usage:**
218
219
```python
220
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
221
222
# Java code splitting
223
java_splitter = RecursiveCharacterTextSplitter.from_language(
224
language=Language.JAVA,
225
chunk_size=2000,
226
chunk_overlap=200
227
)
228
229
java_code = """
230
public class Calculator {
231
private double result;
232
233
public Calculator() {
234
this.result = 0.0;
235
}
236
237
public double add(double a, double b) {
238
result = a + b;
239
return result;
240
}
241
242
public static void main(String[] args) {
243
Calculator calc = new Calculator();
244
System.out.println(calc.add(5.0, 3.0));
245
}
246
}
247
"""
248
249
java_chunks = java_splitter.split_text(java_code)
250
251
# C++ code splitting
252
cpp_splitter = RecursiveCharacterTextSplitter.from_language(
253
language=Language.CPP,
254
chunk_size=1500
255
)
256
257
# Get separators for inspection
258
cpp_separators = RecursiveCharacterTextSplitter.get_separators_for_language(Language.CPP)
259
```
260
261
## Supported Languages
262
263
The `Language` enum provides optimized separators for:
264
265
### Popular Languages
266
- **PYTHON**: Python source code
267
- **JS**, **TS**: JavaScript and TypeScript
268
- **JAVA**: Java source code
269
- **CPP**, **C**: C and C++ source code
270
- **CSHARP**: C# source code
271
- **GO**: Go source code
272
- **RUST**: Rust source code
273
- **RUBY**: Ruby source code
274
- **PHP**: PHP source code
275
276
### Additional Languages
277
- **KOTLIN**: Kotlin source code
278
- **SCALA**: Scala source code
279
- **SWIFT**: Swift source code
280
- **PROTO**: Protocol Buffer definitions
281
- **SOL**: Solidity smart contracts
282
- **COBOL**: COBOL source code
283
- **LUA**: Lua scripts
284
- **PERL**: Perl scripts
285
- **HASKELL**: Haskell source code
286
- **ELIXIR**: Elixir source code
287
- **POWERSHELL**: PowerShell scripts
288
- **VISUALBASIC6**: Visual Basic 6 source code
289
290
### Document Formats
291
- **MARKDOWN**: Markdown documents
292
- **LATEX**: LaTeX documents
293
- **HTML**: HTML documents
294
- **RST**: reStructuredText documents
295
296
## Best Practices
297
298
1. **Use language-specific splitters**: Always use the appropriate language splitter for better code structure preservation
299
2. **Configure appropriate chunk sizes**: Balance between preserving complete functions/classes and staying within token limits
300
3. **Consider minimal overlap**: Code chunks often need less overlap than prose text
301
4. **Test with your codebase**: Different coding styles may require different chunk sizes
302
5. **Preserve imports**: Ensure import/include statements are properly handled in your chunking strategy
303
6. **Maintain syntax validity**: Verify that code chunks maintain valid syntax boundaries
304
7. **Handle comments appropriately**: Consider how code comments should be distributed across chunks