Tessl Tile for pypi/langchain-text-splitters@0.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

character-splitting.md code-splitting.md core-base.md document-structure.md index.md nlp-splitting.md token-splitting.md

code-splitting.mddocs/

0
# Code-Aware Text Splitting
1

2
Code-aware text splitting provides specialized text segmentation that understands programming language syntax and structure. These splitters are designed to maintain code integrity by respecting logical boundaries such as function definitions, class declarations, and block structures.
3

4
## Capabilities
5

6
### Python Code Splitting
7

8
Specialized splitting for Python source code that respects Python syntax and structure.
9

10
```python { .api }
11
class PythonCodeTextSplitter(RecursiveCharacterTextSplitter):
12
    def __init__(self, **kwargs: Any) -> None: ...
13
```
14

15
**Usage:**
16

17
```python
18
from langchain_text_splitters import PythonCodeTextSplitter
19

20
python_splitter = PythonCodeTextSplitter(
21
    chunk_size=2000,
22
    chunk_overlap=200
23
)
24

25
python_code = """
26
import os
27
import sys
28

29
def calculate_sum(a, b):
30
    '''Calculate the sum of two numbers.'''
31
    return a + b
32

33
class Calculator:
34
    def __init__(self):
35
        self.history = []
36
    
37
    def add(self, x, y):
38
        result = x + y
39
        self.history.append(f"{x} + {y} = {result}")
40
        return result
41
    
42
    def get_history(self):
43
        return self.history
44

45
if __name__ == "__main__":
46
    calc = Calculator()
47
    print(calc.add(5, 3))
48
"""
49

50
chunks = python_splitter.split_text(python_code)
51
```
52

53
The Python splitter uses separators optimized for Python syntax:
54
- Class definitions (`class `)
55
- Function definitions (`def `, `async def `)
56
- Control flow statements (`if `, `for `, `while `, `try `, `with `)
57
- Standard separators (`\n\n`, `\n`, ` `, ``)
58

59
### JavaScript/TypeScript Framework Splitting
60

61
Specialized splitting for React/JSX, Vue, and Svelte code that understands component boundaries and framework-specific syntax.
62

63
```python { .api }
64
class JSFrameworkTextSplitter(RecursiveCharacterTextSplitter):
65
    def __init__(
66
        self,
67
        separators: Optional[list[str]] = None,
68
        chunk_size: int = 2000,
69
        chunk_overlap: int = 0,
70
        **kwargs: Any
71
    ) -> None: ...
72
    
73
    def split_text(self, text: str) -> list[str]: ...
74
```
75

76
**Parameters:**
77
- `separators`: Custom separator list (default: framework-optimized separators)
78
- `chunk_size`: Maximum chunk size (default: `2000`)
79
- `chunk_overlap`: Overlap between chunks (default: `0`)
80

81
**Usage:**
82

83
```python
84
from langchain_text_splitters import JSFrameworkTextSplitter
85

86
jsx_splitter = JSFrameworkTextSplitter(
87
    chunk_size=1500,
88
    chunk_overlap=100
89
)
90

91
react_code = """
92
import React, { useState, useEffect } from 'react';
93

94
const UserProfile = ({ userId }) => {
95
    const [user, setUser] = useState(null);
96
    const [loading, setLoading] = useState(true);
97

98
    useEffect(() => {
99
        fetchUser(userId)
100
            .then(userData => {
101
                setUser(userData);
102
                setLoading(false);
103
            })
104
            .catch(error => {
105
                console.error('Error fetching user:', error);
106
                setLoading(false);
107
            });
108
    }, [userId]);
109

110
    if (loading) {
111
        return <LoadingSpinner />;
112
    }
113

114
    return (
115
        <div className="user-profile">
116
            <h1>{user.name}</h1>
117
            <p>{user.email}</p>
118
        </div>
119
    );
120
};
121

122
export default UserProfile;
123
"""
124

125
chunks = jsx_splitter.split_text(react_code)
126
```
127

128
The JSX splitter recognizes:
129
- Component definitions and exports
130
- Hook declarations (`useState`, `useEffect`, etc.)
131
- JSX elements and fragments
132
- Import/export statements
133
- Function and arrow function boundaries
134

135
### LaTeX Document Splitting
136

137
Specialized splitting for LaTeX documents that respects LaTeX structure and formatting commands.
138

139
```python { .api }
140
class LatexTextSplitter(RecursiveCharacterTextSplitter):
141
    def __init__(self, **kwargs: Any) -> None: ...
142
```
143

144
**Usage:**
145

146
```python
147
from langchain_text_splitters import LatexTextSplitter
148

149
latex_splitter = LatexTextSplitter(
150
    chunk_size=1000,
151
    chunk_overlap=100
152
)
153

154
latex_document = r"""
155
\documentclass{article}
156
\usepackage{amsmath}
157

158
\title{Mathematical Analysis}
159
\author{Author Name}
160
\date{\today}
161

162
\begin{document}
163

164
\maketitle
165

166
\section{Introduction}
167
This document presents a mathematical analysis of...
168

169
\subsection{Preliminaries}
170
Let us define the following concepts:
171

172
\begin{definition}
173
A function $f: \mathbb{R} \to \mathbb{R}$ is continuous at point $a$ if...
174
\end{definition}
175

176
\begin{theorem}
177
If $f$ is continuous on $[a, b]$ and differentiable on $(a, b)$, then...
178
\end{theorem}
179

180
\section{Main Results}
181
The main theorem can be stated as follows:
182

183
\begin{align}
184
\int_a^b f(x) dx &= F(b) - F(a) \\
185
&= \lim_{n \to \infty} \sum_{i=1}^n f(x_i) \Delta x
186
\end{align}
187

188
\end{document}
189
"""
190

191
chunks = latex_splitter.split_text(latex_document)
192
```
193

194
The LaTeX splitter uses separators that respect:
195
- Document structure (`\section`, `\subsection`, `\chapter`)
196
- Environment boundaries (`\begin{}`, `\end{}`)
197
- Mathematical expressions and equations
198
- Standard paragraph breaks
199

200
### Language-Specific Splitting via RecursiveCharacterTextSplitter
201

202
For other programming languages, use the `RecursiveCharacterTextSplitter.from_language()` method with the appropriate `Language` enum value.
203

204
```python { .api }
205
# Available through RecursiveCharacterTextSplitter
206
@classmethod
207
def from_language(
208
    cls,
209
    language: Language,
210
    **kwargs: Any
211
) -> "RecursiveCharacterTextSplitter": ...
212

213
@staticmethod
214
def get_separators_for_language(language: Language) -> list[str]: ...
215
```
216

217
**Usage:**
218

219
```python
220
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
221

222
# Java code splitting
223
java_splitter = RecursiveCharacterTextSplitter.from_language(
224
    language=Language.JAVA,
225
    chunk_size=2000,
226
    chunk_overlap=200
227
)
228

229
java_code = """
230
public class Calculator {
231
    private double result;
232
    
233
    public Calculator() {
234
        this.result = 0.0;
235
    }
236
    
237
    public double add(double a, double b) {
238
        result = a + b;
239
        return result;
240
    }
241
    
242
    public static void main(String[] args) {
243
        Calculator calc = new Calculator();
244
        System.out.println(calc.add(5.0, 3.0));
245
    }
246
}
247
"""
248

249
java_chunks = java_splitter.split_text(java_code)
250

251
# C++ code splitting
252
cpp_splitter = RecursiveCharacterTextSplitter.from_language(
253
    language=Language.CPP,
254
    chunk_size=1500
255
)
256

257
# Get separators for inspection
258
cpp_separators = RecursiveCharacterTextSplitter.get_separators_for_language(Language.CPP)
259
```
260

261
## Supported Languages
262

263
The `Language` enum provides optimized separators for:
264

265
### Popular Languages
266
- **PYTHON**: Python source code
267
- **JS**, **TS**: JavaScript and TypeScript
268
- **JAVA**: Java source code  
269
- **CPP**, **C**: C and C++ source code
270
- **CSHARP**: C# source code
271
- **GO**: Go source code
272
- **RUST**: Rust source code
273
- **RUBY**: Ruby source code
274
- **PHP**: PHP source code
275

276
### Additional Languages
277
- **KOTLIN**: Kotlin source code
278
- **SCALA**: Scala source code
279
- **SWIFT**: Swift source code
280
- **PROTO**: Protocol Buffer definitions
281
- **SOL**: Solidity smart contracts
282
- **COBOL**: COBOL source code
283
- **LUA**: Lua scripts
284
- **PERL**: Perl scripts
285
- **HASKELL**: Haskell source code
286
- **ELIXIR**: Elixir source code
287
- **POWERSHELL**: PowerShell scripts
288
- **VISUALBASIC6**: Visual Basic 6 source code
289

290
### Document Formats
291
- **MARKDOWN**: Markdown documents
292
- **LATEX**: LaTeX documents
293
- **HTML**: HTML documents
294
- **RST**: reStructuredText documents
295

296
## Best Practices
297

298
1. **Use language-specific splitters**: Always use the appropriate language splitter for better code structure preservation
299
2. **Configure appropriate chunk sizes**: Balance between preserving complete functions/classes and staying within token limits
300
3. **Consider minimal overlap**: Code chunks often need less overlap than prose text
301
4. **Test with your codebase**: Different coding styles may require different chunk sizes
302
5. **Preserve imports**: Ensure import/include statements are properly handled in your chunking strategy
303
6. **Maintain syntax validity**: Verify that code chunks maintain valid syntax boundaries
304
7. **Handle comments appropriately**: Consider how code comments should be distributed across chunks

Version

Tile

Files

code-splitting.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

code-splitting.mddocs/