CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-langchain-text-splitters

LangChain text splitting utilities for breaking documents into manageable chunks for AI processing

Pending
Overview
Eval results
Files

character-splitting.mddocs/

Character-Based Text Splitting

Character-based splitting provides fundamental text segmentation based on specific character separators. This includes simple separator-based splitting and advanced recursive splitting strategies that try multiple separators in order of preference.

Capabilities

Basic Character Splitting

Simple text splitting based on a single separator string or regex pattern.

class CharacterTextSplitter(TextSplitter):
    def __init__(
        self,
        separator: str = "\n\n",
        is_separator_regex: bool = False,
        **kwargs: Any
    ) -> None: ...
    
    def split_text(self, text: str) -> list[str]: ...

Parameters:

  • separator: String or regex pattern to split on (default: "\n\n")
  • is_separator_regex: Whether separator should be treated as regex (default: False)
  • **kwargs: Additional parameters passed to TextSplitter.__init__()

Usage:

from langchain_text_splitters import CharacterTextSplitter

# Split on double newlines
splitter = CharacterTextSplitter(separator="\n\n", chunk_size=1000)
chunks = splitter.split_text("Paragraph 1\n\nParagraph 2\n\nParagraph 3")

# Split using regex
regex_splitter = CharacterTextSplitter(
    separator=r"\s+",  # Split on any whitespace
    is_separator_regex=True,
    chunk_size=500
)
chunks = regex_splitter.split_text("Word1 Word2    Word3\tWord4\nWord5")

Recursive Character Splitting

Advanced splitting that tries multiple separators in order of preference, recursively splitting chunks that are still too large.

class RecursiveCharacterTextSplitter(TextSplitter):
    def __init__(
        self,
        separators: Optional[list[str]] = None,
        keep_separator: Union[bool, Literal["start", "end"]] = True,
        is_separator_regex: bool = False,
        **kwargs: Any
    ) -> None: ...
    
    def split_text(self, text: str) -> list[str]: ...
    
    @classmethod
    def from_language(
        cls,
        language: Language,
        **kwargs: Any
    ) -> "RecursiveCharacterTextSplitter": ...
    
    @staticmethod
    def get_separators_for_language(language: Language) -> list[str]: ...

Parameters:

  • separators: List of separators to try in order (default: ["\n\n", "\n", " ", ""])
  • keep_separator: Whether to keep separator and where to place it (default: True)
  • is_separator_regex: Whether separators should be treated as regex (default: False)
  • **kwargs: Additional parameters passed to TextSplitter.__init__()

Class Methods:

  • from_language(): Create splitter optimized for specific programming language
  • get_separators_for_language(): Get separator list for programming language

Usage:

from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

# Basic recursive splitting
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

text = "Long document with multiple paragraphs and sections..."
chunks = splitter.split_text(text)

# Language-specific splitting for Python code
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=2000,
    chunk_overlap=100
)
python_code = """
def function1():
    pass

class MyClass:
    def method(self):
        return "result"
"""
code_chunks = python_splitter.split_text(python_code)

# Custom separators
custom_splitter = RecursiveCharacterTextSplitter(
    separators=["###", "##", "#", "\n\n", "\n", " ", ""],
    chunk_size=500,
    keep_separator=True
)

# Get separators for different languages
python_seps = RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)
js_seps = RecursiveCharacterTextSplitter.get_separators_for_language(Language.JS)

Language Support

The Language enum supports the following programming languages with optimized separator patterns:

  • CPP, C: C/C++ code splitting
  • CSHARP: C# code splitting
  • GO: Go code splitting
  • JAVA, KOTLIN, SCALA: JVM language splitting
  • JS, TS: JavaScript/TypeScript splitting
  • PHP: PHP code splitting
  • PROTO: Protocol Buffer definition splitting
  • PYTHON: Python code splitting
  • RST: reStructuredText splitting
  • RUBY: Ruby code splitting
  • RUST: Rust code splitting
  • SWIFT: Swift code splitting
  • MARKDOWN: Markdown document splitting
  • LATEX: LaTeX document splitting
  • HTML: HTML document splitting
  • SOL: Solidity smart contract splitting
  • COBOL: COBOL code splitting
  • LUA: Lua script splitting
  • PERL: Perl script splitting
  • HASKELL: Haskell code splitting
  • ELIXIR: Elixir code splitting
  • POWERSHELL: PowerShell script splitting
  • VISUALBASIC6: Visual Basic 6 code splitting

Each language has carefully tuned separator patterns that respect the syntax and structure of that language for optimal code splitting.

Best Practices

  1. Choose appropriate separators: Use natural break points like paragraphs (\n\n) for text, or language-specific patterns for code
  2. Configure chunk overlap: Set reasonable overlap (10-20% of chunk size) to maintain context across chunks
  3. Use language-specific splitting: For code, use from_language() method for better results
  4. Consider regex patterns: Use is_separator_regex=True for complex splitting patterns
  5. Test chunk sizes: Validate that resulting chunks fit within your model's context window

Install with Tessl CLI

npx tessl i tessl/pypi-langchain-text-splitters

docs

character-splitting.md

code-splitting.md

core-base.md

document-structure.md

index.md

nlp-splitting.md

token-splitting.md

tile.json