LangChain text splitting utilities for breaking documents into manageable chunks for AI processing
—
Character-based splitting provides fundamental text segmentation based on specific character separators. This includes simple separator-based splitting and advanced recursive splitting strategies that try multiple separators in order of preference.
Simple text splitting based on a single separator string or regex pattern.
class CharacterTextSplitter(TextSplitter):
def __init__(
self,
separator: str = "\n\n",
is_separator_regex: bool = False,
**kwargs: Any
) -> None: ...
def split_text(self, text: str) -> list[str]: ...Parameters:
separator: String or regex pattern to split on (default: "\n\n")is_separator_regex: Whether separator should be treated as regex (default: False)**kwargs: Additional parameters passed to TextSplitter.__init__()Usage:
from langchain_text_splitters import CharacterTextSplitter
# Split on double newlines
splitter = CharacterTextSplitter(separator="\n\n", chunk_size=1000)
chunks = splitter.split_text("Paragraph 1\n\nParagraph 2\n\nParagraph 3")
# Split using regex
regex_splitter = CharacterTextSplitter(
separator=r"\s+", # Split on any whitespace
is_separator_regex=True,
chunk_size=500
)
chunks = regex_splitter.split_text("Word1 Word2 Word3\tWord4\nWord5")Advanced splitting that tries multiple separators in order of preference, recursively splitting chunks that are still too large.
class RecursiveCharacterTextSplitter(TextSplitter):
def __init__(
self,
separators: Optional[list[str]] = None,
keep_separator: Union[bool, Literal["start", "end"]] = True,
is_separator_regex: bool = False,
**kwargs: Any
) -> None: ...
def split_text(self, text: str) -> list[str]: ...
@classmethod
def from_language(
cls,
language: Language,
**kwargs: Any
) -> "RecursiveCharacterTextSplitter": ...
@staticmethod
def get_separators_for_language(language: Language) -> list[str]: ...Parameters:
separators: List of separators to try in order (default: ["\n\n", "\n", " ", ""])keep_separator: Whether to keep separator and where to place it (default: True)is_separator_regex: Whether separators should be treated as regex (default: False)**kwargs: Additional parameters passed to TextSplitter.__init__()Class Methods:
from_language(): Create splitter optimized for specific programming languageget_separators_for_language(): Get separator list for programming languageUsage:
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
# Basic recursive splitting
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
is_separator_regex=False,
)
text = "Long document with multiple paragraphs and sections..."
chunks = splitter.split_text(text)
# Language-specific splitting for Python code
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=2000,
chunk_overlap=100
)
python_code = """
def function1():
pass
class MyClass:
def method(self):
return "result"
"""
code_chunks = python_splitter.split_text(python_code)
# Custom separators
custom_splitter = RecursiveCharacterTextSplitter(
separators=["###", "##", "#", "\n\n", "\n", " ", ""],
chunk_size=500,
keep_separator=True
)
# Get separators for different languages
python_seps = RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)
js_seps = RecursiveCharacterTextSplitter.get_separators_for_language(Language.JS)The Language enum supports the following programming languages with optimized separator patterns:
Each language has carefully tuned separator patterns that respect the syntax and structure of that language for optimal code splitting.
\n\n) for text, or language-specific patterns for codefrom_language() method for better resultsis_separator_regex=True for complex splitting patternsInstall with Tessl CLI
npx tessl i tessl/pypi-langchain-text-splitters