LangChain text splitting utilities for breaking documents into manageable chunks for AI processing
—
The core base classes and utilities provide the fundamental interfaces, enums, and utility functions that form the foundation of all text splitting functionality in langchain-text-splitters. These components define the common patterns and contracts used throughout the library.
The core abstract interface that all text splitters implement, providing common functionality and defining the splitting contract.
class TextSplitter(BaseDocumentTransformer, ABC):
def __init__(
self,
chunk_size: int = 4000,
chunk_overlap: int = 200,
length_function: Callable[[str], int] = len,
keep_separator: Union[bool, Literal["start", "end"]] = False,
add_start_index: bool = False,
strip_whitespace: bool = True
) -> None: ...
@abstractmethod
def split_text(self, text: str) -> list[str]: ...
def create_documents(
self,
texts: list[str],
metadatas: Optional[list[dict[Any, Any]]] = None
) -> list[Document]: ...
def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...
@classmethod
def from_huggingface_tokenizer(
cls,
tokenizer: Any,
**kwargs: Any
) -> "TextSplitter": ...
@classmethod
def from_tiktoken_encoder(
cls,
encoding_name: str = "gpt2",
model_name: Optional[str] = None,
allowed_special: Union[Literal["all"], AbstractSet[str]] = set(),
disallowed_special: Union[Literal["all"], Collection[str]] = "all",
**kwargs: Any
) -> Self: ...Constructor Parameters:
chunk_size: Maximum size of chunks to return (default: 4000)chunk_overlap: Overlap in characters between chunks (default: 200)length_function: Function that measures the length of given chunks (default: len)keep_separator: Whether to keep the separator and where to place it (default: False)add_start_index: If True, includes chunk's start index in metadata (default: False)strip_whitespace: If True, strips whitespace from start and end of documents (default: True)Abstract Methods:
split_text(): Must be implemented by all concrete splittersConcrete Methods:
create_documents(): Create Document objects from text list with optional metadatasplit_documents(): Split existing Document objects into smaller chunksFactory Methods:
from_huggingface_tokenizer(): Create splitter from HuggingFace tokenizerfrom_tiktoken_encoder(): Create splitter from tiktoken encoderUsage:
from langchain_text_splitters import TextSplitter
from langchain_core.documents import Document
# Example concrete implementation (normally you'd use CharacterTextSplitter)
class SimpleTextSplitter(TextSplitter):
def split_text(self, text: str) -> list[str]:
# Simple implementation that splits on periods
sentences = text.split('.')
return [s.strip() + '.' for s in sentences if s.strip()]
# Using the splitter
splitter = SimpleTextSplitter(
chunk_size=1000,
chunk_overlap=100,
add_start_index=True,
strip_whitespace=True
)
# Split text
text = "First sentence. Second sentence. Third sentence."
chunks = splitter.split_text(text)
# Create documents with metadata
documents = splitter.create_documents(
texts=[text],
metadatas=[{"source": "example.txt", "author": "unknown"}]
)
# Split existing documents
existing_docs = [Document(page_content=text, metadata={"page": 1})]
split_docs = splitter.split_documents(existing_docs)Enumeration defining supported programming languages for language-specific text splitting.
class Language(Enum):
CPP = "cpp"
GO = "go"
JAVA = "java"
KOTLIN = "kotlin"
JS = "js"
TS = "ts"
PHP = "php"
PROTO = "proto"
PYTHON = "python"
RST = "rst"
RUBY = "ruby"
RUST = "rust"
SCALA = "scala"
SWIFT = "swift"
MARKDOWN = "markdown"
LATEX = "latex"
HTML = "html"
SOL = "sol"
CSHARP = "csharp"
COBOL = "cobol"
C = "c"
LUA = "lua"
PERL = "perl"
HASKELL = "haskell"
ELIXIR = "elixir"
POWERSHELL = "powershell"
VISUALBASIC6 = "visualbasic6"Usage:
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter
# Use with language-specific splitting
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=2000
)
# Get separators for a language
js_separators = RecursiveCharacterTextSplitter.get_separators_for_language(Language.JS)
# Compare language values
if some_language == Language.PYTHON.value: # "python"
print("This is Python code")Configuration dataclass for token-based text splitting operations.
@dataclass(frozen=True)
class Tokenizer:
chunk_overlap: int
tokens_per_chunk: int
decode: Callable[[list[int]], str]
encode: Callable[[str], list[int]]Fields:
chunk_overlap: Number of tokens to overlap between chunkstokens_per_chunk: Maximum number of tokens per chunkdecode: Function to decode token IDs back to textencode: Function to encode text to token IDsUsage:
from langchain_text_splitters import Tokenizer, split_text_on_tokens
import tiktoken
# Create tokenizer configuration
encoding = tiktoken.get_encoding("gpt2")
tokenizer_config = Tokenizer(
chunk_overlap=50,
tokens_per_chunk=512,
decode=encoding.decode,
encode=encoding.encode
)
# Use with splitting function
text = "Long text to be tokenized and split..."
chunks = split_text_on_tokens(text=text, tokenizer=tokenizer_config)Utility function for splitting text using a tokenizer configuration.
def split_text_on_tokens(*, text: str, tokenizer: Tokenizer) -> list[str]: ...Parameters:
text: The text to splittokenizer: Tokenizer configuration objectReturns:
Usage:
from langchain_text_splitters import split_text_on_tokens, Tokenizer
from transformers import AutoTokenizer
# Using HuggingFace tokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer_config = Tokenizer(
chunk_overlap=25,
tokens_per_chunk=256,
decode=lambda tokens: hf_tokenizer.decode(tokens, skip_special_tokens=True),
encode=lambda text: hf_tokenizer.encode(text, add_special_tokens=False)
)
text = "This is a sample text that will be tokenized and split into chunks."
chunks = split_text_on_tokens(text=text, tokenizer=tokenizer_config)The text splitters package provides several TypedDict definitions for structured data used across various splitters.
class ElementType(TypedDict):
"""Element type as typed dict for HTML elements."""
url: str
xpath: str
content: str
metadata: dict[str, str]
class HeaderType(TypedDict):
"""Header type as typed dict for markdown headers."""
level: int
name: str
data: str
class LineType(TypedDict):
"""Line type as typed dict for text lines with metadata."""
metadata: dict[str, str]
content: strThese types are used by:
ElementType: HTML-based splitters for structured element dataHeaderType: Markdown splitters for header informationLineType: Markdown splitters for line-based processingAll text splitters inherit from LangChain's BaseDocumentTransformer, providing consistent integration with the LangChain ecosystem.
# Inherited from langchain_core
class BaseDocumentTransformer(ABC):
@abstractmethod
def transform_documents(
self,
documents: Sequence[Document],
**kwargs: Any
) -> Sequence[Document]: ...
async def atransform_documents(
self,
documents: Sequence[Document],
**kwargs: Any
) -> Sequence[Document]: ...The base TextSplitter class includes built-in validation for common configuration errors:
# These will raise ValueError
TextSplitter(chunk_size=0) # chunk_size must be > 0
TextSplitter(chunk_overlap=-1) # chunk_overlap must be >= 0
TextSplitter(chunk_size=100, chunk_overlap=200) # overlap > chunk_sizeThe library follows a clear inheritance pattern:
BaseDocumentTransformer (from LangChain Core)TextSplitter (abstract base class)CharacterTextSplitter, TokenTextSplitter, etc.)Many splitters provide factory methods for convenient initialization:
from_language() for language-specific splittingfrom_huggingface_tokenizer() for HuggingFace integrationfrom_tiktoken_encoder() for OpenAI tokenizer integrationAll splitters accept common configuration parameters through the base class while allowing specific customization through their own parameters.
TextSplitter and implement split_text()create_documents() and split_documents() to maintain document metadataInstall with Tessl CLI
npx tessl i tessl/pypi-langchain-text-splitters