CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-langchain-text-splitters

LangChain text splitting utilities for breaking documents into manageable chunks for AI processing

Pending
Overview
Eval results
Files

core-base.mddocs/

Core Base Classes and Utilities

The core base classes and utilities provide the fundamental interfaces, enums, and utility functions that form the foundation of all text splitting functionality in langchain-text-splitters. These components define the common patterns and contracts used throughout the library.

Capabilities

TextSplitter Abstract Base Class

The core abstract interface that all text splitters implement, providing common functionality and defining the splitting contract.

class TextSplitter(BaseDocumentTransformer, ABC):
    def __init__(
        self,
        chunk_size: int = 4000,
        chunk_overlap: int = 200,
        length_function: Callable[[str], int] = len,
        keep_separator: Union[bool, Literal["start", "end"]] = False,
        add_start_index: bool = False,
        strip_whitespace: bool = True
    ) -> None: ...
    
    @abstractmethod
    def split_text(self, text: str) -> list[str]: ...
    
    def create_documents(
        self,
        texts: list[str],
        metadatas: Optional[list[dict[Any, Any]]] = None
    ) -> list[Document]: ...
    
    def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...
    
    @classmethod
    def from_huggingface_tokenizer(
        cls,
        tokenizer: Any,
        **kwargs: Any
    ) -> "TextSplitter": ...
    
    @classmethod
    def from_tiktoken_encoder(
        cls,
        encoding_name: str = "gpt2",
        model_name: Optional[str] = None,
        allowed_special: Union[Literal["all"], AbstractSet[str]] = set(),
        disallowed_special: Union[Literal["all"], Collection[str]] = "all",
        **kwargs: Any
    ) -> Self: ...

Constructor Parameters:

  • chunk_size: Maximum size of chunks to return (default: 4000)
  • chunk_overlap: Overlap in characters between chunks (default: 200)
  • length_function: Function that measures the length of given chunks (default: len)
  • keep_separator: Whether to keep the separator and where to place it (default: False)
  • add_start_index: If True, includes chunk's start index in metadata (default: False)
  • strip_whitespace: If True, strips whitespace from start and end of documents (default: True)

Abstract Methods:

  • split_text(): Must be implemented by all concrete splitters

Concrete Methods:

  • create_documents(): Create Document objects from text list with optional metadata
  • split_documents(): Split existing Document objects into smaller chunks

Factory Methods:

  • from_huggingface_tokenizer(): Create splitter from HuggingFace tokenizer
  • from_tiktoken_encoder(): Create splitter from tiktoken encoder

Usage:

from langchain_text_splitters import TextSplitter
from langchain_core.documents import Document

# Example concrete implementation (normally you'd use CharacterTextSplitter)
class SimpleTextSplitter(TextSplitter):
    def split_text(self, text: str) -> list[str]:
        # Simple implementation that splits on periods
        sentences = text.split('.')
        return [s.strip() + '.' for s in sentences if s.strip()]

# Using the splitter
splitter = SimpleTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    add_start_index=True,
    strip_whitespace=True
)

# Split text
text = "First sentence. Second sentence. Third sentence."
chunks = splitter.split_text(text)

# Create documents with metadata
documents = splitter.create_documents(
    texts=[text],
    metadatas=[{"source": "example.txt", "author": "unknown"}]
)

# Split existing documents
existing_docs = [Document(page_content=text, metadata={"page": 1})]
split_docs = splitter.split_documents(existing_docs)

Language Enumeration

Enumeration defining supported programming languages for language-specific text splitting.

class Language(Enum):
    CPP = "cpp"
    GO = "go"
    JAVA = "java"
    KOTLIN = "kotlin"
    JS = "js"
    TS = "ts"
    PHP = "php"
    PROTO = "proto"
    PYTHON = "python"
    RST = "rst"
    RUBY = "ruby"
    RUST = "rust"
    SCALA = "scala"
    SWIFT = "swift"
    MARKDOWN = "markdown"
    LATEX = "latex"
    HTML = "html"
    SOL = "sol"
    CSHARP = "csharp"
    COBOL = "cobol"
    C = "c"
    LUA = "lua"
    PERL = "perl"
    HASKELL = "haskell"
    ELIXIR = "elixir"
    POWERSHELL = "powershell"
    VISUALBASIC6 = "visualbasic6"

Usage:

from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

# Use with language-specific splitting
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=2000
)

# Get separators for a language
js_separators = RecursiveCharacterTextSplitter.get_separators_for_language(Language.JS)

# Compare language values
if some_language == Language.PYTHON.value:  # "python"
    print("This is Python code")

Tokenizer Configuration

Configuration dataclass for token-based text splitting operations.

@dataclass(frozen=True)
class Tokenizer:
    chunk_overlap: int
    tokens_per_chunk: int
    decode: Callable[[list[int]], str]
    encode: Callable[[str], list[int]]

Fields:

  • chunk_overlap: Number of tokens to overlap between chunks
  • tokens_per_chunk: Maximum number of tokens per chunk
  • decode: Function to decode token IDs back to text
  • encode: Function to encode text to token IDs

Usage:

from langchain_text_splitters import Tokenizer, split_text_on_tokens
import tiktoken

# Create tokenizer configuration
encoding = tiktoken.get_encoding("gpt2")
tokenizer_config = Tokenizer(
    chunk_overlap=50,
    tokens_per_chunk=512,
    decode=encoding.decode,
    encode=encoding.encode
)

# Use with splitting function
text = "Long text to be tokenized and split..."
chunks = split_text_on_tokens(text=text, tokenizer=tokenizer_config)

Token-Based Splitting Utility

Utility function for splitting text using a tokenizer configuration.

def split_text_on_tokens(*, text: str, tokenizer: Tokenizer) -> list[str]: ...

Parameters:

  • text: The text to split
  • tokenizer: Tokenizer configuration object

Returns:

  • List of text chunks split according to token boundaries

Usage:

from langchain_text_splitters import split_text_on_tokens, Tokenizer
from transformers import AutoTokenizer

# Using HuggingFace tokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

tokenizer_config = Tokenizer(
    chunk_overlap=25,
    tokens_per_chunk=256,
    decode=lambda tokens: hf_tokenizer.decode(tokens, skip_special_tokens=True),
    encode=lambda text: hf_tokenizer.encode(text, add_special_tokens=False)
)

text = "This is a sample text that will be tokenized and split into chunks."
chunks = split_text_on_tokens(text=text, tokenizer=tokenizer_config)

Type Definitions

The text splitters package provides several TypedDict definitions for structured data used across various splitters.

class ElementType(TypedDict):
    """Element type as typed dict for HTML elements."""
    url: str
    xpath: str
    content: str
    metadata: dict[str, str]

class HeaderType(TypedDict):
    """Header type as typed dict for markdown headers."""
    level: int
    name: str
    data: str

class LineType(TypedDict):
    """Line type as typed dict for text lines with metadata."""
    metadata: dict[str, str]
    content: str

These types are used by:

  • ElementType: HTML-based splitters for structured element data
  • HeaderType: Markdown splitters for header information
  • LineType: Markdown splitters for line-based processing

Base Document Transformer Integration

All text splitters inherit from LangChain's BaseDocumentTransformer, providing consistent integration with the LangChain ecosystem.

# Inherited from langchain_core
class BaseDocumentTransformer(ABC):
    @abstractmethod
    def transform_documents(
        self,
        documents: Sequence[Document],
        **kwargs: Any
    ) -> Sequence[Document]: ...
    
    async def atransform_documents(
        self,
        documents: Sequence[Document],
        **kwargs: Any
    ) -> Sequence[Document]: ...

Error Handling and Validation

The base TextSplitter class includes built-in validation for common configuration errors:

# These will raise ValueError
TextSplitter(chunk_size=0)           # chunk_size must be > 0
TextSplitter(chunk_overlap=-1)       # chunk_overlap must be >= 0  
TextSplitter(chunk_size=100, chunk_overlap=200)  # overlap > chunk_size

Design Principles

Inheritance Hierarchy

The library follows a clear inheritance pattern:

  1. BaseDocumentTransformer (from LangChain Core)
  2. TextSplitter (abstract base class)
  3. Concrete implementations (CharacterTextSplitter, TokenTextSplitter, etc.)

Factory Pattern

Many splitters provide factory methods for convenient initialization:

  • from_language() for language-specific splitting
  • from_huggingface_tokenizer() for HuggingFace integration
  • from_tiktoken_encoder() for OpenAI tokenizer integration

Configuration Flexibility

All splitters accept common configuration parameters through the base class while allowing specific customization through their own parameters.

Best Practices

  1. Extend TextSplitter: When creating custom splitters, extend TextSplitter and implement split_text()
  2. Use factory methods: Leverage factory methods for common initialization patterns
  3. Validate parameters: The base class provides validation; add custom validation in subclasses
  4. Preserve metadata: Use create_documents() and split_documents() to maintain document metadata
  5. Handle edge cases: Consider empty strings, very short texts, and texts smaller than chunk_size
  6. Choose appropriate length functions: For token-based splitting, use token counting functions
  7. Test with real data: Validate your splitter configuration with representative data

Install with Tessl CLI

npx tessl i tessl/pypi-langchain-text-splitters

docs

character-splitting.md

code-splitting.md

core-base.md

document-structure.md

index.md

nlp-splitting.md

token-splitting.md

tile.json