tessl/pypi-langchain-text-splitters

LangChain text splitting utilities for breaking documents into manageable chunks for AI processing

—

Pending

Overview

Eval results

Files

Document Structure-Aware Splitting

Name: tessl/pypi-langchain-text-splitters
Author: tessl

Document structure-aware splitting provides specialized text segmentation that understands and preserves the structural elements of various document formats. These splitters maintain semantic context by respecting document hierarchy, headers, and formatting while creating appropriately sized chunks.

Capabilities

HTML Document Splitting

Specialized splitters for HTML content that preserve document structure and semantic elements.

HTML Header Text Splitter

Splits HTML content based on header tags while preserving document hierarchy and metadata.

class HTMLHeaderTextSplitter:
    def __init__(
        self,
        headers_to_split_on: list[tuple[str, str]],
        return_each_element: bool = False
    ) -> None: ...
    
    def split_text(self, text: str) -> list[Document]: ...
    
    def split_text_from_url(
        self,
        url: str,
        timeout: int = 10,
        **kwargs: Any
    ) -> list[Document]: ...
    
    def split_text_from_file(self, file: Any) -> list[Document]: ...

Parameters:

headers_to_split_on: List of tuples (header_tag, header_name) defining split points
return_each_element: Whether to return each element separately (default: False)

Usage:

from langchain_text_splitters import HTMLHeaderTextSplitter

# Define headers to split on
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)

# Split HTML text
html_content = """
<h1>Chapter 1</h1>
<p>Content of chapter 1...</p>
<h2>Section 1.1</h2>
<p>Content of section 1.1...</p>
"""
documents = html_splitter.split_text(html_content)

# Split HTML from URL
url_docs = html_splitter.split_text_from_url("https://example.com", timeout=30)

# Split HTML from file
with open("document.html", "r") as f:
    file_docs = html_splitter.split_text_from_file(f)

HTML Section Splitter

Advanced HTML splitting based on tags and font sizes, requiring lxml for enhanced processing.

class HTMLSectionSplitter:
    def __init__(
        self,
        headers_to_split_on: list[tuple[str, str]],
        **kwargs: Any
    ) -> None: ...
    
    def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...
    
    def split_text(self, text: str) -> list[Document]: ...
    
    def create_documents(
        self,
        texts: list[str],
        metadatas: Optional[list[dict[Any, Any]]] = None
    ) -> list[Document]: ...
    
    def split_html_by_headers(self, html_doc: str) -> list[dict[str, Optional[str]]]: ...
    
    def convert_possible_tags_to_header(self, html_content: str) -> str: ...
    
    def split_text_from_file(self, file: Any) -> list[Document]: ...

HTML Semantic Preserving Splitter

Beta-stage advanced HTML splitter that preserves semantic structure with media handling capabilities.

class HTMLSemanticPreservingSplitter(BaseDocumentTransformer):
    def __init__(
        self,
        headers_to_split_on: list[tuple[str, str]],
        *,
        max_chunk_size: int = 1000,
        chunk_overlap: int = 0,
        separators: Optional[list[str]] = None,
        elements_to_preserve: Optional[list[str]] = None,
        preserve_links: bool = False,
        preserve_images: bool = False,
        preserve_videos: bool = False,
        preserve_audio: bool = False,
        custom_handlers: Optional[dict[str, Callable[[Any], str]]] = None,
        stopword_removal: bool = False,
        stopword_lang: str = "english",
        normalize_text: bool = False,
        external_metadata: Optional[dict[str, str]] = None,
        allowlist_tags: Optional[list[str]] = None,
        denylist_tags: Optional[list[str]] = None,
        preserve_parent_metadata: bool = False,
        keep_separator: Union[bool, Literal["start", "end"]] = True
    ) -> None: ...
    
    def split_text(self, text: str) -> list[Document]: ...
    
    def transform_documents(
        self,
        documents: Sequence[Document],
        **kwargs: Any
    ) -> list[Document]: ...

Parameters:

max_chunk_size: Maximum size of each chunk (default: 1000)
chunk_overlap: Number of characters to overlap between chunks (default: 0)
separators: Delimiters used by RecursiveCharacterTextSplitter for further splitting
elements_to_preserve: HTML tags to remain intact during splitting
preserve_links: Whether to convert <a> tags to Markdown links (default: False)
preserve_images: Whether to convert <img> tags to Markdown images (default: False)
preserve_videos: Whether to convert <video> tags to Markdown video links (default: False)
preserve_audio: Whether to convert <audio> tags to Markdown audio links (default: False)
custom_handlers: Custom element handlers for specific tags
stopword_removal: Whether to remove stopwords from text (default: False)
stopword_lang: Language for stopword removal (default: "english")
normalize_text: Whether to normalize text during processing (default: False)
external_metadata: Additional metadata to include in all documents
allowlist_tags: HTML tags to specifically include in processing
denylist_tags: HTML tags to exclude from processing
preserve_parent_metadata: Whether to preserve metadata from parent elements (default: False)
keep_separator: Whether to keep separators and where to place them (default: True)

Markdown Document Splitting

Specialized splitters for Markdown content that understand heading hierarchy and structure.

Markdown Text Splitter

Basic Markdown splitting that extends recursive character splitting with Markdown-specific separators.

class MarkdownTextSplitter(RecursiveCharacterTextSplitter):
    def __init__(self, **kwargs: Any) -> None: ...

Markdown Header Text Splitter

Splits Markdown content based on header levels while preserving document structure.

class MarkdownHeaderTextSplitter:
    def __init__(
        self,
        headers_to_split_on: list[tuple[str, str]],
        return_each_line: bool = False,
        strip_headers: bool = True,
        custom_header_patterns: Optional[dict[int, str]] = None
    ) -> None: ...
    
    def split_text(self, text: str) -> list[Document]: ...
    
    def aggregate_lines_to_chunks(self, lines: list[LineType]) -> list[Document]: ...

Parameters:

headers_to_split_on: List of tuples (header_level, header_name)
return_each_line: Whether to return each line as separate document
strip_headers: Whether to remove header text from content
custom_header_patterns: Custom regex patterns for header detection

Usage:

from langchain_text_splitters import MarkdownHeaderTextSplitter

# Define headers to split on
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)

markdown_content = """
# Chapter 1
Content of chapter 1...

## Section 1.1
Content of section 1.1...

### Subsection 1.1.1
Content of subsection...
"""

documents = markdown_splitter.split_text(markdown_content)

Experimental Markdown Syntax Text Splitter

Advanced experimental Markdown splitter with exact whitespace retention and structured metadata extraction.

class ExperimentalMarkdownSyntaxTextSplitter:
    def __init__(
        self,
        headers_to_split_on: Optional[list[tuple[str, str]]] = None,
        return_each_line: bool = False,
        strip_headers: bool = True
    ) -> None: ...
    
    def split_text(self, text: str) -> list[Document]: ...

JSON Data Splitting

Specialized splitter for JSON data that preserves structure while creating manageable chunks.

class RecursiveJsonSplitter:
    def __init__(
        self,
        max_chunk_size: int = 2000,
        min_chunk_size: Optional[int] = None
    ) -> None: ...
    
    def split_json(
        self,
        json_data: dict,
        convert_lists: bool = False
    ) -> list[dict]: ...
    
    def split_text(
        self,
        json_data: dict,
        convert_lists: bool = False,
        ensure_ascii: bool = True
    ) -> list[str]: ...
    
    def create_documents(
        self,
        texts: list[dict],
        convert_lists: bool = False,
        ensure_ascii: bool = True,
        metadatas: Optional[list[dict[Any, Any]]] = None
    ) -> list[Document]: ...

Parameters:

max_chunk_size: Maximum size of JSON chunks
min_chunk_size: Minimum size for chunk splitting

Methods:

split_json(): Split JSON into dictionary chunks
split_text(): Split JSON into string chunks
create_documents(): Create Document objects from JSON

Usage:

from langchain_text_splitters import RecursiveJsonSplitter
import json

json_splitter = RecursiveJsonSplitter(max_chunk_size=1000)

# Large JSON data
large_json = {
    "users": [
        {"id": 1, "name": "Alice", "data": {...}},
        {"id": 2, "name": "Bob", "data": {...}},
        # ... many more users
    ],
    "metadata": {"version": "1.0", "created": "2023-01-01"}
}

# Split into dictionary chunks
dict_chunks = json_splitter.split_json(large_json)

# Split into string chunks
string_chunks = json_splitter.split_text(large_json, ensure_ascii=False)

# Create Document objects
documents = json_splitter.create_documents([large_json])

Type Definitions

Document structure splitters use several type definitions for metadata and configuration:

class ElementType(TypedDict):
    url: str
    xpath: str
    content: str
    metadata: dict[str, str]

class HeaderType(TypedDict):
    level: int
    name: str
    data: str

class LineType(TypedDict):
    metadata: dict[str, str]
    content: str

Best Practices

Choose appropriate headers: Select header levels that represent logical document divisions
Preserve metadata: Document structure splitters maintain hierarchical metadata for context
Handle nested structures: JSON splitter respects nested object and array boundaries
Configure chunk sizes: Balance between context preservation and manageable chunk sizes
Test with your documents: Different document structures may require different splitting strategies
Use semantic preservation: For HTML, consider using the semantic preserving splitter for better structure retention

Install with Tessl CLI

npx tessl i tessl/pypi-langchain-text-splitters

docs

character-splitting.md

code-splitting.md

core-base.md

document-structure.md

tessl/pypi-langchain-text-splitters

document-structure.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

Document Structure-Aware Splitting

Capabilities

HTML Document Splitting

HTML Header Text Splitter

HTML Section Splitter

HTML Semantic Preserving Splitter

Markdown Document Splitting

Markdown Text Splitter

Markdown Header Text Splitter

Experimental Markdown Syntax Text Splitter

JSON Data Splitting

Type Definitions

Best Practices

document-structure.mddocs/