CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-langchain-text-splitters

LangChain text splitting utilities for breaking documents into manageable chunks for AI processing

Pending
Overview
Eval results
Files

document-structure.mddocs/

Document Structure-Aware Splitting

Document structure-aware splitting provides specialized text segmentation that understands and preserves the structural elements of various document formats. These splitters maintain semantic context by respecting document hierarchy, headers, and formatting while creating appropriately sized chunks.

Capabilities

HTML Document Splitting

Specialized splitters for HTML content that preserve document structure and semantic elements.

HTML Header Text Splitter

Splits HTML content based on header tags while preserving document hierarchy and metadata.

class HTMLHeaderTextSplitter:
    def __init__(
        self,
        headers_to_split_on: list[tuple[str, str]],
        return_each_element: bool = False
    ) -> None: ...
    
    def split_text(self, text: str) -> list[Document]: ...
    
    def split_text_from_url(
        self,
        url: str,
        timeout: int = 10,
        **kwargs: Any
    ) -> list[Document]: ...
    
    def split_text_from_file(self, file: Any) -> list[Document]: ...

Parameters:

  • headers_to_split_on: List of tuples (header_tag, header_name) defining split points
  • return_each_element: Whether to return each element separately (default: False)

Usage:

from langchain_text_splitters import HTMLHeaderTextSplitter

# Define headers to split on
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)

# Split HTML text
html_content = """
<h1>Chapter 1</h1>
<p>Content of chapter 1...</p>
<h2>Section 1.1</h2>
<p>Content of section 1.1...</p>
"""
documents = html_splitter.split_text(html_content)

# Split HTML from URL
url_docs = html_splitter.split_text_from_url("https://example.com", timeout=30)

# Split HTML from file
with open("document.html", "r") as f:
    file_docs = html_splitter.split_text_from_file(f)

HTML Section Splitter

Advanced HTML splitting based on tags and font sizes, requiring lxml for enhanced processing.

class HTMLSectionSplitter:
    def __init__(
        self,
        headers_to_split_on: list[tuple[str, str]],
        **kwargs: Any
    ) -> None: ...
    
    def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...
    
    def split_text(self, text: str) -> list[Document]: ...
    
    def create_documents(
        self,
        texts: list[str],
        metadatas: Optional[list[dict[Any, Any]]] = None
    ) -> list[Document]: ...
    
    def split_html_by_headers(self, html_doc: str) -> list[dict[str, Optional[str]]]: ...
    
    def convert_possible_tags_to_header(self, html_content: str) -> str: ...
    
    def split_text_from_file(self, file: Any) -> list[Document]: ...

HTML Semantic Preserving Splitter

Beta-stage advanced HTML splitter that preserves semantic structure with media handling capabilities.

class HTMLSemanticPreservingSplitter(BaseDocumentTransformer):
    def __init__(
        self,
        headers_to_split_on: list[tuple[str, str]],
        *,
        max_chunk_size: int = 1000,
        chunk_overlap: int = 0,
        separators: Optional[list[str]] = None,
        elements_to_preserve: Optional[list[str]] = None,
        preserve_links: bool = False,
        preserve_images: bool = False,
        preserve_videos: bool = False,
        preserve_audio: bool = False,
        custom_handlers: Optional[dict[str, Callable[[Any], str]]] = None,
        stopword_removal: bool = False,
        stopword_lang: str = "english",
        normalize_text: bool = False,
        external_metadata: Optional[dict[str, str]] = None,
        allowlist_tags: Optional[list[str]] = None,
        denylist_tags: Optional[list[str]] = None,
        preserve_parent_metadata: bool = False,
        keep_separator: Union[bool, Literal["start", "end"]] = True
    ) -> None: ...
    
    def split_text(self, text: str) -> list[Document]: ...
    
    def transform_documents(
        self,
        documents: Sequence[Document],
        **kwargs: Any
    ) -> list[Document]: ...

Parameters:

  • max_chunk_size: Maximum size of each chunk (default: 1000)
  • chunk_overlap: Number of characters to overlap between chunks (default: 0)
  • separators: Delimiters used by RecursiveCharacterTextSplitter for further splitting
  • elements_to_preserve: HTML tags to remain intact during splitting
  • preserve_links: Whether to convert <a> tags to Markdown links (default: False)
  • preserve_images: Whether to convert <img> tags to Markdown images (default: False)
  • preserve_videos: Whether to convert <video> tags to Markdown video links (default: False)
  • preserve_audio: Whether to convert <audio> tags to Markdown audio links (default: False)
  • custom_handlers: Custom element handlers for specific tags
  • stopword_removal: Whether to remove stopwords from text (default: False)
  • stopword_lang: Language for stopword removal (default: "english")
  • normalize_text: Whether to normalize text during processing (default: False)
  • external_metadata: Additional metadata to include in all documents
  • allowlist_tags: HTML tags to specifically include in processing
  • denylist_tags: HTML tags to exclude from processing
  • preserve_parent_metadata: Whether to preserve metadata from parent elements (default: False)
  • keep_separator: Whether to keep separators and where to place them (default: True)

Markdown Document Splitting

Specialized splitters for Markdown content that understand heading hierarchy and structure.

Markdown Text Splitter

Basic Markdown splitting that extends recursive character splitting with Markdown-specific separators.

class MarkdownTextSplitter(RecursiveCharacterTextSplitter):
    def __init__(self, **kwargs: Any) -> None: ...

Markdown Header Text Splitter

Splits Markdown content based on header levels while preserving document structure.

class MarkdownHeaderTextSplitter:
    def __init__(
        self,
        headers_to_split_on: list[tuple[str, str]],
        return_each_line: bool = False,
        strip_headers: bool = True,
        custom_header_patterns: Optional[dict[int, str]] = None
    ) -> None: ...
    
    def split_text(self, text: str) -> list[Document]: ...
    
    def aggregate_lines_to_chunks(self, lines: list[LineType]) -> list[Document]: ...

Parameters:

  • headers_to_split_on: List of tuples (header_level, header_name)
  • return_each_line: Whether to return each line as separate document
  • strip_headers: Whether to remove header text from content
  • custom_header_patterns: Custom regex patterns for header detection

Usage:

from langchain_text_splitters import MarkdownHeaderTextSplitter

# Define headers to split on
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)

markdown_content = """
# Chapter 1
Content of chapter 1...

## Section 1.1
Content of section 1.1...

### Subsection 1.1.1
Content of subsection...
"""

documents = markdown_splitter.split_text(markdown_content)

Experimental Markdown Syntax Text Splitter

Advanced experimental Markdown splitter with exact whitespace retention and structured metadata extraction.

class ExperimentalMarkdownSyntaxTextSplitter:
    def __init__(
        self,
        headers_to_split_on: Optional[list[tuple[str, str]]] = None,
        return_each_line: bool = False,
        strip_headers: bool = True
    ) -> None: ...
    
    def split_text(self, text: str) -> list[Document]: ...

JSON Data Splitting

Specialized splitter for JSON data that preserves structure while creating manageable chunks.

class RecursiveJsonSplitter:
    def __init__(
        self,
        max_chunk_size: int = 2000,
        min_chunk_size: Optional[int] = None
    ) -> None: ...
    
    def split_json(
        self,
        json_data: dict,
        convert_lists: bool = False
    ) -> list[dict]: ...
    
    def split_text(
        self,
        json_data: dict,
        convert_lists: bool = False,
        ensure_ascii: bool = True
    ) -> list[str]: ...
    
    def create_documents(
        self,
        texts: list[dict],
        convert_lists: bool = False,
        ensure_ascii: bool = True,
        metadatas: Optional[list[dict[Any, Any]]] = None
    ) -> list[Document]: ...

Parameters:

  • max_chunk_size: Maximum size of JSON chunks
  • min_chunk_size: Minimum size for chunk splitting

Methods:

  • split_json(): Split JSON into dictionary chunks
  • split_text(): Split JSON into string chunks
  • create_documents(): Create Document objects from JSON

Usage:

from langchain_text_splitters import RecursiveJsonSplitter
import json

json_splitter = RecursiveJsonSplitter(max_chunk_size=1000)

# Large JSON data
large_json = {
    "users": [
        {"id": 1, "name": "Alice", "data": {...}},
        {"id": 2, "name": "Bob", "data": {...}},
        # ... many more users
    ],
    "metadata": {"version": "1.0", "created": "2023-01-01"}
}

# Split into dictionary chunks
dict_chunks = json_splitter.split_json(large_json)

# Split into string chunks
string_chunks = json_splitter.split_text(large_json, ensure_ascii=False)

# Create Document objects
documents = json_splitter.create_documents([large_json])

Type Definitions

Document structure splitters use several type definitions for metadata and configuration:

class ElementType(TypedDict):
    url: str
    xpath: str
    content: str
    metadata: dict[str, str]

class HeaderType(TypedDict):
    level: int
    name: str
    data: str

class LineType(TypedDict):
    metadata: dict[str, str]
    content: str

Best Practices

  1. Choose appropriate headers: Select header levels that represent logical document divisions
  2. Preserve metadata: Document structure splitters maintain hierarchical metadata for context
  3. Handle nested structures: JSON splitter respects nested object and array boundaries
  4. Configure chunk sizes: Balance between context preservation and manageable chunk sizes
  5. Test with your documents: Different document structures may require different splitting strategies
  6. Use semantic preservation: For HTML, consider using the semantic preserving splitter for better structure retention

Install with Tessl CLI

npx tessl i tessl/pypi-langchain-text-splitters

docs

character-splitting.md

code-splitting.md

core-base.md

document-structure.md

index.md

nlp-splitting.md

token-splitting.md

tile.json