LangChain text splitting utilities for breaking documents into manageable chunks for AI processing
—
Document structure-aware splitting provides specialized text segmentation that understands and preserves the structural elements of various document formats. These splitters maintain semantic context by respecting document hierarchy, headers, and formatting while creating appropriately sized chunks.
Specialized splitters for HTML content that preserve document structure and semantic elements.
Splits HTML content based on header tags while preserving document hierarchy and metadata.
class HTMLHeaderTextSplitter:
def __init__(
self,
headers_to_split_on: list[tuple[str, str]],
return_each_element: bool = False
) -> None: ...
def split_text(self, text: str) -> list[Document]: ...
def split_text_from_url(
self,
url: str,
timeout: int = 10,
**kwargs: Any
) -> list[Document]: ...
def split_text_from_file(self, file: Any) -> list[Document]: ...Parameters:
headers_to_split_on: List of tuples (header_tag, header_name) defining split pointsreturn_each_element: Whether to return each element separately (default: False)Usage:
from langchain_text_splitters import HTMLHeaderTextSplitter
# Define headers to split on
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
# Split HTML text
html_content = """
<h1>Chapter 1</h1>
<p>Content of chapter 1...</p>
<h2>Section 1.1</h2>
<p>Content of section 1.1...</p>
"""
documents = html_splitter.split_text(html_content)
# Split HTML from URL
url_docs = html_splitter.split_text_from_url("https://example.com", timeout=30)
# Split HTML from file
with open("document.html", "r") as f:
file_docs = html_splitter.split_text_from_file(f)Advanced HTML splitting based on tags and font sizes, requiring lxml for enhanced processing.
class HTMLSectionSplitter:
def __init__(
self,
headers_to_split_on: list[tuple[str, str]],
**kwargs: Any
) -> None: ...
def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...
def split_text(self, text: str) -> list[Document]: ...
def create_documents(
self,
texts: list[str],
metadatas: Optional[list[dict[Any, Any]]] = None
) -> list[Document]: ...
def split_html_by_headers(self, html_doc: str) -> list[dict[str, Optional[str]]]: ...
def convert_possible_tags_to_header(self, html_content: str) -> str: ...
def split_text_from_file(self, file: Any) -> list[Document]: ...Beta-stage advanced HTML splitter that preserves semantic structure with media handling capabilities.
class HTMLSemanticPreservingSplitter(BaseDocumentTransformer):
def __init__(
self,
headers_to_split_on: list[tuple[str, str]],
*,
max_chunk_size: int = 1000,
chunk_overlap: int = 0,
separators: Optional[list[str]] = None,
elements_to_preserve: Optional[list[str]] = None,
preserve_links: bool = False,
preserve_images: bool = False,
preserve_videos: bool = False,
preserve_audio: bool = False,
custom_handlers: Optional[dict[str, Callable[[Any], str]]] = None,
stopword_removal: bool = False,
stopword_lang: str = "english",
normalize_text: bool = False,
external_metadata: Optional[dict[str, str]] = None,
allowlist_tags: Optional[list[str]] = None,
denylist_tags: Optional[list[str]] = None,
preserve_parent_metadata: bool = False,
keep_separator: Union[bool, Literal["start", "end"]] = True
) -> None: ...
def split_text(self, text: str) -> list[Document]: ...
def transform_documents(
self,
documents: Sequence[Document],
**kwargs: Any
) -> list[Document]: ...Parameters:
max_chunk_size: Maximum size of each chunk (default: 1000)chunk_overlap: Number of characters to overlap between chunks (default: 0)separators: Delimiters used by RecursiveCharacterTextSplitter for further splittingelements_to_preserve: HTML tags to remain intact during splittingpreserve_links: Whether to convert <a> tags to Markdown links (default: False)preserve_images: Whether to convert <img> tags to Markdown images (default: False)preserve_videos: Whether to convert <video> tags to Markdown video links (default: False)preserve_audio: Whether to convert <audio> tags to Markdown audio links (default: False)custom_handlers: Custom element handlers for specific tagsstopword_removal: Whether to remove stopwords from text (default: False)stopword_lang: Language for stopword removal (default: "english")normalize_text: Whether to normalize text during processing (default: False)external_metadata: Additional metadata to include in all documentsallowlist_tags: HTML tags to specifically include in processingdenylist_tags: HTML tags to exclude from processingpreserve_parent_metadata: Whether to preserve metadata from parent elements (default: False)keep_separator: Whether to keep separators and where to place them (default: True)Specialized splitters for Markdown content that understand heading hierarchy and structure.
Basic Markdown splitting that extends recursive character splitting with Markdown-specific separators.
class MarkdownTextSplitter(RecursiveCharacterTextSplitter):
def __init__(self, **kwargs: Any) -> None: ...Splits Markdown content based on header levels while preserving document structure.
class MarkdownHeaderTextSplitter:
def __init__(
self,
headers_to_split_on: list[tuple[str, str]],
return_each_line: bool = False,
strip_headers: bool = True,
custom_header_patterns: Optional[dict[int, str]] = None
) -> None: ...
def split_text(self, text: str) -> list[Document]: ...
def aggregate_lines_to_chunks(self, lines: list[LineType]) -> list[Document]: ...Parameters:
headers_to_split_on: List of tuples (header_level, header_name)return_each_line: Whether to return each line as separate documentstrip_headers: Whether to remove header text from contentcustom_header_patterns: Custom regex patterns for header detectionUsage:
from langchain_text_splitters import MarkdownHeaderTextSplitter
# Define headers to split on
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
markdown_content = """
# Chapter 1
Content of chapter 1...
## Section 1.1
Content of section 1.1...
### Subsection 1.1.1
Content of subsection...
"""
documents = markdown_splitter.split_text(markdown_content)Advanced experimental Markdown splitter with exact whitespace retention and structured metadata extraction.
class ExperimentalMarkdownSyntaxTextSplitter:
def __init__(
self,
headers_to_split_on: Optional[list[tuple[str, str]]] = None,
return_each_line: bool = False,
strip_headers: bool = True
) -> None: ...
def split_text(self, text: str) -> list[Document]: ...Specialized splitter for JSON data that preserves structure while creating manageable chunks.
class RecursiveJsonSplitter:
def __init__(
self,
max_chunk_size: int = 2000,
min_chunk_size: Optional[int] = None
) -> None: ...
def split_json(
self,
json_data: dict,
convert_lists: bool = False
) -> list[dict]: ...
def split_text(
self,
json_data: dict,
convert_lists: bool = False,
ensure_ascii: bool = True
) -> list[str]: ...
def create_documents(
self,
texts: list[dict],
convert_lists: bool = False,
ensure_ascii: bool = True,
metadatas: Optional[list[dict[Any, Any]]] = None
) -> list[Document]: ...Parameters:
max_chunk_size: Maximum size of JSON chunksmin_chunk_size: Minimum size for chunk splittingMethods:
split_json(): Split JSON into dictionary chunkssplit_text(): Split JSON into string chunkscreate_documents(): Create Document objects from JSONUsage:
from langchain_text_splitters import RecursiveJsonSplitter
import json
json_splitter = RecursiveJsonSplitter(max_chunk_size=1000)
# Large JSON data
large_json = {
"users": [
{"id": 1, "name": "Alice", "data": {...}},
{"id": 2, "name": "Bob", "data": {...}},
# ... many more users
],
"metadata": {"version": "1.0", "created": "2023-01-01"}
}
# Split into dictionary chunks
dict_chunks = json_splitter.split_json(large_json)
# Split into string chunks
string_chunks = json_splitter.split_text(large_json, ensure_ascii=False)
# Create Document objects
documents = json_splitter.create_documents([large_json])Document structure splitters use several type definitions for metadata and configuration:
class ElementType(TypedDict):
url: str
xpath: str
content: str
metadata: dict[str, str]
class HeaderType(TypedDict):
level: int
name: str
data: str
class LineType(TypedDict):
metadata: dict[str, str]
content: strInstall with Tessl CLI
npx tessl i tessl/pypi-langchain-text-splitters