tessl/pypi-langchain-text-splitters

LangChain text splitting utilities for breaking documents into manageable chunks for AI processing

—

Pending

Overview

Eval results

Files

NLP-Based Text Splitting

Name: tessl/pypi-langchain-text-splitters
Author: tessl

NLP-based text splitting provides intelligent text segmentation using natural language processing libraries. These splitters understand linguistic boundaries such as sentences and phrases, making them ideal for processing natural language text while preserving semantic coherence.

Capabilities

NLTK Text Splitting

Text splitting using NLTK's sentence tokenization, supporting multiple languages and tokenization approaches.

class NLTKTextSplitter(TextSplitter):
    def __init__(
        self,
        separator: str = "\n\n",
        language: str = "english",
        *,
        use_span_tokenize: bool = False,
        **kwargs: Any
    ) -> None: ...
    
    def split_text(self, text: str) -> list[str]: ...

Parameters:

separator: Separator used to join sentences into chunks (default: "\n\n")
language: Language for NLTK sentence tokenization (default: "english")
use_span_tokenize: Whether to use span tokenization for better performance (default: False)
**kwargs: Additional parameters passed to TextSplitter.__init__()

Usage:

from langchain_text_splitters import NLTKTextSplitter

# Basic NLTK splitting
nltk_splitter = NLTKTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    language="english"
)

text = """
Natural language processing is a fascinating field. It combines computer science and linguistics. 
Machine learning has revolutionized how we approach NLP tasks. Today's models can understand 
context and generate human-like text. However, challenges remain in areas like common sense 
reasoning and multilingual understanding.
"""

chunks = nltk_splitter.split_text(text)

# Multi-language support
spanish_splitter = NLTKTextSplitter(
    language="spanish",
    chunk_size=800,
    separator="\n"
)

spanish_text = """
El procesamiento de lenguaje natural es un campo fascinante. Combina ciencias de la computación 
y lingüística. El aprendizaje automático ha revolucionado cómo abordamos las tareas de PLN.
"""

spanish_chunks = spanish_splitter.split_text(spanish_text)

# Span tokenization for better performance
span_splitter = NLTKTextSplitter(
    use_span_tokenize=True,
    chunk_size=1200,
    language="english"
)

Supported Languages: NLTK supports sentence tokenization for multiple languages including:

English, Spanish, French, German, Italian, Portuguese
Dutch, Russian, Czech, Polish, Turkish
And many others depending on NLTK data availability

spaCy Text Splitting

Text splitting using spaCy's advanced NLP pipeline with sentence segmentation and linguistic analysis.

class SpacyTextSplitter(TextSplitter):
    def __init__(
        self,
        separator: str = "\n\n",
        pipeline: str = "en_core_web_sm",
        max_length: int = 1000000,
        *,
        strip_whitespace: bool = True,
        **kwargs: Any
    ) -> None: ...
    
    def split_text(self, text: str) -> list[str]: ...

Parameters:

separator: Separator used to join sentences into chunks (default: "\n\n")
pipeline: spaCy pipeline/model name (default: "en_core_web_sm")
max_length: Maximum text length for spaCy processing (default: 1000000)
strip_whitespace: Whether to strip whitespace from chunks (default: True)
**kwargs: Additional parameters passed to TextSplitter.__init__()

Usage:

from langchain_text_splitters import SpacyTextSplitter

# Basic spaCy splitting
spacy_splitter = SpacyTextSplitter(
    pipeline="en_core_web_sm",
    chunk_size=1000,
    chunk_overlap=100
)

text = """
The field of artificial intelligence has seen remarkable progress in recent years. Deep learning 
models have achieved human-level performance on many tasks. Computer vision systems can now 
recognize objects with incredible accuracy. Natural language models can generate coherent text 
and engage in meaningful conversations.
"""

chunks = spacy_splitter.split_text(text)

# Different language models
german_splitter = SpacyTextSplitter(
    pipeline="de_core_news_sm",  # German model
    chunk_size=800,
    separator="\n"
)

# Larger models for better accuracy
large_splitter = SpacyTextSplitter(
    pipeline="en_core_web_lg",  # Large English model
    chunk_size=1500,
    max_length=2000000  # Handle longer texts
)

# Custom separator and settings
custom_splitter = SpacyTextSplitter(
    pipeline="en_core_web_md",
    separator=" | ",  # Custom separator
    strip_whitespace=False,
    chunk_size=600
)

Popular spaCy Models:

English: en_core_web_sm, en_core_web_md, en_core_web_lg
German: de_core_news_sm, de_core_news_md, de_core_news_lg
French: fr_core_news_sm, fr_core_news_md, fr_core_news_lg
Spanish: es_core_news_sm, es_core_news_md, es_core_news_lg
Chinese: zh_core_web_sm, zh_core_web_md, zh_core_web_lg
Japanese: ja_core_news_sm, ja_core_news_md, ja_core_news_lg

Korean Language Text Splitting

Specialized text splitting for Korean using Konlpy with the Kkma tokenizer.

class KonlpyTextSplitter(TextSplitter):
    def __init__(
        self,
        separator: str = "\n\n",
        **kwargs: Any
    ) -> None: ...
    
    def split_text(self, text: str) -> list[str]: ...

Parameters:

separator: Separator used to join sentences into chunks (default: "\n\n")
**kwargs: Additional parameters passed to TextSplitter.__init__()

Usage:

from langchain_text_splitters import KonlpyTextSplitter

korean_splitter = KonlpyTextSplitter(
    chunk_size=800,
    chunk_overlap=100
)

korean_text = """
자연어 처리는 컴퓨터 과학과 언어학을 결합한 흥미로운 분야입니다. 기계 학습이 자연어 처리 작업에 
접근하는 방식을 혁신했습니다. 오늘날의 모델들은 맥락을 이해하고 인간과 같은 텍스트를 생성할 수 
있습니다. 그러나 상식적 추론과 다국어 이해와 같은 영역에서는 여전히 과제가 남아 있습니다.
"""

chunks = korean_splitter.split_text(korean_text)

# Custom separator for Korean text
korean_custom_splitter = KonlpyTextSplitter(
    separator="\n",
    chunk_size=600,
    chunk_overlap=50
)

The Korean splitter uses Konlpy's Kkma tokenizer, which provides:

Morphological analysis
Sentence boundary detection
Support for Korean linguistic structures
Proper handling of Korean punctuation and spacing

Installation Requirements

Each NLP-based splitter requires specific dependencies:

NLTK Text Splitter

pip install nltk

Download required NLTK data:

import nltk
nltk.download('punkt')  # For sentence tokenization
nltk.download('punkt_tab')  # For newer NLTK versions

spaCy Text Splitter

pip install spacy

Download language models:

# English
python -m spacy download en_core_web_sm

# Other languages
python -m spacy download de_core_news_sm  # German
python -m spacy download fr_core_news_sm  # French
python -m spacy download es_core_news_sm  # Spanish

Konlpy Text Splitter

pip install konlpy

Note: Konlpy may require additional system dependencies depending on your platform.

Comparison of NLP Splitters

Splitter	Strengths	Best Use Cases	Performance
NLTK	Lightweight, many languages, fast setup	Simple sentence splitting, multilingual text	Fast
spaCy	Advanced NLP, high accuracy, robust models	High-quality text processing, complex documents	Medium-Fast
Konlpy	Korean language expertise, morphological analysis	Korean text processing, Korean NLP tasks	Medium

Best Practices

Choose the right tool: Use NLTK for simple sentence splitting, spaCy for advanced analysis, Konlpy for Korean
Model selection: Choose model size based on accuracy vs. speed trade-offs
Language matching: Use language-specific models for non-English text
Memory considerations: Larger spaCy models require more memory
Preprocessing: Clean text before NLP processing for better results
Sentence coherence: NLP splitters maintain sentence boundaries, preserving semantic coherence
Cultural context: For specialized domains or cultures, consider domain-specific models
Performance testing: Benchmark different splitters with your specific text types

Install with Tessl CLI

npx tessl i tessl/pypi-langchain-text-splitters

docs

character-splitting.md

code-splitting.md

core-base.md

document-structure.md

tessl/pypi-langchain-text-splitters

nlp-splitting.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

NLP-Based Text Splitting

Capabilities

NLTK Text Splitting

spaCy Text Splitting

Korean Language Text Splitting

Installation Requirements

NLTK Text Splitter

spaCy Text Splitter

Konlpy Text Splitter

Comparison of NLP Splitters

Best Practices

nlp-splitting.mddocs/