CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-langchain-text-splitters

LangChain text splitting utilities for breaking documents into manageable chunks for AI processing

Pending
Overview
Eval results
Files

nlp-splitting.mddocs/

NLP-Based Text Splitting

NLP-based text splitting provides intelligent text segmentation using natural language processing libraries. These splitters understand linguistic boundaries such as sentences and phrases, making them ideal for processing natural language text while preserving semantic coherence.

Capabilities

NLTK Text Splitting

Text splitting using NLTK's sentence tokenization, supporting multiple languages and tokenization approaches.

class NLTKTextSplitter(TextSplitter):
    def __init__(
        self,
        separator: str = "\n\n",
        language: str = "english",
        *,
        use_span_tokenize: bool = False,
        **kwargs: Any
    ) -> None: ...
    
    def split_text(self, text: str) -> list[str]: ...

Parameters:

  • separator: Separator used to join sentences into chunks (default: "\n\n")
  • language: Language for NLTK sentence tokenization (default: "english")
  • use_span_tokenize: Whether to use span tokenization for better performance (default: False)
  • **kwargs: Additional parameters passed to TextSplitter.__init__()

Usage:

from langchain_text_splitters import NLTKTextSplitter

# Basic NLTK splitting
nltk_splitter = NLTKTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    language="english"
)

text = """
Natural language processing is a fascinating field. It combines computer science and linguistics. 
Machine learning has revolutionized how we approach NLP tasks. Today's models can understand 
context and generate human-like text. However, challenges remain in areas like common sense 
reasoning and multilingual understanding.
"""

chunks = nltk_splitter.split_text(text)

# Multi-language support
spanish_splitter = NLTKTextSplitter(
    language="spanish",
    chunk_size=800,
    separator="\n"
)

spanish_text = """
El procesamiento de lenguaje natural es un campo fascinante. Combina ciencias de la computación 
y lingüística. El aprendizaje automático ha revolucionado cómo abordamos las tareas de PLN.
"""

spanish_chunks = spanish_splitter.split_text(spanish_text)

# Span tokenization for better performance
span_splitter = NLTKTextSplitter(
    use_span_tokenize=True,
    chunk_size=1200,
    language="english"
)

Supported Languages: NLTK supports sentence tokenization for multiple languages including:

  • English, Spanish, French, German, Italian, Portuguese
  • Dutch, Russian, Czech, Polish, Turkish
  • And many others depending on NLTK data availability

spaCy Text Splitting

Text splitting using spaCy's advanced NLP pipeline with sentence segmentation and linguistic analysis.

class SpacyTextSplitter(TextSplitter):
    def __init__(
        self,
        separator: str = "\n\n",
        pipeline: str = "en_core_web_sm",
        max_length: int = 1000000,
        *,
        strip_whitespace: bool = True,
        **kwargs: Any
    ) -> None: ...
    
    def split_text(self, text: str) -> list[str]: ...

Parameters:

  • separator: Separator used to join sentences into chunks (default: "\n\n")
  • pipeline: spaCy pipeline/model name (default: "en_core_web_sm")
  • max_length: Maximum text length for spaCy processing (default: 1000000)
  • strip_whitespace: Whether to strip whitespace from chunks (default: True)
  • **kwargs: Additional parameters passed to TextSplitter.__init__()

Usage:

from langchain_text_splitters import SpacyTextSplitter

# Basic spaCy splitting
spacy_splitter = SpacyTextSplitter(
    pipeline="en_core_web_sm",
    chunk_size=1000,
    chunk_overlap=100
)

text = """
The field of artificial intelligence has seen remarkable progress in recent years. Deep learning 
models have achieved human-level performance on many tasks. Computer vision systems can now 
recognize objects with incredible accuracy. Natural language models can generate coherent text 
and engage in meaningful conversations.
"""

chunks = spacy_splitter.split_text(text)

# Different language models
german_splitter = SpacyTextSplitter(
    pipeline="de_core_news_sm",  # German model
    chunk_size=800,
    separator="\n"
)

# Larger models for better accuracy
large_splitter = SpacyTextSplitter(
    pipeline="en_core_web_lg",  # Large English model
    chunk_size=1500,
    max_length=2000000  # Handle longer texts
)

# Custom separator and settings
custom_splitter = SpacyTextSplitter(
    pipeline="en_core_web_md",
    separator=" | ",  # Custom separator
    strip_whitespace=False,
    chunk_size=600
)

Popular spaCy Models:

  • English: en_core_web_sm, en_core_web_md, en_core_web_lg
  • German: de_core_news_sm, de_core_news_md, de_core_news_lg
  • French: fr_core_news_sm, fr_core_news_md, fr_core_news_lg
  • Spanish: es_core_news_sm, es_core_news_md, es_core_news_lg
  • Chinese: zh_core_web_sm, zh_core_web_md, zh_core_web_lg
  • Japanese: ja_core_news_sm, ja_core_news_md, ja_core_news_lg

Korean Language Text Splitting

Specialized text splitting for Korean using Konlpy with the Kkma tokenizer.

class KonlpyTextSplitter(TextSplitter):
    def __init__(
        self,
        separator: str = "\n\n",
        **kwargs: Any
    ) -> None: ...
    
    def split_text(self, text: str) -> list[str]: ...

Parameters:

  • separator: Separator used to join sentences into chunks (default: "\n\n")
  • **kwargs: Additional parameters passed to TextSplitter.__init__()

Usage:

from langchain_text_splitters import KonlpyTextSplitter

korean_splitter = KonlpyTextSplitter(
    chunk_size=800,
    chunk_overlap=100
)

korean_text = """
자연어 처리는 컴퓨터 과학과 언어학을 결합한 흥미로운 분야입니다. 기계 학습이 자연어 처리 작업에 
접근하는 방식을 혁신했습니다. 오늘날의 모델들은 맥락을 이해하고 인간과 같은 텍스트를 생성할 수 
있습니다. 그러나 상식적 추론과 다국어 이해와 같은 영역에서는 여전히 과제가 남아 있습니다.
"""

chunks = korean_splitter.split_text(korean_text)

# Custom separator for Korean text
korean_custom_splitter = KonlpyTextSplitter(
    separator="\n",
    chunk_size=600,
    chunk_overlap=50
)

The Korean splitter uses Konlpy's Kkma tokenizer, which provides:

  • Morphological analysis
  • Sentence boundary detection
  • Support for Korean linguistic structures
  • Proper handling of Korean punctuation and spacing

Installation Requirements

Each NLP-based splitter requires specific dependencies:

NLTK Text Splitter

pip install nltk

Download required NLTK data:

import nltk
nltk.download('punkt')  # For sentence tokenization
nltk.download('punkt_tab')  # For newer NLTK versions

spaCy Text Splitter

pip install spacy

Download language models:

# English
python -m spacy download en_core_web_sm

# Other languages
python -m spacy download de_core_news_sm  # German
python -m spacy download fr_core_news_sm  # French
python -m spacy download es_core_news_sm  # Spanish

Konlpy Text Splitter

pip install konlpy

Note: Konlpy may require additional system dependencies depending on your platform.

Comparison of NLP Splitters

SplitterStrengthsBest Use CasesPerformance
NLTKLightweight, many languages, fast setupSimple sentence splitting, multilingual textFast
spaCyAdvanced NLP, high accuracy, robust modelsHigh-quality text processing, complex documentsMedium-Fast
KonlpyKorean language expertise, morphological analysisKorean text processing, Korean NLP tasksMedium

Best Practices

  1. Choose the right tool: Use NLTK for simple sentence splitting, spaCy for advanced analysis, Konlpy for Korean
  2. Model selection: Choose model size based on accuracy vs. speed trade-offs
  3. Language matching: Use language-specific models for non-English text
  4. Memory considerations: Larger spaCy models require more memory
  5. Preprocessing: Clean text before NLP processing for better results
  6. Sentence coherence: NLP splitters maintain sentence boundaries, preserving semantic coherence
  7. Cultural context: For specialized domains or cultures, consider domain-specific models
  8. Performance testing: Benchmark different splitters with your specific text types

Install with Tessl CLI

npx tessl i tessl/pypi-langchain-text-splitters

docs

character-splitting.md

code-splitting.md

core-base.md

document-structure.md

index.md

nlp-splitting.md

token-splitting.md

tile.json