giuseppe-trisciuoglio/developer-kit

Comprehensive developer toolkit providing reusable skills for Java/Spring Boot, TypeScript/NestJS/React/Next.js, Python, PHP, AWS CloudFormation, AI/RAG, DevOps, and more.

Quality

90%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Risky

Do not use without reviewing

This version of the tile failed moderation

Moderation pipeline encountered an internal error

Document Chunking Strategies

Name: giuseppe-trisciuoglio/developer-kit
Rating: 90.85333333333334 (1 reviews)
Author: giuseppe-trisciuoglio

Overview

Document chunking is the process of breaking large documents into smaller, manageable pieces that can be effectively embedded and retrieved.

Chunking Strategies

1. Recursive Character Text Splitter

Method: Split text based on character count, trying separators in order Use Case: General purpose text splitting Advantages: Preserves sentence and paragraph boundaries when possible

from langchain.text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]  # Try these in order
)
chunks = splitter.split_documents(documents)

2. Token-Based Splitting

Method: Split based on token count rather than characters Use Case: When working with token limits of language models Advantages: Better control over context window usage

from langchain.text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=512,
    chunk_overlap=50
)
chunks = splitter.split_documents(documents)

3. Semantic Chunking

Method: Split based on semantic similarity Use Case: When maintaining semantic coherence is important Advantages: Chunks are more semantically meaningful

from langchain.text_splitters import SemanticChunker

splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile"
)
chunks = splitter.split_documents(documents)

4. Markdown Header Splitter

Method: Split based on markdown headers Use Case: Structured documents with clear hierarchical organization Advantages: Maintains document structure and context

from langchain.text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_documents(documents)

5. HTML Splitter

Method: Split based on HTML tags Use Case: Web pages and HTML documents Advantages: Preserves HTML structure and metadata

from langchain.text_splitters import HTMLHeaderTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_documents(documents)

Parameter Tuning

Chunk Size

Small chunks (200-400 tokens): More precise retrieval, but may lose context
Medium chunks (500-1000 tokens): Good balance of precision and context
Large chunks (1000-2000 tokens): More context, but less precise retrieval

Chunk Overlap

Purpose: Preserve context at chunk boundaries
Typical range: 10-20% of chunk size
Higher overlap: Better context preservation, but more redundancy
Lower overlap: Less redundancy, but may lose important context

Separators

Hierarchical separators: Start with larger boundaries (paragraphs), then smaller (sentences)
Custom separators: Add domain-specific separators for better results
Language-specific: Adjust for different languages and writing styles

Best Practices

Preserve Context: Ensure chunks contain enough surrounding context
Maintain Coherence: Keep semantically related content together
Respect Boundaries: Avoid breaking sentences or important phrases
Consider Query Types: Adapt chunking strategy to typical user queries
Test and Iterate: Evaluate different chunking strategies for your specific use case