CtrlK
BlogDocsLog inGet started
Tessl Logo

giuseppe-trisciuoglio/developer-kit

Comprehensive developer toolkit providing reusable skills for Java/Spring Boot, TypeScript/NestJS/React/Next.js, Python, PHP, AWS CloudFormation, AI/RAG, DevOps, and more.

82

Quality

82%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Risky

Do not use without reviewing

Validation failed for skills in this tile
One or more skills have errors that need to be fixed before they can move to Implementation and Discovery review.
Overview
Quality
Evals
Security
Files

tools.mdplugins/developer-kit-ai/skills/chunking-strategy/references/

Recommended Libraries and Frameworks

This document provides a comprehensive guide to tools, libraries, and frameworks for implementing chunking strategies.

Core Chunking Libraries

LangChain

Overview: Comprehensive framework for building applications with large language models, includes robust text splitting utilities.

Installation:

pip install langchain langchain-text-splitters

Key Features:

  • Multiple text splitting strategies
  • Integration with various document loaders
  • Support for different content types (code, markdown, etc.)
  • Customizable separators and parameters

Example Usage:

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter,
    MarkdownTextSplitter,
    PythonCodeTextSplitter
)

# Basic recursive splitting
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

chunks = splitter.split_text(large_text)

# Markdown-specific splitting
markdown_splitter = MarkdownTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

# Code-specific splitting
code_splitter = PythonCodeTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

Pros:

  • Well-maintained and actively developed
  • Extensive documentation and examples
  • Integrates well with other LangChain components
  • Supports multiple document types

Cons:

  • Can be heavy dependency for simple use cases
  • Some advanced features require LangChain ecosystem

LlamaIndex

Overview: Data framework for LLM applications with advanced indexing and retrieval capabilities.

Installation:

pip install llama-index

Key Features:

  • Advanced semantic chunking
  • Hierarchical indexing
  • Context-aware retrieval
  • Integration with vector databases

Example Usage:

from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser
)
from llama_index.core import SimpleDirectoryReader
from llama_index.embeddings.openai import OpenAIEmbedding

# Basic sentence splitting
splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20
)

# Semantic chunking with embeddings
embed_model = OpenAIEmbedding()
semantic_splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=embed_model
)

# Load and process documents
documents = SimpleDirectoryReader("./data").load_data()
nodes = semantic_splitter.get_nodes_from_documents(documents)

Pros:

  • Excellent semantic chunking capabilities
  • Built for production RAG systems
  • Strong vector database integration
  • Active community support

Cons:

  • More complex setup for basic use cases
  • Semantic chunking requires embedding model setup

Unstructured

Overview: Open-source library for processing unstructured documents, especially strong with multi-modal content.

Installation:

pip install "unstructured[pdf,png,jpg]"

Key Features:

  • Multi-modal document processing
  • Support for PDFs, images, and various formats
  • Structure preservation
  • Table extraction and processing

Example Usage:

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

# Partition document by type
elements = partition(filename="document.pdf")

# Chunk by title/heading structure
chunks = chunk_by_title(
    elements,
    combine_text_under_n_chars=2000,
    max_characters=10000,
    new_after_n_chars=1500,
    multipage_sections=True
)

# Access chunked content
for chunk in chunks:
    print(f"Category: {chunk.category}")
    print(f"Content: {chunk.text[:200]}...")

Pros:

  • Excellent for PDF and image processing
  • Preserves document structure
  • Handles tables and figures well
  • Strong multi-modal capabilities

Cons:

  • Can be slower for large documents
  • Requires additional dependencies for some formats

Text Processing Libraries

NLTK (Natural Language Toolkit)

Installation:

pip install nltk

Key Features:

  • Sentence tokenization
  • Language detection
  • Text preprocessing
  • Linguistic analysis

Example Usage:

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

# Download required data
nltk.download('punkt')
nltk.download('stopwords')

# Sentence and word tokenization
text = "This is a sample sentence. This is another sentence."
sentences = sent_tokenize(text)
words = word_tokenize(text)

# Stop words removal
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]

spaCy

Installation:

pip install spacy
python -m spacy download en_core_web_sm

Key Features:

  • Industrial-strength NLP
  • Named entity recognition
  • Dependency parsing
  • Sentence boundary detection

Example Usage:

import spacy

# Load language model
nlp = spacy.load("en_core_web_sm")

# Process text
doc = nlp("This is a sample sentence. This is another sentence.")

# Extract sentences
sentences = [sent.text for sent in doc.sents]

# Named entities
entities = [(ent.text, ent.label_) for ent in doc.ents]

# Dependency parsing for better chunking
for token in doc:
    print(f"{token.text}: {token.dep_} (head: {token.head.text})")

Sentence Transformers

Installation:

pip install sentence-transformers

Key Features:

  • Pre-trained sentence embeddings
  • Semantic similarity calculation
  • Multi-lingual support
  • Custom model training

Example Usage:

from sentence_transformers import SentenceTransformer, util
import numpy as np

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings
sentences = ["This is a sentence.", "This is another sentence."]
embeddings = model.encode(sentences)

# Calculate semantic similarity
similarity = util.cos_sim(embeddings[0], embeddings[1])

# Find semantic boundaries for chunking
def find_semantic_boundaries(text, model, threshold=0.8):
    sentences = [s.strip() for s in text.split('.') if s.strip()]
    embeddings = model.encode(sentences)

    boundaries = [0]
    for i in range(1, len(sentences)):
        similarity = util.cos_sim(embeddings[i-1], embeddings[i])
        if similarity < threshold:
            boundaries.append(i)

    return boundaries

Vector Databases and Search

ChromaDB

Installation:

pip install chromadb

Key Features:

  • In-memory and persistent storage
  • Built-in embedding functions
  • Similarity search
  • Metadata filtering

Example Usage:

import chromadb
from chromadb.utils import embedding_functions

# Initialize client
client = chromadb.Client()

# Create collection
collection = client.create_collection(
    name="document_chunks",
    embedding_function=embedding_functions.DefaultEmbeddingFunction()
)

# Add chunks
collection.add(
    documents=[chunk["content"] for chunk in chunks],
    metadatas=[chunk.get("metadata", {}) for chunk in chunks],
    ids=[chunk["id"] for chunk in chunks]
)

# Search
results = collection.query(
    query_texts=["What is chunking?"],
    n_results=5
)

Pinecone

Installation:

pip install pinecone-client

Key Features:

  • Managed vector database service
  • High-performance similarity search
  • Metadata filtering
  • Scalable infrastructure

Example Usage:

import pinecone
from sentence_transformers import SentenceTransformer

# Initialize
pinecone.init(api_key="your-api-key", environment="your-environment")
index_name = "document-chunks"

# Create index if it doesn't exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=384,  # Match embedding model
        metric="cosine"
    )

index = pinecone.Index(index_name)

# Generate embeddings and upsert
model = SentenceTransformer('all-MiniLM-L6-v2')
for chunk in chunks:
    embedding = model.encode(chunk["content"])
    index.upsert(
        vectors=[{
            "id": chunk["id"],
            "values": embedding.tolist(),
            "metadata": chunk.get("metadata", {})
        }]
    )

# Search
query_embedding = model.encode("search query")
results = index.query(
    vector=query_embedding.tolist(),
    top_k=5,
    include_metadata=True
)

Weaviate

Installation:

pip install weaviate-client

Key Features:

  • GraphQL API
  • Hybrid search (dense + sparse)
  • Real-time updates
  • Schema validation

Example Usage:

import weaviate

# Connect to Weaviate
client = weaviate.Client("http://localhost:8080")

# Define schema
client.schema.create_class({
    "class": "DocumentChunk",
    "description": "A chunk of document content",
    "properties": [
        {
            "name": "content",
            "dataType": ["text"]
        },
        {
            "name": "source",
            "dataType": ["string"]
        }
    ]
})

# Add data
for chunk in chunks:
    client.data_object.create(
        data_object={
            "content": chunk["content"],
            "source": chunk.get("source", "unknown")
        },
        class_name="DocumentChunk"
    )

# Search
results = client.query.get(
    "DocumentChunk",
    ["content", "source"]
).with_near_text({
    "concepts": ["search query"]
}).with_limit(5).do()

Evaluation and Testing

RAGAS

Installation:

pip install ragas

Key Features:

  • RAG evaluation metrics
  • Answer quality assessment
  • Context relevance measurement
  • Faithfulness evaluation

Example Usage:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall
)
from datasets import Dataset

# Prepare evaluation data
dataset = Dataset.from_dict({
    "question": ["What is chunking?"],
    "answer": ["Chunking is the process of breaking large documents into smaller segments"],
    "contexts": [["Chunking involves dividing text into manageable pieces for better processing"]],
    "ground_truth": ["Chunking is a document processing technique"]
})

# Evaluate
result = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_relevancy,
        context_recall
    ]
)

print(result)

TruEra (TruLens)

Installation:

pip install trulens trulens-apps

Key Features:

  • LLM application evaluation
  • Feedback functions
  • Hallucination detection
  • Performance monitoring

Example Usage:

from trulens.core import TruSession
from trulens.apps.custom import instrument
from trulens.feedback import GroundTruthAgreement

# Initialize session
session = TruSession()

# Define feedback functions
f_groundedness = GroundTruthAgreement(ground_truth)

# Evaluate chunks
@instrument
def chunk_and_query(text, query):
    chunks = chunk_function(text)
    relevant_chunks = search_function(chunks, query)
    answer = generate_function(relevant_chunks, query)
    return answer

# Record evaluation
with session:
    chunk_and_query("large document text", "what is the main topic?")

Document Processing

PyPDF2

Installation:

pip install PyPDF2

Key Features:

  • PDF text extraction
  • Page manipulation
  • Metadata extraction
  • Form field processing

Example Usage:

import PyPDF2

def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text()
    return text

# Extract text by page for better chunking
def extract_pages(pdf_path):
    pages = []
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for i, page in enumerate(reader.pages):
            pages.append({
                "page_number": i + 1,
                "content": page.extract_text()
            })
    return pages

python-docx

Installation:

pip install python-docx

Key Features:

  • Microsoft Word document processing
  • Paragraph and table extraction
  • Style preservation
  • Metadata access

Example Usage:

from docx import Document

def extract_from_docx(docx_path):
    doc = Document(docx_path)
    content = []

    for paragraph in doc.paragraphs:
        if paragraph.text.strip():
            content.append({
                "type": "paragraph",
                "text": paragraph.text,
                "style": paragraph.style.name
            })

    for table in doc.tables:
        table_text = []
        for row in table.rows:
            row_text = [cell.text for cell in row.cells]
            table_text.append(" | ".join(row_text))

        content.append({
            "type": "table",
            "text": "\n".join(table_text)
        })

    return content

Specialized Libraries

tiktoken (OpenAI)

Installation:

pip install tiktoken

Key Features:

  • Accurate token counting for OpenAI models
  • Fast encoding/decoding
  • Multiple model support
  • Language model specific tokenization

Example Usage:

import tiktoken

# Get encoding for specific model
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

# Encode text
tokens = encoding.encode("This is a sample text")
print(f"Token count: {len(tokens)}")

# Decode tokens
text = encoding.decode(tokens)

# Count tokens without full encoding
def count_tokens(text, model="gpt-3.5-turbo"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Use in chunking
def chunk_by_tokens(text, max_tokens=1000):
    encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
    tokens = encoding.encode(text)

    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)

    return chunks

PDFMiner

Installation:

pip install pdfminer.six

Key Features:

  • Detailed PDF analysis
  • Layout preservation
  • Font and style information
  • High-precision text extraction

Example Usage:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer

def extract_structured_text(pdf_path):
    structured_content = []

    for page_layout in extract_pages(pdf_path):
        page_content = []

        for element in page_layout:
            if isinstance(element, LTTextContainer):
                text = element.get_text()
                font_info = {
                    "font_size": element.height,
                    "is_bold": "Bold" in element.fontname,
                    "x0": element.x0,
                    "y0": element.y0
                }
                page_content.append({
                    "text": text.strip(),
                    "font_info": font_info
                })

        structured_content.append({
            "page_number": page_layout.pageid,
            "content": page_content
        })

    return structured_content

Performance and Optimization

Dask

Installation:

pip install dask[complete]

Key Features:

  • Parallel processing
  • Out-of-core computation
  • Distributed computing
  • Integration with pandas

Example Usage:

import dask.bag as db
from dask.distributed import Client

# Setup distributed client
client = Client(n_workers=4)

# Parallel chunking of multiple documents
def chunk_document(document):
    # Your chunking logic here
    return chunk_function(document)

# Process documents in parallel
documents = ["doc1", "doc2", "doc3", ...]  # List of document contents
document_bag = db.from_sequence(documents)

# Apply chunking function in parallel
chunked_documents = document_bag.map(chunk_document)

# Compute results
results = chunked_documents.compute()

Ray

Installation:

pip install ray

Key Features:

  • Distributed computing
  • Actor model
  • Autoscaling
  • ML pipeline integration

Example Usage:

import ray

# Initialize Ray
ray.init()

@ray.remote
class ChunkingWorker:
    def __init__(self, strategy):
        self.strategy = strategy

    def chunk_documents(self, documents):
        results = []
        for doc in documents:
            chunks = self.strategy.chunk(doc)
            results.append(chunks)
        return results

# Create workers
workers = [ChunkingWorker.remote(strategy) for _ in range(4)]

# Distribute work
documents_batch = [documents[i::4] for i in range(4)]
futures = [worker.chunk_documents.remote(batch)
           for worker, batch in zip(workers, documents_batch)]

# Get results
results = ray.get(futures)

Development and Testing

pytest

Installation:

pip install pytest pytest-asyncio

Example Tests:

import pytest
from your_chunking_module import FixedSizeChunker, SemanticChunker

class TestFixedSizeChunker:
    def test_chunk_size_respect(self):
        chunker = FixedSizeChunker(chunk_size=100, chunk_overlap=10)
        text = "word " * 50  # 50 words

        chunks = chunker.chunk(text)

        for chunk in chunks:
            assert len(chunk.split()) <= 100  # Account for word boundaries

    def test_overlap_consistency(self):
        chunker = FixedSizeChunker(chunk_size=50, chunk_overlap=10)
        text = "word " * 30

        chunks = chunker.chunk(text)

        # Check overlap between consecutive chunks
        for i in range(1, len(chunks)):
            chunk1_words = set(chunks[i-1].split()[-10:])
            chunk2_words = set(chunks[i].split()[:10])
            overlap = len(chunk1_words & chunk2_words)
            assert overlap >= 5  # Allow some tolerance

@pytest.mark.asyncio
async def test_semantic_chunker():
    chunker = SemanticChunker()
    text = "First topic sentence. Another sentence about first topic. " \
           "Now switching to second topic. More about second topic."

    chunks = await chunker.chunk_async(text)

    # Should detect topic change and create boundary
    assert len(chunks) >= 2
    assert "first topic" in chunks[0].lower()
    assert "second topic" in chunks[1].lower()

Memory Profiler

Installation:

pip install memory-profiler

Example Usage:

from memory_profiler import profile

@profile
def chunk_large_document():
    chunker = FixedSizeChunker(chunk_size=1000)
    large_text = "word " * 100000  # Large document

    chunks = chunker.chunk(large_text)
    return chunks

# Run with: python -m memory_profiler your_script.py

This comprehensive toolset provides everything needed to implement, test, and optimize chunking strategies for various use cases, from simple text processing to production-grade RAG systems.

plugins

developer-kit-ai

skills

chunking-strategy

references

advanced-strategies.md

evaluation.md

implementation.md

research.md

semantic-methods.md

strategies.md

tools.md

visualization-tools.md

SKILL.md

README.md

tile.json