CtrlK
CommunityDocumentationLog inGet started
Tessl Logo

tessl/maven-com-embabel-agent--embabel-agent-rag-core

RAG (Retrieval-Augmented Generation) framework for the Embabel Agent platform providing content ingestion, chunking, hierarchical navigation, and semantic search capabilities

Overview
Eval results
Files

content-ingestion.mddocs/api-reference/

Content Ingestion API Reference

Parse, chunk, and ingest documents into the RAG system with support for hierarchical content structures, directory parsing, and configurable chunking strategies.

Content Reading

Parse content from various sources into hierarchical document structures.

HierarchicalContentReader

Interface for reading and parsing content into NavigableDocument structures.

interface HierarchicalContentReader {
    /**
     * Parse content from a URL
     * @param url URL to fetch and parse
     * @return Parsed navigable document
     */
    fun parseUrl(url: String): NavigableDocument

    /**
     * Parse content from a resource path
     * @param resourcePath Classpath resource path
     * @return Parsed navigable document
     */
    fun parseResource(resourcePath: String): NavigableDocument

    /**
     * Parse content from a file
     * @param file File to parse
     * @param url Optional URL to associate with the document
     * @return Parsed navigable document
     */
    fun parseFile(file: File, url: String? = null): NavigableDocument

    /**
     * Parse content from an input stream
     * @param inputStream Stream containing content
     * @param uri URI to associate with the document
     * @return Parsed navigable document
     */
    fun parseContent(inputStream: InputStream, uri: String): NavigableDocument

    /**
     * Parse multiple files from a directory
     * @param fileTools File reading tools
     * @param config Directory parsing configuration
     * @return Parsing result with statistics and parsed documents
     */
    fun parseFromDirectory(
        fileTools: FileReadTools,
        config: DirectoryParsingConfig
    ): DirectoryParsingResult
}

Methods:

  • parseUrl(): Fetch and parse content from HTTP(S) URL

    • Parameters: url - URL to fetch
    • Returns: Parsed document structure
    • Throws: IOException if fetch fails, parsing exceptions
  • parseResource(): Load and parse from classpath resource

    • Parameters: resourcePath - Classpath resource path
    • Returns: Parsed document
    • Throws: IOException if resource not found
  • parseFile(): Parse local file

    • Parameters:
      • file - File to parse
      • url - Optional URL for document URI
    • Returns: Parsed document
    • Throws: IOException if file not readable
  • parseContent(): Parse from stream

    • Parameters:
      • inputStream - Content stream
      • uri - URI to associate with document
    • Returns: Parsed document
  • parseFromDirectory(): Batch parse directory

    • Parameters:
      • fileTools - File system utilities
      • config - Parsing configuration
    • Returns: Results with statistics

Directory Parsing

Parse multiple documents from directory structures.

DirectoryParsingConfig

Configuration for directory parsing operations.

data class DirectoryParsingConfig(
    /**
     * File extensions to include (e.g., "md", "txt")
     */
    val includedExtensions: Set<String>,

    /**
     * Directory names to exclude from traversal
     */
    val excludedDirectories: Set<String>,

    /**
     * Relative path from base directory
     */
    val relativePath: String = ".",

    /**
     * Maximum file size in bytes
     */
    val maxFileSize: Long = 10_485_760, // 10 MB

    /**
     * Whether to follow symbolic links
     */
    val followSymlinks: Boolean = false,

    /**
     * Maximum directory depth to traverse
     */
    val maxDepth: Int = Int.MAX_VALUE
) {
    fun withRelativePath(newRelativePath: String): DirectoryParsingConfig
    fun withMaxFileSize(newMaxFileSize: Long): DirectoryParsingConfig
    fun withFollowSymlinks(newFollowSymlinks: Boolean): DirectoryParsingConfig
    fun withMaxDepth(newMaxDepth: Int): DirectoryParsingConfig
}

Properties:

  • includedExtensions: File extensions to process (without dot)
  • excludedDirectories: Directory names to skip
  • relativePath: Starting path relative to base
  • maxFileSize: Maximum file size in bytes (default 10 MB)
  • followSymlinks: Follow symbolic links (default false)
  • maxDepth: Maximum traversal depth (default unlimited)

Builder Methods:

  • withRelativePath(): Create copy with new path
  • withMaxFileSize(): Create copy with new size limit
  • withFollowSymlinks(): Create copy with new symlink setting
  • withMaxDepth(): Create copy with new depth limit

DirectoryParsingResult

Result of directory parsing operation with statistics.

data class DirectoryParsingResult(
    /**
     * Total files found matching criteria
     */
    val totalFilesFound: Int,

    /**
     * Number of files successfully processed
     */
    val filesProcessed: Int,

    /**
     * Number of files skipped
     */
    val filesSkipped: Int,

    /**
     * Number of files that errored during processing
     */
    val filesErrored: Int,

    /**
     * Parsed content roots
     */
    val contentRoots: List<NavigableDocument>,

    /**
     * Time taken for processing
     */
    val processingTime: Duration,

    /**
     * Error messages from failed files
     */
    val errors: List<String>,

    /**
     * Whether parsing succeeded overall
     */
    val success: Boolean,

    /**
     * Total sections extracted across all documents
     */
    val totalSectionsExtracted: Int
)

Properties:

  • totalFilesFound: Files matching criteria
  • filesProcessed: Successfully parsed files
  • filesSkipped: Skipped files (size, permissions, etc.)
  • filesErrored: Files with parsing errors
  • contentRoots: Parsed documents
  • processingTime: Total processing duration
  • errors: Error messages for failed files
  • success: True if at least one file processed successfully
  • totalSectionsExtracted: Sum of sections across all documents

Content Chunking

Convert hierarchical documents into indexed chunks.

ContentChunker

Interface for chunking document sections.

interface ContentChunker {
    /**
     * Chunk transformer to apply to generated chunks
     */
    val chunkTransformer: ChunkTransformer

    /**
     * Convert a container section into chunks
     * @param section Container section to chunk
     * @return Iterable of chunks with metadata
     */
    fun chunk(section: NavigableContainerSection): Iterable<Chunk>

    /**
     * Configuration for chunking behavior
     */
    data class Config(
        /**
         * Maximum size of each chunk in characters
         */
        val maxChunkSize: Int = 1500,

        /**
         * Overlap between consecutive chunks in characters
         */
        val overlapSize: Int = 200,

        /**
         * Batch size for embedding generation
         */
        val embeddingBatchSize: Int = 100
    )

    companion object {
        /**
         * Standard metadata keys for chunks
         */
        const val CHUNK_INDEX = "chunk_index"
        const val TOTAL_CHUNKS = "total_chunks"
        const val SEQUENCE_NUMBER = "sequence_number"
        const val ROOT_DOCUMENT_ID = "root_document_id"
        const val CONTAINER_SECTION_ID = "container_section_id"
        const val CONTAINER_SECTION_TITLE = "container_section_title"
        const val CONTAINER_SECTION_URL = "container_section_url"
        const val LEAF_SECTION_ID = "leaf_section_id"
        const val LEAF_SECTION_TITLE = "leaf_section_title"
        const val LEAF_SECTION_URL = "leaf_section_url"

        /**
         * Create a content chunker with config and transformer
         */
        operator fun invoke(
            config: Config,
            chunkTransformer: ChunkTransformer
        ): InMemoryContentChunker
    }
}

Properties:

  • chunkTransformer: Transformer applied to each chunk

Methods:

  • chunk(): Convert section to chunks
    • Parameters: section - Container section to chunk
    • Returns: Iterable of chunks with metadata
    • Behavior: Splits long sections into overlapping chunks

Configuration:

  • maxChunkSize: Maximum chunk size in characters (default 1500)
  • overlapSize: Overlap between chunks in characters (default 200)
  • embeddingBatchSize: Batch size for embedding generation (default 100)

Metadata Keys:

  • CHUNK_INDEX: Index within section (0-based)
  • TOTAL_CHUNKS: Total chunks from section
  • SEQUENCE_NUMBER: Global sequence across document
  • ROOT_DOCUMENT_ID: Document root ID
  • CONTAINER_SECTION_ID: Parent container section ID
  • CONTAINER_SECTION_TITLE: Parent container section title
  • CONTAINER_SECTION_URL: Parent container section URL
  • LEAF_SECTION_ID: Source leaf section ID
  • LEAF_SECTION_TITLE: Source leaf section title
  • LEAF_SECTION_URL: Source leaf section URL

InMemoryContentChunker

In-memory implementation of ContentChunker.

class InMemoryContentChunker(
    /**
     * Chunking configuration
     */
    val config: ContentChunker.Config,

    /**
     * Transformer to apply to chunks
     */
    override val chunkTransformer: ChunkTransformer
) : ContentChunker {

    /**
     * Chunk a single container section
     * @param section Container section to chunk
     * @return List of chunks
     */
    override fun chunk(section: NavigableContainerSection): List<Chunk>

    /**
     * Chunk multiple sections
     * @param sections List of container sections
     * @return List of all chunks from all sections
     */
    fun splitSections(sections: List<NavigableContainerSection>): List<Chunk>
}

Constructor Parameters:

  • config: Chunking configuration
  • chunkTransformer: Transformer for chunks

Methods:

  • chunk(): Chunk single section
  • splitSections(): Batch chunk multiple sections

Behavior:

  • Splits sections into chunks of maxChunkSize
  • Creates overlapSize overlap between consecutive chunks
  • Applies chunk transformer to each chunk
  • Adds standard metadata to all chunks
  • Maintains sequence numbers across document

Ingestion Pipeline

Ingest content into storage repositories.

Ingester

Interface for ingesting resources into stores.

interface Ingester : HasInfoString {
    /**
     * Target stores for ingestion
     */
    val stores: List<ChunkingContentElementRepository>

    /**
     * Check if ingester is active and ready
     * @return true if ingester can accept requests
     */
    fun active(): Boolean

    /**
     * Ingest a resource by path or URL
     * @param resourcePath Path or URL to ingest
     * @return Ingestion result with statistics
     */
    fun ingest(resourcePath: String): IngestionResult
}

Properties:

  • stores: List of target repositories

Methods:

  • active(): Check if ingester is ready

    • Returns: True if ingester can process requests
  • ingest(): Ingest resource

    • Parameters: resourcePath - File path, URL, or resource path
    • Returns: Ingestion result with statistics
    • Behavior: Parses content, chunks, and stores in all repositories

IngestionResult

Result of an ingestion operation.

data class IngestionResult(
    /**
     * Names of stores that received content
     */
    val storesWrittenTo: Set<String>,

    /**
     * IDs of chunks created
     */
    val chunkIds: List<String>,

    /**
     * Number of documents written
     */
    val documentsWritten: Int
) {
    /**
     * Check if ingestion succeeded
     * @return true if any content was written
     */
    fun success(): Boolean
}

Properties:

  • storesWrittenTo: Names of stores that received content
  • chunkIds: IDs of created chunks
  • documentsWritten: Number of documents ingested

Methods:

  • success(): Returns true if any content was written

Content Refresh Policies

Control when content should be re-ingested.

ContentRefreshPolicy

Interface for determining whether content needs refreshing.

interface ContentRefreshPolicy {
    /**
     * Check if content at URI should be re-read
     * @param repository Repository to check
     * @param rootUri URI of content root
     * @return true if content should be refreshed
     */
    fun shouldReread(
        repository: ChunkingContentElementRepository,
        rootUri: String
    ): Boolean

    /**
     * Check if document should be refreshed
     * @param repository Repository to check
     * @param root Document to potentially refresh
     * @return true if document should be refreshed
     */
    fun shouldRefreshDocument(
        repository: ChunkingContentElementRepository,
        root: NavigableDocument
    ): Boolean

    /**
     * Ingest URI if refresh policy determines it's needed
     * @param repository Target repository
     * @param hierarchicalContentReader Reader for parsing content
     * @param rootUri URI to potentially ingest
     * @return Parsed document if ingested, null if skipped
     */
    fun ingestUriIfNeeded(
        repository: ChunkingContentElementRepository,
        hierarchicalContentReader: HierarchicalContentReader,
        rootUri: String
    ): NavigableDocument?
}

Methods:

  • shouldReread(): Check if URI needs re-ingestion

    • Parameters:
      • repository - Repository to check
      • rootUri - URI of content
    • Returns: True if content should be re-read
  • shouldRefreshDocument(): Check if document needs refresh

    • Parameters:
      • repository - Repository to check
      • root - Document to evaluate
    • Returns: True if document should be refreshed
  • ingestUriIfNeeded(): Conditionally ingest URI

    • Parameters:
      • repository - Target repository
      • hierarchicalContentReader - Parser
      • rootUri - URI to potentially ingest
    • Returns: Document if ingested, null if skipped

Retrievable Enhancement

Enhance retrievables during storage.

RetrievableEnhancer

Interface for enhancing retrievables with additional data.

interface RetrievableEnhancer {
    /**
     * Enhance a retrievable before storage
     * @param retrievable Retrievable to enhance
     * @return Enhanced retrievable
     */
    fun <T : Retrievable> enhance(retrievable: T): T
}

Methods:

  • enhance(): Add data to retrievable
    • Parameters: retrievable - Item to enhance
    • Returns: Enhanced item (may be same instance or new copy)
    • Use Cases: Add metadata, enrich properties, transform content

Usage Examples

Basic Content Reading

import com.embabel.agent.rag.ingestion.*
import java.io.File

val reader: HierarchicalContentReader = // implementation

// Parse from URL
val doc1 = reader.parseUrl("https://example.com/docs/guide.html")
println("Parsed: ${doc1.title}")
println("Sections: ${doc1.children.count()}")

// Parse from file
val file = File("/path/to/document.md")
val doc2 = reader.parseFile(file, url = "file:///path/to/document.md")
println("Parsed: ${doc2.title}")

// Parse from classpath resource
val doc3 = reader.parseResource("docs/readme.md")
println("Parsed: ${doc3.title}")

// Parse from input stream
val inputStream = File("document.txt").inputStream()
val doc4 = reader.parseContent(inputStream, uri = "file:///document.txt")
println("Parsed: ${doc4.title}")

Directory Parsing

import com.embabel.agent.rag.ingestion.*

val reader: HierarchicalContentReader = // implementation
val fileTools: FileReadTools = // implementation

// Configure directory parsing
val config = DirectoryParsingConfig(
    includedExtensions = setOf("md", "txt", "html"),
    excludedDirectories = setOf("node_modules", ".git", "build", "target"),
    relativePath = "docs",
    maxFileSize = 5_242_880, // 5 MB
    followSymlinks = false,
    maxDepth = 10
)

// Parse directory
val result = reader.parseFromDirectory(fileTools, config)

// Check results
println("=== Parsing Results ===")
println("Files found: ${result.totalFilesFound}")
println("Files processed: ${result.filesProcessed}")
println("Files skipped: ${result.filesSkipped}")
println("Files errored: ${result.filesErrored}")
println("Documents created: ${result.contentRoots.size}")
println("Sections extracted: ${result.totalSectionsExtracted}")
println("Processing time: ${result.processingTime}")
println("Success: ${result.success}")

// Process errors
if (result.errors.isNotEmpty()) {
    println("\n=== Errors ===")
    result.errors.forEach { error ->
        println("  - $error")
    }
}

// Process parsed documents
result.contentRoots.forEach { doc ->
    println("\nDocument: ${doc.title}")
    println("  URI: ${doc.uri}")
    println("  Children: ${doc.children.count()}")
    println("  Leaves: ${doc.leaves().count()}")
}

Configuring Directory Parsing

import com.embabel.agent.rag.ingestion.*

// Start with base config
val baseConfig = DirectoryParsingConfig(
    includedExtensions = setOf("md", "adoc"),
    excludedDirectories = setOf(".git", "node_modules")
)

// Customize with builder methods
val customConfig = baseConfig
    .withRelativePath("src/docs")
    .withMaxFileSize(10_485_760) // 10 MB
    .withFollowSymlinks(true)
    .withMaxDepth(5)

// Use custom config
val result = reader.parseFromDirectory(fileTools, customConfig)

Content Chunking

import com.embabel.agent.rag.ingestion.*
import com.embabel.agent.rag.model.*

// Configure chunker
val config = ContentChunker.Config(
    maxChunkSize = 1500,
    overlapSize = 200,
    embeddingBatchSize = 100
)

// Create chunker with transformer
val chunker = ContentChunker(
    config = config,
    chunkTransformer = ChunkTransformer.NO_OP
)

// Parse a document
val reader: HierarchicalContentReader = // implementation
val document = reader.parseUrl("https://example.com/docs")

// Chunk the document
val chunks = chunker.chunk(document).toList()
println("Created ${chunks.size} chunks")

// Examine chunks
chunks.forEach { chunk ->
    val index = chunk.metadata[ContentChunker.CHUNK_INDEX]
    val total = chunk.metadata[ContentChunker.TOTAL_CHUNKS]
    val sequence = chunk.metadata[ContentChunker.SEQUENCE_NUMBER]
    val sectionTitle = chunk.metadata[ContentChunker.CONTAINER_SECTION_TITLE]

    println("Chunk $index of $total (sequence: $sequence)")
    println("  From section: $sectionTitle")
    println("  Text length: ${chunk.text.length}")
    println("  Parent: ${chunk.parentId}")
}

InMemoryContentChunker

import com.embabel.agent.rag.ingestion.*

val config = ContentChunker.Config(
    maxChunkSize = 2000,
    overlapSize = 300
)

val chunker = InMemoryContentChunker(
    config = config,
    chunkTransformer = ChunkTransformer.NO_OP
)

// Chunk multiple sections
val reader: HierarchicalContentReader = // implementation
val doc1 = reader.parseUrl("https://example.com/doc1")
val doc2 = reader.parseUrl("https://example.com/doc2")

val allChunks = chunker.splitSections(listOf(doc1, doc2))
println("Total chunks: ${allChunks.size}")

// Process chunks
allChunks.forEach { chunk ->
    println("${chunk.id}: ${chunk.text.take(50)}...")
}

Accessing Chunk Metadata

import com.embabel.agent.rag.ingestion.*
import com.embabel.agent.rag.model.*

val chunker: ContentChunker = // implementation
val document: NavigableDocument = // parsed document

val chunks = chunker.chunk(document).toList()

chunks.forEach { chunk ->
    // Standard metadata
    val chunkIndex = chunk.metadata[ContentChunker.CHUNK_INDEX] as? Int
    val totalChunks = chunk.metadata[ContentChunker.TOTAL_CHUNKS] as? Int
    val sequenceNumber = chunk.metadata[ContentChunker.SEQUENCE_NUMBER] as? Int
    val rootDocId = chunk.metadata[ContentChunker.ROOT_DOCUMENT_ID] as? String

    // Section metadata
    val containerSectionId = chunk.metadata[ContentChunker.CONTAINER_SECTION_ID] as? String
    val containerSectionTitle = chunk.metadata[ContentChunker.CONTAINER_SECTION_TITLE] as? String
    val leafSectionTitle = chunk.metadata[ContentChunker.LEAF_SECTION_TITLE] as? String

    println("Chunk $chunkIndex/$totalChunks (sequence: $sequenceNumber)")
    println("  Root: $rootDocId")
    println("  Container: $containerSectionTitle")
    println("  Leaf: $leafSectionTitle")
}

Ingestion Pipeline

import com.embabel.agent.rag.ingestion.*
import com.embabel.agent.rag.store.*

val ingester: Ingester = // implementation

// Check if ingester is ready
if (!ingester.active()) {
    println("Ingester not active")
    return
}

// Ingest a resource
val result = ingester.ingest("https://example.com/docs/guide.html")

// Check results
if (result.success()) {
    println("Ingestion succeeded!")
    println("Stores written to: ${result.storesWrittenTo.joinToString()}")
    println("Chunks created: ${result.chunkIds.size}")
    println("Documents written: ${result.documentsWritten}")

    // Access chunk IDs
    result.chunkIds.take(5).forEach { chunkId ->
        println("  Created chunk: $chunkId")
    }
} else {
    println("Ingestion failed - no content written")
}

Custom Ingester Implementation

import com.embabel.agent.rag.ingestion.*
import com.embabel.agent.rag.store.*

class CustomIngester(
    override val stores: List<ChunkingContentElementRepository>,
    private val reader: HierarchicalContentReader,
    private val chunker: ContentChunker
) : Ingester {

    override fun active(): Boolean {
        return stores.all { it.info().isPersistent }
    }

    override fun ingest(resourcePath: String): IngestionResult {
        // Parse document
        val document = when {
            resourcePath.startsWith("http") -> reader.parseUrl(resourcePath)
            resourcePath.startsWith("classpath:") ->
                reader.parseResource(resourcePath.removePrefix("classpath:"))
            else -> reader.parseFile(File(resourcePath))
        }

        // Store in all repositories
        val storesWritten = mutableSetOf<String>()
        val allChunkIds = mutableListOf<String>()

        stores.forEach { store ->
            val chunkIds = store.writeAndChunkDocument(document)
            storesWritten.add(store.name)
            allChunkIds.addAll(chunkIds)
        }

        return IngestionResult(
            storesWrittenTo = storesWritten,
            chunkIds = allChunkIds,
            documentsWritten = 1
        )
    }

    override fun infoString(verbose: Boolean?, indent: Int): String {
        return "CustomIngester with ${stores.size} stores"
    }
}

// Use custom ingester
val store1: ChunkingContentElementRepository = // implementation
val store2: ChunkingContentElementRepository = // implementation

val ingester = CustomIngester(
    stores = listOf(store1, store2),
    reader = reader,
    chunker = chunker
)

val result = ingester.ingest("https://example.com/docs")
println("Wrote to ${result.storesWrittenTo.size} stores")

Content Refresh Policies

import com.embabel.agent.rag.ingestion.*
import com.embabel.agent.rag.store.*
import java.time.Duration
import java.time.Instant

// Simple time-based policy
class TimeBasedRefreshPolicy(
    private val maxAge: Duration
) : ContentRefreshPolicy {

    override fun shouldReread(
        repository: ChunkingContentElementRepository,
        rootUri: String
    ): Boolean {
        val existing = repository.findContentRootByUri(rootUri)
        if (existing == null) return true

        val age = Duration.between(existing.ingestionTimestamp, Instant.now())
        return age > maxAge
    }

    override fun shouldRefreshDocument(
        repository: ChunkingContentElementRepository,
        root: NavigableDocument
    ): Boolean {
        val age = Duration.between(root.ingestionTimestamp, Instant.now())
        return age > maxAge
    }

    override fun ingestUriIfNeeded(
        repository: ChunkingContentElementRepository,
        hierarchicalContentReader: HierarchicalContentReader,
        rootUri: String
    ): NavigableDocument? {
        if (!shouldReread(repository, rootUri)) {
            return null
        }

        // Delete old version
        repository.deleteRootAndDescendants(rootUri)

        // Ingest new version
        val document = hierarchicalContentReader.parseUrl(rootUri)
        repository.writeAndChunkDocument(document)

        return document
    }
}

// Use refresh policy
val policy = TimeBasedRefreshPolicy(maxAge = Duration.ofDays(7))
val repository: ChunkingContentElementRepository = // implementation
val reader: HierarchicalContentReader = // implementation

// Check if refresh needed
val uri = "https://example.com/docs"
if (policy.shouldReread(repository, uri)) {
    println("Content needs refreshing")

    val document = policy.ingestUriIfNeeded(repository, reader, uri)
    if (document != null) {
        println("Refreshed: ${document.title}")
    }
} else {
    println("Content is up to date")
}

Custom Retrievable Enhancer

import com.embabel.agent.rag.ingestion.*
import com.embabel.agent.rag.model.*

// Sentiment analysis enhancer
class SentimentEnhancer : RetrievableEnhancer {
    override fun <T : Retrievable> enhance(retrievable: T): T {
        if (retrievable is Chunk) {
            val sentiment = analyzeSentiment(retrievable.text)
            val enhanced = retrievable.withAdditionalMetadata(
                mapOf(
                    "sentiment" to sentiment.name,
                    "sentiment_score" to sentiment.score
                )
            )
            @Suppress("UNCHECKED_CAST")
            return enhanced as T
        }
        return retrievable
    }

    private fun analyzeSentiment(text: String): Sentiment {
        val positiveWords = listOf("good", "great", "excellent", "success")
        val negativeWords = listOf("bad", "error", "fail", "problem")

        val lowerText = text.lowercase()
        val positiveCount = positiveWords.count { lowerText.contains(it) }
        val negativeCount = negativeWords.count { lowerText.contains(it) }

        return when {
            positiveCount > negativeCount -> Sentiment("positive", 0.7)
            negativeCount > positiveCount -> Sentiment("negative", -0.7)
            else -> Sentiment("neutral", 0.0)
        }
    }

    data class Sentiment(val name: String, val score: Double)
}

// Language detection enhancer
class LanguageEnhancer : RetrievableEnhancer {
    override fun <T : Retrievable> enhance(retrievable: T): T {
        if (retrievable is Chunk) {
            val language = detectLanguage(retrievable.text)
            val enhanced = retrievable.withAdditionalMetadata(
                mapOf("language" to language)
            )
            @Suppress("UNCHECKED_CAST")
            return enhanced as T
        }
        return retrievable
    }

    private fun detectLanguage(text: String): String {
        return when {
            text.contains(Regex("[\\p{IsHan}]")) -> "zh"
            text.contains(Regex("[\\p{IsHiragana}\\p{IsKatakana}]")) -> "ja"
            text.contains(Regex("[\\p{IsHangul}]")) -> "ko"
            else -> "en"
        }
    }
}

// Use enhancers with repository
val enhancers = listOf(
    SentimentEnhancer(),
    LanguageEnhancer()
)

// Repository will apply enhancers during ingestion
val repository: ChunkingContentElementRepository = // with enhancers
val document: NavigableDocument = // parsed document
val chunkIds = repository.writeAndChunkDocument(document)

// Chunks now have sentiment and language metadata

Complete Ingestion Workflow

import com.embabel.agent.rag.ingestion.*
import com.embabel.agent.rag.store.*
import java.time.Duration

// 1. Set up components
val reader: HierarchicalContentReader = // implementation

val chunker = ContentChunker(
    config = ContentChunker.Config(
        maxChunkSize = 1500,
        overlapSize = 200,
        embeddingBatchSize = 100
    ),
    chunkTransformer = ChunkTransformer.NO_OP
)

val repository: ChunkingContentElementRepository = // implementation

val policy = TimeBasedRefreshPolicy(maxAge = Duration.ofDays(7))

// 2. Parse document
val uri = "https://example.com/docs/guide.html"
val document = reader.parseUrl(uri)
println("Parsed: ${document.title}")
println("Sections: ${document.descendants().count()}")

// 3. Check if refresh needed
if (policy.shouldRefreshDocument(repository, document)) {
    println("Refreshing content...")

    // Delete existing
    repository.deleteRootAndDescendants(uri)

    // Store new version
    val chunkIds = repository.writeAndChunkDocument(document)
    println("Stored ${chunkIds.size} chunks")

    // Verify storage
    val info = repository.info()
    println("Repository now has ${info.chunkCount} chunks")
} else {
    println("Content is up to date, skipping ingestion")
}

Batch Directory Ingestion

import com.embabel.agent.rag.ingestion.*
import com.embabel.agent.rag.store.*

val reader: HierarchicalContentReader = // implementation
val fileTools: FileReadTools = // implementation
val repository: ChunkingContentElementRepository = // implementation

// Configure directory parsing
val config = DirectoryParsingConfig(
    includedExtensions = setOf("md", "txt"),
    excludedDirectories = setOf(".git", "node_modules"),
    relativePath = "docs",
    maxFileSize = 5_242_880 // 5 MB
)

// Parse all documents
val result = reader.parseFromDirectory(fileTools, config)

if (result.success) {
    println("Parsed ${result.filesProcessed} files")

    // Ingest all documents
    var totalChunks = 0
    result.contentRoots.forEach { document ->
        val chunkIds = repository.writeAndChunkDocument(document)
        totalChunks += chunkIds.size
        println("Ingested: ${document.title} (${chunkIds.size} chunks)")
    }

    println("Total chunks created: $totalChunks")

    // Check final state
    val info = repository.info()
    println("Repository state:")
    println("  Documents: ${info.documentCount}")
    println("  Chunks: ${info.chunkCount}")
} else {
    println("Parsing failed")
    result.errors.forEach { println("Error: $it") }
}
tessl i tessl/maven-com-embabel-agent--embabel-agent-rag-core@0.3.1

docs

index.md

README.md

tile.json