CtrlK

Community Documentation Log in Get started

tessl/maven-com-embabel-agent--embabel-agent-rag-core

RAG (Retrieval-Augmented Generation) framework for the Embabel Agent platform providing content ingestion, chunking, hierarchical navigation, and semantic search capabilities

Overview

Eval results

Files

System Architecture

Name: tessl/maven-com-embabel-agent--embabel-agent-rag-core
Author: tessl

Deep dive into the architectural design, extensibility points, and design principles of the Embabel Agent RAG Core framework.

Overview

The Embabel Agent RAG Core framework is built around a layered architecture that separates concerns while maintaining flexibility and extensibility. The system is designed to handle everything from document ingestion to complex multi-modal retrieval operations.

Architectural Layers

┌─────────────────────────────────────────────────────────┐
│                   Application Layer                      │
│  (LLM Tools, Search Services, Domain Logic)             │
└───────────────────┬─────────────────────────────────────┘
                    │
┌───────────────────┴─────────────────────────────────────┐
│                  Service Layer                           │
│  (Search Operations, Filtering, Entity Management)      │
└───────────────────┬─────────────────────────────────────┘
                    │
┌───────────────────┴─────────────────────────────────────┐
│                 Ingestion Layer                          │
│  (Content Reading, Chunking, Transformation)            │
└───────────────────┬─────────────────────────────────────┘
                    │
┌───────────────────┴─────────────────────────────────────┐
│                  Storage Layer                           │
│  (Repositories, Vector Stores, Embeddings)              │
└───────────────────┬─────────────────────────────────────┘
                    │
┌───────────────────┴─────────────────────────────────────┐
│              Infrastructure Layer                        │
│  (Embedding Services, Spring AI Integration)            │
└─────────────────────────────────────────────────────────┘

Core Domain Model

Document Hierarchy

Documents are represented as hierarchical structures that preserve organization and context.

NavigableDocument (Root)
    ├─ ContentRoot (metadata, URI, timestamp)
    │
    ├─ NavigableContainerSection
    │   ├─ LeafSection (actual content)
    │   ├─ NavigableContainerSection
    │   │   ├─ LeafSection
    │   │   └─ LeafSection
    │   └─ LeafSection
    │
    └─ NavigableContainerSection
        └─ LeafSection

Key Characteristics:

ContentRoot: Top-level document with URI and ingestion metadata
NavigableContainerSection: Intermediate sections with children
LeafSection: Terminal nodes containing actual text content
Chunk: Indexed segments derived from leaf sections

This hierarchy enables:

Preservation of document structure
Context-aware retrieval
Hierarchical navigation
Flexible chunking strategies

Data Abstractions

Base Interfaces

// Foundation for all data objects
sealed interface Datum {
    val id: String
    val uri: String?
    val metadata: Map<String, Any?>

    fun propertiesToPersist(): Map<String, Any?>
    fun labels(): Set<String>
}

// Objects that can be embedded
interface Embeddable {
    fun embeddableValue(): String
}

// Objects with embeddings
interface Embedded {
    val embedding: Embedding?
}

// RAG-retrievable objects
interface Retrievable : HasInfoString, Datum, Embeddable

// Hierarchical content
interface HierarchicalContentElement : ContentElement {
    val parentId: String?
}

Type Hierarchy

Datum (base)
  │
  ├─ ContentElement
  │   ├─ HierarchicalContentElement
  │   │   ├─ ContentRoot
  │   │   │   └─ NavigableDocument
  │   │   ├─ Section
  │   │   │   ├─ ContainerSection
  │   │   │   │   └─ NavigableContainerSection
  │   │   │   └─ LeafSection
  │   │   └─ Chunk
  │   │
  │   └─ Retrievable
  │       ├─ Source
  │       │   ├─ Chunk
  │       │   └─ Fact
  │       └─ NamedEntity
  │           └─ NamedEntityData
  │
  └─ [Custom Domain Types]

This hierarchy provides:

Type safety across the system
Clear separation of concerns
Extensibility for custom types
Consistent interface for all data objects

Ingestion Pipeline Architecture

Pipeline Flow

Input → Reader → Document → Chunker → Chunks → Transformer → Enhanced Chunks → Repository
 │         │         │          │         │          │              │              │
 │         │         │          │         │          │              │              └─ Embedding
 │         │         │          │         │          │              │                 Generation
 │         │         │          │         │          │              │
 │         │         │          │         │          │              └─ Metadata
 │         │         │          │         │          │                 Enrichment
 │         │         │          │         │          │
 │         │         │          │         │          └─ Text
 │         │         │          │         │             Modification
 │         │         │          │         │
 │         │         │          │         └─ Chunk
 │         │         │          │            Creation
 │         │         │          │
 │         │         │          └─ Content
 │         │         │             Chunking
 │         │         │
 │         │         └─ Hierarchical
 │         │            Document
 │         │
 │         └─ Content
 │            Parsing
 │
 └─ Source
    (URL, File, Stream, Directory)

Component Responsibilities

1. HierarchicalContentReader

Purpose: Parse raw content into structured NavigableDocument objects

interface HierarchicalContentReader {
    fun parseUrl(url: String): NavigableDocument
    fun parseFile(file: File): NavigableDocument
    fun parseStream(stream: InputStream, uri: String): NavigableDocument
    fun parseDirectory(directory: File, recursive: Boolean): List<NavigableDocument>
}

Responsibilities:

Content format detection
Structure extraction (headings, sections)
Metadata extraction
URI management

Extensibility:

Custom parsers for proprietary formats
Format-specific metadata extraction
Custom structure inference

2. ContentRefreshPolicy

Purpose: Determine when documents should be re-ingested

interface ContentRefreshPolicy {
    fun shouldReread(
        repository: ChunkingContentElementRepository,
        rootUri: String
    ): Boolean

    fun shouldRefreshDocument(
        repository: ChunkingContentElementRepository,
        root: NavigableDocument
    ): Boolean

    fun ingestUriIfNeeded(
        repository: ChunkingContentElementRepository,
        hierarchicalContentReader: HierarchicalContentReader,
        rootUri: String
    ): NavigableDocument?
}

Strategies:

Time-based (TTL)
Always/Never refresh
URL-specific policies
Metadata-driven
External trigger-based

Design Rationale:

Separates refresh logic from ingestion
Supports complex refresh strategies
Enables cost optimization
Maintains data freshness

3. ContentChunker

Purpose: Break documents into Chunk objects for indexing

class ContentChunker(
    val config: Config,
    val chunkTransformer: ChunkTransformer = ChunkTransformer.NO_OP
) {
    fun chunk(document: NavigableDocument): Sequence<Chunk>
    fun chunk(section: Section): Sequence<Chunk>

    data class Config(
        val maxChunkSize: Int = 1500,
        val overlapSize: Int = 200,
        val respectSentenceBoundaries: Boolean = true
    )
}

Strategy:

Sliding window with overlap
Respect sentence boundaries
Maintain parent references
Generate positional metadata

Design Rationale:

Overlap improves retrieval quality
Sentence boundaries preserve semantic coherence
Metadata enables hierarchical reconstruction
Configurable for domain-specific needs

4. ChunkTransformer

Purpose: Enrich and modify chunks before storage

interface ChunkTransformer {
    val name: String
    fun transform(chunk: Chunk, context: ChunkTransformationContext): Chunk
}

abstract class AbstractChunkTransformer : ChunkTransformer {
    open fun additionalMetadata(
        chunk: Chunk,
        context: ChunkTransformationContext
    ): Map<String, Any> = emptyMap()

    open fun newText(
        chunk: Chunk,
        context: ChunkTransformationContext
    ): String = chunk.text
}

Capabilities:

Text modification (add titles, clean formatting)
Metadata enrichment (language, sentiment, complexity)
Transformation chaining
Conditional transformation

Design Rationale:

Separation of chunking and enrichment
Composable transformers
Context-aware transformations
Extensible for custom logic

5. ChunkingContentElementRepository

Purpose: Persist documents, sections, and chunks with embeddings

interface ChunkingContentElementRepository : ContentElementRepository {
    val enhancers: List<RetrievableEnhancer>

    fun writeAndChunkDocument(root: NavigableDocument): List<String>
    fun deleteRootAndDescendants(uri: String): DocumentDeletionResult?
    fun findContentRootByUri(uri: String): ContentRoot?
    fun existsRootWithUri(uri: String): Boolean
    fun <T : Retrievable> enhance(retrievable: T): T
    fun onNewRetrievables(retrievables: List<Retrievable>)
}

Implementation Strategy:

Batch embedding generation
Transaction management
Relationship creation
Index maintenance

Design Rationale:

Unified document lifecycle
Automatic embedding management
Extensible enhancement pipeline
Backend-agnostic interface

Search Architecture

Search Abstractions

// Basic vector search
interface VectorSearch {
    fun <T : Retrievable> vectorSearch(
        request: TextSimilaritySearchRequest,
        clazz: Class<T>
    ): List<SimilarityResult<T>>
}

// Vector search with filtering
interface FilteringVectorSearch : VectorSearch {
    fun <T : Retrievable> vectorSearchWithFilter(
        request: TextSimilaritySearchRequest,
        clazz: Class<T>,
        metadataFilter: PropertyFilter?,
        entityFilter: EntityFilter?
    ): List<SimilarityResult<T>>
}

// Full-text search
interface TextSearch {
    fun <T : Retrievable> textSearch(
        request: TextSearchRequest,
        clazz: Class<T>
    ): List<SimilarityResult<T>>
}

// Regex search
interface RegexSearch {
    fun <T : Retrievable> regexSearch(
        request: RegexSearchRequest,
        clazz: Class<T>
    ): List<SimilarityResult<T>>
}

Search Modalities

Vector Search

Mechanism: Semantic similarity using embeddings

Query Text → Embedding → Vector Space → Nearest Neighbors → Results
                                ↓
                          Cosine Similarity
                                ↓
                          Similarity Scores

Characteristics:

Captures semantic meaning
Language-agnostic
Handles synonyms and paraphrasing
Computationally intensive

Use Cases:

Natural language queries
Concept-based retrieval
Cross-lingual search
Fuzzy matching

Text Search

Mechanism: Full-text search with Lucene-like syntax

Characteristics:

Exact keyword matching
Boolean operators (AND, OR, NOT)
Phrase matching
Wildcard support
Fast execution

Use Cases:

Precise keyword queries
Technical documentation search
Code search
Structured queries

Regex Search

Mechanism: Pattern-based matching

Characteristics:

Deterministic matching
Complex patterns
Field-specific search
No ranking

Use Cases:

Email/phone number extraction
Identifier search
Format validation
Pattern-based filtering

Filtering Architecture

PropertyFilter DSL

Composable filter expressions for metadata and properties.

sealed interface PropertyFilter {
    operator fun not(): PropertyFilter
    infix fun and(other: PropertyFilter): PropertyFilter
    infix fun or(other: PropertyFilter): PropertyFilter
}

// Comparison filters
data class Eq(val key: String, val value: Any) : PropertyFilter
data class Ne(val key: String, val value: Any) : PropertyFilter
data class Gt(val key: String, val value: Number) : PropertyFilter
data class Gte(val key: String, val value: Number) : PropertyFilter
data class Lt(val key: String, val value: Number) : PropertyFilter
data class Lte(val key: String, val value: Number) : PropertyFilter

// Collection filters
data class In(val key: String, val values: List<Any>) : PropertyFilter
data class Nin(val key: String, val values: List<Any>) : PropertyFilter

// String filters
data class Contains(val key: String, val value: String) : PropertyFilter
data class StartsWith(val key: String, val value: String) : PropertyFilter
data class EndsWith(val key: String, val value: String) : PropertyFilter

// Logical operators
data class And(val filters: List<PropertyFilter>) : PropertyFilter
data class Or(val filters: List<PropertyFilter>) : PropertyFilter
data class Not(val filter: PropertyFilter) : PropertyFilter

Design Rationale:

Type-safe filter construction
Composable expressions
Backend-agnostic DSL
Support for complex boolean logic

Entity Architecture

Named Entity Model

interface NamedEntity : Retrievable, NamedAndDescribed {
    override val id: String
    override val name: String
    override val description: String
    val uri: String?
    val metadata: Map<String, Any?>

    fun labels(): Set<String>
}

interface NamedEntityData : NamedEntity {
    val properties: Map<String, Any>
    val linkedDomainType: DomainType?

    fun <T : NamedEntity> toTypedInstance(objectMapper: ObjectMapper): T?
    fun <T : NamedEntity> toInstance(
        vararg interfaces: Class<out NamedEntity>
    ): T
}

Relationship Model

@Target(AnnotationTarget.FUNCTION)
@Retention(AnnotationRetention.RUNTIME)
annotation class Relationship(
    val name: String = "",
    val direction: RelationshipDirection = RelationshipDirection.OUTGOING
)

enum class RelationshipDirection {
    OUTGOING, INCOMING, BOTH
}

interface RelationshipNavigator {
    fun findRelated(
        source: RetrievableIdentifier,
        relationshipName: String,
        direction: RelationshipDirection
    ): List<NamedEntityData>
}

Capabilities:

Dynamic proxy generation
Interface-based entity definition
Relationship navigation
Type conversion

Design Rationale:

Flexible entity modeling
No code generation required
Graph-like relationships
Type-safe access

Integration Architecture

Spring AI Integration

class SpringVectorStoreVectorSearch(
    val vectorStore: VectorStore
) : FilteringVectorSearch, TypeRetrievalOperations {
    // Adapts Spring AI VectorStore to RAG interface
}

fun PropertyFilter.toSpringAiExpression(): Filter.Expression {
    // Converts RAG filters to Spring AI expressions
}

Integration Points:

VectorStore adapter
Filter expression conversion
Document mapping
Embedding service integration

Benefits:

Leverage Spring AI ecosystem
Multiple vector store backends
Consistent RAG interface
Minimal adaptation code

Extensibility Points

1. Custom Content Readers

Implement HierarchicalContentReader for custom formats:

class CustomFormatReader : HierarchicalContentReader {
    override fun parseUrl(url: String): NavigableDocument {
        // Custom parsing logic
    }
}

Use Cases:

Proprietary document formats
Custom metadata extraction
Domain-specific structure inference

2. Custom Refresh Policies

Implement ContentRefreshPolicy for custom refresh logic:

class CustomRefreshPolicy : ContentRefreshPolicy {
    override fun shouldReread(
        repository: ChunkingContentElementRepository,
        rootUri: String
    ): Boolean {
        // Custom refresh decision logic
    }
}

Use Cases:

Business-specific refresh rules
External event triggers
Cost optimization strategies
Compliance requirements

3. Custom Chunk Transformers

Extend AbstractChunkTransformer for custom enrichment:

class CustomTransformer : AbstractChunkTransformer() {
    override val name = "custom-transformer"

    override fun additionalMetadata(
        chunk: Chunk,
        context: ChunkTransformationContext
    ): Map<String, Any> {
        // Custom metadata generation
    }

    override fun newText(
        chunk: Chunk,
        context: ChunkTransformationContext
    ): String {
        // Custom text transformation
    }
}

Use Cases:

Domain-specific metadata
Custom NLP pipelines
Integration with external services
Business rule enforcement

4. Custom Repositories

Extend AbstractChunkingContentElementRepository for custom backends:

class CustomRepository : AbstractChunkingContentElementRepository() {
    override fun persistChunksWithEmbeddings(
        chunks: List<Chunk>,
        embeddings: Map<String, FloatArray>
    ) {
        // Custom persistence logic
    }

    override fun createInternalRelationships(root: NavigableDocument) {
        // Custom relationship creation
    }

    override fun commit() {
        // Custom transaction management
    }
}

Use Cases:

Custom database backends
Distributed storage systems
Caching strategies
Audit logging

5. Custom Entity Types

Define entity interfaces with relationships:

interface Project : NamedEntity {
    @Relationship(name = "HAS_CONTRIBUTOR")
    fun getContributors(): List<Employee>

    @Relationship(name = "DEPENDS_ON")
    fun getDependencies(): List<Project>
}

interface Employee : NamedEntity {
    val department: String
    val role: String

    @Relationship(name = "WORKS_ON")
    fun getProjects(): List<Project>
}

Use Cases:

Domain modeling
Graph-based queries
Relationship navigation
Type-safe entity access

Design Principles

1. Separation of Concerns

Each component has a single, well-defined responsibility:

Reading: Parse content, extract structure
Chunking: Break into retrievable units
Transformation: Enrich and modify
Storage: Persist and index
Search: Retrieve and rank

Benefits:

Easy to understand
Simple to test
Flexible composition
Clear interfaces

2. Interface-Based Design

Dependencies defined through interfaces, not implementations:

Enables multiple implementations
Facilitates testing with mocks
Supports dependency injection
Promotes loose coupling

Benefits:

Testability
Flexibility
Extensibility
Clear contracts

3. Composability

Components can be combined flexibly:

Chain transformers
Combine filters
Layer policies
Compose search strategies

Benefits:

Reusable components
Complex behavior from simple parts
Easy customization
Minimal code duplication

4. Performance Optimization

Optimizations throughout the pipeline:

Batch embedding generation
Lazy evaluation
Caching expensive operations
Efficient vector algorithms
Connection pooling

Benefits:

Scalable to large datasets
Responsive queries
Cost-effective
Resource-efficient

5. Extensibility by Default

Easy to extend without modifying core:

Abstract base classes
Template method pattern
Strategy pattern
Plugin architecture

Benefits:

Customizable behavior
Domain-specific adaptations
Integration with external services
Future-proof design

Performance Characteristics

Ingestion Performance

Time Complexity:

Document parsing: O(n) where n = document size
Chunking: O(n) where n = document size
Embedding generation: O(c * d) where c = chunk count, d = embedding dimension
Storage: O(c) where c = chunk count

Space Complexity:

In-memory document: O(n) where n = document size
Chunks: O(c * s) where c = chunk count, s = chunk size
Embeddings: O(c * d) where c = chunk count, d = embedding dimension

Optimization Strategies:

Batch embedding generation (reduce API calls)
Streaming processing for large documents
Parallel chunking for multiple documents
Incremental updates (only refresh changed content)

Search Performance

Vector Search:

Embedding generation: O(d) where d = query length
Vector similarity: O(n * d) where n = corpus size, d = dimension
With ANN (Approximate Nearest Neighbor): O(log n * d)

Text Search:

Index lookup: O(log n) where n = corpus size
Post-filtering: O(k) where k = result count

Optimization Strategies:

Use ANN algorithms (HNSW, IVF)
Implement result caching
Optimize filter execution order
Pre-compute common queries

Deployment Patterns

Standalone Application

┌─────────────────┐
│   Application   │
├─────────────────┤
│   RAG Core      │
├─────────────────┤
│  Embeddings API │
├─────────────────┤
│  Vector Store   │
└─────────────────┘

Characteristics:

Simple deployment
All-in-one process
Lower latency
Resource constraints

Microservices Architecture

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Ingestion  │    │    Search    │    │   Entities   │
│   Service    │    │   Service    │    │   Service    │
└───────┬──────┘    └───────┬──────┘    └───────┬──────┘
        │                   │                    │
        └───────────────────┴────────────────────┘
                            │
                    ┌───────┴────────┐
                    │  Vector Store   │
                    │  (Shared State) │
                    └────────────────┘

Characteristics:

Scalable components
Independent deployment
Service isolation
Network overhead

Serverless Architecture

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Lambda:    │    │   Lambda:    │    │   Lambda:    │
│   Ingest     │    │   Search     │    │   Query      │
└───────┬──────┘    └───────┬──────┘    └───────┬──────┘
        │                   │                    │
        └───────────────────┴────────────────────┘
                            │
                    ┌───────┴────────┐
                    │  Managed Store │
                    │  (Pinecone,    │
                    │   Weaviate)    │
                    └────────────────┘

Characteristics:

Auto-scaling
Pay-per-use
Cold start latency
Managed infrastructure

tessl/maven-com-embabel-agent--embabel-agent-rag-core

architecture.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/advanced/

System Architecture

Overview

Architectural Layers

Core Domain Model

Document Hierarchy

Data Abstractions

Base Interfaces

Type Hierarchy

Ingestion Pipeline Architecture

Pipeline Flow

Component Responsibilities

1. HierarchicalContentReader

2. ContentRefreshPolicy

3. ContentChunker

4. ChunkTransformer

5. ChunkingContentElementRepository

Search Architecture

Search Abstractions

Search Modalities

Vector Search

Text Search

Regex Search

Filtering Architecture

PropertyFilter DSL

Entity Architecture

Named Entity Model

Relationship Model

Integration Architecture

Spring AI Integration

Extensibility Points

1. Custom Content Readers

2. Custom Refresh Policies

3. Custom Chunk Transformers

4. Custom Repositories

5. Custom Entity Types

Design Principles

1. Separation of Concerns

2. Interface-Based Design

3. Composability

4. Performance Optimization

5. Extensibility by Default

Performance Characteristics

Ingestion Performance

Search Performance

Deployment Patterns

Standalone Application

Microservices Architecture

Serverless Architecture

See Also

architecture.mddocs/advanced/