CtrlK

Community Documentation Log in Get started

tessl/maven-com-embabel-agent--embabel-agent-common

Common AI framework utilities for the Embabel Agent system including LLM configuration, output converters, prompt contributors, and embedding service abstractions.

Overview

Eval results

Files

Performance

Name: tessl/maven-com-embabel-agent--embabel-agent-common
Author: tessl

Optimization strategies and best practices for Embabel Agent Common.

Model Selection

Cost vs Quality Tradeoffs

Development: Use cheaper models for iteration

val devOptions = LlmOptions.withModel("gpt-3.5-turbo")
    .withMaxTokens(500)
// ~10x cheaper than GPT-4

Production: Use appropriate model for task

val taskOptions = when (complexity) {
    Complexity.LOW -> LlmOptions.withModel("gpt-3.5-turbo")
    Complexity.MEDIUM -> LlmOptions.withModel("gpt-4")
    Complexity.HIGH -> LlmOptions.withModel("gpt-4-turbo")
}

Temperature Settings

Deterministic tasks (data extraction, classification): Low temperature

val extractionOptions = LlmOptions.withModel("gpt-4")
    .withTemperature(0.2)
// More consistent, faster inference

Creative tasks (writing, brainstorming): Higher temperature

val creativeOptions = LlmOptions.withModel("gpt-4")
    .withTemperature(0.8)

Token Limits

Minimize tokens for faster responses and lower costs:

val options = LlmOptions.withModel("gpt-4")
    .withMaxTokens(500) // Not 2000 if you only need brief responses

Calculate actual needs:

fun calculateTokenBudget(expectedWords: Int): Int {
    // ~1.3 tokens per word for English
    return (expectedWords * 1.3).toInt()
}

val options = LlmOptions.withModel("gpt-4")
    .withMaxTokens(calculateTokenBudget(300))

Batching

Batch Embeddings

DON'T - Multiple individual calls:

val embeddings = texts.map { text ->
    embeddingService.embed(text) // N network calls
}

DO - Single batch call:

val embeddings = embeddingService.embed(texts) // 1 network call

Improvement: 5-10x faster for large batches

Batch LLM Requests

Process multiple items efficiently:

fun processBatch(items: List<String>): List<Result> {
    return items.chunked(10).flatMap { batch ->
        // Process 10 at a time
        batch.map { item -> process(item) }
    }
}

Parallel batching:

fun processBatchParallel(items: List<String>): List<Result> {
    return items.chunked(10).flatMap { batch ->
        batch.parallelStream()
            .map { item -> process(item) }
            .toList()
    }
}

Caching

Response Caching

Simple cache:

private val cache = ConcurrentHashMap<String, String>()

fun callWithCache(prompt: String): String {
    return cache.getOrPut(prompt) {
        llmClient.call(prompt)
    }
}

Time-based cache:

data class CachedResponse(val content: String, val timestamp: Instant)

private val cache = ConcurrentHashMap<String, CachedResponse>()

fun callWithTimedCache(prompt: String, ttl: Duration): String {
    val cached = cache[prompt]

    if (cached != null) {
        val age = Duration.between(cached.timestamp, Instant.now())
        if (age < ttl) {
            return cached.content
        }
    }

    val response = llmClient.call(prompt)
    cache[prompt] = CachedResponse(response, Instant.now())
    return response
}

Embedding Caching

Cache by content hash:

class CachedEmbeddingService(
    private val delegate: EmbeddingService
) : EmbeddingService by delegate {
    private val cache = ConcurrentHashMap<String, FloatArray>()

    override fun embed(text: String): FloatArray {
        return cache.getOrPut(text) {
            delegate.embed(text)
        }
    }

    override fun embed(texts: List<String>): List<FloatArray> {
        val results = mutableListOf<FloatArray>()
        val toEmbed = mutableListOf<String>()

        texts.forEach { text ->
            cache[text]?.let { results.add(it) } ?: toEmbed.add(text)
        }

        if (toEmbed.isNotEmpty()) {
            val newEmbeddings = delegate.embed(toEmbed)
            toEmbed.zip(newEmbeddings).forEach { (text, embedding) ->
                cache[text] = embedding
            }
            results.addAll(newEmbeddings)
        }

        return results
    }
}

Improvement: Near-instant retrieval for repeated content

Streaming Optimization

Backpressure

Handle slow consumers:

stream
    .onBackpressureBuffer(1000) // Buffer up to 1000 items
    .subscribe { event ->
        slowProcess(event)
    }

Drop excess items:

stream
    .onBackpressureDrop() // Drop if consumer too slow
    .subscribe { event ->
        processQuickly(event)
    }

Parallel Processing

CPU-bound work:

stream
    .parallel()
    .runOn(Schedulers.parallel())
    .map { event ->
        cpuIntensiveOperation(event)
    }
    .sequential()
    .subscribe { result -> save(result) }

I/O-bound work:

stream
    .parallel()
    .runOn(Schedulers.boundedElastic())
    .flatMap { event ->
        Mono.fromCallable { databaseOperation(event) }
    }
    .sequential()
    .subscribe { result -> process(result) }

Early Termination

Don't process more than needed:

stream
    .take(100) // Only first 100 items
    .subscribe { event -> process(event) }

Stop on condition:

stream
    .takeWhile { event -> event.timestamp < cutoff }
    .subscribe { event -> process(event) }

Converter Optimization

Reuse Converters

DON'T - Create new converter each time:

fun convert(response: String): Person? {
    val converter = JacksonOutputConverter(Person::class.java, objectMapper)
    return converter.convert(response)
}

DO - Reuse converter instance:

private val personConverter = JacksonOutputConverter(Person::class.java, objectMapper)

fun convert(response: String): Person? {
    return personConverter.convert(response)
}

Reason: Schema generation is expensive, done once at creation

Reuse ObjectMapper

DON'T - Create new mapper each time:

val mapper = ObjectMapper().registerKotlinModule()
val converter = JacksonOutputConverter(Person::class.java, mapper)

DO - Shared mapper instance:

@Configuration
class JacksonConfig {
    @Bean
    fun objectMapper(): ObjectMapper {
        return ObjectMapper()
            .registerKotlinModule()
            .configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
    }
}

@Service
class MyService(private val objectMapper: ObjectMapper) {
    private val converter = JacksonOutputConverter(Person::class.java, objectMapper)
}

Reason: ObjectMapper is thread-safe and expensive to create

Memory Management

Streaming Large Datasets

DON'T - Load all into memory:

val events = converter.convertStream(hugeJsonl).collectList().block()
// OutOfMemoryError for large data

DO - Process incrementally:

converter.convertStream(hugeJsonl)
    .buffer(100) // Process in batches
    .subscribe { batch ->
        processBatch(batch)
    }

Embedding Storage

Optimize FloatArray storage:

// For many embeddings, consider using primitive arrays
class EmbeddingStore {
    private val embeddings = FloatArray(numDocs * dimensions)

    fun getEmbedding(docId: Int): FloatArray {
        val start = docId * dimensions
        return embeddings.copyOfRange(start, start + dimensions)
    }
}
// More memory-efficient than List<FloatArray>

Prompt Optimization

Minimal Prompts

DON'T - Include unnecessary detail:

val prompt = """
    You are an AI assistant. Please be helpful and courteous.
    Extract the person's name, age, and email from the text.
    Make sure to format it as JSON. Be careful to get all details.
    The JSON should have fields for name, age, and email.

    ${converter.jsonSchema}

    Text: $text
"""

DO - Be concise:

val prompt = """
    Extract person info as JSON:
    ${converter.jsonSchema}

    $text
"""

Improvement: Lower token costs, faster responses

Schema Inclusion

Only include schema when needed:

// For structured output
val prompt = "Extract data: ${converter.jsonSchema}\n$text"

// For simple tasks, skip schema
val prompt = "Summarize: $text"

Connection Pooling

HTTP Client Configuration

@Configuration
class HttpClientConfig {
    @Bean
    fun httpClient(): HttpClient {
        return HttpClient.create()
            .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 10000)
            .option(ChannelOption.SO_KEEPALIVE, true)
            .option(ChannelOption.TCP_NODELAY, true)
    }
}

Monitoring

Performance Metrics

class PerformanceMetrics {
    private val requestDurations = mutableListOf<Long>()
    private val tokenCounts = mutableListOf<Int>()

    fun recordRequest(durationMs: Long, tokens: Int) {
        requestDurations.add(durationMs)
        tokenCounts.add(tokens)
    }

    fun getStats(): Stats {
        return Stats(
            avgDuration = requestDurations.average(),
            p95Duration = requestDurations.sorted()[
                (requestDurations.size * 0.95).toInt()
            ],
            avgTokens = tokenCounts.average(),
            totalRequests = requestDurations.size
        )
    }
}

data class Stats(
    val avgDuration: Double,
    val p95Duration: Long,
    val avgTokens: Double,
    val totalRequests: Int
)

Cost Monitoring

class CostMonitor(private val pricing: PricingModel) {
    private val costByModel = ConcurrentHashMap<String, Double>()

    fun recordUsage(model: String, inputTokens: Int, outputTokens: Int) {
        val cost = pricing.costOf(inputTokens, outputTokens)
        costByModel.merge(model, cost) { old, new -> old + new }
    }

    fun getTotalCost(): Double = costByModel.values.sum()

    fun getCostByModel(): Map<String, Double> = costByModel.toMap()
}

Benchmarking

Measure Operations

inline fun <T> measureTime(operation: String, block: () -> T): T {
    val start = System.currentTimeMillis()
    return try {
        block()
    } finally {
        val duration = System.currentTimeMillis() - start
        logger.info("$operation took ${duration}ms")
    }
}

// Usage
val result = measureTime("LLM call") {
    llmClient.call(prompt)
}

Compare Strategies

@Test
fun `benchmark embedding strategies`() {
    val texts = (1..1000).map { "Document $it" }

    // Individual calls
    val time1 = measureTimeMillis {
        texts.forEach { embeddingService.embed(it) }
    }

    // Batch call
    val time2 = measureTimeMillis {
        embeddingService.embed(texts)
    }

    println("Individual: ${time1}ms")
    println("Batch: ${time2}ms")
    println("Speedup: ${time1.toDouble() / time2}x")
}

Best Practices Summary

Choose appropriate models - Don't use GPT-4 for simple tasks
Set realistic token limits - Don't request 2000 tokens if you need 200
Batch operations - Use batch APIs whenever possible
Cache aggressively - Cache embeddings and repeated queries
Reuse instances - ObjectMapper and converters are expensive to create
Stream large datasets - Don't load everything into memory
Use parallel processing - For CPU/IO-bound work
Monitor costs - Track token usage and costs
Minimize prompts - Shorter prompts = lower costs + faster responses
Test at scale - Benchmark with realistic data volumes

Performance Targets

Operation	Target	Notes
Embedding (batch 100)	< 2s	Using text-embedding-ada-002
GPT-3.5 call (500 tokens)	< 3s	With low temperature
GPT-4 call (500 tokens)	< 8s	With low temperature
Streaming JSONL (1000 lines)	< 1s	Parsing only, no LLM
Schema generation	< 50ms	Cached after first use
Conversion (valid JSON)	< 10ms	Per object

Cost Optimization

Token Usage

Reduce input tokens:

Use concise prompts
Don't repeat schema in every message
Truncate long contexts

Reduce output tokens:

Set maxTokens appropriately
Request only needed fields
Use structured formats (JSON vs prose)

Model Selection

Cost comparison (per 1M tokens):

GPT-3.5 Turbo: $0.50 input / $1.50 output
GPT-4: $30 input / $60 output
GPT-4 Turbo: $10 input / $30 output

Use cheaper models when possible:

val simpleTask = LlmOptions.withModel("gpt-3.5-turbo")
// 60x cheaper than GPT-4

Caching Impact

Example savings:

1000 repeated embeddings
Without cache: $0.01 × 1000 = $10
With cache: $0.01 × 1 = $0.01
Savings: 99.9%

tessl i tessl/maven-com-embabel-agent--embabel-agent-common@0.3.1

docs

advanced

error-handling.md

integration-patterns.md

performance.md

core

reference

supporting

tessl/maven-com-embabel-agent--embabel-agent-common

performance.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/advanced/

Performance

Model Selection

Cost vs Quality Tradeoffs

Temperature Settings

Token Limits

Batching

Batch Embeddings

Batch LLM Requests

Caching

Response Caching

Embedding Caching

Streaming Optimization

Backpressure

Parallel Processing

Early Termination

Converter Optimization

Reuse Converters

Reuse ObjectMapper

Memory Management

Streaming Large Datasets

Embedding Storage

Prompt Optimization

Minimal Prompts

Schema Inclusion

Connection Pooling

HTTP Client Configuration

Monitoring

Performance Metrics

Cost Monitoring

Benchmarking

Measure Operations

Compare Strategies

Best Practices Summary

Performance Targets

Cost Optimization

Token Usage

Model Selection

Caching Impact

performance.mddocs/advanced/