CtrlK

Community Documentation Log in Get started

tessl/maven-org-springframework-ai--spring-ai-azure-openai

Spring AI integration for Azure OpenAI services providing chat completion, text embeddings, image generation, and audio transcription with GPT, DALL-E, and Whisper models

Overview

Eval results

Files

Performance Guide

Name: tessl/maven-org-springframework-ai--spring-ai-azure-openai
Author: tessl

Optimization strategies and best practices for production deployments.

Model Instance Reuse

Critical: Always reuse model instances. Creating new instances is expensive.

✅ Correct - Singleton Pattern

@Configuration
public class AIConfiguration {
    
    @Bean
    @Singleton
    public AzureOpenAiChatModel chatModel(OpenAIClient client) {
        return AzureOpenAiChatModel.builder()
            .openAIClientBuilder(new OpenAIClientBuilder()
                .credential(new AzureKeyCredential(apiKey))
                .endpoint(endpoint))
            .defaultOptions(AzureOpenAiChatOptions.builder()
                .deploymentName("gpt-4o")
                .temperature(0.7)
                .build())
            .build();
    }
}

@Service
public class ChatService {
    @Autowired
    private AzureOpenAiChatModel chatModel;  // Reuse across requests
    
    public String chat(String message) {
        return chatModel.call(new Prompt(message))
            .getResult()
            .getOutput()
            .getText();
    }
}

❌ Incorrect - Creating Per Request

// DON'T DO THIS
public String chat(String message) {
    AzureOpenAiChatModel model = AzureOpenAiChatModel.builder()...build();
    return model.call(new Prompt(message))...;  // Wasteful
}

Batch Processing

Embeddings - Batch Multiple Texts

Efficient (1 API call):

List<String> texts = List.of("text1", "text2", "text3", ...);
EmbeddingResponse response = embeddingModel.call(
    new EmbeddingRequest(texts, null)
);

Inefficient (N API calls):

for (String text : texts) {
    embeddingModel.call(new EmbeddingRequest(List.of(text), null));
}

Optimal Batch Sizes

Operation	Recommended Batch Size	Max Batch Size
Embeddings	100-500	2048
Chat (parallel)	5-10 concurrent	Depends on quota
Images (DALL-E 2)	1-4 per request	10

Parallel Processing

Thread-Safe Concurrent Requests

ExecutorService executor = Executors.newFixedThreadPool(10);
List<CompletableFuture<ChatResponse>> futures = new ArrayList<>();

for (String prompt : prompts) {
    CompletableFuture<ChatResponse> future = CompletableFuture.supplyAsync(
        () -> chatModel.call(new Prompt(prompt)),
        executor
    );
    futures.add(future);
}

List<ChatResponse> responses = futures.stream()
    .map(CompletableFuture::join)
    .collect(Collectors.toList());

Rate Limiting with Semaphore

Semaphore rateLimiter = new Semaphore(10);  // Max 10 concurrent

public ChatResponse rateLimitedCall(Prompt prompt) {
    try {
        rateLimiter.acquire();
        try {
            return chatModel.call(prompt);
        } finally {
            rateLimiter.release();
        }
    } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
        throw new RuntimeException(e);
    }
}

Caching Strategies

Response Caching

@Service
public class CachedChatService {
    private final LoadingCache<String, String> cache;
    private final AzureOpenAiChatModel chatModel;
    
    public CachedChatService(AzureOpenAiChatModel chatModel) {
        this.chatModel = chatModel;
        this.cache = Caffeine.newBuilder()
            .maximumSize(10000)
            .expireAfterWrite(1, TimeUnit.HOURS)
            .build(this::generateResponse);
    }
    
    private String generateResponse(String prompt) {
        return chatModel.call(new Prompt(prompt))
            .getResult()
            .getOutput()
            .getText();
    }
    
    public String getCachedResponse(String prompt) {
        return cache.get(prompt);
    }
}

Embedding Caching

@Service
public class CachedEmbeddingService {
    private final Map<String, float[]> embeddingCache = new ConcurrentHashMap<>();
    private final AzureOpenAiEmbeddingModel embeddingModel;
    
    public float[] getEmbedding(String text) {
        return embeddingCache.computeIfAbsent(text, t -> {
            EmbeddingResponse response = embeddingModel.call(
                new EmbeddingRequest(List.of(t), null)
            );
            return response.getResults().get(0).getOutput();
        });
    }
}

Token Optimization

Reduce Token Usage

// Use shorter prompts
String verbose = "I would like you to please provide me with information about...";
String concise = "Explain...";  // Better

// Limit response length
AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .maxTokens(500)  // Limit response size
    .build();

// Use appropriate temperature
options = AzureOpenAiChatOptions.builder()
    .temperature(0.0)  // Deterministic, often shorter responses
    .build();

Token Counting

public class TokenOptimizer {
    private final TokenCounter tokenCounter;
    
    public String optimizePrompt(String prompt, int maxTokens) {
        int tokens = tokenCounter.count(prompt);
        
        if (tokens <= maxTokens) {
            return prompt;
        }
        
        // Truncate to fit
        return truncateToTokens(prompt, maxTokens);
    }
}

Streaming for Better UX

Use streaming for long responses to improve perceived latency:

public void streamResponse(String prompt, Consumer<String> onToken) {
    Flux<ChatResponse> stream = chatModel.stream(new Prompt(prompt));
    
    stream.subscribe(
        chunk -> {
            String token = chunk.getResult().getOutput().getText();
            if (token != null) {
                onToken.accept(token);  // Update UI immediately
            }
        }
    );
}

Connection Pooling

Azure SDK handles connection pooling automatically, but you can tune it:

OpenAIClient client = new OpenAIClientBuilder()
    .credential(new AzureKeyCredential(apiKey))
    .endpoint(endpoint)
    .httpClient(new NettyAsyncHttpClientBuilder()
        .connectionProvider(ConnectionProvider.builder("custom")
            .maxConnections(100)
            .maxIdleTime(Duration.ofSeconds(30))
            .build())
        .build())
    .buildClient();

Embedding Dimension Optimization

Reduce storage and computation by using smaller dimensions:

// Full dimensions (best quality)
AzureOpenAiEmbeddingOptions fullOptions = AzureOpenAiEmbeddingOptions.builder()
    .deploymentName("text-embedding-3-small")
    .dimensions(1536)
    .build();

// Reduced dimensions (3x faster search, 67% less storage)
AzureOpenAiEmbeddingOptions reducedOptions = AzureOpenAiEmbeddingOptions.builder()
    .deploymentName("text-embedding-3-small")
    .dimensions(512)
    .build();

Dimension Trade-offs:

1536 → 512: ~5% accuracy loss, 3x faster search
1536 → 768: ~2% accuracy loss, 2x faster search
1536 → 1024: ~1% accuracy loss, 1.5x faster search

Model Selection

Chat Models

Model	Speed	Cost	Quality	Use Case
gpt-35-turbo	Fastest	Lowest	Good	Simple tasks, high volume
gpt-4o	Fast	Medium	Excellent	General purpose
gpt-4	Slow	High	Excellent	Complex reasoning
o1/o3	Slowest	Highest	Best	Advanced reasoning

Embedding Models

Model	Speed	Cost	Dimensions	Use Case
ada-002	Fast	Low	1536	General purpose
3-small	Fast	Low	512-1536	Configurable, efficient
3-large	Medium	Medium	1024-3072	Best quality

Image Models

Model	Speed	Cost	Quality	Use Case
DALL-E 2	Fast	Low	Good	Multiple variations, cost-sensitive
DALL-E 3	Slow	High	Excellent	High-quality, single images

Latency Optimization

Reduce Network Latency

Choose Closest Region: Deploy in same region as Azure OpenAI
Use Streaming: Start displaying results immediately
Parallel Requests: Process multiple requests concurrently
Cache Aggressively: Cache common queries

Typical Latencies

Operation	Typical Latency	Factors
Chat (gpt-4o, 500 tokens)	2-5s	Token count, complexity
Chat (streaming, first token)	200-500ms	Network, load
Embeddings (100 texts)	500-1500ms	Batch size, dimensions
Image (DALL-E 3)	10-30s	Size, quality, complexity
Audio (1 min, Whisper)	2-5s	Audio quality, format

Memory Optimization

Conversation History Management

public class ConversationManager {
    private static final int MAX_HISTORY = 20;
    
    public List<Message> trimHistory(List<Message> history) {
        if (history.size() <= MAX_HISTORY) {
            return history;
        }
        
        // Keep system message + recent messages
        List<Message> trimmed = new ArrayList<>();
        trimmed.add(history.get(0));  // System message
        trimmed.addAll(history.subList(
            history.size() - (MAX_HISTORY - 1),
            history.size()
        ));
        
        return trimmed;
    }
}

Embedding Storage Optimization

// Store as float[] (4 bytes per dimension)
float[] embedding = new float[1536];  // 6 KB

// Or quantize to int8 (1 byte per dimension)
byte[] quantized = quantizeToInt8(embedding);  // 1.5 KB (75% reduction)

// Or use reduced dimensions
float[] reduced = new float[512];  // 2 KB (67% reduction)

Cost Optimization

Strategies

Use Cheaper Models: gpt-35-turbo instead of gpt-4 when possible
Reduce Token Usage: Shorter prompts and responses
Cache Responses: Avoid redundant API calls
Batch Operations: Embeddings batch processing
Reduce Dimensions: Use smaller embedding dimensions
Use DALL-E 2: For non-critical image generation

Cost Comparison (Approximate)

Operation	Cost (Relative)
gpt-35-turbo	1x
gpt-4o	10x
gpt-4	30x
o1	50x
Embeddings (ada-002)	0.1x
DALL-E 2 (512x512)	5x
DALL-E 3 (1024x1024)	20x
DALL-E 3 HD	40x

Monitoring and Profiling

Key Metrics to Track

@Service
public class MetricsService {
    private final MeterRegistry registry;
    
    public void recordApiCall(String model, long durationMs, boolean success) {
        registry.counter("ai.api.calls",
            "model", model,
            "success", String.valueOf(success)
        ).increment();
        
        registry.timer("ai.api.duration",
            "model", model
        ).record(durationMs, TimeUnit.MILLISECONDS);
    }
    
    public void recordTokens(String model, int tokens) {
        registry.counter("ai.tokens.used",
            "model", model
        ).increment(tokens);
    }
}

Performance Benchmarking

public class PerformanceBenchmark {
    
    public void benchmarkChatModels() {
        String prompt = "Explain quantum computing";
        
        // Benchmark gpt-35-turbo
        long start = System.currentTimeMillis();
        chatModel.call(new Prompt(prompt, turboOptions));
        long turboTime = System.currentTimeMillis() - start;
        
        // Benchmark gpt-4o
        start = System.currentTimeMillis();
        chatModel.call(new Prompt(prompt, gpt4Options));
        long gpt4Time = System.currentTimeMillis() - start;
        
        System.out.println("gpt-35-turbo: " + turboTime + "ms");
        System.out.println("gpt-4o: " + gpt4Time + "ms");
    }
}

Best Practices Summary

Reuse model instances - Create once, use many times
Batch operations - Process multiple items in one request
Use caching - Cache responses and embeddings
Parallel processing - Handle concurrent requests efficiently
Choose appropriate models - Balance cost, speed, and quality
Optimize tokens - Use concise prompts and limit responses
Stream when possible - Improve perceived latency
Monitor metrics - Track performance and costs
Use reduced dimensions - For embeddings when appropriate
Implement rate limiting - Prevent quota exhaustion

tessl/maven-org-springframework-ai--spring-ai-azure-openai

performance.mddocs/reference/

Performance Guide

Model Instance Reuse

✅ Correct - Singleton Pattern

❌ Incorrect - Creating Per Request

Batch Processing

Embeddings - Batch Multiple Texts

Optimal Batch Sizes

Parallel Processing

Thread-Safe Concurrent Requests

Rate Limiting with Semaphore

Caching Strategies

Response Caching

Embedding Caching

Token Optimization

Reduce Token Usage

Token Counting

Streaming for Better UX

Connection Pooling

Embedding Dimension Optimization

Model Selection

Chat Models

Embedding Models

Image Models

Latency Optimization

Reduce Network Latency

Typical Latencies

Memory Optimization

Conversation History Management

Embedding Storage Optimization

Cost Optimization

Strategies

Cost Comparison (Approximate)

Monitoring and Profiling

Key Metrics to Track

Performance Benchmarking

Best Practices Summary

See Also

tessl/maven-org-springframework-ai--spring-ai-azure-openai

performance.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/reference/

Performance Guide

Model Instance Reuse

✅ Correct - Singleton Pattern

❌ Incorrect - Creating Per Request

Batch Processing

Embeddings - Batch Multiple Texts

Optimal Batch Sizes

Parallel Processing

Thread-Safe Concurrent Requests

Rate Limiting with Semaphore

Caching Strategies

Response Caching

Embedding Caching

Token Optimization

Reduce Token Usage

Token Counting

Streaming for Better UX

Connection Pooling

Embedding Dimension Optimization

Model Selection

Chat Models

Embedding Models

Image Models

Latency Optimization

Reduce Network Latency

Typical Latencies

Memory Optimization

Conversation History Management

Embedding Storage Optimization

Cost Optimization

Strategies

Cost Comparison (Approximate)

Monitoring and Profiling

Key Metrics to Track

Performance Benchmarking

Best Practices Summary

See Also

performance.mddocs/reference/