CtrlK
CommunityDocumentationLog inGet started
Tessl Logo

tessl/maven-org-springframework-ai--spring-ai-azure-openai

Spring AI integration for Azure OpenAI services providing chat completion, text embeddings, image generation, and audio transcription with GPT, DALL-E, and Whisper models

Overview
Eval results
Files

performance.mddocs/reference/

Performance Guide

Optimization strategies and best practices for production deployments.

Model Instance Reuse

Critical: Always reuse model instances. Creating new instances is expensive.

✅ Correct - Singleton Pattern

@Configuration
public class AIConfiguration {
    
    @Bean
    @Singleton
    public AzureOpenAiChatModel chatModel(OpenAIClient client) {
        return AzureOpenAiChatModel.builder()
            .openAIClientBuilder(new OpenAIClientBuilder()
                .credential(new AzureKeyCredential(apiKey))
                .endpoint(endpoint))
            .defaultOptions(AzureOpenAiChatOptions.builder()
                .deploymentName("gpt-4o")
                .temperature(0.7)
                .build())
            .build();
    }
}

@Service
public class ChatService {
    @Autowired
    private AzureOpenAiChatModel chatModel;  // Reuse across requests
    
    public String chat(String message) {
        return chatModel.call(new Prompt(message))
            .getResult()
            .getOutput()
            .getText();
    }
}

❌ Incorrect - Creating Per Request

// DON'T DO THIS
public String chat(String message) {
    AzureOpenAiChatModel model = AzureOpenAiChatModel.builder()...build();
    return model.call(new Prompt(message))...;  // Wasteful
}

Batch Processing

Embeddings - Batch Multiple Texts

Efficient (1 API call):

List<String> texts = List.of("text1", "text2", "text3", ...);
EmbeddingResponse response = embeddingModel.call(
    new EmbeddingRequest(texts, null)
);

Inefficient (N API calls):

for (String text : texts) {
    embeddingModel.call(new EmbeddingRequest(List.of(text), null));
}

Optimal Batch Sizes

OperationRecommended Batch SizeMax Batch Size
Embeddings100-5002048
Chat (parallel)5-10 concurrentDepends on quota
Images (DALL-E 2)1-4 per request10

Parallel Processing

Thread-Safe Concurrent Requests

ExecutorService executor = Executors.newFixedThreadPool(10);
List<CompletableFuture<ChatResponse>> futures = new ArrayList<>();

for (String prompt : prompts) {
    CompletableFuture<ChatResponse> future = CompletableFuture.supplyAsync(
        () -> chatModel.call(new Prompt(prompt)),
        executor
    );
    futures.add(future);
}

List<ChatResponse> responses = futures.stream()
    .map(CompletableFuture::join)
    .collect(Collectors.toList());

Rate Limiting with Semaphore

Semaphore rateLimiter = new Semaphore(10);  // Max 10 concurrent

public ChatResponse rateLimitedCall(Prompt prompt) {
    try {
        rateLimiter.acquire();
        try {
            return chatModel.call(prompt);
        } finally {
            rateLimiter.release();
        }
    } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
        throw new RuntimeException(e);
    }
}

Caching Strategies

Response Caching

@Service
public class CachedChatService {
    private final LoadingCache<String, String> cache;
    private final AzureOpenAiChatModel chatModel;
    
    public CachedChatService(AzureOpenAiChatModel chatModel) {
        this.chatModel = chatModel;
        this.cache = Caffeine.newBuilder()
            .maximumSize(10000)
            .expireAfterWrite(1, TimeUnit.HOURS)
            .build(this::generateResponse);
    }
    
    private String generateResponse(String prompt) {
        return chatModel.call(new Prompt(prompt))
            .getResult()
            .getOutput()
            .getText();
    }
    
    public String getCachedResponse(String prompt) {
        return cache.get(prompt);
    }
}

Embedding Caching

@Service
public class CachedEmbeddingService {
    private final Map<String, float[]> embeddingCache = new ConcurrentHashMap<>();
    private final AzureOpenAiEmbeddingModel embeddingModel;
    
    public float[] getEmbedding(String text) {
        return embeddingCache.computeIfAbsent(text, t -> {
            EmbeddingResponse response = embeddingModel.call(
                new EmbeddingRequest(List.of(t), null)
            );
            return response.getResults().get(0).getOutput();
        });
    }
}

Token Optimization

Reduce Token Usage

// Use shorter prompts
String verbose = "I would like you to please provide me with information about...";
String concise = "Explain...";  // Better

// Limit response length
AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .maxTokens(500)  // Limit response size
    .build();

// Use appropriate temperature
options = AzureOpenAiChatOptions.builder()
    .temperature(0.0)  // Deterministic, often shorter responses
    .build();

Token Counting

public class TokenOptimizer {
    private final TokenCounter tokenCounter;
    
    public String optimizePrompt(String prompt, int maxTokens) {
        int tokens = tokenCounter.count(prompt);
        
        if (tokens <= maxTokens) {
            return prompt;
        }
        
        // Truncate to fit
        return truncateToTokens(prompt, maxTokens);
    }
}

Streaming for Better UX

Use streaming for long responses to improve perceived latency:

public void streamResponse(String prompt, Consumer<String> onToken) {
    Flux<ChatResponse> stream = chatModel.stream(new Prompt(prompt));
    
    stream.subscribe(
        chunk -> {
            String token = chunk.getResult().getOutput().getText();
            if (token != null) {
                onToken.accept(token);  // Update UI immediately
            }
        }
    );
}

Connection Pooling

Azure SDK handles connection pooling automatically, but you can tune it:

OpenAIClient client = new OpenAIClientBuilder()
    .credential(new AzureKeyCredential(apiKey))
    .endpoint(endpoint)
    .httpClient(new NettyAsyncHttpClientBuilder()
        .connectionProvider(ConnectionProvider.builder("custom")
            .maxConnections(100)
            .maxIdleTime(Duration.ofSeconds(30))
            .build())
        .build())
    .buildClient();

Embedding Dimension Optimization

Reduce storage and computation by using smaller dimensions:

// Full dimensions (best quality)
AzureOpenAiEmbeddingOptions fullOptions = AzureOpenAiEmbeddingOptions.builder()
    .deploymentName("text-embedding-3-small")
    .dimensions(1536)
    .build();

// Reduced dimensions (3x faster search, 67% less storage)
AzureOpenAiEmbeddingOptions reducedOptions = AzureOpenAiEmbeddingOptions.builder()
    .deploymentName("text-embedding-3-small")
    .dimensions(512)
    .build();

Dimension Trade-offs:

  • 1536 → 512: ~5% accuracy loss, 3x faster search
  • 1536 → 768: ~2% accuracy loss, 2x faster search
  • 1536 → 1024: ~1% accuracy loss, 1.5x faster search

Model Selection

Chat Models

ModelSpeedCostQualityUse Case
gpt-35-turboFastestLowestGoodSimple tasks, high volume
gpt-4oFastMediumExcellentGeneral purpose
gpt-4SlowHighExcellentComplex reasoning
o1/o3SlowestHighestBestAdvanced reasoning

Embedding Models

ModelSpeedCostDimensionsUse Case
ada-002FastLow1536General purpose
3-smallFastLow512-1536Configurable, efficient
3-largeMediumMedium1024-3072Best quality

Image Models

ModelSpeedCostQualityUse Case
DALL-E 2FastLowGoodMultiple variations, cost-sensitive
DALL-E 3SlowHighExcellentHigh-quality, single images

Latency Optimization

Reduce Network Latency

  1. Choose Closest Region: Deploy in same region as Azure OpenAI
  2. Use Streaming: Start displaying results immediately
  3. Parallel Requests: Process multiple requests concurrently
  4. Cache Aggressively: Cache common queries

Typical Latencies

OperationTypical LatencyFactors
Chat (gpt-4o, 500 tokens)2-5sToken count, complexity
Chat (streaming, first token)200-500msNetwork, load
Embeddings (100 texts)500-1500msBatch size, dimensions
Image (DALL-E 3)10-30sSize, quality, complexity
Audio (1 min, Whisper)2-5sAudio quality, format

Memory Optimization

Conversation History Management

public class ConversationManager {
    private static final int MAX_HISTORY = 20;
    
    public List<Message> trimHistory(List<Message> history) {
        if (history.size() <= MAX_HISTORY) {
            return history;
        }
        
        // Keep system message + recent messages
        List<Message> trimmed = new ArrayList<>();
        trimmed.add(history.get(0));  // System message
        trimmed.addAll(history.subList(
            history.size() - (MAX_HISTORY - 1),
            history.size()
        ));
        
        return trimmed;
    }
}

Embedding Storage Optimization

// Store as float[] (4 bytes per dimension)
float[] embedding = new float[1536];  // 6 KB

// Or quantize to int8 (1 byte per dimension)
byte[] quantized = quantizeToInt8(embedding);  // 1.5 KB (75% reduction)

// Or use reduced dimensions
float[] reduced = new float[512];  // 2 KB (67% reduction)

Cost Optimization

Strategies

  1. Use Cheaper Models: gpt-35-turbo instead of gpt-4 when possible
  2. Reduce Token Usage: Shorter prompts and responses
  3. Cache Responses: Avoid redundant API calls
  4. Batch Operations: Embeddings batch processing
  5. Reduce Dimensions: Use smaller embedding dimensions
  6. Use DALL-E 2: For non-critical image generation

Cost Comparison (Approximate)

OperationCost (Relative)
gpt-35-turbo1x
gpt-4o10x
gpt-430x
o150x
Embeddings (ada-002)0.1x
DALL-E 2 (512x512)5x
DALL-E 3 (1024x1024)20x
DALL-E 3 HD40x

Monitoring and Profiling

Key Metrics to Track

@Service
public class MetricsService {
    private final MeterRegistry registry;
    
    public void recordApiCall(String model, long durationMs, boolean success) {
        registry.counter("ai.api.calls",
            "model", model,
            "success", String.valueOf(success)
        ).increment();
        
        registry.timer("ai.api.duration",
            "model", model
        ).record(durationMs, TimeUnit.MILLISECONDS);
    }
    
    public void recordTokens(String model, int tokens) {
        registry.counter("ai.tokens.used",
            "model", model
        ).increment(tokens);
    }
}

Performance Benchmarking

public class PerformanceBenchmark {
    
    public void benchmarkChatModels() {
        String prompt = "Explain quantum computing";
        
        // Benchmark gpt-35-turbo
        long start = System.currentTimeMillis();
        chatModel.call(new Prompt(prompt, turboOptions));
        long turboTime = System.currentTimeMillis() - start;
        
        // Benchmark gpt-4o
        start = System.currentTimeMillis();
        chatModel.call(new Prompt(prompt, gpt4Options));
        long gpt4Time = System.currentTimeMillis() - start;
        
        System.out.println("gpt-35-turbo: " + turboTime + "ms");
        System.out.println("gpt-4o: " + gpt4Time + "ms");
    }
}

Best Practices Summary

  1. Reuse model instances - Create once, use many times
  2. Batch operations - Process multiple items in one request
  3. Use caching - Cache responses and embeddings
  4. Parallel processing - Handle concurrent requests efficiently
  5. Choose appropriate models - Balance cost, speed, and quality
  6. Optimize tokens - Use concise prompts and limit responses
  7. Stream when possible - Improve perceived latency
  8. Monitor metrics - Track performance and costs
  9. Use reduced dimensions - For embeddings when appropriate
  10. Implement rate limiting - Prevent quota exhaustion

See Also

  • Error Handling - Handle failures efficiently
  • Chat API Reference - Chat optimization details
  • Embeddings API Reference - Embedding optimization
  • Real-World Scenarios - Production examples
tessl i tessl/maven-org-springframework-ai--spring-ai-azure-openai@1.1.1

docs

index.md

tile.json