Spring AI integration for Azure OpenAI services providing chat completion, text embeddings, image generation, and audio transcription with GPT, DALL-E, and Whisper models
Optimization strategies and best practices for production deployments.
Critical: Always reuse model instances. Creating new instances is expensive.
@Configuration
public class AIConfiguration {
@Bean
@Singleton
public AzureOpenAiChatModel chatModel(OpenAIClient client) {
return AzureOpenAiChatModel.builder()
.openAIClientBuilder(new OpenAIClientBuilder()
.credential(new AzureKeyCredential(apiKey))
.endpoint(endpoint))
.defaultOptions(AzureOpenAiChatOptions.builder()
.deploymentName("gpt-4o")
.temperature(0.7)
.build())
.build();
}
}
@Service
public class ChatService {
@Autowired
private AzureOpenAiChatModel chatModel; // Reuse across requests
public String chat(String message) {
return chatModel.call(new Prompt(message))
.getResult()
.getOutput()
.getText();
}
}// DON'T DO THIS
public String chat(String message) {
AzureOpenAiChatModel model = AzureOpenAiChatModel.builder()...build();
return model.call(new Prompt(message))...; // Wasteful
}Efficient (1 API call):
List<String> texts = List.of("text1", "text2", "text3", ...);
EmbeddingResponse response = embeddingModel.call(
new EmbeddingRequest(texts, null)
);Inefficient (N API calls):
for (String text : texts) {
embeddingModel.call(new EmbeddingRequest(List.of(text), null));
}| Operation | Recommended Batch Size | Max Batch Size |
|---|---|---|
| Embeddings | 100-500 | 2048 |
| Chat (parallel) | 5-10 concurrent | Depends on quota |
| Images (DALL-E 2) | 1-4 per request | 10 |
ExecutorService executor = Executors.newFixedThreadPool(10);
List<CompletableFuture<ChatResponse>> futures = new ArrayList<>();
for (String prompt : prompts) {
CompletableFuture<ChatResponse> future = CompletableFuture.supplyAsync(
() -> chatModel.call(new Prompt(prompt)),
executor
);
futures.add(future);
}
List<ChatResponse> responses = futures.stream()
.map(CompletableFuture::join)
.collect(Collectors.toList());Semaphore rateLimiter = new Semaphore(10); // Max 10 concurrent
public ChatResponse rateLimitedCall(Prompt prompt) {
try {
rateLimiter.acquire();
try {
return chatModel.call(prompt);
} finally {
rateLimiter.release();
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new RuntimeException(e);
}
}@Service
public class CachedChatService {
private final LoadingCache<String, String> cache;
private final AzureOpenAiChatModel chatModel;
public CachedChatService(AzureOpenAiChatModel chatModel) {
this.chatModel = chatModel;
this.cache = Caffeine.newBuilder()
.maximumSize(10000)
.expireAfterWrite(1, TimeUnit.HOURS)
.build(this::generateResponse);
}
private String generateResponse(String prompt) {
return chatModel.call(new Prompt(prompt))
.getResult()
.getOutput()
.getText();
}
public String getCachedResponse(String prompt) {
return cache.get(prompt);
}
}@Service
public class CachedEmbeddingService {
private final Map<String, float[]> embeddingCache = new ConcurrentHashMap<>();
private final AzureOpenAiEmbeddingModel embeddingModel;
public float[] getEmbedding(String text) {
return embeddingCache.computeIfAbsent(text, t -> {
EmbeddingResponse response = embeddingModel.call(
new EmbeddingRequest(List.of(t), null)
);
return response.getResults().get(0).getOutput();
});
}
}// Use shorter prompts
String verbose = "I would like you to please provide me with information about...";
String concise = "Explain..."; // Better
// Limit response length
AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
.maxTokens(500) // Limit response size
.build();
// Use appropriate temperature
options = AzureOpenAiChatOptions.builder()
.temperature(0.0) // Deterministic, often shorter responses
.build();public class TokenOptimizer {
private final TokenCounter tokenCounter;
public String optimizePrompt(String prompt, int maxTokens) {
int tokens = tokenCounter.count(prompt);
if (tokens <= maxTokens) {
return prompt;
}
// Truncate to fit
return truncateToTokens(prompt, maxTokens);
}
}Use streaming for long responses to improve perceived latency:
public void streamResponse(String prompt, Consumer<String> onToken) {
Flux<ChatResponse> stream = chatModel.stream(new Prompt(prompt));
stream.subscribe(
chunk -> {
String token = chunk.getResult().getOutput().getText();
if (token != null) {
onToken.accept(token); // Update UI immediately
}
}
);
}Azure SDK handles connection pooling automatically, but you can tune it:
OpenAIClient client = new OpenAIClientBuilder()
.credential(new AzureKeyCredential(apiKey))
.endpoint(endpoint)
.httpClient(new NettyAsyncHttpClientBuilder()
.connectionProvider(ConnectionProvider.builder("custom")
.maxConnections(100)
.maxIdleTime(Duration.ofSeconds(30))
.build())
.build())
.buildClient();Reduce storage and computation by using smaller dimensions:
// Full dimensions (best quality)
AzureOpenAiEmbeddingOptions fullOptions = AzureOpenAiEmbeddingOptions.builder()
.deploymentName("text-embedding-3-small")
.dimensions(1536)
.build();
// Reduced dimensions (3x faster search, 67% less storage)
AzureOpenAiEmbeddingOptions reducedOptions = AzureOpenAiEmbeddingOptions.builder()
.deploymentName("text-embedding-3-small")
.dimensions(512)
.build();Dimension Trade-offs:
| Model | Speed | Cost | Quality | Use Case |
|---|---|---|---|---|
| gpt-35-turbo | Fastest | Lowest | Good | Simple tasks, high volume |
| gpt-4o | Fast | Medium | Excellent | General purpose |
| gpt-4 | Slow | High | Excellent | Complex reasoning |
| o1/o3 | Slowest | Highest | Best | Advanced reasoning |
| Model | Speed | Cost | Dimensions | Use Case |
|---|---|---|---|---|
| ada-002 | Fast | Low | 1536 | General purpose |
| 3-small | Fast | Low | 512-1536 | Configurable, efficient |
| 3-large | Medium | Medium | 1024-3072 | Best quality |
| Model | Speed | Cost | Quality | Use Case |
|---|---|---|---|---|
| DALL-E 2 | Fast | Low | Good | Multiple variations, cost-sensitive |
| DALL-E 3 | Slow | High | Excellent | High-quality, single images |
| Operation | Typical Latency | Factors |
|---|---|---|
| Chat (gpt-4o, 500 tokens) | 2-5s | Token count, complexity |
| Chat (streaming, first token) | 200-500ms | Network, load |
| Embeddings (100 texts) | 500-1500ms | Batch size, dimensions |
| Image (DALL-E 3) | 10-30s | Size, quality, complexity |
| Audio (1 min, Whisper) | 2-5s | Audio quality, format |
public class ConversationManager {
private static final int MAX_HISTORY = 20;
public List<Message> trimHistory(List<Message> history) {
if (history.size() <= MAX_HISTORY) {
return history;
}
// Keep system message + recent messages
List<Message> trimmed = new ArrayList<>();
trimmed.add(history.get(0)); // System message
trimmed.addAll(history.subList(
history.size() - (MAX_HISTORY - 1),
history.size()
));
return trimmed;
}
}// Store as float[] (4 bytes per dimension)
float[] embedding = new float[1536]; // 6 KB
// Or quantize to int8 (1 byte per dimension)
byte[] quantized = quantizeToInt8(embedding); // 1.5 KB (75% reduction)
// Or use reduced dimensions
float[] reduced = new float[512]; // 2 KB (67% reduction)| Operation | Cost (Relative) |
|---|---|
| gpt-35-turbo | 1x |
| gpt-4o | 10x |
| gpt-4 | 30x |
| o1 | 50x |
| Embeddings (ada-002) | 0.1x |
| DALL-E 2 (512x512) | 5x |
| DALL-E 3 (1024x1024) | 20x |
| DALL-E 3 HD | 40x |
@Service
public class MetricsService {
private final MeterRegistry registry;
public void recordApiCall(String model, long durationMs, boolean success) {
registry.counter("ai.api.calls",
"model", model,
"success", String.valueOf(success)
).increment();
registry.timer("ai.api.duration",
"model", model
).record(durationMs, TimeUnit.MILLISECONDS);
}
public void recordTokens(String model, int tokens) {
registry.counter("ai.tokens.used",
"model", model
).increment(tokens);
}
}public class PerformanceBenchmark {
public void benchmarkChatModels() {
String prompt = "Explain quantum computing";
// Benchmark gpt-35-turbo
long start = System.currentTimeMillis();
chatModel.call(new Prompt(prompt, turboOptions));
long turboTime = System.currentTimeMillis() - start;
// Benchmark gpt-4o
start = System.currentTimeMillis();
chatModel.call(new Prompt(prompt, gpt4Options));
long gpt4Time = System.currentTimeMillis() - start;
System.out.println("gpt-35-turbo: " + turboTime + "ms");
System.out.println("gpt-4o: " + gpt4Time + "ms");
}
}tessl i tessl/maven-org-springframework-ai--spring-ai-azure-openai@1.1.1