CtrlK
CommunityDocumentationLog inGet started
Tessl Logo

tessl/maven-org-springframework-ai--spring-ai-azure-openai

Spring AI integration for Azure OpenAI services providing chat completion, text embeddings, image generation, and audio transcription with GPT, DALL-E, and Whisper models

Overview
Eval results
Files

chat-api.mddocs/reference/

Chat Completion

The chat completion API provides conversational AI capabilities using Azure OpenAI's GPT models. It supports both synchronous and streaming responses, tool calling, structured outputs, and comprehensive configuration options.

Imports

import org.springframework.ai.azure.openai.AzureOpenAiChatModel;
import org.springframework.ai.azure.openai.AzureOpenAiChatOptions;
import org.springframework.ai.azure.openai.AzureOpenAiResponseFormat;
import org.springframework.ai.chat.model.ChatResponse;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.chat.messages.Message;
import org.springframework.ai.chat.messages.UserMessage;
import org.springframework.ai.chat.messages.AssistantMessage;
import org.springframework.ai.chat.metadata.ChatResponseMetadata;
import com.azure.ai.openai.OpenAIClient;
import com.azure.ai.openai.OpenAIClientBuilder;
import com.azure.core.credential.AzureKeyCredential;
import reactor.core.publisher.Flux;

AzureOpenAiChatModel

The main class for chat completion operations.

Thread Safety

Thread-Safe: AzureOpenAiChatModel is fully thread-safe and can be safely used across multiple threads concurrently. A single instance can handle multiple concurrent requests.

Recommendation: Create one instance and reuse it across your application rather than creating new instances for each request.

Construction

Using Builder Pattern

class AzureOpenAiChatModel {
    static Builder builder();

    class Builder {
        Builder openAIClientBuilder(OpenAIClientBuilder openAIClientBuilder);
        Builder defaultOptions(AzureOpenAiChatOptions defaultOptions);
        Builder toolCallingManager(ToolCallingManager toolCallingManager);
        Builder toolExecutionEligibilityPredicate(ToolExecutionEligibilityPredicate predicate);
        Builder observationRegistry(ObservationRegistry observationRegistry);
        AzureOpenAiChatModel build();
    }
}

Builder Parameters:

  • openAIClientBuilder: Azure OpenAI client builder for authentication and connection (required, non-null)
  • defaultOptions: Default chat options applied to all requests (optional, can be overridden per-request)
  • toolCallingManager: Manages tool/function calling capabilities (optional, null disables tool calling)
  • toolExecutionEligibilityPredicate: Predicate to determine if a tool should be executed (optional, for advanced tool control)
  • observationRegistry: Micrometer observation registry for metrics and distributed tracing (optional, null disables observability)

Example - Basic Builder:

AzureOpenAiChatModel chatModel = AzureOpenAiChatModel.builder()
    .openAIClientBuilder(new OpenAIClientBuilder()
        .credential(new AzureKeyCredential(apiKey))
        .endpoint(endpoint))
    .defaultOptions(AzureOpenAiChatOptions.builder()
        .deploymentName("gpt-4o")
        .temperature(0.7)
        .build())
    .observationRegistry(observationRegistry)
    .build();

Example - With Tool Execution Control:

// Define custom predicate for tool execution eligibility
ToolExecutionEligibilityPredicate toolPredicate = (toolCall, context) -> {
    // Only allow certain tools to execute
    Set<String> allowedTools = Set.of("get_weather", "search_database");
    return allowedTools.contains(toolCall.getToolName());
};

AzureOpenAiChatModel chatModel = AzureOpenAiChatModel.builder()
    .openAIClientBuilder(new OpenAIClientBuilder()
        .credential(new AzureKeyCredential(apiKey))
        .endpoint(endpoint))
    .defaultOptions(defaultOptions)
    .toolCallingManager(toolCallingManager)
    .toolExecutionEligibilityPredicate(toolPredicate)
    .build();

Using Constructor

class AzureOpenAiChatModel {
    AzureOpenAiChatModel(
        OpenAIClientBuilder openAIClientBuilder,
        AzureOpenAiChatOptions defaultOptions,
        ToolCallingManager toolCallingManager,
        ObservationRegistry observationRegistry
    );

    AzureOpenAiChatModel(
        OpenAIClientBuilder openAIClientBuilder,
        AzureOpenAiChatOptions defaultOptions,
        ToolCallingManager toolCallingManager,
        ObservationRegistry observationRegistry,
        ToolExecutionEligibilityPredicate toolExecutionEligibilityPredicate
    );
}

Constructor Parameters:

  • All parameters nullable except openAIClientBuilder
  • openAIClientBuilder: Required, throws NullPointerException if null
  • defaultOptions: Optional, uses model defaults if null
  • toolCallingManager: Optional, disables tool calling if null
  • observationRegistry: Optional, disables observability if null
  • toolExecutionEligibilityPredicate: Optional, all tools eligible if null

Core Methods

Synchronous Chat Completion

ChatResponse call(Prompt prompt);

Generate a chat response synchronously. Blocks until the complete response is received.

Parameters:

  • prompt: The prompt containing messages and optional options (non-null, throws NullPointerException if null)

Returns: ChatResponse containing the generated response, metadata, and usage information (never null)

Throws:

  • HttpResponseException: HTTP errors from Azure API (400, 401, 403, 429, 500)
  • ResourceNotFoundException: Deployment not found (404)
  • NonTransientAiException: Permanent failures (invalid parameters, auth errors)
  • TransientAiException: Temporary failures (rate limits, timeouts)
  • NullPointerException: If prompt is null

Example:

Prompt prompt = new Prompt("What is the capital of France?");
ChatResponse response = chatModel.call(prompt);
String answer = response.getResult().getOutput().getText();

With Options Override:

AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .temperature(0.9)
    .maxTokens(100)
    .build();

Prompt prompt = new Prompt("Write a creative story", options);
ChatResponse response = chatModel.call(prompt);

Multi-turn Conversation:

List<Message> messages = List.of(
    new UserMessage("What is machine learning?"),
    new AssistantMessage("Machine learning is..."),
    new UserMessage("Can you give an example?")
);
Prompt prompt = new Prompt(messages);
ChatResponse response = chatModel.call(prompt);

Error Handling:

try {
    ChatResponse response = chatModel.call(prompt);
} catch (HttpResponseException e) {
    if (e.getResponse().getStatusCode() == 429) {
        // Rate limit - implement retry with backoff
        throw new RateLimitException("Rate limit exceeded", e);
    } else if (e.getResponse().getStatusCode() == 401) {
        // Auth error - check credentials
        throw new AuthenticationException("Invalid credentials", e);
    }
} catch (ResourceNotFoundException e) {
    // Deployment not found
    throw new ConfigurationException("Invalid deployment name", e);
}

Streaming Chat Completion

Flux<ChatResponse> stream(Prompt prompt);

Generate a chat response as a reactive stream for real-time token delivery. Returns immediately with a Flux that emits tokens as they become available.

Parameters:

  • prompt: The prompt containing messages and optional options (non-null)

Returns: Flux<ChatResponse> emitting partial responses as tokens arrive (never null, may complete immediately on error)

Throws (via Flux error signal):

  • Same exceptions as call() method, signaled through Flux error channel

Stream Behavior:

  • Each emitted ChatResponse contains one or more new tokens
  • Token text is in response.getResult().getOutput().getText()
  • Final chunk may contain usage metadata if streamUsage(true) is set
  • Stream completes normally when full response is generated
  • Stream errors if API call fails

Example:

Prompt prompt = new Prompt("Explain quantum physics");
Flux<ChatResponse> responseStream = chatModel.stream(prompt);

responseStream.subscribe(
    chatResponse -> {
        String token = chatResponse.getResult().getOutput().getText();
        System.out.print(token);
    },
    error -> System.err.println("Error: " + error),
    () -> System.out.println("\nComplete")
);

Collecting Full Response:

String fullResponse = chatModel.stream(prompt)
    .map(response -> response.getResult().getOutput().getText())
    .collectList()
    .map(tokens -> String.join("", tokens))
    .block();

Streaming with Usage Tracking:

AzureOpenAiChatOptions optionsWithUsage = AzureOpenAiChatOptions.builder()
    .streamUsage(true)  // Enable usage reporting in stream
    .build();

Prompt prompt = new Prompt("Tell me a story", optionsWithUsage);
Flux<ChatResponse> stream = chatModel.stream(prompt);

stream.subscribe(
    chatResponse -> {
        String token = chatResponse.getResult().getOutput().getText();
        if (token != null) {
            System.out.print(token);
        }

        // Check for usage metadata in final chunk
        if (chatResponse.getMetadata() != null) {
            Usage usage = chatResponse.getMetadata().getUsage();
            if (usage != null) {
                System.out.println("\nTotal tokens used: " + usage.getTotalTokens());
            }
        }
    }
);

Error Handling in Streams:

chatModel.stream(prompt)
    .onErrorResume(throwable -> {
        if (throwable instanceof HttpResponseException) {
            HttpResponseException httpEx = (HttpResponseException) throwable;
            if (httpEx.getResponse().getStatusCode() == 429) {
                // Retry after delay
                return Mono.delay(Duration.ofSeconds(1))
                    .flatMapMany(tick -> chatModel.stream(prompt));
            }
        }
        return Flux.error(throwable);
    })
    .subscribe(/* ... */);

Configuration Methods

AzureOpenAiChatOptions getDefaultOptions();
void setObservationConvention(ChatModelObservationConvention observationConvention);

getDefaultOptions():

  • Returns the default options configured for this model instance
  • Returns null if no default options were provided
  • Changes to returned object do not affect the model

setObservationConvention():

  • Sets custom observation convention for metrics/tracing
  • Parameter can be null to use default convention
  • Thread-safe, can be called while model is in use

Metadata Methods

Static factory methods for creating metadata from Azure responses:

static ChatResponseMetadata from(ChatCompletions chatCompletions, PromptMetadata promptFilterMetadata);
static ChatResponseMetadata from(ChatCompletions chatCompletions, PromptMetadata promptFilterMetadata, Usage usage);
static ChatResponseMetadata from(ChatCompletions chatCompletions, PromptMetadata promptFilterMetadata, CompletionsUsage usage);
static ChatResponseMetadata from(ChatResponse chatResponse, Usage usage);

These methods convert Azure SDK response objects into Spring AI metadata objects.

Method Descriptions:

  • from(ChatCompletions, PromptMetadata): Create metadata from Azure chat completions with prompt filter metadata (returns non-null ChatResponseMetadata)
  • from(ChatCompletions, PromptMetadata, Usage): Create metadata with Spring AI Usage object (returns non-null ChatResponseMetadata)
  • from(ChatCompletions, PromptMetadata, CompletionsUsage): Create metadata with Azure CompletionsUsage object (returns non-null ChatResponseMetadata)
  • from(ChatResponse, Usage): Create metadata from existing ChatResponse (returns non-null ChatResponseMetadata)

Example - Accessing Response Metadata:

ChatResponse response = chatModel.call(prompt);
ChatResponseMetadata metadata = response.getMetadata();

// Access usage information
Usage usage = metadata.getUsage();
System.out.println("Prompt tokens: " + usage.getPromptTokens());
System.out.println("Generation tokens: " + usage.getGenerationTokens());
System.out.println("Total tokens: " + usage.getTotalTokens());

// Access finish reason
String finishReason = response.getResult().getMetadata().getFinishReason();
System.out.println("Finish reason: " + finishReason);

AzureOpenAiChatOptions

Configuration class for chat completion requests.

Construction

class AzureOpenAiChatOptions {
    static Builder builder();
    static AzureOpenAiChatOptions fromOptions(AzureOpenAiChatOptions fromOptions);
    AzureOpenAiChatOptions copy();
}

fromOptions(): Creates new instance copying all settings from another instance (parameter non-null, returns non-null)

copy(): Creates deep copy of this instance (returns non-null)

Builder

class Builder {
    Builder deploymentName(String deploymentName);
    Builder temperature(Double temperature);
    Builder topP(Double topP);
    Builder maxTokens(Integer maxTokens);
    Builder maxCompletionTokens(Integer maxCompletionTokens);
    Builder N(Integer n);
    Builder frequencyPenalty(Double frequencyPenalty);
    Builder presencePenalty(Double presencePenalty);
    Builder logitBias(Map<String, Integer> logitBias);
    Builder stop(List<String> stop);
    Builder user(String user);
    Builder seed(Long seed);
    Builder responseFormat(AzureOpenAiResponseFormat responseFormat);
    Builder logprobs(Boolean logprobs);
    Builder topLogprobs(Integer topLogprobs);
    Builder reasoningEffort(String reasoningEffort);
    Builder enhancements(AzureChatEnhancementConfiguration enhancements);
    Builder streamOptions(ChatCompletionStreamOptions streamOptions);
    Builder streamUsage(Boolean enableStreamUsage);
    Builder toolCallbacks(List<ToolCallback> toolCallbacks);
    Builder toolCallbacks(ToolCallback... toolCallbacks);
    Builder toolNames(Set<String> toolNames);
    Builder toolNames(String... toolNames);
    Builder internalToolExecutionEnabled(Boolean internalToolExecutionEnabled);
    Builder toolContext(Map<String, Object> toolContext);
    AzureOpenAiChatOptions build();
}

Builder Methods:

  • All builder methods return this for fluent chaining (never null)
  • All parameters are optional (can be null)
  • build() returns non-null AzureOpenAiChatOptions instance

Model Configuration

String getDeploymentName();
void setDeploymentName(String deploymentName);
String getModel();
void setModel(String model);

The deployment name specifies which Azure OpenAI deployment to use (e.g., "gpt-4o", "gpt-4", "gpt-35-turbo").

Constraints:

  • Cannot be null or empty string
  • Must match an existing deployment in your Azure OpenAI resource
  • Throws IllegalArgumentException if invalid

Common Deployments:

  • Standard models: "gpt-4o", "gpt-4", "gpt-4-32k", "gpt-35-turbo", "gpt-35-turbo-16k"
  • Reasoning models: "o1", "o3", "o4-mini"

Example:

AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .deploymentName("gpt-4o")
    .build();

Token Limits

Integer getMaxTokens();
void setMaxTokens(Integer maxTokens);
Integer getMaxCompletionTokens();
void setMaxCompletionTokens(Integer maxCompletionTokens);

Important: These two parameters are mutually exclusive and serve different purposes:

  • maxTokens: Maximum total tokens (prompt + completion). Used by non-reasoning models (GPT-4, GPT-3.5, etc.)
  • maxCompletionTokens: Maximum tokens in the completion only. Required for reasoning models (o1, o3, o4-mini)

Constraints:

  • Both must be > 0 if set
  • Cannot exceed model's maximum context length
  • Cannot use both parameters together (throws IllegalArgumentException)
  • maxTokens for standard models, maxCompletionTokens for reasoning models

Model Context Limits:

  • gpt-4o: 128,000 tokens
  • gpt-4: 8,192 tokens
  • gpt-4-32k: 32,768 tokens
  • gpt-35-turbo: 4,096 tokens
  • gpt-35-turbo-16k: 16,384 tokens
  • o1, o3, o4-mini: varies (use maxCompletionTokens)

For Standard Models (GPT-4, GPT-3.5, etc.):

AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .deploymentName("gpt-4o")
    .maxTokens(1000)  // Total tokens including prompt
    .build();

For Reasoning Models (o1, o3, o4-mini):

AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .deploymentName("o1")
    .maxCompletionTokens(2000)  // Completion tokens only
    .build();

Do not use both parameters together - use maxTokens for standard models and maxCompletionTokens for reasoning models.

Sampling Parameters

Double getTemperature();
void setTemperature(Double temperature);
Double getTopP();
void setTopP(Double topP);
Integer getN();
void setN(Integer n);
  • temperature: Controls randomness (0.0 = deterministic, 2.0 = very random). Default: 0.7
  • topP: Nucleus sampling threshold (0.0-1.0)
  • n: Number of completions to generate per prompt

Constraints:

  • temperature: Must be 0.0-2.0 (throws IllegalArgumentException if out of range)
  • topP: Must be 0.0-1.0 (throws IllegalArgumentException if out of range)
  • n: Must be >= 1 (throws IllegalArgumentException if < 1)

Temperature Guidelines:

  • 0.0-0.3: Factual, deterministic responses (best for Q&A, data extraction)
  • 0.4-0.7: Balanced creativity and consistency (general purpose)
  • 0.8-1.2: Creative responses (stories, brainstorming)
  • 1.3-2.0: Highly random, experimental (rare use cases)

Example:

AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .temperature(0.8)
    .topP(0.9)
    .N(1)
    .build();

Penalties and Bias

Double getFrequencyPenalty();
void setFrequencyPenalty(Double frequencyPenalty);
Double getPresencePenalty();
void setPresencePenalty(Double presencePenalty);
Map<String, Integer> getLogitBias();
void setLogitBias(Map<String, Integer> logitBias);
  • frequencyPenalty: Reduce repetition based on frequency (-2.0 to 2.0)
  • presencePenalty: Reduce repetition based on presence (-2.0 to 2.0)
  • logitBias: Modify likelihood of specific tokens

Constraints:

  • frequencyPenalty: Must be -2.0 to 2.0
  • presencePenalty: Must be -2.0 to 2.0
  • logitBias: Map keys are token IDs (as strings), values are -100 to 100

Penalty Guidelines:

  • Positive values (0.0-2.0): Discourage repetition
  • Negative values (-2.0-0.0): Encourage repetition (rarely used)
  • 0.0: No penalty (default)

Example:

AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .frequencyPenalty(0.5)
    .presencePenalty(0.5)
    .build();

Logit Bias Example:

// Discourage specific words (get token IDs from tokenizer)
Map<String, Integer> bias = Map.of(
    "1234", -100,  // Completely ban token 1234
    "5678", 50     // Boost likelihood of token 5678
);

AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .logitBias(bias)
    .build();

Stop Sequences

List<String> getStop();
void setStop(List<String> stop);
List<String> getStopSequences();
void setStopSequences(List<String> stopSequences);

Define sequences where the model will stop generating tokens.

Constraints:

  • Maximum 4 stop sequences
  • Each sequence max 20 characters
  • Case-sensitive matching

Use Cases:

  • Stop at specific delimiters
  • Prevent generating unwanted content
  • Control output structure

Example:

AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .stop(List.of("\n", "END", "###"))
    .build();

Response Format

AzureOpenAiResponseFormat getResponseFormat();
void setResponseFormat(AzureOpenAiResponseFormat responseFormat);

Control the format of the model's output (text, JSON object, or JSON schema).

Format Types:

  • TEXT: Plain text response (default)
  • JSON_OBJECT: Valid JSON object (no schema enforcement)
  • JSON_SCHEMA: JSON conforming to specific schema (strict validation)

Constraints:

  • JSON_SCHEMA only supported on GPT-4o, GPT-4-turbo and later
  • Must include "JSON" or similar in prompt when using JSON formats
  • Schema must be valid JSON Schema Draft 7

Example - JSON Object:

AzureOpenAiResponseFormat format = AzureOpenAiResponseFormat.builder()
    .type(AzureOpenAiResponseFormat.Type.JSON_OBJECT)
    .build();

AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .responseFormat(format)
    .build();

// Prompt must request JSON output
Prompt prompt = new Prompt("List 3 colors in JSON format", options);

Example - JSON Schema:

Map<String, Object> schema = Map.of(
    "type", "object",
    "properties", Map.of(
        "name", Map.of("type", "string"),
        "age", Map.of("type", "number")
    ),
    "required", List.of("name")
);

AzureOpenAiResponseFormat format = AzureOpenAiResponseFormat.builder()
    .type(AzureOpenAiResponseFormat.Type.JSON_SCHEMA)
    .jsonSchema(AzureOpenAiResponseFormat.JsonSchema.builder()
        .name("PersonSchema")
        .schema(schema)
        .strict(true)
        .build())
    .build();

Advanced Options

Long getSeed();
void setSeed(Long seed);
Boolean isLogprobs();
void setLogprobs(Boolean logprobs);
Integer getTopLogProbs();
void setTopLogProbs(Integer topLogProbs);
String getReasoningEffort();
void setReasoningEffort(String reasoningEffort);
AzureChatEnhancementConfiguration getEnhancements();
void setEnhancements(AzureChatEnhancementConfiguration enhancements);
ChatCompletionStreamOptions getStreamOptions();
void setStreamOptions(ChatCompletionStreamOptions streamOptions);
Boolean getStreamUsage();
void setStreamUsage(Boolean enableStreamUsage);
String getUser();
void setUser(String user);
  • seed: Integer seed for deterministic sampling
  • logprobs: Return log probabilities for tokens
  • topLogProbs: Number of top log probabilities to return (1-20)
  • reasoningEffort: Control reasoning effort for reasoning models (o1, o3, o4-mini). Valid values: "low", "medium", "high"
  • enhancements: Azure-specific enhancements (e.g., grounding, OCR)
  • streamOptions: Azure ChatCompletionStreamOptions for fine-grained streaming control
  • streamUsage: Include usage token counts in streaming responses (convenience alternative to streamOptions)
  • user: Identifier for the end-user (for abuse monitoring)

Constraints:

  • seed: Any long value, null for non-deterministic
  • logprobs: Boolean flag
  • topLogProbs: 1-20 if logprobs is true, null otherwise
  • reasoningEffort: Must be "low", "medium", or "high" (reasoning models only)
  • user: Max 256 characters

Note: streamOptions and streamUsage are related but serve different purposes:

  • streamUsage is a boolean convenience flag to enable usage reporting in streams
  • streamOptions provides more detailed control via Azure's ChatCompletionStreamOptions object

Example - Standard Options:

AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .seed(12345L)
    .logprobs(true)
    .topLogprobs(5)
    .user("user-123")
    .build();

Example - Reasoning Models:

AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .deploymentName("o1")
    .maxCompletionTokens(5000)
    .reasoningEffort("high")  // Increases thinking time for better results
    .build();

Reasoning Effort Levels

The reasoningEffort parameter controls how much computational effort reasoning models invest in problem-solving:

  • "low": Faster responses with less reasoning depth
  • "medium": Balanced speed and reasoning quality (default)
  • "high": Maximum reasoning depth, slower but more thorough

Only applicable to reasoning models: o1, o3, o4-mini. This parameter is ignored by standard models like GPT-4 and GPT-3.5.

Tool Calling

List<ToolCallback> getToolCallbacks();
void setToolCallbacks(List<ToolCallback> toolCallbacks);
Set<String> getToolNames();
void setToolNames(Set<String> toolNames);
void setFunctions(Set<String> functions);
Boolean getInternalToolExecutionEnabled();
void setInternalToolExecutionEnabled(Boolean internalToolExecutionEnabled);
Map<String, Object> getToolContext();
void setToolContext(Map<String, Object> toolContext);

Configure tool/function calling for the model.

Parameters:

  • toolCallbacks: List of tool implementations (can be empty or null)
  • toolNames: Set of tool names to make available (must match callback names)
  • internalToolExecutionEnabled: Whether to automatically execute tools (default: true)
  • toolContext: Additional context passed to tool callbacks (can be null or empty)

Constraints:

  • Tool names in toolNames must have corresponding callbacks in toolCallbacks
  • Tool callbacks must have unique names
  • Tool context keys are case-sensitive

Example:

// Define tool callback
ToolCallback weatherTool = new ToolCallback(
    "get_weather",
    "Get current weather for a location",
    (args) -> {
        String location = (String) args.get("location");
        return getWeatherData(location);
    }
);

AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .toolCallbacks(weatherTool)
    .toolNames("get_weather")
    .internalToolExecutionEnabled(true)
    .toolContext(Map.of("api_key", weatherApiKey))
    .build();

AzureOpenAiResponseFormat

Defines the structure of the model's response.

Construction

class AzureOpenAiResponseFormat {
    static Builder builder();

    Type getType();
    void setType(Type type);
    JsonSchema getJsonSchema();
    void setJsonSchema(JsonSchema jsonSchema);
    String getSchema();
    void setSchema(String schema);
}

Builder

class Builder {
    Builder type(Type type);
    Builder jsonSchema(JsonSchema jsonSchema);
    Builder jsonSchema(String jsonSchema);
    AzureOpenAiResponseFormat build();
}

Type Enum

enum Type {
    TEXT,
    JSON_OBJECT,
    JSON_SCHEMA
}
  • TEXT: Plain text response (default)
  • JSON_OBJECT: Valid JSON object
  • JSON_SCHEMA: JSON conforming to a specific schema

JsonSchema

class JsonSchema {
    static Builder builder();

    String getName();
    Map<String, Object> getSchema();
    Boolean getStrict();

    class Builder {
        Builder name(String name);
        Builder schema(Map<String, Object> schema);
        Builder schema(String schema);
        Builder strict(Boolean strict);
        JsonSchema build();
    }
}

JsonSchema Properties:

  • name: Schema name (default: "custom_schema" if not specified, must not be null or empty)
  • schema: The JSON schema definition as a Map or JSON string (non-null, must be valid JSON Schema Draft 7)
  • strict: Whether to enforce strict schema matching (default: true)

Strict Mode:

  • When true: Model output must exactly match schema (validation errors if mismatch)
  • When false: Model attempts to match schema but may deviate

Builder Methods:

  • name(String): Set the schema name (parameter non-null)
  • schema(Map<String, Object>): Set schema from Map (parameter non-null)
  • schema(String): Set schema from JSON string (convenience method, parameter non-null)
  • strict(Boolean): Enable or disable strict schema enforcement
  • build(): Build the JsonSchema instance (returns non-null)

Example - Using Map:

AzureOpenAiResponseFormat.JsonSchema schema =
    AzureOpenAiResponseFormat.JsonSchema.builder()
        .name("ProductInfo")
        .schema(Map.of(
            "type", "object",
            "properties", Map.of(
                "product_name", Map.of("type", "string"),
                "price", Map.of("type", "number"),
                "in_stock", Map.of("type", "boolean")
            ),
            "required", List.of("product_name", "price")
        ))
        .strict(true)
        .build();

AzureOpenAiResponseFormat format = AzureOpenAiResponseFormat.builder()
    .type(AzureOpenAiResponseFormat.Type.JSON_SCHEMA)
    .jsonSchema(schema)
    .build();

Example - Using JSON String:

String schemaJson = """
{
    "type": "object",
    "properties": {
        "product_name": {"type": "string"},
        "price": {"type": "number"},
        "in_stock": {"type": "boolean"}
    },
    "required": ["product_name", "price"]
}
""";

AzureOpenAiResponseFormat.JsonSchema schema =
    AzureOpenAiResponseFormat.JsonSchema.builder()
        .name("ProductInfo")
        .schema(schemaJson)  // Convenience method accepts JSON string
        .strict(true)
        .build();

AzureOpenAiResponseFormat format = AzureOpenAiResponseFormat.builder()
    .type(AzureOpenAiResponseFormat.Type.JSON_SCHEMA)
    .jsonSchema(schema)
    .build();

Example - Using setSchema Convenience Method:

AzureOpenAiResponseFormat format = AzureOpenAiResponseFormat.builder()
    .type(AzureOpenAiResponseFormat.Type.JSON_SCHEMA)
    .build();

// Set schema as JSON string directly
String schemaJson = "{\"type\":\"object\",\"properties\":{\"name\":{\"type\":\"string\"}}}";
format.setSchema(schemaJson);

// The schema will be automatically parsed and set

Usage Examples

Basic Chat

AzureOpenAiChatModel chatModel = AzureOpenAiChatModel.builder()
    .openAIClientBuilder(new OpenAIClientBuilder()
        .credential(new AzureKeyCredential(apiKey))
        .endpoint(endpoint))
    .defaultOptions(AzureOpenAiChatOptions.builder()
        .deploymentName("gpt-4o")
        .build())
    .build();

ChatResponse response = chatModel.call(new Prompt("Hello, how are you?"));
System.out.println(response.getResult().getOutput().getText());

Streaming Response

Flux<ChatResponse> stream = chatModel.stream(new Prompt("Tell me a story"));

stream.subscribe(chunk -> {
    String content = chunk.getResult().getOutput().getText();
    if (content != null) {
        System.out.print(content);
    }
});

Structured Output with JSON Schema

Map<String, Object> schema = Map.of(
    "type", "object",
    "properties", Map.of(
        "city", Map.of("type", "string"),
        "country", Map.of("type", "string"),
        "population", Map.of("type", "integer")
    ),
    "required", List.of("city", "country")
);

AzureOpenAiResponseFormat format = AzureOpenAiResponseFormat.builder()
    .type(AzureOpenAiResponseFormat.Type.JSON_SCHEMA)
    .jsonSchema(AzureOpenAiResponseFormat.JsonSchema.builder()
        .name("CityInfo")
        .schema(schema)
        .strict(true)
        .build())
    .build();

AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .responseFormat(format)
    .build();

Prompt prompt = new Prompt("Tell me about Paris", options);
ChatResponse response = chatModel.call(prompt);
String jsonResponse = response.getResult().getOutput().getText();
// jsonResponse will be valid JSON matching the schema

Multi-turn Conversation

List<Message> conversation = new ArrayList<>();
conversation.add(new UserMessage("What is the weather in Seattle?"));

// Get response
ChatResponse response1 = chatModel.call(new Prompt(conversation));
conversation.add(new AssistantMessage(response1.getResult().getOutput().getText()));

// Continue conversation
conversation.add(new UserMessage("What about tomorrow?"));
ChatResponse response2 = chatModel.call(new Prompt(conversation));

Temperature Control

// Creative writing (high temperature)
AzureOpenAiChatOptions creativeOptions = AzureOpenAiChatOptions.builder()
    .temperature(1.5)
    .build();

Prompt creativePrompt = new Prompt("Write a creative story", creativeOptions);
ChatResponse creativeResponse = chatModel.call(creativePrompt);

// Factual response (low temperature)
AzureOpenAiChatOptions factualOptions = AzureOpenAiChatOptions.builder()
    .temperature(0.1)
    .build();

Prompt factualPrompt = new Prompt("What is 2+2?", factualOptions);
ChatResponse factualResponse = chatModel.call(factualPrompt);

Token Limit Control

AzureOpenAiChatOptions options = AzureOpenAiChatOptions.builder()
    .maxTokens(100)
    .build();

Prompt prompt = new Prompt("Explain quantum physics", options);
ChatResponse response = chatModel.call(prompt);
// Response will be limited to 100 tokens

Using Reasoning Models

Reasoning models (o1, o3, o4-mini) are specialized models that spend more time "thinking" before responding, making them ideal for complex problem-solving, mathematical reasoning, and coding tasks.

Key Differences from Standard Models:

  • Use maxCompletionTokens instead of maxTokens
  • Support reasoningEffort parameter to control thinking depth
  • Typically have longer latency due to extended reasoning process
  • Excel at complex multi-step problems

Basic Reasoning Model Usage:

AzureOpenAiChatOptions reasoningOptions = AzureOpenAiChatOptions.builder()
    .deploymentName("o1")
    .maxCompletionTokens(2000)
    .build();

Prompt prompt = new Prompt(
    "Solve this complex math problem: Find the derivative of f(x) = x^3 * ln(x)",
    reasoningOptions
);
ChatResponse response = chatModel.call(prompt);

With Reasoning Effort Control:

// High effort for complex problems
AzureOpenAiChatOptions highEffortOptions = AzureOpenAiChatOptions.builder()
    .deploymentName("o1")
    .maxCompletionTokens(5000)
    .reasoningEffort("high")  // More thorough reasoning
    .build();

Prompt complexPrompt = new Prompt(
    "Design a distributed system architecture for a global e-commerce platform with 100M+ users",
    highEffortOptions
);
ChatResponse detailedResponse = chatModel.call(complexPrompt);

// Low effort for simpler questions
AzureOpenAiChatOptions lowEffortOptions = AzureOpenAiChatOptions.builder()
    .deploymentName("o1")
    .maxCompletionTokens(1000)
    .reasoningEffort("low")  // Faster responses
    .build();

Prompt simplePrompt = new Prompt("What is 15 * 24?", lowEffortOptions);
ChatResponse quickResponse = chatModel.call(simplePrompt);

Code Generation with Reasoning Models:

AzureOpenAiChatOptions codeOptions = AzureOpenAiChatOptions.builder()
    .deploymentName("o1")
    .maxCompletionTokens(3000)
    .reasoningEffort("high")
    .build();

Prompt codePrompt = new Prompt(
    "Write a Java implementation of a thread-safe LRU cache with O(1) operations",
    codeOptions
);
ChatResponse codeResponse = chatModel.call(codePrompt);

Mathematical Problem Solving:

AzureOpenAiChatOptions mathOptions = AzureOpenAiChatOptions.builder()
    .deploymentName("o1")
    .maxCompletionTokens(2000)
    .reasoningEffort("medium")
    .build();

Prompt mathPrompt = new Prompt(
    "Prove that the square root of 2 is irrational using contradiction",
    mathOptions
);
ChatResponse proofResponse = chatModel.call(mathPrompt);

Error Handling

Common Exceptions

// Azure SDK exceptions
com.azure.core.exception.HttpResponseException  // HTTP errors (400, 401, 403, 429, 500)
com.azure.core.exception.ResourceNotFoundException  // Deployment not found (404)

// Spring AI exceptions  
org.springframework.ai.retry.NonTransientAiException  // Permanent failures
org.springframework.ai.retry.TransientAiException  // Temporary failures (retry-able)

Exception Scenarios

HttpResponseException Status Codes:

  • 400 Bad Request: Invalid parameters, malformed request, incompatible options
  • 401 Unauthorized: Invalid API key, expired credentials
  • 403 Forbidden: Insufficient permissions, quota exceeded, content filter triggered
  • 404 Not Found: Handled separately by ResourceNotFoundException
  • 429 Too Many Requests: Rate limit exceeded, need to retry with backoff
  • 500 Internal Server Error: Azure service error, transient failure
  • 503 Service Unavailable: Service temporarily down

Common Error Scenarios:

  1. Rate Limiting (429):
try {
    response = chatModel.call(prompt);
} catch (HttpResponseException e) {
    if (e.getResponse().getStatusCode() == 429) {
        // Wait and retry with exponential backoff
        int retryAfterSeconds = extractRetryAfter(e.getResponse());
        Thread.sleep(retryAfterSeconds * 1000);
        response = chatModel.call(prompt);
    }
}
  1. Invalid Deployment (404):
try {
    response = chatModel.call(prompt);
} catch (ResourceNotFoundException e) {
    // Deployment name is incorrect or doesn't exist
    throw new ConfigurationException(
        "Deployment '" + options.getDeploymentName() + "' not found. " +
        "Check Azure portal for valid deployment names.", e
    );
}
  1. Authentication Failure (401):
try {
    response = chatModel.call(prompt);
} catch (HttpResponseException e) {
    if (e.getResponse().getStatusCode() == 401) {
        // API key is invalid or expired
        throw new AuthenticationException(
            "Invalid Azure OpenAI credentials. Check API key and endpoint.", e
        );
    }
}
  1. Content Filter (403):
try {
    response = chatModel.call(prompt);
} catch (HttpResponseException e) {
    if (e.getResponse().getStatusCode() == 403) {
        // Content filtered or quota exceeded
        String errorBody = e.getResponse().getBodyAsString().block();
        if (errorBody.contains("content_filter")) {
            throw new ContentFilterException(
                "Request blocked by content filter. Review prompt content.", e
            );
        } else {
            throw new QuotaException("Quota exceeded. Check Azure usage limits.", e);
        }
    }
}
  1. Token Limit Exceeded:
try {
    response = chatModel.call(prompt);
} catch (HttpResponseException e) {
    if (e.getResponse().getStatusCode() == 400) {
        String errorMessage = e.getMessage();
        if (errorMessage.contains("maximum context length")) {
            // Prompt + max_tokens exceeds model limit
            throw new TokenLimitException(
                "Total tokens exceed model limit. Reduce prompt length or maxTokens.", e
            );
        }
    }
}

Retry Logic

Exponential Backoff with Jitter:

public ChatResponse callWithRetry(AzureOpenAiChatModel model, Prompt prompt) {
    int maxRetries = 3;
    int baseDelayMs = 1000;
    
    for (int attempt = 0; attempt < maxRetries; attempt++) {
        try {
            return model.call(prompt);
        } catch (HttpResponseException e) {
            int statusCode = e.getResponse().getStatusCode();
            
            // Only retry on transient errors
            if (statusCode == 429 || statusCode == 500 || statusCode == 503) {
                if (attempt < maxRetries - 1) {
                    // Exponential backoff with jitter
                    int delayMs = baseDelayMs * (1 << attempt);
                    int jitter = ThreadLocalRandom.current().nextInt(0, delayMs / 2);
                    Thread.sleep(delayMs + jitter);
                    continue;
                }
            }
            
            // Non-retryable error or max retries exceeded
            throw e;
        }
    }
    
    throw new RuntimeException("Max retries exceeded");
}

Validation Rules

Parameter Constraints Summary

Deployment Name:

  • Required: Yes (throws NullPointerException if null)
  • Format: Non-empty string
  • Must match existing Azure deployment

Temperature:

  • Range: 0.0 to 2.0 (throws IllegalArgumentException if out of range)
  • Default: 0.7
  • Type: Double (nullable)

Top P:

  • Range: 0.0 to 1.0 (throws IllegalArgumentException if out of range)
  • Default: 1.0
  • Type: Double (nullable)

Max Tokens:

  • Range: > 0, <= model's max context length
  • Mutually exclusive with maxCompletionTokens
  • Type: Integer (nullable)

Max Completion Tokens:

  • Range: > 0
  • Required for reasoning models (o1, o3, o4-mini)
  • Mutually exclusive with maxTokens
  • Type: Integer (nullable)

N (Number of Completions):

  • Range: >= 1
  • Default: 1
  • Type: Integer (nullable)

Frequency Penalty:

  • Range: -2.0 to 2.0
  • Default: 0.0
  • Type: Double (nullable)

Presence Penalty:

  • Range: -2.0 to 2.0
  • Default: 0.0
  • Type: Double (nullable)

Logit Bias:

  • Keys: Token IDs as strings
  • Values: -100 to 100
  • Type: Map<String, Integer> (nullable)

Stop Sequences:

  • Maximum: 4 sequences
  • Max length per sequence: 20 characters
  • Type: List<String> (nullable)

Seed:

  • Range: Any long value
  • For deterministic sampling
  • Type: Long (nullable)

Top Log Probs:

  • Range: 1 to 20
  • Requires logprobs = true
  • Type: Integer (nullable)

Reasoning Effort:

  • Values: "low", "medium", "high"
  • Only for reasoning models (o1, o3, o4-mini)
  • Type: String (nullable)

User Identifier:

  • Max length: 256 characters
  • For abuse monitoring
  • Type: String (nullable)

Troubleshooting

Issue: "Invalid parameter" errors

Symptom: 400 Bad Request with parameter validation error

Common Causes:

  1. Using maxTokens with reasoning models (use maxCompletionTokens instead)
  2. Using both maxTokens and maxCompletionTokens together
  3. Temperature out of range (0.0-2.0)
  4. TopP out of range (0.0-1.0)
  5. More than 4 stop sequences

Solution: Validate parameters match the constraints documented above.

Issue: Streaming stops unexpectedly

Symptom: Flux completes before full response generated

Common Causes:

  1. Stop sequence encountered
  2. Token limit reached
  3. Content filter triggered
  4. Network timeout

Solution:

chatModel.stream(prompt)
    .doOnComplete(() -> System.out.println("Stream completed"))
    .doOnError(error -> System.err.println("Stream error: " + error))
    .subscribe(/* ... */);

Issue: JSON schema validation failures

Symptom: Model doesn't return valid JSON matching schema

Solutions:

  1. Ensure model supports JSON schema (GPT-4o, GPT-4-turbo or later)
  2. Set strict(true) in JsonSchema builder
  3. Include "respond in JSON format" in prompt
  4. Validate schema is valid JSON Schema Draft 7
  5. Check for complex nested structures (may need simplification)

Issue: Slow response times

Symptom: Requests take longer than expected

Causes & Solutions:

  1. Model type: Reasoning models (o1, o3) are intentionally slower
    • Use standard models (gpt-4o, gpt-4) for faster responses
  2. Reasoning effort: Set to "low" for faster responses
    • reasoningEffort("low")
  3. Token limits: Higher maxTokens = longer generation time
    • Reduce maxTokens for faster responses
  4. Network latency: Distance from Azure region
    • Choose closest Azure region
  5. Concurrent requests: Too many simultaneous requests
    • Implement request queuing/throttling

Performance Optimization

Model Instance Reuse

Recommended:

// Create once at application startup
@Bean
public AzureOpenAiChatModel chatModel() {
    return AzureOpenAiChatModel.builder()
        .openAIClientBuilder(clientBuilder)
        .defaultOptions(defaultOptions)
        .build();
}

// Inject and reuse
@Autowired
private AzureOpenAiChatModel chatModel;

public void handleRequest() {
    chatModel.call(prompt);  // Reuse same instance
}

Avoid:

// Don't create new instance per request
public void handleRequest() {
    AzureOpenAiChatModel model = AzureOpenAiChatModel.builder()...build();
    model.call(prompt);  // Inefficient
}

Streaming for Better UX

Use streaming for:

  • Long-form content generation
  • Interactive chat interfaces
  • Real-time user feedback
  • Responses > 500 tokens
Flux<ChatResponse> stream = chatModel.stream(prompt);
stream.subscribe(
    chunk -> updateUI(chunk.getResult().getOutput().getText()),
    error -> handleError(error),
    () -> markComplete()
);

Parallel Requests

Models are thread-safe and can handle concurrent requests:

ExecutorService executor = Executors.newFixedThreadPool(10);
List<CompletableFuture<ChatResponse>> futures = new ArrayList<>();

for (Prompt prompt : prompts) {
    CompletableFuture<ChatResponse> future = CompletableFuture.supplyAsync(
        () -> chatModel.call(prompt),
        executor
    );
    futures.add(future);
}

// Wait for all to complete
List<ChatResponse> responses = futures.stream()
    .map(CompletableFuture::join)
    .collect(Collectors.toList());

Caching Strategy

Implement caching for repeated identical requests:

private final Map<String, ChatResponse> responseCache = new ConcurrentHashMap<>();

public ChatResponse getCachedResponse(String promptText) {
    return responseCache.computeIfAbsent(promptText, key -> {
        Prompt prompt = new Prompt(key);
        return chatModel.call(prompt);
    });
}

Default Values

  • Deployment Name: "gpt-4o"
  • Temperature: 0.7
  • Response Format: TEXT
  • Reasoning Effort: "medium" (for reasoning models only)
  • N: 1
  • Frequency Penalty: 0.0
  • Presence Penalty: 0.0
  • Top P: 1.0
  • Logprobs: false
  • Stream Usage: false
  • Internal Tool Execution: true
tessl i tessl/maven-org-springframework-ai--spring-ai-azure-openai@1.1.1

docs

index.md

tile.json