0
# Sentence Boundary Detection
1
2
Identify sentence boundaries in text with automatic language detection and script-specific processing. This service determines where sentences begin and end in input text, providing length information for proper text segmentation and analysis.
3
4
## Capabilities
5
6
### Find Sentence Boundaries
7
8
Analyzes input text to identify sentence boundaries and returns length information for each detected sentence with optional language detection.
9
10
```python { .api }
11
def find_sentence_boundaries(
12
body: Union[List[str], List[InputTextItem], IO[bytes]],
13
*,
14
client_trace_id: Optional[str] = None,
15
language: Optional[str] = None,
16
script: Optional[str] = None,
17
**kwargs: Any
18
) -> List[BreakSentenceItem]
19
```
20
21
**Parameters:**
22
- `body`: Text to analyze (strings, InputTextItem objects, or binary data)
23
- `client_trace_id`: Client-generated GUID for request tracking
24
- `language`: Language code for the text (auto-detected if omitted)
25
- `script`: Script identifier for the text (default script assumed if omitted)
26
27
**Returns:** List of sentence boundary analysis results
28
29
### Usage Examples
30
31
```python
32
from azure.ai.translation.text import TextTranslationClient
33
from azure.core.credentials import AzureKeyCredential
34
35
client = TextTranslationClient(
36
credential=AzureKeyCredential("your-api-key"),
37
region="your-region"
38
)
39
40
# Basic sentence boundary detection with auto-detection
41
response = client.find_sentence_boundaries(
42
body=["The answer lies in machine translation. This is a test. How are you?"]
43
)
44
45
result = response[0]
46
print(f"Detected language: {result.detected_language.language}")
47
print(f"Detection confidence: {result.detected_language.score}")
48
print(f"Sentence lengths: {result.sent_len}")
49
# Output: [37, 15, 12] (character counts for each sentence)
50
51
# Multi-text analysis
52
multi_response = client.find_sentence_boundaries(
53
body=[
54
"First text with multiple sentences. This is sentence two.",
55
"Second text. Also has multiple parts. Three sentences total."
56
]
57
)
58
59
for i, result in enumerate(multi_response):
60
print(f"\nText {i+1}:")
61
print(f" Language: {result.detected_language.language}")
62
print(f" Sentence lengths: {result.sent_len}")
63
64
# Specify language and script explicitly
65
explicit_response = client.find_sentence_boundaries(
66
body=["¡Hola mundo! ¿Cómo estás hoy? Me alegro de verte."],
67
language="es",
68
script="Latn"
69
)
70
71
# Complex punctuation handling
72
complex_response = client.find_sentence_boundaries(
73
body=["Dr. Smith went to the U.S.A. yesterday. He said 'Hello!' to everyone."]
74
)
75
76
# Mixed language content (relies on auto-detection)
77
mixed_response = client.find_sentence_boundaries(
78
body=["English sentence. Sentence en français. Back to English."]
79
)
80
```
81
82
## Input Types
83
84
### Text Input Models
85
86
```python { .api }
87
class InputTextItem:
88
text: str # Text content to analyze for sentence boundaries
89
```
90
91
## Response Types
92
93
### Sentence Boundary Results
94
95
```python { .api }
96
class BreakSentenceItem:
97
sent_len: List[int] # Character lengths of each detected sentence
98
detected_language: Optional[DetectedLanguage] # Auto-detected language info
99
```
100
101
### Language Detection Information
102
103
```python { .api }
104
class DetectedLanguage:
105
language: str # Detected language code (ISO 639-1/639-3)
106
score: float # Detection confidence score (0.0 to 1.0)
107
```
108
109
## Sentence Segmentation Rules
110
111
The service applies language-specific and script-specific rules for sentence boundary detection:
112
113
### General Rules
114
- Periods, exclamation marks, and question marks typically end sentences
115
- Abbreviations (Dr., Mr., U.S.A.) are handled contextually
116
- Quotation marks and parentheses are considered in boundary detection
117
- Multiple consecutive punctuation marks are processed appropriately
118
119
### Language-Specific Processing
120
- **English**: Handles abbreviations, contractions, and decimal numbers
121
- **Spanish**: Processes inverted punctuation marks (¡¿)
122
- **Chinese/Japanese**: Recognizes full-width punctuation (。!?)
123
- **Arabic**: Handles right-to-left text directionality
124
- **German**: Manages compound words and capitalization rules
125
126
### Script Considerations
127
- **Latin scripts**: Standard punctuation processing
128
- **CJK scripts**: Full-width punctuation mark recognition
129
- **Arabic script**: Right-to-left text flow handling
130
- **Devanagari**: Script-specific sentence ending markers
131
132
## Integration with Translation
133
134
Sentence boundary detection is automatically used when `include_sentence_length=True` in translation requests:
135
136
```python
137
# Translation with automatic sentence boundary detection
138
translation_response = client.translate(
139
body=["First sentence. Second sentence. Third sentence."],
140
to_language=["es"],
141
include_sentence_length=True
142
)
143
144
translation = translation_response[0].translations[0]
145
if translation.sent_len:
146
print(f"Source sentence lengths: {translation.sent_len.src_sent_len}")
147
print(f"Target sentence lengths: {translation.sent_len.trans_sent_len}")
148
```
149
150
## Error Handling
151
152
```python
153
from azure.core.exceptions import HttpResponseError
154
155
try:
156
response = client.find_sentence_boundaries(
157
body=["Text to analyze"],
158
language="invalid-code" # Invalid language code
159
)
160
except HttpResponseError as error:
161
if error.error:
162
print(f"Error Code: {error.error.code}")
163
print(f"Message: {error.error.message}")
164
```