or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

dictionary-operations.mdindex.mdlanguage-support.mdscript-transliteration.mdsentence-boundaries.mdtext-translation.md

sentence-boundaries.mddocs/

0

# Sentence Boundary Detection

1

2

Identify sentence boundaries in text with automatic language detection and script-specific processing. This service determines where sentences begin and end in input text, providing length information for proper text segmentation and analysis.

3

4

## Capabilities

5

6

### Find Sentence Boundaries

7

8

Analyzes input text to identify sentence boundaries and returns length information for each detected sentence with optional language detection.

9

10

```python { .api }

11

def find_sentence_boundaries(

12

body: Union[List[str], List[InputTextItem], IO[bytes]],

13

*,

14

client_trace_id: Optional[str] = None,

15

language: Optional[str] = None,

16

script: Optional[str] = None,

17

**kwargs: Any

18

) -> List[BreakSentenceItem]

19

```

20

21

**Parameters:**

22

- `body`: Text to analyze (strings, InputTextItem objects, or binary data)

23

- `client_trace_id`: Client-generated GUID for request tracking

24

- `language`: Language code for the text (auto-detected if omitted)

25

- `script`: Script identifier for the text (default script assumed if omitted)

26

27

**Returns:** List of sentence boundary analysis results

28

29

### Usage Examples

30

31

```python

32

from azure.ai.translation.text import TextTranslationClient

33

from azure.core.credentials import AzureKeyCredential

34

35

client = TextTranslationClient(

36

credential=AzureKeyCredential("your-api-key"),

37

region="your-region"

38

)

39

40

# Basic sentence boundary detection with auto-detection

41

response = client.find_sentence_boundaries(

42

body=["The answer lies in machine translation. This is a test. How are you?"]

43

)

44

45

result = response[0]

46

print(f"Detected language: {result.detected_language.language}")

47

print(f"Detection confidence: {result.detected_language.score}")

48

print(f"Sentence lengths: {result.sent_len}")

49

# Output: [37, 15, 12] (character counts for each sentence)

50

51

# Multi-text analysis

52

multi_response = client.find_sentence_boundaries(

53

body=[

54

"First text with multiple sentences. This is sentence two.",

55

"Second text. Also has multiple parts. Three sentences total."

56

]

57

)

58

59

for i, result in enumerate(multi_response):

60

print(f"\nText {i+1}:")

61

print(f" Language: {result.detected_language.language}")

62

print(f" Sentence lengths: {result.sent_len}")

63

64

# Specify language and script explicitly

65

explicit_response = client.find_sentence_boundaries(

66

body=["¡Hola mundo! ¿Cómo estás hoy? Me alegro de verte."],

67

language="es",

68

script="Latn"

69

)

70

71

# Complex punctuation handling

72

complex_response = client.find_sentence_boundaries(

73

body=["Dr. Smith went to the U.S.A. yesterday. He said 'Hello!' to everyone."]

74

)

75

76

# Mixed language content (relies on auto-detection)

77

mixed_response = client.find_sentence_boundaries(

78

body=["English sentence. Sentence en français. Back to English."]

79

)

80

```

81

82

## Input Types

83

84

### Text Input Models

85

86

```python { .api }

87

class InputTextItem:

88

text: str # Text content to analyze for sentence boundaries

89

```

90

91

## Response Types

92

93

### Sentence Boundary Results

94

95

```python { .api }

96

class BreakSentenceItem:

97

sent_len: List[int] # Character lengths of each detected sentence

98

detected_language: Optional[DetectedLanguage] # Auto-detected language info

99

```

100

101

### Language Detection Information

102

103

```python { .api }

104

class DetectedLanguage:

105

language: str # Detected language code (ISO 639-1/639-3)

106

score: float # Detection confidence score (0.0 to 1.0)

107

```

108

109

## Sentence Segmentation Rules

110

111

The service applies language-specific and script-specific rules for sentence boundary detection:

112

113

### General Rules

114

- Periods, exclamation marks, and question marks typically end sentences

115

- Abbreviations (Dr., Mr., U.S.A.) are handled contextually

116

- Quotation marks and parentheses are considered in boundary detection

117

- Multiple consecutive punctuation marks are processed appropriately

118

119

### Language-Specific Processing

120

- **English**: Handles abbreviations, contractions, and decimal numbers

121

- **Spanish**: Processes inverted punctuation marks (¡¿)

122

- **Chinese/Japanese**: Recognizes full-width punctuation (。!?)

123

- **Arabic**: Handles right-to-left text directionality

124

- **German**: Manages compound words and capitalization rules

125

126

### Script Considerations

127

- **Latin scripts**: Standard punctuation processing

128

- **CJK scripts**: Full-width punctuation mark recognition

129

- **Arabic script**: Right-to-left text flow handling

130

- **Devanagari**: Script-specific sentence ending markers

131

132

## Integration with Translation

133

134

Sentence boundary detection is automatically used when `include_sentence_length=True` in translation requests:

135

136

```python

137

# Translation with automatic sentence boundary detection

138

translation_response = client.translate(

139

body=["First sentence. Second sentence. Third sentence."],

140

to_language=["es"],

141

include_sentence_length=True

142

)

143

144

translation = translation_response[0].translations[0]

145

if translation.sent_len:

146

print(f"Source sentence lengths: {translation.sent_len.src_sent_len}")

147

print(f"Target sentence lengths: {translation.sent_len.trans_sent_len}")

148

```

149

150

## Error Handling

151

152

```python

153

from azure.core.exceptions import HttpResponseError

154

155

try:

156

response = client.find_sentence_boundaries(

157

body=["Text to analyze"],

158

language="invalid-code" # Invalid language code

159

)

160

except HttpResponseError as error:

161

if error.error:

162

print(f"Error Code: {error.error.code}")

163

print(f"Message: {error.error.message}")

164

```