CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-azure-ai-translation-text

Azure Text Translation client library for Python that provides neural machine translation technology for quick and accurate source-to-target text translation in real time across all supported languages

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview
Eval results
Files

sentence-boundaries.mddocs/

Sentence Boundary Detection

Identify sentence boundaries in text with automatic language detection and script-specific processing. This service determines where sentences begin and end in input text, providing length information for proper text segmentation and analysis.

Capabilities

Find Sentence Boundaries

Analyzes input text to identify sentence boundaries and returns length information for each detected sentence with optional language detection.

def find_sentence_boundaries(
    body: Union[List[str], List[InputTextItem], IO[bytes]],
    *,
    client_trace_id: Optional[str] = None,
    language: Optional[str] = None,
    script: Optional[str] = None,
    **kwargs: Any
) -> List[BreakSentenceItem]

Parameters:

  • body: Text to analyze (strings, InputTextItem objects, or binary data)
  • client_trace_id: Client-generated GUID for request tracking
  • language: Language code for the text (auto-detected if omitted)
  • script: Script identifier for the text (default script assumed if omitted)

Returns: List of sentence boundary analysis results

Usage Examples

from azure.ai.translation.text import TextTranslationClient
from azure.core.credentials import AzureKeyCredential

client = TextTranslationClient(
    credential=AzureKeyCredential("your-api-key"),
    region="your-region"
)

# Basic sentence boundary detection with auto-detection
response = client.find_sentence_boundaries(
    body=["The answer lies in machine translation. This is a test. How are you?"]
)

result = response[0]
print(f"Detected language: {result.detected_language.language}")
print(f"Detection confidence: {result.detected_language.score}")
print(f"Sentence lengths: {result.sent_len}")
# Output: [37, 15, 12] (character counts for each sentence)

# Multi-text analysis
multi_response = client.find_sentence_boundaries(
    body=[
        "First text with multiple sentences. This is sentence two.",
        "Second text. Also has multiple parts. Three sentences total."
    ]
)

for i, result in enumerate(multi_response):
    print(f"\nText {i+1}:")
    print(f"  Language: {result.detected_language.language}")
    print(f"  Sentence lengths: {result.sent_len}")

# Specify language and script explicitly
explicit_response = client.find_sentence_boundaries(
    body=["¡Hola mundo! ¿Cómo estás hoy? Me alegro de verte."],
    language="es",
    script="Latn"
)

# Complex punctuation handling
complex_response = client.find_sentence_boundaries(
    body=["Dr. Smith went to the U.S.A. yesterday. He said 'Hello!' to everyone."]
)

# Mixed language content (relies on auto-detection)
mixed_response = client.find_sentence_boundaries(
    body=["English sentence. Sentence en français. Back to English."]
)

Input Types

Text Input Models

class InputTextItem:
    text: str  # Text content to analyze for sentence boundaries

Response Types

Sentence Boundary Results

class BreakSentenceItem:
    sent_len: List[int]  # Character lengths of each detected sentence
    detected_language: Optional[DetectedLanguage]  # Auto-detected language info

Language Detection Information

class DetectedLanguage:
    language: str  # Detected language code (ISO 639-1/639-3)
    score: float   # Detection confidence score (0.0 to 1.0)

Sentence Segmentation Rules

The service applies language-specific and script-specific rules for sentence boundary detection:

General Rules

  • Periods, exclamation marks, and question marks typically end sentences
  • Abbreviations (Dr., Mr., U.S.A.) are handled contextually
  • Quotation marks and parentheses are considered in boundary detection
  • Multiple consecutive punctuation marks are processed appropriately

Language-Specific Processing

  • English: Handles abbreviations, contractions, and decimal numbers
  • Spanish: Processes inverted punctuation marks (¡¿)
  • Chinese/Japanese: Recognizes full-width punctuation (。!?)
  • Arabic: Handles right-to-left text directionality
  • German: Manages compound words and capitalization rules

Script Considerations

  • Latin scripts: Standard punctuation processing
  • CJK scripts: Full-width punctuation mark recognition
  • Arabic script: Right-to-left text flow handling
  • Devanagari: Script-specific sentence ending markers

Integration with Translation

Sentence boundary detection is automatically used when include_sentence_length=True in translation requests:

# Translation with automatic sentence boundary detection
translation_response = client.translate(
    body=["First sentence. Second sentence. Third sentence."],
    to_language=["es"],
    include_sentence_length=True
)

translation = translation_response[0].translations[0]
if translation.sent_len:
    print(f"Source sentence lengths: {translation.sent_len.src_sent_len}")
    print(f"Target sentence lengths: {translation.sent_len.trans_sent_len}")

Error Handling

from azure.core.exceptions import HttpResponseError

try:
    response = client.find_sentence_boundaries(
        body=["Text to analyze"],
        language="invalid-code"  # Invalid language code
    )
except HttpResponseError as error:
    if error.error:
        print(f"Error Code: {error.error.code}")
        print(f"Message: {error.error.message}")

Install with Tessl CLI

npx tessl i tessl/pypi-azure-ai-translation-text

docs

dictionary-operations.md

index.md

language-support.md

script-transliteration.md

sentence-boundaries.md

text-translation.md

tile.json