CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-tencentcloud-sdk-python-tmt

Tencent Cloud Machine Translation (TMT) SDK for Python providing comprehensive text, file, image, and speech translation capabilities

Overview
Eval results
Files

speech-translation.mddocs/

Speech Translation

Audio translation combining speech recognition and translation for Chinese-English bidirectional processing. Supports both streaming and batch audio processing with multiple audio format compatibility.

Capabilities

Audio Translation

Recognizes speech in audio files and translates the recognized text to the target language. Supports real-time streaming and batch processing modes.

def SpeechTranslate(self, request: models.SpeechTranslateRequest) -> models.SpeechTranslateResponse:
    """
    Translate speech audio to text in target language.
    
    Args:
        request: SpeechTranslateRequest with audio data and parameters
        
    Returns:
        SpeechTranslateResponse with translated text result
        
    Raises:
        TencentCloudSDKException: For various error conditions
    """

Usage Example (Single Audio File):

import base64
from tencentcloud.common import credential
from tencentcloud.tmt.v20180321.tmt_client import TmtClient
from tencentcloud.tmt.v20180321 import models

# Initialize client
cred = credential.Credential("SecretId", "SecretKey")
client = TmtClient(cred, "ap-beijing")

# Read and encode audio file
with open("speech.wav", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode()

# Create speech translation request
req = models.SpeechTranslateRequest()
req.SessionUuid = "unique-session-id"
req.Source = "zh"  # Chinese input
req.Target = "en"  # English output
req.AudioFormat = 1  # PCM format
req.Data = audio_data
req.Seq = 0  # Sequence number
req.IsEnd = 1  # Single file, mark as end
req.ProjectId = 0

# Perform speech translation
resp = client.SpeechTranslate(req)
print(f"Session: {resp.SessionUuid}")
print(f"Translation: {resp.Source} -> {resp.Target}")
print(f"Original: {resp.SourceText}")
print(f"Translated: {resp.TargetText}")
print(f"Recognition status: {resp.RecognizeStatus}")

Usage Example (Streaming Audio):

def stream_audio_translation(client, audio_chunks, session_uuid):
    """
    Process streaming audio chunks for real-time translation.
    
    Args:
        client: TmtClient instance
        audio_chunks: List of audio data chunks (200-500ms each)
        session_uuid: Unique session identifier
    
    Returns:
        List of translation results
    """
    results = []
    
    for i, chunk in enumerate(audio_chunks):
        req = models.SpeechTranslateRequest()
        req.SessionUuid = session_uuid
        req.Source = "en"
        req.Target = "zh"
        req.AudioFormat = 1  # PCM only for streaming
        req.Data = base64.b64encode(chunk).decode()
        req.Seq = i
        req.IsEnd = 1 if i == len(audio_chunks) - 1 else 0
        req.ProjectId = 0
        
        try:
            resp = client.SpeechTranslate(req)
            if resp.TargetText:
                results.append(resp.TargetText)
                print(f"Chunk {i}: {resp.SourceText} -> {resp.TargetText}")
        except Exception as e:
            print(f"Error processing chunk {i}: {e}")
            
    return results

# Example usage
session_id = "streaming-session-001"
# audio_chunks would be your segmented audio data
# results = stream_audio_translation(client, audio_chunks, session_id)

Request/Response Models

SpeechTranslateRequest

class SpeechTranslateRequest:
    """
    Request parameters for speech translation.
    
    Attributes:
        SessionUuid (str): Unique session identifier for tracking
        Source (str): Source language code (zh, en)
        Target (str): Target language code (zh, en)
        AudioFormat (int): Audio format (1: PCM, 2: MP3, 3: SPEEX)
        Data (str): Base64 encoded audio data
        Seq (int): Sequence number for streaming (starts from 0)
        IsEnd (int): End flag (0: more chunks, 1: final chunk)
        ProjectId (int): Project ID (default: 0)
    """

SpeechTranslateResponse

class SpeechTranslateResponse:
    """
    Response from speech translation.
    
    Attributes:
        SessionUuid (str): Session identifier from request
        RecognizeStatus (int): Speech recognition status (1=processing, 0=complete)
        SourceText (str): Recognized original text
        TargetText (str): Translated text result
        Seq (int): Audio fragment sequence number
        Source (str): Source language
        Target (str): Target language
        VadSeq (int): Voice activity detection sequence number
        RequestId (str): Unique request identifier
    """

Supported Audio Formats

Format Specifications

PCM (Format ID: 1)

  • Sampling Rate: 16kHz
  • Bit Depth: 16-bit
  • Channels: Mono (single channel)
  • Streaming Support: Yes (required for real-time)
  • Chunk Duration: 200-500ms per chunk
  • Use Case: Real-time streaming translation

MP3 (Format ID: 2)

  • Streaming Support: No (batch only)
  • Max Duration: 8 seconds
  • Use Case: Pre-recorded audio files
  • Quality: Variable bitrate supported

SPEEX (Format ID: 3)

  • Streaming Support: No (batch only)
  • Max Duration: 8 seconds
  • Use Case: Compressed voice recordings
  • Quality: Optimized for speech

Language Support

Speech translation currently supports Chinese-English bidirectional translation:

Supported Language Pairs

  • Chinese to English: zh → en
  • English to Chinese: en → zh

Language Codes

  • zh: Simplified Chinese (Mandarin)
  • en: English

Processing Modes

Streaming Mode (PCM only)

  • Real-time processing of audio chunks
  • 200-500ms chunk duration recommended
  • Sequential processing with Seq numbering
  • IsEnd=1 for final chunk
  • Immediate translation results

Batch Mode (All formats)

  • Single audio file processing
  • Maximum 8 seconds duration (MP3, SPEEX)
  • No duration limit for PCM
  • IsEnd=1, Seq=0 for single file
  • Complete translation after processing

Audio Quality Requirements

Clear Speech

  • Minimal background noise
  • Clear pronunciation
  • Avoid overlapping speakers
  • Consistent volume levels

Technical Requirements

  • Proper sampling rate (16kHz for PCM)
  • Adequate bit depth (16-bit minimum)
  • Stable audio stream without dropouts
  • Proper audio encoding

Session Management

Session UUID

  • Unique identifier for each translation session
  • Required for tracking streaming sessions
  • Use consistent UUID across all chunks in a session
  • Helps correlate results with audio input

Sequence Numbers

  • Start from 0 for first chunk
  • Increment by 1 for each subsequent chunk
  • Used for proper ordering in streaming mode
  • Critical for maintaining audio continuity

Error Handling

Common error scenarios for speech translation:

  • UNSUPPORTEDOPERATION_AUDIODURATIONEXCEED: Audio exceeds maximum duration
  • UNSUPPORTEDOPERATION_UNSUPPORTEDLANGUAGE: Language pair not supported
  • FAILEDOPERATION_REQUESTAILABERR: Audio processing failure
  • INVALIDPARAMETER_SEQINTERVALTOOLARGE: Invalid sequence numbering
  • INVALIDPARAMETER_DUPLICATEDSESSIONIDANDSEQ: Duplicate session/sequence

Example error handling:

def safe_speech_translate(client, request):
    """Safely perform speech translation with error handling."""
    try:
        response = client.SpeechTranslate(request)
        return response.TargetText
    except TencentCloudSDKException as e:
        if e.code == "UNSUPPORTEDOPERATION_AUDIODURATIONEXCEED":
            print("Audio file too long, split into smaller chunks")
        elif e.code == "UNSUPPORTEDOPERATION_UNSUPPORTEDLANGUAGE":
            print("Language pair not supported, use zh<->en only")
        elif e.code == "FAILEDOPERATION_REQUESTAILABERR":
            print("Audio processing failed, check audio quality")
        else:
            print(f"Speech translation error: {e.code} - {e.message}")
        return None

# Usage
result = safe_speech_translate(client, req)
if result:
    print(f"Translation: {result}")

Best Practices

Audio Preparation

  • Use high-quality recording equipment
  • Record in quiet environments
  • Maintain consistent speaking pace
  • Avoid background music or noise

Streaming Implementation

  • Buffer audio in 200-500ms chunks
  • Implement proper sequence numbering
  • Handle network interruptions gracefully
  • Process results as they arrive

Error Recovery

  • Implement retry logic for transient errors
  • Validate audio format before submission
  • Monitor session state across chunks
  • Provide user feedback for processing status

Install with Tessl CLI

npx tessl i tessl/pypi-tencentcloud-sdk-python-tmt@3.0.1

docs

file-translation.md

image-translation.md

index.md

speech-translation.md

text-translation.md

tile.json