0
# Speech Translation
1
2
Audio translation combining speech recognition and translation for Chinese-English bidirectional processing. Supports both streaming and batch audio processing with multiple audio format compatibility.
3
4
## Capabilities
5
6
### Audio Translation
7
8
Recognizes speech in audio files and translates the recognized text to the target language. Supports real-time streaming and batch processing modes.
9
10
```python { .api }
11
def SpeechTranslate(self, request: models.SpeechTranslateRequest) -> models.SpeechTranslateResponse:
12
"""
13
Translate speech audio to text in target language.
14
15
Args:
16
request: SpeechTranslateRequest with audio data and parameters
17
18
Returns:
19
SpeechTranslateResponse with translated text result
20
21
Raises:
22
TencentCloudSDKException: For various error conditions
23
"""
24
```
25
26
**Usage Example (Single Audio File):**
27
28
```python
29
import base64
30
from tencentcloud.common import credential
31
from tencentcloud.tmt.v20180321.tmt_client import TmtClient
32
from tencentcloud.tmt.v20180321 import models
33
34
# Initialize client
35
cred = credential.Credential("SecretId", "SecretKey")
36
client = TmtClient(cred, "ap-beijing")
37
38
# Read and encode audio file
39
with open("speech.wav", "rb") as f:
40
audio_data = base64.b64encode(f.read()).decode()
41
42
# Create speech translation request
43
req = models.SpeechTranslateRequest()
44
req.SessionUuid = "unique-session-id"
45
req.Source = "zh" # Chinese input
46
req.Target = "en" # English output
47
req.AudioFormat = 1 # PCM format
48
req.Data = audio_data
49
req.Seq = 0 # Sequence number
50
req.IsEnd = 1 # Single file, mark as end
51
req.ProjectId = 0
52
53
# Perform speech translation
54
resp = client.SpeechTranslate(req)
55
print(f"Session: {resp.SessionUuid}")
56
print(f"Translation: {resp.Source} -> {resp.Target}")
57
print(f"Original: {resp.SourceText}")
58
print(f"Translated: {resp.TargetText}")
59
print(f"Recognition status: {resp.RecognizeStatus}")
60
```
61
62
**Usage Example (Streaming Audio):**
63
64
```python
65
def stream_audio_translation(client, audio_chunks, session_uuid):
66
"""
67
Process streaming audio chunks for real-time translation.
68
69
Args:
70
client: TmtClient instance
71
audio_chunks: List of audio data chunks (200-500ms each)
72
session_uuid: Unique session identifier
73
74
Returns:
75
List of translation results
76
"""
77
results = []
78
79
for i, chunk in enumerate(audio_chunks):
80
req = models.SpeechTranslateRequest()
81
req.SessionUuid = session_uuid
82
req.Source = "en"
83
req.Target = "zh"
84
req.AudioFormat = 1 # PCM only for streaming
85
req.Data = base64.b64encode(chunk).decode()
86
req.Seq = i
87
req.IsEnd = 1 if i == len(audio_chunks) - 1 else 0
88
req.ProjectId = 0
89
90
try:
91
resp = client.SpeechTranslate(req)
92
if resp.TargetText:
93
results.append(resp.TargetText)
94
print(f"Chunk {i}: {resp.SourceText} -> {resp.TargetText}")
95
except Exception as e:
96
print(f"Error processing chunk {i}: {e}")
97
98
return results
99
100
# Example usage
101
session_id = "streaming-session-001"
102
# audio_chunks would be your segmented audio data
103
# results = stream_audio_translation(client, audio_chunks, session_id)
104
```
105
106
## Request/Response Models
107
108
### SpeechTranslateRequest
109
110
```python { .api }
111
class SpeechTranslateRequest:
112
"""
113
Request parameters for speech translation.
114
115
Attributes:
116
SessionUuid (str): Unique session identifier for tracking
117
Source (str): Source language code (zh, en)
118
Target (str): Target language code (zh, en)
119
AudioFormat (int): Audio format (1: PCM, 2: MP3, 3: SPEEX)
120
Data (str): Base64 encoded audio data
121
Seq (int): Sequence number for streaming (starts from 0)
122
IsEnd (int): End flag (0: more chunks, 1: final chunk)
123
ProjectId (int): Project ID (default: 0)
124
"""
125
```
126
127
### SpeechTranslateResponse
128
129
```python { .api }
130
class SpeechTranslateResponse:
131
"""
132
Response from speech translation.
133
134
Attributes:
135
SessionUuid (str): Session identifier from request
136
RecognizeStatus (int): Speech recognition status (1=processing, 0=complete)
137
SourceText (str): Recognized original text
138
TargetText (str): Translated text result
139
Seq (int): Audio fragment sequence number
140
Source (str): Source language
141
Target (str): Target language
142
VadSeq (int): Voice activity detection sequence number
143
RequestId (str): Unique request identifier
144
"""
145
```
146
147
## Supported Audio Formats
148
149
### Format Specifications
150
151
**PCM (Format ID: 1)**
152
- **Sampling Rate**: 16kHz
153
- **Bit Depth**: 16-bit
154
- **Channels**: Mono (single channel)
155
- **Streaming Support**: Yes (required for real-time)
156
- **Chunk Duration**: 200-500ms per chunk
157
- **Use Case**: Real-time streaming translation
158
159
**MP3 (Format ID: 2)**
160
- **Streaming Support**: No (batch only)
161
- **Max Duration**: 8 seconds
162
- **Use Case**: Pre-recorded audio files
163
- **Quality**: Variable bitrate supported
164
165
**SPEEX (Format ID: 3)**
166
- **Streaming Support**: No (batch only)
167
- **Max Duration**: 8 seconds
168
- **Use Case**: Compressed voice recordings
169
- **Quality**: Optimized for speech
170
171
## Language Support
172
173
Speech translation currently supports **Chinese-English bidirectional translation**:
174
175
### Supported Language Pairs
176
- **Chinese to English**: zh → en
177
- **English to Chinese**: en → zh
178
179
### Language Codes
180
- **zh**: Simplified Chinese (Mandarin)
181
- **en**: English
182
183
## Processing Modes
184
185
### Streaming Mode (PCM only)
186
- Real-time processing of audio chunks
187
- 200-500ms chunk duration recommended
188
- Sequential processing with Seq numbering
189
- IsEnd=1 for final chunk
190
- Immediate translation results
191
192
### Batch Mode (All formats)
193
- Single audio file processing
194
- Maximum 8 seconds duration (MP3, SPEEX)
195
- No duration limit for PCM
196
- IsEnd=1, Seq=0 for single file
197
- Complete translation after processing
198
199
## Audio Quality Requirements
200
201
### Clear Speech
202
- Minimal background noise
203
- Clear pronunciation
204
- Avoid overlapping speakers
205
- Consistent volume levels
206
207
### Technical Requirements
208
- Proper sampling rate (16kHz for PCM)
209
- Adequate bit depth (16-bit minimum)
210
- Stable audio stream without dropouts
211
- Proper audio encoding
212
213
## Session Management
214
215
### Session UUID
216
- Unique identifier for each translation session
217
- Required for tracking streaming sessions
218
- Use consistent UUID across all chunks in a session
219
- Helps correlate results with audio input
220
221
### Sequence Numbers
222
- Start from 0 for first chunk
223
- Increment by 1 for each subsequent chunk
224
- Used for proper ordering in streaming mode
225
- Critical for maintaining audio continuity
226
227
## Error Handling
228
229
Common error scenarios for speech translation:
230
231
- **UNSUPPORTEDOPERATION_AUDIODURATIONEXCEED**: Audio exceeds maximum duration
232
- **UNSUPPORTEDOPERATION_UNSUPPORTEDLANGUAGE**: Language pair not supported
233
- **FAILEDOPERATION_REQUESTAILABERR**: Audio processing failure
234
- **INVALIDPARAMETER_SEQINTERVALTOOLARGE**: Invalid sequence numbering
235
- **INVALIDPARAMETER_DUPLICATEDSESSIONIDANDSEQ**: Duplicate session/sequence
236
237
Example error handling:
238
239
```python
240
def safe_speech_translate(client, request):
241
"""Safely perform speech translation with error handling."""
242
try:
243
response = client.SpeechTranslate(request)
244
return response.TargetText
245
except TencentCloudSDKException as e:
246
if e.code == "UNSUPPORTEDOPERATION_AUDIODURATIONEXCEED":
247
print("Audio file too long, split into smaller chunks")
248
elif e.code == "UNSUPPORTEDOPERATION_UNSUPPORTEDLANGUAGE":
249
print("Language pair not supported, use zh<->en only")
250
elif e.code == "FAILEDOPERATION_REQUESTAILABERR":
251
print("Audio processing failed, check audio quality")
252
else:
253
print(f"Speech translation error: {e.code} - {e.message}")
254
return None
255
256
# Usage
257
result = safe_speech_translate(client, req)
258
if result:
259
print(f"Translation: {result}")
260
```
261
262
## Best Practices
263
264
### Audio Preparation
265
- Use high-quality recording equipment
266
- Record in quiet environments
267
- Maintain consistent speaking pace
268
- Avoid background music or noise
269
270
### Streaming Implementation
271
- Buffer audio in 200-500ms chunks
272
- Implement proper sequence numbering
273
- Handle network interruptions gracefully
274
- Process results as they arrive
275
276
### Error Recovery
277
- Implement retry logic for transient errors
278
- Validate audio format before submission
279
- Monitor session state across chunks
280
- Provide user feedback for processing status