or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

file-translation.mdimage-translation.mdindex.mdspeech-translation.mdtext-translation.md

speech-translation.mddocs/

0

# Speech Translation

1

2

Audio translation combining speech recognition and translation for Chinese-English bidirectional processing. Supports both streaming and batch audio processing with multiple audio format compatibility.

3

4

## Capabilities

5

6

### Audio Translation

7

8

Recognizes speech in audio files and translates the recognized text to the target language. Supports real-time streaming and batch processing modes.

9

10

```python { .api }

11

def SpeechTranslate(self, request: models.SpeechTranslateRequest) -> models.SpeechTranslateResponse:

12

"""

13

Translate speech audio to text in target language.

14

15

Args:

16

request: SpeechTranslateRequest with audio data and parameters

17

18

Returns:

19

SpeechTranslateResponse with translated text result

20

21

Raises:

22

TencentCloudSDKException: For various error conditions

23

"""

24

```

25

26

**Usage Example (Single Audio File):**

27

28

```python

29

import base64

30

from tencentcloud.common import credential

31

from tencentcloud.tmt.v20180321.tmt_client import TmtClient

32

from tencentcloud.tmt.v20180321 import models

33

34

# Initialize client

35

cred = credential.Credential("SecretId", "SecretKey")

36

client = TmtClient(cred, "ap-beijing")

37

38

# Read and encode audio file

39

with open("speech.wav", "rb") as f:

40

audio_data = base64.b64encode(f.read()).decode()

41

42

# Create speech translation request

43

req = models.SpeechTranslateRequest()

44

req.SessionUuid = "unique-session-id"

45

req.Source = "zh" # Chinese input

46

req.Target = "en" # English output

47

req.AudioFormat = 1 # PCM format

48

req.Data = audio_data

49

req.Seq = 0 # Sequence number

50

req.IsEnd = 1 # Single file, mark as end

51

req.ProjectId = 0

52

53

# Perform speech translation

54

resp = client.SpeechTranslate(req)

55

print(f"Session: {resp.SessionUuid}")

56

print(f"Translation: {resp.Source} -> {resp.Target}")

57

print(f"Original: {resp.SourceText}")

58

print(f"Translated: {resp.TargetText}")

59

print(f"Recognition status: {resp.RecognizeStatus}")

60

```

61

62

**Usage Example (Streaming Audio):**

63

64

```python

65

def stream_audio_translation(client, audio_chunks, session_uuid):

66

"""

67

Process streaming audio chunks for real-time translation.

68

69

Args:

70

client: TmtClient instance

71

audio_chunks: List of audio data chunks (200-500ms each)

72

session_uuid: Unique session identifier

73

74

Returns:

75

List of translation results

76

"""

77

results = []

78

79

for i, chunk in enumerate(audio_chunks):

80

req = models.SpeechTranslateRequest()

81

req.SessionUuid = session_uuid

82

req.Source = "en"

83

req.Target = "zh"

84

req.AudioFormat = 1 # PCM only for streaming

85

req.Data = base64.b64encode(chunk).decode()

86

req.Seq = i

87

req.IsEnd = 1 if i == len(audio_chunks) - 1 else 0

88

req.ProjectId = 0

89

90

try:

91

resp = client.SpeechTranslate(req)

92

if resp.TargetText:

93

results.append(resp.TargetText)

94

print(f"Chunk {i}: {resp.SourceText} -> {resp.TargetText}")

95

except Exception as e:

96

print(f"Error processing chunk {i}: {e}")

97

98

return results

99

100

# Example usage

101

session_id = "streaming-session-001"

102

# audio_chunks would be your segmented audio data

103

# results = stream_audio_translation(client, audio_chunks, session_id)

104

```

105

106

## Request/Response Models

107

108

### SpeechTranslateRequest

109

110

```python { .api }

111

class SpeechTranslateRequest:

112

"""

113

Request parameters for speech translation.

114

115

Attributes:

116

SessionUuid (str): Unique session identifier for tracking

117

Source (str): Source language code (zh, en)

118

Target (str): Target language code (zh, en)

119

AudioFormat (int): Audio format (1: PCM, 2: MP3, 3: SPEEX)

120

Data (str): Base64 encoded audio data

121

Seq (int): Sequence number for streaming (starts from 0)

122

IsEnd (int): End flag (0: more chunks, 1: final chunk)

123

ProjectId (int): Project ID (default: 0)

124

"""

125

```

126

127

### SpeechTranslateResponse

128

129

```python { .api }

130

class SpeechTranslateResponse:

131

"""

132

Response from speech translation.

133

134

Attributes:

135

SessionUuid (str): Session identifier from request

136

RecognizeStatus (int): Speech recognition status (1=processing, 0=complete)

137

SourceText (str): Recognized original text

138

TargetText (str): Translated text result

139

Seq (int): Audio fragment sequence number

140

Source (str): Source language

141

Target (str): Target language

142

VadSeq (int): Voice activity detection sequence number

143

RequestId (str): Unique request identifier

144

"""

145

```

146

147

## Supported Audio Formats

148

149

### Format Specifications

150

151

**PCM (Format ID: 1)**

152

- **Sampling Rate**: 16kHz

153

- **Bit Depth**: 16-bit

154

- **Channels**: Mono (single channel)

155

- **Streaming Support**: Yes (required for real-time)

156

- **Chunk Duration**: 200-500ms per chunk

157

- **Use Case**: Real-time streaming translation

158

159

**MP3 (Format ID: 2)**

160

- **Streaming Support**: No (batch only)

161

- **Max Duration**: 8 seconds

162

- **Use Case**: Pre-recorded audio files

163

- **Quality**: Variable bitrate supported

164

165

**SPEEX (Format ID: 3)**

166

- **Streaming Support**: No (batch only)

167

- **Max Duration**: 8 seconds

168

- **Use Case**: Compressed voice recordings

169

- **Quality**: Optimized for speech

170

171

## Language Support

172

173

Speech translation currently supports **Chinese-English bidirectional translation**:

174

175

### Supported Language Pairs

176

- **Chinese to English**: zh → en

177

- **English to Chinese**: en → zh

178

179

### Language Codes

180

- **zh**: Simplified Chinese (Mandarin)

181

- **en**: English

182

183

## Processing Modes

184

185

### Streaming Mode (PCM only)

186

- Real-time processing of audio chunks

187

- 200-500ms chunk duration recommended

188

- Sequential processing with Seq numbering

189

- IsEnd=1 for final chunk

190

- Immediate translation results

191

192

### Batch Mode (All formats)

193

- Single audio file processing

194

- Maximum 8 seconds duration (MP3, SPEEX)

195

- No duration limit for PCM

196

- IsEnd=1, Seq=0 for single file

197

- Complete translation after processing

198

199

## Audio Quality Requirements

200

201

### Clear Speech

202

- Minimal background noise

203

- Clear pronunciation

204

- Avoid overlapping speakers

205

- Consistent volume levels

206

207

### Technical Requirements

208

- Proper sampling rate (16kHz for PCM)

209

- Adequate bit depth (16-bit minimum)

210

- Stable audio stream without dropouts

211

- Proper audio encoding

212

213

## Session Management

214

215

### Session UUID

216

- Unique identifier for each translation session

217

- Required for tracking streaming sessions

218

- Use consistent UUID across all chunks in a session

219

- Helps correlate results with audio input

220

221

### Sequence Numbers

222

- Start from 0 for first chunk

223

- Increment by 1 for each subsequent chunk

224

- Used for proper ordering in streaming mode

225

- Critical for maintaining audio continuity

226

227

## Error Handling

228

229

Common error scenarios for speech translation:

230

231

- **UNSUPPORTEDOPERATION_AUDIODURATIONEXCEED**: Audio exceeds maximum duration

232

- **UNSUPPORTEDOPERATION_UNSUPPORTEDLANGUAGE**: Language pair not supported

233

- **FAILEDOPERATION_REQUESTAILABERR**: Audio processing failure

234

- **INVALIDPARAMETER_SEQINTERVALTOOLARGE**: Invalid sequence numbering

235

- **INVALIDPARAMETER_DUPLICATEDSESSIONIDANDSEQ**: Duplicate session/sequence

236

237

Example error handling:

238

239

```python

240

def safe_speech_translate(client, request):

241

"""Safely perform speech translation with error handling."""

242

try:

243

response = client.SpeechTranslate(request)

244

return response.TargetText

245

except TencentCloudSDKException as e:

246

if e.code == "UNSUPPORTEDOPERATION_AUDIODURATIONEXCEED":

247

print("Audio file too long, split into smaller chunks")

248

elif e.code == "UNSUPPORTEDOPERATION_UNSUPPORTEDLANGUAGE":

249

print("Language pair not supported, use zh<->en only")

250

elif e.code == "FAILEDOPERATION_REQUESTAILABERR":

251

print("Audio processing failed, check audio quality")

252

else:

253

print(f"Speech translation error: {e.code} - {e.message}")

254

return None

255

256

# Usage

257

result = safe_speech_translate(client, req)

258

if result:

259

print(f"Translation: {result}")

260

```

261

262

## Best Practices

263

264

### Audio Preparation

265

- Use high-quality recording equipment

266

- Record in quiet environments

267

- Maintain consistent speaking pace

268

- Avoid background music or noise

269

270

### Streaming Implementation

271

- Buffer audio in 200-500ms chunks

272

- Implement proper sequence numbering

273

- Handle network interruptions gracefully

274

- Process results as they arrive

275

276

### Error Recovery

277

- Implement retry logic for transient errors

278

- Validate audio format before submission

279

- Monitor session state across chunks

280

- Provide user feedback for processing status