or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

advanced-features.mdindex.mdspeech-adaptation.mdspeech-recognition.mdstreaming-recognition.mdtypes-and-configuration.md

types-and-configuration.mddocs/

0

# Types and Configuration

1

2

Core data types, configuration objects, and enums for speech recognition setup and result processing across all API versions.

3

4

## Core Configuration Types

5

6

### RecognitionConfig

7

8

Main configuration object for speech recognition requests.

9

10

```python { .api }

11

class RecognitionConfig:

12

"""Configuration for speech recognition."""

13

encoding: AudioEncoding

14

sample_rate_hertz: int

15

audio_channel_count: int

16

enable_separate_recognition_per_channel: bool

17

language_code: str

18

alternative_language_codes: Sequence[str]

19

max_alternatives: int

20

profanity_filter: bool

21

speech_contexts: Sequence[SpeechContext]

22

enable_word_time_offsets: bool

23

enable_word_confidence: bool

24

enable_automatic_punctuation: bool

25

enable_spoken_punctuation: bool

26

enable_spoken_emojis: bool

27

enable_speaker_diarization: bool

28

diarization_config: SpeakerDiarizationConfig

29

metadata: RecognitionMetadata

30

model: str

31

use_enhanced: bool

32

adaptation: SpeechAdaptation

33

transcript_normalization: TranscriptNormalization

34

enable_voice_activity_events: bool

35

```

36

37

### RecognitionAudio

38

39

Specifies the audio input for recognition.

40

41

```python { .api }

42

class RecognitionAudio:

43

"""Audio input specification."""

44

content: bytes # Raw audio bytes

45

uri: str # Cloud Storage URI (gs://bucket/file)

46

```

47

48

### SpeakerDiarizationConfig

49

50

Configuration for speaker diarization (identifying different speakers).

51

52

```python { .api }

53

class SpeakerDiarizationConfig:

54

"""Configuration for speaker diarization."""

55

enable_speaker_diarization: bool

56

min_speaker_count: int

57

max_speaker_count: int

58

speaker_tag: int

59

```

60

61

### SpeechContext

62

63

Provides hints to improve recognition accuracy.

64

65

```python { .api }

66

class SpeechContext:

67

"""Context hints for speech recognition."""

68

phrases: Sequence[str]

69

boost: float

70

speech_adaptation: SpeechAdaptation

71

```

72

73

### RecognitionMetadata

74

75

Metadata about the recognition request for analytics and optimization.

76

77

```python { .api }

78

class RecognitionMetadata:

79

"""Metadata for recognition requests."""

80

interaction_type: InteractionType

81

industry_naics_code_of_audio: int

82

microphone_distance: MicrophoneDistance

83

original_media_type: OriginalMediaType

84

recording_device_type: RecordingDeviceType

85

recording_device_name: str

86

original_mime_type: str

87

audio_topic: str

88

```

89

90

## Result Types

91

92

### SpeechRecognitionResult

93

94

Container for recognition results.

95

96

```python { .api }

97

class SpeechRecognitionResult:

98

"""Container for speech recognition results."""

99

alternatives: Sequence[SpeechRecognitionAlternative]

100

channel_tag: int

101

result_end_time: Duration

102

language_code: str

103

```

104

105

### SpeechRecognitionAlternative

106

107

Individual recognition hypothesis with confidence score.

108

109

```python { .api }

110

class SpeechRecognitionAlternative:

111

"""Individual recognition alternative."""

112

transcript: str

113

confidence: float

114

words: Sequence[WordInfo]

115

```

116

117

### WordInfo

118

119

Word-level information including timing and confidence.

120

121

```python { .api }

122

class WordInfo:

123

"""Word-level recognition information."""

124

start_time: Duration

125

end_time: Duration

126

word: str

127

confidence: float

128

speaker_tag: int

129

speaker_label: str

130

```

131

132

### SpeechAdaptationInfo

133

134

Information about applied speech adaptations.

135

136

```python { .api }

137

class SpeechAdaptationInfo:

138

"""Information about applied speech adaptations."""

139

adaptation_timeout: bool

140

timeout_message: str

141

```

142

143

## Enumeration Types

144

145

### AudioEncoding

146

147

Supported audio encoding formats.

148

149

```python { .api }

150

class AudioEncoding:

151

"""Audio encoding formats."""

152

ENCODING_UNSPECIFIED = 0

153

LINEAR16 = 1 # 16-bit linear PCM

154

FLAC = 2 # FLAC lossless

155

MULAW = 3 # 8-bit mu-law

156

AMR = 4 # AMR narrowband

157

AMR_WB = 5 # AMR wideband

158

OGG_OPUS = 6 # Ogg Opus

159

SPEEX_WITH_HEADER_BYTE = 7 # Speex with header

160

MP3 = 8 # MP3

161

WEBM_OPUS = 9 # WebM Opus

162

```

163

164

### InteractionType

165

166

Types of user interactions for recognition optimization.

167

168

```python { .api }

169

class InteractionType:

170

"""Interaction types for recognition optimization."""

171

INTERACTION_TYPE_UNSPECIFIED = 0

172

DISCUSSION = 1 # Multi-participant discussion

173

PRESENTATION = 2 # Single speaker presentation

174

PHONE_CALL = 3 # Phone conversation

175

VOICEMAIL = 4 # Voicemail message

176

PROFESSIONALLY_PRODUCED = 5 # Professional audio content

177

VOICE_SEARCH = 6 # Voice search queries

178

VOICE_COMMAND = 7 # Voice commands

179

DICTATION = 8 # Dictation use case

180

```

181

182

### MicrophoneDistance

183

184

Microphone distance from the audio source.

185

186

```python { .api }

187

class MicrophoneDistance:

188

"""Microphone distance categories."""

189

MICROPHONE_DISTANCE_UNSPECIFIED = 0

190

NEARFIELD = 1 # 0-1 meter from source

191

MIDFIELD = 2 # 1-3 meters from source

192

FARFIELD = 3 # 3+ meters from source

193

```

194

195

### OriginalMediaType

196

197

Original media type of the audio.

198

199

```python { .api }

200

class OriginalMediaType:

201

"""Original media type categories."""

202

ORIGINAL_MEDIA_TYPE_UNSPECIFIED = 0

203

AUDIO = 1 # Audio-only content

204

VIDEO = 2 # Video content with audio track

205

```

206

207

### RecordingDeviceType

208

209

Type of device used for recording.

210

211

```python { .api }

212

class RecordingDeviceType:

213

"""Recording device types."""

214

RECORDING_DEVICE_TYPE_UNSPECIFIED = 0

215

SMARTPHONE = 1 # Mobile phone

216

PC = 2 # Personal computer

217

PHONE_LINE = 3 # Traditional phone line

218

VEHICLE = 4 # In-vehicle system

219

OTHER_OUTDOOR_DEVICE = 5 # Other outdoor recording

220

OTHER_INDOOR_DEVICE = 6 # Other indoor recording

221

```

222

223

## Usage Examples

224

225

### Basic Configuration

226

227

```python

228

from google.cloud import speech

229

230

# Simple configuration for high-quality audio

231

config = speech.RecognitionConfig(

232

encoding=speech.RecognitionConfig.AudioEncoding.FLAC,

233

sample_rate_hertz=44100,

234

language_code="en-US",

235

enable_automatic_punctuation=True,

236

enable_word_time_offsets=True,

237

)

238

239

# Audio from file content

240

with open("audio.flac", "rb") as f:

241

audio_content = f.read()

242

243

audio = speech.RecognitionAudio(content=audio_content)

244

```

245

246

### Advanced Configuration

247

248

```python

249

from google.cloud import speech

250

251

# Comprehensive configuration with all features

252

config = speech.RecognitionConfig(

253

encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,

254

sample_rate_hertz=16000,

255

audio_channel_count=2,

256

enable_separate_recognition_per_channel=True,

257

language_code="en-US",

258

alternative_language_codes=["en-GB", "en-AU"],

259

max_alternatives=3,

260

profanity_filter=True,

261

enable_word_time_offsets=True,

262

enable_word_confidence=True,

263

enable_automatic_punctuation=True,

264

enable_speaker_diarization=True,

265

diarization_config=speech.SpeakerDiarizationConfig(

266

enable_speaker_diarization=True,

267

min_speaker_count=2,

268

max_speaker_count=6,

269

),

270

metadata=speech.RecognitionMetadata(

271

interaction_type=speech.RecognitionMetadata.InteractionType.DISCUSSION,

272

microphone_distance=speech.RecognitionMetadata.MicrophoneDistance.NEARFIELD,

273

original_media_type=speech.RecognitionMetadata.OriginalMediaType.AUDIO,

274

recording_device_type=speech.RecognitionMetadata.RecordingDeviceType.SMARTPHONE,

275

),

276

speech_contexts=[

277

speech.SpeechContext(

278

phrases=["technical", "terminology", "API", "cloud computing"],

279

boost=10.0

280

)

281

],

282

use_enhanced=True, # Use enhanced model

283

)

284

285

# Cloud Storage audio

286

audio = speech.RecognitionAudio(

287

uri="gs://your-bucket/meeting-recording.wav"

288

)

289

```

290

291

### Processing Results

292

293

```python

294

# Process comprehensive results

295

response = client.recognize(config=config, audio=audio)

296

297

for i, result in enumerate(response.results):

298

print(f"Result {i + 1}:")

299

300

# Process alternatives

301

for j, alternative in enumerate(result.alternatives):

302

print(f" Alternative {j + 1} (confidence: {alternative.confidence:.2f}):")

303

print(f" Transcript: {alternative.transcript}")

304

305

# Process word-level information

306

if alternative.words:

307

print(" Word details:")

308

for word in alternative.words[:5]: # Show first 5 words

309

print(f" '{word.word}': "

310

f"{word.start_time.total_seconds():.1f}s-"

311

f"{word.end_time.total_seconds():.1f}s "

312

f"(confidence: {word.confidence:.2f})")

313

if word.speaker_tag:

314

print(f" Speaker: {word.speaker_tag}")

315

316

# Access metadata

317

if response.speech_adaptation_info:

318

if response.speech_adaptation_info.adaptation_timeout:

319

print("Warning: Speech adaptation timed out")

320

```

321

322

## Configuration Best Practices

323

324

### Audio Quality Settings

325

326

```python

327

# Optimal settings for different audio sources

328

phone_config = speech.RecognitionConfig(

329

encoding=speech.RecognitionConfig.AudioEncoding.MULAW,

330

sample_rate_hertz=8000,

331

language_code="en-US",

332

metadata=speech.RecognitionMetadata(

333

interaction_type=speech.RecognitionMetadata.InteractionType.PHONE_CALL,

334

microphone_distance=speech.RecognitionMetadata.MicrophoneDistance.NEARFIELD,

335

recording_device_type=speech.RecognitionMetadata.RecordingDeviceType.PHONE_LINE,

336

),

337

)

338

339

# High-quality studio recording

340

studio_config = speech.RecognitionConfig(

341

encoding=speech.RecognitionConfig.AudioEncoding.FLAC,

342

sample_rate_hertz=48000,

343

language_code="en-US",

344

use_enhanced=True,

345

metadata=speech.RecognitionMetadata(

346

interaction_type=speech.RecognitionMetadata.InteractionType.PROFESSIONALLY_PRODUCED,

347

microphone_distance=speech.RecognitionMetadata.MicrophoneDistance.NEARFIELD,

348

original_media_type=speech.RecognitionMetadata.OriginalMediaType.AUDIO,

349

),

350

)

351

352

# Mobile app recording

353

mobile_config = speech.RecognitionConfig(

354

encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,

355

sample_rate_hertz=16000,

356

language_code="en-US",

357

enable_automatic_punctuation=True,

358

metadata=speech.RecognitionMetadata(

359

interaction_type=speech.RecognitionMetadata.InteractionType.VOICE_COMMAND,

360

microphone_distance=speech.RecognitionMetadata.MicrophoneDistance.NEARFIELD,

361

recording_device_type=speech.RecognitionMetadata.RecordingDeviceType.SMARTPHONE,

362

),

363

)

364

```

365

366

### Language Configuration

367

368

```python

369

# Multi-language support

370

multilingual_config = speech.RecognitionConfig(

371

encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,

372

sample_rate_hertz=16000,

373

language_code="en-US", # Primary language

374

alternative_language_codes=[

375

"es-ES", # Spanish

376

"fr-FR", # French

377

"de-DE", # German

378

],

379

max_alternatives=2, # Get alternatives for uncertain regions

380

)

381

```

382

383

### Performance Optimization

384

385

```python

386

# Optimized for speed vs accuracy trade-offs

387

fast_config = speech.RecognitionConfig(

388

encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,

389

sample_rate_hertz=16000,

390

language_code="en-US",

391

max_alternatives=1, # Single alternative

392

enable_word_time_offsets=False, # Skip word timing

393

enable_word_confidence=False, # Skip word confidence

394

# Keep automatic punctuation for readability

395

enable_automatic_punctuation=True,

396

)

397

398

# Optimized for maximum accuracy

399

accurate_config = speech.RecognitionConfig(

400

encoding=speech.RecognitionConfig.AudioEncoding.FLAC,

401

sample_rate_hertz=48000,

402

language_code="en-US",

403

use_enhanced=True, # Enhanced model

404

max_alternatives=3, # Multiple alternatives

405

enable_word_time_offsets=True, # Word-level timing

406

enable_word_confidence=True, # Word-level confidence

407

enable_automatic_punctuation=True,

408

enable_speaker_diarization=True,

409

diarization_config=speech.SpeakerDiarizationConfig(

410

enable_speaker_diarization=True,

411

min_speaker_count=1,

412

max_speaker_count=10,

413

),

414

)

415

```

416

417

## Common Data Patterns

418

419

### Duration and Timestamp Handling

420

421

```python

422

from google.protobuf.duration_pb2 import Duration

423

424

# Working with Duration objects

425

for word in alternative.words:

426

# Convert to seconds

427

start_seconds = word.start_time.total_seconds()

428

end_seconds = word.end_time.total_seconds()

429

duration = end_seconds - start_seconds

430

431

print(f"Word '{word.word}': {start_seconds:.2f}s - {end_seconds:.2f}s ({duration:.2f}s)")

432

```

433

434

### Error Handling with Type Information

435

436

```python

437

from google.api_core import exceptions

438

from google.cloud import speech

439

440

try:

441

response = client.recognize(config=config, audio=audio)

442

443

# Check for empty results

444

if not response.results:

445

print("No speech detected in audio")

446

447

# Validate result structure

448

for result in response.results:

449

if not result.alternatives:

450

print("No alternatives found for this result")

451

continue

452

453

best_alternative = result.alternatives[0]

454

if best_alternative.confidence < 0.5:

455

print(f"Low confidence result: {best_alternative.confidence}")

456

457

except exceptions.InvalidArgument as e:

458

print(f"Invalid configuration: {e}")

459

except exceptions.OutOfRange as e:

460

print(f"Audio too long or other limit exceeded: {e}")

461

except exceptions.DeadlineExceeded as e:

462

print(f"Request timed out: {e}")

463

```