0
# Types and Configuration
1
2
Core data types, configuration objects, and enums for speech recognition setup and result processing across all API versions.
3
4
## Core Configuration Types
5
6
### RecognitionConfig
7
8
Main configuration object for speech recognition requests.
9
10
```python { .api }
11
class RecognitionConfig:
12
"""Configuration for speech recognition."""
13
encoding: AudioEncoding
14
sample_rate_hertz: int
15
audio_channel_count: int
16
enable_separate_recognition_per_channel: bool
17
language_code: str
18
alternative_language_codes: Sequence[str]
19
max_alternatives: int
20
profanity_filter: bool
21
speech_contexts: Sequence[SpeechContext]
22
enable_word_time_offsets: bool
23
enable_word_confidence: bool
24
enable_automatic_punctuation: bool
25
enable_spoken_punctuation: bool
26
enable_spoken_emojis: bool
27
enable_speaker_diarization: bool
28
diarization_config: SpeakerDiarizationConfig
29
metadata: RecognitionMetadata
30
model: str
31
use_enhanced: bool
32
adaptation: SpeechAdaptation
33
transcript_normalization: TranscriptNormalization
34
enable_voice_activity_events: bool
35
```
36
37
### RecognitionAudio
38
39
Specifies the audio input for recognition.
40
41
```python { .api }
42
class RecognitionAudio:
43
"""Audio input specification."""
44
content: bytes # Raw audio bytes
45
uri: str # Cloud Storage URI (gs://bucket/file)
46
```
47
48
### SpeakerDiarizationConfig
49
50
Configuration for speaker diarization (identifying different speakers).
51
52
```python { .api }
53
class SpeakerDiarizationConfig:
54
"""Configuration for speaker diarization."""
55
enable_speaker_diarization: bool
56
min_speaker_count: int
57
max_speaker_count: int
58
speaker_tag: int
59
```
60
61
### SpeechContext
62
63
Provides hints to improve recognition accuracy.
64
65
```python { .api }
66
class SpeechContext:
67
"""Context hints for speech recognition."""
68
phrases: Sequence[str]
69
boost: float
70
speech_adaptation: SpeechAdaptation
71
```
72
73
### RecognitionMetadata
74
75
Metadata about the recognition request for analytics and optimization.
76
77
```python { .api }
78
class RecognitionMetadata:
79
"""Metadata for recognition requests."""
80
interaction_type: InteractionType
81
industry_naics_code_of_audio: int
82
microphone_distance: MicrophoneDistance
83
original_media_type: OriginalMediaType
84
recording_device_type: RecordingDeviceType
85
recording_device_name: str
86
original_mime_type: str
87
audio_topic: str
88
```
89
90
## Result Types
91
92
### SpeechRecognitionResult
93
94
Container for recognition results.
95
96
```python { .api }
97
class SpeechRecognitionResult:
98
"""Container for speech recognition results."""
99
alternatives: Sequence[SpeechRecognitionAlternative]
100
channel_tag: int
101
result_end_time: Duration
102
language_code: str
103
```
104
105
### SpeechRecognitionAlternative
106
107
Individual recognition hypothesis with confidence score.
108
109
```python { .api }
110
class SpeechRecognitionAlternative:
111
"""Individual recognition alternative."""
112
transcript: str
113
confidence: float
114
words: Sequence[WordInfo]
115
```
116
117
### WordInfo
118
119
Word-level information including timing and confidence.
120
121
```python { .api }
122
class WordInfo:
123
"""Word-level recognition information."""
124
start_time: Duration
125
end_time: Duration
126
word: str
127
confidence: float
128
speaker_tag: int
129
speaker_label: str
130
```
131
132
### SpeechAdaptationInfo
133
134
Information about applied speech adaptations.
135
136
```python { .api }
137
class SpeechAdaptationInfo:
138
"""Information about applied speech adaptations."""
139
adaptation_timeout: bool
140
timeout_message: str
141
```
142
143
## Enumeration Types
144
145
### AudioEncoding
146
147
Supported audio encoding formats.
148
149
```python { .api }
150
class AudioEncoding:
151
"""Audio encoding formats."""
152
ENCODING_UNSPECIFIED = 0
153
LINEAR16 = 1 # 16-bit linear PCM
154
FLAC = 2 # FLAC lossless
155
MULAW = 3 # 8-bit mu-law
156
AMR = 4 # AMR narrowband
157
AMR_WB = 5 # AMR wideband
158
OGG_OPUS = 6 # Ogg Opus
159
SPEEX_WITH_HEADER_BYTE = 7 # Speex with header
160
MP3 = 8 # MP3
161
WEBM_OPUS = 9 # WebM Opus
162
```
163
164
### InteractionType
165
166
Types of user interactions for recognition optimization.
167
168
```python { .api }
169
class InteractionType:
170
"""Interaction types for recognition optimization."""
171
INTERACTION_TYPE_UNSPECIFIED = 0
172
DISCUSSION = 1 # Multi-participant discussion
173
PRESENTATION = 2 # Single speaker presentation
174
PHONE_CALL = 3 # Phone conversation
175
VOICEMAIL = 4 # Voicemail message
176
PROFESSIONALLY_PRODUCED = 5 # Professional audio content
177
VOICE_SEARCH = 6 # Voice search queries
178
VOICE_COMMAND = 7 # Voice commands
179
DICTATION = 8 # Dictation use case
180
```
181
182
### MicrophoneDistance
183
184
Microphone distance from the audio source.
185
186
```python { .api }
187
class MicrophoneDistance:
188
"""Microphone distance categories."""
189
MICROPHONE_DISTANCE_UNSPECIFIED = 0
190
NEARFIELD = 1 # 0-1 meter from source
191
MIDFIELD = 2 # 1-3 meters from source
192
FARFIELD = 3 # 3+ meters from source
193
```
194
195
### OriginalMediaType
196
197
Original media type of the audio.
198
199
```python { .api }
200
class OriginalMediaType:
201
"""Original media type categories."""
202
ORIGINAL_MEDIA_TYPE_UNSPECIFIED = 0
203
AUDIO = 1 # Audio-only content
204
VIDEO = 2 # Video content with audio track
205
```
206
207
### RecordingDeviceType
208
209
Type of device used for recording.
210
211
```python { .api }
212
class RecordingDeviceType:
213
"""Recording device types."""
214
RECORDING_DEVICE_TYPE_UNSPECIFIED = 0
215
SMARTPHONE = 1 # Mobile phone
216
PC = 2 # Personal computer
217
PHONE_LINE = 3 # Traditional phone line
218
VEHICLE = 4 # In-vehicle system
219
OTHER_OUTDOOR_DEVICE = 5 # Other outdoor recording
220
OTHER_INDOOR_DEVICE = 6 # Other indoor recording
221
```
222
223
## Usage Examples
224
225
### Basic Configuration
226
227
```python
228
from google.cloud import speech
229
230
# Simple configuration for high-quality audio
231
config = speech.RecognitionConfig(
232
encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
233
sample_rate_hertz=44100,
234
language_code="en-US",
235
enable_automatic_punctuation=True,
236
enable_word_time_offsets=True,
237
)
238
239
# Audio from file content
240
with open("audio.flac", "rb") as f:
241
audio_content = f.read()
242
243
audio = speech.RecognitionAudio(content=audio_content)
244
```
245
246
### Advanced Configuration
247
248
```python
249
from google.cloud import speech
250
251
# Comprehensive configuration with all features
252
config = speech.RecognitionConfig(
253
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
254
sample_rate_hertz=16000,
255
audio_channel_count=2,
256
enable_separate_recognition_per_channel=True,
257
language_code="en-US",
258
alternative_language_codes=["en-GB", "en-AU"],
259
max_alternatives=3,
260
profanity_filter=True,
261
enable_word_time_offsets=True,
262
enable_word_confidence=True,
263
enable_automatic_punctuation=True,
264
enable_speaker_diarization=True,
265
diarization_config=speech.SpeakerDiarizationConfig(
266
enable_speaker_diarization=True,
267
min_speaker_count=2,
268
max_speaker_count=6,
269
),
270
metadata=speech.RecognitionMetadata(
271
interaction_type=speech.RecognitionMetadata.InteractionType.DISCUSSION,
272
microphone_distance=speech.RecognitionMetadata.MicrophoneDistance.NEARFIELD,
273
original_media_type=speech.RecognitionMetadata.OriginalMediaType.AUDIO,
274
recording_device_type=speech.RecognitionMetadata.RecordingDeviceType.SMARTPHONE,
275
),
276
speech_contexts=[
277
speech.SpeechContext(
278
phrases=["technical", "terminology", "API", "cloud computing"],
279
boost=10.0
280
)
281
],
282
use_enhanced=True, # Use enhanced model
283
)
284
285
# Cloud Storage audio
286
audio = speech.RecognitionAudio(
287
uri="gs://your-bucket/meeting-recording.wav"
288
)
289
```
290
291
### Processing Results
292
293
```python
294
# Process comprehensive results
295
response = client.recognize(config=config, audio=audio)
296
297
for i, result in enumerate(response.results):
298
print(f"Result {i + 1}:")
299
300
# Process alternatives
301
for j, alternative in enumerate(result.alternatives):
302
print(f" Alternative {j + 1} (confidence: {alternative.confidence:.2f}):")
303
print(f" Transcript: {alternative.transcript}")
304
305
# Process word-level information
306
if alternative.words:
307
print(" Word details:")
308
for word in alternative.words[:5]: # Show first 5 words
309
print(f" '{word.word}': "
310
f"{word.start_time.total_seconds():.1f}s-"
311
f"{word.end_time.total_seconds():.1f}s "
312
f"(confidence: {word.confidence:.2f})")
313
if word.speaker_tag:
314
print(f" Speaker: {word.speaker_tag}")
315
316
# Access metadata
317
if response.speech_adaptation_info:
318
if response.speech_adaptation_info.adaptation_timeout:
319
print("Warning: Speech adaptation timed out")
320
```
321
322
## Configuration Best Practices
323
324
### Audio Quality Settings
325
326
```python
327
# Optimal settings for different audio sources
328
phone_config = speech.RecognitionConfig(
329
encoding=speech.RecognitionConfig.AudioEncoding.MULAW,
330
sample_rate_hertz=8000,
331
language_code="en-US",
332
metadata=speech.RecognitionMetadata(
333
interaction_type=speech.RecognitionMetadata.InteractionType.PHONE_CALL,
334
microphone_distance=speech.RecognitionMetadata.MicrophoneDistance.NEARFIELD,
335
recording_device_type=speech.RecognitionMetadata.RecordingDeviceType.PHONE_LINE,
336
),
337
)
338
339
# High-quality studio recording
340
studio_config = speech.RecognitionConfig(
341
encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
342
sample_rate_hertz=48000,
343
language_code="en-US",
344
use_enhanced=True,
345
metadata=speech.RecognitionMetadata(
346
interaction_type=speech.RecognitionMetadata.InteractionType.PROFESSIONALLY_PRODUCED,
347
microphone_distance=speech.RecognitionMetadata.MicrophoneDistance.NEARFIELD,
348
original_media_type=speech.RecognitionMetadata.OriginalMediaType.AUDIO,
349
),
350
)
351
352
# Mobile app recording
353
mobile_config = speech.RecognitionConfig(
354
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
355
sample_rate_hertz=16000,
356
language_code="en-US",
357
enable_automatic_punctuation=True,
358
metadata=speech.RecognitionMetadata(
359
interaction_type=speech.RecognitionMetadata.InteractionType.VOICE_COMMAND,
360
microphone_distance=speech.RecognitionMetadata.MicrophoneDistance.NEARFIELD,
361
recording_device_type=speech.RecognitionMetadata.RecordingDeviceType.SMARTPHONE,
362
),
363
)
364
```
365
366
### Language Configuration
367
368
```python
369
# Multi-language support
370
multilingual_config = speech.RecognitionConfig(
371
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
372
sample_rate_hertz=16000,
373
language_code="en-US", # Primary language
374
alternative_language_codes=[
375
"es-ES", # Spanish
376
"fr-FR", # French
377
"de-DE", # German
378
],
379
max_alternatives=2, # Get alternatives for uncertain regions
380
)
381
```
382
383
### Performance Optimization
384
385
```python
386
# Optimized for speed vs accuracy trade-offs
387
fast_config = speech.RecognitionConfig(
388
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
389
sample_rate_hertz=16000,
390
language_code="en-US",
391
max_alternatives=1, # Single alternative
392
enable_word_time_offsets=False, # Skip word timing
393
enable_word_confidence=False, # Skip word confidence
394
# Keep automatic punctuation for readability
395
enable_automatic_punctuation=True,
396
)
397
398
# Optimized for maximum accuracy
399
accurate_config = speech.RecognitionConfig(
400
encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
401
sample_rate_hertz=48000,
402
language_code="en-US",
403
use_enhanced=True, # Enhanced model
404
max_alternatives=3, # Multiple alternatives
405
enable_word_time_offsets=True, # Word-level timing
406
enable_word_confidence=True, # Word-level confidence
407
enable_automatic_punctuation=True,
408
enable_speaker_diarization=True,
409
diarization_config=speech.SpeakerDiarizationConfig(
410
enable_speaker_diarization=True,
411
min_speaker_count=1,
412
max_speaker_count=10,
413
),
414
)
415
```
416
417
## Common Data Patterns
418
419
### Duration and Timestamp Handling
420
421
```python
422
from google.protobuf.duration_pb2 import Duration
423
424
# Working with Duration objects
425
for word in alternative.words:
426
# Convert to seconds
427
start_seconds = word.start_time.total_seconds()
428
end_seconds = word.end_time.total_seconds()
429
duration = end_seconds - start_seconds
430
431
print(f"Word '{word.word}': {start_seconds:.2f}s - {end_seconds:.2f}s ({duration:.2f}s)")
432
```
433
434
### Error Handling with Type Information
435
436
```python
437
from google.api_core import exceptions
438
from google.cloud import speech
439
440
try:
441
response = client.recognize(config=config, audio=audio)
442
443
# Check for empty results
444
if not response.results:
445
print("No speech detected in audio")
446
447
# Validate result structure
448
for result in response.results:
449
if not result.alternatives:
450
print("No alternatives found for this result")
451
continue
452
453
best_alternative = result.alternatives[0]
454
if best_alternative.confidence < 0.5:
455
print(f"Low confidence result: {best_alternative.confidence}")
456
457
except exceptions.InvalidArgument as e:
458
print(f"Invalid configuration: {e}")
459
except exceptions.OutOfRange as e:
460
print(f"Audio too long or other limit exceeded: {e}")
461
except exceptions.DeadlineExceeded as e:
462
print(f"Request timed out: {e}")
463
```