Tessl Tile for pypi/ctranslate2@4.6.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

converters.md index.md inference.md specialized.md specifications.md utilities.md

specialized.mddocs/

0
# Specialized Models
1

2
Domain-specific model classes for speech recognition and audio processing tasks. CTranslate2 provides optimized implementations for Whisper (speech-to-text), Wav2Vec2 (speech representation learning), and Wav2Vec2Bert (enhanced speech processing) with the same performance optimizations as the core inference classes.
3

4
## Capabilities
5

6
### Whisper Speech Recognition
7

8
The `Whisper` class provides optimized inference for OpenAI's Whisper automatic speech recognition models, supporting transcription, translation, and language detection.
9

10
```python { .api }
11
class Whisper:
12
    def __init__(self, model_path: str, device: str = "auto", 
13
                 device_index: int = 0, compute_type: str = "default",
14
                 inter_threads: int = 1, intra_threads: int = 0,
15
                 max_queued_batches: int = 0, files: dict = None):
16
        """
17
        Initialize Whisper model for speech recognition.
18
        
19
        Args:
20
            model_path (str): Path to the CTranslate2 Whisper model directory
21
            device (str): Device to run on ("cpu", "cuda", "auto")
22
            device_index (int): Device index for multi-GPU setups
23
            compute_type (str): Computation precision ("default", "float32", "float16", "int8")
24
            inter_threads (int): Number of inter-op threads
25
            intra_threads (int): Number of intra-op threads (0 for auto)
26
            max_queued_batches (int): Maximum number of batches in queue
27
            files (dict): Additional model files mapping
28
        """
29
    
30
    def transcribe(self, features: list, language: str = None, 
31
                  task: str = "transcribe", beam_size: int = 5,
32
                  patience: float = 1.0, length_penalty: float = 1.0,
33
                  repetition_penalty: float = 1.0, no_repeat_ngram_size: int = 0,
34
                  temperature: float = 1.0, compression_ratio_threshold: float = 2.4,
35
                  log_prob_threshold: float = -1.0, no_speech_threshold: float = 0.6,
36
                  condition_on_previous_text: bool = True,
37
                  prompt_reset_on_temperature: float = 0.5,
38
                  initial_prompt: str = None, prefix: str = None,
39
                  suppress_blank: bool = True, suppress_tokens: list = None,
40
                  without_timestamps: bool = False, max_initial_timestamp: float = 1.0,
41
                  word_timestamps: bool = False, prepend_punctuations: str = "\"'"¿([{-",
42
                  append_punctuations: str = "\"'.。,，!！?？:：")]}、",
43
                  vad_filter: bool = False, vad_parameters: dict = None,
44
                  max_new_tokens: int = None, clip_timestamps: list = None,
45
                  hallucination_silence_threshold: float = None,
46
                  hotwords: str = None, language_detection_threshold: float = None,
47
                  language_detection_segments: int = 1, **kwargs) -> list:
48
        """
49
        Transcribe audio features to text.
50
        
51
        Args:
52
            features (list): List of mel-spectrogram features
53
            language (str): Language code (e.g., "en", "fr", "de")
54
            task (str): Task type ("transcribe" or "translate")
55
            beam_size (int): Beam search size
56
            patience (float): Beam search patience
57
            length_penalty (float): Length penalty for beam search
58
            repetition_penalty (float): Repetition penalty
59
            no_repeat_ngram_size (int): N-gram size to avoid repetition
60
            temperature (float): Sampling temperature
61
            compression_ratio_threshold (float): Threshold for compression ratio
62
            log_prob_threshold (float): Log probability threshold
63
            no_speech_threshold (float): No speech detection threshold
64
            condition_on_previous_text (bool): Use previous text as context
65
            prompt_reset_on_temperature (float): Reset prompt at temperature
66
            initial_prompt (str): Initial prompt text
67
            prefix (str): Prefix for generated text
68
            suppress_blank (bool): Suppress blank tokens
69
            suppress_tokens (list): List of tokens to suppress
70
            without_timestamps (bool): Generate without timestamps
71
            max_initial_timestamp (float): Maximum initial timestamp
72
            word_timestamps (bool): Generate word-level timestamps
73
            prepend_punctuations (str): Punctuations to prepend
74
            append_punctuations (str): Punctuations to append
75
            vad_filter (bool): Apply voice activity detection filter
76
            vad_parameters (dict): VAD configuration parameters
77
            max_new_tokens (int): Maximum new tokens to generate
78
            clip_timestamps (list): Timestamp clipping range
79
            hallucination_silence_threshold (float): Silence threshold for hallucination detection
80
            hotwords (str): Hotwords for biased generation
81
            language_detection_threshold (float): Threshold for language detection
82
            language_detection_segments (int): Number of segments for language detection
83
            
84
        Returns:
85
            list: List of WhisperGenerationResult objects
86
        """
87
    
88
    def detect_language(self, features: list, **kwargs) -> list:
89
        """
90
        Detect language from audio features.
91
        
92
        Args:
93
            features (list): List of mel-spectrogram features
94
            **kwargs: Additional detection parameters
95
            
96
        Returns:
97
            list: List of detected languages with probabilities
98
        """
99
    
100
    def generate(self, features: list, prompts: list = None, **kwargs) -> list:
101
        """
102
        Generate text from audio features with optional prompts.
103
        
104
        Args:
105
            features (list): List of mel-spectrogram features
106
            prompts (list): List of text prompts
107
            **kwargs: Additional generation parameters
108
            
109
        Returns:
110
            list: List of generation results
111
        """
112
```
113

114
### Wav2Vec2 Speech Representation
115

116
The `Wav2Vec2` class provides inference for Facebook's Wav2Vec2 models for speech representation learning and feature extraction.
117

118
```python { .api }
119
class Wav2Vec2:
120
    def __init__(self, model_path: str, device: str = "auto",
121
                 device_index: int = 0, compute_type: str = "default",
122
                 inter_threads: int = 1, intra_threads: int = 0,
123
                 max_queued_batches: int = 0, files: dict = None):
124
        """
125
        Initialize Wav2Vec2 model for speech processing.
126
        
127
        Args:
128
            model_path (str): Path to the CTranslate2 Wav2Vec2 model directory
129
            device (str): Device to run on ("cpu", "cuda", "auto")
130
            device_index (int): Device index for multi-GPU setups
131
            compute_type (str): Computation precision ("default", "float32", "float16", "int8")
132
            inter_threads (int): Number of inter-op threads
133
            intra_threads (int): Number of intra-op threads (0 for auto)
134
            max_queued_batches (int): Maximum number of batches in queue
135
            files (dict): Additional model files mapping
136
        """
137
    
138
    def encode(self, features: list, normalize: bool = False,
139
               return_hidden: bool = False, **kwargs) -> list:
140
        """
141
        Encode audio features using Wav2Vec2.
142
        
143
        Args:
144
            features (list): List of raw audio waveforms or features
145
            normalize (bool): Whether to normalize output representations
146
            return_hidden (bool): Whether to return hidden states
147
            **kwargs: Additional encoding parameters
148
            
149
        Returns:
150
            list: List of encoded representations
151
        """
152
    
153
    def forward_batch(self, inputs: list, **kwargs) -> list:
154
        """
155
        Forward pass on a batch of audio inputs.
156
        
157
        Args:
158
            inputs (list): List of audio input sequences
159
            **kwargs: Additional forward pass parameters
160
            
161
        Returns:
162
            list: List of forward pass outputs
163
        """
164
```
165

166
### Wav2Vec2Bert Enhanced Speech Processing
167

168
The `Wav2Vec2Bert` class provides inference for the enhanced Wav2Vec2-BERT models that combine speech representation learning with BERT-style pretraining.
169

170
```python { .api }
171
class Wav2Vec2Bert:
172
    def __init__(self, model_path: str, device: str = "auto",
173
                 device_index: int = 0, compute_type: str = "default",
174
                 inter_threads: int = 1, intra_threads: int = 0,
175
                 max_queued_batches: int = 0, files: dict = None):
176
        """
177
        Initialize Wav2Vec2Bert model for enhanced speech processing.
178
        
179
        Args:
180
            model_path (str): Path to the CTranslate2 Wav2Vec2Bert model directory
181
            device (str): Device to run on ("cpu", "cuda", "auto")
182
            device_index (int): Device index for multi-GPU setups
183
            compute_type (str): Computation precision ("default", "float32", "float16", "int8")
184
            inter_threads (int): Number of inter-op threads
185
            intra_threads (int): Number of intra-op threads (0 for auto)
186
            max_queued_batches (int): Maximum number of batches in queue
187
            files (dict): Additional model files mapping
188
        """
189
    
190
    def encode(self, features: list, normalize: bool = False,
191
               return_hidden: bool = False, **kwargs) -> list:
192
        """
193
        Encode audio features using Wav2Vec2Bert.
194
        
195
        Args:
196
            features (list): List of raw audio waveforms or features
197
            normalize (bool): Whether to normalize output representations
198
            return_hidden (bool): Whether to return hidden states
199
            **kwargs: Additional encoding parameters
200
            
201
        Returns:
202
            list: List of encoded representations
203
        """
204
    
205
    def forward_batch(self, inputs: list, **kwargs) -> list:
206
        """
207
        Forward pass on a batch of audio inputs.
208
        
209
        Args:
210
            inputs (list): List of audio input sequences
211
            **kwargs: Additional forward pass parameters
212
            
213
        Returns:
214
            list: List of forward pass outputs
215
        """
216
```
217

218
## Usage Examples
219

220
### Whisper Speech-to-Text
221

222
```python
223
import ctranslate2
224

225
# Load Whisper model
226
whisper = ctranslate2.models.Whisper("path/to/whisper_ct2_model", device="cpu")
227

228
# Prepare audio features (mel-spectrograms)
229
# Features should be mel-spectrograms with shape (80, time_steps)
230
audio_features = [mel_spectrogram_1, mel_spectrogram_2]  # List of numpy arrays
231

232
# Transcribe audio
233
results = whisper.transcribe(audio_features, language="en", task="transcribe")
234

235
for result in results:
236
    print("Transcription:", result.sequences[0])
237
    if hasattr(result, 'timestamps') and result.timestamps:
238
        print("Timestamps:", result.timestamps)
239

240
# Transcribe with word-level timestamps
241
results = whisper.transcribe(
242
    audio_features, 
243
    language="en",
244
    word_timestamps=True,
245
    without_timestamps=False
246
)
247

248
for result in results:
249
    print("Text:", result.sequences[0])
250
    for word_info in result.word_timestamps:
251
        print(f"Word: {word_info['word']}, Start: {word_info['start']:.2f}s, End: {word_info['end']:.2f}s")
252
```
253

254
### Language Detection
255

256
```python
257
import ctranslate2
258

259
whisper = ctranslate2.models.Whisper("path/to/whisper_ct2_model")
260

261
# Detect language from audio
262
language_results = whisper.detect_language(audio_features)
263

264
for result in language_results:
265
    detected_language = result.language
266
    confidence = result.language_probability
267
    print(f"Detected language: {detected_language} (confidence: {confidence:.3f})")
268
```
269

270
### Translation Task
271

272
```python
273
import ctranslate2
274

275
whisper = ctranslate2.models.Whisper("path/to/whisper_ct2_model")
276

277
# Translate foreign speech to English
278
results = whisper.transcribe(
279
    audio_features,
280
    task="translate",  # Translate to English
281
    language="fr"      # Source language is French
282
)
283

284
for result in results:
285
    print("English translation:", result.sequences[0])
286
```
287

288
### Wav2Vec2 Feature Extraction
289

290
```python
291
import ctranslate2
292
import numpy as np
293

294
# Load Wav2Vec2 model
295
wav2vec2 = ctranslate2.models.Wav2Vec2("path/to/wav2vec2_ct2_model", device="cpu")
296

297
# Prepare raw audio waveforms
298
# Audio should be 16kHz mono waveforms
299
audio_waveforms = [waveform_1, waveform_2]  # List of numpy arrays
300

301
# Extract speech representations
302
representations = wav2vec2.encode(audio_waveforms, normalize=True)
303

304
for repr in representations:
305
    print("Representation shape:", repr.shape)
306
    # Use representations for downstream tasks like speaker recognition,
307
    # emotion detection, or as features for other models
308
```
309

310
### Wav2Vec2Bert Processing
311

312
```python
313
import ctranslate2
314

315
# Load Wav2Vec2Bert model
316
wav2vec2bert = ctranslate2.models.Wav2Vec2Bert("path/to/wav2vec2bert_ct2_model")
317

318
# Extract enhanced representations
319
enhanced_representations = wav2vec2bert.encode(
320
    audio_waveforms,
321
    normalize=True,
322
    return_hidden=True
323
)
324

325
for repr in enhanced_representations:
326
    print("Enhanced representation shape:", repr.shape)
327
    # These representations combine speech and language understanding
328
```
329

330
### Batch Processing for Efficiency
331

332
```python
333
import ctranslate2
334

335
whisper = ctranslate2.models.Whisper("path/to/whisper_ct2_model", device="cuda")
336

337
# Process multiple audio files efficiently
338
batch_features = [features_1, features_2, features_3, features_4]
339

340
# Batch transcription
341
batch_results = whisper.transcribe(
342
    batch_features,
343
    language="en",
344
    beam_size=5,
345
    temperature=0.0  # Deterministic output
346
)
347

348
for i, result in enumerate(batch_results):
349
    print(f"Audio {i+1}: {result.sequences[0]}")
350
```
351

352
### Advanced Whisper Configuration
353

354
```python
355
import ctranslate2
356

357
whisper = ctranslate2.models.Whisper("path/to/whisper_ct2_model")
358

359
# Advanced transcription with custom parameters
360
results = whisper.transcribe(
361
    audio_features,
362
    language="en",
363
    task="transcribe",
364
    beam_size=10,
365
    temperature=0.2,
366
    compression_ratio_threshold=2.4,
367
    log_prob_threshold=-1.0,
368
    no_speech_threshold=0.6,
369
    condition_on_previous_text=True,
370
    initial_prompt="This is a technical presentation about machine learning.",
371
    suppress_tokens=[50256, 50257],  # Suppress specific tokens
372
    word_timestamps=True,
373
    vad_filter=True,
374
    vad_parameters={
375
        "threshold": 0.5,
376
        "min_speech_duration_ms": 250,
377
        "max_speech_duration_s": 30
378
    }
379
)
380
```
381

382
## Types
383

384
```python { .api }
385
class WhisperGenerationResult:
386
    """Result from Whisper transcription/translation."""
387
    sequences: list[list[str]]      # Generated text sequences
388
    scores: list[float]             # Generation scores
389
    language: str                   # Detected/specified language
390
    language_probability: float     # Language detection confidence
391
    timestamps: list[dict]          # Segment-level timestamps
392
    word_timestamps: list[dict]     # Word-level timestamps (if requested)
393
    avg_logprob: float             # Average log probability
394
    compression_ratio: float        # Compression ratio metric
395
    no_speech_prob: float          # No speech probability
396

397
class WhisperGenerationResultAsync:
398
    """Async result wrapper for Whisper operations."""
399
    def result(self) -> WhisperGenerationResult: ...
400
    def is_done(self) -> bool: ...
401

402
# Whisper-specific configuration structures
403
class WhisperTimestamp:
404
    """Word or segment timestamp information."""
405
    start: float                    # Start time in seconds
406
    end: float                      # End time in seconds
407
    word: str                       # Word text (for word timestamps)
408
    probability: float              # Confidence score
409

410
class WhisperSegment:
411
    """Transcription segment with metadata."""
412
    text: str                       # Segment text
413
    start: float                    # Start time in seconds
414
    end: float                      # End time in seconds
415
    tokens: list[int]              # Token IDs
416
    temperature: float              # Generation temperature used
417
    avg_logprob: float             # Average log probability
418
    compression_ratio: float        # Compression ratio
419
    no_speech_prob: float          # No speech probability
420

421
# Wav2Vec2 result types
422
class Wav2Vec2Output:
423
    """Output from Wav2Vec2 encoding."""
424
    representations: StorageView    # Learned speech representations
425
    hidden_states: list            # Hidden states (if requested)
426
    attention_weights: list        # Attention weights (if available)
427

428
class Wav2Vec2BertOutput:
429
    """Output from Wav2Vec2Bert encoding."""
430
    representations: StorageView    # Enhanced speech representations
431
    hidden_states: list            # Hidden states from all layers
432
    attention_weights: list        # Attention weights
433
    adapter_outputs: list          # Adapter layer outputs
434
```

Version

Tile

Files

specialized.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

specialized.mddocs/