0
# Faster Whisper
1
2
A high-performance reimplementation of OpenAI's Whisper automatic speech recognition model using CTranslate2 for fast inference. Faster Whisper delivers up to 4x faster transcription than the original openai/whisper implementation while maintaining the same accuracy and using less memory, with support for various precision levels (FP16, INT8) for both CPU and GPU execution.
3
4
## Package Information
5
6
- **Package Name**: faster-whisper
7
- **Language**: Python
8
- **Installation**: `pip install faster-whisper`
9
- **Requirements**: Python 3.9+
10
11
## Core Imports
12
13
```python
14
from faster_whisper import WhisperModel
15
```
16
17
Common additional imports:
18
19
```python
20
from faster_whisper import (
21
WhisperModel,
22
BatchedInferencePipeline,
23
decode_audio,
24
available_models,
25
download_model,
26
format_timestamp
27
)
28
```
29
30
## Basic Usage
31
32
```python
33
from faster_whisper import WhisperModel
34
35
# Initialize model
36
model = WhisperModel("base", device="cpu", compute_type="int8")
37
38
# Transcribe audio file
39
segments, info = model.transcribe("audio.mp3", beam_size=5)
40
41
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
42
43
# Process transcription segments
44
for segment in segments:
45
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
46
```
47
48
## Architecture
49
50
The library is built around several key components:
51
52
- **WhisperModel**: Main interface for speech recognition providing transcription and language detection
53
- **BatchedInferencePipeline**: Batched processing for improved throughput on multiple audio files
54
- **Audio Processing**: PyAV-based audio decoding with automatic format conversion and resampling
55
- **VAD Integration**: Silero VAD for automatic voice activity detection and silence filtering
56
- **CTranslate2 Backend**: Optimized inference engine with support for multiple compute types and devices
57
58
This design enables efficient speech-to-text processing with extensive customization options for different deployment scenarios.
59
60
## Capabilities
61
62
### Core Speech Recognition
63
64
Primary speech recognition functionality including transcription, language detection, and model management. These are the main operations for converting audio to text.
65
66
```python { .api }
67
class WhisperModel:
68
def __init__(self, model_size_or_path, device="auto", compute_type="default", **kwargs): ...
69
def transcribe(self, audio, language=None, task="transcribe", **kwargs): ...
70
def detect_language(self, audio=None, features=None, **kwargs): ...
71
72
def available_models(): ...
73
def download_model(size_or_id, output_dir=None, **kwargs): ...
74
```
75
76
[Core Speech Recognition](./core-speech-recognition.md)
77
78
### Batched Processing
79
80
High-throughput batch processing capabilities for processing multiple audio files or chunks efficiently.
81
82
```python { .api }
83
class BatchedInferencePipeline:
84
def __init__(self, model): ...
85
def forward(self, features, tokenizer, chunks_metadata, options): ...
86
```
87
88
[Batched Processing](./batched-processing.md)
89
90
### Audio Processing
91
92
Audio decoding, format conversion, and preprocessing utilities for preparing audio data for transcription.
93
94
```python { .api }
95
def decode_audio(input_file, sampling_rate=16000, split_stereo=False): ...
96
def pad_or_trim(array, length=3000, *, axis=-1): ...
97
```
98
99
[Audio Processing](./audio-processing.md)
100
101
### Voice Activity Detection
102
103
Voice activity detection functionality using Silero VAD for automatic silence detection and audio segmentation.
104
105
```python { .api }
106
@dataclass
107
class VadOptions:
108
threshold: float = 0.5
109
min_speech_duration_ms: int = 0
110
max_speech_duration_s: float = float("inf")
111
min_silence_duration_ms: int = 2000
112
speech_pad_ms: int = 400
113
114
def get_speech_timestamps(audio, vad_options=None, sampling_rate=16000, **kwargs): ...
115
```
116
117
[Voice Activity Detection](./voice-activity-detection.md)
118
119
### Utilities
120
121
Helper functions for timestamp formatting, model information, and other utility operations.
122
123
```python { .api }
124
def format_timestamp(seconds, always_include_hours=False, decimal_marker="."): ...
125
def get_logger(): ...
126
def get_assets_path(): ...
127
```
128
129
[Utilities](./utilities.md)
130
131
## Core Types
132
133
```python { .api }
134
@dataclass
135
class Word:
136
start: float
137
end: float
138
word: str
139
probability: float
140
141
@dataclass
142
class Segment:
143
id: int
144
seek: int
145
start: float
146
end: float
147
text: str
148
tokens: list[int]
149
avg_logprob: float
150
compression_ratio: float
151
no_speech_prob: float
152
words: list[Word] | None
153
temperature: float | None
154
155
@dataclass
156
class TranscriptionInfo:
157
language: str
158
language_probability: float
159
duration: float
160
duration_after_vad: float
161
all_language_probs: list[tuple[str, float]] | None
162
transcription_options: TranscriptionOptions
163
vad_options: VadOptions
164
165
@dataclass
166
class TranscriptionOptions:
167
beam_size: int
168
best_of: int
169
patience: float
170
length_penalty: float
171
repetition_penalty: float
172
no_repeat_ngram_size: int
173
log_prob_threshold: float | None
174
no_speech_threshold: float | None
175
compression_ratio_threshold: float | None
176
condition_on_previous_text: bool
177
prompt_reset_on_temperature: float
178
temperatures: list[float]
179
initial_prompt: str | list[int] | None
180
prefix: str | None
181
suppress_blank: bool
182
suppress_tokens: list[int] | None
183
without_timestamps: bool
184
max_initial_timestamp: float
185
word_timestamps: bool
186
prepend_punctuations: str
187
append_punctuations: str
188
multilingual: bool
189
max_new_tokens: int | None
190
clip_timestamps: str | list[float]
191
hallucination_silence_threshold: float | None
192
hotwords: str | None
193
```