tessl/npm-openai

The official TypeScript library for the OpenAI API

—

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview

Eval results

Files

Audio

Name: tessl/npm-openai
Author: tessl

Core audio capabilities providing text-to-speech generation, speech-to-text transcription with advanced features including speaker diarization, and audio translation to English.

Capabilities

The Audio resource is organized into three sub-resources, each serving distinct audio processing needs:

Speech Generation

Generate natural-sounding audio from text input with configurable voices and audio formats.

client.audio.speech.create(params: SpeechCreateParams): Promise<Response>;

Speech

Transcription

Convert audio to text with support for multiple languages, speaker diarization, streaming, and detailed metadata including timestamps and confidence scores.

client.audio.transcriptions.create(params: TranscriptionCreateParams): Promise<Transcription | TranscriptionVerbose | TranscriptionDiarized | Stream<TranscriptionStreamEvent>>;

Transcriptions

Translation

Translate audio in any language to English text with optional detailed segment information.

client.audio.translations.create(params: TranslationCreateParams): Promise<Translation | TranslationVerbose>;

Translations

Speech Sub-Resource

Text-to-speech audio generation with multiple voice options and configurable audio formats.

Methods

speech.create()

Generates audio from text input with configurable voice, format, speed, and model selection.

/**
 * Generates audio from the input text
 * @param params - Configuration for audio generation
 * @returns Response containing audio data as binary stream
 */
speech.create(params: SpeechCreateParams): Promise<Response>;

Parameters:

interface SpeechCreateParams {
  /** The text to generate audio for (max 4096 characters) */
  input: string;

  /** TTS model: 'tts-1', 'tts-1-hd', or 'gpt-4o-mini-tts' */
  model: SpeechModel;

  /** Voice to use: 'alloy', 'ash', 'ballad', 'cedar', 'coral', 'echo', 'fable', 'marin', 'nova', 'onyx', 'sage', 'shimmer', 'verse' */
  voice: 'alloy' | 'ash' | 'ballad' | 'cedar' | 'coral' | 'echo' | 'fable' | 'marin' | 'nova' | 'onyx' | 'sage' | 'shimmer' | 'verse';

  /** Audio format: 'mp3', 'opus', 'aac', 'flac', 'wav', 'pcm' (default: 'mp3') */
  response_format?: 'mp3' | 'opus' | 'aac' | 'flac' | 'wav' | 'pcm';

  /** Speed from 0.25 to 4.0 (default: 1.0) */
  speed?: number;

  /** Voice control instructions (not supported with tts-1 or tts-1-hd) */
  instructions?: string;

  /** Stream format: 'sse' or 'audio' ('sse' not supported for tts-1/tts-1-hd) */
  stream_format?: 'sse' | 'audio';
}

Types

SpeechModel { .api }

Union type for available text-to-speech models:

type SpeechModel = 'tts-1' | 'tts-1-hd' | 'gpt-4o-mini-tts';

tts-1 - Low latency, natural sounding (default for real-time applications)
tts-1-hd - Higher quality audio with increased latency
gpt-4o-mini-tts - Latest TTS model with advanced voice control

Examples

Basic Text-to-Speech

Generate audio in MP3 format with the default voice:

import fs from 'fs';

const response = await client.audio.speech.create({
  model: 'tts-1-hd',
  voice: 'alloy',
  input: 'The quick brown fox jumps over the lazy dog.',
});

const audioBuffer = await response.blob();
fs.writeFileSync('output.mp3', audioBuffer);

Different Voice Options

Generate the same text with different voices to find the best fit:

const text = 'Welcome to our audio service.';
const voices = ['alloy', 'echo', 'sage', 'shimmer', 'nova'] as const;

for (const voice of voices) {
  const response = await client.audio.speech.create({
    model: 'tts-1-hd',
    voice: voice,
    input: text,
    response_format: 'mp3',
  });

  const buffer = await response.blob();
  fs.writeFileSync(`voice_${voice}.mp3`, buffer);
}

High-Quality Audio with Custom Speed

Generate high-fidelity audio at a slower pace:

const response = await client.audio.speech.create({
  model: 'tts-1-hd',
  voice: 'sage',
  input: 'This is a carefully paced announcement.',
  response_format: 'flac', // Lossless format for best quality
  speed: 0.8, // Slower than normal
});

const audioFile = await response.arrayBuffer();
fs.writeFileSync('announcement.flac', Buffer.from(audioFile));

Multiple Audio Formats

Generate audio in different formats for various use cases:

const formats = ['mp3', 'opus', 'aac', 'wav'] as const;
const input = 'Testing different audio formats.';

for (const format of formats) {
  const response = await client.audio.speech.create({
    model: 'tts-1',
    voice: 'shimmer',
    input: input,
    response_format: format as any,
  });

  const buffer = await response.blob();
  fs.writeFileSync(`output.${format}`, buffer);
}

Voice Control with Instructions

Use advanced voice control (requires gpt-4o-mini-tts):

const response = await client.audio.speech.create({
  model: 'gpt-4o-mini-tts',
  voice: 'sage',
  input: 'This announcement should sound urgent and professional.',
  instructions: 'Speak with urgency and authority, using a professional tone.',
  speed: 1.1,
});

const buffer = await response.blob();

Transcriptions Sub-Resource

Convert audio to text with support for speaker diarization, streaming, and detailed metadata.

Methods

transcriptions.create()

Transcribes audio to text with options for verbose output, diarization, and real-time streaming.

/**
 * Transcribes audio into the input language
 * @param params - Transcription configuration
 * @returns Transcribed text or detailed transcription object, optionally streamed
 */
transcriptions.create(
  params: TranscriptionCreateParams
): Promise<Transcription | TranscriptionVerbose | TranscriptionDiarized | Stream<TranscriptionStreamEvent> | string>;

Parameters:

interface TranscriptionCreateParamsBase {
  /** Audio file to transcribe (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm) */
  file: Uploadable;

  /** Model: 'gpt-4o-transcribe', 'gpt-4o-mini-transcribe', 'whisper-1', or 'gpt-4o-transcribe-diarize' */
  model: AudioModel;

  /** Response format: 'json', 'verbose_json', 'diarized_json', 'text', 'srt', or 'vtt' */
  response_format?: AudioResponseFormat;

  /** Enable streaming (not supported for whisper-1) */
  stream?: boolean;

  /** Language code in ISO-639-1 format (e.g., 'en', 'fr', 'es') */
  language?: string;

  /** Text to guide style or continue previous segment */
  prompt?: string;

  /** Sampling temperature 0-1 (default: 0, uses log probability) */
  temperature?: number;

  /** Chunking strategy: 'auto' or manual VAD configuration */
  chunking_strategy?: 'auto' | VadConfig | null;

  /** Include additional information: 'logprobs' */
  include?: Array<'logprobs'>;

  /** Timestamp granularities: 'word', 'segment', or both */
  timestamp_granularities?: Array<'word' | 'segment'>;

  /** Speaker names for diarization (up to 4 speakers) */
  known_speaker_names?: Array<string>;

  /** Audio samples of known speakers (2-10 seconds each) */
  known_speaker_references?: Array<string>;
}

interface TranscriptionCreateParamsNonStreaming extends TranscriptionCreateParamsBase {
  stream?: false | null;
}

interface TranscriptionCreateParamsStreaming extends TranscriptionCreateParamsBase {
  stream: true;
}

VAD Configuration:

interface VadConfig {
  type: 'server_vad';
  prefix_padding_ms?: number; // Audio to include before VAD detection (default: 300ms)
  silence_duration_ms?: number; // Duration of silence to detect stop (default: 1800ms)
  threshold?: number; // Sensitivity 0.0-1.0 (default: 0.5)
}

Types

Transcription { .api }

Basic transcription response with text content:

interface Transcription {
  text: string;
  logprobs?: Array<Transcription.Logprob>; // Only with logprobs include
  usage?: Transcription.Tokens | Transcription.Duration;
}

TranscriptionVerbose { .api }

Detailed transcription with timestamps, segments, and word-level timing:

interface TranscriptionVerbose {
  text: string;
  duration: number; // Duration in seconds
  language: string; // Detected language code
  segments?: Array<TranscriptionSegment>; // Segment details with timestamps
  words?: Array<TranscriptionWord>; // Word-level timing information
  usage?: TranscriptionVerbose.Usage;
}

TranscriptionSegment { .api }

Individual segment with detailed timing and confidence metrics:

interface TranscriptionSegment {
  id: number;
  start: number; // Start time in seconds
  end: number; // End time in seconds
  text: string;
  temperature: number;
  avg_logprob: number; // Average log probability
  compression_ratio: number;
  no_speech_prob: number; // Probability of silence
  tokens: Array<number>;
  seek: number;
}

TranscriptionWord { .api }

Word-level timing information for precise synchronization:

interface TranscriptionWord {
  word: string;
  start: number; // Start time in seconds
  end: number; // End time in seconds
}

TranscriptionDiarized { .api }

Speaker-identified transcription with segment attribution:

interface TranscriptionDiarized {
  text: string;
  duration: number;
  task: 'transcribe';
  segments: Array<TranscriptionDiarizedSegment>; // Annotated with speaker labels
  usage?: TranscriptionDiarized.Tokens | TranscriptionDiarized.Duration;
}

TranscriptionDiarizedSegment { .api }

Segment with speaker identification:

interface TranscriptionDiarizedSegment {
  id: string;
  text: string;
  start: number; // Start time in seconds
  end: number; // End time in seconds
  speaker: string; // Speaker label ('A', 'B', etc., or known speaker name)
  type: 'transcript.text.segment';
}

TranscriptionStreamEvent { .api }

Union type for streaming transcription events:

type TranscriptionStreamEvent =
  | TranscriptionTextDeltaEvent
  | TranscriptionTextDoneEvent
  | TranscriptionTextSegmentEvent;

TranscriptionTextDeltaEvent { .api }

Streaming event with incremental text:

interface TranscriptionTextDeltaEvent {
  type: 'transcript.text.delta';
  delta: string; // Incremental text
  logprobs?: Array<TranscriptionTextDeltaEvent.Logprob>;
  segment_id?: string; // For diarized segments
}

TranscriptionTextDoneEvent { .api }

Final completion event with full transcription:

interface TranscriptionTextDoneEvent {
  type: 'transcript.text.done';
  text: string; // Complete transcription
  logprobs?: Array<TranscriptionTextDoneEvent.Logprob>;
  usage?: TranscriptionTextDoneEvent.Usage;
}

TranscriptionTextSegmentEvent { .api }

Diarized segment completion event:

interface TranscriptionTextSegmentEvent {
  type: 'transcript.text.segment';
  id: string;
  text: string;
  speaker: string; // Speaker label
  start: number;
  end: number;
}

Examples

Basic Transcription

Transcribe an audio file to text:

import fs from 'fs';

const audioFile = fs.createReadStream('speech.mp3');

const transcription = await client.audio.transcriptions.create({
  file: audioFile,
  model: 'gpt-4o-transcribe',
});

console.log('Transcribed text:', transcription.text);

Transcription with Language Specification

Improve accuracy by specifying the language:

const frenchAudio = fs.createReadStream('french_speech.mp3');

const transcription = await client.audio.transcriptions.create({
  file: frenchAudio,
  model: 'gpt-4o-transcribe',
  language: 'fr', // ISO-639-1 language code
  prompt: 'This is a technical discussion about software development.', // Style guide
});

console.log('French transcription:', transcription.text);

Verbose Output with Timestamps

Get detailed segment and word-level timing information:

const transcription = await client.audio.transcriptions.create({
  file: fs.createReadStream('podcast.mp3'),
  model: 'gpt-4o-transcribe',
  response_format: 'verbose_json',
  timestamp_granularities: ['word', 'segment'],
});

// Access word-level timing
if (transcription.words) {
  transcription.words.forEach(word => {
    console.log(`${word.word}: ${word.start.toFixed(2)}s - ${word.end.toFixed(2)}s`);
  });
}

// Access segments
if (transcription.segments) {
  transcription.segments.forEach(segment => {
    console.log(`[${segment.start.toFixed(2)}s - ${segment.end.toFixed(2)}s] ${segment.text}`);
    console.log(`  Confidence: ${(1 - segment.no_speech_prob).toFixed(3)}`);
  });
}

Speaker Diarization

Identify and separate different speakers:

const audioFile = fs.createReadStream('multi_speaker_audio.mp3');

const diarization = await client.audio.transcriptions.create({
  file: audioFile,
  model: 'gpt-4o-transcribe-diarize',
  response_format: 'diarized_json',
});

// View segments with speaker identification
diarization.segments.forEach(segment => {
  console.log(`[${segment.start.toFixed(2)}s] Speaker ${segment.speaker}: ${segment.text}`);
});

Diarization with Known Speakers

Provide reference audio for known speakers to improve identification:

const mainAudio = fs.createReadStream('meeting_recording.mp3');

const diarization = await client.audio.transcriptions.create({
  file: mainAudio,
  model: 'gpt-4o-transcribe-diarize',
  response_format: 'diarized_json',
  known_speaker_names: ['John', 'Sarah', 'Mike'],
  known_speaker_references: [
    'data:audio/mp3;base64,//NExAAR...',  // John's voice sample
    'data:audio/mp3;base64,//NExAAR...',  // Sarah's voice sample
    'data:audio/mp3;base64,//NExAAR...',  // Mike's voice sample
  ],
});

// Speakers now labeled by name instead of letters
diarization.segments.forEach(segment => {
  console.log(`${segment.speaker}: "${segment.text}"`);
});

Streaming Transcription

Real-time transcription as audio arrives:

const audioStream = fs.createReadStream('live_audio.mp3');

const stream = await client.audio.transcriptions.create({
  file: audioStream,
  model: 'gpt-4o-transcribe',
  stream: true,
  response_format: 'json',
});

let fullText = '';

for await (const event of stream) {
  if (event.type === 'transcript.text.delta') {
    // Incremental text arrives
    process.stdout.write(event.delta);
    fullText += event.delta;
  } else if (event.type === 'transcript.text.done') {
    // Final complete text
    console.log('\nFinal transcription:', event.text);
  }
}

Streaming with Diarization

Real-time speaker-identified transcription:

const audioStream = fs.createReadStream('streaming_meeting.mp3');

const stream = await client.audio.transcriptions.create({
  file: audioStream,
  model: 'gpt-4o-transcribe-diarize',
  stream: true,
  response_format: 'diarized_json',
});

const speakers: { [key: string]: string } = {};

for await (const event of stream) {
  if (event.type === 'transcript.text.segment') {
    // Complete segment with speaker information
    if (!speakers[event.speaker]) {
      console.log(`\n[New Speaker: ${event.speaker}]`);
      speakers[event.speaker] = event.speaker;
    }
    console.log(`${event.speaker} [${event.start.toFixed(2)}s-${event.end.toFixed(2)}s]: ${event.text}`);
  }
}

Confidence Scores and Quality Metrics

Analyze transcription confidence and quality:

const transcription = await client.audio.transcriptions.create({
  file: fs.createReadStream('audio.mp3'),
  model: 'gpt-4o-transcribe',
  response_format: 'verbose_json',
  include: ['logprobs'],
});

// Analyze segment quality
if (transcription.segments) {
  transcription.segments.forEach(segment => {
    const confidence = 1 - segment.no_speech_prob;
    const quality = segment.compression_ratio < 2.4 ? 'good' : 'degraded';

    console.log(`Segment "${segment.text}"`);
    console.log(`  Confidence: ${(confidence * 100).toFixed(1)}%`);
    console.log(`  Quality: ${quality}`);
    console.log(`  Compression ratio: ${segment.compression_ratio.toFixed(3)}`);
  });
}

Custom Chunking Strategy

Configure voice activity detection for better segment boundaries:

const transcription = await client.audio.transcriptions.create({
  file: fs.createReadStream('long_audio.mp3'),
  model: 'gpt-4o-transcribe-diarize',
  response_format: 'diarized_json',
  chunking_strategy: {
    type: 'server_vad',
    threshold: 0.6, // Higher sensitivity in noisy environments
    silence_duration_ms: 1200, // Shorter pause = new segment
    prefix_padding_ms: 500,
  },
});

console.log('Segments:', transcription.segments.length);

Translations Sub-Resource

Translate audio from any language to English text.

Methods

translations.create()

Translates audio to English with optional detailed segment information.

/**
 * Translates audio into English
 * @param params - Translation configuration
 * @returns Translated English text or detailed translation object
 */
translations.create(
  params: TranslationCreateParams
): Promise<Translation | TranslationVerbose | string>;

Parameters:

interface TranslationCreateParams {
  /** Audio file to translate (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm) */
  file: Uploadable;

  /** Model: 'whisper-1' (currently the only available translation model) */
  model: AudioModel;

  /** Response format: 'json', 'verbose_json', 'text', 'srt', or 'vtt' (default: 'json') */
  response_format?: 'json' | 'text' | 'srt' | 'verbose_json' | 'vtt';

  /** Text to guide style or continue previous segment (should be in English) */
  prompt?: string;

  /** Sampling temperature 0-1 (default: 0, uses log probability) */
  temperature?: number;
}

Types

Translation { .api }

Basic translation response with English text:

interface Translation {
  text: string;
}

TranslationVerbose { .api }

Detailed translation with segment information:

interface TranslationVerbose {
  text: string; // Translated English text
  duration: number; // Duration in seconds
  language: string; // Always 'english' for output
  segments?: Array<TranscriptionSegment>; // Segment details with timestamps
}

Examples

Basic Translation

Translate audio to English:

import fs from 'fs';

const spanishAudio = fs.createReadStream('spanish_interview.mp3');

const translation = await client.audio.translations.create({
  file: spanishAudio,
  model: 'whisper-1',
});

console.log('English translation:', translation.text);

Translation with Style Guidance

Guide the translation style with a prompt:

const frenchPodcast = fs.createReadStream('french_podcast.mp3');

const translation = await client.audio.translations.create({
  file: frenchPodcast,
  model: 'whisper-1',
  prompt: 'This is a formal technical discussion. Use precise technical terminology.',
  response_format: 'json',
});

console.log('Professional translation:', translation.text);

Verbose Translation with Segments

Get segment-level information for synchronized translation display:

const italianAudio = fs.createReadStream('italian_video.mp3');

const translation = await client.audio.translations.create({
  file: italianAudio,
  model: 'whisper-1',
  response_format: 'verbose_json',
});

console.log(`Full translation: ${translation.text}`);
console.log(`Duration: ${translation.duration} seconds`);

// Display segments with timing for subtitle generation
if (translation.segments) {
  translation.segments.forEach(segment => {
    const start = segment.start.toFixed(2);
    const end = segment.end.toFixed(2);
    console.log(`[${start}s - ${end}s] ${segment.text}`);
  });
}

Translation to Subtitle Formats

Export translations in subtitle formats for video:

const germanAudio = fs.createReadStream('german_video.mp3');

// Get SRT format (SubRip)
const srtTranslation = await client.audio.translations.create({
  file: germanAudio,
  model: 'whisper-1',
  response_format: 'srt',
});

fs.writeFileSync('english_subtitles.srt', srtTranslation);

// Get VTT format (WebVTT)
const vttTranslation = await client.audio.translations.create({
  file: germanAudio,
  model: 'whisper-1',
  response_format: 'vtt',
});

fs.writeFileSync('english_subtitles.vtt', vttTranslation);

Batch Audio Translation

Translate multiple audio files:

const files = ['french_audio.mp3', 'spanish_audio.mp3', 'german_audio.mp3'];
const translations = {};

for (const file of files) {
  const audioStream = fs.createReadStream(file);

  const result = await client.audio.translations.create({
    file: audioStream,
    model: 'whisper-1',
  });

  translations[file] = result.text;
}

// Save all translations
fs.writeFileSync('translations.json', JSON.stringify(translations, null, 2));

AudioModel { .api }

Supported audio models for transcription and translation:

type AudioModel =
  | 'whisper-1'
  | 'gpt-4o-transcribe'
  | 'gpt-4o-mini-transcribe'
  | 'gpt-4o-transcribe-diarize';

whisper-1 - Reliable transcription and translation model, optimized for various audio qualities
gpt-4o-transcribe - Advanced transcription with improved accuracy and language detection
gpt-4o-mini-transcribe - Lightweight variant for efficient transcription
gpt-4o-transcribe-diarize - Speaker identification and diarization capabilities

AudioResponseFormat { .api }

Output format options for transcriptions and translations:

type AudioResponseFormat =
  | 'json'
  | 'text'
  | 'srt'
  | 'verbose_json'
  | 'vtt'
  | 'diarized_json';

json - Structured JSON response with text content (default)
text - Plain text without additional metadata
srt - SubRip subtitle format (timing + text)
verbose_json - Detailed JSON with segments, timing, and confidence scores
vtt - WebVTT subtitle format (timing + text)
diarized_json - JSON with speaker identification and segment timing

Common Patterns

Error Handling

Handle common audio processing errors:

import { BadRequestError, APIError } from 'openai';

try {
  const transcription = await client.audio.transcriptions.create({
    file: fs.createReadStream('audio.mp3'),
    model: 'gpt-4o-transcribe',
  });
} catch (error) {
  if (error instanceof BadRequestError) {
    console.error('Invalid file format or parameters:', error.message);
  } else if (error instanceof APIError) {
    console.error('API error:', error.message);
  }
}

File Handling

Work with different file input types:

import { toFile } from 'openai';

// From file system
const fromDisk = fs.createReadStream('audio.mp3');

// From Buffer
const buffer = await fs.promises.readFile('audio.mp3');
const fromBuffer = await toFile(buffer, 'audio.mp3', { type: 'audio/mpeg' });

// From URL (requires fetch)
const response = await fetch('https://example.com/audio.mp3');
const blob = await response.blob();
const fromUrl = await toFile(blob, 'audio.mp3', { type: 'audio/mpeg' });

// Use with any sub-resource
const transcription = await client.audio.transcriptions.create({
  file: fromBuffer,
  model: 'gpt-4o-transcribe',
});

Request Options

Control request behavior and timeouts:

const transcription = await client.audio.transcriptions.create(
  {
    file: fs.createReadStream('audio.mp3'),
    model: 'gpt-4o-transcribe',
  },
  {
    timeout: 30000, // 30 second timeout
    maxRetries: 2,
  }
);

Combining Audio Operations

Chain multiple audio operations for complete audio processing:

// 1. Transcribe audio
const transcription = await client.audio.transcriptions.create({
  file: fs.createReadStream('mixed_language.mp3'),
  model: 'gpt-4o-transcribe',
  response_format: 'verbose_json',
  timestamp_granularities: ['word'],
});

// 2. Translate the transcribed content using chat completion
const translation = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: `Translate to Spanish:\n\n${transcription.text}`,
  }],
});

// 3. Generate speech from translated text
const speech = await client.audio.speech.create({
  model: 'tts-1-hd',
  voice: 'nova',
  input: translation.choices[0].message.content || '',
});

const audioBuffer = await speech.blob();
fs.writeFileSync('translated_speech.mp3', audioBuffer);

Install with Tessl CLI