Implement text-to-speech (TTS) capabilities using the z-ai-web-dev-sdk. Use this skill when the user needs to convert text into natural-sounding speech, create audio content, build voice-enabled applications, or generate spoken audio files. Supports multiple voices, adjustable speed, and various audio formats.
This skill guides the implementation of text-to-speech (TTS) functionality using the z-ai-web-dev-sdk package, enabling conversion of text into natural-sounding speech audio.
Skill Location: {project_path}/skills/TTS
This skill is located at the above path in your project.
Reference Scripts: Example test scripts are available in the {Skill Location}/scripts/ directory for quick testing and reference. See {Skill Location}/scripts/tts.ts for a working example.
Text-to-Speech allows you to build applications that generate spoken audio from text input, supporting various voices, speeds, and output formats for diverse use cases.
IMPORTANT: z-ai-web-dev-sdk MUST be used in backend code only. Never use it in client-side code.
Before implementing TTS functionality, be aware of these important limitations:
stream: true is enabled, only pcm format is supportedwav, pcm, and mp3 formatsfunction splitTextIntoChunks(text, maxLength = 1000) {
const chunks = [];
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
let currentChunk = '';
for (const sentence of sentences) {
if ((currentChunk + sentence).length <= maxLength) {
currentChunk += sentence;
} else {
if (currentChunk) chunks.push(currentChunk.trim());
currentChunk = sentence;
}
}
if (currentChunk) chunks.push(currentChunk.trim());
return chunks;
}The z-ai-web-dev-sdk package is already installed. Import it as shown in the examples below.
For simple text-to-speech conversions, you can use the z-ai CLI instead of writing code. This is ideal for quick audio generation, testing voices, or simple automation.
# Convert text to speech (default WAV format)
z-ai tts --input "Hello, world" --output ./hello.wav
# Using short options
z-ai tts -i "Hello, world" -o ./hello.wav# Use specific voice
z-ai tts -i "Welcome to our service" -o ./welcome.wav --voice tongtong
# Adjust speech speed (0.5-2.0)
z-ai tts -i "This is faster speech" -o ./fast.wav --speed 1.5
# Slower speech
z-ai tts -i "This is slower speech" -o ./slow.wav --speed 0.8# MP3 format
z-ai tts -i "Hello World" -o ./hello.mp3 --format mp3
# WAV format (default)
z-ai tts -i "Hello World" -o ./hello.wav --format wav
# PCM format
z-ai tts -i "Hello World" -o ./hello.pcm --format pcm# Stream audio generation
z-ai tts -i "This is a longer text that will be streamed" -o ./stream.wav --stream--input, -i <text>: Required - Text to convert to speech (max 1024 characters)--output, -o <path>: Required - Output audio file path--voice, -v <voice>: Optional - Voice type (default: tongtong)--speed, -s <number>: Optional - Speech speed, 0.5-2.0 (default: 1.0)--format, -f <format>: Optional - Output format: wav, mp3, pcm (default: wav)--stream: Optional - Enable streaming output (only supports pcm format)Use CLI for:
Use SDK for:
import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';
async function textToSpeech(text, outputPath) {
const zai = await ZAI.create();
const response = await zai.audio.tts.create({
input: text,
voice: 'tongtong',
speed: 1.0,
response_format: 'wav',
stream: false
});
// Get array buffer from Response object
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(new Uint8Array(arrayBuffer));
fs.writeFileSync(outputPath, buffer);
console.log(`Audio saved to ${outputPath}`);
return outputPath;
}
// Usage
await textToSpeech('Hello, world!', './output.wav');import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';
async function generateWithVoice(text, voice, outputPath) {
const zai = await ZAI.create();
const response = await zai.audio.tts.create({
input: text,
voice: voice, // Available voices: tongtong, chuichui, xiaochen, jam, kazi, douji, luodo
speed: 1.0,
response_format: 'wav',
stream: false
});
// Get array buffer from Response object
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(new Uint8Array(arrayBuffer));
fs.writeFileSync(outputPath, buffer);
return outputPath;
}
// Usage
await generateWithVoice('Welcome to our service', 'tongtong', './welcome.wav');import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';
async function generateWithSpeed(text, speed, outputPath) {
const zai = await ZAI.create();
// Speed range: 0.5 to 2.0 (API constraint)
// 0.5 = half speed (slower)
// 1.0 = normal speed (default)
// 2.0 = double speed (faster)
// Values outside this range will cause API errors
const response = await zai.audio.tts.create({
input: text,
voice: 'tongtong',
speed: speed,
response_format: 'wav',
stream: false
});
// Get array buffer from Response object
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(new Uint8Array(arrayBuffer));
fs.writeFileSync(outputPath, buffer);
return outputPath;
}
// Usage - slower narration
await generateWithSpeed('This is an important announcement', 0.8, './slow.wav');
// Usage - faster narration
await generateWithSpeed('Quick update', 1.3, './fast.wav');import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';
async function generateWithVolume(text, volume, outputPath) {
const zai = await ZAI.create();
// Volume range: greater than 0, up to 10 (API constraint)
// Values must be > 0 (exclusive) and <= 10 (inclusive)
// Default: 1.0 (normal volume)
const response = await zai.audio.tts.create({
input: text,
voice: 'tongtong',
speed: 1.0,
volume: volume, // Optional parameter
response_format: 'wav',
stream: false
});
// Get array buffer from Response object
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(new Uint8Array(arrayBuffer));
fs.writeFileSync(outputPath, buffer);
return outputPath;
}
// Usage - louder audio
await generateWithVolume('This is an announcement', 5.0, './loud.wav');
// Usage - quieter audio
await generateWithVolume('Whispered message', 0.5, './quiet.wav');import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';
import path from 'path';
async function batchTextToSpeech(textArray, outputDir) {
const zai = await ZAI.create();
const results = [];
// Ensure output directory exists
if (!fs.existsSync(outputDir)) {
fs.mkdirSync(outputDir, { recursive: true });
}
for (let i = 0; i < textArray.length; i++) {
try {
const text = textArray[i];
const outputPath = path.join(outputDir, `audio_${i + 1}.wav`);
const response = await zai.audio.tts.create({
input: text,
voice: 'tongtong',
speed: 1.0,
response_format: 'wav',
stream: false
});
// Get array buffer from Response object
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(new Uint8Array(arrayBuffer));
fs.writeFileSync(outputPath, buffer);
results.push({
success: true,
text,
path: outputPath
});
} catch (error) {
results.push({
success: false,
text: textArray[i],
error: error.message
});
}
}
return results;
}
// Usage
const texts = [
'Welcome to chapter one',
'Welcome to chapter two',
'Welcome to chapter three'
];
const results = await batchTextToSpeech(texts, './audio-output');
console.log('Generated:', results.length, 'audio files');import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';
class TTSGenerator {
constructor() {
this.zai = null;
}
async initialize() {
this.zai = await ZAI.create();
}
async generateAudio(text, options = {}) {
const {
voice = 'tongtong',
speed = 1.0,
format = 'wav'
} = options;
const response = await this.zai.audio.tts.create({
input: text,
voice: voice,
speed: speed,
response_format: format,
stream: false
});
// Get array buffer from Response object
const arrayBuffer = await response.arrayBuffer();
return Buffer.from(new Uint8Array(arrayBuffer));
}
async saveAudio(text, outputPath, options = {}) {
const buffer = await this.generateAudio(text, options);
if (buffer) {
fs.writeFileSync(outputPath, buffer);
return outputPath;
}
return null;
}
}
// Usage
const generator = new TTSGenerator();
await generator.initialize();
await generator.saveAudio(
'Hello, this is a test',
'./output.wav',
{ speed: 1.2 }
);import { NextRequest, NextResponse } from 'next/server';
export async function POST(req: NextRequest) {
try {
const { text, voice = 'tongtong', speed = 1.0 } = await req.json();
// Import ZAI SDK
const ZAI = (await import('z-ai-web-dev-sdk')).default;
// Create SDK instance
const zai = await ZAI.create();
// Generate TTS audio
const response = await zai.audio.tts.create({
input: text.trim(),
voice: voice,
speed: speed,
response_format: 'wav',
stream: false,
});
// Get array buffer from Response object
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(new Uint8Array(arrayBuffer));
// Return audio as response
return new NextResponse(buffer, {
status: 200,
headers: {
'Content-Type': 'audio/wav',
'Content-Length': buffer.length.toString(),
'Cache-Control': 'no-cache',
},
});
} catch (error) {
console.error('TTS API Error:', error);
return NextResponse.json(
{
error: error instanceof Error ? error.message : '生成语音失败,请稍后重试',
},
{ status: 500 }
);
}
}function prepareTextForTTS(text) {
// Remove excessive whitespace
text = text.replace(/\s+/g, ' ').trim();
// Expand common abbreviations for better pronunciation
const abbreviations = {
'Dr.': 'Doctor',
'Mr.': 'Mister',
'Mrs.': 'Misses',
'etc.': 'et cetera'
};
for (const [abbr, full] of Object.entries(abbreviations)) {
text = text.replace(new RegExp(abbr, 'g'), full);
}
return text;
}import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';
async function safeTTS(text, outputPath) {
try {
// Validate input
if (!text || text.trim().length === 0) {
throw new Error('Text input cannot be empty');
}
if (text.length > 1024) {
throw new Error('Text input exceeds maximum length of 1024 characters');
}
const zai = await ZAI.create();
const response = await zai.audio.tts.create({
input: text,
voice: 'tongtong',
speed: 1.0,
response_format: 'wav',
stream: false
});
// Get array buffer from Response object
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(new Uint8Array(arrayBuffer));
fs.writeFileSync(outputPath, buffer);
return {
success: true,
path: outputPath,
size: buffer.length
};
} catch (error) {
console.error('TTS Error:', error);
return {
success: false,
error: error.message
};
}
}import ZAI from 'z-ai-web-dev-sdk';
// Create a singleton instance
let zaiInstance = null;
async function getZAIInstance() {
if (!zaiInstance) {
zaiInstance = await ZAI.create();
}
return zaiInstance;
}
// Usage
const zai = await getZAIInstance();
const response = await zai.audio.tts.create({ ... });import express from 'express';
import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';
import path from 'path';
const app = express();
app.use(express.json());
let zaiInstance;
const outputDir = './audio-output';
async function initZAI() {
zaiInstance = await ZAI.create();
if (!fs.existsSync(outputDir)) {
fs.mkdirSync(outputDir, { recursive: true });
}
}
app.post('/api/tts', async (req, res) => {
try {
const { text, voice = 'tongtong', speed = 1.0 } = req.body;
if (!text) {
return res.status(400).json({ error: 'Text is required' });
}
const filename = `tts_${Date.now()}.wav`;
const outputPath = path.join(outputDir, filename);
const response = await zaiInstance.audio.tts.create({
input: text,
voice: voice,
speed: speed,
response_format: 'wav',
stream: false
});
// Get array buffer from Response object
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(new Uint8Array(arrayBuffer));
fs.writeFileSync(outputPath, buffer);
res.json({
success: true,
audioUrl: `/audio/${filename}`,
size: buffer.length
});
} catch (error) {
res.status(500).json({ error: error.message });
}
});
app.use('/audio', express.static('audio-output'));
initZAI().then(() => {
app.listen(3000, () => {
console.log('TTS API running on port 3000');
});
});Issue: "Input text exceeds maximum length"
splitTextIntoChunks function shown in the API Limitations sectionIssue: "Invalid speed parameter" or unexpected speed behavior
Issue: "Invalid volume parameter"
Issue: "Stream format not supported" with WAV/MP3
response_format: 'pcm' with streaming, or disable streaming (stream: false) for WAV/MP3 outputIssue: "SDK must be used in backend"
Issue: "TypeError: response.audio is undefined"
await response.arrayBuffer() instead of accessing response.audioIssue: Generated audio file is empty or corrupted
await response.arrayBuffer() and properly converting to Buffer: Buffer.from(new Uint8Array(arrayBuffer))Issue: Audio sounds unnatural
Issue: Long processing times
Issue: Next.js caching old API route
Input Text Length: Maximum 1024 characters per request. For longer text:
// Split long text into chunks
const longText = "..."; // Your long text here
const chunks = splitTextIntoChunks(longText, 1000);
for (const chunk of chunks) {
const response = await zai.audio.tts.create({
input: chunk,
voice: 'tongtong',
speed: 1.0,
response_format: 'wav',
stream: false
});
// Process each chunk...
}Streaming Format Limitation: When using stream: true, only pcm format is supported. For wav or mp3 output, use stream: false.
Sample Rate: Audio is generated at 24000 Hz sample rate (recommended setting for playback).
The zai.audio.tts.create() method returns a standard Response object (not a custom object with an audio property). Always use:
// ✅ CORRECT
const response = await zai.audio.tts.create({ ... });
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(new Uint8Array(arrayBuffer));
// ❌ WRONG - This will not work
const response = await zai.audio.tts.create({ ... });
const buffer = Buffer.from(response.audio); // response.audio is undefinedtongtong - 温暖亲切chuichui - 活泼可爱xiaochen - 沉稳专业jam - 英音绅士kazi - 清晰标准douji - 自然流畅luodo - 富有感染力0.5 (half speed)1.0 (normal speed)2.0 (double speed)Important: Speed values outside the range [0.5, 2.0] will result in API errors.
0 (exclusive)1.0 (normal volume)10 (inclusive)Note: Volume parameter is optional. When not specified, defaults to 1.0.
await response.arrayBuffer()Buffer.from(new Uint8Array(arrayBuffer))07048a9
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.