tessl/npm-pinyin

Chinese character to Pinyin conversion with intelligent phrase matching and multiple pronunciation support

—

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview

Eval results

Files

Text Segmentation

Name: tessl/npm-pinyin
Author: tessl

Integrated support for multiple Chinese text segmentation libraries to improve conversion accuracy for phrases and compound words by recognizing word boundaries.

Capabilities

Segmentation Libraries

Support for multiple Chinese word segmentation libraries with automatic fallback handling.

type IPinyinSegment = "nodejieba" | "segmentit" | "@node-rs/jieba" | "Intl.Segmenter";

Note: Segmentation is handled internally by the main pinyin() function when the segment option is provided. There is no standalone segment function exported from the package.

Segmentation Options

nodejieba (Default)

Type: C++ implementation
Performance: Fastest
Dependency: Requires nodejieba as peer dependency
Platform: Node.js only

@node-rs/jieba

Type: Rust implementation via NAPI
Performance: Very fast
Dependency: Requires @node-rs/jieba as peer dependency
Platform: Node.js only

segmentit

Type: Pure JavaScript
Performance: Good
Dependency: Requires segmentit as peer dependency
Platform: Node.js and browsers

Intl.Segmenter

Type: Web standard API
Performance: Good
Dependency: Built into modern environments
Platform: Modern browsers and Node.js 16+

Segmentation Configuration

Segmentation can be enabled through the main pinyin function options:

interface IPinyinOptions {
  /** Text segmentation library for phrase recognition */
  segment?: IPinyinSegment | boolean;
}

Configuration Options:

false (default): No segmentation, character-by-character conversion
true: Enable segmentation using Intl.Segmenter (recommended default)
"nodejieba": Use nodejieba library (fastest, Node.js only)
"@node-rs/jieba": Use Rust-based jieba (very fast, Node.js only)
"segmentit": Use pure JavaScript segmentation (cross-platform)
"Intl.Segmenter": Use web standard segmentation (modern environments)

Usage Examples

Basic Segmentation

import pinyin from "pinyin";

// Without segmentation (character-by-character)
console.log(pinyin("我喜欢编程"));
// Result: [["wǒ"], ["xǐ"], ["huān"], ["biān"], ["chéng"]]

// With segmentation (phrase-aware)
console.log(pinyin("我喜欢编程", { segment: true }));
// Result: [["wǒ"], ["xǐ"], ["huān"], ["biānchéng"]]

// With specific segmentation library
console.log(pinyin("我喜欢编程", { segment: "nodejieba" }));
// Result: [["wǒ"], ["xǐhuān"], ["biānchéng"]]

Phrase Recognition Benefits

Segmentation significantly improves accuracy for compound words and phrases:

// Without segmentation - less accurate
console.log(pinyin("北京大学"));
// Result: [["běi"], ["jīng"], ["dà"], ["xué"]]

// With segmentation - more accurate phrase recognition
console.log(pinyin("北京大学", { segment: true }));
// Result: [["běijīng"], ["dàxué"]]

// Complex phrases
console.log(pinyin("人工智能技术", { segment: "nodejieba" }));
// Result: [["réngōng"], ["zhìnéng"], ["jìshù"]]

Segmentation with Grouping

Combine segmentation with group option for phrase-level Pinyin:

// Segmentation with phrase grouping
console.log(pinyin("自然语言处理", { 
  segment: true, 
  group: true 
}));
// Result: [["zìráncr"], ["yǔyán"], ["chǔlǐ"]]

// Character-by-character for comparison
console.log(pinyin("自然语言处理"));
// Result: [["zì"], ["rán"], ["yǔ"], ["yán"], ["chǔ"], ["lǐ"]]

Internal Segmentation Process

The segmentation functionality is internal to the pinyin() function and is not exposed as a standalone API. When you enable segmentation through the segment option, the library automatically handles word boundary detection internally before applying Pinyin conversion.

Error Handling and Fallbacks

The segmentation system includes robust error handling:

Missing Dependencies

When a specified segmentation library is not available:

// If nodejieba is not installed
console.log(pinyin("测试", { segment: "nodejieba" }));
// Logs: "pinyin v4: 'nodejieba' is peerDependencies"
// Fallback: Returns original text as single segment

Segmentation Failures

If segmentation fails due to errors:

// Error in segmentation library
console.log(segment("测试", "invalid-library"));
// Fallback: Returns original text as array with single element
// Result: ["测试"]

Platform Compatibility

Different libraries work on different platforms:

// Browser environment
console.log(pinyin("测试", { segment: "Intl.Segmenter" }));
// Works in modern browsers

console.log(pinyin("测试", { segment: "nodejieba" }));
// Will fallback - nodejieba only works in Node.js

Performance Considerations

Library Performance Comparison

@node-rs/jieba: Fastest (Rust implementation)
nodejieba: Very fast (C++ implementation)
Intl.Segmenter: Good (native browser/Node.js)
segmentit: Good (pure JavaScript)

Memory Usage

@node-rs/jieba: Low memory overhead
nodejieba: Moderate memory usage
segmentit: Higher memory for dictionary loading
Intl.Segmenter: Minimal additional memory

Recommendation

For most applications, use segment: true (defaults to Intl.Segmenter) as it provides good performance without additional dependencies.

// Recommended approach
const result = pinyin("中文文本", { segment: true });

For high-performance Node.js applications with many conversions, consider nodejieba or @node-rs/jieba:

// High-performance Node.js
const result = pinyin("中文文本", { segment: "@node-rs/jieba" });

Install with Tessl CLI

npx tessl i tessl/npm-pinyin

docs

tessl/npm-pinyin