CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/npm-pinyin

Chinese character to Pinyin conversion with intelligent phrase matching and multiple pronunciation support

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview
Eval results
Files

text-segmentation.mddocs/

Text Segmentation

Integrated support for multiple Chinese text segmentation libraries to improve conversion accuracy for phrases and compound words by recognizing word boundaries.

Capabilities

Segmentation Libraries

Support for multiple Chinese word segmentation libraries with automatic fallback handling.

type IPinyinSegment = "nodejieba" | "segmentit" | "@node-rs/jieba" | "Intl.Segmenter";

Note: Segmentation is handled internally by the main pinyin() function when the segment option is provided. There is no standalone segment function exported from the package.

Segmentation Options

nodejieba (Default)

  • Type: C++ implementation
  • Performance: Fastest
  • Dependency: Requires nodejieba as peer dependency
  • Platform: Node.js only

@node-rs/jieba

  • Type: Rust implementation via NAPI
  • Performance: Very fast
  • Dependency: Requires @node-rs/jieba as peer dependency
  • Platform: Node.js only

segmentit

  • Type: Pure JavaScript
  • Performance: Good
  • Dependency: Requires segmentit as peer dependency
  • Platform: Node.js and browsers

Intl.Segmenter

  • Type: Web standard API
  • Performance: Good
  • Dependency: Built into modern environments
  • Platform: Modern browsers and Node.js 16+

Segmentation Configuration

Segmentation can be enabled through the main pinyin function options:

interface IPinyinOptions {
  /** Text segmentation library for phrase recognition */
  segment?: IPinyinSegment | boolean;
}

Configuration Options:

  • false (default): No segmentation, character-by-character conversion
  • true: Enable segmentation using Intl.Segmenter (recommended default)
  • "nodejieba": Use nodejieba library (fastest, Node.js only)
  • "@node-rs/jieba": Use Rust-based jieba (very fast, Node.js only)
  • "segmentit": Use pure JavaScript segmentation (cross-platform)
  • "Intl.Segmenter": Use web standard segmentation (modern environments)

Usage Examples

Basic Segmentation

import pinyin from "pinyin";

// Without segmentation (character-by-character)
console.log(pinyin("我喜欢编程"));
// Result: [["wǒ"], ["xǐ"], ["huān"], ["biān"], ["chéng"]]

// With segmentation (phrase-aware)
console.log(pinyin("我喜欢编程", { segment: true }));
// Result: [["wǒ"], ["xǐ"], ["huān"], ["biānchéng"]]

// With specific segmentation library
console.log(pinyin("我喜欢编程", { segment: "nodejieba" }));
// Result: [["wǒ"], ["xǐhuān"], ["biānchéng"]]

Phrase Recognition Benefits

Segmentation significantly improves accuracy for compound words and phrases:

// Without segmentation - less accurate
console.log(pinyin("北京大学"));
// Result: [["běi"], ["jīng"], ["dà"], ["xué"]]

// With segmentation - more accurate phrase recognition
console.log(pinyin("北京大学", { segment: true }));
// Result: [["běijīng"], ["dàxué"]]

// Complex phrases
console.log(pinyin("人工智能技术", { segment: "nodejieba" }));
// Result: [["réngōng"], ["zhìnéng"], ["jìshù"]]

Segmentation with Grouping

Combine segmentation with group option for phrase-level Pinyin:

// Segmentation with phrase grouping
console.log(pinyin("自然语言处理", { 
  segment: true, 
  group: true 
}));
// Result: [["zìráncr"], ["yǔyán"], ["chǔlǐ"]]

// Character-by-character for comparison
console.log(pinyin("自然语言处理"));
// Result: [["zì"], ["rán"], ["yǔ"], ["yán"], ["chǔ"], ["lǐ"]]

Internal Segmentation Process

The segmentation functionality is internal to the pinyin() function and is not exposed as a standalone API. When you enable segmentation through the segment option, the library automatically handles word boundary detection internally before applying Pinyin conversion.

Error Handling and Fallbacks

The segmentation system includes robust error handling:

Missing Dependencies

When a specified segmentation library is not available:

// If nodejieba is not installed
console.log(pinyin("测试", { segment: "nodejieba" }));
// Logs: "pinyin v4: 'nodejieba' is peerDependencies"
// Fallback: Returns original text as single segment

Segmentation Failures

If segmentation fails due to errors:

// Error in segmentation library
console.log(segment("测试", "invalid-library"));
// Fallback: Returns original text as array with single element
// Result: ["测试"]

Platform Compatibility

Different libraries work on different platforms:

// Browser environment
console.log(pinyin("测试", { segment: "Intl.Segmenter" }));
// Works in modern browsers

console.log(pinyin("测试", { segment: "nodejieba" }));
// Will fallback - nodejieba only works in Node.js

Performance Considerations

Library Performance Comparison

  1. @node-rs/jieba: Fastest (Rust implementation)
  2. nodejieba: Very fast (C++ implementation)
  3. Intl.Segmenter: Good (native browser/Node.js)
  4. segmentit: Good (pure JavaScript)

Memory Usage

  • @node-rs/jieba: Low memory overhead
  • nodejieba: Moderate memory usage
  • segmentit: Higher memory for dictionary loading
  • Intl.Segmenter: Minimal additional memory

Recommendation

For most applications, use segment: true (defaults to Intl.Segmenter) as it provides good performance without additional dependencies.

// Recommended approach
const result = pinyin("中文文本", { segment: true });

For high-performance Node.js applications with many conversions, consider nodejieba or @node-rs/jieba:

// High-performance Node.js
const result = pinyin("中文文本", { segment: "@node-rs/jieba" });

Install with Tessl CLI

npx tessl i tessl/npm-pinyin

docs

core-conversion.md

index.md

output-styles.md

text-segmentation.md

utility-functions.md

tile.json