Tessl Tile for npm/pinyin@4.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-conversion.md index.md output-styles.md text-segmentation.md utility-functions.md

text-segmentation.mddocs/

0
# Text Segmentation
1

2
Integrated support for multiple Chinese text segmentation libraries to improve conversion accuracy for phrases and compound words by recognizing word boundaries.
3

4
## Capabilities
5

6
### Segmentation Libraries
7

8
Support for multiple Chinese word segmentation libraries with automatic fallback handling.
9

10
```typescript { .api }
11
type IPinyinSegment = "nodejieba" | "segmentit" | "@node-rs/jieba" | "Intl.Segmenter";
12
```
13

14
**Note**: Segmentation is handled internally by the main `pinyin()` function when the `segment` option is provided. There is no standalone `segment` function exported from the package.
15

16
### Segmentation Options
17

18
**nodejieba (Default)**
19
- **Type**: C++ implementation
20
- **Performance**: Fastest
21
- **Dependency**: Requires `nodejieba` as peer dependency
22
- **Platform**: Node.js only
23

24
**@node-rs/jieba**
25
- **Type**: Rust implementation via NAPI
26
- **Performance**: Very fast
27
- **Dependency**: Requires `@node-rs/jieba` as peer dependency  
28
- **Platform**: Node.js only
29

30
**segmentit**
31
- **Type**: Pure JavaScript
32
- **Performance**: Good
33
- **Dependency**: Requires `segmentit` as peer dependency
34
- **Platform**: Node.js and browsers
35

36
**Intl.Segmenter**
37
- **Type**: Web standard API
38
- **Performance**: Good
39
- **Dependency**: Built into modern environments
40
- **Platform**: Modern browsers and Node.js 16+
41

42
### Segmentation Configuration
43

44
Segmentation can be enabled through the main pinyin function options:
45

46
```typescript { .api }
47
interface IPinyinOptions {
48
  /** Text segmentation library for phrase recognition */
49
  segment?: IPinyinSegment | boolean;
50
}
51
```
52

53
**Configuration Options:**
54

55
- `false` (default): No segmentation, character-by-character conversion
56
- `true`: Enable segmentation using `Intl.Segmenter` (recommended default)
57
- `"nodejieba"`: Use nodejieba library (fastest, Node.js only)
58
- `"@node-rs/jieba"`: Use Rust-based jieba (very fast, Node.js only)
59
- `"segmentit"`: Use pure JavaScript segmentation (cross-platform)
60
- `"Intl.Segmenter"`: Use web standard segmentation (modern environments)
61

62
## Usage Examples
63

64
### Basic Segmentation
65

66
```typescript
67
import pinyin from "pinyin";
68

69
// Without segmentation (character-by-character)
70
console.log(pinyin("我喜欢编程"));
71
// Result: [["wǒ"], ["xǐ"], ["huān"], ["biān"], ["chéng"]]
72

73
// With segmentation (phrase-aware)
74
console.log(pinyin("我喜欢编程", { segment: true }));
75
// Result: [["wǒ"], ["xǐ"], ["huān"], ["biānchéng"]]
76

77
// With specific segmentation library
78
console.log(pinyin("我喜欢编程", { segment: "nodejieba" }));
79
// Result: [["wǒ"], ["xǐhuān"], ["biānchéng"]]
80
```
81

82
### Phrase Recognition Benefits
83

84
Segmentation significantly improves accuracy for compound words and phrases:
85

86
```typescript
87
// Without segmentation - less accurate
88
console.log(pinyin("北京大学"));
89
// Result: [["běi"], ["jīng"], ["dà"], ["xué"]]
90

91
// With segmentation - more accurate phrase recognition
92
console.log(pinyin("北京大学", { segment: true }));
93
// Result: [["běijīng"], ["dàxué"]]
94

95
// Complex phrases
96
console.log(pinyin("人工智能技术", { segment: "nodejieba" }));
97
// Result: [["réngōng"], ["zhìnéng"], ["jìshù"]]
98
```
99

100
### Segmentation with Grouping
101

102
Combine segmentation with group option for phrase-level Pinyin:
103

104
```typescript
105
// Segmentation with phrase grouping
106
console.log(pinyin("自然语言处理", { 
107
  segment: true, 
108
  group: true 
109
}));
110
// Result: [["zìráncr"], ["yǔyán"], ["chǔlǐ"]]
111

112
// Character-by-character for comparison
113
console.log(pinyin("自然语言处理"));
114
// Result: [["zì"], ["rán"], ["yǔ"], ["yán"], ["chǔ"], ["lǐ"]]
115
```
116

117
### Internal Segmentation Process
118

119
The segmentation functionality is internal to the `pinyin()` function and is not exposed as a standalone API. When you enable segmentation through the `segment` option, the library automatically handles word boundary detection internally before applying Pinyin conversion.
120

121
## Error Handling and Fallbacks
122

123
The segmentation system includes robust error handling:
124

125
### Missing Dependencies
126

127
When a specified segmentation library is not available:
128

129
```typescript
130
// If nodejieba is not installed
131
console.log(pinyin("测试", { segment: "nodejieba" }));
132
// Logs: "pinyin v4: 'nodejieba' is peerDependencies"
133
// Fallback: Returns original text as single segment
134
```
135

136
### Segmentation Failures
137

138
If segmentation fails due to errors:
139

140
```typescript
141
// Error in segmentation library
142
console.log(segment("测试", "invalid-library"));
143
// Fallback: Returns original text as array with single element
144
// Result: ["测试"]
145
```
146

147
### Platform Compatibility
148

149
Different libraries work on different platforms:
150

151
```typescript
152
// Browser environment
153
console.log(pinyin("测试", { segment: "Intl.Segmenter" }));
154
// Works in modern browsers
155

156
console.log(pinyin("测试", { segment: "nodejieba" }));
157
// Will fallback - nodejieba only works in Node.js
158
```
159

160
## Performance Considerations
161

162
### Library Performance Comparison
163

164
1. **@node-rs/jieba**: Fastest (Rust implementation)
165
2. **nodejieba**: Very fast (C++ implementation)  
166
3. **Intl.Segmenter**: Good (native browser/Node.js)
167
4. **segmentit**: Good (pure JavaScript)
168

169
### Memory Usage
170

171
- **@node-rs/jieba**: Low memory overhead
172
- **nodejieba**: Moderate memory usage
173
- **segmentit**: Higher memory for dictionary loading
174
- **Intl.Segmenter**: Minimal additional memory
175

176
### Recommendation
177

178
For most applications, use `segment: true` (defaults to `Intl.Segmenter`) as it provides good performance without additional dependencies.
179

180
```typescript
181
// Recommended approach
182
const result = pinyin("中文文本", { segment: true });
183
```
184

185
For high-performance Node.js applications with many conversions, consider `nodejieba` or `@node-rs/jieba`:
186

187
```typescript
188
// High-performance Node.js
189
const result = pinyin("中文文本", { segment: "@node-rs/jieba" });
190
```

Version

Tile

Files

text-segmentation.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

text-segmentation.mddocs/