0
# Text Segmentation
1
2
Integrated support for multiple Chinese text segmentation libraries to improve conversion accuracy for phrases and compound words by recognizing word boundaries.
3
4
## Capabilities
5
6
### Segmentation Libraries
7
8
Support for multiple Chinese word segmentation libraries with automatic fallback handling.
9
10
```typescript { .api }
11
type IPinyinSegment = "nodejieba" | "segmentit" | "@node-rs/jieba" | "Intl.Segmenter";
12
```
13
14
**Note**: Segmentation is handled internally by the main `pinyin()` function when the `segment` option is provided. There is no standalone `segment` function exported from the package.
15
16
### Segmentation Options
17
18
**nodejieba (Default)**
19
- **Type**: C++ implementation
20
- **Performance**: Fastest
21
- **Dependency**: Requires `nodejieba` as peer dependency
22
- **Platform**: Node.js only
23
24
**@node-rs/jieba**
25
- **Type**: Rust implementation via NAPI
26
- **Performance**: Very fast
27
- **Dependency**: Requires `@node-rs/jieba` as peer dependency
28
- **Platform**: Node.js only
29
30
**segmentit**
31
- **Type**: Pure JavaScript
32
- **Performance**: Good
33
- **Dependency**: Requires `segmentit` as peer dependency
34
- **Platform**: Node.js and browsers
35
36
**Intl.Segmenter**
37
- **Type**: Web standard API
38
- **Performance**: Good
39
- **Dependency**: Built into modern environments
40
- **Platform**: Modern browsers and Node.js 16+
41
42
### Segmentation Configuration
43
44
Segmentation can be enabled through the main pinyin function options:
45
46
```typescript { .api }
47
interface IPinyinOptions {
48
/** Text segmentation library for phrase recognition */
49
segment?: IPinyinSegment | boolean;
50
}
51
```
52
53
**Configuration Options:**
54
55
- `false` (default): No segmentation, character-by-character conversion
56
- `true`: Enable segmentation using `Intl.Segmenter` (recommended default)
57
- `"nodejieba"`: Use nodejieba library (fastest, Node.js only)
58
- `"@node-rs/jieba"`: Use Rust-based jieba (very fast, Node.js only)
59
- `"segmentit"`: Use pure JavaScript segmentation (cross-platform)
60
- `"Intl.Segmenter"`: Use web standard segmentation (modern environments)
61
62
## Usage Examples
63
64
### Basic Segmentation
65
66
```typescript
67
import pinyin from "pinyin";
68
69
// Without segmentation (character-by-character)
70
console.log(pinyin("我喜欢编程"));
71
// Result: [["wǒ"], ["xǐ"], ["huān"], ["biān"], ["chéng"]]
72
73
// With segmentation (phrase-aware)
74
console.log(pinyin("我喜欢编程", { segment: true }));
75
// Result: [["wǒ"], ["xǐ"], ["huān"], ["biānchéng"]]
76
77
// With specific segmentation library
78
console.log(pinyin("我喜欢编程", { segment: "nodejieba" }));
79
// Result: [["wǒ"], ["xǐhuān"], ["biānchéng"]]
80
```
81
82
### Phrase Recognition Benefits
83
84
Segmentation significantly improves accuracy for compound words and phrases:
85
86
```typescript
87
// Without segmentation - less accurate
88
console.log(pinyin("北京大学"));
89
// Result: [["běi"], ["jīng"], ["dà"], ["xué"]]
90
91
// With segmentation - more accurate phrase recognition
92
console.log(pinyin("北京大学", { segment: true }));
93
// Result: [["běijīng"], ["dàxué"]]
94
95
// Complex phrases
96
console.log(pinyin("人工智能技术", { segment: "nodejieba" }));
97
// Result: [["réngōng"], ["zhìnéng"], ["jìshù"]]
98
```
99
100
### Segmentation with Grouping
101
102
Combine segmentation with group option for phrase-level Pinyin:
103
104
```typescript
105
// Segmentation with phrase grouping
106
console.log(pinyin("自然语言处理", {
107
segment: true,
108
group: true
109
}));
110
// Result: [["zìráncr"], ["yǔyán"], ["chǔlǐ"]]
111
112
// Character-by-character for comparison
113
console.log(pinyin("自然语言处理"));
114
// Result: [["zì"], ["rán"], ["yǔ"], ["yán"], ["chǔ"], ["lǐ"]]
115
```
116
117
### Internal Segmentation Process
118
119
The segmentation functionality is internal to the `pinyin()` function and is not exposed as a standalone API. When you enable segmentation through the `segment` option, the library automatically handles word boundary detection internally before applying Pinyin conversion.
120
121
## Error Handling and Fallbacks
122
123
The segmentation system includes robust error handling:
124
125
### Missing Dependencies
126
127
When a specified segmentation library is not available:
128
129
```typescript
130
// If nodejieba is not installed
131
console.log(pinyin("测试", { segment: "nodejieba" }));
132
// Logs: "pinyin v4: 'nodejieba' is peerDependencies"
133
// Fallback: Returns original text as single segment
134
```
135
136
### Segmentation Failures
137
138
If segmentation fails due to errors:
139
140
```typescript
141
// Error in segmentation library
142
console.log(segment("测试", "invalid-library"));
143
// Fallback: Returns original text as array with single element
144
// Result: ["测试"]
145
```
146
147
### Platform Compatibility
148
149
Different libraries work on different platforms:
150
151
```typescript
152
// Browser environment
153
console.log(pinyin("测试", { segment: "Intl.Segmenter" }));
154
// Works in modern browsers
155
156
console.log(pinyin("测试", { segment: "nodejieba" }));
157
// Will fallback - nodejieba only works in Node.js
158
```
159
160
## Performance Considerations
161
162
### Library Performance Comparison
163
164
1. **@node-rs/jieba**: Fastest (Rust implementation)
165
2. **nodejieba**: Very fast (C++ implementation)
166
3. **Intl.Segmenter**: Good (native browser/Node.js)
167
4. **segmentit**: Good (pure JavaScript)
168
169
### Memory Usage
170
171
- **@node-rs/jieba**: Low memory overhead
172
- **nodejieba**: Moderate memory usage
173
- **segmentit**: Higher memory for dictionary loading
174
- **Intl.Segmenter**: Minimal additional memory
175
176
### Recommendation
177
178
For most applications, use `segment: true` (defaults to `Intl.Segmenter`) as it provides good performance without additional dependencies.
179
180
```typescript
181
// Recommended approach
182
const result = pinyin("中文文本", { segment: true });
183
```
184
185
For high-performance Node.js applications with many conversions, consider `nodejieba` or `@node-rs/jieba`:
186
187
```typescript
188
// High-performance Node.js
189
const result = pinyin("中文文本", { segment: "@node-rs/jieba" });
190
```