or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-conversion.mdindex.mdoutput-styles.mdtext-segmentation.mdutility-functions.md

text-segmentation.mddocs/

0

# Text Segmentation

1

2

Integrated support for multiple Chinese text segmentation libraries to improve conversion accuracy for phrases and compound words by recognizing word boundaries.

3

4

## Capabilities

5

6

### Segmentation Libraries

7

8

Support for multiple Chinese word segmentation libraries with automatic fallback handling.

9

10

```typescript { .api }

11

type IPinyinSegment = "nodejieba" | "segmentit" | "@node-rs/jieba" | "Intl.Segmenter";

12

```

13

14

**Note**: Segmentation is handled internally by the main `pinyin()` function when the `segment` option is provided. There is no standalone `segment` function exported from the package.

15

16

### Segmentation Options

17

18

**nodejieba (Default)**

19

- **Type**: C++ implementation

20

- **Performance**: Fastest

21

- **Dependency**: Requires `nodejieba` as peer dependency

22

- **Platform**: Node.js only

23

24

**@node-rs/jieba**

25

- **Type**: Rust implementation via NAPI

26

- **Performance**: Very fast

27

- **Dependency**: Requires `@node-rs/jieba` as peer dependency

28

- **Platform**: Node.js only

29

30

**segmentit**

31

- **Type**: Pure JavaScript

32

- **Performance**: Good

33

- **Dependency**: Requires `segmentit` as peer dependency

34

- **Platform**: Node.js and browsers

35

36

**Intl.Segmenter**

37

- **Type**: Web standard API

38

- **Performance**: Good

39

- **Dependency**: Built into modern environments

40

- **Platform**: Modern browsers and Node.js 16+

41

42

### Segmentation Configuration

43

44

Segmentation can be enabled through the main pinyin function options:

45

46

```typescript { .api }

47

interface IPinyinOptions {

48

/** Text segmentation library for phrase recognition */

49

segment?: IPinyinSegment | boolean;

50

}

51

```

52

53

**Configuration Options:**

54

55

- `false` (default): No segmentation, character-by-character conversion

56

- `true`: Enable segmentation using `Intl.Segmenter` (recommended default)

57

- `"nodejieba"`: Use nodejieba library (fastest, Node.js only)

58

- `"@node-rs/jieba"`: Use Rust-based jieba (very fast, Node.js only)

59

- `"segmentit"`: Use pure JavaScript segmentation (cross-platform)

60

- `"Intl.Segmenter"`: Use web standard segmentation (modern environments)

61

62

## Usage Examples

63

64

### Basic Segmentation

65

66

```typescript

67

import pinyin from "pinyin";

68

69

// Without segmentation (character-by-character)

70

console.log(pinyin("我喜欢编程"));

71

// Result: [["wǒ"], ["xǐ"], ["huān"], ["biān"], ["chéng"]]

72

73

// With segmentation (phrase-aware)

74

console.log(pinyin("我喜欢编程", { segment: true }));

75

// Result: [["wǒ"], ["xǐ"], ["huān"], ["biānchéng"]]

76

77

// With specific segmentation library

78

console.log(pinyin("我喜欢编程", { segment: "nodejieba" }));

79

// Result: [["wǒ"], ["xǐhuān"], ["biānchéng"]]

80

```

81

82

### Phrase Recognition Benefits

83

84

Segmentation significantly improves accuracy for compound words and phrases:

85

86

```typescript

87

// Without segmentation - less accurate

88

console.log(pinyin("北京大学"));

89

// Result: [["běi"], ["jīng"], ["dà"], ["xué"]]

90

91

// With segmentation - more accurate phrase recognition

92

console.log(pinyin("北京大学", { segment: true }));

93

// Result: [["běijīng"], ["dàxué"]]

94

95

// Complex phrases

96

console.log(pinyin("人工智能技术", { segment: "nodejieba" }));

97

// Result: [["réngōng"], ["zhìnéng"], ["jìshù"]]

98

```

99

100

### Segmentation with Grouping

101

102

Combine segmentation with group option for phrase-level Pinyin:

103

104

```typescript

105

// Segmentation with phrase grouping

106

console.log(pinyin("自然语言处理", {

107

segment: true,

108

group: true

109

}));

110

// Result: [["zìráncr"], ["yǔyán"], ["chǔlǐ"]]

111

112

// Character-by-character for comparison

113

console.log(pinyin("自然语言处理"));

114

// Result: [["zì"], ["rán"], ["yǔ"], ["yán"], ["chǔ"], ["lǐ"]]

115

```

116

117

### Internal Segmentation Process

118

119

The segmentation functionality is internal to the `pinyin()` function and is not exposed as a standalone API. When you enable segmentation through the `segment` option, the library automatically handles word boundary detection internally before applying Pinyin conversion.

120

121

## Error Handling and Fallbacks

122

123

The segmentation system includes robust error handling:

124

125

### Missing Dependencies

126

127

When a specified segmentation library is not available:

128

129

```typescript

130

// If nodejieba is not installed

131

console.log(pinyin("测试", { segment: "nodejieba" }));

132

// Logs: "pinyin v4: 'nodejieba' is peerDependencies"

133

// Fallback: Returns original text as single segment

134

```

135

136

### Segmentation Failures

137

138

If segmentation fails due to errors:

139

140

```typescript

141

// Error in segmentation library

142

console.log(segment("测试", "invalid-library"));

143

// Fallback: Returns original text as array with single element

144

// Result: ["测试"]

145

```

146

147

### Platform Compatibility

148

149

Different libraries work on different platforms:

150

151

```typescript

152

// Browser environment

153

console.log(pinyin("测试", { segment: "Intl.Segmenter" }));

154

// Works in modern browsers

155

156

console.log(pinyin("测试", { segment: "nodejieba" }));

157

// Will fallback - nodejieba only works in Node.js

158

```

159

160

## Performance Considerations

161

162

### Library Performance Comparison

163

164

1. **@node-rs/jieba**: Fastest (Rust implementation)

165

2. **nodejieba**: Very fast (C++ implementation)

166

3. **Intl.Segmenter**: Good (native browser/Node.js)

167

4. **segmentit**: Good (pure JavaScript)

168

169

### Memory Usage

170

171

- **@node-rs/jieba**: Low memory overhead

172

- **nodejieba**: Moderate memory usage

173

- **segmentit**: Higher memory for dictionary loading

174

- **Intl.Segmenter**: Minimal additional memory

175

176

### Recommendation

177

178

For most applications, use `segment: true` (defaults to `Intl.Segmenter`) as it provides good performance without additional dependencies.

179

180

```typescript

181

// Recommended approach

182

const result = pinyin("中文文本", { segment: true });

183

```

184

185

For high-performance Node.js applications with many conversions, consider `nodejieba` or `@node-rs/jieba`:

186

187

```typescript

188

// High-performance Node.js

189

const result = pinyin("中文文本", { segment: "@node-rs/jieba" });

190

```