or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

dictionary-management.mdindex.mdnodes-features.mdtokenization.md

index.mddocs/

0

# Fugashi

1

2

A high-performance Cython wrapper for MeCab, providing fast and pythonic Japanese tokenization and morphological analysis. Fugashi offers comprehensive access to MeCab's tokenization capabilities with built-in support for UniDic dictionaries and extensive morphological feature extraction.

3

4

## Package Information

5

6

- **Package Name**: fugashi

7

- **Language**: Python

8

- **Installation**: `pip install fugashi`

9

- **Dependencies**: Requires MeCab system library (automatically handled in pre-built wheels)

10

- **Dictionary**: Requires a MeCab dictionary - UniDic recommended (`pip install 'fugashi[unidic-lite]'`)

11

12

## Core Imports

13

14

```python

15

import fugashi

16

from fugashi import Tagger, GenericTagger, Node, UnidicNode

17

```

18

19

For basic tokenization:

20

```python

21

from fugashi import Tagger

22

```

23

24

For advanced dictionary management:

25

```python

26

from fugashi import GenericTagger, create_feature_wrapper

27

```

28

29

## Basic Usage

30

31

```python

32

from fugashi import Tagger

33

34

# Initialize tagger with UniDic (automatic detection)

35

tagger = Tagger()

36

37

# Tokenize text

38

text = "麩菓子は、麩を主材料とした日本の菓子。"

39

nodes = tagger(text)

40

41

# Access token information

42

for node in nodes:

43

print(f"Surface: {node.surface}")

44

print(f"Lemma: {node.feature.lemma}")

45

print(f"POS: {node.pos}")

46

print(f"Features: {node.feature}")

47

print("---")

48

49

# Get formatted output

50

formatted = tagger.parse(text)

51

print(formatted) # Traditional MeCab output format

52

53

# Wakati (word-segmented) mode

54

wakati_tagger = Tagger('-Owakati')

55

words = wakati_tagger.parse(text)

56

print(words) # Space-separated tokens

57

```

58

59

## Architecture

60

61

Fugashi provides a layered architecture for Japanese text processing:

62

63

- **Taggers**: High-level interfaces (Tagger, GenericTagger) that manage MeCab instances and provide parsing methods

64

- **Nodes**: Token representations (Node, UnidicNode) containing surface forms, morphological features, and metadata

65

- **Feature Wrappers**: Named tuple structures (UnidicFeatures17/26/29) providing structured access to dictionary features

66

- **Dictionary Management**: Functions for building custom dictionaries and accessing dictionary information

67

68

This design enables both simple tokenization workflows and sophisticated morphological analysis applications, with automatic dictionary format detection and extensive customization options.

69

70

## Capabilities

71

72

### Core Tokenization

73

74

Primary tokenization functionality including text parsing, node list generation, wakati mode, and n-best parsing. These functions provide the essential Japanese text processing capabilities.

75

76

```python { .api }

77

class Tagger:

78

def __init__(self, arg: str = '') -> None: ...

79

def __call__(self, text: str) -> List[UnidicNode]: ...

80

def parse(self, text: str) -> str: ...

81

def parseToNodeList(self, text: str) -> List[UnidicNode]: ...

82

def nbest(self, text: str, num: int = 10) -> str: ...

83

def nbestToNodeList(self, text: str, num: int = 10) -> List[List[UnidicNode]]: ...

84

85

class GenericTagger:

86

def __init__(self, args: str = '', wrapper: Callable = make_tuple, quiet: bool = False) -> None: ...

87

def __call__(self, text: str) -> List[Node]: ...

88

def parse(self, text: str) -> str: ...

89

def parseToNodeList(self, text: str) -> List[Node]: ...

90

def nbest(self, text: str, num: int = 10) -> str: ...

91

def nbestToNodeList(self, text: str, num: int = 10) -> List[List[Node]]: ...

92

```

93

94

[Tokenization](./tokenization.md)

95

96

### Nodes and Features

97

98

Token representation and morphological feature access including surface forms, part-of-speech information, lemmas, pronunciation data, and grammatical features. These provide detailed linguistic information for each token.

99

100

```python { .api }

101

class Node:

102

@property

103

def surface(self) -> str: ...

104

@property

105

def feature(self) -> NamedTuple: ...

106

@property

107

def feature_raw(self) -> str: ...

108

@property

109

def length(self) -> int: ...

110

@property

111

def rlength(self) -> int: ...

112

@property

113

def posid(self) -> int: ...

114

@property

115

def char_type(self) -> int: ...

116

@property

117

def stat(self) -> int: ...

118

@property

119

def is_unk(self) -> bool: ...

120

@property

121

def white_space(self) -> str: ...

122

123

class UnidicNode(Node):

124

@property

125

def pos(self) -> str: ...

126

127

UnidicFeatures17 = NamedTuple('UnidicFeatures17', [

128

('pos1', str), ('pos2', str), ('pos3', str), ('pos4', str),

129

('cType', str), ('cForm', str), ('lForm', str), ('lemma', str),

130

('orth', str), ('pron', str), ('orthBase', str), ('pronBase', str),

131

('goshu', str), ('iType', str), ('iForm', str), ('fType', str), ('fForm', str)

132

])

133

134

UnidicFeatures26 = NamedTuple('UnidicFeatures26', [

135

('pos1', str), ('pos2', str), ('pos3', str), ('pos4', str),

136

('cType', str), ('cForm', str), ('lForm', str), ('lemma', str),

137

('orth', str), ('pron', str), ('orthBase', str), ('pronBase', str),

138

('goshu', str), ('iType', str), ('iForm', str), ('fType', str), ('fForm', str),

139

('kana', str), ('kanaBase', str), ('form', str), ('formBase', str),

140

('iConType', str), ('fConType', str), ('aType', str), ('aConType', str), ('aModeType', str)

141

])

142

143

UnidicFeatures29 = NamedTuple('UnidicFeatures29', [

144

('pos1', str), ('pos2', str), ('pos3', str), ('pos4', str),

145

('cType', str), ('cForm', str), ('lForm', str), ('lemma', str),

146

('orth', str), ('pron', str), ('orthBase', str), ('pronBase', str),

147

('goshu', str), ('iType', str), ('iForm', str), ('fType', str), ('fForm', str),

148

('iConType', str), ('fConType', str), ('type', str), ('kana', str), ('kanaBase', str),

149

('form', str), ('formBase', str), ('aType', str), ('aConType', str),

150

('aModType', str), ('lid', str), ('lemma_id', str)

151

])

152

```

153

154

[Nodes and Features](./nodes-features.md)

155

156

### Dictionary Management

157

158

Dictionary configuration, information access, and custom dictionary building. These functions enable advanced dictionary management and customization for specific use cases.

159

160

```python { .api }

161

def create_feature_wrapper(name: str, fields: List[str], default: Any = None) -> NamedTuple: ...

162

def try_import_unidic() -> Optional[str]: ...

163

def build_dictionary(args: str) -> None: ...

164

165

class Tagger:

166

@property

167

def dictionary_info(self) -> List[Dict[str, Union[str, int]]]: ...

168

169

class GenericTagger:

170

@property

171

def dictionary_info(self) -> List[Dict[str, Union[str, int]]]: ...

172

```

173

174

[Dictionary Management](./dictionary-management.md)

175

176

### Command Line Interface

177

178

Console scripts for command-line text processing, dictionary information, and dictionary building. These provide direct access to fugashi functionality from the terminal.

179

180

```python { .api }

181

def main():

182

"""Command-line interface for text tokenization.

183

184

Console script: fugashi

185

186

Processes text from stdin, treating each line as a sentence.

187

Supports all MeCab options via command-line arguments.

188

189

Examples:

190

echo "日本語" | fugashi

191

echo "日本語" | fugashi -Owakati

192

"""

193

...

194

195

def info():

196

"""Display dictionary and configuration information.

197

198

Console script: fugashi-info

199

200

Shows detailed information about loaded dictionaries including

201

version, size, charset, and file paths.

202

203

Example:

204

fugashi-info

205

"""

206

...

207

208

def build_dict():

209

"""Build custom MeCab user dictionary from CSV input.

210

211

Console script: fugashi-build-dict

212

213

Compiles CSV dictionary sources into MeCab binary format.

214

Defaults to UTF-8 encoding for input and output.

215

216

Example:

217

fugashi-build-dict -o custom.dic input.csv

218

"""

219

...

220

```