Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
npx @tessl/cli install tessl/pypi-fugashi@1.5.00
# Fugashi
1
2
A high-performance Cython wrapper for MeCab, providing fast and pythonic Japanese tokenization and morphological analysis. Fugashi offers comprehensive access to MeCab's tokenization capabilities with built-in support for UniDic dictionaries and extensive morphological feature extraction.
3
4
## Package Information
5
6
- **Package Name**: fugashi
7
- **Language**: Python
8
- **Installation**: `pip install fugashi`
9
- **Dependencies**: Requires MeCab system library (automatically handled in pre-built wheels)
10
- **Dictionary**: Requires a MeCab dictionary - UniDic recommended (`pip install 'fugashi[unidic-lite]'`)
11
12
## Core Imports
13
14
```python
15
import fugashi
16
from fugashi import Tagger, GenericTagger, Node, UnidicNode
17
```
18
19
For basic tokenization:
20
```python
21
from fugashi import Tagger
22
```
23
24
For advanced dictionary management:
25
```python
26
from fugashi import GenericTagger, create_feature_wrapper
27
```
28
29
## Basic Usage
30
31
```python
32
from fugashi import Tagger
33
34
# Initialize tagger with UniDic (automatic detection)
35
tagger = Tagger()
36
37
# Tokenize text
38
text = "麩菓子は、麩を主材料とした日本の菓子。"
39
nodes = tagger(text)
40
41
# Access token information
42
for node in nodes:
43
print(f"Surface: {node.surface}")
44
print(f"Lemma: {node.feature.lemma}")
45
print(f"POS: {node.pos}")
46
print(f"Features: {node.feature}")
47
print("---")
48
49
# Get formatted output
50
formatted = tagger.parse(text)
51
print(formatted) # Traditional MeCab output format
52
53
# Wakati (word-segmented) mode
54
wakati_tagger = Tagger('-Owakati')
55
words = wakati_tagger.parse(text)
56
print(words) # Space-separated tokens
57
```
58
59
## Architecture
60
61
Fugashi provides a layered architecture for Japanese text processing:
62
63
- **Taggers**: High-level interfaces (Tagger, GenericTagger) that manage MeCab instances and provide parsing methods
64
- **Nodes**: Token representations (Node, UnidicNode) containing surface forms, morphological features, and metadata
65
- **Feature Wrappers**: Named tuple structures (UnidicFeatures17/26/29) providing structured access to dictionary features
66
- **Dictionary Management**: Functions for building custom dictionaries and accessing dictionary information
67
68
This design enables both simple tokenization workflows and sophisticated morphological analysis applications, with automatic dictionary format detection and extensive customization options.
69
70
## Capabilities
71
72
### Core Tokenization
73
74
Primary tokenization functionality including text parsing, node list generation, wakati mode, and n-best parsing. These functions provide the essential Japanese text processing capabilities.
75
76
```python { .api }
77
class Tagger:
78
def __init__(self, arg: str = '') -> None: ...
79
def __call__(self, text: str) -> List[UnidicNode]: ...
80
def parse(self, text: str) -> str: ...
81
def parseToNodeList(self, text: str) -> List[UnidicNode]: ...
82
def nbest(self, text: str, num: int = 10) -> str: ...
83
def nbestToNodeList(self, text: str, num: int = 10) -> List[List[UnidicNode]]: ...
84
85
class GenericTagger:
86
def __init__(self, args: str = '', wrapper: Callable = make_tuple, quiet: bool = False) -> None: ...
87
def __call__(self, text: str) -> List[Node]: ...
88
def parse(self, text: str) -> str: ...
89
def parseToNodeList(self, text: str) -> List[Node]: ...
90
def nbest(self, text: str, num: int = 10) -> str: ...
91
def nbestToNodeList(self, text: str, num: int = 10) -> List[List[Node]]: ...
92
```
93
94
[Tokenization](./tokenization.md)
95
96
### Nodes and Features
97
98
Token representation and morphological feature access including surface forms, part-of-speech information, lemmas, pronunciation data, and grammatical features. These provide detailed linguistic information for each token.
99
100
```python { .api }
101
class Node:
102
@property
103
def surface(self) -> str: ...
104
@property
105
def feature(self) -> NamedTuple: ...
106
@property
107
def feature_raw(self) -> str: ...
108
@property
109
def length(self) -> int: ...
110
@property
111
def rlength(self) -> int: ...
112
@property
113
def posid(self) -> int: ...
114
@property
115
def char_type(self) -> int: ...
116
@property
117
def stat(self) -> int: ...
118
@property
119
def is_unk(self) -> bool: ...
120
@property
121
def white_space(self) -> str: ...
122
123
class UnidicNode(Node):
124
@property
125
def pos(self) -> str: ...
126
127
UnidicFeatures17 = NamedTuple('UnidicFeatures17', [
128
('pos1', str), ('pos2', str), ('pos3', str), ('pos4', str),
129
('cType', str), ('cForm', str), ('lForm', str), ('lemma', str),
130
('orth', str), ('pron', str), ('orthBase', str), ('pronBase', str),
131
('goshu', str), ('iType', str), ('iForm', str), ('fType', str), ('fForm', str)
132
])
133
134
UnidicFeatures26 = NamedTuple('UnidicFeatures26', [
135
('pos1', str), ('pos2', str), ('pos3', str), ('pos4', str),
136
('cType', str), ('cForm', str), ('lForm', str), ('lemma', str),
137
('orth', str), ('pron', str), ('orthBase', str), ('pronBase', str),
138
('goshu', str), ('iType', str), ('iForm', str), ('fType', str), ('fForm', str),
139
('kana', str), ('kanaBase', str), ('form', str), ('formBase', str),
140
('iConType', str), ('fConType', str), ('aType', str), ('aConType', str), ('aModeType', str)
141
])
142
143
UnidicFeatures29 = NamedTuple('UnidicFeatures29', [
144
('pos1', str), ('pos2', str), ('pos3', str), ('pos4', str),
145
('cType', str), ('cForm', str), ('lForm', str), ('lemma', str),
146
('orth', str), ('pron', str), ('orthBase', str), ('pronBase', str),
147
('goshu', str), ('iType', str), ('iForm', str), ('fType', str), ('fForm', str),
148
('iConType', str), ('fConType', str), ('type', str), ('kana', str), ('kanaBase', str),
149
('form', str), ('formBase', str), ('aType', str), ('aConType', str),
150
('aModType', str), ('lid', str), ('lemma_id', str)
151
])
152
```
153
154
[Nodes and Features](./nodes-features.md)
155
156
### Dictionary Management
157
158
Dictionary configuration, information access, and custom dictionary building. These functions enable advanced dictionary management and customization for specific use cases.
159
160
```python { .api }
161
def create_feature_wrapper(name: str, fields: List[str], default: Any = None) -> NamedTuple: ...
162
def try_import_unidic() -> Optional[str]: ...
163
def build_dictionary(args: str) -> None: ...
164
165
class Tagger:
166
@property
167
def dictionary_info(self) -> List[Dict[str, Union[str, int]]]: ...
168
169
class GenericTagger:
170
@property
171
def dictionary_info(self) -> List[Dict[str, Union[str, int]]]: ...
172
```
173
174
[Dictionary Management](./dictionary-management.md)
175
176
### Command Line Interface
177
178
Console scripts for command-line text processing, dictionary information, and dictionary building. These provide direct access to fugashi functionality from the terminal.
179
180
```python { .api }
181
def main():
182
"""Command-line interface for text tokenization.
183
184
Console script: fugashi
185
186
Processes text from stdin, treating each line as a sentence.
187
Supports all MeCab options via command-line arguments.
188
189
Examples:
190
echo "日本語" | fugashi
191
echo "日本語" | fugashi -Owakati
192
"""
193
...
194
195
def info():
196
"""Display dictionary and configuration information.
197
198
Console script: fugashi-info
199
200
Shows detailed information about loaded dictionaries including
201
version, size, charset, and file paths.
202
203
Example:
204
fugashi-info
205
"""
206
...
207
208
def build_dict():
209
"""Build custom MeCab user dictionary from CSV input.
210
211
Console script: fugashi-build-dict
212
213
Compiles CSV dictionary sources into MeCab binary format.
214
Defaults to UTF-8 encoding for input and output.
215
216
Example:
217
fugashi-build-dict -o custom.dic input.csv
218
"""
219
...
220
```