Tessl Tile for pypi/fugashi@1.5.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

dictionary-management.md index.md nodes-features.md tokenization.md

index.mddocs/

0
# Fugashi
1

2
A high-performance Cython wrapper for MeCab, providing fast and pythonic Japanese tokenization and morphological analysis. Fugashi offers comprehensive access to MeCab's tokenization capabilities with built-in support for UniDic dictionaries and extensive morphological feature extraction.
3

4
## Package Information
5

6
- **Package Name**: fugashi
7
- **Language**: Python
8
- **Installation**: `pip install fugashi`
9
- **Dependencies**: Requires MeCab system library (automatically handled in pre-built wheels)
10
- **Dictionary**: Requires a MeCab dictionary - UniDic recommended (`pip install 'fugashi[unidic-lite]'`)
11

12
## Core Imports
13

14
```python
15
import fugashi
16
from fugashi import Tagger, GenericTagger, Node, UnidicNode
17
```
18

19
For basic tokenization:
20
```python
21
from fugashi import Tagger
22
```
23

24
For advanced dictionary management:
25
```python
26
from fugashi import GenericTagger, create_feature_wrapper
27
```
28

29
## Basic Usage
30

31
```python
32
from fugashi import Tagger
33

34
# Initialize tagger with UniDic (automatic detection)
35
tagger = Tagger()
36

37
# Tokenize text
38
text = "麩菓子は、麩を主材料とした日本の菓子。"
39
nodes = tagger(text)
40

41
# Access token information
42
for node in nodes:
43
    print(f"Surface: {node.surface}")
44
    print(f"Lemma: {node.feature.lemma}")
45
    print(f"POS: {node.pos}")
46
    print(f"Features: {node.feature}")
47
    print("---")
48

49
# Get formatted output
50
formatted = tagger.parse(text)
51
print(formatted)  # Traditional MeCab output format
52

53
# Wakati (word-segmented) mode
54
wakati_tagger = Tagger('-Owakati')
55
words = wakati_tagger.parse(text)
56
print(words)  # Space-separated tokens
57
```
58

59
## Architecture
60

61
Fugashi provides a layered architecture for Japanese text processing:
62

63
- **Taggers**: High-level interfaces (Tagger, GenericTagger) that manage MeCab instances and provide parsing methods
64
- **Nodes**: Token representations (Node, UnidicNode) containing surface forms, morphological features, and metadata  
65
- **Feature Wrappers**: Named tuple structures (UnidicFeatures17/26/29) providing structured access to dictionary features
66
- **Dictionary Management**: Functions for building custom dictionaries and accessing dictionary information
67

68
This design enables both simple tokenization workflows and sophisticated morphological analysis applications, with automatic dictionary format detection and extensive customization options.
69

70
## Capabilities
71

72
### Core Tokenization
73

74
Primary tokenization functionality including text parsing, node list generation, wakati mode, and n-best parsing. These functions provide the essential Japanese text processing capabilities.
75

76
```python { .api }
77
class Tagger:
78
    def __init__(self, arg: str = '') -> None: ...
79
    def __call__(self, text: str) -> List[UnidicNode]: ...
80
    def parse(self, text: str) -> str: ...
81
    def parseToNodeList(self, text: str) -> List[UnidicNode]: ...
82
    def nbest(self, text: str, num: int = 10) -> str: ...
83
    def nbestToNodeList(self, text: str, num: int = 10) -> List[List[UnidicNode]]: ...
84

85
class GenericTagger:
86
    def __init__(self, args: str = '', wrapper: Callable = make_tuple, quiet: bool = False) -> None: ...
87
    def __call__(self, text: str) -> List[Node]: ...
88
    def parse(self, text: str) -> str: ...
89
    def parseToNodeList(self, text: str) -> List[Node]: ...
90
    def nbest(self, text: str, num: int = 10) -> str: ...
91
    def nbestToNodeList(self, text: str, num: int = 10) -> List[List[Node]]: ...
92
```
93

94
[Tokenization](./tokenization.md)
95

96
### Nodes and Features
97

98
Token representation and morphological feature access including surface forms, part-of-speech information, lemmas, pronunciation data, and grammatical features. These provide detailed linguistic information for each token.
99

100
```python { .api }
101
class Node:
102
    @property
103
    def surface(self) -> str: ...
104
    @property
105
    def feature(self) -> NamedTuple: ...
106
    @property
107
    def feature_raw(self) -> str: ...
108
    @property
109
    def length(self) -> int: ...
110
    @property
111
    def rlength(self) -> int: ...
112
    @property
113
    def posid(self) -> int: ...
114
    @property
115
    def char_type(self) -> int: ...
116
    @property
117
    def stat(self) -> int: ...
118
    @property
119
    def is_unk(self) -> bool: ...
120
    @property
121
    def white_space(self) -> str: ...
122

123
class UnidicNode(Node):
124
    @property
125
    def pos(self) -> str: ...
126

127
UnidicFeatures17 = NamedTuple('UnidicFeatures17', [
128
    ('pos1', str), ('pos2', str), ('pos3', str), ('pos4', str),
129
    ('cType', str), ('cForm', str), ('lForm', str), ('lemma', str),
130
    ('orth', str), ('pron', str), ('orthBase', str), ('pronBase', str),
131
    ('goshu', str), ('iType', str), ('iForm', str), ('fType', str), ('fForm', str)
132
])
133

134
UnidicFeatures26 = NamedTuple('UnidicFeatures26', [
135
    ('pos1', str), ('pos2', str), ('pos3', str), ('pos4', str),
136
    ('cType', str), ('cForm', str), ('lForm', str), ('lemma', str),
137
    ('orth', str), ('pron', str), ('orthBase', str), ('pronBase', str),
138
    ('goshu', str), ('iType', str), ('iForm', str), ('fType', str), ('fForm', str),
139
    ('kana', str), ('kanaBase', str), ('form', str), ('formBase', str),
140
    ('iConType', str), ('fConType', str), ('aType', str), ('aConType', str), ('aModeType', str)
141
])
142

143
UnidicFeatures29 = NamedTuple('UnidicFeatures29', [
144
    ('pos1', str), ('pos2', str), ('pos3', str), ('pos4', str),
145
    ('cType', str), ('cForm', str), ('lForm', str), ('lemma', str),
146
    ('orth', str), ('pron', str), ('orthBase', str), ('pronBase', str),
147
    ('goshu', str), ('iType', str), ('iForm', str), ('fType', str), ('fForm', str),
148
    ('iConType', str), ('fConType', str), ('type', str), ('kana', str), ('kanaBase', str),
149
    ('form', str), ('formBase', str), ('aType', str), ('aConType', str),
150
    ('aModType', str), ('lid', str), ('lemma_id', str)
151
])
152
```
153

154
[Nodes and Features](./nodes-features.md)
155

156
### Dictionary Management
157

158
Dictionary configuration, information access, and custom dictionary building. These functions enable advanced dictionary management and customization for specific use cases.
159

160
```python { .api }
161
def create_feature_wrapper(name: str, fields: List[str], default: Any = None) -> NamedTuple: ...
162
def try_import_unidic() -> Optional[str]: ...
163
def build_dictionary(args: str) -> None: ...
164

165
class Tagger:
166
    @property
167
    def dictionary_info(self) -> List[Dict[str, Union[str, int]]]: ...
168

169
class GenericTagger:
170
    @property  
171
    def dictionary_info(self) -> List[Dict[str, Union[str, int]]]: ...
172
```
173

174
[Dictionary Management](./dictionary-management.md)
175

176
### Command Line Interface
177

178
Console scripts for command-line text processing, dictionary information, and dictionary building. These provide direct access to fugashi functionality from the terminal.
179

180
```python { .api }
181
def main():
182
    """Command-line interface for text tokenization.
183
    
184
    Console script: fugashi
185
    
186
    Processes text from stdin, treating each line as a sentence.
187
    Supports all MeCab options via command-line arguments.
188
    
189
    Examples:
190
        echo "日本語" | fugashi
191
        echo "日本語" | fugashi -Owakati
192
    """
193
    ...
194

195
def info():
196
    """Display dictionary and configuration information.
197
    
198
    Console script: fugashi-info
199
    
200
    Shows detailed information about loaded dictionaries including
201
    version, size, charset, and file paths.
202
    
203
    Example:
204
        fugashi-info
205
    """
206
    ...
207

208
def build_dict():
209
    """Build custom MeCab user dictionary from CSV input.
210
    
211
    Console script: fugashi-build-dict
212
    
213
    Compiles CSV dictionary sources into MeCab binary format.
214
    Defaults to UTF-8 encoding for input and output.
215
    
216
    Example:
217
        fugashi-build-dict -o custom.dic input.csv
218
    """
219
    ...
220
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/