0
# File Identification
1
2
Core functionality for identifying files using multiple detection methods including path-based analysis, filename pattern matching, extension mapping, and interpreter detection.
3
4
## Capabilities
5
6
### Path-Based Identification
7
8
Comprehensive file identification using file system metadata, permissions, content analysis, and extension matching. This is the primary identification method providing the most complete tag set.
9
10
```python { .api }
11
def tags_from_path(path: str) -> set[str]:
12
"""
13
Identify file tags from a file path using comprehensive analysis.
14
15
Performs file system analysis including:
16
- File type detection (file, directory, symlink, socket)
17
- Permission analysis (executable, non-executable)
18
- Extension and filename matching
19
- Shebang parsing for executables
20
- Binary/text content detection
21
22
Args:
23
path (str): File path to analyze
24
25
Returns:
26
set[str]: Set of identifying tags
27
28
Raises:
29
ValueError: If path does not exist
30
"""
31
```
32
33
**Usage Example:**
34
35
```python
36
from identify.identify import tags_from_path
37
38
# Python script
39
tags = tags_from_path('/path/to/script.py')
40
print(tags) # {'file', 'text', 'python', 'non-executable'}
41
42
# Executable shell script
43
tags = tags_from_path('/usr/bin/script.sh')
44
print(tags) # {'file', 'text', 'shell', 'bash', 'executable'}
45
46
# Directory
47
tags = tags_from_path('/path/to/directory')
48
print(tags) # {'directory'}
49
50
# Binary image file
51
tags = tags_from_path('/path/to/image.png')
52
print(tags) # {'file', 'binary', 'image', 'png', 'non-executable'}
53
```
54
55
### Filename-Only Identification
56
57
Fast identification based solely on filename and extension without accessing the file system. Useful for batch processing or when file system access is not available.
58
59
```python { .api }
60
def tags_from_filename(path: str) -> set[str]:
61
"""
62
Identify file tags based only on filename/extension.
63
64
Matches filename against known patterns and extensions without
65
accessing the file system. Supports both extension-based matching
66
and special filename recognition (e.g., 'Dockerfile', '.gitignore').
67
68
Args:
69
path (str): File path or filename to analyze
70
71
Returns:
72
set[str]: Set of identifying tags (empty if no matches)
73
"""
74
```
75
76
**Usage Example:**
77
78
```python
79
from identify.identify import tags_from_filename
80
81
# Extension-based matching
82
tags = tags_from_filename('config.yaml')
83
print(tags) # {'yaml', 'text'}
84
85
tags = tags_from_filename('script.js')
86
print(tags) # {'javascript', 'text'}
87
88
# Special filename matching
89
tags = tags_from_filename('Dockerfile')
90
print(tags) # {'dockerfile', 'text'}
91
92
tags = tags_from_filename('.gitignore')
93
print(tags) # {'gitignore', 'text'}
94
95
# No match returns empty set
96
tags = tags_from_filename('unknown')
97
print(tags) # set()
98
```
99
100
### Interpreter-Based Identification
101
102
Identify file types based on interpreter names, typically extracted from shebang lines. Supports version-specific interpreters with fallback to general interpreter names.
103
104
```python { .api }
105
def tags_from_interpreter(interpreter: str) -> set[str]:
106
"""
107
Get tags for a given interpreter name.
108
109
Attempts progressive matching from specific to general:
110
'python3.9.1' -> 'python3.9' -> 'python3' -> 'python'
111
112
Args:
113
interpreter (str): Interpreter name (e.g., 'python3', 'bash', 'node')
114
115
Returns:
116
set[str]: Set of identifying tags (empty if no matches)
117
"""
118
```
119
120
**Usage Example:**
121
122
```python
123
from identify.identify import tags_from_interpreter
124
125
# Specific version with fallback
126
tags = tags_from_interpreter('python3.9.1')
127
print(tags) # {'python', 'python3'}
128
129
# Shell interpreters
130
tags = tags_from_interpreter('bash')
131
print(tags) # {'shell', 'bash'}
132
133
tags = tags_from_interpreter('zsh')
134
print(tags) # {'shell', 'zsh'}
135
136
# JavaScript runtime
137
tags = tags_from_interpreter('node')
138
print(tags) # {'javascript'}
139
140
# Unknown interpreter
141
tags = tags_from_interpreter('unknown')
142
print(tags) # set()
143
```
144
145
## Data Structures
146
147
The identification system relies on comprehensive databases of file patterns:
148
149
```python { .api }
150
from identify.extensions import EXTENSIONS, EXTENSIONS_NEED_BINARY_CHECK, NAMES
151
from identify.interpreters import INTERPRETERS
152
153
# Extension to tags mapping (400+ extensions)
154
EXTENSIONS: dict[str, set[str]]
155
156
# Extensions requiring binary content check
157
EXTENSIONS_NEED_BINARY_CHECK: dict[str, set[str]]
158
159
# Special filename to tags mapping (100+ filenames)
160
NAMES: dict[str, set[str]]
161
162
# Interpreter to tags mapping
163
INTERPRETERS: dict[str, set[str]]
164
```
165
166
**Example Data:**
167
168
```python
169
# Sample extension mappings
170
EXTENSIONS['py'] # {'python', 'text'}
171
EXTENSIONS['js'] # {'javascript', 'text'}
172
EXTENSIONS['png'] # {'binary', 'image', 'png'}
173
174
# Sample filename mappings
175
NAMES['Dockerfile'] # {'dockerfile', 'text'}
176
NAMES['.gitignore'] # {'gitignore', 'text'}
177
NAMES['package.json'] # {'json', 'text', 'npm'}
178
179
# Sample interpreter mappings
180
INTERPRETERS['python3'] # {'python', 'python3'}
181
INTERPRETERS['bash'] # {'shell', 'bash'}
182
INTERPRETERS['node'] # {'javascript'}
183
```
184
185
## Tag Categories
186
187
The identification system uses standardized tags organized into categories:
188
189
**File Types:** `file`, `directory`, `symlink`, `socket`
190
**Permissions:** `executable`, `non-executable`
191
**Encoding:** `text`, `binary`
192
**Languages:** `python`, `javascript`, `shell`, `c++`, `java`, etc.
193
**Formats:** `json`, `yaml`, `xml`, `csv`, `markdown`, etc.
194
**Images:** `png`, `jpg`, `gif`, `svg`, etc.
195
**Archives:** `zip`, `tar`, `gzip`, `bzip2`, etc.