0
# identify
1
2
File identification library for Python that determines file types based on paths, extensions, and content analysis. The library provides comprehensive file type detection using multiple methods including path-based analysis, filename matching, extension mapping, and shebang parsing.
3
4
## Package Information
5
6
- **Package Name**: identify
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install identify`
10
- **Optional Features**: `pip install identify[license]` for license identification
11
12
## Core Imports
13
14
```python
15
from identify import identify
16
```
17
18
Or import specific functions:
19
20
```python
21
from identify.identify import (
22
tags_from_path,
23
tags_from_filename,
24
tags_from_interpreter,
25
file_is_text,
26
parse_shebang_from_file,
27
license_id,
28
ALL_TAGS
29
)
30
```
31
32
For command-line interface:
33
34
```python
35
from identify.cli import main
36
```
37
38
## Basic Usage
39
40
```python
41
from identify.identify import tags_from_path, tags_from_filename
42
43
# Identify file from path (comprehensive analysis)
44
tags = tags_from_path('/path/to/script.py')
45
print(tags) # {'file', 'text', 'python', 'non-executable'}
46
47
# Identify file from filename only (no file system access)
48
tags = tags_from_filename('config.yaml')
49
print(tags) # {'yaml', 'text'}
50
51
# Check if file is text or binary
52
from identify.identify import file_is_text
53
is_text = file_is_text('/path/to/file.txt')
54
print(is_text) # True
55
56
# Parse shebang from executable files
57
from identify.identify import parse_shebang_from_file
58
shebang = parse_shebang_from_file('/path/to/script.py')
59
print(shebang) # ('python3',) or empty tuple
60
```
61
62
## Architecture
63
64
The identify library uses a layered approach to file identification:
65
66
- **Path Analysis**: Examines file system metadata (type, permissions, accessibility)
67
- **Filename Matching**: Matches against known filenames and extensions using predefined mappings
68
- **Content Analysis**: Performs binary/text detection and shebang parsing for executables
69
- **Tag System**: Returns standardized tags that categorize files by type, encoding, mode, and language
70
71
The library includes comprehensive databases of file extensions (`EXTENSIONS`), special filenames (`NAMES`), and interpreter mappings (`INTERPRETERS`) covering hundreds of file types across multiple programming languages and formats.
72
73
## Capabilities
74
75
### File Identification
76
77
Core functionality for identifying files using path-based analysis, filename matching, and extension mapping. Returns comprehensive tag sets describing file characteristics including type, encoding, language, and mode.
78
79
```python { .api }
80
def tags_from_path(path: str) -> set[str]: ...
81
def tags_from_filename(path: str) -> set[str]: ...
82
def tags_from_interpreter(interpreter: str) -> set[str]: ...
83
```
84
85
[File Identification](./file-identification.md)
86
87
### Content Analysis
88
89
Functions for analyzing file content to determine text vs binary classification and parse executable shebang lines for interpreter identification.
90
91
```python { .api }
92
def file_is_text(path: str) -> bool: ...
93
def is_text(bytesio: IO[bytes]) -> bool: ...
94
def parse_shebang_from_file(path: str) -> tuple[str, ...]: ...
95
def parse_shebang(bytesio: IO[bytes]) -> tuple[str, ...]: ...
96
```
97
98
[Content Analysis](./content-analysis.md)
99
100
### License Identification
101
102
Optional functionality for identifying software licenses in files using SPDX identifiers through text matching and edit distance algorithms.
103
104
```python { .api }
105
def license_id(filename: str) -> str | None: ...
106
```
107
108
[License Identification](./license-identification.md)
109
110
## Constants and Data
111
112
```python { .api }
113
# Basic file type constants
114
DIRECTORY: str
115
FILE: str
116
SYMLINK: str
117
SOCKET: str
118
EXECUTABLE: str
119
NON_EXECUTABLE: str
120
TEXT: str
121
BINARY: str
122
123
# Tag collections
124
TYPE_TAGS: frozenset[str]
125
MODE_TAGS: frozenset[str]
126
ENCODING_TAGS: frozenset[str]
127
ALL_TAGS: frozenset[str]
128
```
129
130
## Command Line Interface
131
132
The package provides a command-line tool for file identification:
133
134
```python { .api }
135
from collections.abc import Sequence
136
137
def main(argv: Sequence[str] | None = None) -> int:
138
"""
139
Command-line interface for file identification.
140
141
Args:
142
argv: Command line arguments, defaults to sys.argv
143
144
Returns:
145
int: Exit code (0 for success, 1 for error)
146
"""
147
```
148
149
**Usage:**
150
151
```bash
152
# Identify file with full analysis
153
identify-cli /path/to/file
154
155
# Identify using filename only
156
identify-cli --filename-only /path/to/file
157
```
158
159
Output is JSON array of tags:
160
```json
161
["file", "text", "python", "non-executable"]
162
```