0
# Content Analysis
1
2
Functions for analyzing file content to determine text vs binary classification and parse executable shebang lines for interpreter identification.
3
4
## Capabilities
5
6
### Text/Binary Detection
7
8
Determine whether files contain text or binary content using character analysis algorithms based on libmagic's detection logic.
9
10
```python { .api }
11
def file_is_text(path: str) -> bool:
12
"""
13
Determine if a file contains text content.
14
15
Opens file and analyzes the first 1KB to determine if content
16
appears to be text based on character distribution analysis.
17
18
Args:
19
path (str): Path to file to analyze
20
21
Returns:
22
bool: True if file appears to be text, False if binary
23
24
Raises:
25
ValueError: If path does not exist
26
"""
27
28
def is_text(bytesio: IO[bytes]) -> bool:
29
"""
30
Determine if byte stream content appears to be text.
31
32
Analyzes the first 1KB of a byte stream to determine if content
33
appears to be text based on character distribution. Based on
34
libmagic's binary/text detection algorithm.
35
36
Args:
37
bytesio (IO[bytes]): Open binary file-like object
38
39
Returns:
40
bool: True if content appears to be text, False if binary
41
"""
42
```
43
44
**Usage Example:**
45
46
```python
47
from identify.identify import file_is_text, is_text
48
import io
49
50
# Check file directly
51
is_text_file = file_is_text('/path/to/document.txt')
52
print(is_text_file) # True
53
54
is_binary_file = file_is_text('/path/to/image.png')
55
print(is_binary_file) # False
56
57
# Check byte stream
58
with open('/path/to/file.py', 'rb') as f:
59
result = is_text(f)
60
print(result) # True
61
62
# Check bytes in memory
63
data = b"print('hello world')\n"
64
stream = io.BytesIO(data)
65
result = is_text(stream)
66
print(result) # True
67
```
68
69
### Shebang Parsing
70
71
Parse shebang lines from executable files to extract interpreter and argument information. Handles various shebang formats including env, nix-shell, and quoted arguments.
72
73
```python { .api }
74
def parse_shebang_from_file(path: str) -> tuple[str, ...]:
75
"""
76
Parse shebang from a file path.
77
78
Extracts shebang information from executable files. Only processes
79
files that are executable and have valid shebang format. Handles
80
various shebang patterns including /usr/bin/env usage.
81
82
Args:
83
path (str): Path to executable file
84
85
Returns:
86
tuple[str, ...]: Tuple of command and arguments, empty if no shebang
87
88
Raises:
89
ValueError: If path does not exist
90
"""
91
92
def parse_shebang(bytesio: IO[bytes]) -> tuple[str, ...]:
93
"""
94
Parse shebang from an open binary file stream.
95
96
Reads and parses shebang line from the beginning of a binary stream.
97
Handles various formats including env, nix-shell, and quoted arguments.
98
Only processes printable ASCII content.
99
100
Args:
101
bytesio (IO[bytes]): Open binary file-like object positioned at start
102
103
Returns:
104
tuple[str, ...]: Tuple of command and arguments, empty if no valid shebang
105
"""
106
```
107
108
**Usage Example:**
109
110
```python
111
from identify.identify import parse_shebang_from_file, parse_shebang
112
import io
113
114
# Parse from file path
115
shebang = parse_shebang_from_file('/usr/bin/python3-script')
116
print(shebang) # ('python3',)
117
118
shebang = parse_shebang_from_file('/path/to/bash-script.sh')
119
print(shebang) # ('bash',)
120
121
# Parse from byte stream
122
script_content = b'#!/usr/bin/env python3\nprint("hello")\n'
123
stream = io.BytesIO(script_content)
124
shebang = parse_shebang(stream)
125
print(shebang) # ('python3',)
126
127
# Complex shebang with arguments
128
script_content = b'#!/usr/bin/env -S python3 -u\nprint("hello")\n'
129
stream = io.BytesIO(script_content)
130
shebang = parse_shebang(stream)
131
print(shebang) # ('python3', '-u')
132
133
# No shebang
134
script_content = b'print("hello")\n'
135
stream = io.BytesIO(script_content)
136
shebang = parse_shebang(stream)
137
print(shebang) # ()
138
```
139
140
## Shebang Format Handling
141
142
The shebang parser handles various common patterns:
143
144
**Standard Format:**
145
```bash
146
#!/bin/bash
147
#!/usr/bin/python3
148
```
149
150
**Environment-based:**
151
```bash
152
#!/usr/bin/env python3
153
#!/usr/bin/env -S python3 -u
154
```
155
156
**Nix Shell:**
157
```bash
158
#!/usr/bin/env nix-shell
159
#! /some/path/to/interpreter
160
```
161
162
**Quoted Arguments:**
163
Parser attempts shlex-style parsing first, then falls back to simple whitespace splitting for malformed quotes.
164
165
## Character Analysis Details
166
167
The text/binary detection uses character distribution analysis:
168
169
- **Text Characters**: Control chars (7,8,9,10,11,12,13,27), printable ASCII (0x20-0x7F), extended ASCII (0x80-0xFF)
170
- **Analysis Window**: First 1024 bytes of file content
171
- **Algorithm**: Based on libmagic's encoding detection logic
172
- **Threshold**: Any non-text characters indicate binary content
173
174
This approach provides reliable text/binary classification for most file types while being performant for large-scale file processing.