File identification library for Python that determines file types based on paths, extensions, and content analysis
—
Functions for analyzing file content to determine text vs binary classification and parse executable shebang lines for interpreter identification.
Determine whether files contain text or binary content using character analysis algorithms based on libmagic's detection logic.
def file_is_text(path: str) -> bool:
"""
Determine if a file contains text content.
Opens file and analyzes the first 1KB to determine if content
appears to be text based on character distribution analysis.
Args:
path (str): Path to file to analyze
Returns:
bool: True if file appears to be text, False if binary
Raises:
ValueError: If path does not exist
"""
def is_text(bytesio: IO[bytes]) -> bool:
"""
Determine if byte stream content appears to be text.
Analyzes the first 1KB of a byte stream to determine if content
appears to be text based on character distribution. Based on
libmagic's binary/text detection algorithm.
Args:
bytesio (IO[bytes]): Open binary file-like object
Returns:
bool: True if content appears to be text, False if binary
"""Usage Example:
from identify.identify import file_is_text, is_text
import io
# Check file directly
is_text_file = file_is_text('/path/to/document.txt')
print(is_text_file) # True
is_binary_file = file_is_text('/path/to/image.png')
print(is_binary_file) # False
# Check byte stream
with open('/path/to/file.py', 'rb') as f:
result = is_text(f)
print(result) # True
# Check bytes in memory
data = b"print('hello world')\n"
stream = io.BytesIO(data)
result = is_text(stream)
print(result) # TrueParse shebang lines from executable files to extract interpreter and argument information. Handles various shebang formats including env, nix-shell, and quoted arguments.
def parse_shebang_from_file(path: str) -> tuple[str, ...]:
"""
Parse shebang from a file path.
Extracts shebang information from executable files. Only processes
files that are executable and have valid shebang format. Handles
various shebang patterns including /usr/bin/env usage.
Args:
path (str): Path to executable file
Returns:
tuple[str, ...]: Tuple of command and arguments, empty if no shebang
Raises:
ValueError: If path does not exist
"""
def parse_shebang(bytesio: IO[bytes]) -> tuple[str, ...]:
"""
Parse shebang from an open binary file stream.
Reads and parses shebang line from the beginning of a binary stream.
Handles various formats including env, nix-shell, and quoted arguments.
Only processes printable ASCII content.
Args:
bytesio (IO[bytes]): Open binary file-like object positioned at start
Returns:
tuple[str, ...]: Tuple of command and arguments, empty if no valid shebang
"""Usage Example:
from identify.identify import parse_shebang_from_file, parse_shebang
import io
# Parse from file path
shebang = parse_shebang_from_file('/usr/bin/python3-script')
print(shebang) # ('python3',)
shebang = parse_shebang_from_file('/path/to/bash-script.sh')
print(shebang) # ('bash',)
# Parse from byte stream
script_content = b'#!/usr/bin/env python3\nprint("hello")\n'
stream = io.BytesIO(script_content)
shebang = parse_shebang(stream)
print(shebang) # ('python3',)
# Complex shebang with arguments
script_content = b'#!/usr/bin/env -S python3 -u\nprint("hello")\n'
stream = io.BytesIO(script_content)
shebang = parse_shebang(stream)
print(shebang) # ('python3', '-u')
# No shebang
script_content = b'print("hello")\n'
stream = io.BytesIO(script_content)
shebang = parse_shebang(stream)
print(shebang) # ()The shebang parser handles various common patterns:
Standard Format:
#!/bin/bash
#!/usr/bin/python3Environment-based:
#!/usr/bin/env python3
#!/usr/bin/env -S python3 -uNix Shell:
#!/usr/bin/env nix-shell
#! /some/path/to/interpreterQuoted Arguments: Parser attempts shlex-style parsing first, then falls back to simple whitespace splitting for malformed quotes.
The text/binary detection uses character distribution analysis:
This approach provides reliable text/binary classification for most file types while being performant for large-scale file processing.
Install with Tessl CLI
npx tessl i tessl/pypi-identify