File identification library for Python that determines file types based on paths, extensions, and content analysis
—
Core functionality for identifying files using multiple detection methods including path-based analysis, filename pattern matching, extension mapping, and interpreter detection.
Comprehensive file identification using file system metadata, permissions, content analysis, and extension matching. This is the primary identification method providing the most complete tag set.
def tags_from_path(path: str) -> set[str]:
"""
Identify file tags from a file path using comprehensive analysis.
Performs file system analysis including:
- File type detection (file, directory, symlink, socket)
- Permission analysis (executable, non-executable)
- Extension and filename matching
- Shebang parsing for executables
- Binary/text content detection
Args:
path (str): File path to analyze
Returns:
set[str]: Set of identifying tags
Raises:
ValueError: If path does not exist
"""Usage Example:
from identify.identify import tags_from_path
# Python script
tags = tags_from_path('/path/to/script.py')
print(tags) # {'file', 'text', 'python', 'non-executable'}
# Executable shell script
tags = tags_from_path('/usr/bin/script.sh')
print(tags) # {'file', 'text', 'shell', 'bash', 'executable'}
# Directory
tags = tags_from_path('/path/to/directory')
print(tags) # {'directory'}
# Binary image file
tags = tags_from_path('/path/to/image.png')
print(tags) # {'file', 'binary', 'image', 'png', 'non-executable'}Fast identification based solely on filename and extension without accessing the file system. Useful for batch processing or when file system access is not available.
def tags_from_filename(path: str) -> set[str]:
"""
Identify file tags based only on filename/extension.
Matches filename against known patterns and extensions without
accessing the file system. Supports both extension-based matching
and special filename recognition (e.g., 'Dockerfile', '.gitignore').
Args:
path (str): File path or filename to analyze
Returns:
set[str]: Set of identifying tags (empty if no matches)
"""Usage Example:
from identify.identify import tags_from_filename
# Extension-based matching
tags = tags_from_filename('config.yaml')
print(tags) # {'yaml', 'text'}
tags = tags_from_filename('script.js')
print(tags) # {'javascript', 'text'}
# Special filename matching
tags = tags_from_filename('Dockerfile')
print(tags) # {'dockerfile', 'text'}
tags = tags_from_filename('.gitignore')
print(tags) # {'gitignore', 'text'}
# No match returns empty set
tags = tags_from_filename('unknown')
print(tags) # set()Identify file types based on interpreter names, typically extracted from shebang lines. Supports version-specific interpreters with fallback to general interpreter names.
def tags_from_interpreter(interpreter: str) -> set[str]:
"""
Get tags for a given interpreter name.
Attempts progressive matching from specific to general:
'python3.9.1' -> 'python3.9' -> 'python3' -> 'python'
Args:
interpreter (str): Interpreter name (e.g., 'python3', 'bash', 'node')
Returns:
set[str]: Set of identifying tags (empty if no matches)
"""Usage Example:
from identify.identify import tags_from_interpreter
# Specific version with fallback
tags = tags_from_interpreter('python3.9.1')
print(tags) # {'python', 'python3'}
# Shell interpreters
tags = tags_from_interpreter('bash')
print(tags) # {'shell', 'bash'}
tags = tags_from_interpreter('zsh')
print(tags) # {'shell', 'zsh'}
# JavaScript runtime
tags = tags_from_interpreter('node')
print(tags) # {'javascript'}
# Unknown interpreter
tags = tags_from_interpreter('unknown')
print(tags) # set()The identification system relies on comprehensive databases of file patterns:
from identify.extensions import EXTENSIONS, EXTENSIONS_NEED_BINARY_CHECK, NAMES
from identify.interpreters import INTERPRETERS
# Extension to tags mapping (400+ extensions)
EXTENSIONS: dict[str, set[str]]
# Extensions requiring binary content check
EXTENSIONS_NEED_BINARY_CHECK: dict[str, set[str]]
# Special filename to tags mapping (100+ filenames)
NAMES: dict[str, set[str]]
# Interpreter to tags mapping
INTERPRETERS: dict[str, set[str]]Example Data:
# Sample extension mappings
EXTENSIONS['py'] # {'python', 'text'}
EXTENSIONS['js'] # {'javascript', 'text'}
EXTENSIONS['png'] # {'binary', 'image', 'png'}
# Sample filename mappings
NAMES['Dockerfile'] # {'dockerfile', 'text'}
NAMES['.gitignore'] # {'gitignore', 'text'}
NAMES['package.json'] # {'json', 'text', 'npm'}
# Sample interpreter mappings
INTERPRETERS['python3'] # {'python', 'python3'}
INTERPRETERS['bash'] # {'shell', 'bash'}
INTERPRETERS['node'] # {'javascript'}The identification system uses standardized tags organized into categories:
File Types: file, directory, symlink, socket
Permissions: executable, non-executable
Encoding: text, binary
Languages: python, javascript, shell, c++, java, etc.
Formats: json, yaml, xml, csv, markdown, etc.
Images: png, jpg, gif, svg, etc.
Archives: zip, tar, gzip, bzip2, etc.
Install with Tessl CLI
npx tessl i tessl/pypi-identify