File identification library for Python that determines file types based on paths, extensions, and content analysis
npx @tessl/cli install tessl/pypi-identify@2.6.0File identification library for Python that determines file types based on paths, extensions, and content analysis. The library provides comprehensive file type detection using multiple methods including path-based analysis, filename matching, extension mapping, and shebang parsing.
pip install identifypip install identify[license] for license identificationfrom identify import identifyOr import specific functions:
from identify.identify import (
tags_from_path,
tags_from_filename,
tags_from_interpreter,
file_is_text,
parse_shebang_from_file,
license_id,
ALL_TAGS
)For command-line interface:
from identify.cli import mainfrom identify.identify import tags_from_path, tags_from_filename
# Identify file from path (comprehensive analysis)
tags = tags_from_path('/path/to/script.py')
print(tags) # {'file', 'text', 'python', 'non-executable'}
# Identify file from filename only (no file system access)
tags = tags_from_filename('config.yaml')
print(tags) # {'yaml', 'text'}
# Check if file is text or binary
from identify.identify import file_is_text
is_text = file_is_text('/path/to/file.txt')
print(is_text) # True
# Parse shebang from executable files
from identify.identify import parse_shebang_from_file
shebang = parse_shebang_from_file('/path/to/script.py')
print(shebang) # ('python3',) or empty tupleThe identify library uses a layered approach to file identification:
The library includes comprehensive databases of file extensions (EXTENSIONS), special filenames (NAMES), and interpreter mappings (INTERPRETERS) covering hundreds of file types across multiple programming languages and formats.
Core functionality for identifying files using path-based analysis, filename matching, and extension mapping. Returns comprehensive tag sets describing file characteristics including type, encoding, language, and mode.
def tags_from_path(path: str) -> set[str]: ...
def tags_from_filename(path: str) -> set[str]: ...
def tags_from_interpreter(interpreter: str) -> set[str]: ...Functions for analyzing file content to determine text vs binary classification and parse executable shebang lines for interpreter identification.
def file_is_text(path: str) -> bool: ...
def is_text(bytesio: IO[bytes]) -> bool: ...
def parse_shebang_from_file(path: str) -> tuple[str, ...]: ...
def parse_shebang(bytesio: IO[bytes]) -> tuple[str, ...]: ...Optional functionality for identifying software licenses in files using SPDX identifiers through text matching and edit distance algorithms.
def license_id(filename: str) -> str | None: ...# Basic file type constants
DIRECTORY: str
FILE: str
SYMLINK: str
SOCKET: str
EXECUTABLE: str
NON_EXECUTABLE: str
TEXT: str
BINARY: str
# Tag collections
TYPE_TAGS: frozenset[str]
MODE_TAGS: frozenset[str]
ENCODING_TAGS: frozenset[str]
ALL_TAGS: frozenset[str]The package provides a command-line tool for file identification:
from collections.abc import Sequence
def main(argv: Sequence[str] | None = None) -> int:
"""
Command-line interface for file identification.
Args:
argv: Command line arguments, defaults to sys.argv
Returns:
int: Exit code (0 for success, 1 for error)
"""Usage:
# Identify file with full analysis
identify-cli /path/to/file
# Identify using filename only
identify-cli --filename-only /path/to/fileOutput is JSON array of tags:
["file", "text", "python", "non-executable"]