tessl/pypi-identify

File identification library for Python that determines file types based on paths, extensions, and content analysis

—

Pending

Overview

Eval results

Files

File Identification

Name: tessl/pypi-identify
Author: tessl

Core functionality for identifying files using multiple detection methods including path-based analysis, filename pattern matching, extension mapping, and interpreter detection.

Capabilities

Path-Based Identification

Comprehensive file identification using file system metadata, permissions, content analysis, and extension matching. This is the primary identification method providing the most complete tag set.

def tags_from_path(path: str) -> set[str]:
    """
    Identify file tags from a file path using comprehensive analysis.
    
    Performs file system analysis including:
    - File type detection (file, directory, symlink, socket)
    - Permission analysis (executable, non-executable)
    - Extension and filename matching
    - Shebang parsing for executables
    - Binary/text content detection
    
    Args:
        path (str): File path to analyze
        
    Returns:
        set[str]: Set of identifying tags
        
    Raises:
        ValueError: If path does not exist
    """

Usage Example:

from identify.identify import tags_from_path

# Python script
tags = tags_from_path('/path/to/script.py')
print(tags)  # {'file', 'text', 'python', 'non-executable'}

# Executable shell script  
tags = tags_from_path('/usr/bin/script.sh')
print(tags)  # {'file', 'text', 'shell', 'bash', 'executable'}

# Directory
tags = tags_from_path('/path/to/directory')
print(tags)  # {'directory'}

# Binary image file
tags = tags_from_path('/path/to/image.png')
print(tags)  # {'file', 'binary', 'image', 'png', 'non-executable'}

Filename-Only Identification

Fast identification based solely on filename and extension without accessing the file system. Useful for batch processing or when file system access is not available.

def tags_from_filename(path: str) -> set[str]:
    """
    Identify file tags based only on filename/extension.
    
    Matches filename against known patterns and extensions without
    accessing the file system. Supports both extension-based matching
    and special filename recognition (e.g., 'Dockerfile', '.gitignore').
    
    Args:
        path (str): File path or filename to analyze
        
    Returns:
        set[str]: Set of identifying tags (empty if no matches)
    """

Usage Example:

from identify.identify import tags_from_filename

# Extension-based matching
tags = tags_from_filename('config.yaml')
print(tags)  # {'yaml', 'text'}

tags = tags_from_filename('script.js')
print(tags)  # {'javascript', 'text'}

# Special filename matching  
tags = tags_from_filename('Dockerfile')
print(tags)  # {'dockerfile', 'text'}

tags = tags_from_filename('.gitignore')
print(tags)  # {'gitignore', 'text'}

# No match returns empty set
tags = tags_from_filename('unknown')
print(tags)  # set()

Interpreter-Based Identification

Identify file types based on interpreter names, typically extracted from shebang lines. Supports version-specific interpreters with fallback to general interpreter names.

def tags_from_interpreter(interpreter: str) -> set[str]:
    """
    Get tags for a given interpreter name.
    
    Attempts progressive matching from specific to general:
    'python3.9.1' -> 'python3.9' -> 'python3' -> 'python'
    
    Args:
        interpreter (str): Interpreter name (e.g., 'python3', 'bash', 'node')
        
    Returns:
        set[str]: Set of identifying tags (empty if no matches)
    """

Usage Example:

from identify.identify import tags_from_interpreter

# Specific version with fallback
tags = tags_from_interpreter('python3.9.1')
print(tags)  # {'python', 'python3'}

# Shell interpreters
tags = tags_from_interpreter('bash')  
print(tags)  # {'shell', 'bash'}

tags = tags_from_interpreter('zsh')
print(tags)  # {'shell', 'zsh'}

# JavaScript runtime
tags = tags_from_interpreter('node')
print(tags)  # {'javascript'}

# Unknown interpreter
tags = tags_from_interpreter('unknown')
print(tags)  # set()

Data Structures

The identification system relies on comprehensive databases of file patterns:

from identify.extensions import EXTENSIONS, EXTENSIONS_NEED_BINARY_CHECK, NAMES
from identify.interpreters import INTERPRETERS

# Extension to tags mapping (400+ extensions)
EXTENSIONS: dict[str, set[str]]

# Extensions requiring binary content check  
EXTENSIONS_NEED_BINARY_CHECK: dict[str, set[str]]

# Special filename to tags mapping (100+ filenames)
NAMES: dict[str, set[str]]

# Interpreter to tags mapping
INTERPRETERS: dict[str, set[str]]

Example Data:

# Sample extension mappings
EXTENSIONS['py']     # {'python', 'text'}
EXTENSIONS['js']     # {'javascript', 'text'}  
EXTENSIONS['png']    # {'binary', 'image', 'png'}

# Sample filename mappings
NAMES['Dockerfile']     # {'dockerfile', 'text'}
NAMES['.gitignore']     # {'gitignore', 'text'}
NAMES['package.json']   # {'json', 'text', 'npm'}

# Sample interpreter mappings  
INTERPRETERS['python3']  # {'python', 'python3'}
INTERPRETERS['bash']     # {'shell', 'bash'}
INTERPRETERS['node']     # {'javascript'}

Tag Categories

The identification system uses standardized tags organized into categories:

File Types: file, directory, symlink, socket
Permissions: executable, non-executable
Encoding: text, binary
Languages: python, javascript, shell, c++, java, etc.
Formats: json, yaml, xml, csv, markdown, etc.
Images: png, jpg, gif, svg, etc.
Archives: zip, tar, gzip, bzip2, etc.