tessl/pypi-identify

File identification library for Python that determines file types based on paths, extensions, and content analysis

—

Pending

Overview

Eval results

Files

identify

Name: tessl/pypi-identify
Author: tessl

File identification library for Python that determines file types based on paths, extensions, and content analysis. The library provides comprehensive file type detection using multiple methods including path-based analysis, filename matching, extension mapping, and shebang parsing.

Package Information

Package Name: identify
Package Type: pypi
Language: Python
Installation: pip install identify
Optional Features: pip install identify[license] for license identification

Core Imports

from identify import identify

Or import specific functions:

from identify.identify import (
    tags_from_path, 
    tags_from_filename, 
    tags_from_interpreter,
    file_is_text,
    parse_shebang_from_file,
    license_id,
    ALL_TAGS
)

For command-line interface:

from identify.cli import main

Basic Usage

from identify.identify import tags_from_path, tags_from_filename

# Identify file from path (comprehensive analysis)
tags = tags_from_path('/path/to/script.py')
print(tags)  # {'file', 'text', 'python', 'non-executable'}

# Identify file from filename only (no file system access)  
tags = tags_from_filename('config.yaml')
print(tags)  # {'yaml', 'text'}

# Check if file is text or binary
from identify.identify import file_is_text
is_text = file_is_text('/path/to/file.txt')
print(is_text)  # True

# Parse shebang from executable files
from identify.identify import parse_shebang_from_file
shebang = parse_shebang_from_file('/path/to/script.py')
print(shebang)  # ('python3',) or empty tuple

Architecture

The identify library uses a layered approach to file identification:

Path Analysis: Examines file system metadata (type, permissions, accessibility)
Filename Matching: Matches against known filenames and extensions using predefined mappings
Content Analysis: Performs binary/text detection and shebang parsing for executables
Tag System: Returns standardized tags that categorize files by type, encoding, mode, and language

The library includes comprehensive databases of file extensions (EXTENSIONS), special filenames (NAMES), and interpreter mappings (INTERPRETERS) covering hundreds of file types across multiple programming languages and formats.

Capabilities

File Identification

Core functionality for identifying files using path-based analysis, filename matching, and extension mapping. Returns comprehensive tag sets describing file characteristics including type, encoding, language, and mode.

def tags_from_path(path: str) -> set[str]: ...
def tags_from_filename(path: str) -> set[str]: ...
def tags_from_interpreter(interpreter: str) -> set[str]: ...

File Identification

Content Analysis

Functions for analyzing file content to determine text vs binary classification and parse executable shebang lines for interpreter identification.

def file_is_text(path: str) -> bool: ...
def is_text(bytesio: IO[bytes]) -> bool: ...
def parse_shebang_from_file(path: str) -> tuple[str, ...]: ...
def parse_shebang(bytesio: IO[bytes]) -> tuple[str, ...]: ...

Content Analysis

License Identification

Optional functionality for identifying software licenses in files using SPDX identifiers through text matching and edit distance algorithms.

def license_id(filename: str) -> str | None: ...

License Identification

Constants and Data

# Basic file type constants
DIRECTORY: str
FILE: str  
SYMLINK: str
SOCKET: str
EXECUTABLE: str
NON_EXECUTABLE: str
TEXT: str
BINARY: str

# Tag collections
TYPE_TAGS: frozenset[str]
MODE_TAGS: frozenset[str] 
ENCODING_TAGS: frozenset[str]
ALL_TAGS: frozenset[str]

Command Line Interface

The package provides a command-line tool for file identification:

from collections.abc import Sequence

def main(argv: Sequence[str] | None = None) -> int:
    """
    Command-line interface for file identification.
    
    Args:
        argv: Command line arguments, defaults to sys.argv
        
    Returns:
        int: Exit code (0 for success, 1 for error)
    """

Usage:

# Identify file with full analysis
identify-cli /path/to/file

# Identify using filename only
identify-cli --filename-only /path/to/file

Output is JSON array of tags:

["file", "text", "python", "non-executable"]

Install with Tessl CLI

npx tessl i tessl/pypi-identify

Workspace: tessl
Visibility: Public
Created: 6 months ago
Last updated: about 1 month ago
Describes: pkg:pypi/identify@2.6.x
Publish Source: CLI
Badge