CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-identify

File identification library for Python that determines file types based on paths, extensions, and content analysis

Pending
Overview
Eval results
Files

content-analysis.mddocs/

Content Analysis

Functions for analyzing file content to determine text vs binary classification and parse executable shebang lines for interpreter identification.

Capabilities

Text/Binary Detection

Determine whether files contain text or binary content using character analysis algorithms based on libmagic's detection logic.

def file_is_text(path: str) -> bool:
    """
    Determine if a file contains text content.
    
    Opens file and analyzes the first 1KB to determine if content
    appears to be text based on character distribution analysis.
    
    Args:
        path (str): Path to file to analyze
        
    Returns:
        bool: True if file appears to be text, False if binary
        
    Raises:
        ValueError: If path does not exist
    """

def is_text(bytesio: IO[bytes]) -> bool:
    """
    Determine if byte stream content appears to be text.
    
    Analyzes the first 1KB of a byte stream to determine if content
    appears to be text based on character distribution. Based on
    libmagic's binary/text detection algorithm.
    
    Args:
        bytesio (IO[bytes]): Open binary file-like object
        
    Returns:
        bool: True if content appears to be text, False if binary
    """

Usage Example:

from identify.identify import file_is_text, is_text
import io

# Check file directly
is_text_file = file_is_text('/path/to/document.txt')
print(is_text_file)  # True

is_binary_file = file_is_text('/path/to/image.png')  
print(is_binary_file)  # False

# Check byte stream
with open('/path/to/file.py', 'rb') as f:
    result = is_text(f)
    print(result)  # True

# Check bytes in memory
data = b"print('hello world')\n"
stream = io.BytesIO(data)
result = is_text(stream)
print(result)  # True

Shebang Parsing

Parse shebang lines from executable files to extract interpreter and argument information. Handles various shebang formats including env, nix-shell, and quoted arguments.

def parse_shebang_from_file(path: str) -> tuple[str, ...]:
    """
    Parse shebang from a file path.
    
    Extracts shebang information from executable files. Only processes
    files that are executable and have valid shebang format. Handles
    various shebang patterns including /usr/bin/env usage.
    
    Args:
        path (str): Path to executable file
        
    Returns:
        tuple[str, ...]: Tuple of command and arguments, empty if no shebang
        
    Raises:
        ValueError: If path does not exist
    """

def parse_shebang(bytesio: IO[bytes]) -> tuple[str, ...]:
    """
    Parse shebang from an open binary file stream.
    
    Reads and parses shebang line from the beginning of a binary stream.
    Handles various formats including env, nix-shell, and quoted arguments.
    Only processes printable ASCII content.
    
    Args:
        bytesio (IO[bytes]): Open binary file-like object positioned at start
        
    Returns:
        tuple[str, ...]: Tuple of command and arguments, empty if no valid shebang
    """

Usage Example:

from identify.identify import parse_shebang_from_file, parse_shebang
import io

# Parse from file path
shebang = parse_shebang_from_file('/usr/bin/python3-script')
print(shebang)  # ('python3',)

shebang = parse_shebang_from_file('/path/to/bash-script.sh') 
print(shebang)  # ('bash',)

# Parse from byte stream
script_content = b'#!/usr/bin/env python3\nprint("hello")\n'
stream = io.BytesIO(script_content)
shebang = parse_shebang(stream)
print(shebang)  # ('python3',)

# Complex shebang with arguments
script_content = b'#!/usr/bin/env -S python3 -u\nprint("hello")\n'
stream = io.BytesIO(script_content)  
shebang = parse_shebang(stream)
print(shebang)  # ('python3', '-u')

# No shebang
script_content = b'print("hello")\n'
stream = io.BytesIO(script_content)
shebang = parse_shebang(stream)
print(shebang)  # ()

Shebang Format Handling

The shebang parser handles various common patterns:

Standard Format:

#!/bin/bash
#!/usr/bin/python3

Environment-based:

#!/usr/bin/env python3
#!/usr/bin/env -S python3 -u

Nix Shell:

#!/usr/bin/env nix-shell
#! /some/path/to/interpreter

Quoted Arguments: Parser attempts shlex-style parsing first, then falls back to simple whitespace splitting for malformed quotes.

Character Analysis Details

The text/binary detection uses character distribution analysis:

  • Text Characters: Control chars (7,8,9,10,11,12,13,27), printable ASCII (0x20-0x7F), extended ASCII (0x80-0xFF)
  • Analysis Window: First 1024 bytes of file content
  • Algorithm: Based on libmagic's encoding detection logic
  • Threshold: Any non-text characters indicate binary content

This approach provides reliable text/binary classification for most file types while being performant for large-scale file processing.

Install with Tessl CLI

npx tessl i tessl/pypi-identify

docs

content-analysis.md

file-identification.md

index.md

license-identification.md

tile.json