tessl/pypi-identify

File identification library for Python that determines file types based on paths, extensions, and content analysis

—

Pending

Overview

Eval results

Files

License Identification

Name: tessl/pypi-identify
Author: tessl

Optional functionality for identifying software licenses in files using SPDX identifiers through text matching and edit distance algorithms. This feature requires the optional ukkonen dependency.

Installation

pip install identify[license]

Or install the dependency manually:

pip install ukkonen

Capabilities

License Detection

Identify SPDX license identifiers from license file content using exact text matching and fuzzy matching with edit distance algorithms.

def license_id(filename: str) -> str | None:
    """
    Return the SPDX ID for the license contained in filename.
    
    Uses a two-phase approach:
    1. Exact text match after normalization (copyright removal, whitespace)
    2. Edit distance matching with 5% threshold for fuzzy matches
    
    Args:
        filename (str): Path to license file to analyze
        
    Returns:
        str | None: SPDX license identifier or None if no license detected
        
    Raises:
        ImportError: If ukkonen dependency is not installed
        UnicodeDecodeError: If file cannot be decoded as UTF-8
        FileNotFoundError: If filename does not exist
    """

Usage Example:

from identify.identify import license_id

# Detect common licenses
spdx = license_id('LICENSE')
print(spdx)  # 'MIT'

spdx = license_id('COPYING')  
print(spdx)  # 'GPL-3.0-or-later'

spdx = license_id('LICENSE.txt')
print(spdx)  # 'Apache-2.0'

# No license detected
spdx = license_id('README.md')
print(spdx)  # None

# Handle missing dependency
try:
    spdx = license_id('LICENSE')
except ImportError:
    print("Install with: pip install identify[license]")

Algorithm Details

The license identification process uses a sophisticated matching algorithm:

Text Normalization

Copyright Removal: Strips copyright notices using regex pattern ^\s*(Copyright|$C$) .*$
Whitespace Normalization: Replaces all whitespace sequences with single spaces
Trimming: Removes leading/trailing whitespace

Matching Process

Exact Match: Compares normalized text against known license database
Length Filtering: Skips edit distance for texts with >5% length difference
Edit Distance: Uses Ukkonen algorithm with 5% threshold for fuzzy matching
Best Match: Returns license with minimum edit distance under threshold

License Database

The library includes a comprehensive database of open source licenses:

from identify.vendor.licenses import LICENSES

# License database structure
LICENSES: tuple[tuple[str, str], ...]

Example License Data:

# Sample license entries (SPDX_ID, license_text)
('MIT', 'Permission is hereby granted, free of charge...'),
('Apache-2.0', 'Licensed under the Apache License, Version 2.0...'),
('GPL-3.0-or-later', 'This program is free software...'),
('BSD-3-Clause', 'Redistribution and use in source and binary forms...'),

Supported Licenses

The license database includes popular open source licenses with SPDX identifiers:

Permissive Licenses:

MIT, BSD-2-Clause, BSD-3-Clause
Apache-2.0, ISC, 0BSD

Copyleft Licenses:

GPL-2.0, GPL-3.0, LGPL-2.1, LGPL-3.0
AGPL-3.0, MPL-2.0

Creative Commons:

CC0-1.0, CC-BY-4.0, CC-BY-SA-4.0

And many others covering most common open source license types.

Error Handling

The function handles various error conditions:

from identify.identify import license_id

# Missing dependency
try:
    result = license_id('LICENSE')
except ImportError as e:
    print(f"Install ukkonen: {e}")

# File not found  
try:
    result = license_id('nonexistent-file')
except FileNotFoundError:
    print("License file not found")

# Encoding issues
try:
    result = license_id('binary-file')
except UnicodeDecodeError:
    print("File is not valid UTF-8 text")

Performance Considerations

File Size: Reads entire file into memory for analysis
Edit Distance: Computationally expensive, mitigated by length filtering
Caching: No built-in caching; consider caching results for repeated analysis
Encoding: Assumes UTF-8 encoding for all license files

Integration Example

import os
from identify.identify import license_id, tags_from_filename

def analyze_license_files(directory):
    """Find and identify all license files in a directory."""
    license_files = []
    
    for filename in os.listdir(directory):
        # Check if filename suggests a license file
        tags = tags_from_filename(filename)
        if any(tag in filename.lower() for tag in ['license', 'copying']):
            filepath = os.path.join(directory, filename)
            try:
                spdx_id = license_id(filepath)
                license_files.append({
                    'file': filename,
                    'spdx_id': spdx_id,
                    'tags': tags
                })
            except Exception as e:
                print(f"Error analyzing {filename}: {e}")
    
    return license_files

Install with Tessl CLI