CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-identify

File identification library for Python that determines file types based on paths, extensions, and content analysis

Pending
Overview
Eval results
Files

license-identification.mddocs/

License Identification

Optional functionality for identifying software licenses in files using SPDX identifiers through text matching and edit distance algorithms. This feature requires the optional ukkonen dependency.

Installation

pip install identify[license]

Or install the dependency manually:

pip install ukkonen

Capabilities

License Detection

Identify SPDX license identifiers from license file content using exact text matching and fuzzy matching with edit distance algorithms.

def license_id(filename: str) -> str | None:
    """
    Return the SPDX ID for the license contained in filename.
    
    Uses a two-phase approach:
    1. Exact text match after normalization (copyright removal, whitespace)
    2. Edit distance matching with 5% threshold for fuzzy matches
    
    Args:
        filename (str): Path to license file to analyze
        
    Returns:
        str | None: SPDX license identifier or None if no license detected
        
    Raises:
        ImportError: If ukkonen dependency is not installed
        UnicodeDecodeError: If file cannot be decoded as UTF-8
        FileNotFoundError: If filename does not exist
    """

Usage Example:

from identify.identify import license_id

# Detect common licenses
spdx = license_id('LICENSE')
print(spdx)  # 'MIT'

spdx = license_id('COPYING')  
print(spdx)  # 'GPL-3.0-or-later'

spdx = license_id('LICENSE.txt')
print(spdx)  # 'Apache-2.0'

# No license detected
spdx = license_id('README.md')
print(spdx)  # None

# Handle missing dependency
try:
    spdx = license_id('LICENSE')
except ImportError:
    print("Install with: pip install identify[license]")

Algorithm Details

The license identification process uses a sophisticated matching algorithm:

Text Normalization

  1. Copyright Removal: Strips copyright notices using regex pattern ^\s*(Copyright|\(C\)) .*$
  2. Whitespace Normalization: Replaces all whitespace sequences with single spaces
  3. Trimming: Removes leading/trailing whitespace

Matching Process

  1. Exact Match: Compares normalized text against known license database
  2. Length Filtering: Skips edit distance for texts with >5% length difference
  3. Edit Distance: Uses Ukkonen algorithm with 5% threshold for fuzzy matching
  4. Best Match: Returns license with minimum edit distance under threshold

License Database

The library includes a comprehensive database of open source licenses:

from identify.vendor.licenses import LICENSES

# License database structure
LICENSES: tuple[tuple[str, str], ...]

Example License Data:

# Sample license entries (SPDX_ID, license_text)
('MIT', 'Permission is hereby granted, free of charge...'),
('Apache-2.0', 'Licensed under the Apache License, Version 2.0...'),
('GPL-3.0-or-later', 'This program is free software...'),
('BSD-3-Clause', 'Redistribution and use in source and binary forms...'),

Supported Licenses

The license database includes popular open source licenses with SPDX identifiers:

Permissive Licenses:

  • MIT, BSD-2-Clause, BSD-3-Clause
  • Apache-2.0, ISC, 0BSD

Copyleft Licenses:

  • GPL-2.0, GPL-3.0, LGPL-2.1, LGPL-3.0
  • AGPL-3.0, MPL-2.0

Creative Commons:

  • CC0-1.0, CC-BY-4.0, CC-BY-SA-4.0

And many others covering most common open source license types.

Error Handling

The function handles various error conditions:

from identify.identify import license_id

# Missing dependency
try:
    result = license_id('LICENSE')
except ImportError as e:
    print(f"Install ukkonen: {e}")

# File not found  
try:
    result = license_id('nonexistent-file')
except FileNotFoundError:
    print("License file not found")

# Encoding issues
try:
    result = license_id('binary-file')
except UnicodeDecodeError:
    print("File is not valid UTF-8 text")

Performance Considerations

  • File Size: Reads entire file into memory for analysis
  • Edit Distance: Computationally expensive, mitigated by length filtering
  • Caching: No built-in caching; consider caching results for repeated analysis
  • Encoding: Assumes UTF-8 encoding for all license files

Integration Example

import os
from identify.identify import license_id, tags_from_filename

def analyze_license_files(directory):
    """Find and identify all license files in a directory."""
    license_files = []
    
    for filename in os.listdir(directory):
        # Check if filename suggests a license file
        tags = tags_from_filename(filename)
        if any(tag in filename.lower() for tag in ['license', 'copying']):
            filepath = os.path.join(directory, filename)
            try:
                spdx_id = license_id(filepath)
                license_files.append({
                    'file': filename,
                    'spdx_id': spdx_id,
                    'tags': tags
                })
            except Exception as e:
                print(f"Error analyzing {filename}: {e}")
    
    return license_files

Install with Tessl CLI

npx tessl i tessl/pypi-identify

docs

content-analysis.md

file-identification.md

index.md

license-identification.md

tile.json