File identification library for Python that determines file types based on paths, extensions, and content analysis
—
Optional functionality for identifying software licenses in files using SPDX identifiers through text matching and edit distance algorithms. This feature requires the optional ukkonen dependency.
pip install identify[license]Or install the dependency manually:
pip install ukkonenIdentify SPDX license identifiers from license file content using exact text matching and fuzzy matching with edit distance algorithms.
def license_id(filename: str) -> str | None:
"""
Return the SPDX ID for the license contained in filename.
Uses a two-phase approach:
1. Exact text match after normalization (copyright removal, whitespace)
2. Edit distance matching with 5% threshold for fuzzy matches
Args:
filename (str): Path to license file to analyze
Returns:
str | None: SPDX license identifier or None if no license detected
Raises:
ImportError: If ukkonen dependency is not installed
UnicodeDecodeError: If file cannot be decoded as UTF-8
FileNotFoundError: If filename does not exist
"""Usage Example:
from identify.identify import license_id
# Detect common licenses
spdx = license_id('LICENSE')
print(spdx) # 'MIT'
spdx = license_id('COPYING')
print(spdx) # 'GPL-3.0-or-later'
spdx = license_id('LICENSE.txt')
print(spdx) # 'Apache-2.0'
# No license detected
spdx = license_id('README.md')
print(spdx) # None
# Handle missing dependency
try:
spdx = license_id('LICENSE')
except ImportError:
print("Install with: pip install identify[license]")The license identification process uses a sophisticated matching algorithm:
^\s*(Copyright|\(C\)) .*$The library includes a comprehensive database of open source licenses:
from identify.vendor.licenses import LICENSES
# License database structure
LICENSES: tuple[tuple[str, str], ...]Example License Data:
# Sample license entries (SPDX_ID, license_text)
('MIT', 'Permission is hereby granted, free of charge...'),
('Apache-2.0', 'Licensed under the Apache License, Version 2.0...'),
('GPL-3.0-or-later', 'This program is free software...'),
('BSD-3-Clause', 'Redistribution and use in source and binary forms...'),The license database includes popular open source licenses with SPDX identifiers:
Permissive Licenses:
Copyleft Licenses:
Creative Commons:
And many others covering most common open source license types.
The function handles various error conditions:
from identify.identify import license_id
# Missing dependency
try:
result = license_id('LICENSE')
except ImportError as e:
print(f"Install ukkonen: {e}")
# File not found
try:
result = license_id('nonexistent-file')
except FileNotFoundError:
print("License file not found")
# Encoding issues
try:
result = license_id('binary-file')
except UnicodeDecodeError:
print("File is not valid UTF-8 text")import os
from identify.identify import license_id, tags_from_filename
def analyze_license_files(directory):
"""Find and identify all license files in a directory."""
license_files = []
for filename in os.listdir(directory):
# Check if filename suggests a license file
tags = tags_from_filename(filename)
if any(tag in filename.lower() for tag in ['license', 'copying']):
filepath = os.path.join(directory, filename)
try:
spdx_id = license_id(filepath)
license_files.append({
'file': filename,
'spdx_id': spdx_id,
'tags': tags
})
except Exception as e:
print(f"Error analyzing {filename}: {e}")
return license_filesInstall with Tessl CLI
npx tessl i tessl/pypi-identify