0
# License Identification
1
2
Optional functionality for identifying software licenses in files using SPDX identifiers through text matching and edit distance algorithms. This feature requires the optional `ukkonen` dependency.
3
4
## Installation
5
6
```bash
7
pip install identify[license]
8
```
9
10
Or install the dependency manually:
11
12
```bash
13
pip install ukkonen
14
```
15
16
## Capabilities
17
18
### License Detection
19
20
Identify SPDX license identifiers from license file content using exact text matching and fuzzy matching with edit distance algorithms.
21
22
```python { .api }
23
def license_id(filename: str) -> str | None:
24
"""
25
Return the SPDX ID for the license contained in filename.
26
27
Uses a two-phase approach:
28
1. Exact text match after normalization (copyright removal, whitespace)
29
2. Edit distance matching with 5% threshold for fuzzy matches
30
31
Args:
32
filename (str): Path to license file to analyze
33
34
Returns:
35
str | None: SPDX license identifier or None if no license detected
36
37
Raises:
38
ImportError: If ukkonen dependency is not installed
39
UnicodeDecodeError: If file cannot be decoded as UTF-8
40
FileNotFoundError: If filename does not exist
41
"""
42
```
43
44
**Usage Example:**
45
46
```python
47
from identify.identify import license_id
48
49
# Detect common licenses
50
spdx = license_id('LICENSE')
51
print(spdx) # 'MIT'
52
53
spdx = license_id('COPYING')
54
print(spdx) # 'GPL-3.0-or-later'
55
56
spdx = license_id('LICENSE.txt')
57
print(spdx) # 'Apache-2.0'
58
59
# No license detected
60
spdx = license_id('README.md')
61
print(spdx) # None
62
63
# Handle missing dependency
64
try:
65
spdx = license_id('LICENSE')
66
except ImportError:
67
print("Install with: pip install identify[license]")
68
```
69
70
## Algorithm Details
71
72
The license identification process uses a sophisticated matching algorithm:
73
74
### Text Normalization
75
76
1. **Copyright Removal**: Strips copyright notices using regex pattern `^\s*(Copyright|\(C\)) .*$`
77
2. **Whitespace Normalization**: Replaces all whitespace sequences with single spaces
78
3. **Trimming**: Removes leading/trailing whitespace
79
80
### Matching Process
81
82
1. **Exact Match**: Compares normalized text against known license database
83
2. **Length Filtering**: Skips edit distance for texts with >5% length difference
84
3. **Edit Distance**: Uses Ukkonen algorithm with 5% threshold for fuzzy matching
85
4. **Best Match**: Returns license with minimum edit distance under threshold
86
87
### License Database
88
89
The library includes a comprehensive database of open source licenses:
90
91
```python { .api }
92
from identify.vendor.licenses import LICENSES
93
94
# License database structure
95
LICENSES: tuple[tuple[str, str], ...]
96
```
97
98
**Example License Data:**
99
100
```python
101
# Sample license entries (SPDX_ID, license_text)
102
('MIT', 'Permission is hereby granted, free of charge...'),
103
('Apache-2.0', 'Licensed under the Apache License, Version 2.0...'),
104
('GPL-3.0-or-later', 'This program is free software...'),
105
('BSD-3-Clause', 'Redistribution and use in source and binary forms...'),
106
```
107
108
## Supported Licenses
109
110
The license database includes popular open source licenses with SPDX identifiers:
111
112
**Permissive Licenses:**
113
- MIT, BSD-2-Clause, BSD-3-Clause
114
- Apache-2.0, ISC, 0BSD
115
116
**Copyleft Licenses:**
117
- GPL-2.0, GPL-3.0, LGPL-2.1, LGPL-3.0
118
- AGPL-3.0, MPL-2.0
119
120
**Creative Commons:**
121
- CC0-1.0, CC-BY-4.0, CC-BY-SA-4.0
122
123
**And many others** covering most common open source license types.
124
125
## Error Handling
126
127
The function handles various error conditions:
128
129
```python
130
from identify.identify import license_id
131
132
# Missing dependency
133
try:
134
result = license_id('LICENSE')
135
except ImportError as e:
136
print(f"Install ukkonen: {e}")
137
138
# File not found
139
try:
140
result = license_id('nonexistent-file')
141
except FileNotFoundError:
142
print("License file not found")
143
144
# Encoding issues
145
try:
146
result = license_id('binary-file')
147
except UnicodeDecodeError:
148
print("File is not valid UTF-8 text")
149
```
150
151
## Performance Considerations
152
153
- **File Size**: Reads entire file into memory for analysis
154
- **Edit Distance**: Computationally expensive, mitigated by length filtering
155
- **Caching**: No built-in caching; consider caching results for repeated analysis
156
- **Encoding**: Assumes UTF-8 encoding for all license files
157
158
## Integration Example
159
160
```python
161
import os
162
from identify.identify import license_id, tags_from_filename
163
164
def analyze_license_files(directory):
165
"""Find and identify all license files in a directory."""
166
license_files = []
167
168
for filename in os.listdir(directory):
169
# Check if filename suggests a license file
170
tags = tags_from_filename(filename)
171
if any(tag in filename.lower() for tag in ['license', 'copying']):
172
filepath = os.path.join(directory, filename)
173
try:
174
spdx_id = license_id(filepath)
175
license_files.append({
176
'file': filename,
177
'spdx_id': spdx_id,
178
'tags': tags
179
})
180
except Exception as e:
181
print(f"Error analyzing {filename}: {e}")
182
183
return license_files
184
```