Tessl Tile for pypi/identify@2.6.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

content-analysis.md file-identification.md index.md license-identification.md

license-identification.mddocs/

0
# License Identification
1

2
Optional functionality for identifying software licenses in files using SPDX identifiers through text matching and edit distance algorithms. This feature requires the optional `ukkonen` dependency.
3

4
## Installation
5

6
```bash
7
pip install identify[license]
8
```
9

10
Or install the dependency manually:
11

12
```bash
13
pip install ukkonen
14
```
15

16
## Capabilities
17

18
### License Detection
19

20
Identify SPDX license identifiers from license file content using exact text matching and fuzzy matching with edit distance algorithms.
21

22
```python { .api }
23
def license_id(filename: str) -> str | None:
24
    """
25
    Return the SPDX ID for the license contained in filename.
26
    
27
    Uses a two-phase approach:
28
    1. Exact text match after normalization (copyright removal, whitespace)
29
    2. Edit distance matching with 5% threshold for fuzzy matches
30
    
31
    Args:
32
        filename (str): Path to license file to analyze
33
        
34
    Returns:
35
        str | None: SPDX license identifier or None if no license detected
36
        
37
    Raises:
38
        ImportError: If ukkonen dependency is not installed
39
        UnicodeDecodeError: If file cannot be decoded as UTF-8
40
        FileNotFoundError: If filename does not exist
41
    """
42
```
43

44
**Usage Example:**
45

46
```python
47
from identify.identify import license_id
48

49
# Detect common licenses
50
spdx = license_id('LICENSE')
51
print(spdx)  # 'MIT'
52

53
spdx = license_id('COPYING')  
54
print(spdx)  # 'GPL-3.0-or-later'
55

56
spdx = license_id('LICENSE.txt')
57
print(spdx)  # 'Apache-2.0'
58

59
# No license detected
60
spdx = license_id('README.md')
61
print(spdx)  # None
62

63
# Handle missing dependency
64
try:
65
    spdx = license_id('LICENSE')
66
except ImportError:
67
    print("Install with: pip install identify[license]")
68
```
69

70
## Algorithm Details
71

72
The license identification process uses a sophisticated matching algorithm:
73

74
### Text Normalization
75

76
1. **Copyright Removal**: Strips copyright notices using regex pattern `^\s*(Copyright|\(C\)) .*$`
77
2. **Whitespace Normalization**: Replaces all whitespace sequences with single spaces
78
3. **Trimming**: Removes leading/trailing whitespace
79

80
### Matching Process
81

82
1. **Exact Match**: Compares normalized text against known license database
83
2. **Length Filtering**: Skips edit distance for texts with >5% length difference  
84
3. **Edit Distance**: Uses Ukkonen algorithm with 5% threshold for fuzzy matching
85
4. **Best Match**: Returns license with minimum edit distance under threshold
86

87
### License Database
88

89
The library includes a comprehensive database of open source licenses:
90

91
```python { .api }
92
from identify.vendor.licenses import LICENSES
93

94
# License database structure
95
LICENSES: tuple[tuple[str, str], ...]
96
```
97

98
**Example License Data:**
99

100
```python
101
# Sample license entries (SPDX_ID, license_text)
102
('MIT', 'Permission is hereby granted, free of charge...'),
103
('Apache-2.0', 'Licensed under the Apache License, Version 2.0...'),
104
('GPL-3.0-or-later', 'This program is free software...'),
105
('BSD-3-Clause', 'Redistribution and use in source and binary forms...'),
106
```
107

108
## Supported Licenses
109

110
The license database includes popular open source licenses with SPDX identifiers:
111

112
**Permissive Licenses:**
113
- MIT, BSD-2-Clause, BSD-3-Clause
114
- Apache-2.0, ISC, 0BSD
115

116
**Copyleft Licenses:**  
117
- GPL-2.0, GPL-3.0, LGPL-2.1, LGPL-3.0
118
- AGPL-3.0, MPL-2.0
119

120
**Creative Commons:**
121
- CC0-1.0, CC-BY-4.0, CC-BY-SA-4.0
122

123
**And many others** covering most common open source license types.
124

125
## Error Handling
126

127
The function handles various error conditions:
128

129
```python
130
from identify.identify import license_id
131

132
# Missing dependency
133
try:
134
    result = license_id('LICENSE')
135
except ImportError as e:
136
    print(f"Install ukkonen: {e}")
137

138
# File not found  
139
try:
140
    result = license_id('nonexistent-file')
141
except FileNotFoundError:
142
    print("License file not found")
143

144
# Encoding issues
145
try:
146
    result = license_id('binary-file')
147
except UnicodeDecodeError:
148
    print("File is not valid UTF-8 text")
149
```
150

151
## Performance Considerations
152

153
- **File Size**: Reads entire file into memory for analysis
154
- **Edit Distance**: Computationally expensive, mitigated by length filtering
155
- **Caching**: No built-in caching; consider caching results for repeated analysis
156
- **Encoding**: Assumes UTF-8 encoding for all license files
157

158
## Integration Example
159

160
```python
161
import os
162
from identify.identify import license_id, tags_from_filename
163

164
def analyze_license_files(directory):
165
    """Find and identify all license files in a directory."""
166
    license_files = []
167
    
168
    for filename in os.listdir(directory):
169
        # Check if filename suggests a license file
170
        tags = tags_from_filename(filename)
171
        if any(tag in filename.lower() for tag in ['license', 'copying']):
172
            filepath = os.path.join(directory, filename)
173
            try:
174
                spdx_id = license_id(filepath)
175
                license_files.append({
176
                    'file': filename,
177
                    'spdx_id': spdx_id,
178
                    'tags': tags
179
                })
180
            except Exception as e:
181
                print(f"Error analyzing {filename}: {e}")
182
    
183
    return license_files
184
```

Version

Tile

Files

license-identification.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

license-identification.mddocs/