or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

content-analysis.mdfile-identification.mdindex.mdlicense-identification.md

license-identification.mddocs/

0

# License Identification

1

2

Optional functionality for identifying software licenses in files using SPDX identifiers through text matching and edit distance algorithms. This feature requires the optional `ukkonen` dependency.

3

4

## Installation

5

6

```bash

7

pip install identify[license]

8

```

9

10

Or install the dependency manually:

11

12

```bash

13

pip install ukkonen

14

```

15

16

## Capabilities

17

18

### License Detection

19

20

Identify SPDX license identifiers from license file content using exact text matching and fuzzy matching with edit distance algorithms.

21

22

```python { .api }

23

def license_id(filename: str) -> str | None:

24

"""

25

Return the SPDX ID for the license contained in filename.

26

27

Uses a two-phase approach:

28

1. Exact text match after normalization (copyright removal, whitespace)

29

2. Edit distance matching with 5% threshold for fuzzy matches

30

31

Args:

32

filename (str): Path to license file to analyze

33

34

Returns:

35

str | None: SPDX license identifier or None if no license detected

36

37

Raises:

38

ImportError: If ukkonen dependency is not installed

39

UnicodeDecodeError: If file cannot be decoded as UTF-8

40

FileNotFoundError: If filename does not exist

41

"""

42

```

43

44

**Usage Example:**

45

46

```python

47

from identify.identify import license_id

48

49

# Detect common licenses

50

spdx = license_id('LICENSE')

51

print(spdx) # 'MIT'

52

53

spdx = license_id('COPYING')

54

print(spdx) # 'GPL-3.0-or-later'

55

56

spdx = license_id('LICENSE.txt')

57

print(spdx) # 'Apache-2.0'

58

59

# No license detected

60

spdx = license_id('README.md')

61

print(spdx) # None

62

63

# Handle missing dependency

64

try:

65

spdx = license_id('LICENSE')

66

except ImportError:

67

print("Install with: pip install identify[license]")

68

```

69

70

## Algorithm Details

71

72

The license identification process uses a sophisticated matching algorithm:

73

74

### Text Normalization

75

76

1. **Copyright Removal**: Strips copyright notices using regex pattern `^\s*(Copyright|\(C\)) .*$`

77

2. **Whitespace Normalization**: Replaces all whitespace sequences with single spaces

78

3. **Trimming**: Removes leading/trailing whitespace

79

80

### Matching Process

81

82

1. **Exact Match**: Compares normalized text against known license database

83

2. **Length Filtering**: Skips edit distance for texts with >5% length difference

84

3. **Edit Distance**: Uses Ukkonen algorithm with 5% threshold for fuzzy matching

85

4. **Best Match**: Returns license with minimum edit distance under threshold

86

87

### License Database

88

89

The library includes a comprehensive database of open source licenses:

90

91

```python { .api }

92

from identify.vendor.licenses import LICENSES

93

94

# License database structure

95

LICENSES: tuple[tuple[str, str], ...]

96

```

97

98

**Example License Data:**

99

100

```python

101

# Sample license entries (SPDX_ID, license_text)

102

('MIT', 'Permission is hereby granted, free of charge...'),

103

('Apache-2.0', 'Licensed under the Apache License, Version 2.0...'),

104

('GPL-3.0-or-later', 'This program is free software...'),

105

('BSD-3-Clause', 'Redistribution and use in source and binary forms...'),

106

```

107

108

## Supported Licenses

109

110

The license database includes popular open source licenses with SPDX identifiers:

111

112

**Permissive Licenses:**

113

- MIT, BSD-2-Clause, BSD-3-Clause

114

- Apache-2.0, ISC, 0BSD

115

116

**Copyleft Licenses:**

117

- GPL-2.0, GPL-3.0, LGPL-2.1, LGPL-3.0

118

- AGPL-3.0, MPL-2.0

119

120

**Creative Commons:**

121

- CC0-1.0, CC-BY-4.0, CC-BY-SA-4.0

122

123

**And many others** covering most common open source license types.

124

125

## Error Handling

126

127

The function handles various error conditions:

128

129

```python

130

from identify.identify import license_id

131

132

# Missing dependency

133

try:

134

result = license_id('LICENSE')

135

except ImportError as e:

136

print(f"Install ukkonen: {e}")

137

138

# File not found

139

try:

140

result = license_id('nonexistent-file')

141

except FileNotFoundError:

142

print("License file not found")

143

144

# Encoding issues

145

try:

146

result = license_id('binary-file')

147

except UnicodeDecodeError:

148

print("File is not valid UTF-8 text")

149

```

150

151

## Performance Considerations

152

153

- **File Size**: Reads entire file into memory for analysis

154

- **Edit Distance**: Computationally expensive, mitigated by length filtering

155

- **Caching**: No built-in caching; consider caching results for repeated analysis

156

- **Encoding**: Assumes UTF-8 encoding for all license files

157

158

## Integration Example

159

160

```python

161

import os

162

from identify.identify import license_id, tags_from_filename

163

164

def analyze_license_files(directory):

165

"""Find and identify all license files in a directory."""

166

license_files = []

167

168

for filename in os.listdir(directory):

169

# Check if filename suggests a license file

170

tags = tags_from_filename(filename)

171

if any(tag in filename.lower() for tag in ['license', 'copying']):

172

filepath = os.path.join(directory, filename)

173

try:

174

spdx_id = license_id(filepath)

175

license_files.append({

176

'file': filename,

177

'spdx_id': spdx_id,

178

'tags': tags

179

})

180

except Exception as e:

181

print(f"Error analyzing {filename}: {e}")

182

183

return license_files

184

```