or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

content-analysis.mdfile-identification.mdindex.mdlicense-identification.md

file-identification.mddocs/

0

# File Identification

1

2

Core functionality for identifying files using multiple detection methods including path-based analysis, filename pattern matching, extension mapping, and interpreter detection.

3

4

## Capabilities

5

6

### Path-Based Identification

7

8

Comprehensive file identification using file system metadata, permissions, content analysis, and extension matching. This is the primary identification method providing the most complete tag set.

9

10

```python { .api }

11

def tags_from_path(path: str) -> set[str]:

12

"""

13

Identify file tags from a file path using comprehensive analysis.

14

15

Performs file system analysis including:

16

- File type detection (file, directory, symlink, socket)

17

- Permission analysis (executable, non-executable)

18

- Extension and filename matching

19

- Shebang parsing for executables

20

- Binary/text content detection

21

22

Args:

23

path (str): File path to analyze

24

25

Returns:

26

set[str]: Set of identifying tags

27

28

Raises:

29

ValueError: If path does not exist

30

"""

31

```

32

33

**Usage Example:**

34

35

```python

36

from identify.identify import tags_from_path

37

38

# Python script

39

tags = tags_from_path('/path/to/script.py')

40

print(tags) # {'file', 'text', 'python', 'non-executable'}

41

42

# Executable shell script

43

tags = tags_from_path('/usr/bin/script.sh')

44

print(tags) # {'file', 'text', 'shell', 'bash', 'executable'}

45

46

# Directory

47

tags = tags_from_path('/path/to/directory')

48

print(tags) # {'directory'}

49

50

# Binary image file

51

tags = tags_from_path('/path/to/image.png')

52

print(tags) # {'file', 'binary', 'image', 'png', 'non-executable'}

53

```

54

55

### Filename-Only Identification

56

57

Fast identification based solely on filename and extension without accessing the file system. Useful for batch processing or when file system access is not available.

58

59

```python { .api }

60

def tags_from_filename(path: str) -> set[str]:

61

"""

62

Identify file tags based only on filename/extension.

63

64

Matches filename against known patterns and extensions without

65

accessing the file system. Supports both extension-based matching

66

and special filename recognition (e.g., 'Dockerfile', '.gitignore').

67

68

Args:

69

path (str): File path or filename to analyze

70

71

Returns:

72

set[str]: Set of identifying tags (empty if no matches)

73

"""

74

```

75

76

**Usage Example:**

77

78

```python

79

from identify.identify import tags_from_filename

80

81

# Extension-based matching

82

tags = tags_from_filename('config.yaml')

83

print(tags) # {'yaml', 'text'}

84

85

tags = tags_from_filename('script.js')

86

print(tags) # {'javascript', 'text'}

87

88

# Special filename matching

89

tags = tags_from_filename('Dockerfile')

90

print(tags) # {'dockerfile', 'text'}

91

92

tags = tags_from_filename('.gitignore')

93

print(tags) # {'gitignore', 'text'}

94

95

# No match returns empty set

96

tags = tags_from_filename('unknown')

97

print(tags) # set()

98

```

99

100

### Interpreter-Based Identification

101

102

Identify file types based on interpreter names, typically extracted from shebang lines. Supports version-specific interpreters with fallback to general interpreter names.

103

104

```python { .api }

105

def tags_from_interpreter(interpreter: str) -> set[str]:

106

"""

107

Get tags for a given interpreter name.

108

109

Attempts progressive matching from specific to general:

110

'python3.9.1' -> 'python3.9' -> 'python3' -> 'python'

111

112

Args:

113

interpreter (str): Interpreter name (e.g., 'python3', 'bash', 'node')

114

115

Returns:

116

set[str]: Set of identifying tags (empty if no matches)

117

"""

118

```

119

120

**Usage Example:**

121

122

```python

123

from identify.identify import tags_from_interpreter

124

125

# Specific version with fallback

126

tags = tags_from_interpreter('python3.9.1')

127

print(tags) # {'python', 'python3'}

128

129

# Shell interpreters

130

tags = tags_from_interpreter('bash')

131

print(tags) # {'shell', 'bash'}

132

133

tags = tags_from_interpreter('zsh')

134

print(tags) # {'shell', 'zsh'}

135

136

# JavaScript runtime

137

tags = tags_from_interpreter('node')

138

print(tags) # {'javascript'}

139

140

# Unknown interpreter

141

tags = tags_from_interpreter('unknown')

142

print(tags) # set()

143

```

144

145

## Data Structures

146

147

The identification system relies on comprehensive databases of file patterns:

148

149

```python { .api }

150

from identify.extensions import EXTENSIONS, EXTENSIONS_NEED_BINARY_CHECK, NAMES

151

from identify.interpreters import INTERPRETERS

152

153

# Extension to tags mapping (400+ extensions)

154

EXTENSIONS: dict[str, set[str]]

155

156

# Extensions requiring binary content check

157

EXTENSIONS_NEED_BINARY_CHECK: dict[str, set[str]]

158

159

# Special filename to tags mapping (100+ filenames)

160

NAMES: dict[str, set[str]]

161

162

# Interpreter to tags mapping

163

INTERPRETERS: dict[str, set[str]]

164

```

165

166

**Example Data:**

167

168

```python

169

# Sample extension mappings

170

EXTENSIONS['py'] # {'python', 'text'}

171

EXTENSIONS['js'] # {'javascript', 'text'}

172

EXTENSIONS['png'] # {'binary', 'image', 'png'}

173

174

# Sample filename mappings

175

NAMES['Dockerfile'] # {'dockerfile', 'text'}

176

NAMES['.gitignore'] # {'gitignore', 'text'}

177

NAMES['package.json'] # {'json', 'text', 'npm'}

178

179

# Sample interpreter mappings

180

INTERPRETERS['python3'] # {'python', 'python3'}

181

INTERPRETERS['bash'] # {'shell', 'bash'}

182

INTERPRETERS['node'] # {'javascript'}

183

```

184

185

## Tag Categories

186

187

The identification system uses standardized tags organized into categories:

188

189

**File Types:** `file`, `directory`, `symlink`, `socket`

190

**Permissions:** `executable`, `non-executable`

191

**Encoding:** `text`, `binary`

192

**Languages:** `python`, `javascript`, `shell`, `c++`, `java`, etc.

193

**Formats:** `json`, `yaml`, `xml`, `csv`, `markdown`, etc.

194

**Images:** `png`, `jpg`, `gif`, `svg`, etc.

195

**Archives:** `zip`, `tar`, `gzip`, `bzip2`, etc.