or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-identify

File identification library for Python that determines file types based on paths, extensions, and content analysis

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/identify@2.6.x

To install, run

npx @tessl/cli install tessl/pypi-identify@2.6.0

0

# identify

1

2

File identification library for Python that determines file types based on paths, extensions, and content analysis. The library provides comprehensive file type detection using multiple methods including path-based analysis, filename matching, extension mapping, and shebang parsing.

3

4

## Package Information

5

6

- **Package Name**: identify

7

- **Package Type**: pypi

8

- **Language**: Python

9

- **Installation**: `pip install identify`

10

- **Optional Features**: `pip install identify[license]` for license identification

11

12

## Core Imports

13

14

```python

15

from identify import identify

16

```

17

18

Or import specific functions:

19

20

```python

21

from identify.identify import (

22

tags_from_path,

23

tags_from_filename,

24

tags_from_interpreter,

25

file_is_text,

26

parse_shebang_from_file,

27

license_id,

28

ALL_TAGS

29

)

30

```

31

32

For command-line interface:

33

34

```python

35

from identify.cli import main

36

```

37

38

## Basic Usage

39

40

```python

41

from identify.identify import tags_from_path, tags_from_filename

42

43

# Identify file from path (comprehensive analysis)

44

tags = tags_from_path('/path/to/script.py')

45

print(tags) # {'file', 'text', 'python', 'non-executable'}

46

47

# Identify file from filename only (no file system access)

48

tags = tags_from_filename('config.yaml')

49

print(tags) # {'yaml', 'text'}

50

51

# Check if file is text or binary

52

from identify.identify import file_is_text

53

is_text = file_is_text('/path/to/file.txt')

54

print(is_text) # True

55

56

# Parse shebang from executable files

57

from identify.identify import parse_shebang_from_file

58

shebang = parse_shebang_from_file('/path/to/script.py')

59

print(shebang) # ('python3',) or empty tuple

60

```

61

62

## Architecture

63

64

The identify library uses a layered approach to file identification:

65

66

- **Path Analysis**: Examines file system metadata (type, permissions, accessibility)

67

- **Filename Matching**: Matches against known filenames and extensions using predefined mappings

68

- **Content Analysis**: Performs binary/text detection and shebang parsing for executables

69

- **Tag System**: Returns standardized tags that categorize files by type, encoding, mode, and language

70

71

The library includes comprehensive databases of file extensions (`EXTENSIONS`), special filenames (`NAMES`), and interpreter mappings (`INTERPRETERS`) covering hundreds of file types across multiple programming languages and formats.

72

73

## Capabilities

74

75

### File Identification

76

77

Core functionality for identifying files using path-based analysis, filename matching, and extension mapping. Returns comprehensive tag sets describing file characteristics including type, encoding, language, and mode.

78

79

```python { .api }

80

def tags_from_path(path: str) -> set[str]: ...

81

def tags_from_filename(path: str) -> set[str]: ...

82

def tags_from_interpreter(interpreter: str) -> set[str]: ...

83

```

84

85

[File Identification](./file-identification.md)

86

87

### Content Analysis

88

89

Functions for analyzing file content to determine text vs binary classification and parse executable shebang lines for interpreter identification.

90

91

```python { .api }

92

def file_is_text(path: str) -> bool: ...

93

def is_text(bytesio: IO[bytes]) -> bool: ...

94

def parse_shebang_from_file(path: str) -> tuple[str, ...]: ...

95

def parse_shebang(bytesio: IO[bytes]) -> tuple[str, ...]: ...

96

```

97

98

[Content Analysis](./content-analysis.md)

99

100

### License Identification

101

102

Optional functionality for identifying software licenses in files using SPDX identifiers through text matching and edit distance algorithms.

103

104

```python { .api }

105

def license_id(filename: str) -> str | None: ...

106

```

107

108

[License Identification](./license-identification.md)

109

110

## Constants and Data

111

112

```python { .api }

113

# Basic file type constants

114

DIRECTORY: str

115

FILE: str

116

SYMLINK: str

117

SOCKET: str

118

EXECUTABLE: str

119

NON_EXECUTABLE: str

120

TEXT: str

121

BINARY: str

122

123

# Tag collections

124

TYPE_TAGS: frozenset[str]

125

MODE_TAGS: frozenset[str]

126

ENCODING_TAGS: frozenset[str]

127

ALL_TAGS: frozenset[str]

128

```

129

130

## Command Line Interface

131

132

The package provides a command-line tool for file identification:

133

134

```python { .api }

135

from collections.abc import Sequence

136

137

def main(argv: Sequence[str] | None = None) -> int:

138

"""

139

Command-line interface for file identification.

140

141

Args:

142

argv: Command line arguments, defaults to sys.argv

143

144

Returns:

145

int: Exit code (0 for success, 1 for error)

146

"""

147

```

148

149

**Usage:**

150

151

```bash

152

# Identify file with full analysis

153

identify-cli /path/to/file

154

155

# Identify using filename only

156

identify-cli --filename-only /path/to/file

157

```

158

159

Output is JSON array of tags:

160

```json

161

["file", "text", "python", "non-executable"]

162

```