or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

content-analysis.mdfile-identification.mdindex.mdlicense-identification.md

content-analysis.mddocs/

0

# Content Analysis

1

2

Functions for analyzing file content to determine text vs binary classification and parse executable shebang lines for interpreter identification.

3

4

## Capabilities

5

6

### Text/Binary Detection

7

8

Determine whether files contain text or binary content using character analysis algorithms based on libmagic's detection logic.

9

10

```python { .api }

11

def file_is_text(path: str) -> bool:

12

"""

13

Determine if a file contains text content.

14

15

Opens file and analyzes the first 1KB to determine if content

16

appears to be text based on character distribution analysis.

17

18

Args:

19

path (str): Path to file to analyze

20

21

Returns:

22

bool: True if file appears to be text, False if binary

23

24

Raises:

25

ValueError: If path does not exist

26

"""

27

28

def is_text(bytesio: IO[bytes]) -> bool:

29

"""

30

Determine if byte stream content appears to be text.

31

32

Analyzes the first 1KB of a byte stream to determine if content

33

appears to be text based on character distribution. Based on

34

libmagic's binary/text detection algorithm.

35

36

Args:

37

bytesio (IO[bytes]): Open binary file-like object

38

39

Returns:

40

bool: True if content appears to be text, False if binary

41

"""

42

```

43

44

**Usage Example:**

45

46

```python

47

from identify.identify import file_is_text, is_text

48

import io

49

50

# Check file directly

51

is_text_file = file_is_text('/path/to/document.txt')

52

print(is_text_file) # True

53

54

is_binary_file = file_is_text('/path/to/image.png')

55

print(is_binary_file) # False

56

57

# Check byte stream

58

with open('/path/to/file.py', 'rb') as f:

59

result = is_text(f)

60

print(result) # True

61

62

# Check bytes in memory

63

data = b"print('hello world')\n"

64

stream = io.BytesIO(data)

65

result = is_text(stream)

66

print(result) # True

67

```

68

69

### Shebang Parsing

70

71

Parse shebang lines from executable files to extract interpreter and argument information. Handles various shebang formats including env, nix-shell, and quoted arguments.

72

73

```python { .api }

74

def parse_shebang_from_file(path: str) -> tuple[str, ...]:

75

"""

76

Parse shebang from a file path.

77

78

Extracts shebang information from executable files. Only processes

79

files that are executable and have valid shebang format. Handles

80

various shebang patterns including /usr/bin/env usage.

81

82

Args:

83

path (str): Path to executable file

84

85

Returns:

86

tuple[str, ...]: Tuple of command and arguments, empty if no shebang

87

88

Raises:

89

ValueError: If path does not exist

90

"""

91

92

def parse_shebang(bytesio: IO[bytes]) -> tuple[str, ...]:

93

"""

94

Parse shebang from an open binary file stream.

95

96

Reads and parses shebang line from the beginning of a binary stream.

97

Handles various formats including env, nix-shell, and quoted arguments.

98

Only processes printable ASCII content.

99

100

Args:

101

bytesio (IO[bytes]): Open binary file-like object positioned at start

102

103

Returns:

104

tuple[str, ...]: Tuple of command and arguments, empty if no valid shebang

105

"""

106

```

107

108

**Usage Example:**

109

110

```python

111

from identify.identify import parse_shebang_from_file, parse_shebang

112

import io

113

114

# Parse from file path

115

shebang = parse_shebang_from_file('/usr/bin/python3-script')

116

print(shebang) # ('python3',)

117

118

shebang = parse_shebang_from_file('/path/to/bash-script.sh')

119

print(shebang) # ('bash',)

120

121

# Parse from byte stream

122

script_content = b'#!/usr/bin/env python3\nprint("hello")\n'

123

stream = io.BytesIO(script_content)

124

shebang = parse_shebang(stream)

125

print(shebang) # ('python3',)

126

127

# Complex shebang with arguments

128

script_content = b'#!/usr/bin/env -S python3 -u\nprint("hello")\n'

129

stream = io.BytesIO(script_content)

130

shebang = parse_shebang(stream)

131

print(shebang) # ('python3', '-u')

132

133

# No shebang

134

script_content = b'print("hello")\n'

135

stream = io.BytesIO(script_content)

136

shebang = parse_shebang(stream)

137

print(shebang) # ()

138

```

139

140

## Shebang Format Handling

141

142

The shebang parser handles various common patterns:

143

144

**Standard Format:**

145

```bash

146

#!/bin/bash

147

#!/usr/bin/python3

148

```

149

150

**Environment-based:**

151

```bash

152

#!/usr/bin/env python3

153

#!/usr/bin/env -S python3 -u

154

```

155

156

**Nix Shell:**

157

```bash

158

#!/usr/bin/env nix-shell

159

#! /some/path/to/interpreter

160

```

161

162

**Quoted Arguments:**

163

Parser attempts shlex-style parsing first, then falls back to simple whitespace splitting for malformed quotes.

164

165

## Character Analysis Details

166

167

The text/binary detection uses character distribution analysis:

168

169

- **Text Characters**: Control chars (7,8,9,10,11,12,13,27), printable ASCII (0x20-0x7F), extended ASCII (0x80-0xFF)

170

- **Analysis Window**: First 1024 bytes of file content

171

- **Algorithm**: Based on libmagic's encoding detection logic

172

- **Threshold**: Any non-text characters indicate binary content

173

174

This approach provides reliable text/binary classification for most file types while being performant for large-scale file processing.