or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cli-interface.mdcore-detection.mddetection-results.mdindex.mdlegacy-compatibility.md

core-detection.mddocs/

0

# Core Detection Functions

1

2

Primary charset detection methods that analyze raw bytes, file pointers, or file paths to determine character encoding. These functions form the core of charset-normalizer's detection capabilities and support extensive customization through parameters.

3

4

## Capabilities

5

6

### Bytes Detection

7

8

Detects character encoding from raw bytes or bytearray sequences using advanced heuristic analysis.

9

10

```python { .api }

11

def from_bytes(

12

sequences: bytes | bytearray,

13

steps: int = 5,

14

chunk_size: int = 512,

15

threshold: float = 0.2,

16

cp_isolation: list[str] | None = None,

17

cp_exclusion: list[str] | None = None,

18

preemptive_behaviour: bool = True,

19

explain: bool = False,

20

language_threshold: float = 0.1,

21

enable_fallback: bool = True,

22

) -> CharsetMatches:

23

"""

24

Detect charset from raw bytes sequence.

25

26

Parameters:

27

- sequences: Raw bytes or bytearray to analyze

28

- steps: Number of analysis steps (default: 5)

29

- chunk_size: Size of data chunks for analysis (default: 512)

30

- threshold: Mess ratio threshold for encoding rejection (default: 0.2)

31

- cp_isolation: List of encodings to test exclusively

32

- cp_exclusion: List of encodings to exclude from testing

33

- preemptive_behaviour: Enable BOM/signature priority detection (default: True)

34

- explain: Enable detailed logging for debugging (default: False)

35

- language_threshold: Minimum coherence for language detection (default: 0.1)

36

- enable_fallback: Enable fallback to common encodings (default: True)

37

38

Returns:

39

CharsetMatches: Ordered collection of detection results

40

41

Raises:

42

TypeError: If sequences is not bytes or bytearray

43

"""

44

```

45

46

**Usage Example:**

47

48

```python

49

import charset_normalizer

50

51

# Basic detection

52

raw_data = b'\xe4\xb8\xad\xe6\x96\x87' # Chinese text in UTF-8

53

results = charset_normalizer.from_bytes(raw_data)

54

best_match = results.best()

55

print(f"Encoding: {best_match.encoding}") # utf_8

56

print(f"Language: {best_match.language}") # Chinese

57

58

# Advanced detection with custom parameters

59

results = charset_normalizer.from_bytes(

60

raw_data,

61

steps=10, # More thorough analysis

62

threshold=0.1, # Stricter mess threshold

63

cp_isolation=['utf_8', 'gb2312', 'big5'], # Test only Chinese encodings

64

explain=True # Enable debug logging

65

)

66

```

67

68

### File Pointer Detection

69

70

Detects character encoding from an open file pointer without closing it.

71

72

```python { .api }

73

def from_fp(

74

fp: BinaryIO,

75

steps: int = 5,

76

chunk_size: int = 512,

77

threshold: float = 0.20,

78

cp_isolation: list[str] | None = None,

79

cp_exclusion: list[str] | None = None,

80

preemptive_behaviour: bool = True,

81

explain: bool = False,

82

language_threshold: float = 0.1,

83

enable_fallback: bool = True,

84

) -> CharsetMatches:

85

"""

86

Detect charset from file pointer.

87

88

Parameters:

89

- fp: Open binary file pointer

90

- Other parameters: Same as from_bytes

91

92

Returns:

93

CharsetMatches: Ordered collection of detection results

94

95

Note: Does not close the file pointer

96

"""

97

```

98

99

**Usage Example:**

100

101

```python

102

import charset_normalizer

103

104

with open('document.txt', 'rb') as fp:

105

results = charset_normalizer.from_fp(fp)

106

best_match = results.best()

107

if best_match:

108

print(f"File encoding: {best_match.encoding}")

109

# File pointer remains open for further operations

110

```

111

112

### File Path Detection

113

114

Detects character encoding by opening and reading a file from its path.

115

116

```python { .api }

117

def from_path(

118

path: str | bytes | PathLike,

119

steps: int = 5,

120

chunk_size: int = 512,

121

threshold: float = 0.20,

122

cp_isolation: list[str] | None = None,

123

cp_exclusion: list[str] | None = None,

124

preemptive_behaviour: bool = True,

125

explain: bool = False,

126

language_threshold: float = 0.1,

127

enable_fallback: bool = True,

128

) -> CharsetMatches:

129

"""

130

Detect charset from file path.

131

132

Parameters:

133

- path: Path to file (string, bytes, or PathLike object)

134

- Other parameters: Same as from_bytes

135

136

Returns:

137

CharsetMatches: Ordered collection of detection results

138

139

Raises:

140

IOError: If file cannot be opened or read

141

"""

142

```

143

144

**Usage Example:**

145

146

```python

147

import charset_normalizer

148

from pathlib import Path

149

150

# Using string path

151

results = charset_normalizer.from_path('data/sample.txt')

152

153

# Using Path object

154

file_path = Path('documents/report.csv')

155

results = charset_normalizer.from_path(file_path)

156

157

# With custom settings for CSV files

158

results = charset_normalizer.from_path(

159

'data.csv',

160

cp_isolation=['utf_8', 'iso-8859-1', 'windows-1252'], # Common for CSV

161

threshold=0.15 # Slightly stricter for structured data

162

)

163

```

164

165

### Binary Detection

166

167

Determines whether input data represents binary (non-text) content.

168

169

```python { .api }

170

def is_binary(

171

fp_or_path_or_payload: PathLike | str | BinaryIO | bytes,

172

steps: int = 5,

173

chunk_size: int = 512,

174

threshold: float = 0.20,

175

cp_isolation: list[str] | None = None,

176

cp_exclusion: list[str] | None = None,

177

preemptive_behaviour: bool = True,

178

explain: bool = False,

179

language_threshold: float = 0.1,

180

enable_fallback: bool = False,

181

) -> bool:

182

"""

183

Detect if input is binary (non-text) content.

184

185

Parameters:

186

- fp_or_path_or_payload: File path, file pointer, or raw bytes

187

- Other parameters: Same as from_bytes (enable_fallback defaults to False)

188

189

Returns:

190

bool: True if content appears to be binary, False if text

191

192

Note: Uses stricter criteria than text detection to avoid false positives

193

"""

194

```

195

196

**Usage Example:**

197

198

```python

199

import charset_normalizer

200

201

# Check if file is binary

202

if charset_normalizer.is_binary('image.jpg'):

203

print("Binary file detected")

204

else:

205

print("Text file detected")

206

207

# Check raw bytes

208

data = b'\x89PNG\r\n\x1a\n' # PNG file header

209

if charset_normalizer.is_binary(data):

210

print("Binary data")

211

212

# Check with file pointer

213

with open('document.pdf', 'rb') as fp:

214

if charset_normalizer.is_binary(fp):

215

print("Binary document")

216

```

217

218

## Parameter Guidelines

219

220

### Performance Tuning

221

222

- **steps**: Higher values (7-10) for more accuracy, lower (3-5) for speed

223

- **chunk_size**: Larger chunks (1024-2048) for large files, smaller (256-512) for small files

224

- **threshold**: Lower values (0.1-0.15) for stricter detection, higher (0.25-0.3) for permissive

225

226

### Encoding Control

227

228

- **cp_isolation**: Use when you know the likely encoding family (e.g., ['utf_8', 'utf_16'] for Unicode)

229

- **cp_exclusion**: Exclude problematic encodings that cause false positives

230

- **preemptive_behaviour**: Disable (False) for pure heuristic analysis without BOM priority

231

232

### Language Detection

233

234

- **language_threshold**: Lower values (0.05) for better language detection, higher (0.2) to reduce false positives

235

- **enable_fallback**: Keep True for safety, set False for stricter binary detection