or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-charset-normalizer

The Real First Universal Charset Detector providing modern, fast, and reliable character encoding detection as an alternative to chardet.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/charset-normalizer@3.4.x

To install, run

npx @tessl/cli install tessl/pypi-charset-normalizer@3.4.0

0

# Charset Normalizer

1

2

The Real First Universal Charset Detector. A modern, fast, and reliable character encoding detection library that serves as an open-source alternative to chardet. It uses advanced heuristics to detect character encodings from raw bytes by testing multiple encoding tables, measuring noise levels, and selecting the best match through comprehensive analysis including language detection and coherence scoring.

3

4

## Package Information

5

6

- **Package Name**: charset-normalizer

7

- **Language**: Python

8

- **Installation**: `pip install charset-normalizer`

9

10

## Core Imports

11

12

```python

13

import charset_normalizer

14

```

15

16

Standard imports for charset detection:

17

18

```python

19

from charset_normalizer import from_bytes, from_fp, from_path, is_binary

20

from charset_normalizer import CharsetMatch, CharsetMatches

21

```

22

23

Legacy compatibility:

24

25

```python

26

from charset_normalizer import detect # chardet compatibility

27

```

28

29

Type annotations (for advanced usage):

30

31

```python

32

from typing import BinaryIO, Iterator

33

from os import PathLike

34

import logging

35

```

36

37

## Basic Usage

38

39

```python

40

import charset_normalizer

41

42

# Detect encoding from bytes

43

with open('unknown_file.txt', 'rb') as f:

44

raw_data = f.read()

45

46

results = charset_normalizer.from_bytes(raw_data)

47

best_guess = results.best()

48

49

if best_guess:

50

print(f"Detected encoding: {best_guess.encoding}")

51

print(f"Confidence (chaos): {best_guess.percent_chaos}%")

52

print(f"Language: {best_guess.language}")

53

54

# Get the decoded text

55

decoded_text = str(best_guess)

56

print(decoded_text)

57

58

# Detect directly from file path

59

results = charset_normalizer.from_path('unknown_file.txt')

60

best_guess = results.best()

61

if best_guess:

62

print(f"File encoding: {best_guess.encoding}")

63

64

# Check if content is binary

65

is_text = not charset_normalizer.is_binary('data_file.bin')

66

print(f"Is text file: {is_text}")

67

```

68

69

## Architecture

70

71

Charset Normalizer uses a multi-step detection process:

72

73

- **Heuristic Detection**: Tests multiple character encodings against input data

74

- **Mess Ratio Analysis**: Measures "chaos" or noise level in decoded content to evaluate encoding quality

75

- **Language Detection**: Uses letter frequency analysis to detect spoken languages and improve encoding confidence

76

- **Coherence Scoring**: Evaluates linguistic coherence of decoded text

77

- **BOM/Signature Detection**: Identifies byte order marks and encoding signatures

78

- **Fallback Mechanisms**: Provides safe fallbacks when detection is uncertain

79

80

This architecture enables highly accurate charset detection across 99+ supported encodings while maintaining performance and reliability.

81

82

## Capabilities

83

84

### Core Detection Functions

85

86

Primary charset detection methods for bytes, file pointers, and file paths. Includes binary content detection to distinguish text from non-text data.

87

88

```python { .api }

89

def from_bytes(sequences, **kwargs) -> CharsetMatches: ...

90

def from_fp(fp, **kwargs) -> CharsetMatches: ...

91

def from_path(path, **kwargs) -> CharsetMatches: ...

92

def is_binary(fp_or_path_or_payload, **kwargs) -> bool: ...

93

```

94

95

[Core Detection](./core-detection.md)

96

97

### Detection Result Classes

98

99

Structured containers for charset detection results, providing detailed information about detected encodings, confidence levels, language detection, and text decoding capabilities.

100

101

```python { .api }

102

class CharsetMatch:

103

encoding: str

104

language: str

105

chaos: float

106

coherence: float

107

def __str__(self) -> str: ...

108

109

class CharsetMatches:

110

def best(self) -> CharsetMatch | None: ...

111

def __getitem__(self, item) -> CharsetMatch: ...

112

```

113

114

[Detection Results](./detection-results.md)

115

116

### Legacy Compatibility

117

118

Chardet-compatible detection function for easy migration from chardet to charset-normalizer while maintaining backward compatibility.

119

120

```python { .api }

121

def detect(byte_str, should_rename_legacy=False, **kwargs) -> dict: ...

122

```

123

124

[Legacy Compatibility](./legacy-compatibility.md)

125

126

### CLI Interface

127

128

Command-line interface and programmatic CLI functions for charset detection, file processing, and interactive operations.

129

130

```python { .api }

131

from charset_normalizer.cli import cli_detect, query_yes_no

132

133

def cli_detect(

134

paths: list[str],

135

alternatives: bool = False,

136

normalize: bool = False,

137

minimal: bool = False,

138

replace: bool = False,

139

force: bool = False,

140

threshold: float = 0.2,

141

verbose: bool = False

142

) -> None: ...

143

144

def query_yes_no(question: str, default: str = "yes") -> bool: ...

145

```

146

147

[CLI Interface](./cli-interface.md)

148

149

### Utility Functions

150

151

Logger configuration and version information utilities.

152

153

```python { .api }

154

def set_logging_handler(

155

name: str = "charset_normalizer",

156

level: int = logging.INFO,

157

format_string: str = "%(asctime)s | %(levelname)s | %(message)s"

158

) -> None:

159

"""

160

Configure a logger with custom handler, level, and format.

161

162

Parameters:

163

- name: Logger name (default: "charset_normalizer")

164

- level: Logging level (default: logging.INFO)

165

- format_string: Log message format (default: includes timestamp, level, message)

166

167

Returns:

168

None

169

170

Note: Sets up a StreamHandler with the specified configuration

171

"""

172

173

__version__: str # Package version string

174

VERSION: list[str] # Version components as list

175

```

176

177

## Types

178

179

```python { .api }

180

# Type aliases for language coherence data

181

CoherenceMatch = tuple[str, float] # (language_name, coherence_score)

182

CoherenceMatches = list[CoherenceMatch] # List of language matches

183

184

# Type aliases for detection results (legacy compatibility)

185

from typing import TypedDict

186

187

class ResultDict(TypedDict):

188

"""Legacy detection result type for chardet compatibility."""

189

encoding: str | None # Detected encoding name or None

190

language: str # Detected language or empty string

191

confidence: float | None # Confidence score (0.0-1.0) or None

192

193

# Import types for function signatures

194

from typing import BinaryIO, Iterator, Any

195

from os import PathLike

196

import logging

197

```