0
# Charset Normalizer
1
2
The Real First Universal Charset Detector. A modern, fast, and reliable character encoding detection library that serves as an open-source alternative to chardet. It uses advanced heuristics to detect character encodings from raw bytes by testing multiple encoding tables, measuring noise levels, and selecting the best match through comprehensive analysis including language detection and coherence scoring.
3
4
## Package Information
5
6
- **Package Name**: charset-normalizer
7
- **Language**: Python
8
- **Installation**: `pip install charset-normalizer`
9
10
## Core Imports
11
12
```python
13
import charset_normalizer
14
```
15
16
Standard imports for charset detection:
17
18
```python
19
from charset_normalizer import from_bytes, from_fp, from_path, is_binary
20
from charset_normalizer import CharsetMatch, CharsetMatches
21
```
22
23
Legacy compatibility:
24
25
```python
26
from charset_normalizer import detect # chardet compatibility
27
```
28
29
Type annotations (for advanced usage):
30
31
```python
32
from typing import BinaryIO, Iterator
33
from os import PathLike
34
import logging
35
```
36
37
## Basic Usage
38
39
```python
40
import charset_normalizer
41
42
# Detect encoding from bytes
43
with open('unknown_file.txt', 'rb') as f:
44
raw_data = f.read()
45
46
results = charset_normalizer.from_bytes(raw_data)
47
best_guess = results.best()
48
49
if best_guess:
50
print(f"Detected encoding: {best_guess.encoding}")
51
print(f"Confidence (chaos): {best_guess.percent_chaos}%")
52
print(f"Language: {best_guess.language}")
53
54
# Get the decoded text
55
decoded_text = str(best_guess)
56
print(decoded_text)
57
58
# Detect directly from file path
59
results = charset_normalizer.from_path('unknown_file.txt')
60
best_guess = results.best()
61
if best_guess:
62
print(f"File encoding: {best_guess.encoding}")
63
64
# Check if content is binary
65
is_text = not charset_normalizer.is_binary('data_file.bin')
66
print(f"Is text file: {is_text}")
67
```
68
69
## Architecture
70
71
Charset Normalizer uses a multi-step detection process:
72
73
- **Heuristic Detection**: Tests multiple character encodings against input data
74
- **Mess Ratio Analysis**: Measures "chaos" or noise level in decoded content to evaluate encoding quality
75
- **Language Detection**: Uses letter frequency analysis to detect spoken languages and improve encoding confidence
76
- **Coherence Scoring**: Evaluates linguistic coherence of decoded text
77
- **BOM/Signature Detection**: Identifies byte order marks and encoding signatures
78
- **Fallback Mechanisms**: Provides safe fallbacks when detection is uncertain
79
80
This architecture enables highly accurate charset detection across 99+ supported encodings while maintaining performance and reliability.
81
82
## Capabilities
83
84
### Core Detection Functions
85
86
Primary charset detection methods for bytes, file pointers, and file paths. Includes binary content detection to distinguish text from non-text data.
87
88
```python { .api }
89
def from_bytes(sequences, **kwargs) -> CharsetMatches: ...
90
def from_fp(fp, **kwargs) -> CharsetMatches: ...
91
def from_path(path, **kwargs) -> CharsetMatches: ...
92
def is_binary(fp_or_path_or_payload, **kwargs) -> bool: ...
93
```
94
95
[Core Detection](./core-detection.md)
96
97
### Detection Result Classes
98
99
Structured containers for charset detection results, providing detailed information about detected encodings, confidence levels, language detection, and text decoding capabilities.
100
101
```python { .api }
102
class CharsetMatch:
103
encoding: str
104
language: str
105
chaos: float
106
coherence: float
107
def __str__(self) -> str: ...
108
109
class CharsetMatches:
110
def best(self) -> CharsetMatch | None: ...
111
def __getitem__(self, item) -> CharsetMatch: ...
112
```
113
114
[Detection Results](./detection-results.md)
115
116
### Legacy Compatibility
117
118
Chardet-compatible detection function for easy migration from chardet to charset-normalizer while maintaining backward compatibility.
119
120
```python { .api }
121
def detect(byte_str, should_rename_legacy=False, **kwargs) -> dict: ...
122
```
123
124
[Legacy Compatibility](./legacy-compatibility.md)
125
126
### CLI Interface
127
128
Command-line interface and programmatic CLI functions for charset detection, file processing, and interactive operations.
129
130
```python { .api }
131
from charset_normalizer.cli import cli_detect, query_yes_no
132
133
def cli_detect(
134
paths: list[str],
135
alternatives: bool = False,
136
normalize: bool = False,
137
minimal: bool = False,
138
replace: bool = False,
139
force: bool = False,
140
threshold: float = 0.2,
141
verbose: bool = False
142
) -> None: ...
143
144
def query_yes_no(question: str, default: str = "yes") -> bool: ...
145
```
146
147
[CLI Interface](./cli-interface.md)
148
149
### Utility Functions
150
151
Logger configuration and version information utilities.
152
153
```python { .api }
154
def set_logging_handler(
155
name: str = "charset_normalizer",
156
level: int = logging.INFO,
157
format_string: str = "%(asctime)s | %(levelname)s | %(message)s"
158
) -> None:
159
"""
160
Configure a logger with custom handler, level, and format.
161
162
Parameters:
163
- name: Logger name (default: "charset_normalizer")
164
- level: Logging level (default: logging.INFO)
165
- format_string: Log message format (default: includes timestamp, level, message)
166
167
Returns:
168
None
169
170
Note: Sets up a StreamHandler with the specified configuration
171
"""
172
173
__version__: str # Package version string
174
VERSION: list[str] # Version components as list
175
```
176
177
## Types
178
179
```python { .api }
180
# Type aliases for language coherence data
181
CoherenceMatch = tuple[str, float] # (language_name, coherence_score)
182
CoherenceMatches = list[CoherenceMatch] # List of language matches
183
184
# Type aliases for detection results (legacy compatibility)
185
from typing import TypedDict
186
187
class ResultDict(TypedDict):
188
"""Legacy detection result type for chardet compatibility."""
189
encoding: str | None # Detected encoding name or None
190
language: str # Detected language or empty string
191
confidence: float | None # Confidence score (0.0-1.0) or None
192
193
# Import types for function signatures
194
from typing import BinaryIO, Iterator, Any
195
from os import PathLike
196
import logging
197
```