Tessl Tile for pypi/charset-normalizer@3.4.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

cli-interface.md core-detection.md detection-results.md index.md legacy-compatibility.md

index.mddocs/

0
# Charset Normalizer
1

2
The Real First Universal Charset Detector. A modern, fast, and reliable character encoding detection library that serves as an open-source alternative to chardet. It uses advanced heuristics to detect character encodings from raw bytes by testing multiple encoding tables, measuring noise levels, and selecting the best match through comprehensive analysis including language detection and coherence scoring.
3

4
## Package Information
5

6
- **Package Name**: charset-normalizer
7
- **Language**: Python
8
- **Installation**: `pip install charset-normalizer`
9

10
## Core Imports
11

12
```python
13
import charset_normalizer
14
```
15

16
Standard imports for charset detection:
17

18
```python
19
from charset_normalizer import from_bytes, from_fp, from_path, is_binary
20
from charset_normalizer import CharsetMatch, CharsetMatches
21
```
22

23
Legacy compatibility:
24

25
```python
26
from charset_normalizer import detect  # chardet compatibility  
27
```
28

29
Type annotations (for advanced usage):
30

31
```python
32
from typing import BinaryIO, Iterator
33
from os import PathLike
34
import logging
35
```
36

37
## Basic Usage
38

39
```python
40
import charset_normalizer
41

42
# Detect encoding from bytes
43
with open('unknown_file.txt', 'rb') as f:
44
    raw_data = f.read()
45

46
results = charset_normalizer.from_bytes(raw_data)
47
best_guess = results.best()
48

49
if best_guess:
50
    print(f"Detected encoding: {best_guess.encoding}")
51
    print(f"Confidence (chaos): {best_guess.percent_chaos}%")
52
    print(f"Language: {best_guess.language}")
53
    
54
    # Get the decoded text
55
    decoded_text = str(best_guess)
56
    print(decoded_text)
57

58
# Detect directly from file path
59
results = charset_normalizer.from_path('unknown_file.txt')
60
best_guess = results.best()
61
if best_guess:
62
    print(f"File encoding: {best_guess.encoding}")
63

64
# Check if content is binary
65
is_text = not charset_normalizer.is_binary('data_file.bin')
66
print(f"Is text file: {is_text}")
67
```
68

69
## Architecture
70

71
Charset Normalizer uses a multi-step detection process:
72

73
- **Heuristic Detection**: Tests multiple character encodings against input data
74
- **Mess Ratio Analysis**: Measures "chaos" or noise level in decoded content to evaluate encoding quality
75
- **Language Detection**: Uses letter frequency analysis to detect spoken languages and improve encoding confidence
76
- **Coherence Scoring**: Evaluates linguistic coherence of decoded text
77
- **BOM/Signature Detection**: Identifies byte order marks and encoding signatures
78
- **Fallback Mechanisms**: Provides safe fallbacks when detection is uncertain
79

80
This architecture enables highly accurate charset detection across 99+ supported encodings while maintaining performance and reliability.
81

82
## Capabilities
83

84
### Core Detection Functions
85

86
Primary charset detection methods for bytes, file pointers, and file paths. Includes binary content detection to distinguish text from non-text data.
87

88
```python { .api }
89
def from_bytes(sequences, **kwargs) -> CharsetMatches: ...
90
def from_fp(fp, **kwargs) -> CharsetMatches: ...
91
def from_path(path, **kwargs) -> CharsetMatches: ...
92
def is_binary(fp_or_path_or_payload, **kwargs) -> bool: ...
93
```
94

95
[Core Detection](./core-detection.md)
96

97
### Detection Result Classes
98

99
Structured containers for charset detection results, providing detailed information about detected encodings, confidence levels, language detection, and text decoding capabilities.
100

101
```python { .api }
102
class CharsetMatch:
103
    encoding: str
104
    language: str
105
    chaos: float
106
    coherence: float
107
    def __str__(self) -> str: ...
108

109
class CharsetMatches:
110
    def best(self) -> CharsetMatch | None: ...
111
    def __getitem__(self, item) -> CharsetMatch: ...
112
```
113

114
[Detection Results](./detection-results.md)
115

116
### Legacy Compatibility
117

118
Chardet-compatible detection function for easy migration from chardet to charset-normalizer while maintaining backward compatibility.
119

120
```python { .api }
121
def detect(byte_str, should_rename_legacy=False, **kwargs) -> dict: ...
122
```
123

124
[Legacy Compatibility](./legacy-compatibility.md)
125

126
### CLI Interface
127

128
Command-line interface and programmatic CLI functions for charset detection, file processing, and interactive operations.
129

130
```python { .api }
131
from charset_normalizer.cli import cli_detect, query_yes_no
132

133
def cli_detect(
134
    paths: list[str],
135
    alternatives: bool = False,
136
    normalize: bool = False,
137
    minimal: bool = False,
138
    replace: bool = False,
139
    force: bool = False,
140
    threshold: float = 0.2,
141
    verbose: bool = False
142
) -> None: ...
143

144
def query_yes_no(question: str, default: str = "yes") -> bool: ...
145
```
146

147
[CLI Interface](./cli-interface.md)
148

149
### Utility Functions
150

151
Logger configuration and version information utilities.
152

153
```python { .api }
154
def set_logging_handler(
155
    name: str = "charset_normalizer",
156
    level: int = logging.INFO,
157
    format_string: str = "%(asctime)s | %(levelname)s | %(message)s"
158
) -> None:
159
    """
160
    Configure a logger with custom handler, level, and format.
161
    
162
    Parameters:
163
    - name: Logger name (default: "charset_normalizer")  
164
    - level: Logging level (default: logging.INFO)
165
    - format_string: Log message format (default: includes timestamp, level, message)
166
    
167
    Returns:
168
    None
169
    
170
    Note: Sets up a StreamHandler with the specified configuration
171
    """
172

173
__version__: str  # Package version string  
174
VERSION: list[str]  # Version components as list
175
```
176

177
## Types
178

179
```python { .api }
180
# Type aliases for language coherence data
181
CoherenceMatch = tuple[str, float]  # (language_name, coherence_score)
182
CoherenceMatches = list[CoherenceMatch]  # List of language matches
183

184
# Type aliases for detection results (legacy compatibility)
185
from typing import TypedDict
186

187
class ResultDict(TypedDict):
188
    """Legacy detection result type for chardet compatibility."""
189
    encoding: str | None  # Detected encoding name or None
190
    language: str  # Detected language or empty string
191
    confidence: float | None  # Confidence score (0.0-1.0) or None
192

193
# Import types for function signatures
194
from typing import BinaryIO, Iterator, Any
195
from os import PathLike
196
import logging
197
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/