Tessl Tile for pypi/charset-normalizer@3.4.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

cli-interface.md core-detection.md detection-results.md index.md legacy-compatibility.md

legacy-compatibility.mddocs/

0
# Legacy Compatibility
1

2
Chardet-compatible detection function that provides backward compatibility for applications migrating from chardet to charset-normalizer. This function maintains the same API interface and return format as chardet while leveraging charset-normalizer's improved detection algorithms.
3

4
## Capabilities
5

6
### Chardet-Compatible Detection
7

8
Drop-in replacement for chardet.detect() with improved accuracy and performance while maintaining the same return format.
9

10
```python { .api }
11
def detect(
12
    byte_str: bytes,
13
    should_rename_legacy: bool = False,
14
    **kwargs: Any
15
) -> ResultDict:
16
    """
17
    Chardet-compatible charset detection function.
18
    
19
    Provides backward compatibility with chardet API while using
20
    charset-normalizer's advanced detection algorithms. Maintained
21
    for migration purposes but not recommended for new projects.
22
    
23
    Parameters:
24
    - byte_str: Raw bytes to analyze for encoding detection
25
    - should_rename_legacy: Whether to rename legacy encodings to modern equivalents
26
    - **kwargs: Additional arguments (ignored with warning for compatibility)
27
    
28
    Returns:
29
    dict with keys:
30
    - 'encoding': str | None - Detected encoding name (chardet-compatible)
31
    - 'language': str - Detected language or empty string
32
    - 'confidence': float | None - Confidence score (0.0-1.0)
33
    
34
    Raises:
35
    TypeError: If byte_str is not bytes or bytearray
36
    
37
    Note: This function is deprecated for new code. Use from_bytes() instead.
38
    """
39
```
40

41
**Usage Example:**
42

43
```python
44
import charset_normalizer
45

46
# Basic chardet-compatible usage
47
raw_data = b'\xe4\xb8\xad\xe6\x96\x87'  # Chinese text
48
result = charset_normalizer.detect(raw_data)
49

50
print(f"Encoding: {result['encoding']}")      # utf_8 or utf-8
51
print(f"Language: {result['language']}")      # Chinese or empty string  
52
print(f"Confidence: {result['confidence']}")  # 0.99 (0.0-1.0 scale)
53

54
# Handle None results
55
if result['encoding']:
56
    try:
57
        decoded_text = raw_data.decode(result['encoding'])
58
        print(f"Text: {decoded_text}")
59
    except UnicodeDecodeError:
60
        print("Decoding failed despite detection")
61
else:
62
    print("No encoding detected")
63
```
64

65
### Migration from Chardet
66

67
Direct replacement patterns for common chardet usage:
68

69
```python
70
# Old chardet code:
71
# import chardet
72
# result = chardet.detect(raw_bytes)
73

74
# Direct replacement:
75
import charset_normalizer
76
result = charset_normalizer.detect(raw_bytes)
77

78
# Access same result structure
79
encoding = result['encoding']
80
confidence = result['confidence']
81
language = result['language']
82
```
83

84
### Legacy Encoding Names
85

86
Control whether legacy encoding names are modernized:
87

88
```python
89
import charset_normalizer
90

91
# With legacy names (default - matches chardet output)
92
result = charset_normalizer.detect(raw_data, should_rename_legacy=False)
93
print(result['encoding'])  # May be 'ISO-8859-1' (chardet style)
94

95
# With modern names  
96
result = charset_normalizer.detect(raw_data, should_rename_legacy=True)
97
print(result['encoding'])  # Will be 'iso-8859-1' (IANA standard)
98
```
99

100
## Compatibility Notes
101

102
### Return Format Differences
103

104
While the basic structure matches chardet, there are subtle differences:
105

106
```python
107
# Chardet typical result:
108
{
109
    'encoding': 'utf-8',
110
    'confidence': 0.99,
111
    'language': ''
112
}
113

114
# Charset-normalizer detect() result:
115
{
116
    'encoding': 'utf_8',      # IANA standard names by default
117
    'confidence': 0.98,       # May differ due to improved algorithms
118
    'language': 'English'     # More comprehensive language detection
119
}
120
```
121

122
### BOM Handling
123

124
Charset-normalizer handles BOM (Byte Order Mark) differently:
125

126
```python
127
# UTF-8 with BOM
128
utf8_bom_data = b'\xef\xbb\xbfHello World'
129

130
# Chardet returns: 'UTF-8-SIG'
131
# Charset-normalizer detect() returns: 'utf_8_sig' (when BOM detected)
132
result = charset_normalizer.detect(utf8_bom_data)
133
print(result['encoding'])  # 'utf_8_sig'
134
```
135

136
### Confidence Scoring
137

138
Confidence calculation differs between libraries:
139

140
```python
141
# For comparison with modern API
142
modern_result = charset_normalizer.from_bytes(raw_data).best()
143
legacy_result = charset_normalizer.detect(raw_data)
144

145
# Modern confidence (inverse of chaos ratio)
146
modern_confidence = 1.0 - modern_result.chaos
147

148
# Legacy confidence (direct from detect)
149
legacy_confidence = legacy_result['confidence']
150

151
# Values may differ due to different calculation methods
152
print(f"Modern: {modern_confidence:.3f}")
153
print(f"Legacy: {legacy_confidence:.3f}")
154
```
155

156
## Migration Recommendations
157

158
### Gradual Migration Strategy
159

160
1. **Phase 1**: Direct replacement
161
```python
162
# Replace import only
163
# from chardet import detect
164
from charset_normalizer import detect
165

166
# Keep existing code unchanged
167
result = detect(raw_bytes)
168
```
169

170
2. **Phase 2**: Enhanced error handling
171
```python
172
import charset_normalizer
173

174
def safe_detect(raw_bytes):
175
    """Enhanced wrapper with better error handling."""
176
    try:
177
        result = charset_normalizer.detect(raw_bytes)
178
        if result['encoding'] and result['confidence'] > 0.7:
179
            return result
180
        else:
181
            # Fallback to modern API for better results
182
            modern_result = charset_normalizer.from_bytes(raw_bytes).best()
183
            if modern_result:
184
                return {
185
                    'encoding': modern_result.encoding,
186
                    'confidence': 1.0 - modern_result.chaos,
187
                    'language': modern_result.language
188
                }
189
    except Exception:
190
        pass
191
    
192
    return {'encoding': None, 'confidence': None, 'language': ''}
193
```
194

195
3. **Phase 3**: Modern API adoption
196
```python
197
import charset_normalizer
198

199
# Migrate to modern API for new code
200
results = charset_normalizer.from_bytes(raw_bytes)
201
best = results.best()
202

203
if best:
204
    # More detailed information available
205
    encoding = best.encoding
206
    confidence = 1.0 - best.chaos
207
    language = best.language
208
    alphabets = best.alphabets
209
    text = str(best)
210
```
211

212
### Performance Considerations
213

214
The legacy detect() function has different performance characteristics:
215

216
```python
217
import time
218
import charset_normalizer
219

220
# Legacy function (single result)
221
start = time.time()
222
result = charset_normalizer.detect(large_data)
223
legacy_time = time.time() - start
224

225
# Modern API (multiple candidates)
226
start = time.time()
227
results = charset_normalizer.from_bytes(large_data)
228
best = results.best()
229
modern_time = time.time() - start
230

231
# Legacy is typically faster for simple detection
232
# Modern API provides more comprehensive analysis
233
```
234

235
### Debugging Legacy Issues
236

237
When migrating from chardet, enable detailed logging:
238

239
```python
240
import charset_normalizer
241
import logging
242

243
# Enable debug logging to compare with chardet behavior
244
result = charset_normalizer.detect(raw_data, explain=True)  # Note: explain ignored but documented
245
```
246

247
For actual debugging, use the modern API:
248
```python
249
# Better debugging with modern API
250
results = charset_normalizer.from_bytes(raw_data, explain=True)
251
# This will show detailed detection process
252
```

Version

Tile

Files

legacy-compatibility.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

legacy-compatibility.mddocs/