or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cli-interface.mdcore-detection.mddetection-results.mdindex.mdlegacy-compatibility.md

legacy-compatibility.mddocs/

0

# Legacy Compatibility

1

2

Chardet-compatible detection function that provides backward compatibility for applications migrating from chardet to charset-normalizer. This function maintains the same API interface and return format as chardet while leveraging charset-normalizer's improved detection algorithms.

3

4

## Capabilities

5

6

### Chardet-Compatible Detection

7

8

Drop-in replacement for chardet.detect() with improved accuracy and performance while maintaining the same return format.

9

10

```python { .api }

11

def detect(

12

byte_str: bytes,

13

should_rename_legacy: bool = False,

14

**kwargs: Any

15

) -> ResultDict:

16

"""

17

Chardet-compatible charset detection function.

18

19

Provides backward compatibility with chardet API while using

20

charset-normalizer's advanced detection algorithms. Maintained

21

for migration purposes but not recommended for new projects.

22

23

Parameters:

24

- byte_str: Raw bytes to analyze for encoding detection

25

- should_rename_legacy: Whether to rename legacy encodings to modern equivalents

26

- **kwargs: Additional arguments (ignored with warning for compatibility)

27

28

Returns:

29

dict with keys:

30

- 'encoding': str | None - Detected encoding name (chardet-compatible)

31

- 'language': str - Detected language or empty string

32

- 'confidence': float | None - Confidence score (0.0-1.0)

33

34

Raises:

35

TypeError: If byte_str is not bytes or bytearray

36

37

Note: This function is deprecated for new code. Use from_bytes() instead.

38

"""

39

```

40

41

**Usage Example:**

42

43

```python

44

import charset_normalizer

45

46

# Basic chardet-compatible usage

47

raw_data = b'\xe4\xb8\xad\xe6\x96\x87' # Chinese text

48

result = charset_normalizer.detect(raw_data)

49

50

print(f"Encoding: {result['encoding']}") # utf_8 or utf-8

51

print(f"Language: {result['language']}") # Chinese or empty string

52

print(f"Confidence: {result['confidence']}") # 0.99 (0.0-1.0 scale)

53

54

# Handle None results

55

if result['encoding']:

56

try:

57

decoded_text = raw_data.decode(result['encoding'])

58

print(f"Text: {decoded_text}")

59

except UnicodeDecodeError:

60

print("Decoding failed despite detection")

61

else:

62

print("No encoding detected")

63

```

64

65

### Migration from Chardet

66

67

Direct replacement patterns for common chardet usage:

68

69

```python

70

# Old chardet code:

71

# import chardet

72

# result = chardet.detect(raw_bytes)

73

74

# Direct replacement:

75

import charset_normalizer

76

result = charset_normalizer.detect(raw_bytes)

77

78

# Access same result structure

79

encoding = result['encoding']

80

confidence = result['confidence']

81

language = result['language']

82

```

83

84

### Legacy Encoding Names

85

86

Control whether legacy encoding names are modernized:

87

88

```python

89

import charset_normalizer

90

91

# With legacy names (default - matches chardet output)

92

result = charset_normalizer.detect(raw_data, should_rename_legacy=False)

93

print(result['encoding']) # May be 'ISO-8859-1' (chardet style)

94

95

# With modern names

96

result = charset_normalizer.detect(raw_data, should_rename_legacy=True)

97

print(result['encoding']) # Will be 'iso-8859-1' (IANA standard)

98

```

99

100

## Compatibility Notes

101

102

### Return Format Differences

103

104

While the basic structure matches chardet, there are subtle differences:

105

106

```python

107

# Chardet typical result:

108

{

109

'encoding': 'utf-8',

110

'confidence': 0.99,

111

'language': ''

112

}

113

114

# Charset-normalizer detect() result:

115

{

116

'encoding': 'utf_8', # IANA standard names by default

117

'confidence': 0.98, # May differ due to improved algorithms

118

'language': 'English' # More comprehensive language detection

119

}

120

```

121

122

### BOM Handling

123

124

Charset-normalizer handles BOM (Byte Order Mark) differently:

125

126

```python

127

# UTF-8 with BOM

128

utf8_bom_data = b'\xef\xbb\xbfHello World'

129

130

# Chardet returns: 'UTF-8-SIG'

131

# Charset-normalizer detect() returns: 'utf_8_sig' (when BOM detected)

132

result = charset_normalizer.detect(utf8_bom_data)

133

print(result['encoding']) # 'utf_8_sig'

134

```

135

136

### Confidence Scoring

137

138

Confidence calculation differs between libraries:

139

140

```python

141

# For comparison with modern API

142

modern_result = charset_normalizer.from_bytes(raw_data).best()

143

legacy_result = charset_normalizer.detect(raw_data)

144

145

# Modern confidence (inverse of chaos ratio)

146

modern_confidence = 1.0 - modern_result.chaos

147

148

# Legacy confidence (direct from detect)

149

legacy_confidence = legacy_result['confidence']

150

151

# Values may differ due to different calculation methods

152

print(f"Modern: {modern_confidence:.3f}")

153

print(f"Legacy: {legacy_confidence:.3f}")

154

```

155

156

## Migration Recommendations

157

158

### Gradual Migration Strategy

159

160

1. **Phase 1**: Direct replacement

161

```python

162

# Replace import only

163

# from chardet import detect

164

from charset_normalizer import detect

165

166

# Keep existing code unchanged

167

result = detect(raw_bytes)

168

```

169

170

2. **Phase 2**: Enhanced error handling

171

```python

172

import charset_normalizer

173

174

def safe_detect(raw_bytes):

175

"""Enhanced wrapper with better error handling."""

176

try:

177

result = charset_normalizer.detect(raw_bytes)

178

if result['encoding'] and result['confidence'] > 0.7:

179

return result

180

else:

181

# Fallback to modern API for better results

182

modern_result = charset_normalizer.from_bytes(raw_bytes).best()

183

if modern_result:

184

return {

185

'encoding': modern_result.encoding,

186

'confidence': 1.0 - modern_result.chaos,

187

'language': modern_result.language

188

}

189

except Exception:

190

pass

191

192

return {'encoding': None, 'confidence': None, 'language': ''}

193

```

194

195

3. **Phase 3**: Modern API adoption

196

```python

197

import charset_normalizer

198

199

# Migrate to modern API for new code

200

results = charset_normalizer.from_bytes(raw_bytes)

201

best = results.best()

202

203

if best:

204

# More detailed information available

205

encoding = best.encoding

206

confidence = 1.0 - best.chaos

207

language = best.language

208

alphabets = best.alphabets

209

text = str(best)

210

```

211

212

### Performance Considerations

213

214

The legacy detect() function has different performance characteristics:

215

216

```python

217

import time

218

import charset_normalizer

219

220

# Legacy function (single result)

221

start = time.time()

222

result = charset_normalizer.detect(large_data)

223

legacy_time = time.time() - start

224

225

# Modern API (multiple candidates)

226

start = time.time()

227

results = charset_normalizer.from_bytes(large_data)

228

best = results.best()

229

modern_time = time.time() - start

230

231

# Legacy is typically faster for simple detection

232

# Modern API provides more comprehensive analysis

233

```

234

235

### Debugging Legacy Issues

236

237

When migrating from chardet, enable detailed logging:

238

239

```python

240

import charset_normalizer

241

import logging

242

243

# Enable debug logging to compare with chardet behavior

244

result = charset_normalizer.detect(raw_data, explain=True) # Note: explain ignored but documented

245

```

246

247

For actual debugging, use the modern API:

248

```python

249

# Better debugging with modern API

250

results = charset_normalizer.from_bytes(raw_data, explain=True)

251

# This will show detailed detection process

252

```