Tessl Tile for pypi/charset-normalizer@3.4.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

cli-interface.md core-detection.md detection-results.md index.md legacy-compatibility.md

core-detection.mddocs/

0
# Core Detection Functions
1

2
Primary charset detection methods that analyze raw bytes, file pointers, or file paths to determine character encoding. These functions form the core of charset-normalizer's detection capabilities and support extensive customization through parameters.
3

4
## Capabilities
5

6
### Bytes Detection
7

8
Detects character encoding from raw bytes or bytearray sequences using advanced heuristic analysis.
9

10
```python { .api }
11
def from_bytes(
12
    sequences: bytes | bytearray,
13
    steps: int = 5,
14
    chunk_size: int = 512,
15
    threshold: float = 0.2,
16
    cp_isolation: list[str] | None = None,
17
    cp_exclusion: list[str] | None = None,
18
    preemptive_behaviour: bool = True,
19
    explain: bool = False,
20
    language_threshold: float = 0.1,
21
    enable_fallback: bool = True,
22
) -> CharsetMatches:
23
    """
24
    Detect charset from raw bytes sequence.
25
    
26
    Parameters:
27
    - sequences: Raw bytes or bytearray to analyze
28
    - steps: Number of analysis steps (default: 5)
29
    - chunk_size: Size of data chunks for analysis (default: 512)
30
    - threshold: Mess ratio threshold for encoding rejection (default: 0.2)
31
    - cp_isolation: List of encodings to test exclusively
32
    - cp_exclusion: List of encodings to exclude from testing
33
    - preemptive_behaviour: Enable BOM/signature priority detection (default: True)
34
    - explain: Enable detailed logging for debugging (default: False)
35
    - language_threshold: Minimum coherence for language detection (default: 0.1)
36
    - enable_fallback: Enable fallback to common encodings (default: True)
37
    
38
    Returns:
39
    CharsetMatches: Ordered collection of detection results
40
    
41
    Raises:
42
    TypeError: If sequences is not bytes or bytearray
43
    """
44
```
45

46
**Usage Example:**
47

48
```python
49
import charset_normalizer
50

51
# Basic detection
52
raw_data = b'\xe4\xb8\xad\xe6\x96\x87'  # Chinese text in UTF-8
53
results = charset_normalizer.from_bytes(raw_data)
54
best_match = results.best()
55
print(f"Encoding: {best_match.encoding}")  # utf_8
56
print(f"Language: {best_match.language}")  # Chinese
57

58
# Advanced detection with custom parameters
59
results = charset_normalizer.from_bytes(
60
    raw_data,
61
    steps=10,  # More thorough analysis
62
    threshold=0.1,  # Stricter mess threshold
63
    cp_isolation=['utf_8', 'gb2312', 'big5'],  # Test only Chinese encodings
64
    explain=True  # Enable debug logging
65
)
66
```
67

68
### File Pointer Detection
69

70
Detects character encoding from an open file pointer without closing it.
71

72
```python { .api }
73
def from_fp(
74
    fp: BinaryIO,
75
    steps: int = 5,
76
    chunk_size: int = 512,
77
    threshold: float = 0.20,
78
    cp_isolation: list[str] | None = None,
79
    cp_exclusion: list[str] | None = None,
80
    preemptive_behaviour: bool = True,
81
    explain: bool = False,
82
    language_threshold: float = 0.1,
83
    enable_fallback: bool = True,
84
) -> CharsetMatches:
85
    """
86
    Detect charset from file pointer.
87
    
88
    Parameters:
89
    - fp: Open binary file pointer
90
    - Other parameters: Same as from_bytes
91
    
92
    Returns:
93
    CharsetMatches: Ordered collection of detection results
94
    
95
    Note: Does not close the file pointer
96
    """
97
```
98

99
**Usage Example:**
100

101
```python
102
import charset_normalizer
103

104
with open('document.txt', 'rb') as fp:
105
    results = charset_normalizer.from_fp(fp)
106
    best_match = results.best()
107
    if best_match:
108
        print(f"File encoding: {best_match.encoding}")
109
        # File pointer remains open for further operations
110
```
111

112
### File Path Detection
113

114
Detects character encoding by opening and reading a file from its path.
115

116
```python { .api }
117
def from_path(
118
    path: str | bytes | PathLike,
119
    steps: int = 5,
120
    chunk_size: int = 512,
121
    threshold: float = 0.20,
122
    cp_isolation: list[str] | None = None,
123
    cp_exclusion: list[str] | None = None,
124
    preemptive_behaviour: bool = True,
125
    explain: bool = False,
126
    language_threshold: float = 0.1,
127
    enable_fallback: bool = True,
128
) -> CharsetMatches:
129
    """
130
    Detect charset from file path.
131
    
132
    Parameters:
133
    - path: Path to file (string, bytes, or PathLike object)
134
    - Other parameters: Same as from_bytes
135
    
136
    Returns:
137
    CharsetMatches: Ordered collection of detection results
138
    
139
    Raises:
140
    IOError: If file cannot be opened or read
141
    """
142
```
143

144
**Usage Example:**
145

146
```python
147
import charset_normalizer
148
from pathlib import Path
149

150
# Using string path
151
results = charset_normalizer.from_path('data/sample.txt')
152

153
# Using Path object
154
file_path = Path('documents/report.csv')
155
results = charset_normalizer.from_path(file_path)
156

157
# With custom settings for CSV files
158
results = charset_normalizer.from_path(
159
    'data.csv',
160
    cp_isolation=['utf_8', 'iso-8859-1', 'windows-1252'],  # Common for CSV
161
    threshold=0.15  # Slightly stricter for structured data
162
)
163
```
164

165
### Binary Detection
166

167
Determines whether input data represents binary (non-text) content.
168

169
```python { .api }
170
def is_binary(
171
    fp_or_path_or_payload: PathLike | str | BinaryIO | bytes,
172
    steps: int = 5,
173
    chunk_size: int = 512,
174
    threshold: float = 0.20,
175
    cp_isolation: list[str] | None = None,
176
    cp_exclusion: list[str] | None = None,
177
    preemptive_behaviour: bool = True,
178
    explain: bool = False,
179
    language_threshold: float = 0.1,
180
    enable_fallback: bool = False,
181
) -> bool:
182
    """
183
    Detect if input is binary (non-text) content.
184
    
185
    Parameters:
186
    - fp_or_path_or_payload: File path, file pointer, or raw bytes
187
    - Other parameters: Same as from_bytes (enable_fallback defaults to False)
188
    
189
    Returns:
190
    bool: True if content appears to be binary, False if text
191
    
192
    Note: Uses stricter criteria than text detection to avoid false positives
193
    """
194
```
195

196
**Usage Example:**
197

198
```python
199
import charset_normalizer
200

201
# Check if file is binary
202
if charset_normalizer.is_binary('image.jpg'):
203
    print("Binary file detected")
204
else:
205
    print("Text file detected")
206

207
# Check raw bytes
208
data = b'\x89PNG\r\n\x1a\n'  # PNG file header
209
if charset_normalizer.is_binary(data):
210
    print("Binary data")
211

212
# Check with file pointer
213
with open('document.pdf', 'rb') as fp:
214
    if charset_normalizer.is_binary(fp):
215
        print("Binary document")
216
```
217

218
## Parameter Guidelines
219

220
### Performance Tuning
221

222
- **steps**: Higher values (7-10) for more accuracy, lower (3-5) for speed
223
- **chunk_size**: Larger chunks (1024-2048) for large files, smaller (256-512) for small files
224
- **threshold**: Lower values (0.1-0.15) for stricter detection, higher (0.25-0.3) for permissive
225

226
### Encoding Control
227

228
- **cp_isolation**: Use when you know the likely encoding family (e.g., ['utf_8', 'utf_16'] for Unicode)
229
- **cp_exclusion**: Exclude problematic encodings that cause false positives
230
- **preemptive_behaviour**: Disable (False) for pure heuristic analysis without BOM priority
231

232
### Language Detection
233

234
- **language_threshold**: Lower values (0.05) for better language detection, higher (0.2) to reduce false positives
235
- **enable_fallback**: Keep True for safety, set False for stricter binary detection

Version

Tile

Files

core-detection.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

core-detection.mddocs/