0
# Core Detection Functions
1
2
Primary charset detection methods that analyze raw bytes, file pointers, or file paths to determine character encoding. These functions form the core of charset-normalizer's detection capabilities and support extensive customization through parameters.
3
4
## Capabilities
5
6
### Bytes Detection
7
8
Detects character encoding from raw bytes or bytearray sequences using advanced heuristic analysis.
9
10
```python { .api }
11
def from_bytes(
12
sequences: bytes | bytearray,
13
steps: int = 5,
14
chunk_size: int = 512,
15
threshold: float = 0.2,
16
cp_isolation: list[str] | None = None,
17
cp_exclusion: list[str] | None = None,
18
preemptive_behaviour: bool = True,
19
explain: bool = False,
20
language_threshold: float = 0.1,
21
enable_fallback: bool = True,
22
) -> CharsetMatches:
23
"""
24
Detect charset from raw bytes sequence.
25
26
Parameters:
27
- sequences: Raw bytes or bytearray to analyze
28
- steps: Number of analysis steps (default: 5)
29
- chunk_size: Size of data chunks for analysis (default: 512)
30
- threshold: Mess ratio threshold for encoding rejection (default: 0.2)
31
- cp_isolation: List of encodings to test exclusively
32
- cp_exclusion: List of encodings to exclude from testing
33
- preemptive_behaviour: Enable BOM/signature priority detection (default: True)
34
- explain: Enable detailed logging for debugging (default: False)
35
- language_threshold: Minimum coherence for language detection (default: 0.1)
36
- enable_fallback: Enable fallback to common encodings (default: True)
37
38
Returns:
39
CharsetMatches: Ordered collection of detection results
40
41
Raises:
42
TypeError: If sequences is not bytes or bytearray
43
"""
44
```
45
46
**Usage Example:**
47
48
```python
49
import charset_normalizer
50
51
# Basic detection
52
raw_data = b'\xe4\xb8\xad\xe6\x96\x87' # Chinese text in UTF-8
53
results = charset_normalizer.from_bytes(raw_data)
54
best_match = results.best()
55
print(f"Encoding: {best_match.encoding}") # utf_8
56
print(f"Language: {best_match.language}") # Chinese
57
58
# Advanced detection with custom parameters
59
results = charset_normalizer.from_bytes(
60
raw_data,
61
steps=10, # More thorough analysis
62
threshold=0.1, # Stricter mess threshold
63
cp_isolation=['utf_8', 'gb2312', 'big5'], # Test only Chinese encodings
64
explain=True # Enable debug logging
65
)
66
```
67
68
### File Pointer Detection
69
70
Detects character encoding from an open file pointer without closing it.
71
72
```python { .api }
73
def from_fp(
74
fp: BinaryIO,
75
steps: int = 5,
76
chunk_size: int = 512,
77
threshold: float = 0.20,
78
cp_isolation: list[str] | None = None,
79
cp_exclusion: list[str] | None = None,
80
preemptive_behaviour: bool = True,
81
explain: bool = False,
82
language_threshold: float = 0.1,
83
enable_fallback: bool = True,
84
) -> CharsetMatches:
85
"""
86
Detect charset from file pointer.
87
88
Parameters:
89
- fp: Open binary file pointer
90
- Other parameters: Same as from_bytes
91
92
Returns:
93
CharsetMatches: Ordered collection of detection results
94
95
Note: Does not close the file pointer
96
"""
97
```
98
99
**Usage Example:**
100
101
```python
102
import charset_normalizer
103
104
with open('document.txt', 'rb') as fp:
105
results = charset_normalizer.from_fp(fp)
106
best_match = results.best()
107
if best_match:
108
print(f"File encoding: {best_match.encoding}")
109
# File pointer remains open for further operations
110
```
111
112
### File Path Detection
113
114
Detects character encoding by opening and reading a file from its path.
115
116
```python { .api }
117
def from_path(
118
path: str | bytes | PathLike,
119
steps: int = 5,
120
chunk_size: int = 512,
121
threshold: float = 0.20,
122
cp_isolation: list[str] | None = None,
123
cp_exclusion: list[str] | None = None,
124
preemptive_behaviour: bool = True,
125
explain: bool = False,
126
language_threshold: float = 0.1,
127
enable_fallback: bool = True,
128
) -> CharsetMatches:
129
"""
130
Detect charset from file path.
131
132
Parameters:
133
- path: Path to file (string, bytes, or PathLike object)
134
- Other parameters: Same as from_bytes
135
136
Returns:
137
CharsetMatches: Ordered collection of detection results
138
139
Raises:
140
IOError: If file cannot be opened or read
141
"""
142
```
143
144
**Usage Example:**
145
146
```python
147
import charset_normalizer
148
from pathlib import Path
149
150
# Using string path
151
results = charset_normalizer.from_path('data/sample.txt')
152
153
# Using Path object
154
file_path = Path('documents/report.csv')
155
results = charset_normalizer.from_path(file_path)
156
157
# With custom settings for CSV files
158
results = charset_normalizer.from_path(
159
'data.csv',
160
cp_isolation=['utf_8', 'iso-8859-1', 'windows-1252'], # Common for CSV
161
threshold=0.15 # Slightly stricter for structured data
162
)
163
```
164
165
### Binary Detection
166
167
Determines whether input data represents binary (non-text) content.
168
169
```python { .api }
170
def is_binary(
171
fp_or_path_or_payload: PathLike | str | BinaryIO | bytes,
172
steps: int = 5,
173
chunk_size: int = 512,
174
threshold: float = 0.20,
175
cp_isolation: list[str] | None = None,
176
cp_exclusion: list[str] | None = None,
177
preemptive_behaviour: bool = True,
178
explain: bool = False,
179
language_threshold: float = 0.1,
180
enable_fallback: bool = False,
181
) -> bool:
182
"""
183
Detect if input is binary (non-text) content.
184
185
Parameters:
186
- fp_or_path_or_payload: File path, file pointer, or raw bytes
187
- Other parameters: Same as from_bytes (enable_fallback defaults to False)
188
189
Returns:
190
bool: True if content appears to be binary, False if text
191
192
Note: Uses stricter criteria than text detection to avoid false positives
193
"""
194
```
195
196
**Usage Example:**
197
198
```python
199
import charset_normalizer
200
201
# Check if file is binary
202
if charset_normalizer.is_binary('image.jpg'):
203
print("Binary file detected")
204
else:
205
print("Text file detected")
206
207
# Check raw bytes
208
data = b'\x89PNG\r\n\x1a\n' # PNG file header
209
if charset_normalizer.is_binary(data):
210
print("Binary data")
211
212
# Check with file pointer
213
with open('document.pdf', 'rb') as fp:
214
if charset_normalizer.is_binary(fp):
215
print("Binary document")
216
```
217
218
## Parameter Guidelines
219
220
### Performance Tuning
221
222
- **steps**: Higher values (7-10) for more accuracy, lower (3-5) for speed
223
- **chunk_size**: Larger chunks (1024-2048) for large files, smaller (256-512) for small files
224
- **threshold**: Lower values (0.1-0.15) for stricter detection, higher (0.25-0.3) for permissive
225
226
### Encoding Control
227
228
- **cp_isolation**: Use when you know the likely encoding family (e.g., ['utf_8', 'utf_16'] for Unicode)
229
- **cp_exclusion**: Exclude problematic encodings that cause false positives
230
- **preemptive_behaviour**: Disable (False) for pure heuristic analysis without BOM priority
231
232
### Language Detection
233
234
- **language_threshold**: Lower values (0.05) for better language detection, higher (0.2) to reduce false positives
235
- **enable_fallback**: Keep True for safety, set False for stricter binary detection