0
# Dialect Detection
1
2
Advanced dialect detection capabilities using pattern analysis and consistency measures. CleverCSV provides sophisticated algorithms to automatically identify CSV dialects with 97% accuracy, offering a significant improvement over standard library approaches for messy CSV files.
3
4
## Capabilities
5
6
### Detector Class
7
8
The main dialect detection engine that provides both modern and compatibility interfaces for CSV dialect detection.
9
10
```python { .api }
11
class Detector:
12
"""
13
Detect CSV dialects using normal forms or data consistency measures.
14
Provides a drop-in replacement for Python's csv.Sniffer.
15
"""
16
17
def detect(
18
self,
19
sample: str,
20
delimiters: Optional[Iterable[str]] = None,
21
verbose: bool = False,
22
method: Union[DetectionMethod, str] = 'auto',
23
skip: bool = True
24
) -> Optional[SimpleDialect]:
25
"""
26
Detect the dialect of a CSV file sample.
27
28
Parameters:
29
- sample: Text sample from CSV file (entire file recommended for best results)
30
- delimiters: Set of delimiters to consider (auto-detected if None)
31
- verbose: Enable progress output
32
- method: Detection method ('auto', 'normal', 'consistency')
33
- skip: Skip low-scoring dialects in consistency detection
34
35
Returns:
36
Detected SimpleDialect or None if detection failed
37
"""
38
39
def sniff(
40
self,
41
sample: str,
42
delimiters: Optional[Iterable[str]] = None,
43
verbose: bool = False
44
) -> Optional[SimpleDialect]:
45
"""
46
Compatibility method for Python csv.Sniffer interface.
47
48
Parameters:
49
- sample: Text sample from CSV file
50
- delimiters: Set of delimiters to consider
51
- verbose: Enable progress output
52
53
Returns:
54
Detected SimpleDialect or None if detection failed
55
"""
56
57
def has_header(self, sample: str, max_rows_to_check: int = 20) -> bool:
58
"""
59
Detect if a CSV sample has a header row.
60
61
Parameters:
62
- sample: Text sample from CSV file
63
- max_rows_to_check: Maximum number of rows to analyze
64
65
Returns:
66
True if header row detected, False otherwise
67
68
Raises:
69
NoDetectionResult: If dialect detection fails
70
"""
71
```
72
73
### Detection Methods
74
75
CleverCSV supports multiple detection strategies that can be selected based on your needs and file characteristics.
76
77
```python { .api }
78
class DetectionMethod(str, Enum):
79
"""Available detection methods for dialect detection."""
80
81
AUTO = 'auto' # Try normal form first, then consistency
82
NORMAL = 'normal' # Normal form detection only
83
CONSISTENCY = 'consistency' # Data consistency measure only
84
```
85
86
#### Usage Examples
87
88
```python
89
import clevercsv
90
91
# Basic detection with auto method
92
detector = clevercsv.Detector()
93
with open('data.csv', 'r') as f:
94
sample = f.read()
95
dialect = detector.detect(sample)
96
print(f"Detected: {dialect}")
97
98
# Use specific detection method
99
dialect = detector.detect(sample, method='normal', verbose=True)
100
101
# Compatibility with csv.Sniffer
102
dialect = detector.sniff(sample)
103
104
# Check for header row
105
has_header = detector.has_header(sample)
106
print(f"Has header: {has_header}")
107
108
# Custom delimiters
109
custom_delims = [',', ';', '|', '\t']
110
dialect = detector.detect(sample, delimiters=custom_delims)
111
```
112
113
### Convenience Detection Function
114
115
High-level function for direct file-based dialect detection without manual file handling.
116
117
```python { .api }
118
def detect_dialect(
119
filename: Union[str, PathLike],
120
num_chars: Optional[int] = None,
121
encoding: Optional[str] = None,
122
verbose: bool = False,
123
method: str = 'auto',
124
skip: bool = True
125
) -> Optional[SimpleDialect]:
126
"""
127
Detect the dialect of a CSV file.
128
129
Parameters:
130
- filename: Path to the CSV file
131
- num_chars: Number of characters to read (entire file if None)
132
- encoding: File encoding (auto-detected if None)
133
- verbose: Enable progress output
134
- method: Detection method ('auto', 'normal', 'consistency')
135
- skip: Skip low-scoring dialects in consistency detection
136
137
Returns:
138
Detected SimpleDialect or None if detection failed
139
"""
140
```
141
142
#### Usage Examples
143
144
```python
145
import clevercsv
146
147
# Simple file-based detection
148
dialect = clevercsv.detect_dialect('data.csv')
149
if dialect:
150
print(f"Delimiter: '{dialect.delimiter}'")
151
print(f"Quote char: '{dialect.quotechar}'")
152
print(f"Escape char: '{dialect.escapechar}'")
153
154
# Fast detection for large files
155
dialect = clevercsv.detect_dialect('large_file.csv', num_chars=50000)
156
157
# Verbose detection with specific method
158
dialect = clevercsv.detect_dialect(
159
'messy_file.csv',
160
method='consistency',
161
verbose=True
162
)
163
164
# Custom encoding
165
dialect = clevercsv.detect_dialect('data.csv', encoding='latin-1')
166
```
167
168
## Detection Algorithms
169
170
### Normal Form Detection
171
172
The primary detection method that analyzes patterns in row lengths and data types to identify the most likely dialect. This method is fast and highly accurate for well-structured CSV files.
173
174
**How it works:**
175
1. Tests potential dialects by parsing the sample
176
2. Analyzes row length consistency and data type patterns
177
3. Scores dialects based on structural regularity
178
4. Returns the highest-scoring consistent dialect
179
180
**Best for:**
181
- Regular CSV files with consistent structure
182
- Files with clear delimiters and quoting patterns
183
- Most common CSV formats
184
185
### Consistency Measure
186
187
Fallback method that uses data consistency scoring when normal form detection is inconclusive. This method is more robust for irregular or messy CSV files.
188
189
**How it works:**
190
1. Parses sample with different potential dialects
191
2. Measures data consistency within columns
192
3. Evaluates type consistency and pattern regularity
193
4. Selects dialect with highest consistency score
194
195
**Best for:**
196
- Messy or irregular CSV files
197
- Files with mixed data types
198
- Cases where normal form detection fails
199
200
### Auto Detection
201
202
The default method that combines both approaches for optimal results:
203
1. Attempts normal form detection first (fast and accurate)
204
2. Falls back to consistency measure if needed (robust but slower)
205
3. Provides best balance of speed and accuracy
206
207
## Performance Optimization
208
209
### Sample Size Considerations
210
211
```python
212
# For speed on large files (may reduce accuracy)
213
dialect = clevercsv.detect_dialect('huge_file.csv', num_chars=10000)
214
215
# For maximum accuracy (slower on large files)
216
dialect = clevercsv.detect_dialect('file.csv') # reads entire file
217
218
# Balanced approach for very large files
219
dialect = clevercsv.detect_dialect('file.csv', num_chars=100000)
220
```
221
222
### Detection Method Selection
223
224
```python
225
# Fastest: normal form only (good for regular files)
226
dialect = clevercsv.detect_dialect('file.csv', method='normal')
227
228
# Most robust: consistency only (good for messy files)
229
dialect = clevercsv.detect_dialect('file.csv', method='consistency')
230
231
# Balanced: auto method (recommended default)
232
dialect = clevercsv.detect_dialect('file.csv', method='auto')
233
```
234
235
## Error Handling and Edge Cases
236
237
### Detection Failures
238
239
```python
240
import clevercsv
241
242
dialect = clevercsv.detect_dialect('problematic.csv')
243
if dialect is None:
244
print("Detection failed - file may not be valid CSV")
245
# Fallback options:
246
# 1. Try with specific delimiters
247
# 2. Use manual dialect specification
248
# 3. Preprocess the file
249
```
250
251
### Header Detection Failures
252
253
```python
254
try:
255
detector = clevercsv.Detector()
256
has_header = detector.has_header(sample)
257
except clevercsv.NoDetectionResult:
258
print("Could not detect dialect for header analysis")
259
# Fallback: assume no header or use domain knowledge
260
```
261
262
### Custom Delimiter Sets
263
264
```python
265
# For files with unusual delimiters
266
exotic_delims = ['|', '§', '¦', '•']
267
detector = clevercsv.Detector()
268
dialect = detector.detect(sample, delimiters=exotic_delims)
269
```
270
271
## Integration Patterns
272
273
### With Standard csv Module
274
275
```python
276
import clevercsv
277
import csv
278
279
# Detect with CleverCSV, use with standard csv
280
dialect = clevercsv.detect_dialect('data.csv')
281
csv_dialect = dialect.to_csv_dialect()
282
283
with open('data.csv', 'r') as f:
284
reader = csv.reader(f, dialect=csv_dialect)
285
data = list(reader)
286
```
287
288
### With Pandas
289
290
```python
291
import clevercsv
292
import pandas as pd
293
294
# Manual detection then pandas
295
dialect = clevercsv.detect_dialect('data.csv')
296
df = pd.read_csv('data.csv', dialect=dialect.to_csv_dialect())
297
298
# Or use CleverCSV's integrated function
299
df = clevercsv.read_dataframe('data.csv') # Detection handled automatically
300
```