Tessl Tile for pypi/clevercsv@0.8.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-reading-writing.md data-reading.md data-writing.md dialect-detection.md dialects-configuration.md dictionary-operations.md index.md

dialect-detection.mddocs/

0
# Dialect Detection
1

2
Advanced dialect detection capabilities using pattern analysis and consistency measures. CleverCSV provides sophisticated algorithms to automatically identify CSV dialects with 97% accuracy, offering a significant improvement over standard library approaches for messy CSV files.
3

4
## Capabilities
5

6
### Detector Class
7

8
The main dialect detection engine that provides both modern and compatibility interfaces for CSV dialect detection.
9

10
```python { .api }
11
class Detector:
12
    """
13
    Detect CSV dialects using normal forms or data consistency measures.
14
    Provides a drop-in replacement for Python's csv.Sniffer.
15
    """
16
    
17
    def detect(
18
        self,
19
        sample: str,
20
        delimiters: Optional[Iterable[str]] = None,
21
        verbose: bool = False,
22
        method: Union[DetectionMethod, str] = 'auto',
23
        skip: bool = True
24
    ) -> Optional[SimpleDialect]:
25
        """
26
        Detect the dialect of a CSV file sample.
27
        
28
        Parameters:
29
        - sample: Text sample from CSV file (entire file recommended for best results)
30
        - delimiters: Set of delimiters to consider (auto-detected if None)
31
        - verbose: Enable progress output
32
        - method: Detection method ('auto', 'normal', 'consistency')
33
        - skip: Skip low-scoring dialects in consistency detection
34
        
35
        Returns:
36
        Detected SimpleDialect or None if detection failed
37
        """
38
    
39
    def sniff(
40
        self,
41
        sample: str,
42
        delimiters: Optional[Iterable[str]] = None,
43
        verbose: bool = False
44
    ) -> Optional[SimpleDialect]:
45
        """
46
        Compatibility method for Python csv.Sniffer interface.
47
        
48
        Parameters:
49
        - sample: Text sample from CSV file
50
        - delimiters: Set of delimiters to consider
51
        - verbose: Enable progress output
52
        
53
        Returns:
54
        Detected SimpleDialect or None if detection failed
55
        """
56
    
57
    def has_header(self, sample: str, max_rows_to_check: int = 20) -> bool:
58
        """
59
        Detect if a CSV sample has a header row.
60
        
61
        Parameters:
62
        - sample: Text sample from CSV file
63
        - max_rows_to_check: Maximum number of rows to analyze
64
        
65
        Returns:
66
        True if header row detected, False otherwise
67
        
68
        Raises:
69
        NoDetectionResult: If dialect detection fails
70
        """
71
```
72

73
### Detection Methods
74

75
CleverCSV supports multiple detection strategies that can be selected based on your needs and file characteristics.
76

77
```python { .api }
78
class DetectionMethod(str, Enum):
79
    """Available detection methods for dialect detection."""
80
    
81
    AUTO = 'auto'        # Try normal form first, then consistency
82
    NORMAL = 'normal'    # Normal form detection only
83
    CONSISTENCY = 'consistency'  # Data consistency measure only
84
```
85

86
#### Usage Examples
87

88
```python
89
import clevercsv
90

91
# Basic detection with auto method
92
detector = clevercsv.Detector()
93
with open('data.csv', 'r') as f:
94
    sample = f.read()
95
    dialect = detector.detect(sample)
96
    print(f"Detected: {dialect}")
97

98
# Use specific detection method
99
dialect = detector.detect(sample, method='normal', verbose=True)
100

101
# Compatibility with csv.Sniffer
102
dialect = detector.sniff(sample)
103

104
# Check for header row
105
has_header = detector.has_header(sample)
106
print(f"Has header: {has_header}")
107

108
# Custom delimiters
109
custom_delims = [',', ';', '|', '\t']
110
dialect = detector.detect(sample, delimiters=custom_delims)
111
```
112

113
### Convenience Detection Function
114

115
High-level function for direct file-based dialect detection without manual file handling.
116

117
```python { .api }
118
def detect_dialect(
119
    filename: Union[str, PathLike],
120
    num_chars: Optional[int] = None,
121
    encoding: Optional[str] = None,
122
    verbose: bool = False,
123
    method: str = 'auto',
124
    skip: bool = True
125
) -> Optional[SimpleDialect]:
126
    """
127
    Detect the dialect of a CSV file.
128
    
129
    Parameters:
130
    - filename: Path to the CSV file
131
    - num_chars: Number of characters to read (entire file if None)
132
    - encoding: File encoding (auto-detected if None)
133
    - verbose: Enable progress output
134
    - method: Detection method ('auto', 'normal', 'consistency')
135
    - skip: Skip low-scoring dialects in consistency detection
136
    
137
    Returns:
138
    Detected SimpleDialect or None if detection failed
139
    """
140
```
141

142
#### Usage Examples
143

144
```python
145
import clevercsv
146

147
# Simple file-based detection
148
dialect = clevercsv.detect_dialect('data.csv')
149
if dialect:
150
    print(f"Delimiter: '{dialect.delimiter}'")
151
    print(f"Quote char: '{dialect.quotechar}'")
152
    print(f"Escape char: '{dialect.escapechar}'")
153

154
# Fast detection for large files
155
dialect = clevercsv.detect_dialect('large_file.csv', num_chars=50000)
156

157
# Verbose detection with specific method
158
dialect = clevercsv.detect_dialect(
159
    'messy_file.csv',
160
    method='consistency',
161
    verbose=True
162
)
163

164
# Custom encoding
165
dialect = clevercsv.detect_dialect('data.csv', encoding='latin-1')
166
```
167

168
## Detection Algorithms
169

170
### Normal Form Detection
171

172
The primary detection method that analyzes patterns in row lengths and data types to identify the most likely dialect. This method is fast and highly accurate for well-structured CSV files.
173

174
**How it works:**
175
1. Tests potential dialects by parsing the sample
176
2. Analyzes row length consistency and data type patterns
177
3. Scores dialects based on structural regularity
178
4. Returns the highest-scoring consistent dialect
179

180
**Best for:**
181
- Regular CSV files with consistent structure
182
- Files with clear delimiters and quoting patterns
183
- Most common CSV formats
184

185
### Consistency Measure
186

187
Fallback method that uses data consistency scoring when normal form detection is inconclusive. This method is more robust for irregular or messy CSV files.
188

189
**How it works:**
190
1. Parses sample with different potential dialects
191
2. Measures data consistency within columns
192
3. Evaluates type consistency and pattern regularity
193
4. Selects dialect with highest consistency score
194

195
**Best for:**
196
- Messy or irregular CSV files
197
- Files with mixed data types
198
- Cases where normal form detection fails
199

200
### Auto Detection
201

202
The default method that combines both approaches for optimal results:
203
1. Attempts normal form detection first (fast and accurate)
204
2. Falls back to consistency measure if needed (robust but slower)
205
3. Provides best balance of speed and accuracy
206

207
## Performance Optimization
208

209
### Sample Size Considerations
210

211
```python
212
# For speed on large files (may reduce accuracy)
213
dialect = clevercsv.detect_dialect('huge_file.csv', num_chars=10000)
214

215
# For maximum accuracy (slower on large files)
216
dialect = clevercsv.detect_dialect('file.csv')  # reads entire file
217

218
# Balanced approach for very large files
219
dialect = clevercsv.detect_dialect('file.csv', num_chars=100000)
220
```
221

222
### Detection Method Selection
223

224
```python
225
# Fastest: normal form only (good for regular files)
226
dialect = clevercsv.detect_dialect('file.csv', method='normal')
227

228
# Most robust: consistency only (good for messy files)
229
dialect = clevercsv.detect_dialect('file.csv', method='consistency')
230

231
# Balanced: auto method (recommended default)
232
dialect = clevercsv.detect_dialect('file.csv', method='auto')
233
```
234

235
## Error Handling and Edge Cases
236

237
### Detection Failures
238

239
```python
240
import clevercsv
241

242
dialect = clevercsv.detect_dialect('problematic.csv')
243
if dialect is None:
244
    print("Detection failed - file may not be valid CSV")
245
    # Fallback options:
246
    # 1. Try with specific delimiters
247
    # 2. Use manual dialect specification
248
    # 3. Preprocess the file
249
```
250

251
### Header Detection Failures
252

253
```python
254
try:
255
    detector = clevercsv.Detector()
256
    has_header = detector.has_header(sample)
257
except clevercsv.NoDetectionResult:
258
    print("Could not detect dialect for header analysis")
259
    # Fallback: assume no header or use domain knowledge
260
```
261

262
### Custom Delimiter Sets
263

264
```python
265
# For files with unusual delimiters
266
exotic_delims = ['|', '§', '¦', '•']
267
detector = clevercsv.Detector()
268
dialect = detector.detect(sample, delimiters=exotic_delims)
269
```
270

271
## Integration Patterns
272

273
### With Standard csv Module
274

275
```python
276
import clevercsv
277
import csv
278

279
# Detect with CleverCSV, use with standard csv
280
dialect = clevercsv.detect_dialect('data.csv')
281
csv_dialect = dialect.to_csv_dialect()
282

283
with open('data.csv', 'r') as f:
284
    reader = csv.reader(f, dialect=csv_dialect)
285
    data = list(reader)
286
```
287

288
### With Pandas
289

290
```python
291
import clevercsv
292
import pandas as pd
293

294
# Manual detection then pandas
295
dialect = clevercsv.detect_dialect('data.csv')
296
df = pd.read_csv('data.csv', dialect=dialect.to_csv_dialect())
297

298
# Or use CleverCSV's integrated function
299
df = clevercsv.read_dataframe('data.csv')  # Detection handled automatically
300
```

Version

Tile

Files

dialect-detection.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

dialect-detection.mddocs/