or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-reading-writing.mddata-reading.mddata-writing.mddialect-detection.mddialects-configuration.mddictionary-operations.mdindex.md

dialect-detection.mddocs/

0

# Dialect Detection

1

2

Advanced dialect detection capabilities using pattern analysis and consistency measures. CleverCSV provides sophisticated algorithms to automatically identify CSV dialects with 97% accuracy, offering a significant improvement over standard library approaches for messy CSV files.

3

4

## Capabilities

5

6

### Detector Class

7

8

The main dialect detection engine that provides both modern and compatibility interfaces for CSV dialect detection.

9

10

```python { .api }

11

class Detector:

12

"""

13

Detect CSV dialects using normal forms or data consistency measures.

14

Provides a drop-in replacement for Python's csv.Sniffer.

15

"""

16

17

def detect(

18

self,

19

sample: str,

20

delimiters: Optional[Iterable[str]] = None,

21

verbose: bool = False,

22

method: Union[DetectionMethod, str] = 'auto',

23

skip: bool = True

24

) -> Optional[SimpleDialect]:

25

"""

26

Detect the dialect of a CSV file sample.

27

28

Parameters:

29

- sample: Text sample from CSV file (entire file recommended for best results)

30

- delimiters: Set of delimiters to consider (auto-detected if None)

31

- verbose: Enable progress output

32

- method: Detection method ('auto', 'normal', 'consistency')

33

- skip: Skip low-scoring dialects in consistency detection

34

35

Returns:

36

Detected SimpleDialect or None if detection failed

37

"""

38

39

def sniff(

40

self,

41

sample: str,

42

delimiters: Optional[Iterable[str]] = None,

43

verbose: bool = False

44

) -> Optional[SimpleDialect]:

45

"""

46

Compatibility method for Python csv.Sniffer interface.

47

48

Parameters:

49

- sample: Text sample from CSV file

50

- delimiters: Set of delimiters to consider

51

- verbose: Enable progress output

52

53

Returns:

54

Detected SimpleDialect or None if detection failed

55

"""

56

57

def has_header(self, sample: str, max_rows_to_check: int = 20) -> bool:

58

"""

59

Detect if a CSV sample has a header row.

60

61

Parameters:

62

- sample: Text sample from CSV file

63

- max_rows_to_check: Maximum number of rows to analyze

64

65

Returns:

66

True if header row detected, False otherwise

67

68

Raises:

69

NoDetectionResult: If dialect detection fails

70

"""

71

```

72

73

### Detection Methods

74

75

CleverCSV supports multiple detection strategies that can be selected based on your needs and file characteristics.

76

77

```python { .api }

78

class DetectionMethod(str, Enum):

79

"""Available detection methods for dialect detection."""

80

81

AUTO = 'auto' # Try normal form first, then consistency

82

NORMAL = 'normal' # Normal form detection only

83

CONSISTENCY = 'consistency' # Data consistency measure only

84

```

85

86

#### Usage Examples

87

88

```python

89

import clevercsv

90

91

# Basic detection with auto method

92

detector = clevercsv.Detector()

93

with open('data.csv', 'r') as f:

94

sample = f.read()

95

dialect = detector.detect(sample)

96

print(f"Detected: {dialect}")

97

98

# Use specific detection method

99

dialect = detector.detect(sample, method='normal', verbose=True)

100

101

# Compatibility with csv.Sniffer

102

dialect = detector.sniff(sample)

103

104

# Check for header row

105

has_header = detector.has_header(sample)

106

print(f"Has header: {has_header}")

107

108

# Custom delimiters

109

custom_delims = [',', ';', '|', '\t']

110

dialect = detector.detect(sample, delimiters=custom_delims)

111

```

112

113

### Convenience Detection Function

114

115

High-level function for direct file-based dialect detection without manual file handling.

116

117

```python { .api }

118

def detect_dialect(

119

filename: Union[str, PathLike],

120

num_chars: Optional[int] = None,

121

encoding: Optional[str] = None,

122

verbose: bool = False,

123

method: str = 'auto',

124

skip: bool = True

125

) -> Optional[SimpleDialect]:

126

"""

127

Detect the dialect of a CSV file.

128

129

Parameters:

130

- filename: Path to the CSV file

131

- num_chars: Number of characters to read (entire file if None)

132

- encoding: File encoding (auto-detected if None)

133

- verbose: Enable progress output

134

- method: Detection method ('auto', 'normal', 'consistency')

135

- skip: Skip low-scoring dialects in consistency detection

136

137

Returns:

138

Detected SimpleDialect or None if detection failed

139

"""

140

```

141

142

#### Usage Examples

143

144

```python

145

import clevercsv

146

147

# Simple file-based detection

148

dialect = clevercsv.detect_dialect('data.csv')

149

if dialect:

150

print(f"Delimiter: '{dialect.delimiter}'")

151

print(f"Quote char: '{dialect.quotechar}'")

152

print(f"Escape char: '{dialect.escapechar}'")

153

154

# Fast detection for large files

155

dialect = clevercsv.detect_dialect('large_file.csv', num_chars=50000)

156

157

# Verbose detection with specific method

158

dialect = clevercsv.detect_dialect(

159

'messy_file.csv',

160

method='consistency',

161

verbose=True

162

)

163

164

# Custom encoding

165

dialect = clevercsv.detect_dialect('data.csv', encoding='latin-1')

166

```

167

168

## Detection Algorithms

169

170

### Normal Form Detection

171

172

The primary detection method that analyzes patterns in row lengths and data types to identify the most likely dialect. This method is fast and highly accurate for well-structured CSV files.

173

174

**How it works:**

175

1. Tests potential dialects by parsing the sample

176

2. Analyzes row length consistency and data type patterns

177

3. Scores dialects based on structural regularity

178

4. Returns the highest-scoring consistent dialect

179

180

**Best for:**

181

- Regular CSV files with consistent structure

182

- Files with clear delimiters and quoting patterns

183

- Most common CSV formats

184

185

### Consistency Measure

186

187

Fallback method that uses data consistency scoring when normal form detection is inconclusive. This method is more robust for irregular or messy CSV files.

188

189

**How it works:**

190

1. Parses sample with different potential dialects

191

2. Measures data consistency within columns

192

3. Evaluates type consistency and pattern regularity

193

4. Selects dialect with highest consistency score

194

195

**Best for:**

196

- Messy or irregular CSV files

197

- Files with mixed data types

198

- Cases where normal form detection fails

199

200

### Auto Detection

201

202

The default method that combines both approaches for optimal results:

203

1. Attempts normal form detection first (fast and accurate)

204

2. Falls back to consistency measure if needed (robust but slower)

205

3. Provides best balance of speed and accuracy

206

207

## Performance Optimization

208

209

### Sample Size Considerations

210

211

```python

212

# For speed on large files (may reduce accuracy)

213

dialect = clevercsv.detect_dialect('huge_file.csv', num_chars=10000)

214

215

# For maximum accuracy (slower on large files)

216

dialect = clevercsv.detect_dialect('file.csv') # reads entire file

217

218

# Balanced approach for very large files

219

dialect = clevercsv.detect_dialect('file.csv', num_chars=100000)

220

```

221

222

### Detection Method Selection

223

224

```python

225

# Fastest: normal form only (good for regular files)

226

dialect = clevercsv.detect_dialect('file.csv', method='normal')

227

228

# Most robust: consistency only (good for messy files)

229

dialect = clevercsv.detect_dialect('file.csv', method='consistency')

230

231

# Balanced: auto method (recommended default)

232

dialect = clevercsv.detect_dialect('file.csv', method='auto')

233

```

234

235

## Error Handling and Edge Cases

236

237

### Detection Failures

238

239

```python

240

import clevercsv

241

242

dialect = clevercsv.detect_dialect('problematic.csv')

243

if dialect is None:

244

print("Detection failed - file may not be valid CSV")

245

# Fallback options:

246

# 1. Try with specific delimiters

247

# 2. Use manual dialect specification

248

# 3. Preprocess the file

249

```

250

251

### Header Detection Failures

252

253

```python

254

try:

255

detector = clevercsv.Detector()

256

has_header = detector.has_header(sample)

257

except clevercsv.NoDetectionResult:

258

print("Could not detect dialect for header analysis")

259

# Fallback: assume no header or use domain knowledge

260

```

261

262

### Custom Delimiter Sets

263

264

```python

265

# For files with unusual delimiters

266

exotic_delims = ['|', '§', '¦', '•']

267

detector = clevercsv.Detector()

268

dialect = detector.detect(sample, delimiters=exotic_delims)

269

```

270

271

## Integration Patterns

272

273

### With Standard csv Module

274

275

```python

276

import clevercsv

277

import csv

278

279

# Detect with CleverCSV, use with standard csv

280

dialect = clevercsv.detect_dialect('data.csv')

281

csv_dialect = dialect.to_csv_dialect()

282

283

with open('data.csv', 'r') as f:

284

reader = csv.reader(f, dialect=csv_dialect)

285

data = list(reader)

286

```

287

288

### With Pandas

289

290

```python

291

import clevercsv

292

import pandas as pd

293

294

# Manual detection then pandas

295

dialect = clevercsv.detect_dialect('data.csv')

296

df = pd.read_csv('data.csv', dialect=dialect.to_csv_dialect())

297

298

# Or use CleverCSV's integrated function

299

df = clevercsv.read_dataframe('data.csv') # Detection handled automatically

300

```