0
# File I/O and Encoding
1
2
File encoding detection, line ending handling, and file opening utilities for robust text file processing across different encodings and platforms. This module ensures docformatter can handle files with various encodings and line ending conventions.
3
4
## Capabilities
5
6
### Encoder Class
7
8
The main class for handling file encoding detection and file I/O operations.
9
10
```python { .api }
11
class Encoder:
12
"""Encoding and decoding of files."""
13
14
# Line ending constants
15
CR = "\r" # Carriage return (Mac classic)
16
LF = "\n" # Line feed (Unix/Linux)
17
CRLF = "\r\n" # Carriage return + Line feed (Windows)
18
19
# Default encoding
20
DEFAULT_ENCODING = "latin-1"
21
22
def __init__(self):
23
"""
24
Initialize an Encoder instance.
25
26
Sets up encoding detection with default fallback encoding
27
and system encoding detection.
28
"""
29
30
# Instance attributes after initialization
31
encoding: str # Current detected/set file encoding
32
system_encoding: str # System preferred encoding
33
```
34
35
### Encoding Detection
36
37
Methods for detecting and working with file encodings.
38
39
```python { .api }
40
def do_detect_encoding(self, filename) -> None:
41
"""
42
Detect and set the encoding for a file.
43
44
Uses charset_normalizer library to detect file encoding with high
45
accuracy. Falls back to DEFAULT_ENCODING if detection fails.
46
47
Args:
48
filename (str): Path to file for encoding detection
49
50
Side Effects:
51
Sets self.encoding to detected encoding
52
"""
53
```
54
55
### Line Ending Detection
56
57
Methods for detecting and normalizing line endings.
58
59
```python { .api }
60
def do_find_newline(self, source: List[str]) -> str:
61
"""
62
Determine the predominant newline style in source lines.
63
64
Analyzes line endings to determine whether file uses Unix (LF),
65
Windows (CRLF), or Mac classic (CR) line endings.
66
67
Args:
68
source (List[str]): List of source code lines
69
70
Returns:
71
str: Predominant newline character(s) (LF, CRLF, or CR)
72
"""
73
```
74
75
### File Opening
76
77
Methods for opening files with proper encoding handling.
78
79
```python { .api }
80
def do_open_with_encoding(self, filename, mode: str = "r"):
81
"""
82
Open file with detected encoding.
83
84
Opens file using the encoding detected by do_detect_encoding().
85
Handles encoding errors gracefully.
86
87
Args:
88
filename (str): Path to file to open
89
mode (str): File opening mode (default: "r")
90
91
Returns:
92
File object opened with proper encoding
93
94
Raises:
95
IOError: If file cannot be opened
96
UnicodeDecodeError: If encoding is incorrect
97
"""
98
```
99
100
### Utility Functions
101
102
File discovery and processing utilities.
103
104
```python { .api }
105
def find_py_files(sources, recursive, exclude=None):
106
"""
107
Find Python source files in given sources.
108
109
Generator function that yields Python files (.py extension)
110
from the specified sources, with support for recursive directory
111
traversal and exclusion patterns.
112
113
Args:
114
sources: Iterable of file/directory paths
115
recursive (bool): Whether to search directories recursively
116
exclude (list, optional): Patterns to exclude from search
117
118
Yields:
119
str: Path to each Python file found
120
"""
121
122
def has_correct_length(length_range, start, end):
123
"""
124
Check if docstring is within specified length range.
125
126
Used with --docstring-length option to filter docstrings
127
by their line count.
128
129
Args:
130
length_range (list): [min_length, max_length] or None
131
start (int): Starting line number of docstring
132
end (int): Ending line number of docstring
133
134
Returns:
135
bool: True if within range or no range specified
136
"""
137
138
def is_in_range(line_range, start, end):
139
"""
140
Check if docstring is within specified line range.
141
142
Used with --range option to process only docstrings
143
within specific line numbers.
144
145
Args:
146
line_range (list): [start_line, end_line] or None
147
start (int): Starting line number of docstring
148
end (int): Ending line number of docstring
149
150
Returns:
151
bool: True if in range or no range specified
152
"""
153
```
154
155
## Usage Examples
156
157
### Basic Encoding Detection
158
159
```python
160
from docformatter import Encoder
161
162
# Create encoder instance
163
encoder = Encoder()
164
165
# Detect encoding for a file
166
encoder.do_detect_encoding("example.py")
167
print(f"Detected encoding: {encoder.encoding}")
168
print(f"System encoding: {encoder.system_encoding}")
169
170
# Open file with detected encoding
171
with encoder.do_open_with_encoding("example.py") as f:
172
content = f.read()
173
print(f"File content length: {len(content)}")
174
```
175
176
### Line Ending Detection
177
178
```python
179
from docformatter import Encoder
180
181
# Read file and detect line endings
182
encoder = Encoder()
183
encoder.do_detect_encoding("mixed_endings.py")
184
185
with encoder.do_open_with_encoding("mixed_endings.py") as f:
186
lines = f.readlines()
187
188
# Detect predominant line ending
189
newline_style = encoder.do_find_newline(lines)
190
print(f"Detected line ending: {repr(newline_style)}")
191
192
if newline_style == encoder.LF:
193
print("Unix/Linux line endings")
194
elif newline_style == encoder.CRLF:
195
print("Windows line endings")
196
elif newline_style == encoder.CR:
197
print("Mac classic line endings")
198
```
199
200
### File Processing with Encoding
201
202
```python
203
from docformatter import Encoder
204
205
def process_python_file(filename):
206
"""Process a Python file with proper encoding handling."""
207
encoder = Encoder()
208
209
# Detect encoding
210
try:
211
encoder.do_detect_encoding(filename)
212
print(f"Processing {filename} with encoding: {encoder.encoding}")
213
214
# Read file content
215
with encoder.do_open_with_encoding(filename) as f:
216
lines = f.readlines()
217
218
# Detect line endings
219
newline_style = encoder.do_find_newline(lines)
220
221
# Process content (example: count docstrings)
222
content = ''.join(lines)
223
docstring_count = content.count('"""') + content.count("'''")
224
225
return {
226
'filename': filename,
227
'encoding': encoder.encoding,
228
'line_ending': newline_style,
229
'line_count': len(lines),
230
'docstring_markers': docstring_count
231
}
232
233
except Exception as e:
234
print(f"Error processing {filename}: {e}")
235
return None
236
237
# Example usage
238
result = process_python_file("example.py")
239
if result:
240
print(f"File info: {result}")
241
```
242
243
### Finding Python Files
244
245
```python
246
from docformatter import find_py_files
247
248
# Find all .py files in current directory
249
files = list(find_py_files(["."], recursive=False))
250
print(f"Found {len(files)} Python files")
251
252
# Find files recursively, excluding test directories
253
files = list(find_py_files(
254
["."],
255
recursive=True,
256
exclude=["tests", "__pycache__", ".git"]
257
))
258
print(f"Found {len(files)} Python files (excluding tests)")
259
260
# Process multiple source locations
261
sources = ["src/", "scripts/", "tools/"]
262
for filename in find_py_files(sources, recursive=True):
263
print(f"Processing: {filename}")
264
```
265
266
### Range and Length Filtering
267
268
```python
269
from docformatter import has_correct_length, is_in_range
270
271
# Check docstring length filtering
272
length_range = [5, 20] # Only process docstrings 5-20 lines long
273
start_line = 10
274
end_line = 15
275
276
if has_correct_length(length_range, start_line, end_line):
277
print("Docstring is within length range")
278
279
# Check line range filtering
280
line_range = [1, 100] # Only process docstrings in lines 1-100
281
if is_in_range(line_range, start_line, end_line):
282
print("Docstring is within line range")
283
284
# Example usage in file processing
285
def should_process_docstring(start, end, length_filter=None, line_filter=None):
286
"""Determine if docstring should be processed based on filters."""
287
if length_filter and not has_correct_length(length_filter, start, end):
288
return False
289
if line_filter and not is_in_range(line_filter, start, end):
290
return False
291
return True
292
293
# Test with various docstrings
294
docstrings = [
295
(5, 8), # Lines 5-8 (4 lines)
296
(10, 25), # Lines 10-25 (16 lines)
297
(50, 75), # Lines 50-75 (26 lines)
298
]
299
300
for start, end in docstrings:
301
should_process = should_process_docstring(
302
start, end,
303
length_filter=[3, 20], # 3-20 lines
304
line_filter=[1, 30] # Lines 1-30
305
)
306
print(f"Docstring lines {start}-{end}: {'Process' if should_process else 'Skip'}")
307
```
308
309
### Advanced File Processing
310
311
```python
312
from docformatter import Encoder, find_py_files
313
314
class FileProcessor:
315
def __init__(self):
316
self.encoder = Encoder()
317
self.processed_files = []
318
319
def process_directory(self, directory, recursive=True, exclude=None):
320
"""Process all Python files in directory."""
321
files = find_py_files([directory], recursive, exclude)
322
323
for filename in files:
324
try:
325
result = self.process_file(filename)
326
if result:
327
self.processed_files.append(result)
328
except Exception as e:
329
print(f"Error processing {filename}: {e}")
330
331
return self.processed_files
332
333
def process_file(self, filename):
334
"""Process individual file with encoding detection."""
335
# Detect encoding
336
self.encoder.do_detect_encoding(filename)
337
338
# Read file
339
with self.encoder.do_open_with_encoding(filename) as f:
340
lines = f.readlines()
341
342
# Analyze file
343
newline_style = self.encoder.do_find_newline(lines)
344
345
return {
346
'filename': filename,
347
'encoding': self.encoder.encoding,
348
'line_ending': repr(newline_style),
349
'lines': len(lines),
350
'size': sum(len(line.encode(self.encoder.encoding)) for line in lines)
351
}
352
353
# Usage
354
processor = FileProcessor()
355
results = processor.process_directory(
356
"src/",
357
recursive=True,
358
exclude=["__pycache__", "*.pyc", "tests/"]
359
)
360
361
# Print summary
362
for result in results:
363
print(f"{result['filename']}: {result['encoding']} encoding, "
364
f"{result['lines']} lines, {result['size']} bytes")
365
```
366
367
### Error Handling
368
369
```python
370
from docformatter import Encoder
371
372
def safe_file_processing(filename):
373
"""Process file with comprehensive error handling."""
374
encoder = Encoder()
375
376
try:
377
# Try to detect encoding
378
encoder.do_detect_encoding(filename)
379
print(f"Detected encoding: {encoder.encoding}")
380
381
except FileNotFoundError:
382
print(f"File not found: {filename}")
383
return None
384
385
except PermissionError:
386
print(f"Permission denied: {filename}")
387
return None
388
389
except Exception as e:
390
print(f"Encoding detection failed: {e}")
391
print(f"Using fallback encoding: {encoder.DEFAULT_ENCODING}")
392
encoder.encoding = encoder.DEFAULT_ENCODING
393
394
try:
395
# Try to open and read file
396
with encoder.do_open_with_encoding(filename) as f:
397
content = f.read()
398
399
return {
400
'success': True,
401
'encoding': encoder.encoding,
402
'content_length': len(content)
403
}
404
405
except UnicodeDecodeError as e:
406
print(f"Unicode decode error: {e}")
407
print("File may have mixed encodings or be binary")
408
return None
409
410
except Exception as e:
411
print(f"File reading error: {e}")
412
return None
413
414
# Test with various files
415
test_files = ["example.py", "unicode_file.py", "binary_file.so", "missing.py"]
416
417
for filename in test_files:
418
result = safe_file_processing(filename)
419
if result:
420
print(f"Successfully processed {filename}")
421
else:
422
print(f"Failed to process {filename}")
423
```
424
425
## Integration with Docformatter
426
427
The file I/O and encoding module integrates with other components:
428
429
- **Formatter**: Provides encoding-aware file reading and writing
430
- **Configuration**: Supports file discovery with exclusion patterns
431
- **String Processing**: Ensures proper handling of Unicode content
432
- **Command-Line Interface**: Enables robust batch file processing
433
434
## Platform Considerations
435
436
The module handles platform-specific differences:
437
438
- **Line Endings**: Detects and preserves original line ending style
439
- **Encodings**: Handles Windows-1252, UTF-8, Latin-1, and other encodings
440
- **File Paths**: Works with both Unix and Windows path conventions
441
- **Permissions**: Graceful handling of permission-denied errors
442
- **Unicode**: Full support for international characters and symbols
443
444
## Performance Considerations
445
446
- **Encoding Detection**: Uses fast heuristic-based detection
447
- **File Reading**: Efficient line-by-line processing for large files
448
- **Memory Usage**: Streams large files rather than loading entirely
449
- **Caching**: Reuses encoding detection results within same session