0
# File Format Specifications
1
2
Configuration classes for supported file formats including CSV, JSON Lines, Parquet, and Avro formats. Each format provides specific configuration options for parsing and processing data files stored in S3.
3
4
## Capabilities
5
6
### CSV Format Configuration
7
8
Configuration for processing CSV (Comma-Separated Values) files with extensive customization options for delimiters, encoding, and parsing behavior.
9
10
```python { .api }
11
class CsvFormat:
12
"""
13
Configuration for CSV file processing with customizable parsing options.
14
"""
15
16
filetype: str = "csv"
17
"""File type identifier, always 'csv'"""
18
19
delimiter: str
20
"""Field delimiter character (e.g., ',', ';', '\t')"""
21
22
infer_datatypes: Optional[bool]
23
"""Whether to automatically infer column data types"""
24
25
quote_char: str
26
"""Character used for quoting fields (default: '"')"""
27
28
escape_char: Optional[str]
29
"""Character used for escaping special characters"""
30
31
encoding: Optional[str]
32
"""Text encoding (e.g., 'utf-8', 'latin-1')"""
33
34
double_quote: bool
35
"""Whether to treat two consecutive quote characters as a single quote"""
36
37
newlines_in_values: bool
38
"""Whether field values can contain newline characters"""
39
40
additional_reader_options: Optional[str]
41
"""JSON string of additional pandas read_csv options"""
42
43
advanced_options: Optional[str]
44
"""JSON string of advanced parsing options"""
45
46
block_size: int
47
"""Block size for reading large CSV files"""
48
```
49
50
### JSON Lines Format Configuration
51
52
Configuration for processing JSON Lines (JSONL/NDJSON) files with options for handling unexpected fields and newline processing.
53
54
```python { .api }
55
class UnexpectedFieldBehaviorEnum(str, Enum):
56
"""Enumeration for handling unexpected fields in JSON objects"""
57
ignore = "ignore" # Ignore unexpected fields
58
infer = "infer" # Infer schema from unexpected fields
59
error = "error" # Raise error on unexpected fields
60
61
class JsonlFormat:
62
"""
63
Configuration for JSON Lines file processing.
64
"""
65
66
filetype: str = "jsonl"
67
"""File type identifier, always 'jsonl'"""
68
69
newlines_in_values: bool
70
"""Whether JSON values can contain newline characters"""
71
72
unexpected_field_behavior: UnexpectedFieldBehaviorEnum
73
"""How to handle fields not defined in the schema"""
74
75
block_size: int
76
"""Block size for reading large JSONL files"""
77
```
78
79
### Parquet Format Configuration
80
81
Configuration for processing Apache Parquet files with options for column selection and performance optimization.
82
83
```python { .api }
84
class ParquetFormat:
85
"""
86
Configuration for Parquet file processing with performance optimization options.
87
"""
88
89
filetype: str = "parquet"
90
"""File type identifier, always 'parquet'"""
91
92
columns: Optional[List[str]]
93
"""Specific columns to read from the Parquet file (None for all columns)"""
94
95
batch_size: int
96
"""Number of rows to read in each batch for memory management"""
97
98
buffer_size: int
99
"""Buffer size for reading Parquet files"""
100
```
101
102
### Avro Format Configuration
103
104
Configuration for processing Apache Avro files with native schema support.
105
106
```python { .api }
107
class AvroFormat:
108
"""
109
Configuration for Avro file processing with native schema support.
110
"""
111
112
filetype: str = "avro"
113
"""File type identifier, always 'avro'"""
114
```
115
116
## Usage Examples
117
118
### CSV Format Usage
119
120
```python
121
from source_s3.source_files_abstract.formats.csv_spec import CsvFormat
122
123
# Basic CSV configuration
124
csv_format = CsvFormat(
125
delimiter=",",
126
quote_char='"',
127
encoding="utf-8",
128
infer_datatypes=True,
129
double_quote=True,
130
newlines_in_values=False,
131
block_size=1024*1024
132
)
133
134
# Advanced CSV configuration with custom options
135
csv_advanced = CsvFormat(
136
delimiter="|",
137
quote_char="'",
138
escape_char="\\",
139
encoding="latin1",
140
infer_datatypes=False,
141
additional_reader_options='{"skiprows": 1, "na_values": ["NULL", ""]}',
142
advanced_options='{"parse_dates": ["created_at"]}',
143
block_size=2*1024*1024
144
)
145
```
146
147
### JSON Lines Format Usage
148
149
```python
150
from source_s3.source_files_abstract.formats.jsonl_spec import JsonlFormat, UnexpectedFieldBehaviorEnum
151
152
# Standard JSONL configuration
153
jsonl_format = JsonlFormat(
154
newlines_in_values=False,
155
unexpected_field_behavior=UnexpectedFieldBehaviorEnum.infer,
156
block_size=1024*1024
157
)
158
159
# Strict JSONL configuration
160
jsonl_strict = JsonlFormat(
161
newlines_in_values=True,
162
unexpected_field_behavior=UnexpectedFieldBehaviorEnum.error,
163
block_size=512*1024
164
)
165
```
166
167
### Parquet Format Usage
168
169
```python
170
from source_s3.source_files_abstract.formats.parquet_spec import ParquetFormat
171
172
# Full Parquet file reading
173
parquet_format = ParquetFormat(
174
columns=None, # Read all columns
175
batch_size=10000,
176
buffer_size=1024*1024
177
)
178
179
# Selective column reading for performance
180
parquet_selective = ParquetFormat(
181
columns=["id", "name", "created_at", "amount"],
182
batch_size=50000,
183
buffer_size=4*1024*1024
184
)
185
```
186
187
### Avro Format Usage
188
189
```python
190
from source_s3.source_files_abstract.formats.avro_spec import AvroFormat
191
192
# Avro configuration (minimal setup required)
193
avro_format = AvroFormat()
194
```
195
196
### Integration with Stream Configuration
197
198
```python
199
from source_s3.v4 import Config
200
from airbyte_cdk.sources.file_based.config.file_based_stream_config import FileBasedStreamConfig
201
202
# Configure stream with CSV format
203
stream_config = FileBasedStreamConfig(
204
name="users_data",
205
globs=["users/*.csv"],
206
format=CsvFormat(
207
delimiter=",",
208
quote_char='"',
209
encoding="utf-8",
210
infer_datatypes=True
211
)
212
)
213
214
# Configure stream with Parquet format
215
parquet_stream = FileBasedStreamConfig(
216
name="analytics_data",
217
globs=["analytics/**/*.parquet"],
218
format=ParquetFormat(
219
columns=["user_id", "event_type", "timestamp", "properties"],
220
batch_size=20000,
221
buffer_size=2*1024*1024
222
)
223
)
224
```
225
226
## Performance Considerations
227
228
### CSV Files
229
- Use larger `block_size` for better performance with large files
230
- Set `infer_datatypes=False` for consistent performance across files
231
- Use `additional_reader_options` for pandas-specific optimizations
232
233
### JSON Lines Files
234
- Larger `block_size` improves performance for large JSONL files
235
- `unexpected_field_behavior="ignore"` is fastest for known schemas
236
- Set `newlines_in_values=False` if JSON doesn't contain embedded newlines
237
238
### Parquet Files
239
- Specify `columns` list to read only required data
240
- Increase `batch_size` for better memory utilization
241
- Larger `buffer_size` can improve I/O performance
242
243
### Avro Files
244
- Avro format provides efficient serialization with minimal configuration
245
- Schema evolution is handled automatically by the Avro format