Tessl Tile for pypi/airbyte-source-s3@4.14.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

configuration.md core-source.md file-formats.md index.md stream-operations.md utilities.md zip-support.md

file-formats.mddocs/

0
# File Format Specifications
1

2
Configuration classes for supported file formats including CSV, JSON Lines, Parquet, and Avro formats. Each format provides specific configuration options for parsing and processing data files stored in S3.
3

4
## Capabilities
5

6
### CSV Format Configuration
7

8
Configuration for processing CSV (Comma-Separated Values) files with extensive customization options for delimiters, encoding, and parsing behavior.
9

10
```python { .api }
11
class CsvFormat:
12
    """
13
    Configuration for CSV file processing with customizable parsing options.
14
    """
15
    
16
    filetype: str = "csv"
17
    """File type identifier, always 'csv'"""
18
    
19
    delimiter: str
20
    """Field delimiter character (e.g., ',', ';', '\t')"""
21
    
22
    infer_datatypes: Optional[bool]
23
    """Whether to automatically infer column data types"""
24
    
25
    quote_char: str
26
    """Character used for quoting fields (default: '"')"""
27
    
28
    escape_char: Optional[str]
29
    """Character used for escaping special characters"""
30
    
31
    encoding: Optional[str] 
32
    """Text encoding (e.g., 'utf-8', 'latin-1')"""
33
    
34
    double_quote: bool
35
    """Whether to treat two consecutive quote characters as a single quote"""
36
    
37
    newlines_in_values: bool
38
    """Whether field values can contain newline characters"""
39
    
40
    additional_reader_options: Optional[str]
41
    """JSON string of additional pandas read_csv options"""
42
    
43
    advanced_options: Optional[str]
44
    """JSON string of advanced parsing options"""
45
    
46
    block_size: int
47
    """Block size for reading large CSV files"""
48
```
49

50
### JSON Lines Format Configuration
51

52
Configuration for processing JSON Lines (JSONL/NDJSON) files with options for handling unexpected fields and newline processing.
53

54
```python { .api }
55
class UnexpectedFieldBehaviorEnum(str, Enum):
56
    """Enumeration for handling unexpected fields in JSON objects"""
57
    ignore = "ignore"    # Ignore unexpected fields
58
    infer = "infer"      # Infer schema from unexpected fields
59
    error = "error"      # Raise error on unexpected fields
60

61
class JsonlFormat:
62
    """
63
    Configuration for JSON Lines file processing.
64
    """
65
    
66
    filetype: str = "jsonl"
67
    """File type identifier, always 'jsonl'"""
68
    
69
    newlines_in_values: bool
70
    """Whether JSON values can contain newline characters"""
71
    
72
    unexpected_field_behavior: UnexpectedFieldBehaviorEnum
73
    """How to handle fields not defined in the schema"""
74
    
75
    block_size: int
76
    """Block size for reading large JSONL files"""
77
```
78

79
### Parquet Format Configuration
80

81
Configuration for processing Apache Parquet files with options for column selection and performance optimization.
82

83
```python { .api }
84
class ParquetFormat:
85
    """
86
    Configuration for Parquet file processing with performance optimization options.
87
    """
88
    
89
    filetype: str = "parquet"
90
    """File type identifier, always 'parquet'"""
91
    
92
    columns: Optional[List[str]]
93
    """Specific columns to read from the Parquet file (None for all columns)"""
94
    
95
    batch_size: int
96
    """Number of rows to read in each batch for memory management"""
97
    
98
    buffer_size: int
99
    """Buffer size for reading Parquet files"""
100
```
101

102
### Avro Format Configuration
103

104
Configuration for processing Apache Avro files with native schema support.
105

106
```python { .api }
107
class AvroFormat:
108
    """
109
    Configuration for Avro file processing with native schema support.
110
    """
111
    
112
    filetype: str = "avro"
113
    """File type identifier, always 'avro'"""
114
```
115

116
## Usage Examples
117

118
### CSV Format Usage
119

120
```python
121
from source_s3.source_files_abstract.formats.csv_spec import CsvFormat
122

123
# Basic CSV configuration
124
csv_format = CsvFormat(
125
    delimiter=",",
126
    quote_char='"',
127
    encoding="utf-8",
128
    infer_datatypes=True,
129
    double_quote=True,
130
    newlines_in_values=False,
131
    block_size=1024*1024
132
)
133

134
# Advanced CSV configuration with custom options
135
csv_advanced = CsvFormat(
136
    delimiter="|",
137
    quote_char="'",
138
    escape_char="\\",
139
    encoding="latin1",
140
    infer_datatypes=False,
141
    additional_reader_options='{"skiprows": 1, "na_values": ["NULL", ""]}',
142
    advanced_options='{"parse_dates": ["created_at"]}',
143
    block_size=2*1024*1024
144
)
145
```
146

147
### JSON Lines Format Usage
148

149
```python
150
from source_s3.source_files_abstract.formats.jsonl_spec import JsonlFormat, UnexpectedFieldBehaviorEnum
151

152
# Standard JSONL configuration
153
jsonl_format = JsonlFormat(
154
    newlines_in_values=False,
155
    unexpected_field_behavior=UnexpectedFieldBehaviorEnum.infer,
156
    block_size=1024*1024
157
)
158

159
# Strict JSONL configuration
160
jsonl_strict = JsonlFormat(
161
    newlines_in_values=True,
162
    unexpected_field_behavior=UnexpectedFieldBehaviorEnum.error,
163
    block_size=512*1024
164
)
165
```
166

167
### Parquet Format Usage
168

169
```python
170
from source_s3.source_files_abstract.formats.parquet_spec import ParquetFormat
171

172
# Full Parquet file reading
173
parquet_format = ParquetFormat(
174
    columns=None,  # Read all columns
175
    batch_size=10000,
176
    buffer_size=1024*1024
177
)
178

179
# Selective column reading for performance
180
parquet_selective = ParquetFormat(
181
    columns=["id", "name", "created_at", "amount"],
182
    batch_size=50000,
183
    buffer_size=4*1024*1024
184
)
185
```
186

187
### Avro Format Usage
188

189
```python
190
from source_s3.source_files_abstract.formats.avro_spec import AvroFormat
191

192
# Avro configuration (minimal setup required)
193
avro_format = AvroFormat()
194
```
195

196
### Integration with Stream Configuration
197

198
```python
199
from source_s3.v4 import Config
200
from airbyte_cdk.sources.file_based.config.file_based_stream_config import FileBasedStreamConfig
201

202
# Configure stream with CSV format
203
stream_config = FileBasedStreamConfig(
204
    name="users_data",
205
    globs=["users/*.csv"],
206
    format=CsvFormat(
207
        delimiter=",",
208
        quote_char='"',
209
        encoding="utf-8",
210
        infer_datatypes=True
211
    )
212
)
213

214
# Configure stream with Parquet format
215
parquet_stream = FileBasedStreamConfig(
216
    name="analytics_data",
217
    globs=["analytics/**/*.parquet"],
218
    format=ParquetFormat(
219
        columns=["user_id", "event_type", "timestamp", "properties"],
220
        batch_size=20000,
221
        buffer_size=2*1024*1024
222
    )
223
)
224
```
225

226
## Performance Considerations
227

228
### CSV Files
229
- Use larger `block_size` for better performance with large files
230
- Set `infer_datatypes=False` for consistent performance across files
231
- Use `additional_reader_options` for pandas-specific optimizations
232

233
### JSON Lines Files
234
- Larger `block_size` improves performance for large JSONL files
235
- `unexpected_field_behavior="ignore"` is fastest for known schemas
236
- Set `newlines_in_values=False` if JSON doesn't contain embedded newlines
237

238
### Parquet Files
239
- Specify `columns` list to read only required data
240
- Increase `batch_size` for better memory utilization
241
- Larger `buffer_size` can improve I/O performance
242

243
### Avro Files
244
- Avro format provides efficient serialization with minimal configuration
245
- Schema evolution is handled automatically by the Avro format

Version

Tile

Files

file-formats.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

file-formats.mddocs/