or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

configuration.mdcore-source.mdfile-formats.mdindex.mdstream-operations.mdutilities.mdzip-support.md

file-formats.mddocs/

0

# File Format Specifications

1

2

Configuration classes for supported file formats including CSV, JSON Lines, Parquet, and Avro formats. Each format provides specific configuration options for parsing and processing data files stored in S3.

3

4

## Capabilities

5

6

### CSV Format Configuration

7

8

Configuration for processing CSV (Comma-Separated Values) files with extensive customization options for delimiters, encoding, and parsing behavior.

9

10

```python { .api }

11

class CsvFormat:

12

"""

13

Configuration for CSV file processing with customizable parsing options.

14

"""

15

16

filetype: str = "csv"

17

"""File type identifier, always 'csv'"""

18

19

delimiter: str

20

"""Field delimiter character (e.g., ',', ';', '\t')"""

21

22

infer_datatypes: Optional[bool]

23

"""Whether to automatically infer column data types"""

24

25

quote_char: str

26

"""Character used for quoting fields (default: '"')"""

27

28

escape_char: Optional[str]

29

"""Character used for escaping special characters"""

30

31

encoding: Optional[str]

32

"""Text encoding (e.g., 'utf-8', 'latin-1')"""

33

34

double_quote: bool

35

"""Whether to treat two consecutive quote characters as a single quote"""

36

37

newlines_in_values: bool

38

"""Whether field values can contain newline characters"""

39

40

additional_reader_options: Optional[str]

41

"""JSON string of additional pandas read_csv options"""

42

43

advanced_options: Optional[str]

44

"""JSON string of advanced parsing options"""

45

46

block_size: int

47

"""Block size for reading large CSV files"""

48

```

49

50

### JSON Lines Format Configuration

51

52

Configuration for processing JSON Lines (JSONL/NDJSON) files with options for handling unexpected fields and newline processing.

53

54

```python { .api }

55

class UnexpectedFieldBehaviorEnum(str, Enum):

56

"""Enumeration for handling unexpected fields in JSON objects"""

57

ignore = "ignore" # Ignore unexpected fields

58

infer = "infer" # Infer schema from unexpected fields

59

error = "error" # Raise error on unexpected fields

60

61

class JsonlFormat:

62

"""

63

Configuration for JSON Lines file processing.

64

"""

65

66

filetype: str = "jsonl"

67

"""File type identifier, always 'jsonl'"""

68

69

newlines_in_values: bool

70

"""Whether JSON values can contain newline characters"""

71

72

unexpected_field_behavior: UnexpectedFieldBehaviorEnum

73

"""How to handle fields not defined in the schema"""

74

75

block_size: int

76

"""Block size for reading large JSONL files"""

77

```

78

79

### Parquet Format Configuration

80

81

Configuration for processing Apache Parquet files with options for column selection and performance optimization.

82

83

```python { .api }

84

class ParquetFormat:

85

"""

86

Configuration for Parquet file processing with performance optimization options.

87

"""

88

89

filetype: str = "parquet"

90

"""File type identifier, always 'parquet'"""

91

92

columns: Optional[List[str]]

93

"""Specific columns to read from the Parquet file (None for all columns)"""

94

95

batch_size: int

96

"""Number of rows to read in each batch for memory management"""

97

98

buffer_size: int

99

"""Buffer size for reading Parquet files"""

100

```

101

102

### Avro Format Configuration

103

104

Configuration for processing Apache Avro files with native schema support.

105

106

```python { .api }

107

class AvroFormat:

108

"""

109

Configuration for Avro file processing with native schema support.

110

"""

111

112

filetype: str = "avro"

113

"""File type identifier, always 'avro'"""

114

```

115

116

## Usage Examples

117

118

### CSV Format Usage

119

120

```python

121

from source_s3.source_files_abstract.formats.csv_spec import CsvFormat

122

123

# Basic CSV configuration

124

csv_format = CsvFormat(

125

delimiter=",",

126

quote_char='"',

127

encoding="utf-8",

128

infer_datatypes=True,

129

double_quote=True,

130

newlines_in_values=False,

131

block_size=1024*1024

132

)

133

134

# Advanced CSV configuration with custom options

135

csv_advanced = CsvFormat(

136

delimiter="|",

137

quote_char="'",

138

escape_char="\\",

139

encoding="latin1",

140

infer_datatypes=False,

141

additional_reader_options='{"skiprows": 1, "na_values": ["NULL", ""]}',

142

advanced_options='{"parse_dates": ["created_at"]}',

143

block_size=2*1024*1024

144

)

145

```

146

147

### JSON Lines Format Usage

148

149

```python

150

from source_s3.source_files_abstract.formats.jsonl_spec import JsonlFormat, UnexpectedFieldBehaviorEnum

151

152

# Standard JSONL configuration

153

jsonl_format = JsonlFormat(

154

newlines_in_values=False,

155

unexpected_field_behavior=UnexpectedFieldBehaviorEnum.infer,

156

block_size=1024*1024

157

)

158

159

# Strict JSONL configuration

160

jsonl_strict = JsonlFormat(

161

newlines_in_values=True,

162

unexpected_field_behavior=UnexpectedFieldBehaviorEnum.error,

163

block_size=512*1024

164

)

165

```

166

167

### Parquet Format Usage

168

169

```python

170

from source_s3.source_files_abstract.formats.parquet_spec import ParquetFormat

171

172

# Full Parquet file reading

173

parquet_format = ParquetFormat(

174

columns=None, # Read all columns

175

batch_size=10000,

176

buffer_size=1024*1024

177

)

178

179

# Selective column reading for performance

180

parquet_selective = ParquetFormat(

181

columns=["id", "name", "created_at", "amount"],

182

batch_size=50000,

183

buffer_size=4*1024*1024

184

)

185

```

186

187

### Avro Format Usage

188

189

```python

190

from source_s3.source_files_abstract.formats.avro_spec import AvroFormat

191

192

# Avro configuration (minimal setup required)

193

avro_format = AvroFormat()

194

```

195

196

### Integration with Stream Configuration

197

198

```python

199

from source_s3.v4 import Config

200

from airbyte_cdk.sources.file_based.config.file_based_stream_config import FileBasedStreamConfig

201

202

# Configure stream with CSV format

203

stream_config = FileBasedStreamConfig(

204

name="users_data",

205

globs=["users/*.csv"],

206

format=CsvFormat(

207

delimiter=",",

208

quote_char='"',

209

encoding="utf-8",

210

infer_datatypes=True

211

)

212

)

213

214

# Configure stream with Parquet format

215

parquet_stream = FileBasedStreamConfig(

216

name="analytics_data",

217

globs=["analytics/**/*.parquet"],

218

format=ParquetFormat(

219

columns=["user_id", "event_type", "timestamp", "properties"],

220

batch_size=20000,

221

buffer_size=2*1024*1024

222

)

223

)

224

```

225

226

## Performance Considerations

227

228

### CSV Files

229

- Use larger `block_size` for better performance with large files

230

- Set `infer_datatypes=False` for consistent performance across files

231

- Use `additional_reader_options` for pandas-specific optimizations

232

233

### JSON Lines Files

234

- Larger `block_size` improves performance for large JSONL files

235

- `unexpected_field_behavior="ignore"` is fastest for known schemas

236

- Set `newlines_in_values=False` if JSON doesn't contain embedded newlines

237

238

### Parquet Files

239

- Specify `columns` list to read only required data

240

- Increase `batch_size` for better memory utilization

241

- Larger `buffer_size` can improve I/O performance

242

243

### Avro Files

244

- Avro format provides efficient serialization with minimal configuration

245

- Schema evolution is handled automatically by the Avro format