S3 connector for Airbyte that syncs data from Amazon S3 and S3-compatible services
—
Quality
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Configuration classes for supported file formats including CSV, JSON Lines, Parquet, and Avro formats. Each format provides specific configuration options for parsing and processing data files stored in S3.
Configuration for processing CSV (Comma-Separated Values) files with extensive customization options for delimiters, encoding, and parsing behavior.
class CsvFormat:
"""
Configuration for CSV file processing with customizable parsing options.
"""
filetype: str = "csv"
"""File type identifier, always 'csv'"""
delimiter: str
"""Field delimiter character (e.g., ',', ';', '\t')"""
infer_datatypes: Optional[bool]
"""Whether to automatically infer column data types"""
quote_char: str
"""Character used for quoting fields (default: '"')"""
escape_char: Optional[str]
"""Character used for escaping special characters"""
encoding: Optional[str]
"""Text encoding (e.g., 'utf-8', 'latin-1')"""
double_quote: bool
"""Whether to treat two consecutive quote characters as a single quote"""
newlines_in_values: bool
"""Whether field values can contain newline characters"""
additional_reader_options: Optional[str]
"""JSON string of additional pandas read_csv options"""
advanced_options: Optional[str]
"""JSON string of advanced parsing options"""
block_size: int
"""Block size for reading large CSV files"""Configuration for processing JSON Lines (JSONL/NDJSON) files with options for handling unexpected fields and newline processing.
class UnexpectedFieldBehaviorEnum(str, Enum):
"""Enumeration for handling unexpected fields in JSON objects"""
ignore = "ignore" # Ignore unexpected fields
infer = "infer" # Infer schema from unexpected fields
error = "error" # Raise error on unexpected fields
class JsonlFormat:
"""
Configuration for JSON Lines file processing.
"""
filetype: str = "jsonl"
"""File type identifier, always 'jsonl'"""
newlines_in_values: bool
"""Whether JSON values can contain newline characters"""
unexpected_field_behavior: UnexpectedFieldBehaviorEnum
"""How to handle fields not defined in the schema"""
block_size: int
"""Block size for reading large JSONL files"""Configuration for processing Apache Parquet files with options for column selection and performance optimization.
class ParquetFormat:
"""
Configuration for Parquet file processing with performance optimization options.
"""
filetype: str = "parquet"
"""File type identifier, always 'parquet'"""
columns: Optional[List[str]]
"""Specific columns to read from the Parquet file (None for all columns)"""
batch_size: int
"""Number of rows to read in each batch for memory management"""
buffer_size: int
"""Buffer size for reading Parquet files"""Configuration for processing Apache Avro files with native schema support.
class AvroFormat:
"""
Configuration for Avro file processing with native schema support.
"""
filetype: str = "avro"
"""File type identifier, always 'avro'"""from source_s3.source_files_abstract.formats.csv_spec import CsvFormat
# Basic CSV configuration
csv_format = CsvFormat(
delimiter=",",
quote_char='"',
encoding="utf-8",
infer_datatypes=True,
double_quote=True,
newlines_in_values=False,
block_size=1024*1024
)
# Advanced CSV configuration with custom options
csv_advanced = CsvFormat(
delimiter="|",
quote_char="'",
escape_char="\\",
encoding="latin1",
infer_datatypes=False,
additional_reader_options='{"skiprows": 1, "na_values": ["NULL", ""]}',
advanced_options='{"parse_dates": ["created_at"]}',
block_size=2*1024*1024
)from source_s3.source_files_abstract.formats.jsonl_spec import JsonlFormat, UnexpectedFieldBehaviorEnum
# Standard JSONL configuration
jsonl_format = JsonlFormat(
newlines_in_values=False,
unexpected_field_behavior=UnexpectedFieldBehaviorEnum.infer,
block_size=1024*1024
)
# Strict JSONL configuration
jsonl_strict = JsonlFormat(
newlines_in_values=True,
unexpected_field_behavior=UnexpectedFieldBehaviorEnum.error,
block_size=512*1024
)from source_s3.source_files_abstract.formats.parquet_spec import ParquetFormat
# Full Parquet file reading
parquet_format = ParquetFormat(
columns=None, # Read all columns
batch_size=10000,
buffer_size=1024*1024
)
# Selective column reading for performance
parquet_selective = ParquetFormat(
columns=["id", "name", "created_at", "amount"],
batch_size=50000,
buffer_size=4*1024*1024
)from source_s3.source_files_abstract.formats.avro_spec import AvroFormat
# Avro configuration (minimal setup required)
avro_format = AvroFormat()from source_s3.v4 import Config
from airbyte_cdk.sources.file_based.config.file_based_stream_config import FileBasedStreamConfig
# Configure stream with CSV format
stream_config = FileBasedStreamConfig(
name="users_data",
globs=["users/*.csv"],
format=CsvFormat(
delimiter=",",
quote_char='"',
encoding="utf-8",
infer_datatypes=True
)
)
# Configure stream with Parquet format
parquet_stream = FileBasedStreamConfig(
name="analytics_data",
globs=["analytics/**/*.parquet"],
format=ParquetFormat(
columns=["user_id", "event_type", "timestamp", "properties"],
batch_size=20000,
buffer_size=2*1024*1024
)
)block_size for better performance with large filesinfer_datatypes=False for consistent performance across filesadditional_reader_options for pandas-specific optimizationsblock_size improves performance for large JSONL filesunexpected_field_behavior="ignore" is fastest for known schemasnewlines_in_values=False if JSON doesn't contain embedded newlinescolumns list to read only required databatch_size for better memory utilizationbuffer_size can improve I/O performanceInstall with Tessl CLI
npx tessl i tessl/pypi-airbyte-source-s3