CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-airbyte-source-s3

S3 connector for Airbyte that syncs data from Amazon S3 and S3-compatible services

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview
Eval results
Files

file-formats.mddocs/

File Format Specifications

Configuration classes for supported file formats including CSV, JSON Lines, Parquet, and Avro formats. Each format provides specific configuration options for parsing and processing data files stored in S3.

Capabilities

CSV Format Configuration

Configuration for processing CSV (Comma-Separated Values) files with extensive customization options for delimiters, encoding, and parsing behavior.

class CsvFormat:
    """
    Configuration for CSV file processing with customizable parsing options.
    """
    
    filetype: str = "csv"
    """File type identifier, always 'csv'"""
    
    delimiter: str
    """Field delimiter character (e.g., ',', ';', '\t')"""
    
    infer_datatypes: Optional[bool]
    """Whether to automatically infer column data types"""
    
    quote_char: str
    """Character used for quoting fields (default: '"')"""
    
    escape_char: Optional[str]
    """Character used for escaping special characters"""
    
    encoding: Optional[str] 
    """Text encoding (e.g., 'utf-8', 'latin-1')"""
    
    double_quote: bool
    """Whether to treat two consecutive quote characters as a single quote"""
    
    newlines_in_values: bool
    """Whether field values can contain newline characters"""
    
    additional_reader_options: Optional[str]
    """JSON string of additional pandas read_csv options"""
    
    advanced_options: Optional[str]
    """JSON string of advanced parsing options"""
    
    block_size: int
    """Block size for reading large CSV files"""

JSON Lines Format Configuration

Configuration for processing JSON Lines (JSONL/NDJSON) files with options for handling unexpected fields and newline processing.

class UnexpectedFieldBehaviorEnum(str, Enum):
    """Enumeration for handling unexpected fields in JSON objects"""
    ignore = "ignore"    # Ignore unexpected fields
    infer = "infer"      # Infer schema from unexpected fields
    error = "error"      # Raise error on unexpected fields

class JsonlFormat:
    """
    Configuration for JSON Lines file processing.
    """
    
    filetype: str = "jsonl"
    """File type identifier, always 'jsonl'"""
    
    newlines_in_values: bool
    """Whether JSON values can contain newline characters"""
    
    unexpected_field_behavior: UnexpectedFieldBehaviorEnum
    """How to handle fields not defined in the schema"""
    
    block_size: int
    """Block size for reading large JSONL files"""

Parquet Format Configuration

Configuration for processing Apache Parquet files with options for column selection and performance optimization.

class ParquetFormat:
    """
    Configuration for Parquet file processing with performance optimization options.
    """
    
    filetype: str = "parquet"
    """File type identifier, always 'parquet'"""
    
    columns: Optional[List[str]]
    """Specific columns to read from the Parquet file (None for all columns)"""
    
    batch_size: int
    """Number of rows to read in each batch for memory management"""
    
    buffer_size: int
    """Buffer size for reading Parquet files"""

Avro Format Configuration

Configuration for processing Apache Avro files with native schema support.

class AvroFormat:
    """
    Configuration for Avro file processing with native schema support.
    """
    
    filetype: str = "avro"
    """File type identifier, always 'avro'"""

Usage Examples

CSV Format Usage

from source_s3.source_files_abstract.formats.csv_spec import CsvFormat

# Basic CSV configuration
csv_format = CsvFormat(
    delimiter=",",
    quote_char='"',
    encoding="utf-8",
    infer_datatypes=True,
    double_quote=True,
    newlines_in_values=False,
    block_size=1024*1024
)

# Advanced CSV configuration with custom options
csv_advanced = CsvFormat(
    delimiter="|",
    quote_char="'",
    escape_char="\\",
    encoding="latin1",
    infer_datatypes=False,
    additional_reader_options='{"skiprows": 1, "na_values": ["NULL", ""]}',
    advanced_options='{"parse_dates": ["created_at"]}',
    block_size=2*1024*1024
)

JSON Lines Format Usage

from source_s3.source_files_abstract.formats.jsonl_spec import JsonlFormat, UnexpectedFieldBehaviorEnum

# Standard JSONL configuration
jsonl_format = JsonlFormat(
    newlines_in_values=False,
    unexpected_field_behavior=UnexpectedFieldBehaviorEnum.infer,
    block_size=1024*1024
)

# Strict JSONL configuration
jsonl_strict = JsonlFormat(
    newlines_in_values=True,
    unexpected_field_behavior=UnexpectedFieldBehaviorEnum.error,
    block_size=512*1024
)

Parquet Format Usage

from source_s3.source_files_abstract.formats.parquet_spec import ParquetFormat

# Full Parquet file reading
parquet_format = ParquetFormat(
    columns=None,  # Read all columns
    batch_size=10000,
    buffer_size=1024*1024
)

# Selective column reading for performance
parquet_selective = ParquetFormat(
    columns=["id", "name", "created_at", "amount"],
    batch_size=50000,
    buffer_size=4*1024*1024
)

Avro Format Usage

from source_s3.source_files_abstract.formats.avro_spec import AvroFormat

# Avro configuration (minimal setup required)
avro_format = AvroFormat()

Integration with Stream Configuration

from source_s3.v4 import Config
from airbyte_cdk.sources.file_based.config.file_based_stream_config import FileBasedStreamConfig

# Configure stream with CSV format
stream_config = FileBasedStreamConfig(
    name="users_data",
    globs=["users/*.csv"],
    format=CsvFormat(
        delimiter=",",
        quote_char='"',
        encoding="utf-8",
        infer_datatypes=True
    )
)

# Configure stream with Parquet format
parquet_stream = FileBasedStreamConfig(
    name="analytics_data",
    globs=["analytics/**/*.parquet"],
    format=ParquetFormat(
        columns=["user_id", "event_type", "timestamp", "properties"],
        batch_size=20000,
        buffer_size=2*1024*1024
    )
)

Performance Considerations

CSV Files

  • Use larger block_size for better performance with large files
  • Set infer_datatypes=False for consistent performance across files
  • Use additional_reader_options for pandas-specific optimizations

JSON Lines Files

  • Larger block_size improves performance for large JSONL files
  • unexpected_field_behavior="ignore" is fastest for known schemas
  • Set newlines_in_values=False if JSON doesn't contain embedded newlines

Parquet Files

  • Specify columns list to read only required data
  • Increase batch_size for better memory utilization
  • Larger buffer_size can improve I/O performance

Avro Files

  • Avro format provides efficient serialization with minimal configuration
  • Schema evolution is handled automatically by the Avro format

Install with Tessl CLI

npx tessl i tessl/pypi-airbyte-source-s3

docs

configuration.md

core-source.md

file-formats.md

index.md

stream-operations.md

utilities.md

zip-support.md

tile.json