tessl/maven-org-apache-flink--flink-connector-filesystem-2-10

Flink filesystem connector providing fault-tolerant rolling file sinks for streaming data to HDFS and Hadoop-compatible filesystems

—

Pending

Overview

Eval results

Files

Bucketing Strategies

Name: tessl/maven-org-apache-flink--flink-connector-filesystem-2-10
Author: tessl

Bucketers determine how streaming data is organized into directory structures within the base path. They control when new bucket directories are created and how elements are routed to appropriate buckets.

Bucketer Interface

The modern bucketing interface used with BucketingSink.

public interface Bucketer<T> extends Serializable

Core Method

Path getBucketPath(org.apache.flink.streaming.connectors.fs.Clock clock, org.apache.hadoop.fs.Path basePath, T element)

Returns the complete bucket path for the provided element.

Parameters:

clock - Clock implementation for getting current time
basePath - Base directory containing all buckets
element - Current element being processed

Returns: Complete Path where the element should be written, including basePath and subtask index

DateTimeBucketer

Creates buckets based on date and time patterns, organizing files into time-based directory structures.

Constructors

public DateTimeBucketer()

Creates a DateTimeBucketer with default format "yyyy-MM-dd--HH" (hourly buckets).

public DateTimeBucketer(String formatString)

Creates a DateTimeBucketer with custom date format pattern.

Parameters:

formatString - Java SimpleDateFormat pattern string

Usage Examples

import org.apache.flink.streaming.connectors.fs.bucketing.DateTimeBucketer;

// Hourly buckets (default): /basePath/2023-12-25--14/
DateTimeBucketer<String> hourly = new DateTimeBucketer<>();

// Daily buckets: /basePath/2023-12-25/
DateTimeBucketer<String> daily = new DateTimeBucketer<>("yyyy-MM-dd");

// Minute-level buckets: /basePath/2023/12/25/14/30/
DateTimeBucketer<String> minutely = new DateTimeBucketer<>("yyyy/MM/dd/HH/mm");

BucketingSink<String> sink = new BucketingSink<>("/tmp/output");
sink.setBucketer(hourly);

Common Date Format Patterns

Pattern	Example Output	Description
`yyyy-MM-dd--HH`	`2023-12-25--14`	Default: hourly buckets
`yyyy-MM-dd`	`2023-12-25`	Daily buckets
`yyyy/MM/dd/HH`	`2023/12/25/14`	Hierarchical hourly
`yyyy-MM-dd/HH/mm`	`2023-12-25/14/30`	Minute-level buckets
`yyyy/MM`	`2023/12`	Monthly buckets
`'year='yyyy'/month='MM'/day='dd`	`year=2023/month=12/day=25`	Hive-style partitioning

BasePathBucketer

Uses the base path as the bucket directory without creating subdirectories. All files are written directly to the base path.

Constructor

public BasePathBucketer()

Creates a BasePathBucketer that writes all files to the base directory.

Usage Example

import org.apache.flink.streaming.connectors.fs.bucketing.BasePathBucketer;

// All files written directly to /tmp/output/
BasePathBucketer<String> bucketer = new BasePathBucketer<>();

BucketingSink<String> sink = new BucketingSink<>("/tmp/output");
sink.setBucketer(bucketer);

Legacy Bucketer Interface (Deprecated)

The original bucketing interface used with RollingSink.

@Deprecated
public interface Bucketer extends Serializable

Methods

@Deprecated
boolean shouldStartNewBucket(org.apache.hadoop.fs.Path basePath, org.apache.hadoop.fs.Path currentBucketPath)

Determines if a new bucket should be started.

@Deprecated
org.apache.hadoop.fs.Path getNextBucketPath(org.apache.hadoop.fs.Path basePath)

Returns the path for the next bucket.

Legacy Implementations

DateTimeBucketer (Legacy)

@Deprecated
public class DateTimeBucketer implements Bucketer

Legacy time-based bucketing for RollingSink.

NonRollingBucketer (Legacy)

@Deprecated
public class NonRollingBucketer implements Bucketer

Legacy single-bucket strategy for RollingSink.

Custom Bucketer Implementation

You can create custom bucketing strategies by implementing the Bucketer interface:

import org.apache.flink.streaming.connectors.fs.bucketing.Bucketer;
import org.apache.flink.streaming.connectors.fs.Clock;
import org.apache.hadoop.fs.Path;

public class CustomBucketer<T> implements Bucketer<T> {
    
    @Override
    public Path getBucketPath(Clock clock, Path basePath, T element) {
        // Custom bucketing logic based on element properties
        String bucketName = determineBucketName(element);
        return new Path(basePath, bucketName);
    }
    
    private String determineBucketName(T element) {
        // Example: bucket by string length
        if (element instanceof String) {
            String str = (String) element;
            return "length-" + str.length();
        }
        return "default";
    }
}

Bucketing Best Practices

Performance Considerations

Avoid Too Many Small Buckets: Excessive bucketing can create many small files, impacting performance
Balance Bucket Size: Consider your data volume and processing patterns
HDFS Block Size: Aim for file sizes that are multiples of HDFS block size (typically 128MB)

Time-based Bucketing Guidelines

// High-volume streams: Use larger time windows
DateTimeBucketer<String> hourly = new DateTimeBucketer<>("yyyy-MM-dd--HH");

// Medium-volume streams: Smaller windows acceptable
DateTimeBucketer<String> tenMinute = new DateTimeBucketer<>("yyyy-MM-dd--HH-mm");

// Low-volume streams: May use very granular bucketing
DateTimeBucketer<String> perMinute = new DateTimeBucketer<>("yyyy/MM/dd/HH/mm");

Combining with Batch Size

// Large buckets with smaller batch sizes for faster file rotation
BucketingSink<String> sink = new BucketingSink<>("/tmp/output");
sink.setBucketer(new DateTimeBucketer<>("yyyy-MM-dd"))  // Daily buckets
    .setBatchSize(64 * 1024 * 1024);  // 64MB files

Hive-Compatible Partitioning

// Create Hive-compatible partition structure
DateTimeBucketer<String> hiveBucketer = new DateTimeBucketer<>(
    "'year='yyyy'/month='MM'/day='dd'/hour='HH"
);

// Results in: /basePath/year=2023/month=12/day=25/hour=14/
BucketingSink<String> sink = new BucketingSink<>("/tmp/output");
sink.setBucketer(hiveBucketer);

Install with Tessl CLI

npx tessl i tessl/maven-org-apache-flink--flink-connector-filesystem-2-10

docs