CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-org-apache-flink--flink-connector-hbase-1-4-2-11

Apache Flink HBase 1.4 Connector provides bidirectional data integration between Flink streaming applications and HBase NoSQL database.

Pending
Overview
Eval results
Files

write-options.mddocs/

Write Options and Performance Tuning

Configuration options for optimizing write performance through buffering, batching, and parallelism control in the Apache Flink HBase 1.4 Connector.

Capabilities

HBaseWriteOptions

Configuration class that encapsulates all write-related settings for optimal performance tuning of HBase sink operations.

/**
 * Options for HBase writing operations
 * Provides configuration for buffering, batching, and parallelism
 */
@Internal
public class HBaseWriteOptions implements Serializable {
    
    /**
     * Returns the maximum buffer size in bytes before flushing
     * @return Buffer size threshold in bytes (default: 2MB from ConnectionConfiguration.WRITE_BUFFER_SIZE_DEFAULT)
     */
    public long getBufferFlushMaxSizeInBytes();
    
    /**
     * Returns the maximum number of rows to buffer before flushing
     * @return Row count threshold (default: 0, disabled)
     */
    public long getBufferFlushMaxRows();
    
    /**
     * Returns the flush interval in milliseconds
     * @return Interval in milliseconds between automatic flushes (default: 0, disabled)
     */
    public long getBufferFlushIntervalMillis();
    
    /**
     * Returns the configured parallelism for sink operations
     * @return Parallelism level, or null for framework default
     */
    public Integer getParallelism();
    
    /**
     * Creates a new builder for configuring write options
     * @return Builder instance for fluent configuration
     */
    public static Builder builder();
}

HBaseWriteOptions.Builder

Builder class providing fluent API for configuring write options with method chaining.

/**
 * Builder for HBaseWriteOptions using fluent interface pattern
 * Allows step-by-step configuration of all write parameters
 */
public static class Builder {
    
    /**
     * Sets the maximum buffer size in bytes for flushing
     * @param bufferFlushMaxSizeInBytes Buffer size threshold (default: 2MB)
     * @return Builder instance for method chaining
     */
    public Builder setBufferFlushMaxSizeInBytes(long bufferFlushMaxSizeInBytes);
    
    /**
     * Sets the maximum number of rows to buffer before flushing
     * @param bufferFlushMaxRows Row count threshold (default: 0, disabled)
     * @return Builder instance for method chaining
     */
    public Builder setBufferFlushMaxRows(long bufferFlushMaxRows);
    
    /**
     * Sets the flush interval in milliseconds for time-based flushing
     * @param bufferFlushIntervalMillis Interval in milliseconds (default: 0, disabled)
     * @return Builder instance for method chaining
     */
    public Builder setBufferFlushIntervalMillis(long bufferFlushIntervalMillis);
    
    /**
     * Sets the parallelism level for sink operations
     * @param parallelism Number of parallel sink instances (null for framework default)
     * @return Builder instance for method chaining
     */
    public Builder setParallelism(Integer parallelism);
    
    /**
     * Creates a new HBaseWriteOptions instance with configured settings
     * @return Configured HBaseWriteOptions instance
     */
    public HBaseWriteOptions build();
}

Usage Example:

// Example: Creating optimized write options for high-throughput scenario
HBaseWriteOptions highThroughputOptions = HBaseWriteOptions.builder()
    .setBufferFlushMaxSizeInBytes(10 * 1024 * 1024) // 10MB buffer
    .setBufferFlushMaxRows(5000)                     // 5000 rows per batch
    .setBufferFlushIntervalMillis(5000)              // 5 second interval
    .setParallelism(8)                               // 8 parallel writers
    .build();

// Example: Creating low-latency write options
HBaseWriteOptions lowLatencyOptions = HBaseWriteOptions.builder()
    .setBufferFlushMaxSizeInBytes(100 * 1024)       // 100KB buffer
    .setBufferFlushMaxRows(100)                      // 100 rows per batch
    .setBufferFlushIntervalMillis(500)               // 500ms interval
    .setParallelism(4)                               // 4 parallel writers
    .build();

Buffer Configuration Strategies

Size-Based Flushing

Configure buffer flushing based on memory consumption to control memory usage and write batch sizes.

// Memory-efficient configuration
HBaseWriteOptions memoryOptimized = HBaseWriteOptions.builder()
    .setBufferFlushMaxSizeInBytes(4 * 1024 * 1024) // 4MB buffer
    .setBufferFlushMaxRows(0)                       // Disable row-based flushing
    .setBufferFlushIntervalMillis(0)                // Disable time-based flushing
    .build();

Buffer Size Guidelines:

  • Small datasets (< 1000 records/sec): 1-2MB buffer size
  • Medium datasets (1000-10000 records/sec): 4-8MB buffer size
  • Large datasets (> 10000 records/sec): 10-20MB buffer size
  • Memory-constrained environments: 512KB-1MB buffer size

Row-Based Flushing

Configure buffer flushing based on the number of accumulated rows for predictable batch sizes.

// Row-count based configuration
HBaseWriteOptions rowBased = HBaseWriteOptions.builder()
    .setBufferFlushMaxSizeInBytes(0)                // Disable size-based flushing
    .setBufferFlushMaxRows(2000)                    // Flush every 2000 rows
    .setBufferFlushIntervalMillis(0)                // Disable time-based flushing
    .build();

Row Count Guidelines:

  • Small records (< 1KB each): 1000-5000 rows per batch
  • Medium records (1-10KB each): 500-2000 rows per batch
  • Large records (> 10KB each): 100-500 rows per batch

Time-Based Flushing

Configure periodic flushing to ensure low latency even with small data volumes.

// Time-based configuration for low latency
HBaseWriteOptions timeBased = HBaseWriteOptions.builder()
    .setBufferFlushMaxSizeInBytes(0)                // Disable size-based flushing
    .setBufferFlushMaxRows(0)                       // Disable row-based flushing
    .setBufferFlushIntervalMillis(1000)             // Flush every 1 second
    .build();

Interval Guidelines:

  • Real-time applications: 100-500ms intervals
  • Near real-time applications: 1-5 second intervals
  • Batch processing: 10-60 second intervals

Combined Flushing Strategies

Use multiple flushing triggers simultaneously for optimal performance across varying load patterns.

// Combined strategy for adaptive performance
HBaseWriteOptions combined = HBaseWriteOptions.builder()
    .setBufferFlushMaxSizeInBytes(8 * 1024 * 1024)  // 8MB size limit
    .setBufferFlushMaxRows(3000)                     // 3000 row limit
    .setBufferFlushIntervalMillis(2000)              // 2 second time limit
    .setParallelism(6)                               // 6 parallel writers
    .build();

// This configuration will flush when ANY condition is met:
// - Buffer reaches 8MB in size, OR
// - Buffer contains 3000 rows, OR  
// - 2 seconds have elapsed since last flush

Performance Tuning Guidelines

High-Throughput Configuration

Optimize for maximum data ingestion rates with larger buffers and higher parallelism.

HBaseWriteOptions highThroughput = HBaseWriteOptions.builder()
    .setBufferFlushMaxSizeInBytes(16 * 1024 * 1024) // 16MB buffers
    .setBufferFlushMaxRows(8000)                     // Large batches
    .setBufferFlushIntervalMillis(10000)             // 10s intervals
    .setParallelism(12)                              // High parallelism
    .build();

High-Throughput Characteristics:

  • Large buffer sizes (10-20MB)
  • High row counts (5000-10000 rows)
  • Longer intervals (5-15 seconds)
  • High parallelism (8-16 instances)

Low-Latency Configuration

Optimize for minimal end-to-end latency with smaller buffers and frequent flushing.

HBaseWriteOptions lowLatency = HBaseWriteOptions.builder()
    .setBufferFlushMaxSizeInBytes(256 * 1024)       // 256KB buffers
    .setBufferFlushMaxRows(50)                       // Small batches
    .setBufferFlushIntervalMillis(200)               // 200ms intervals
    .setParallelism(4)                               // Moderate parallelism
    .build();

Low-Latency Characteristics:

  • Small buffer sizes (100KB-1MB)
  • Low row counts (50-500 rows)
  • Short intervals (100-1000ms)
  • Moderate parallelism (2-6 instances)

Balanced Configuration

General-purpose configuration balancing throughput and latency for most use cases.

HBaseWriteOptions balanced = HBaseWriteOptions.builder()
    .setBufferFlushMaxSizeInBytes(4 * 1024 * 1024)  // 4MB buffers
    .setBufferFlushMaxRows(2000)                     // Medium batches
    .setBufferFlushIntervalMillis(2000)              // 2s intervals
    .setParallelism(6)                               // Balanced parallelism
    .build();

SQL Configuration Mapping

Write options can be configured through SQL DDL statements using connector properties.

-- High-throughput SQL configuration
CREATE TABLE high_volume_events (
    event_id STRING,
    event_data ROW<timestamp TIMESTAMP(3), payload STRING, metadata MAP<STRING, STRING>>,
    PRIMARY KEY (event_id) NOT ENFORCED
) WITH (
    'connector' = 'hbase-1.4',
    'table-name' = 'events',
    'zookeeper.quorum' = 'zk1:2181,zk2:2181,zk3:2181',
    'sink.buffer-flush.max-size' = '16mb',           -- setBufferFlushMaxSizeInBytes
    'sink.buffer-flush.max-rows' = '8000',           -- setBufferFlushMaxRows
    'sink.buffer-flush.interval' = '10s',            -- setBufferFlushIntervalMillis
    'sink.parallelism' = '12'                        -- setParallelism
);

-- Low-latency SQL configuration
CREATE TABLE realtime_alerts (
    alert_id STRING,
    alert_info ROW<severity STRING, message STRING, timestamp TIMESTAMP(3)>,
    PRIMARY KEY (alert_id) NOT ENFORCED
) WITH (
    'connector' = 'hbase-1.4',
    'table-name' = 'alerts',
    'zookeeper.quorum' = 'localhost:2181',
    'sink.buffer-flush.max-size' = '256kb',          -- Small buffers
    'sink.buffer-flush.max-rows' = '50',             -- Small batches
    'sink.buffer-flush.interval' = '200ms',          -- Frequent flushing
    'sink.parallelism' = '4'                         -- Moderate parallelism
);

Monitoring Write Performance

Key Performance Indicators

Monitor these metrics to evaluate write performance and adjust configuration:

  • Throughput: Records/second and bytes/second written to HBase
  • Latency: End-to-end time from Flink to HBase persistence
  • Buffer utilization: Percentage of buffer capacity used
  • Flush frequency: Number of flushes per minute
  • Error rate: Percentage of failed write operations

Performance Tuning Process

  1. Baseline measurement: Start with default settings and measure performance
  2. Identify bottlenecks: Determine if limited by CPU, memory, network, or HBase
  3. Incremental tuning: Adjust one parameter at a time and measure impact
  4. Load testing: Validate configuration under expected production load
  5. Monitoring setup: Establish alerting for performance degradation

Common Performance Issues

Problem: High write latency Solution: Reduce buffer sizes and flush intervals

Problem: Low write throughput
Solution: Increase buffer sizes and parallelism

Problem: Memory pressure Solution: Reduce buffer sizes or increase flush frequency

Problem: HBase region hotspots Solution: Optimize row key distribution and increase parallelism

Install with Tessl CLI

npx tessl i tessl/maven-org-apache-flink--flink-connector-hbase-1-4-2-11

docs

index.md

lookup-options.md

sink-operations.md

source-operations.md

table-factory.md

write-options.md

tile.json