CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-org-apache-hudi--hudi-hadoop-common

Apache Hudi Hadoop common utilities and components that provide core functionality for integrating Apache Hudi with Hadoop ecosystem including file system operations, configuration management, and Hadoop-specific utilities for managing data lakehouse operations

Pending
Overview
Eval results
Files

configuration-management.mddocs/

Configuration Management

DFS-based configuration management providing support for global properties, environment-specific settings, and Hadoop configuration integration. Enables centralized configuration management across distributed Hudi deployments.

Capabilities

DFSPropertiesConfiguration

Main configuration class providing DFS-based properties management with support for global configuration files and environment-specific overrides.

/**
 * DFS-based properties configuration extending PropertiesConfig
 * Provides centralized configuration management for Hudi operations
 */
public class DFSPropertiesConfiguration extends PropertiesConfig {
    
    /** Default properties file name */
    public static final String DEFAULT_PROPERTIES_FILE = "hudi-defaults.conf";
    
    /** Environment variable for configuration directory */
    public static final String CONF_FILE_DIR_ENV_NAME = "HUDI_CONF_DIR";
    
    /** Default configuration file directory */
    public static final String DEFAULT_CONF_FILE_DIR = "file:/etc/hudi/conf";
    
    /** Default path for configuration file */
    public static final StoragePath DEFAULT_PATH;
    
    /** Create configuration with Hadoop configuration and file path */
    public DFSPropertiesConfiguration(Configuration hadoopConf, StoragePath filePath);
    
    /** Create configuration with default settings */
    public DFSPropertiesConfiguration();
    
    /** Add properties from file path */
    public void addPropsFromFile(StoragePath filePath);
    
    /** Add properties from BufferedReader stream */
    public void addPropsFromStream(BufferedReader reader, StoragePath cfgFilePath);
    
    /** Get global properties instance */
    public TypedProperties getGlobalProperties();
    
    /** Get instance properties */
    public TypedProperties getProps();
    
    /** Get instance properties with global properties option */
    public TypedProperties getProps(boolean includeGlobalProps);
}

Global Configuration Management

Static methods for managing global configuration properties across the application.

/**
 * Load global properties from default configuration location
 * Loads properties from HUDI_CONF_DIR or default location
 * @return TypedProperties containing global configuration
 */
public static TypedProperties loadGlobalProps();

/**
 * Get global properties (cached version)
 * Returns cached global properties or loads them if not cached
 * @return TypedProperties containing global configuration
 */
public static TypedProperties getGlobalProps();

/**
 * Refresh global properties by reloading from file system
 * Clears cache and reloads properties from configuration files
 */
public static void refreshGlobalProps();

/**
 * Clear global properties cache
 * Forces next access to reload properties from files
 */
public static void clearGlobalProps();

/**
 * Add property to global properties
 * @param key - Property key to add
 * @param value - Property value to set
 * @return Updated TypedProperties with new property
 */
public static TypedProperties addToGlobalProps(String key, String value);

Configuration File Locations

The configuration system supports multiple approaches for locating configuration files:

  1. Environment Variable: Set HUDI_CONF_DIR to specify custom configuration directory
  2. Default Location: Falls back to /etc/hudi/conf if environment variable not set
  3. Explicit Path: Pass specific StoragePath to constructor for custom location

Hadoop Configuration Integration

Seamless integration with Hadoop Configuration system for unified configuration management.

/**
 * Configuration integration patterns with Hadoop
 */

// Create with existing Hadoop configuration
Configuration hadoopConf = new Configuration();
hadoopConf.addResource("core-site.xml");
hadoopConf.addResource("hdfs-site.xml");

// Custom configuration file location
StoragePath configPath = new StoragePath("hdfs://namenode:8020/config/hudi-custom.conf");
DFSPropertiesConfiguration hudiConf = new DFSPropertiesConfiguration(hadoopConf, configPath);

// Access properties
String tableType = hudiConf.getString("hoodie.table.type", "COPY_ON_WRITE");
int parquetBlockSize = hudiConf.getInt("hoodie.parquet.block.size", 134217728);
boolean asyncCompaction = hudiConf.getBoolean("hoodie.compact.inline", false);

Configuration Properties Access

Inherited methods from PropertiesConfig for accessing configuration values with type safety and defaults.

/**
 * Property access methods (inherited from PropertiesConfig)
 */

// String properties
public String getString(String key);
public String getString(String key, String defaultValue);

// Integer properties  
public int getInt(String key);
public int getInt(String key, int defaultValue);

// Long properties
public long getLong(String key);
public long getLong(String key, long defaultValue);

// Boolean properties
public boolean getBoolean(String key);
public boolean getBoolean(String key, boolean defaultValue);

// Double properties
public double getDouble(String key);
public double getDouble(String key, double defaultValue);

// Get all properties as TypedProperties
public TypedProperties getProps();

Common Hudi Configuration Properties

Standard configuration properties commonly used in Hudi operations:

Table Configuration

  • hoodie.table.name - Name of the Hudi table
  • hoodie.table.type - Table type (COPY_ON_WRITE or MERGE_ON_READ)
  • hoodie.table.base.file.format - Base file format (PARQUET, ORC, etc.)

Write Configuration

  • hoodie.write.markers.type - Marker type for write operations
  • hoodie.write.concurrency.mode - Concurrency mode for writes
  • hoodie.datasource.write.operation - Write operation type (INSERT, UPSERT, etc.)

Compaction Configuration

  • hoodie.compact.inline - Enable inline compaction
  • hoodie.compact.inline.max.delta.commits - Max delta commits before compaction
  • hoodie.compact.strategy - Compaction strategy

File System Configuration

  • hoodie.filesystem.consistency.check.enabled - Enable consistency checks
  • hoodie.filesystem.operation.retry.enable - Enable operation retries
  • hoodie.filesystem.operation.retry.initial.interval - Initial retry interval

Configuration File Format

Hudi configuration files use standard Java properties format:

# Hudi configuration file (hudi-defaults.conf)

# Table settings
hoodie.table.type=COPY_ON_WRITE
hoodie.table.base.file.format=PARQUET

# Write settings
hoodie.write.markers.type=TIMELINE_SERVER_BASED
hoodie.write.concurrency.mode=SINGLE_WRITER
hoodie.datasource.write.operation=UPSERT

# Compaction settings
hoodie.compact.inline=false
hoodie.compact.inline.max.delta.commits=10
hoodie.compact.strategy=org.apache.hudi.table.action.compact.strategy.LogFileSizeBasedCompactionStrategy

# File system settings
hoodie.filesystem.consistency.check.enabled=true
hoodie.filesystem.operation.retry.enable=true
hoodie.filesystem.operation.retry.initial.interval=100

# Parquet settings
hoodie.parquet.block.size=134217728
hoodie.parquet.page.size=1048576
hoodie.parquet.compression.codec=snappy

Usage Examples:

import org.apache.hudi.common.config.DFSPropertiesConfiguration;
import org.apache.hadoop.conf.Configuration;
import org.apache.hudi.storage.StoragePath;
import org.apache.hudi.common.config.TypedProperties;

// Using global configuration
TypedProperties globalProps = DFSPropertiesConfiguration.getGlobalProps();
String defaultTableType = globalProps.getString("hoodie.table.type", "COPY_ON_WRITE");

// Using global DFS configuration instance
DFSPropertiesConfiguration globalConfig = DFSPropertiesConfiguration.getGlobalDFSPropsConfiguration();
boolean inlineCompaction = globalConfig.getBoolean("hoodie.compact.inline", false);

// Creating custom configuration with Hadoop integration
Configuration hadoopConf = new Configuration();
hadoopConf.set("fs.defaultFS", "hdfs://namenode:8020");

// Custom configuration file location
StoragePath customConfigPath = new StoragePath("hdfs://namenode:8020/apps/hudi/conf/production.conf");
DFSPropertiesConfiguration customConfig = new DFSPropertiesConfiguration(hadoopConf, customConfigPath);

// Access configuration properties with defaults
String tableName = customConfig.getString("hoodie.table.name", "default_table");
int parquetBlockSize = customConfig.getInt("hoodie.parquet.block.size", 134217728);
boolean consistencyCheck = customConfig.getBoolean("hoodie.filesystem.consistency.check.enabled", true);

// Environment-based configuration directory
// Set environment variable: export HUDI_CONF_DIR=hdfs://namenode:8020/config/hudi
// Configuration will automatically load from: hdfs://namenode:8020/config/hudi/hudi-defaults.conf
DFSPropertiesConfiguration envConfig = new DFSPropertiesConfiguration();

// Working with TypedProperties for bulk operations
TypedProperties allProps = customConfig.getProps();
for (String key : allProps.stringPropertyNames()) {
    String value = allProps.getString(key);
    System.out.println(key + " = " + value);
}

// Combining with Hadoop configuration for unified setup
Configuration unifiedConf = new Configuration();
unifiedConf.addResource("core-site.xml");
unifiedConf.addResource("hdfs-site.xml");

DFSPropertiesConfiguration hudiConfig = new DFSPropertiesConfiguration(
    unifiedConf, 
    new StoragePath("hdfs://namenode:8020/config/hudi-defaults.conf")
);

// Use in Hudi operations
String recordKeyField = hudiConfig.getString("hoodie.datasource.write.recordkey.field", "_row_key");
String partitionPathField = hudiConfig.getString("hoodie.datasource.write.partitionpath.field", "partition");
String precombineField = hudiConfig.getString("hoodie.datasource.write.precombine.field", "ts");

Configuration Best Practices

  1. Environment Separation: Use different configuration files for different environments (dev, staging, prod)
  2. Centralized Storage: Store configuration files in HDFS or other distributed storage for consistency
  3. Property Precedence: Understand that programmatically set properties override configuration file values
  4. Default Values: Always provide sensible defaults when accessing configuration properties
  5. Documentation: Document custom configuration properties and their expected values

Install with Tessl CLI

npx tessl i tessl/maven-org-apache-hudi--hudi-hadoop-common

docs

configuration-management.md

filesystem-utilities.md

format-utilities.md

index.md

io-operations.md

storage-operations.md

tile.json