or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

configuration-management.mdfilesystem-utilities.mdformat-utilities.mdindex.mdio-operations.mdstorage-operations.md

index.mddocs/

0

# Apache Hudi Hadoop Common

1

2

Apache Hudi Hadoop Common provides essential Hadoop integration components for Apache Hudi, an open data lakehouse platform. This library contains core utilities and abstractions that enable Hudi to work seamlessly with the Hadoop ecosystem, including HDFS file system operations, configuration management through DFSPropertiesConfiguration, and format-specific utilities for Parquet, ORC, and HFile formats.

3

4

## Package Information

5

6

- **Package Name**: hudi-hadoop-common

7

- **Package Type**: maven

8

- **Language**: Java

9

- **Group ID**: org.apache.hudi

10

- **Artifact ID**: hudi-hadoop-common

11

- **Installation**: Add dependency to pom.xml:

12

13

```xml

14

<dependency>

15

<groupId>org.apache.hudi</groupId>

16

<artifactId>hudi-hadoop-common</artifactId>

17

<version>1.0.2</version>

18

</dependency>

19

```

20

21

## Core Imports

22

23

```java

24

import org.apache.hudi.storage.hadoop.HoodieHadoopStorage;

25

import org.apache.hudi.storage.hadoop.HadoopStorageConfiguration;

26

import org.apache.hudi.io.hadoop.HoodieHadoopIOFactory;

27

import org.apache.hudi.common.config.DFSPropertiesConfiguration;

28

import org.apache.hudi.hadoop.fs.HadoopFSUtils;

29

import org.apache.hudi.common.util.Option;

30

import org.apache.hudi.common.util.ParquetUtils;

31

import org.apache.hudi.common.util.AvroOrcUtils;

32

import org.apache.hudi.common.util.collection.ClosableIterator;

33

import org.apache.hudi.common.util.collection.Pair;

34

import org.apache.hudi.common.model.HoodieKey;

35

import org.apache.hudi.common.model.HoodieColumnRangeMetadata;

36

import org.apache.hudi.common.model.HoodieFileFormat;

37

import org.apache.hudi.keygen.BaseKeyGenerator;

38

import org.apache.avro.Schema;

39

import org.apache.avro.generic.GenericRecord;

40

import org.apache.parquet.hadoop.metadata.ParquetMetadata;

41

import org.apache.parquet.hadoop.metadata.CompressionCodecName;

42

import org.apache.parquet.schema.MessageType;

43

import org.apache.orc.TypeDescription;

44

import org.apache.orc.storage.ql.exec.vector.ColumnVector;

45

```

46

47

## Basic Usage

48

49

```java

50

import org.apache.hudi.storage.hadoop.HoodieHadoopStorage;

51

import org.apache.hudi.storage.hadoop.HadoopStorageConfiguration;

52

import org.apache.hadoop.conf.Configuration;

53

import org.apache.hudi.storage.StoragePath;

54

55

// Create Hadoop storage configuration

56

Configuration hadoopConf = new Configuration();

57

HadoopStorageConfiguration storageConf = new HadoopStorageConfiguration(hadoopConf);

58

59

// Initialize Hadoop storage

60

StoragePath path = new StoragePath("hdfs://example.com:8020/data");

61

HoodieHadoopStorage storage = new HoodieHadoopStorage(path, storageConf);

62

63

// Perform basic file operations

64

boolean exists = storage.exists(path);

65

InputStream inputStream = storage.open(path);

66

OutputStream outputStream = storage.create(path, true);

67

```

68

69

## Architecture

70

71

Apache Hudi Hadoop Common is organized around several key architectural components:

72

73

- **Storage Abstraction**: `HoodieHadoopStorage` provides a unified interface for Hadoop FileSystem operations with consistency guarantees and retry mechanisms

74

- **I/O Factory Pattern**: `HoodieHadoopIOFactory` creates format-specific readers and writers for Parquet, ORC, and HFile formats

75

- **Configuration Management**: `DFSPropertiesConfiguration` and `HadoopStorageConfiguration` handle Hadoop-specific configuration patterns

76

- **File System Utilities**: `HadoopFSUtils` provides conversion utilities between Hudi and Hadoop path/configuration abstractions

77

- **Format-Specific Utilities**: Specialized utilities for working with Parquet, ORC, and HFile formats in Hadoop environments

78

79

## Capabilities

80

81

### Storage Operations

82

83

Core Hadoop FileSystem abstraction with consistency guarantees, retry mechanisms, and unified interface for distributed storage operations.

84

85

```java { .api }

86

public class HoodieHadoopStorage implements HoodieStorage {

87

public HoodieHadoopStorage(StoragePath path, StorageConfiguration<?> conf);

88

public InputStream open(StoragePath path);

89

public OutputStream create(StoragePath path, boolean overwrite);

90

public boolean exists(StoragePath path);

91

public List<StoragePathInfo> listDirectEntries(StoragePath path);

92

}

93

```

94

95

[Storage Operations](./storage-operations.md)

96

97

### I/O Factory and File Readers/Writers

98

99

Factory pattern for creating format-specific file readers and writers with support for Avro, Parquet, and ORC formats in Hadoop environments.

100

101

```java { .api }

102

public class HoodieHadoopIOFactory implements HoodieIOFactory {

103

public HoodieHadoopIOFactory(HoodieStorage storage);

104

public HoodieFileReaderFactory getReaderFactory(HoodieRecord.HoodieRecordType recordType);

105

public HoodieFileWriterFactory getWriterFactory(HoodieRecord.HoodieRecordType recordType);

106

}

107

```

108

109

[I/O Operations](./io-operations.md)

110

111

### File System Utilities

112

113

Comprehensive utilities for Hadoop FileSystem operations, path conversions, and integration between Hudi and Hadoop abstractions.

114

115

```java { .api }

116

public class HadoopFSUtils {

117

public static <T> FileSystem getFs(String pathStr, StorageConfiguration<T> storageConf);

118

public static StoragePath convertToStoragePath(Path path);

119

public static Path convertToHadoopPath(StoragePath path);

120

public static StorageConfiguration<Configuration> getStorageConf(Configuration conf);

121

}

122

```

123

124

[File System Utilities](./filesystem-utilities.md)

125

126

### Configuration Management

127

128

DFS-based configuration management with support for global properties, environment-specific settings, and Hadoop configuration integration.

129

130

```java { .api }

131

public class DFSPropertiesConfiguration extends PropertiesConfig {

132

public DFSPropertiesConfiguration(Configuration hadoopConf, StoragePath filePath);

133

public static TypedProperties getGlobalProps();

134

public static DFSPropertiesConfiguration getGlobalDFSPropsConfiguration();

135

}

136

```

137

138

[Configuration Management](./configuration-management.md)

139

140

### Format-Specific Utilities

141

142

Specialized utilities for working with Parquet, ORC, and HFile formats, including metadata reading, schema conversions, and format-specific optimizations.

143

144

```java { .api }

145

public class ParquetUtils extends FileFormatUtils {

146

public static ParquetMetadata readMetadata(HoodieStorage storage, StoragePath parquetFilePath);

147

public static Set<Pair<String, Long>> filterRowKeys(HoodieStorage storage, StoragePath filePath, Set<String> filter);

148

}

149

150

public class AvroOrcUtils {

151

public static TypeDescription createOrcSchema(Schema avroSchema);

152

public static Schema createAvroSchema(TypeDescription orcSchema);

153

}

154

```

155

156

[Format Utilities](./format-utilities.md)

157

158

## Common Types

159

160

```java { .api }

161

// Storage path abstraction

162

class StoragePath {

163

public StoragePath(String path);

164

public String toString();

165

}

166

167

// Storage path information

168

class StoragePathInfo {

169

public StoragePath getPath();

170

public long getLength();

171

public boolean isDirectory();

172

public long getModificationTime();

173

}

174

175

// Storage configuration wrapper

176

interface StorageConfiguration<T> {

177

public T unwrap();

178

public String get(String key);

179

public void set(String key, String value);

180

}

181

182

// Consistency guard for file system operations

183

interface ConsistencyGuard {

184

public void waitTillFileAppears(StoragePath filePath);

185

public void waitTillFileDisappears(StoragePath filePath);

186

}

187

```