0
# Apache Hudi Hadoop Common
1
2
Apache Hudi Hadoop Common provides essential Hadoop integration components for Apache Hudi, an open data lakehouse platform. This library contains core utilities and abstractions that enable Hudi to work seamlessly with the Hadoop ecosystem, including HDFS file system operations, configuration management through DFSPropertiesConfiguration, and format-specific utilities for Parquet, ORC, and HFile formats.
3
4
## Package Information
5
6
- **Package Name**: hudi-hadoop-common
7
- **Package Type**: maven
8
- **Language**: Java
9
- **Group ID**: org.apache.hudi
10
- **Artifact ID**: hudi-hadoop-common
11
- **Installation**: Add dependency to pom.xml:
12
13
```xml
14
<dependency>
15
<groupId>org.apache.hudi</groupId>
16
<artifactId>hudi-hadoop-common</artifactId>
17
<version>1.0.2</version>
18
</dependency>
19
```
20
21
## Core Imports
22
23
```java
24
import org.apache.hudi.storage.hadoop.HoodieHadoopStorage;
25
import org.apache.hudi.storage.hadoop.HadoopStorageConfiguration;
26
import org.apache.hudi.io.hadoop.HoodieHadoopIOFactory;
27
import org.apache.hudi.common.config.DFSPropertiesConfiguration;
28
import org.apache.hudi.hadoop.fs.HadoopFSUtils;
29
import org.apache.hudi.common.util.Option;
30
import org.apache.hudi.common.util.ParquetUtils;
31
import org.apache.hudi.common.util.AvroOrcUtils;
32
import org.apache.hudi.common.util.collection.ClosableIterator;
33
import org.apache.hudi.common.util.collection.Pair;
34
import org.apache.hudi.common.model.HoodieKey;
35
import org.apache.hudi.common.model.HoodieColumnRangeMetadata;
36
import org.apache.hudi.common.model.HoodieFileFormat;
37
import org.apache.hudi.keygen.BaseKeyGenerator;
38
import org.apache.avro.Schema;
39
import org.apache.avro.generic.GenericRecord;
40
import org.apache.parquet.hadoop.metadata.ParquetMetadata;
41
import org.apache.parquet.hadoop.metadata.CompressionCodecName;
42
import org.apache.parquet.schema.MessageType;
43
import org.apache.orc.TypeDescription;
44
import org.apache.orc.storage.ql.exec.vector.ColumnVector;
45
```
46
47
## Basic Usage
48
49
```java
50
import org.apache.hudi.storage.hadoop.HoodieHadoopStorage;
51
import org.apache.hudi.storage.hadoop.HadoopStorageConfiguration;
52
import org.apache.hadoop.conf.Configuration;
53
import org.apache.hudi.storage.StoragePath;
54
55
// Create Hadoop storage configuration
56
Configuration hadoopConf = new Configuration();
57
HadoopStorageConfiguration storageConf = new HadoopStorageConfiguration(hadoopConf);
58
59
// Initialize Hadoop storage
60
StoragePath path = new StoragePath("hdfs://example.com:8020/data");
61
HoodieHadoopStorage storage = new HoodieHadoopStorage(path, storageConf);
62
63
// Perform basic file operations
64
boolean exists = storage.exists(path);
65
InputStream inputStream = storage.open(path);
66
OutputStream outputStream = storage.create(path, true);
67
```
68
69
## Architecture
70
71
Apache Hudi Hadoop Common is organized around several key architectural components:
72
73
- **Storage Abstraction**: `HoodieHadoopStorage` provides a unified interface for Hadoop FileSystem operations with consistency guarantees and retry mechanisms
74
- **I/O Factory Pattern**: `HoodieHadoopIOFactory` creates format-specific readers and writers for Parquet, ORC, and HFile formats
75
- **Configuration Management**: `DFSPropertiesConfiguration` and `HadoopStorageConfiguration` handle Hadoop-specific configuration patterns
76
- **File System Utilities**: `HadoopFSUtils` provides conversion utilities between Hudi and Hadoop path/configuration abstractions
77
- **Format-Specific Utilities**: Specialized utilities for working with Parquet, ORC, and HFile formats in Hadoop environments
78
79
## Capabilities
80
81
### Storage Operations
82
83
Core Hadoop FileSystem abstraction with consistency guarantees, retry mechanisms, and unified interface for distributed storage operations.
84
85
```java { .api }
86
public class HoodieHadoopStorage implements HoodieStorage {
87
public HoodieHadoopStorage(StoragePath path, StorageConfiguration<?> conf);
88
public InputStream open(StoragePath path);
89
public OutputStream create(StoragePath path, boolean overwrite);
90
public boolean exists(StoragePath path);
91
public List<StoragePathInfo> listDirectEntries(StoragePath path);
92
}
93
```
94
95
[Storage Operations](./storage-operations.md)
96
97
### I/O Factory and File Readers/Writers
98
99
Factory pattern for creating format-specific file readers and writers with support for Avro, Parquet, and ORC formats in Hadoop environments.
100
101
```java { .api }
102
public class HoodieHadoopIOFactory implements HoodieIOFactory {
103
public HoodieHadoopIOFactory(HoodieStorage storage);
104
public HoodieFileReaderFactory getReaderFactory(HoodieRecord.HoodieRecordType recordType);
105
public HoodieFileWriterFactory getWriterFactory(HoodieRecord.HoodieRecordType recordType);
106
}
107
```
108
109
[I/O Operations](./io-operations.md)
110
111
### File System Utilities
112
113
Comprehensive utilities for Hadoop FileSystem operations, path conversions, and integration between Hudi and Hadoop abstractions.
114
115
```java { .api }
116
public class HadoopFSUtils {
117
public static <T> FileSystem getFs(String pathStr, StorageConfiguration<T> storageConf);
118
public static StoragePath convertToStoragePath(Path path);
119
public static Path convertToHadoopPath(StoragePath path);
120
public static StorageConfiguration<Configuration> getStorageConf(Configuration conf);
121
}
122
```
123
124
[File System Utilities](./filesystem-utilities.md)
125
126
### Configuration Management
127
128
DFS-based configuration management with support for global properties, environment-specific settings, and Hadoop configuration integration.
129
130
```java { .api }
131
public class DFSPropertiesConfiguration extends PropertiesConfig {
132
public DFSPropertiesConfiguration(Configuration hadoopConf, StoragePath filePath);
133
public static TypedProperties getGlobalProps();
134
public static DFSPropertiesConfiguration getGlobalDFSPropsConfiguration();
135
}
136
```
137
138
[Configuration Management](./configuration-management.md)
139
140
### Format-Specific Utilities
141
142
Specialized utilities for working with Parquet, ORC, and HFile formats, including metadata reading, schema conversions, and format-specific optimizations.
143
144
```java { .api }
145
public class ParquetUtils extends FileFormatUtils {
146
public static ParquetMetadata readMetadata(HoodieStorage storage, StoragePath parquetFilePath);
147
public static Set<Pair<String, Long>> filterRowKeys(HoodieStorage storage, StoragePath filePath, Set<String> filter);
148
}
149
150
public class AvroOrcUtils {
151
public static TypeDescription createOrcSchema(Schema avroSchema);
152
public static Schema createAvroSchema(TypeDescription orcSchema);
153
}
154
```
155
156
[Format Utilities](./format-utilities.md)
157
158
## Common Types
159
160
```java { .api }
161
// Storage path abstraction
162
class StoragePath {
163
public StoragePath(String path);
164
public String toString();
165
}
166
167
// Storage path information
168
class StoragePathInfo {
169
public StoragePath getPath();
170
public long getLength();
171
public boolean isDirectory();
172
public long getModificationTime();
173
}
174
175
// Storage configuration wrapper
176
interface StorageConfiguration<T> {
177
public T unwrap();
178
public String get(String key);
179
public void set(String key, String value);
180
}
181
182
// Consistency guard for file system operations
183
interface ConsistencyGuard {
184
public void waitTillFileAppears(StoragePath filePath);
185
public void waitTillFileDisappears(StoragePath filePath);
186
}
187
```