Apache Flink ORC format support without Hive dependencies - provides ORC file reading and writing capabilities for Flink applications using a standalone ORC implementation
npx @tessl/cli install tessl/maven-org-apache-flink--flink-orc-nohive_2.11@1.14.00
# Apache Flink ORC NoHive
1
2
Apache Flink ORC NoHive provides ORC file format support for Apache Flink without requiring Hive dependencies. It enables efficient reading from and writing to ORC files using a standalone ORC implementation, offering high-performance columnar data processing with vectorized operations.
3
4
## Package Information
5
6
- **Package Name**: flink-orc-nohive_2.11
7
- **Package Type**: maven
8
- **Language**: Java
9
- **Group ID**: org.apache.flink
10
- **Artifact ID**: flink-orc-nohive_2.11
11
- **Installation**: Add to your Maven `pom.xml`:
12
13
```xml
14
<dependency>
15
<groupId>org.apache.flink</groupId>
16
<artifactId>flink-orc-nohive_2.11</artifactId>
17
<version>1.14.6</version>
18
</dependency>
19
```
20
21
## Core Imports
22
23
```java
24
import org.apache.flink.orc.nohive.OrcNoHiveBulkWriterFactory;
25
import org.apache.flink.orc.nohive.OrcNoHiveColumnarRowInputFormat;
26
import org.apache.flink.orc.nohive.OrcNoHiveSplitReaderUtil;
27
import org.apache.flink.orc.nohive.shim.OrcNoHiveShim;
28
import org.apache.flink.orc.nohive.vector.OrcNoHiveBatchWrapper;
29
import org.apache.flink.orc.nohive.vector.AbstractOrcNoHiveVector;
30
```
31
32
## Basic Usage
33
34
```java
35
import org.apache.flink.orc.nohive.OrcNoHiveBulkWriterFactory;
36
import org.apache.flink.table.data.RowData;
37
import org.apache.flink.table.types.logical.LogicalType;
38
import org.apache.flink.table.types.logical.VarCharType;
39
import org.apache.flink.table.types.logical.IntType;
40
import org.apache.hadoop.conf.Configuration;
41
42
// Create bulk writer factory for ORC files
43
Configuration hadoopConfig = new Configuration();
44
String orcSchema = "struct<name:string,age:int>";
45
LogicalType[] fieldTypes = {new VarCharType(), new IntType()};
46
47
OrcNoHiveBulkWriterFactory writerFactory = new OrcNoHiveBulkWriterFactory(
48
hadoopConfig,
49
orcSchema,
50
fieldTypes
51
);
52
53
// Use with Flink's streaming file sink
54
StreamingFileSink<RowData> sink = StreamingFileSink
55
.forBulkFormat(outputPath, writerFactory)
56
.build();
57
```
58
59
## Architecture
60
61
The flink-orc-nohive module is organized around several key components:
62
63
- **Writer Factory**: `OrcNoHiveBulkWriterFactory` creates bulk writers for efficient ORC file writing
64
- **Input Formats**: Helper classes for creating columnar input formats with partition support
65
- **Vector Layer**: Custom vector implementations that adapt ORC vectors to Flink's vector API
66
- **Shim Layer**: `OrcNoHiveShim` provides ORC reader implementation without Hive dependencies
67
- **Physical Writer**: Handles ORC file writing with relocated Protobuf classes for no-hive compatibility
68
69
## Capabilities
70
71
### Bulk Writing
72
73
Factory for creating ORC bulk writers that efficiently write Flink RowData to ORC files without Hive dependencies.
74
75
```java { .api }
76
public class OrcNoHiveBulkWriterFactory implements BulkWriter.Factory<RowData> {
77
public OrcNoHiveBulkWriterFactory(Configuration conf, String schema, LogicalType[] fieldTypes);
78
public BulkWriter<RowData> create(FSDataOutputStream out) throws IOException;
79
}
80
```
81
82
[Bulk Writing](./bulk-writing.md)
83
84
### Columnar Reading
85
86
Helper utilities for creating columnar input formats and split readers with partition support for efficient ORC file reading.
87
88
```java { .api }
89
public class OrcNoHiveColumnarRowInputFormat {
90
public static <SplitT extends FileSourceSplit>
91
OrcColumnarRowFileInputFormat<VectorizedRowBatch, SplitT> createPartitionedFormat(
92
Configuration hadoopConfig,
93
RowType tableType,
94
List<String> partitionKeys,
95
PartitionFieldExtractor<SplitT> extractor,
96
int[] selectedFields,
97
List<OrcFilters.Predicate> conjunctPredicates,
98
int batchSize
99
);
100
}
101
102
public class OrcNoHiveSplitReaderUtil {
103
public static OrcColumnarRowSplitReader<VectorizedRowBatch> genPartColumnarRowReader(
104
Configuration conf,
105
String[] fullFieldNames,
106
DataType[] fullFieldTypes,
107
Map<String, Object> partitionSpec,
108
int[] selectedFields,
109
List<OrcFilters.Predicate> conjunctPredicates,
110
int batchSize,
111
org.apache.flink.core.fs.Path path,
112
long splitStart,
113
long splitLength
114
) throws IOException;
115
}
116
```
117
118
[Columnar Reading](./columnar-reading.md)
119
120
### Vector Processing
121
122
High-performance vector implementations that adapt ORC column vectors to Flink's vector API for efficient columnar data processing.
123
124
```java { .api }
125
public abstract class AbstractOrcNoHiveVector implements ColumnVector {
126
public boolean isNullAt(int i);
127
public static ColumnVector createFlinkVector(ColumnVector vector);
128
public static ColumnVector createFlinkVectorFromConstant(LogicalType type, Object value, int batchSize);
129
}
130
```
131
132
[Vector Processing](./vector-processing.md)
133
134
### ORC Integration
135
136
Low-level ORC integration providing record readers and batch wrappers for direct ORC file access without Hive dependencies.
137
138
```java { .api }
139
public class OrcNoHiveShim implements OrcShim<VectorizedRowBatch> {
140
public RecordReader createRecordReader(
141
Configuration conf,
142
TypeDescription schema,
143
int[] selectedFields,
144
List<OrcFilters.Predicate> conjunctPredicates,
145
org.apache.flink.core.fs.Path path,
146
long splitStart,
147
long splitLength
148
) throws IOException;
149
150
public OrcNoHiveBatchWrapper createBatchWrapper(TypeDescription schema, int batchSize);
151
public boolean nextBatch(RecordReader reader, VectorizedRowBatch rowBatch) throws IOException;
152
}
153
```
154
155
[ORC Integration](./orc-integration.md)
156
157
## Supported Data Types
158
159
The module supports all standard Flink logical types:
160
161
- **Primitive Types**: BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE
162
- **String Types**: CHAR, VARCHAR
163
- **Binary Types**: BINARY, VARBINARY
164
- **Temporal Types**: DATE, TIME_WITHOUT_TIME_ZONE, TIMESTAMP_WITHOUT_TIME_ZONE, TIMESTAMP_WITH_LOCAL_TIME_ZONE
165
- **Decimal Types**: DECIMAL with configurable precision and scale
166
167
## Error Handling
168
169
All public methods may throw `IOException` for I/O operations. The module throws `UnsupportedOperationException` for unsupported data types or operations.
170
171
Common exceptions include:
172
- `IOException`: File system operations, ORC file reading/writing errors
173
- `UnsupportedOperationException`: Unsupported logical types or vector operations
174
- `ClassNotFoundException`: Serialization/deserialization errors in factory classes