Tessl Tile for maven/org.apache.flink/flink-orc-nohive_2.11@1.14.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-flink--flink-orc-nohive_2.11

Apache Flink ORC format support without Hive dependencies - provides ORC file reading and writing capabilities for Flink applications using a standalone ORC implementation

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:maven/org.apache.flink/flink-orc-nohive_2.11@1.14.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-flink--flink-orc-nohive_2.11@1.14.0

0
# Apache Flink ORC NoHive
1

2
Apache Flink ORC NoHive provides ORC file format support for Apache Flink without requiring Hive dependencies. It enables efficient reading from and writing to ORC files using a standalone ORC implementation, offering high-performance columnar data processing with vectorized operations.
3

4
## Package Information
5

6
- **Package Name**: flink-orc-nohive_2.11
7
- **Package Type**: maven
8
- **Language**: Java
9
- **Group ID**: org.apache.flink
10
- **Artifact ID**: flink-orc-nohive_2.11
11
- **Installation**: Add to your Maven `pom.xml`:
12

13
```xml
14
<dependency>
15
    <groupId>org.apache.flink</groupId>
16
    <artifactId>flink-orc-nohive_2.11</artifactId>
17
    <version>1.14.6</version>
18
</dependency>
19
```
20

21
## Core Imports
22

23
```java
24
import org.apache.flink.orc.nohive.OrcNoHiveBulkWriterFactory;
25
import org.apache.flink.orc.nohive.OrcNoHiveColumnarRowInputFormat;
26
import org.apache.flink.orc.nohive.OrcNoHiveSplitReaderUtil;
27
import org.apache.flink.orc.nohive.shim.OrcNoHiveShim;
28
import org.apache.flink.orc.nohive.vector.OrcNoHiveBatchWrapper;
29
import org.apache.flink.orc.nohive.vector.AbstractOrcNoHiveVector;
30
```
31

32
## Basic Usage
33

34
```java
35
import org.apache.flink.orc.nohive.OrcNoHiveBulkWriterFactory;
36
import org.apache.flink.table.data.RowData;
37
import org.apache.flink.table.types.logical.LogicalType;
38
import org.apache.flink.table.types.logical.VarCharType;
39
import org.apache.flink.table.types.logical.IntType;
40
import org.apache.hadoop.conf.Configuration;
41

42
// Create bulk writer factory for ORC files
43
Configuration hadoopConfig = new Configuration();
44
String orcSchema = "struct<name:string,age:int>";
45
LogicalType[] fieldTypes = {new VarCharType(), new IntType()};
46

47
OrcNoHiveBulkWriterFactory writerFactory = new OrcNoHiveBulkWriterFactory(
48
    hadoopConfig, 
49
    orcSchema, 
50
    fieldTypes
51
);
52

53
// Use with Flink's streaming file sink
54
StreamingFileSink<RowData> sink = StreamingFileSink
55
    .forBulkFormat(outputPath, writerFactory)
56
    .build();
57
```
58

59
## Architecture
60

61
The flink-orc-nohive module is organized around several key components:
62

63
- **Writer Factory**: `OrcNoHiveBulkWriterFactory` creates bulk writers for efficient ORC file writing
64
- **Input Formats**: Helper classes for creating columnar input formats with partition support
65
- **Vector Layer**: Custom vector implementations that adapt ORC vectors to Flink's vector API
66
- **Shim Layer**: `OrcNoHiveShim` provides ORC reader implementation without Hive dependencies
67
- **Physical Writer**: Handles ORC file writing with relocated Protobuf classes for no-hive compatibility
68

69
## Capabilities
70

71
### Bulk Writing
72

73
Factory for creating ORC bulk writers that efficiently write Flink RowData to ORC files without Hive dependencies.
74

75
```java { .api }
76
public class OrcNoHiveBulkWriterFactory implements BulkWriter.Factory<RowData> {
77
    public OrcNoHiveBulkWriterFactory(Configuration conf, String schema, LogicalType[] fieldTypes);
78
    public BulkWriter<RowData> create(FSDataOutputStream out) throws IOException;
79
}
80
```
81

82
[Bulk Writing](./bulk-writing.md)
83

84
### Columnar Reading
85

86
Helper utilities for creating columnar input formats and split readers with partition support for efficient ORC file reading.
87

88
```java { .api }
89
public class OrcNoHiveColumnarRowInputFormat {
90
    public static <SplitT extends FileSourceSplit> 
91
        OrcColumnarRowFileInputFormat<VectorizedRowBatch, SplitT> createPartitionedFormat(
92
            Configuration hadoopConfig,
93
            RowType tableType,
94
            List<String> partitionKeys,
95
            PartitionFieldExtractor<SplitT> extractor,
96
            int[] selectedFields,
97
            List<OrcFilters.Predicate> conjunctPredicates,
98
            int batchSize
99
        );
100
}
101

102
public class OrcNoHiveSplitReaderUtil {
103
    public static OrcColumnarRowSplitReader<VectorizedRowBatch> genPartColumnarRowReader(
104
        Configuration conf,
105
        String[] fullFieldNames,
106
        DataType[] fullFieldTypes,
107
        Map<String, Object> partitionSpec,
108
        int[] selectedFields,
109
        List<OrcFilters.Predicate> conjunctPredicates,
110
        int batchSize,
111
        org.apache.flink.core.fs.Path path,
112
        long splitStart,
113
        long splitLength
114
    ) throws IOException;
115
}
116
```
117

118
[Columnar Reading](./columnar-reading.md)
119

120
### Vector Processing
121

122
High-performance vector implementations that adapt ORC column vectors to Flink's vector API for efficient columnar data processing.
123

124
```java { .api }
125
public abstract class AbstractOrcNoHiveVector implements ColumnVector {
126
    public boolean isNullAt(int i);
127
    public static ColumnVector createFlinkVector(ColumnVector vector);
128
    public static ColumnVector createFlinkVectorFromConstant(LogicalType type, Object value, int batchSize);
129
}
130
```
131

132
[Vector Processing](./vector-processing.md)
133

134
### ORC Integration
135

136
Low-level ORC integration providing record readers and batch wrappers for direct ORC file access without Hive dependencies.
137

138
```java { .api }
139
public class OrcNoHiveShim implements OrcShim<VectorizedRowBatch> {
140
    public RecordReader createRecordReader(
141
        Configuration conf,
142
        TypeDescription schema,
143
        int[] selectedFields,
144
        List<OrcFilters.Predicate> conjunctPredicates,
145
        org.apache.flink.core.fs.Path path,
146
        long splitStart,
147
        long splitLength
148
    ) throws IOException;
149
    
150
    public OrcNoHiveBatchWrapper createBatchWrapper(TypeDescription schema, int batchSize);
151
    public boolean nextBatch(RecordReader reader, VectorizedRowBatch rowBatch) throws IOException;
152
}
153
```
154

155
[ORC Integration](./orc-integration.md)
156

157
## Supported Data Types
158

159
The module supports all standard Flink logical types:
160

161
- **Primitive Types**: BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE
162
- **String Types**: CHAR, VARCHAR  
163
- **Binary Types**: BINARY, VARBINARY
164
- **Temporal Types**: DATE, TIME_WITHOUT_TIME_ZONE, TIMESTAMP_WITHOUT_TIME_ZONE, TIMESTAMP_WITH_LOCAL_TIME_ZONE
165
- **Decimal Types**: DECIMAL with configurable precision and scale
166

167
## Error Handling
168

169
All public methods may throw `IOException` for I/O operations. The module throws `UnsupportedOperationException` for unsupported data types or operations.
170

171
Common exceptions include:
172
- `IOException`: File system operations, ORC file reading/writing errors
173
- `UnsupportedOperationException`: Unsupported logical types or vector operations
174
- `ClassNotFoundException`: Serialization/deserialization errors in factory classes