Apache Flink ORC format support without Hive dependencies - provides ORC file reading and writing capabilities for Flink applications using a standalone ORC implementation
npx @tessl/cli install tessl/maven-org-apache-flink--flink-orc-nohive-2-11@1.14.0Apache Flink ORC NoHive provides ORC file format support for Apache Flink without requiring Hive dependencies. It enables efficient reading from and writing to ORC files using a standalone ORC implementation, offering high-performance columnar data processing with vectorized operations.
pom.xml:<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-orc-nohive_2.11</artifactId>
<version>1.14.6</version>
</dependency>import org.apache.flink.orc.nohive.OrcNoHiveBulkWriterFactory;
import org.apache.flink.orc.nohive.OrcNoHiveColumnarRowInputFormat;
import org.apache.flink.orc.nohive.OrcNoHiveSplitReaderUtil;
import org.apache.flink.orc.nohive.shim.OrcNoHiveShim;
import org.apache.flink.orc.nohive.vector.OrcNoHiveBatchWrapper;
import org.apache.flink.orc.nohive.vector.AbstractOrcNoHiveVector;import org.apache.flink.orc.nohive.OrcNoHiveBulkWriterFactory;
import org.apache.flink.table.data.RowData;
import org.apache.flink.table.types.logical.LogicalType;
import org.apache.flink.table.types.logical.VarCharType;
import org.apache.flink.table.types.logical.IntType;
import org.apache.hadoop.conf.Configuration;
// Create bulk writer factory for ORC files
Configuration hadoopConfig = new Configuration();
String orcSchema = "struct<name:string,age:int>";
LogicalType[] fieldTypes = {new VarCharType(), new IntType()};
OrcNoHiveBulkWriterFactory writerFactory = new OrcNoHiveBulkWriterFactory(
hadoopConfig,
orcSchema,
fieldTypes
);
// Use with Flink's streaming file sink
StreamingFileSink<RowData> sink = StreamingFileSink
.forBulkFormat(outputPath, writerFactory)
.build();The flink-orc-nohive module is organized around several key components:
OrcNoHiveBulkWriterFactory creates bulk writers for efficient ORC file writingOrcNoHiveShim provides ORC reader implementation without Hive dependenciesFactory for creating ORC bulk writers that efficiently write Flink RowData to ORC files without Hive dependencies.
public class OrcNoHiveBulkWriterFactory implements BulkWriter.Factory<RowData> {
public OrcNoHiveBulkWriterFactory(Configuration conf, String schema, LogicalType[] fieldTypes);
public BulkWriter<RowData> create(FSDataOutputStream out) throws IOException;
}Helper utilities for creating columnar input formats and split readers with partition support for efficient ORC file reading.
public class OrcNoHiveColumnarRowInputFormat {
public static <SplitT extends FileSourceSplit>
OrcColumnarRowFileInputFormat<VectorizedRowBatch, SplitT> createPartitionedFormat(
Configuration hadoopConfig,
RowType tableType,
List<String> partitionKeys,
PartitionFieldExtractor<SplitT> extractor,
int[] selectedFields,
List<OrcFilters.Predicate> conjunctPredicates,
int batchSize
);
}
public class OrcNoHiveSplitReaderUtil {
public static OrcColumnarRowSplitReader<VectorizedRowBatch> genPartColumnarRowReader(
Configuration conf,
String[] fullFieldNames,
DataType[] fullFieldTypes,
Map<String, Object> partitionSpec,
int[] selectedFields,
List<OrcFilters.Predicate> conjunctPredicates,
int batchSize,
org.apache.flink.core.fs.Path path,
long splitStart,
long splitLength
) throws IOException;
}High-performance vector implementations that adapt ORC column vectors to Flink's vector API for efficient columnar data processing.
public abstract class AbstractOrcNoHiveVector implements ColumnVector {
public boolean isNullAt(int i);
public static ColumnVector createFlinkVector(ColumnVector vector);
public static ColumnVector createFlinkVectorFromConstant(LogicalType type, Object value, int batchSize);
}Low-level ORC integration providing record readers and batch wrappers for direct ORC file access without Hive dependencies.
public class OrcNoHiveShim implements OrcShim<VectorizedRowBatch> {
public RecordReader createRecordReader(
Configuration conf,
TypeDescription schema,
int[] selectedFields,
List<OrcFilters.Predicate> conjunctPredicates,
org.apache.flink.core.fs.Path path,
long splitStart,
long splitLength
) throws IOException;
public OrcNoHiveBatchWrapper createBatchWrapper(TypeDescription schema, int batchSize);
public boolean nextBatch(RecordReader reader, VectorizedRowBatch rowBatch) throws IOException;
}The module supports all standard Flink logical types:
All public methods may throw IOException for I/O operations. The module throws UnsupportedOperationException for unsupported data types or operations.
Common exceptions include:
IOException: File system operations, ORC file reading/writing errorsUnsupportedOperationException: Unsupported logical types or vector operationsClassNotFoundException: Serialization/deserialization errors in factory classes