tessl/maven-org-apache-flink--flink-orc-nohive-2-11

Apache Flink ORC format support without Hive dependencies - provides ORC file reading and writing capabilities for Flink applications using a standalone ORC implementation

Overview

Eval results

Files

Apache Flink ORC NoHive

Name: tessl/maven-org-apache-flink--flink-orc-nohive-2-11
Author: tessl

Apache Flink ORC NoHive provides ORC file format support for Apache Flink without requiring Hive dependencies. It enables efficient reading from and writing to ORC files using a standalone ORC implementation, offering high-performance columnar data processing with vectorized operations.

Package Information

Package Name: flink-orc-nohive_2.11
Package Type: maven
Language: Java
Group ID: org.apache.flink
Artifact ID: flink-orc-nohive_2.11
Installation: Add to your Maven pom.xml:

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-orc-nohive_2.11</artifactId>
    <version>1.14.6</version>
</dependency>

Core Imports

import org.apache.flink.orc.nohive.OrcNoHiveBulkWriterFactory;
import org.apache.flink.orc.nohive.OrcNoHiveColumnarRowInputFormat;
import org.apache.flink.orc.nohive.OrcNoHiveSplitReaderUtil;
import org.apache.flink.orc.nohive.shim.OrcNoHiveShim;
import org.apache.flink.orc.nohive.vector.OrcNoHiveBatchWrapper;
import org.apache.flink.orc.nohive.vector.AbstractOrcNoHiveVector;

Basic Usage

import org.apache.flink.orc.nohive.OrcNoHiveBulkWriterFactory;
import org.apache.flink.table.data.RowData;
import org.apache.flink.table.types.logical.LogicalType;
import org.apache.flink.table.types.logical.VarCharType;
import org.apache.flink.table.types.logical.IntType;
import org.apache.hadoop.conf.Configuration;

// Create bulk writer factory for ORC files
Configuration hadoopConfig = new Configuration();
String orcSchema = "struct<name:string,age:int>";
LogicalType[] fieldTypes = {new VarCharType(), new IntType()};

OrcNoHiveBulkWriterFactory writerFactory = new OrcNoHiveBulkWriterFactory(
    hadoopConfig, 
    orcSchema, 
    fieldTypes
);

// Use with Flink's streaming file sink
StreamingFileSink<RowData> sink = StreamingFileSink
    .forBulkFormat(outputPath, writerFactory)
    .build();

Architecture

The flink-orc-nohive module is organized around several key components:

Writer Factory: OrcNoHiveBulkWriterFactory creates bulk writers for efficient ORC file writing
Input Formats: Helper classes for creating columnar input formats with partition support
Vector Layer: Custom vector implementations that adapt ORC vectors to Flink's vector API
Shim Layer: OrcNoHiveShim provides ORC reader implementation without Hive dependencies
Physical Writer: Handles ORC file writing with relocated Protobuf classes for no-hive compatibility

Capabilities

Bulk Writing

Factory for creating ORC bulk writers that efficiently write Flink RowData to ORC files without Hive dependencies.

public class OrcNoHiveBulkWriterFactory implements BulkWriter.Factory<RowData> {
    public OrcNoHiveBulkWriterFactory(Configuration conf, String schema, LogicalType[] fieldTypes);
    public BulkWriter<RowData> create(FSDataOutputStream out) throws IOException;
}

Bulk Writing

Columnar Reading

Helper utilities for creating columnar input formats and split readers with partition support for efficient ORC file reading.

public class OrcNoHiveColumnarRowInputFormat {
    public static <SplitT extends FileSourceSplit> 
        OrcColumnarRowFileInputFormat<VectorizedRowBatch, SplitT> createPartitionedFormat(
            Configuration hadoopConfig,
            RowType tableType,
            List<String> partitionKeys,
            PartitionFieldExtractor<SplitT> extractor,
            int[] selectedFields,
            List<OrcFilters.Predicate> conjunctPredicates,
            int batchSize
        );
}

public class OrcNoHiveSplitReaderUtil {
    public static OrcColumnarRowSplitReader<VectorizedRowBatch> genPartColumnarRowReader(
        Configuration conf,
        String[] fullFieldNames,
        DataType[] fullFieldTypes,
        Map<String, Object> partitionSpec,
        int[] selectedFields,
        List<OrcFilters.Predicate> conjunctPredicates,
        int batchSize,
        org.apache.flink.core.fs.Path path,
        long splitStart,
        long splitLength
    ) throws IOException;
}

Columnar Reading

Vector Processing

High-performance vector implementations that adapt ORC column vectors to Flink's vector API for efficient columnar data processing.

public abstract class AbstractOrcNoHiveVector implements ColumnVector {
    public boolean isNullAt(int i);
    public static ColumnVector createFlinkVector(ColumnVector vector);
    public static ColumnVector createFlinkVectorFromConstant(LogicalType type, Object value, int batchSize);
}

Vector Processing

ORC Integration

Low-level ORC integration providing record readers and batch wrappers for direct ORC file access without Hive dependencies.

public class OrcNoHiveShim implements OrcShim<VectorizedRowBatch> {
    public RecordReader createRecordReader(
        Configuration conf,
        TypeDescription schema,
        int[] selectedFields,
        List<OrcFilters.Predicate> conjunctPredicates,
        org.apache.flink.core.fs.Path path,
        long splitStart,
        long splitLength
    ) throws IOException;
    
    public OrcNoHiveBatchWrapper createBatchWrapper(TypeDescription schema, int batchSize);
    public boolean nextBatch(RecordReader reader, VectorizedRowBatch rowBatch) throws IOException;
}

ORC Integration

Supported Data Types

The module supports all standard Flink logical types:

Primitive Types: BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE
String Types: CHAR, VARCHAR
Binary Types: BINARY, VARBINARY
Temporal Types: DATE, TIME_WITHOUT_TIME_ZONE, TIMESTAMP_WITHOUT_TIME_ZONE, TIMESTAMP_WITH_LOCAL_TIME_ZONE
Decimal Types: DECIMAL with configurable precision and scale

Error Handling

All public methods may throw IOException for I/O operations. The module throws UnsupportedOperationException for unsupported data types or operations.

Common exceptions include:

IOException: File system operations, ORC file reading/writing errors
UnsupportedOperationException: Unsupported logical types or vector operations
ClassNotFoundException: Serialization/deserialization errors in factory classes

Install with Tessl CLI

npx tessl i tessl/maven-org-apache-flink--flink-orc-nohive-2-11

Workspace: tessl
Visibility: Public
Created: 2 months ago
Last updated: about 1 month ago
Describes: pkg:maven/org.apache.flink/flink-orc-nohive_2.11@1.14.x