or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

bulk-writing.mdcolumnar-reading.mdindex.mdorc-integration.mdvector-processing.md
tile.json

tessl/maven-org-apache-flink--flink-orc-nohive-2-11

Apache Flink ORC format support without Hive dependencies - provides ORC file reading and writing capabilities for Flink applications using a standalone ORC implementation

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.flink/flink-orc-nohive_2.11@1.14.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-flink--flink-orc-nohive-2-11@1.14.0

index.mddocs/

Apache Flink ORC NoHive

Apache Flink ORC NoHive provides ORC file format support for Apache Flink without requiring Hive dependencies. It enables efficient reading from and writing to ORC files using a standalone ORC implementation, offering high-performance columnar data processing with vectorized operations.

Package Information

  • Package Name: flink-orc-nohive_2.11
  • Package Type: maven
  • Language: Java
  • Group ID: org.apache.flink
  • Artifact ID: flink-orc-nohive_2.11
  • Installation: Add to your Maven pom.xml:
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-orc-nohive_2.11</artifactId>
    <version>1.14.6</version>
</dependency>

Core Imports

import org.apache.flink.orc.nohive.OrcNoHiveBulkWriterFactory;
import org.apache.flink.orc.nohive.OrcNoHiveColumnarRowInputFormat;
import org.apache.flink.orc.nohive.OrcNoHiveSplitReaderUtil;
import org.apache.flink.orc.nohive.shim.OrcNoHiveShim;
import org.apache.flink.orc.nohive.vector.OrcNoHiveBatchWrapper;
import org.apache.flink.orc.nohive.vector.AbstractOrcNoHiveVector;

Basic Usage

import org.apache.flink.orc.nohive.OrcNoHiveBulkWriterFactory;
import org.apache.flink.table.data.RowData;
import org.apache.flink.table.types.logical.LogicalType;
import org.apache.flink.table.types.logical.VarCharType;
import org.apache.flink.table.types.logical.IntType;
import org.apache.hadoop.conf.Configuration;

// Create bulk writer factory for ORC files
Configuration hadoopConfig = new Configuration();
String orcSchema = "struct<name:string,age:int>";
LogicalType[] fieldTypes = {new VarCharType(), new IntType()};

OrcNoHiveBulkWriterFactory writerFactory = new OrcNoHiveBulkWriterFactory(
    hadoopConfig, 
    orcSchema, 
    fieldTypes
);

// Use with Flink's streaming file sink
StreamingFileSink<RowData> sink = StreamingFileSink
    .forBulkFormat(outputPath, writerFactory)
    .build();

Architecture

The flink-orc-nohive module is organized around several key components:

  • Writer Factory: OrcNoHiveBulkWriterFactory creates bulk writers for efficient ORC file writing
  • Input Formats: Helper classes for creating columnar input formats with partition support
  • Vector Layer: Custom vector implementations that adapt ORC vectors to Flink's vector API
  • Shim Layer: OrcNoHiveShim provides ORC reader implementation without Hive dependencies
  • Physical Writer: Handles ORC file writing with relocated Protobuf classes for no-hive compatibility

Capabilities

Bulk Writing

Factory for creating ORC bulk writers that efficiently write Flink RowData to ORC files without Hive dependencies.

public class OrcNoHiveBulkWriterFactory implements BulkWriter.Factory<RowData> {
    public OrcNoHiveBulkWriterFactory(Configuration conf, String schema, LogicalType[] fieldTypes);
    public BulkWriter<RowData> create(FSDataOutputStream out) throws IOException;
}

Bulk Writing

Columnar Reading

Helper utilities for creating columnar input formats and split readers with partition support for efficient ORC file reading.

public class OrcNoHiveColumnarRowInputFormat {
    public static <SplitT extends FileSourceSplit> 
        OrcColumnarRowFileInputFormat<VectorizedRowBatch, SplitT> createPartitionedFormat(
            Configuration hadoopConfig,
            RowType tableType,
            List<String> partitionKeys,
            PartitionFieldExtractor<SplitT> extractor,
            int[] selectedFields,
            List<OrcFilters.Predicate> conjunctPredicates,
            int batchSize
        );
}

public class OrcNoHiveSplitReaderUtil {
    public static OrcColumnarRowSplitReader<VectorizedRowBatch> genPartColumnarRowReader(
        Configuration conf,
        String[] fullFieldNames,
        DataType[] fullFieldTypes,
        Map<String, Object> partitionSpec,
        int[] selectedFields,
        List<OrcFilters.Predicate> conjunctPredicates,
        int batchSize,
        org.apache.flink.core.fs.Path path,
        long splitStart,
        long splitLength
    ) throws IOException;
}

Columnar Reading

Vector Processing

High-performance vector implementations that adapt ORC column vectors to Flink's vector API for efficient columnar data processing.

public abstract class AbstractOrcNoHiveVector implements ColumnVector {
    public boolean isNullAt(int i);
    public static ColumnVector createFlinkVector(ColumnVector vector);
    public static ColumnVector createFlinkVectorFromConstant(LogicalType type, Object value, int batchSize);
}

Vector Processing

ORC Integration

Low-level ORC integration providing record readers and batch wrappers for direct ORC file access without Hive dependencies.

public class OrcNoHiveShim implements OrcShim<VectorizedRowBatch> {
    public RecordReader createRecordReader(
        Configuration conf,
        TypeDescription schema,
        int[] selectedFields,
        List<OrcFilters.Predicate> conjunctPredicates,
        org.apache.flink.core.fs.Path path,
        long splitStart,
        long splitLength
    ) throws IOException;
    
    public OrcNoHiveBatchWrapper createBatchWrapper(TypeDescription schema, int batchSize);
    public boolean nextBatch(RecordReader reader, VectorizedRowBatch rowBatch) throws IOException;
}

ORC Integration

Supported Data Types

The module supports all standard Flink logical types:

  • Primitive Types: BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE
  • String Types: CHAR, VARCHAR
  • Binary Types: BINARY, VARBINARY
  • Temporal Types: DATE, TIME_WITHOUT_TIME_ZONE, TIMESTAMP_WITHOUT_TIME_ZONE, TIMESTAMP_WITH_LOCAL_TIME_ZONE
  • Decimal Types: DECIMAL with configurable precision and scale

Error Handling

All public methods may throw IOException for I/O operations. The module throws UnsupportedOperationException for unsupported data types or operations.

Common exceptions include:

  • IOException: File system operations, ORC file reading/writing errors
  • UnsupportedOperationException: Unsupported logical types or vector operations
  • ClassNotFoundException: Serialization/deserialization errors in factory classes