or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

bulk-writing.mdcolumnar-reading.mdindex.mdorc-integration.mdvector-processing.md

index.mddocs/

0

# Apache Flink ORC NoHive

1

2

Apache Flink ORC NoHive provides ORC file format support for Apache Flink without requiring Hive dependencies. It enables efficient reading from and writing to ORC files using a standalone ORC implementation, offering high-performance columnar data processing with vectorized operations.

3

4

## Package Information

5

6

- **Package Name**: flink-orc-nohive_2.11

7

- **Package Type**: maven

8

- **Language**: Java

9

- **Group ID**: org.apache.flink

10

- **Artifact ID**: flink-orc-nohive_2.11

11

- **Installation**: Add to your Maven `pom.xml`:

12

13

```xml

14

<dependency>

15

<groupId>org.apache.flink</groupId>

16

<artifactId>flink-orc-nohive_2.11</artifactId>

17

<version>1.14.6</version>

18

</dependency>

19

```

20

21

## Core Imports

22

23

```java

24

import org.apache.flink.orc.nohive.OrcNoHiveBulkWriterFactory;

25

import org.apache.flink.orc.nohive.OrcNoHiveColumnarRowInputFormat;

26

import org.apache.flink.orc.nohive.OrcNoHiveSplitReaderUtil;

27

import org.apache.flink.orc.nohive.shim.OrcNoHiveShim;

28

import org.apache.flink.orc.nohive.vector.OrcNoHiveBatchWrapper;

29

import org.apache.flink.orc.nohive.vector.AbstractOrcNoHiveVector;

30

```

31

32

## Basic Usage

33

34

```java

35

import org.apache.flink.orc.nohive.OrcNoHiveBulkWriterFactory;

36

import org.apache.flink.table.data.RowData;

37

import org.apache.flink.table.types.logical.LogicalType;

38

import org.apache.flink.table.types.logical.VarCharType;

39

import org.apache.flink.table.types.logical.IntType;

40

import org.apache.hadoop.conf.Configuration;

41

42

// Create bulk writer factory for ORC files

43

Configuration hadoopConfig = new Configuration();

44

String orcSchema = "struct<name:string,age:int>";

45

LogicalType[] fieldTypes = {new VarCharType(), new IntType()};

46

47

OrcNoHiveBulkWriterFactory writerFactory = new OrcNoHiveBulkWriterFactory(

48

hadoopConfig,

49

orcSchema,

50

fieldTypes

51

);

52

53

// Use with Flink's streaming file sink

54

StreamingFileSink<RowData> sink = StreamingFileSink

55

.forBulkFormat(outputPath, writerFactory)

56

.build();

57

```

58

59

## Architecture

60

61

The flink-orc-nohive module is organized around several key components:

62

63

- **Writer Factory**: `OrcNoHiveBulkWriterFactory` creates bulk writers for efficient ORC file writing

64

- **Input Formats**: Helper classes for creating columnar input formats with partition support

65

- **Vector Layer**: Custom vector implementations that adapt ORC vectors to Flink's vector API

66

- **Shim Layer**: `OrcNoHiveShim` provides ORC reader implementation without Hive dependencies

67

- **Physical Writer**: Handles ORC file writing with relocated Protobuf classes for no-hive compatibility

68

69

## Capabilities

70

71

### Bulk Writing

72

73

Factory for creating ORC bulk writers that efficiently write Flink RowData to ORC files without Hive dependencies.

74

75

```java { .api }

76

public class OrcNoHiveBulkWriterFactory implements BulkWriter.Factory<RowData> {

77

public OrcNoHiveBulkWriterFactory(Configuration conf, String schema, LogicalType[] fieldTypes);

78

public BulkWriter<RowData> create(FSDataOutputStream out) throws IOException;

79

}

80

```

81

82

[Bulk Writing](./bulk-writing.md)

83

84

### Columnar Reading

85

86

Helper utilities for creating columnar input formats and split readers with partition support for efficient ORC file reading.

87

88

```java { .api }

89

public class OrcNoHiveColumnarRowInputFormat {

90

public static <SplitT extends FileSourceSplit>

91

OrcColumnarRowFileInputFormat<VectorizedRowBatch, SplitT> createPartitionedFormat(

92

Configuration hadoopConfig,

93

RowType tableType,

94

List<String> partitionKeys,

95

PartitionFieldExtractor<SplitT> extractor,

96

int[] selectedFields,

97

List<OrcFilters.Predicate> conjunctPredicates,

98

int batchSize

99

);

100

}

101

102

public class OrcNoHiveSplitReaderUtil {

103

public static OrcColumnarRowSplitReader<VectorizedRowBatch> genPartColumnarRowReader(

104

Configuration conf,

105

String[] fullFieldNames,

106

DataType[] fullFieldTypes,

107

Map<String, Object> partitionSpec,

108

int[] selectedFields,

109

List<OrcFilters.Predicate> conjunctPredicates,

110

int batchSize,

111

org.apache.flink.core.fs.Path path,

112

long splitStart,

113

long splitLength

114

) throws IOException;

115

}

116

```

117

118

[Columnar Reading](./columnar-reading.md)

119

120

### Vector Processing

121

122

High-performance vector implementations that adapt ORC column vectors to Flink's vector API for efficient columnar data processing.

123

124

```java { .api }

125

public abstract class AbstractOrcNoHiveVector implements ColumnVector {

126

public boolean isNullAt(int i);

127

public static ColumnVector createFlinkVector(ColumnVector vector);

128

public static ColumnVector createFlinkVectorFromConstant(LogicalType type, Object value, int batchSize);

129

}

130

```

131

132

[Vector Processing](./vector-processing.md)

133

134

### ORC Integration

135

136

Low-level ORC integration providing record readers and batch wrappers for direct ORC file access without Hive dependencies.

137

138

```java { .api }

139

public class OrcNoHiveShim implements OrcShim<VectorizedRowBatch> {

140

public RecordReader createRecordReader(

141

Configuration conf,

142

TypeDescription schema,

143

int[] selectedFields,

144

List<OrcFilters.Predicate> conjunctPredicates,

145

org.apache.flink.core.fs.Path path,

146

long splitStart,

147

long splitLength

148

) throws IOException;

149

150

public OrcNoHiveBatchWrapper createBatchWrapper(TypeDescription schema, int batchSize);

151

public boolean nextBatch(RecordReader reader, VectorizedRowBatch rowBatch) throws IOException;

152

}

153

```

154

155

[ORC Integration](./orc-integration.md)

156

157

## Supported Data Types

158

159

The module supports all standard Flink logical types:

160

161

- **Primitive Types**: BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE

162

- **String Types**: CHAR, VARCHAR

163

- **Binary Types**: BINARY, VARBINARY

164

- **Temporal Types**: DATE, TIME_WITHOUT_TIME_ZONE, TIMESTAMP_WITHOUT_TIME_ZONE, TIMESTAMP_WITH_LOCAL_TIME_ZONE

165

- **Decimal Types**: DECIMAL with configurable precision and scale

166

167

## Error Handling

168

169

All public methods may throw `IOException` for I/O operations. The module throws `UnsupportedOperationException` for unsupported data types or operations.

170

171

Common exceptions include:

172

- `IOException`: File system operations, ORC file reading/writing errors

173

- `UnsupportedOperationException`: Unsupported logical types or vector operations

174

- `ClassNotFoundException`: Serialization/deserialization errors in factory classes