or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

filesystem-factory.mdfilesystem-operations.mdhadoop-utilities.mdindex.mdio-streams.mdrecoverable-writers.md
tile.json

tessl/maven-org-apache-flink--flink-hadoop-fs

Hadoop FileSystem integration for Apache Flink enabling seamless access to HDFS and other Hadoop-compatible file systems

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.flink/flink-hadoop-fs@2.1.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-flink--flink-hadoop-fs@2.1.0

index.mddocs/

Apache Flink Hadoop FileSystem

Apache Flink Hadoop FileSystem (flink-hadoop-fs) provides seamless integration between Apache Flink's file system abstraction and Hadoop's file system implementations. It enables Flink applications to access HDFS, S3, Azure Blob Storage, Google Cloud Storage, and other Hadoop-compatible file systems with full support for fault-tolerant streaming, exactly-once processing guarantees, and high-performance I/O operations.

Package Information

  • Package Name: flink-hadoop-fs
  • Package Type: maven
  • GroupId: org.apache.flink
  • ArtifactId: flink-hadoop-fs
  • Language: Java
  • Installation: Add to Maven dependencies:
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-hadoop-fs</artifactId>
    <version>2.1.0</version>
</dependency>

Core Imports

import org.apache.flink.runtime.fs.hdfs.HadoopFsFactory;
import org.apache.flink.runtime.fs.hdfs.HadoopFileSystem;
import org.apache.flink.runtime.fs.hdfs.HadoopRecoverableWriter;
import org.apache.flink.runtime.util.HadoopUtils;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.core.fs.Path;

Basic Usage

import org.apache.flink.runtime.fs.hdfs.HadoopFsFactory;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.core.fs.Path;
import org.apache.flink.configuration.Configuration;
import java.net.URI;

// Create and configure factory
HadoopFsFactory factory = new HadoopFsFactory();
Configuration config = new Configuration();
factory.configure(config);

// Create file system for HDFS
URI hdfsUri = URI.create("hdfs://namenode:9000/");
FileSystem fs = factory.create(hdfsUri);

// Basic file operations
Path filePath = new Path("hdfs://namenode:9000/data/input.txt");
boolean exists = fs.exists(filePath);
FileStatus status = fs.getFileStatus(filePath);

// Read from file
FSDataInputStream inputStream = fs.open(filePath);
// ... read data
inputStream.close();

// Write to file
FSDataOutputStream outputStream = fs.create(new Path("hdfs://namenode:9000/data/output.txt"));
outputStream.writeUTF("Hello, Hadoop!");
outputStream.close();

Architecture

Apache Flink Hadoop FileSystem is built around several key components:

  • Factory Pattern: HadoopFsFactory creates appropriate file system instances for different schemes (HDFS, S3, etc.)
  • FileSystem Wrapper: HadoopFileSystem wraps Hadoop's FileSystem implementations with Flink's interface
  • Recoverable Writers: Fault-tolerant writers that support exactly-once processing guarantees through checkpoint/recovery
  • Optimized Streams: High-performance I/O streams with ByteBuffer support and connection limiting
  • Configuration Integration: Seamless bridging between Flink and Hadoop configurations
  • Security Support: Kerberos authentication and delegation token management

Capabilities

FileSystem Factory

Factory for creating Hadoop-based file systems that automatically detects and instantiates the appropriate implementation based on URI schemes.

public class HadoopFsFactory implements FileSystemFactory {
    public String getScheme();
    public void configure(Configuration config);
    public FileSystem create(URI fsUri) throws IOException;
}

FileSystem Factory

Core FileSystem Operations

Comprehensive file system operations including reading, writing, directory management, and metadata access with support for all Hadoop-compatible file systems.

public class HadoopFileSystem extends FileSystem {
    public HadoopFileSystem(org.apache.hadoop.fs.FileSystem hadoopFileSystem);
    public FileStatus getFileStatus(Path f) throws IOException;
    public HadoopDataInputStream open(Path f) throws IOException;
    public HadoopDataOutputStream create(Path f, WriteMode overwrite) throws IOException;
    public boolean delete(Path f, boolean recursive) throws IOException;
    public FileStatus[] listStatus(Path f) throws IOException;
    public RecoverableWriter createRecoverableWriter() throws IOException;
}

Core FileSystem Operations

High-Performance I/O Streams

Optimized input and output streams with ByteBuffer support, connection limiting, and advanced positioning capabilities for efficient data processing.

public class HadoopDataInputStream extends FSDataInputStream implements ByteBufferReadable {
    public void seek(long seekPos) throws IOException;
    public int read(ByteBuffer byteBuffer) throws IOException;
    public int read(long position, ByteBuffer byteBuffer) throws IOException;
}

public class HadoopDataOutputStream extends FSDataOutputStream {
    public long getPos() throws IOException;
    public void sync() throws IOException;
}

I/O Streams

Fault-Tolerant Writing

Recoverable writers that provide exactly-once processing guarantees through persistent state management and checkpoint/recovery mechanisms.

public class HadoopRecoverableWriter implements RecoverableWriter {
    public RecoverableFsDataOutputStream open(Path filePath) throws IOException;
    public RecoverableFsDataOutputStream recover(ResumeRecoverable recoverable) throws IOException;
    public Committer recoverForCommit(CommitRecoverable recoverable) throws IOException;
}

public class HadoopFsRecoverable implements CommitRecoverable, ResumeRecoverable {
    public Path targetFile();
    public Path tempFile();
    public long offset();
}

Fault-Tolerant Writing

Hadoop Integration Utilities

Utility functions for configuration management, security handling, and version compatibility checks when working with Hadoop ecosystems.

public class HadoopUtils {
    public static Configuration getHadoopConfiguration(org.apache.flink.configuration.Configuration flinkConfiguration);
    public static boolean isKerberosSecurityEnabled(UserGroupInformation ugi);
    public static boolean areKerberosCredentialsValid(UserGroupInformation ugi, boolean useTicketCache);
}

Hadoop Integration Utilities

Supported File Systems

This package supports all Hadoop-compatible file systems through automatic scheme detection:

  • HDFS: Hadoop Distributed File System (hdfs://)
  • Amazon S3: S3A and S3N implementations (s3a://, s3n://, s3://)
  • Azure Storage: Blob Storage and Data Lake (wasb://, wasbs://, abfs://, abfss://)
  • Google Cloud Storage: GCS connector (gs://)
  • Local File System: Local file access (file://)
  • Other Hadoop FS: Any Hadoop FileSystem implementation

Error Handling

The package throws standard Java IOExceptions for file system operations, with specific exceptions for:

  • UnsupportedFileSystemSchemeException: When the URI scheme is not supported by Hadoop
  • UnknownHostException: When the file system authority cannot be resolved
  • IOException: General I/O errors during file operations
  • FlinkRuntimeException: Configuration and version compatibility issues

All exceptions include detailed error messages to aid in troubleshooting configuration and connectivity issues.

Types

// Core file system interface
public abstract class FileSystem {
    public enum WriteMode { NO_OVERWRITE, OVERWRITE }
}

// File metadata
public interface FileStatus {
    long getLen();
    long getBlockSize();
    long getAccessTime();
    long getModificationTime();
    short getReplication();
    Path getPath();
    boolean isDir();
}

// Block location information
public interface BlockLocation extends Comparable<BlockLocation> {
    String[] getHosts() throws IOException;
    long getLength();
    long getOffset();
}

// Path representation
public class Path {
    public Path(String path);
    public Path(URI uri);
    public URI toUri();
}