CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-smart-open

Utils for streaming large files from S3, HDFS, GCS, SFTP, Azure Blob Storage, and local filesystem with transparent compression support

Pending
Overview
Eval results
Files

Smart Open

Smart Open is a Python library for efficient streaming of very large files from/to various storage systems including S3, Google Cloud Storage, Azure Blob Storage, HDFS, WebHDFS, HTTP/HTTPS, SFTP, and local filesystem. It provides transparent, on-the-fly compression/decompression for multiple formats and serves as a drop-in replacement for Python's built-in open() function with 100% compatibility.

Package Information

  • Package Name: smart_open
  • Language: Python
  • Installation: pip install smart_open
  • Version: 7.3.0
  • License: MIT

Core Imports

from smart_open import open

For URI parsing:

from smart_open import parse_uri

For compression handling:

from smart_open import register_compressor

For legacy context manager:

from smart_open import smart_open

Basic Usage

from smart_open import open

# Stream from S3
with open('s3://my-bucket/large-file.txt') as f:
    for line in f:
        print(line.strip())

# Stream with compression (automatic detection)
with open('s3://my-bucket/file.txt.gz') as f:
    content = f.read()

# Write to cloud storage
with open('gs://my-bucket/output.txt', 'w') as f:
    f.write('Hello, world!')

# Local files work too (drop-in replacement)
with open('./local-file.txt') as f:
    data = f.read()

# Binary operations with seeking
with open('s3://my-bucket/data.bin', 'rb') as f:
    f.seek(1000)  # Seek to position 1000
    chunk = f.read(100)  # Read 100 bytes

Architecture

Smart Open uses a modular transport architecture:

  • Transport Layer: Protocol-specific implementations (S3, GCS, Azure, HTTP, etc.)
  • Compression Layer: Transparent compression/decompression handling
  • Unified API: Single open() function interface compatible with built-in open()
  • Streaming Design: Memory-efficient processing of arbitrarily large files

Each transport module provides consistent parse_uri(), open_uri(), and open() functions, with Reader/Writer classes implementing standard Python I/O interfaces.

Capabilities

Core File Operations

Universal file operations that work across all supported storage systems with transparent compression support.

def open(uri, mode='r', buffering=-1, encoding=None, errors=None, newline=None, 
         closefd=True, opener=None, compression='infer_from_extension', 
         transport_params=None): ...

def parse_uri(uri_as_string): ...

Core Operations

Cloud Storage Integration

Access to major cloud storage platforms with native client optimizations and streaming capabilities.

# S3 operations
from smart_open.s3 import open, iter_bucket, Reader, MultipartWriter
# GCS operations  
from smart_open.gcs import open, Reader, Writer
# Azure operations
from smart_open.azure import open, Reader, Writer

Cloud Storage

Network and Remote Access

HTTP/HTTPS, FTP, and SSH-based file access with authentication and secure connection support.

# HTTP operations
from smart_open.http import open
# FTP operations  
from smart_open.ftp import open
# SSH/SFTP operations
from smart_open.ssh import open

Network Access

Big Data and Distributed Systems

Integration with Hadoop ecosystem (HDFS, WebHDFS) for big data processing workflows.

# HDFS operations
from smart_open.hdfs import open
# WebHDFS operations
from smart_open.webhdfs import open

Big Data Systems

Compression and Encoding

Automatic and explicit compression handling for multiple formats with streaming support.

def register_compressor(ext, callback): ...
def get_supported_compression_types(): ...
def get_supported_extensions(): ...

Compression

Utilities and Advanced Usage

Helper functions for URI handling, byte ranges, parallel processing, and custom transport development.

# URI utilities
from smart_open.utils import safe_urlsplit, make_range_string
# Concurrency utilities  
from smart_open.concurrency import create_pool
# Transport registration
from smart_open.transport import register_transport, get_transport

Utilities

Supported URL Formats

Smart Open supports a wide variety of URL formats:

  • S3: s3://bucket/key, s3://key:secret@bucket/key
  • Google Cloud: gs://bucket/blob
  • Azure: azure://container/blob
  • HTTP/HTTPS: http://example.com/file, https://example.com/file
  • FTP/FTPS: ftp://host/path, ftps://host/path
  • SSH/SFTP: ssh://user@host/path, sftp://user@host/path
  • HDFS: hdfs:///path/file, hdfs://namenode:port/path/file
  • WebHDFS: webhdfs://host:port/path/file
  • Local files: ./path/file, file:///absolute/path, ~/path/file

All formats support transparent compression based on file extensions (.gz, .bz2, .zst, etc.).

Install with Tessl CLI

npx tessl i tessl/pypi-smart-open
Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/smart-open@7.3.x