or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

big-data.mdcloud-storage.mdcompression.mdcore-operations.mdindex.mdnetwork-access.mdutilities.md
tile.json

tessl/pypi-smart-open

Utils for streaming large files from S3, HDFS, GCS, SFTP, Azure Blob Storage, and local filesystem with transparent compression support

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/smart-open@7.3.x

To install, run

npx @tessl/cli install tessl/pypi-smart-open@7.3.0

index.mddocs/

Smart Open

Smart Open is a Python library for efficient streaming of very large files from/to various storage systems including S3, Google Cloud Storage, Azure Blob Storage, HDFS, WebHDFS, HTTP/HTTPS, SFTP, and local filesystem. It provides transparent, on-the-fly compression/decompression for multiple formats and serves as a drop-in replacement for Python's built-in open() function with 100% compatibility.

Package Information

  • Package Name: smart_open
  • Language: Python
  • Installation: pip install smart_open
  • Version: 7.3.0
  • License: MIT

Core Imports

from smart_open import open

For URI parsing:

from smart_open import parse_uri

For compression handling:

from smart_open import register_compressor

For legacy context manager:

from smart_open import smart_open

Basic Usage

from smart_open import open

# Stream from S3
with open('s3://my-bucket/large-file.txt') as f:
    for line in f:
        print(line.strip())

# Stream with compression (automatic detection)
with open('s3://my-bucket/file.txt.gz') as f:
    content = f.read()

# Write to cloud storage
with open('gs://my-bucket/output.txt', 'w') as f:
    f.write('Hello, world!')

# Local files work too (drop-in replacement)
with open('./local-file.txt') as f:
    data = f.read()

# Binary operations with seeking
with open('s3://my-bucket/data.bin', 'rb') as f:
    f.seek(1000)  # Seek to position 1000
    chunk = f.read(100)  # Read 100 bytes

Architecture

Smart Open uses a modular transport architecture:

  • Transport Layer: Protocol-specific implementations (S3, GCS, Azure, HTTP, etc.)
  • Compression Layer: Transparent compression/decompression handling
  • Unified API: Single open() function interface compatible with built-in open()
  • Streaming Design: Memory-efficient processing of arbitrarily large files

Each transport module provides consistent parse_uri(), open_uri(), and open() functions, with Reader/Writer classes implementing standard Python I/O interfaces.

Capabilities

Core File Operations

Universal file operations that work across all supported storage systems with transparent compression support.

def open(uri, mode='r', buffering=-1, encoding=None, errors=None, newline=None, 
         closefd=True, opener=None, compression='infer_from_extension', 
         transport_params=None): ...

def parse_uri(uri_as_string): ...

Core Operations

Cloud Storage Integration

Access to major cloud storage platforms with native client optimizations and streaming capabilities.

# S3 operations
from smart_open.s3 import open, iter_bucket, Reader, MultipartWriter
# GCS operations  
from smart_open.gcs import open, Reader, Writer
# Azure operations
from smart_open.azure import open, Reader, Writer

Cloud Storage

Network and Remote Access

HTTP/HTTPS, FTP, and SSH-based file access with authentication and secure connection support.

# HTTP operations
from smart_open.http import open
# FTP operations  
from smart_open.ftp import open
# SSH/SFTP operations
from smart_open.ssh import open

Network Access

Big Data and Distributed Systems

Integration with Hadoop ecosystem (HDFS, WebHDFS) for big data processing workflows.

# HDFS operations
from smart_open.hdfs import open
# WebHDFS operations
from smart_open.webhdfs import open

Big Data Systems

Compression and Encoding

Automatic and explicit compression handling for multiple formats with streaming support.

def register_compressor(ext, callback): ...
def get_supported_compression_types(): ...
def get_supported_extensions(): ...

Compression

Utilities and Advanced Usage

Helper functions for URI handling, byte ranges, parallel processing, and custom transport development.

# URI utilities
from smart_open.utils import safe_urlsplit, make_range_string
# Concurrency utilities  
from smart_open.concurrency import create_pool
# Transport registration
from smart_open.transport import register_transport, get_transport

Utilities

Supported URL Formats

Smart Open supports a wide variety of URL formats:

  • S3: s3://bucket/key, s3://key:secret@bucket/key
  • Google Cloud: gs://bucket/blob
  • Azure: azure://container/blob
  • HTTP/HTTPS: http://example.com/file, https://example.com/file
  • FTP/FTPS: ftp://host/path, ftps://host/path
  • SSH/SFTP: ssh://user@host/path, sftp://user@host/path
  • HDFS: hdfs:///path/file, hdfs://namenode:port/path/file
  • WebHDFS: webhdfs://host:port/path/file
  • Local files: ./path/file, file:///absolute/path, ~/path/file

All formats support transparent compression based on file extensions (.gz, .bz2, .zst, etc.).