Utils for streaming large files from S3, HDFS, GCS, SFTP, Azure Blob Storage, and local filesystem with transparent compression support
npx @tessl/cli install tessl/pypi-smart-open@7.3.00
# Smart Open
1
2
Smart Open is a Python library for **efficient streaming of very large files** from/to various storage systems including S3, Google Cloud Storage, Azure Blob Storage, HDFS, WebHDFS, HTTP/HTTPS, SFTP, and local filesystem. It provides transparent, on-the-fly compression/decompression for multiple formats and serves as a drop-in replacement for Python's built-in `open()` function with 100% compatibility.
3
4
## Package Information
5
6
- **Package Name**: smart_open
7
- **Language**: Python
8
- **Installation**: `pip install smart_open`
9
- **Version**: 7.3.0
10
- **License**: MIT
11
12
## Core Imports
13
14
```python
15
from smart_open import open
16
```
17
18
For URI parsing:
19
20
```python
21
from smart_open import parse_uri
22
```
23
24
For compression handling:
25
26
```python
27
from smart_open import register_compressor
28
```
29
30
For legacy context manager:
31
32
```python
33
from smart_open import smart_open
34
```
35
36
## Basic Usage
37
38
```python
39
from smart_open import open
40
41
# Stream from S3
42
with open('s3://my-bucket/large-file.txt') as f:
43
for line in f:
44
print(line.strip())
45
46
# Stream with compression (automatic detection)
47
with open('s3://my-bucket/file.txt.gz') as f:
48
content = f.read()
49
50
# Write to cloud storage
51
with open('gs://my-bucket/output.txt', 'w') as f:
52
f.write('Hello, world!')
53
54
# Local files work too (drop-in replacement)
55
with open('./local-file.txt') as f:
56
data = f.read()
57
58
# Binary operations with seeking
59
with open('s3://my-bucket/data.bin', 'rb') as f:
60
f.seek(1000) # Seek to position 1000
61
chunk = f.read(100) # Read 100 bytes
62
```
63
64
## Architecture
65
66
Smart Open uses a modular transport architecture:
67
68
- **Transport Layer**: Protocol-specific implementations (S3, GCS, Azure, HTTP, etc.)
69
- **Compression Layer**: Transparent compression/decompression handling
70
- **Unified API**: Single `open()` function interface compatible with built-in `open()`
71
- **Streaming Design**: Memory-efficient processing of arbitrarily large files
72
73
Each transport module provides consistent `parse_uri()`, `open_uri()`, and `open()` functions, with Reader/Writer classes implementing standard Python I/O interfaces.
74
75
## Capabilities
76
77
### Core File Operations
78
79
Universal file operations that work across all supported storage systems with transparent compression support.
80
81
```python { .api }
82
def open(uri, mode='r', buffering=-1, encoding=None, errors=None, newline=None,
83
closefd=True, opener=None, compression='infer_from_extension',
84
transport_params=None): ...
85
86
def parse_uri(uri_as_string): ...
87
```
88
89
[Core Operations](./core-operations.md)
90
91
### Cloud Storage Integration
92
93
Access to major cloud storage platforms with native client optimizations and streaming capabilities.
94
95
```python { .api }
96
# S3 operations
97
from smart_open.s3 import open, iter_bucket, Reader, MultipartWriter
98
# GCS operations
99
from smart_open.gcs import open, Reader, Writer
100
# Azure operations
101
from smart_open.azure import open, Reader, Writer
102
```
103
104
[Cloud Storage](./cloud-storage.md)
105
106
### Network and Remote Access
107
108
HTTP/HTTPS, FTP, and SSH-based file access with authentication and secure connection support.
109
110
```python { .api }
111
# HTTP operations
112
from smart_open.http import open
113
# FTP operations
114
from smart_open.ftp import open
115
# SSH/SFTP operations
116
from smart_open.ssh import open
117
```
118
119
[Network Access](./network-access.md)
120
121
### Big Data and Distributed Systems
122
123
Integration with Hadoop ecosystem (HDFS, WebHDFS) for big data processing workflows.
124
125
```python { .api }
126
# HDFS operations
127
from smart_open.hdfs import open
128
# WebHDFS operations
129
from smart_open.webhdfs import open
130
```
131
132
[Big Data Systems](./big-data.md)
133
134
### Compression and Encoding
135
136
Automatic and explicit compression handling for multiple formats with streaming support.
137
138
```python { .api }
139
def register_compressor(ext, callback): ...
140
def get_supported_compression_types(): ...
141
def get_supported_extensions(): ...
142
```
143
144
[Compression](./compression.md)
145
146
### Utilities and Advanced Usage
147
148
Helper functions for URI handling, byte ranges, parallel processing, and custom transport development.
149
150
```python { .api }
151
# URI utilities
152
from smart_open.utils import safe_urlsplit, make_range_string
153
# Concurrency utilities
154
from smart_open.concurrency import create_pool
155
# Transport registration
156
from smart_open.transport import register_transport, get_transport
157
```
158
159
[Utilities](./utilities.md)
160
161
## Supported URL Formats
162
163
Smart Open supports a wide variety of URL formats:
164
165
- **S3**: `s3://bucket/key`, `s3://key:secret@bucket/key`
166
- **Google Cloud**: `gs://bucket/blob`
167
- **Azure**: `azure://container/blob`
168
- **HTTP/HTTPS**: `http://example.com/file`, `https://example.com/file`
169
- **FTP/FTPS**: `ftp://host/path`, `ftps://host/path`
170
- **SSH/SFTP**: `ssh://user@host/path`, `sftp://user@host/path`
171
- **HDFS**: `hdfs:///path/file`, `hdfs://namenode:port/path/file`
172
- **WebHDFS**: `webhdfs://host:port/path/file`
173
- **Local files**: `./path/file`, `file:///absolute/path`, `~/path/file`
174
175
All formats support transparent compression based on file extensions (`.gz`, `.bz2`, `.zst`, etc.).