or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-smart-open

Utils for streaming large files from S3, HDFS, GCS, SFTP, Azure Blob Storage, and local filesystem with transparent compression support

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/smart-open@7.3.x

To install, run

npx @tessl/cli install tessl/pypi-smart-open@7.3.0

0

# Smart Open

1

2

Smart Open is a Python library for **efficient streaming of very large files** from/to various storage systems including S3, Google Cloud Storage, Azure Blob Storage, HDFS, WebHDFS, HTTP/HTTPS, SFTP, and local filesystem. It provides transparent, on-the-fly compression/decompression for multiple formats and serves as a drop-in replacement for Python's built-in `open()` function with 100% compatibility.

3

4

## Package Information

5

6

- **Package Name**: smart_open

7

- **Language**: Python

8

- **Installation**: `pip install smart_open`

9

- **Version**: 7.3.0

10

- **License**: MIT

11

12

## Core Imports

13

14

```python

15

from smart_open import open

16

```

17

18

For URI parsing:

19

20

```python

21

from smart_open import parse_uri

22

```

23

24

For compression handling:

25

26

```python

27

from smart_open import register_compressor

28

```

29

30

For legacy context manager:

31

32

```python

33

from smart_open import smart_open

34

```

35

36

## Basic Usage

37

38

```python

39

from smart_open import open

40

41

# Stream from S3

42

with open('s3://my-bucket/large-file.txt') as f:

43

for line in f:

44

print(line.strip())

45

46

# Stream with compression (automatic detection)

47

with open('s3://my-bucket/file.txt.gz') as f:

48

content = f.read()

49

50

# Write to cloud storage

51

with open('gs://my-bucket/output.txt', 'w') as f:

52

f.write('Hello, world!')

53

54

# Local files work too (drop-in replacement)

55

with open('./local-file.txt') as f:

56

data = f.read()

57

58

# Binary operations with seeking

59

with open('s3://my-bucket/data.bin', 'rb') as f:

60

f.seek(1000) # Seek to position 1000

61

chunk = f.read(100) # Read 100 bytes

62

```

63

64

## Architecture

65

66

Smart Open uses a modular transport architecture:

67

68

- **Transport Layer**: Protocol-specific implementations (S3, GCS, Azure, HTTP, etc.)

69

- **Compression Layer**: Transparent compression/decompression handling

70

- **Unified API**: Single `open()` function interface compatible with built-in `open()`

71

- **Streaming Design**: Memory-efficient processing of arbitrarily large files

72

73

Each transport module provides consistent `parse_uri()`, `open_uri()`, and `open()` functions, with Reader/Writer classes implementing standard Python I/O interfaces.

74

75

## Capabilities

76

77

### Core File Operations

78

79

Universal file operations that work across all supported storage systems with transparent compression support.

80

81

```python { .api }

82

def open(uri, mode='r', buffering=-1, encoding=None, errors=None, newline=None,

83

closefd=True, opener=None, compression='infer_from_extension',

84

transport_params=None): ...

85

86

def parse_uri(uri_as_string): ...

87

```

88

89

[Core Operations](./core-operations.md)

90

91

### Cloud Storage Integration

92

93

Access to major cloud storage platforms with native client optimizations and streaming capabilities.

94

95

```python { .api }

96

# S3 operations

97

from smart_open.s3 import open, iter_bucket, Reader, MultipartWriter

98

# GCS operations

99

from smart_open.gcs import open, Reader, Writer

100

# Azure operations

101

from smart_open.azure import open, Reader, Writer

102

```

103

104

[Cloud Storage](./cloud-storage.md)

105

106

### Network and Remote Access

107

108

HTTP/HTTPS, FTP, and SSH-based file access with authentication and secure connection support.

109

110

```python { .api }

111

# HTTP operations

112

from smart_open.http import open

113

# FTP operations

114

from smart_open.ftp import open

115

# SSH/SFTP operations

116

from smart_open.ssh import open

117

```

118

119

[Network Access](./network-access.md)

120

121

### Big Data and Distributed Systems

122

123

Integration with Hadoop ecosystem (HDFS, WebHDFS) for big data processing workflows.

124

125

```python { .api }

126

# HDFS operations

127

from smart_open.hdfs import open

128

# WebHDFS operations

129

from smart_open.webhdfs import open

130

```

131

132

[Big Data Systems](./big-data.md)

133

134

### Compression and Encoding

135

136

Automatic and explicit compression handling for multiple formats with streaming support.

137

138

```python { .api }

139

def register_compressor(ext, callback): ...

140

def get_supported_compression_types(): ...

141

def get_supported_extensions(): ...

142

```

143

144

[Compression](./compression.md)

145

146

### Utilities and Advanced Usage

147

148

Helper functions for URI handling, byte ranges, parallel processing, and custom transport development.

149

150

```python { .api }

151

# URI utilities

152

from smart_open.utils import safe_urlsplit, make_range_string

153

# Concurrency utilities

154

from smart_open.concurrency import create_pool

155

# Transport registration

156

from smart_open.transport import register_transport, get_transport

157

```

158

159

[Utilities](./utilities.md)

160

161

## Supported URL Formats

162

163

Smart Open supports a wide variety of URL formats:

164

165

- **S3**: `s3://bucket/key`, `s3://key:secret@bucket/key`

166

- **Google Cloud**: `gs://bucket/blob`

167

- **Azure**: `azure://container/blob`

168

- **HTTP/HTTPS**: `http://example.com/file`, `https://example.com/file`

169

- **FTP/FTPS**: `ftp://host/path`, `ftps://host/path`

170

- **SSH/SFTP**: `ssh://user@host/path`, `sftp://user@host/path`

171

- **HDFS**: `hdfs:///path/file`, `hdfs://namenode:port/path/file`

172

- **WebHDFS**: `webhdfs://host:port/path/file`

173

- **Local files**: `./path/file`, `file:///absolute/path`, `~/path/file`

174

175

All formats support transparent compression based on file extensions (`.gz`, `.bz2`, `.zst`, etc.).