0
# Google Cloud BigQuery Storage
1
2
A high-performance Python client library for the Google BigQuery Storage API that enables efficient streaming of large datasets from BigQuery tables. The library provides streaming read capabilities with support for multiple data formats (Avro, Arrow, Protocol Buffers), streaming write operations with transactional semantics, and integration with popular data analysis frameworks like pandas and pyarrow.
3
4
## Package Information
5
6
- **Package Name**: google-cloud-bigquery-storage
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install google-cloud-bigquery-storage`
10
- **Optional Dependencies**:
11
- `pip install google-cloud-bigquery-storage[fastavro]` - for Avro format support
12
- `pip install google-cloud-bigquery-storage[pyarrow]` - for Arrow format support
13
- `pip install google-cloud-bigquery-storage[pandas]` - for pandas DataFrame support
14
15
## Core Imports
16
17
```python
18
from google.cloud import bigquery_storage
19
```
20
21
Import specific clients and types:
22
23
```python
24
from google.cloud.bigquery_storage import BigQueryReadClient, BigQueryWriteClient, BigQueryWriteAsyncClient
25
from google.cloud.bigquery_storage import types
26
from google.cloud.bigquery_storage import ReadRowsStream, AppendRowsStream
27
28
# Access package version
29
import google.cloud.bigquery_storage
30
print(google.cloud.bigquery_storage.__version__)
31
```
32
33
Access beta/alpha and v1 APIs:
34
35
```python
36
# Explicit v1 API access
37
from google.cloud import bigquery_storage_v1
38
39
# Beta version for metastore services
40
from google.cloud import bigquery_storage_v1beta
41
from google.cloud import bigquery_storage_v1beta2
42
43
# Alpha version for experimental features
44
from google.cloud import bigquery_storage_v1alpha
45
```
46
47
## Basic Usage
48
49
### Reading BigQuery Data
50
51
```python
52
from google.cloud.bigquery_storage import BigQueryReadClient, types
53
54
# Create client
55
client = BigQueryReadClient()
56
57
# Configure read session
58
table = "projects/your-project/datasets/your_dataset/tables/your_table"
59
requested_session = types.ReadSession(
60
table=table,
61
data_format=types.DataFormat.AVRO
62
)
63
64
# Create read session
65
session = client.create_read_session(
66
parent="projects/your-project",
67
read_session=requested_session,
68
max_stream_count=1
69
)
70
71
# Read data
72
reader = client.read_rows(session.streams[0].name)
73
for row in reader.rows(session):
74
print(row)
75
```
76
77
### Writing BigQuery Data
78
79
```python
80
from google.cloud.bigquery_storage import BigQueryWriteClient, types
81
82
# Create client
83
client = BigQueryWriteClient()
84
85
# Create write stream
86
parent = client.table_path("your-project", "your_dataset", "your_table")
87
write_stream = types.WriteStream(type_=types.WriteStream.Type.PENDING)
88
stream = client.create_write_stream(parent=parent, write_stream=write_stream)
89
90
# Append data (requires protocol buffer serialized data)
91
request = types.AppendRowsRequest(write_stream=stream.name)
92
# ... configure with serialized row data
93
response = client.append_rows([request])
94
```
95
96
## Architecture
97
98
The BigQuery Storage API uses a streaming architecture designed for high-performance data transfer:
99
100
- **Read Sessions**: Logical containers that define what data to read and how to format it
101
- **Read Streams**: Individual data streams within a session that can be processed in parallel
102
- **Write Streams**: Buffered append-only streams for inserting data with transactional semantics
103
- **Data Formats**: Support for Avro, Arrow, and Protocol Buffer serialization
104
- **Helper Classes**: High-level abstractions (`ReadRowsStream`, `AppendRowsStream`) for easier stream management
105
106
This design enables:
107
- **Parallel Processing**: Multiple streams can be read/written concurrently
108
- **Format Flexibility**: Choose optimal serialization format for your use case
109
- **Integration**: Seamless conversion to pandas DataFrames and Apache Arrow
110
- **Transactional Writes**: ACID guarantees for write operations
111
112
## Capabilities
113
114
### Reading Data
115
116
High-performance streaming reads from BigQuery tables with support for parallel processing, column selection, row filtering, and multiple data formats. Includes conversion utilities for pandas and Arrow. Available in both synchronous and asynchronous versions.
117
118
```python { .api }
119
class BigQueryReadClient:
120
def create_read_session(
121
self,
122
parent: str,
123
read_session: ReadSession,
124
max_stream_count: int = None
125
) -> ReadSession: ...
126
127
def read_rows(self, name: str, offset: int = 0) -> ReadRowsStream: ...
128
129
def split_read_stream(
130
self,
131
name: str,
132
fraction: float = None
133
) -> SplitReadStreamResponse: ...
134
135
class BigQueryReadAsyncClient:
136
async def create_read_session(
137
self,
138
parent: str,
139
read_session: ReadSession,
140
max_stream_count: int = None
141
) -> ReadSession: ...
142
143
def read_rows(self, name: str, offset: int = 0) -> ReadRowsStream: ...
144
145
async def split_read_stream(
146
self,
147
name: str,
148
fraction: float = None
149
) -> SplitReadStreamResponse: ...
150
```
151
152
[Reading Data](./reading-data.md)
153
154
### Writing Data
155
156
Streaming write operations with support for multiple write stream types, transactional semantics, and batch commit operations. Supports Protocol Buffer, Avro, and Arrow data formats.
157
158
```python { .api }
159
class BigQueryWriteClient:
160
def create_write_stream(
161
self,
162
parent: str,
163
write_stream: WriteStream
164
) -> WriteStream: ...
165
166
def append_rows(
167
self,
168
requests: Iterator[AppendRowsRequest]
169
) -> Iterator[AppendRowsResponse]: ...
170
171
def finalize_write_stream(self, name: str) -> FinalizeWriteStreamResponse: ...
172
173
def batch_commit_write_streams(
174
self,
175
parent: str,
176
write_streams: List[str]
177
) -> BatchCommitWriteStreamsResponse: ...
178
```
179
180
[Writing Data](./writing-data.md)
181
182
### Types and Schemas
183
184
Comprehensive type system for BigQuery Storage operations including session configuration, stream management, data formats, error handling, and schema definitions.
185
186
```python { .api }
187
class DataFormat(enum.Enum):
188
AVRO = 1
189
ARROW = 2
190
PROTO = 3
191
192
class ReadSession:
193
table: str
194
data_format: DataFormat
195
read_options: TableReadOptions
196
streams: List[ReadStream]
197
198
class WriteStream:
199
name: str
200
type_: WriteStream.Type
201
create_time: Timestamp
202
state: WriteStream.State
203
```
204
205
[Types and Schemas](./types-schemas.md)
206
207
### Metastore Services
208
209
Beta and alpha services for managing BigQuery external table metastore partitions. Supports batch operations for creating, updating, deleting, and listing Hive-style partitions in external tables.
210
211
```python { .api }
212
class MetastorePartitionServiceClient:
213
def batch_create_metastore_partitions(
214
self,
215
parent: str,
216
requests: List[CreateMetastorePartitionRequest]
217
) -> BatchCreateMetastorePartitionsResponse: ...
218
219
def batch_delete_metastore_partitions(
220
self,
221
parent: str,
222
partition_names: List[str]
223
) -> None: ...
224
225
def list_metastore_partitions(
226
self,
227
parent: str,
228
filter: str = None
229
) -> List[MetastorePartition]: ...
230
```
231
232
[Metastore Services](./metastore-services.md)