Google BigQuery connector for pandas DataFrames
npx @tessl/cli install tessl/pypi-pandas-gbq@0.29.00
# pandas-gbq
1
2
Google BigQuery connector for pandas DataFrames providing seamless two-way data integration. The library offers high-level functions for executing SQL queries and returning results as DataFrames, uploading DataFrames to BigQuery tables, with comprehensive support for authentication, schema inference, data type mapping, and performance optimizations.
3
4
## Package Information
5
6
- **Package Name**: pandas-gbq
7
- **Language**: Python
8
- **Installation**: `pip install pandas-gbq`
9
- **Optional Features**:
10
- BigQuery Storage API: `pip install pandas-gbq[bqstorage]`
11
- Progress bars: `pip install pandas-gbq[tqdm]`
12
- Geographic data: `pip install pandas-gbq[geopandas]`
13
14
## Core Imports
15
16
```python
17
import pandas_gbq
18
```
19
20
Standard import pattern:
21
22
```python
23
from pandas_gbq import read_gbq, to_gbq
24
```
25
26
For context management:
27
28
```python
29
from pandas_gbq import context, Context
30
```
31
32
## Basic Usage
33
34
```python
35
import pandas as pd
36
from pandas_gbq import read_gbq, to_gbq
37
38
# Reading data from BigQuery
39
df = read_gbq(
40
query="SELECT name, age FROM `my_project.my_dataset.users` WHERE age > 21",
41
project_id="my-gcp-project"
42
)
43
44
# Writing DataFrame to BigQuery
45
sample_data = pd.DataFrame({
46
'name': ['Alice', 'Bob', 'Charlie'],
47
'age': [25, 30, 35],
48
'city': ['New York', 'London', 'Tokyo']
49
})
50
51
to_gbq(
52
dataframe=sample_data,
53
destination_table="my_dataset.new_users",
54
project_id="my-gcp-project",
55
if_exists="append"
56
)
57
58
# Using context for session-wide configuration
59
import pandas_gbq
60
pandas_gbq.context.project = "my-gcp-project"
61
pandas_gbq.context.dialect = "standard"
62
63
# Now queries can omit project_id
64
df = read_gbq("SELECT * FROM `my_dataset.my_table`")
65
```
66
67
## Architecture
68
69
pandas-gbq serves as a high-level connector between pandas DataFrames and Google BigQuery, built on top of the Google Cloud BigQuery client library. The library provides a seamless interface that handles the complexities of BigQuery integration:
70
71
- **Authentication Layer**: Supports multiple authentication methods including Application Default Credentials, service account keys, and user OAuth flows
72
- **Data Type Mapping**: Automatically handles conversion between pandas data types and BigQuery types, with support for nullable integers, timestamps, and geographic data
73
- **Query Execution**: Executes SQL queries through BigQuery's REST API with support for both standard and legacy SQL dialects
74
- **Data Transfer**: Optimizes data transfer with chunking for large uploads, BigQuery Storage API for fast reads, and streaming inserts for real-time data
75
- **Session Management**: Provides a global context system for credential and configuration management across multiple operations
76
77
The design prioritizes ease of use while exposing advanced BigQuery features through comprehensive parameter options, making it suitable for both simple data analysis tasks and complex ETL pipelines.
78
79
## Capabilities
80
81
### Data Reading
82
83
Execute SQL queries or read tables directly from Google BigQuery, returning results as pandas DataFrames with full data type support and optional BigQuery Storage API acceleration.
84
85
```python { .api }
86
def read_gbq(
87
query_or_table: str,
88
project_id: str = None,
89
index_col: str = None,
90
columns: list[str] = None,
91
reauth: bool = False,
92
auth_local_webserver: bool = True,
93
dialect: str = None,
94
location: str = None,
95
configuration: dict = None,
96
credentials = None,
97
use_bqstorage_api: bool = False,
98
max_results: int = None,
99
verbose: bool = None,
100
private_key: str = None,
101
progress_bar_type: str = "tqdm",
102
dtypes: dict = None,
103
auth_redirect_uri: str = None,
104
client_id: str = None,
105
client_secret: str = None,
106
*,
107
col_order: list[str] = None,
108
bigquery_client = None
109
) -> pd.DataFrame:
110
"""
111
Read data from Google BigQuery to a pandas DataFrame.
112
113
Parameters:
114
- query_or_table: SQL query string or table ID (dataset.table format)
115
- project_id: Google Cloud Platform project ID (optional if set in environment/context)
116
- index_col: Column name to use as DataFrame index
117
- columns: List of column names in desired order for results
118
- reauth: Force re-authentication (useful for multiple accounts)
119
- auth_local_webserver: Use local webserver flow instead of console flow
120
- dialect: SQL dialect ('standard' or 'legacy', default: 'standard')
121
- location: Geographic location for query job execution
122
- configuration: Advanced query configuration parameters
123
- credentials: Google auth credentials object
124
- use_bqstorage_api: Enable BigQuery Storage API for faster reads
125
- max_results: Maximum number of rows to return
126
- verbose: Enable verbose logging (deprecated)
127
- private_key: Service account private key (deprecated)
128
- progress_bar_type: Progress bar type ('tqdm' or None)
129
- dtypes: Explicit data type mapping for columns
130
- auth_redirect_uri: Custom OAuth redirect URI
131
- client_id: OAuth client ID
132
- client_secret: OAuth client secret
133
- col_order: Column order for results (keyword-only)
134
- bigquery_client: Pre-configured BigQuery client instance (keyword-only)
135
136
Returns:
137
pandas.DataFrame: Query results as DataFrame with appropriate data types
138
"""
139
```
140
141
Usage examples:
142
143
```python
144
# Basic query
145
df = read_gbq("SELECT * FROM `dataset.table` LIMIT 1000", project_id="my-project")
146
147
# Direct table read
148
df = read_gbq("my_dataset.my_table", project_id="my-project")
149
150
# With BigQuery Storage API for large results
151
df = read_gbq(
152
"SELECT * FROM `big_dataset.huge_table`",
153
project_id="my-project",
154
use_bqstorage_api=True
155
)
156
157
# With custom column ordering and types
158
df = read_gbq(
159
"SELECT name, age, created_at FROM `users.profile`",
160
project_id="my-project",
161
columns=["name", "age", "created_at"],
162
dtypes={"age": "Int64"} # Use nullable integer type
163
)
164
```
165
166
### Data Writing
167
168
Upload pandas DataFrames to Google BigQuery tables with flexible schema handling, multiple upload methods, and comprehensive data type support.
169
170
```python { .api }
171
def to_gbq(
172
dataframe: pd.DataFrame,
173
destination_table: str,
174
project_id: str = None,
175
chunksize: int = None,
176
reauth: bool = False,
177
if_exists: str = "fail",
178
auth_local_webserver: bool = True,
179
table_schema: list[dict] = None,
180
location: str = None,
181
progress_bar: bool = True,
182
credentials = None,
183
api_method: str = "default",
184
verbose: bool = None,
185
private_key: str = None,
186
auth_redirect_uri: str = None,
187
client_id: str = None,
188
client_secret: str = None,
189
user_agent: str = None,
190
rfc9110_delimiter: bool = False,
191
bigquery_client = None
192
) -> None:
193
"""
194
Write a DataFrame to a Google BigQuery table.
195
196
Parameters:
197
- dataframe: pandas DataFrame to upload
198
- destination_table: Target table in format 'dataset.table' or 'project.dataset.table'
199
- project_id: Google Cloud Platform project ID (optional if set in environment/context)
200
- chunksize: Number of rows per upload chunk (None = upload all at once)
201
- reauth: Force re-authentication
202
- if_exists: Behavior when table exists ('fail', 'replace', 'append')
203
- auth_local_webserver: Use local webserver flow instead of console flow
204
- table_schema: Custom BigQuery schema specification as list of field dicts
205
- location: Geographic location for table and load job
206
- progress_bar: Show upload progress bar
207
- credentials: Google auth credentials object
208
- api_method: Upload method ('default', 'streaming', 'batch')
209
- verbose: Enable verbose logging (deprecated)
210
- private_key: Service account private key (deprecated)
211
- auth_redirect_uri: Custom OAuth redirect URI
212
- client_id: OAuth client ID
213
- client_secret: OAuth client secret
214
- user_agent: Custom user agent string
215
- rfc9110_delimiter: Use RFC 9110 compliant field delimiters
216
- bigquery_client: Pre-configured BigQuery client instance
217
218
Returns:
219
None
220
"""
221
```
222
223
Usage examples:
224
225
```python
226
import pandas as pd
227
228
df = pd.DataFrame({
229
'id': [1, 2, 3],
230
'name': ['Alice', 'Bob', 'Charlie'],
231
'score': [95.5, 87.2, 92.1],
232
'active': [True, False, True]
233
})
234
235
# Basic upload
236
to_gbq(df, "my_dataset.scores", project_id="my-project")
237
238
# Append to existing table
239
to_gbq(df, "my_dataset.scores", project_id="my-project", if_exists="append")
240
241
# Replace existing table
242
to_gbq(df, "my_dataset.scores", project_id="my-project", if_exists="replace")
243
244
# Custom schema specification
245
schema = [
246
{"name": "id", "type": "INTEGER", "mode": "REQUIRED"},
247
{"name": "name", "type": "STRING", "mode": "REQUIRED"},
248
{"name": "score", "type": "FLOAT", "mode": "NULLABLE"},
249
{"name": "active", "type": "BOOLEAN", "mode": "NULLABLE"}
250
]
251
252
to_gbq(
253
df,
254
"my_dataset.scores",
255
project_id="my-project",
256
table_schema=schema,
257
if_exists="replace"
258
)
259
260
# Chunked upload for large DataFrames
261
to_gbq(
262
large_df,
263
"my_dataset.big_table",
264
project_id="my-project",
265
chunksize=10000,
266
if_exists="append"
267
)
268
```
269
270
### Session Context Management
271
272
Manage session-wide configuration for credentials, default project, and SQL dialect preferences to reduce parameter repetition across function calls.
273
274
```python { .api }
275
class Context:
276
"""Storage for objects to be used throughout a session."""
277
278
@property
279
def credentials(self) -> google.auth.credentials.Credentials:
280
"""
281
Credentials to use for Google APIs.
282
283
Returns:
284
google.auth.credentials.Credentials object or None
285
"""
286
287
@credentials.setter
288
def credentials(self, value: google.auth.credentials.Credentials) -> None:
289
"""Set credentials for Google APIs."""
290
291
@property
292
def project(self) -> str:
293
"""
294
Default project to use for calls to Google APIs.
295
296
Returns:
297
str: Project ID or None
298
"""
299
300
@project.setter
301
def project(self, value: str) -> None:
302
"""Set default project ID."""
303
304
@property
305
def dialect(self) -> str:
306
"""
307
Default SQL dialect to use in read_gbq.
308
309
Returns:
310
str: 'standard', 'legacy', or None
311
"""
312
313
@dialect.setter
314
def dialect(self, value: str) -> None:
315
"""Set default SQL dialect ('standard' or 'legacy')."""
316
317
# Global context instance
318
context: Context
319
```
320
321
Usage examples:
322
323
```python
324
import pandas_gbq
325
from google.oauth2 import service_account
326
327
# Set credentials from service account file
328
credentials = service_account.Credentials.from_service_account_file(
329
'/path/to/service-account-key.json'
330
)
331
pandas_gbq.context.credentials = credentials
332
333
# Set default project
334
pandas_gbq.context.project = "my-gcp-project"
335
336
# Set SQL dialect preference
337
pandas_gbq.context.dialect = "standard"
338
339
# Now function calls can omit these parameters
340
df = read_gbq("SELECT * FROM my_dataset.my_table")
341
to_gbq(df, "my_dataset.output_table", if_exists="replace")
342
343
# Create custom context instances
344
custom_context = Context()
345
custom_context.project = "different-project"
346
```
347
348
## Exception Handling
349
350
pandas-gbq defines custom exceptions for different error conditions:
351
352
```python { .api }
353
# Core exceptions
354
class DatasetCreationError(ValueError):
355
"""Raised when dataset creation fails."""
356
357
class TableCreationError(ValueError):
358
"""Raised when table creation fails."""
359
360
@property
361
def message(self) -> str:
362
"""Error message."""
363
364
class InvalidSchema(ValueError):
365
"""Raised when DataFrame doesn't match BigQuery table schema."""
366
367
@property
368
def message(self) -> str:
369
"""Error message."""
370
371
class NotFoundException(ValueError):
372
"""Raised when project, table, or dataset not found."""
373
374
class GenericGBQException(ValueError):
375
"""Raised for unrecognized Google API errors."""
376
377
class AccessDenied(ValueError):
378
"""Raised for authentication/authorization failures."""
379
380
class ConversionError(GenericGBQException):
381
"""Raised for DataFrame conversion problems."""
382
383
# Query and data handling exceptions
384
class QueryTimeout(ValueError):
385
"""Raised when query exceeds timeout."""
386
387
class InvalidColumnOrder(ValueError):
388
"""Raised when column order doesn't match schema."""
389
390
class InvalidIndexColumn(ValueError):
391
"""Raised when index column doesn't match schema."""
392
393
class InvalidPageToken(ValueError):
394
"""Raised for BigQuery page token failures."""
395
396
class InvalidPrivateKeyFormat(ValueError):
397
"""Raised for invalid private key format."""
398
399
# Warning types
400
class LargeResultsWarning(UserWarning):
401
"""Warning for results beyond recommended DataFrame size."""
402
403
class PerformanceWarning(RuntimeWarning):
404
"""Warning for unsupported performance features."""
405
```
406
407
Common error handling patterns:
408
409
```python
410
from pandas_gbq import read_gbq, to_gbq
411
from pandas_gbq.exceptions import (
412
NotFoundException,
413
AccessDenied,
414
QueryTimeout,
415
InvalidSchema
416
)
417
418
try:
419
df = read_gbq("SELECT * FROM nonexistent.table", project_id="my-project")
420
except NotFoundException as e:
421
print(f"Table not found: {e}")
422
except AccessDenied as e:
423
print(f"Access denied: {e}")
424
except QueryTimeout as e:
425
print(f"Query timed out: {e}")
426
427
try:
428
to_gbq(df, "dataset.table", project_id="my-project")
429
except InvalidSchema as e:
430
print(f"Schema mismatch: {e.message}")
431
except TableCreationError as e:
432
print(f"Failed to create table: {e.message}")
433
```
434
435
## Type Definitions
436
437
Key data types and interfaces used throughout the pandas-gbq API:
438
439
```python { .api }
440
# Authentication types (from google-auth)
441
from google.auth.credentials import Credentials
442
from google.cloud.bigquery import Client as BigQueryClient
443
444
# pandas-gbq specific type hints
445
from typing import Optional, Union, List, Dict, Any
446
447
# Common type aliases used in function signatures
448
ProjectId = Optional[str]
449
TableId = str # Format: "dataset.table" or "project.dataset.table"
450
QueryString = str
451
ColumnList = Optional[List[str]]
452
SchemaList = Optional[List[Dict[str, Any]]]
453
ConfigDict = Optional[Dict[str, Any]]
454
DtypeDict = Optional[Dict[str, Any]]
455
456
# BigQuery schema field format
457
SchemaField = Dict[str, Any] # Example: {"name": "col1", "type": "STRING", "mode": "NULLABLE"}
458
459
# API method options
460
ApiMethod = str # "default", "streaming", or "batch"
461
IfExistsOption = str # "fail", "replace", or "append"
462
DialectOption = str # "standard" or "legacy"
463
ProgressBarType = str # "tqdm" or None
464
```
465
466
## Version Information
467
468
```python { .api }
469
__version__: str # Current package version (0.29.2)
470
```
471
472
## Authentication Methods
473
474
pandas-gbq supports multiple authentication methods:
475
476
1. **Application Default Credentials (ADC)**: Automatic detection from environment
477
2. **Service Account Key Files**: JSON key files for service accounts
478
3. **User Account OAuth**: Interactive browser-based authentication
479
4. **Service Account Keys**: Direct credential objects
480
5. **Custom BigQuery Client**: Pre-configured client instances
481
482
```python
483
# ADC (recommended for production)
484
df = read_gbq("SELECT 1", project_id="my-project") # Uses ADC automatically
485
486
# Service account file
487
from google.oauth2 import service_account
488
credentials = service_account.Credentials.from_service_account_file('key.json')
489
df = read_gbq("SELECT 1", project_id="my-project", credentials=credentials)
490
491
# User OAuth (interactive)
492
df = read_gbq("SELECT 1", project_id="my-project", reauth=True)
493
494
# Pre-configured client
495
from google.cloud import bigquery
496
client = bigquery.Client(project="my-project")
497
df = read_gbq("SELECT 1", bigquery_client=client)
498
```