CtrlK
BlogDocsLog inGet started
Tessl Logo

databricks-lakebase-provisioned

Patterns and best practices for Lakebase Provisioned (Databricks managed PostgreSQL) for OLTP workloads. Use when creating Lakebase instances, connecting applications or Databricks Apps to PostgreSQL, implementing reverse ETL via synced tables, storing agent or chat memory, or configuring OAuth authentication for Lakebase.

89

Quality

86%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Lakebase Provisioned

Patterns and best practices for using Lakebase Provisioned (Databricks managed PostgreSQL) for OLTP workloads.

When to Use

Use this skill when:

  • Building applications that need a PostgreSQL database for transactional workloads
  • Adding persistent state to Databricks Apps
  • Implementing reverse ETL from Delta Lake to an operational database
  • Storing chat/agent memory for LangChain applications

Overview

Lakebase Provisioned is Databricks' managed PostgreSQL database service for OLTP (Online Transaction Processing) workloads. It provides a fully managed PostgreSQL-compatible database that integrates with Unity Catalog and supports OAuth token-based authentication.

FeatureDescription
Managed PostgreSQLFully managed instances with automatic provisioning
OAuth AuthenticationToken-based auth via Databricks SDK (1-hour expiry)
Unity CatalogRegister databases for governance
Reverse ETLSync data from Delta tables to PostgreSQL
Apps IntegrationFirst-class support in Databricks Apps

Available Regions (AWS): us-east-1, us-east-2, us-west-2, eu-central-1, eu-west-1, ap-south-1, ap-southeast-1, ap-southeast-2

Quick Start

Create and connect to a Lakebase Provisioned instance:

from databricks.sdk import WorkspaceClient
import uuid

# Initialize client
w = WorkspaceClient()

# Create a database instance
instance = w.database.create_database_instance(
    name="my-lakebase-instance",
    capacity="CU_1",  # CU_1, CU_2, CU_4, CU_8
    stopped=False
)
print(f"Instance created: {instance.name}")
print(f"DNS endpoint: {instance.read_write_dns}")

Common Patterns

Generate OAuth Token

from databricks.sdk import WorkspaceClient
import uuid

w = WorkspaceClient()

# Generate OAuth token for database connection
cred = w.database.generate_database_credential(
    request_id=str(uuid.uuid4()),
    instance_names=["my-lakebase-instance"]
)
token = cred.token  # Use this as password in connection string

Connect from Notebook

import psycopg
from databricks.sdk import WorkspaceClient
import uuid

# Get instance details
w = WorkspaceClient()
instance = w.database.get_database_instance(name="my-lakebase-instance")

# Generate token
cred = w.database.generate_database_credential(
    request_id=str(uuid.uuid4()),
    instance_names=["my-lakebase-instance"]
)

# Connect using psycopg3
conn_string = f"host={instance.read_write_dns} dbname=postgres user={w.current_user.me().user_name} password={cred.token} sslmode=require"
with psycopg.connect(conn_string) as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT version()")
        print(cur.fetchone())

SQLAlchemy with Token Refresh (Production)

For long-running applications, tokens must be refreshed (expire after 1 hour):

import asyncio
import os
import uuid
from sqlalchemy import event
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker
from databricks.sdk import WorkspaceClient

# Token refresh state
_current_token = None
_token_refresh_task = None
TOKEN_REFRESH_INTERVAL = 50 * 60  # 50 minutes (before 1-hour expiry)

def _generate_token(instance_name: str) -> str:
    """Generate fresh OAuth token."""
    w = WorkspaceClient()
    cred = w.database.generate_database_credential(
        request_id=str(uuid.uuid4()),
        instance_names=[instance_name]
    )
    return cred.token

async def _token_refresh_loop(instance_name: str):
    """Background task to refresh token every 50 minutes."""
    global _current_token
    while True:
        await asyncio.sleep(TOKEN_REFRESH_INTERVAL)
        _current_token = await asyncio.to_thread(_generate_token, instance_name)

def init_database(instance_name: str, database_name: str, username: str) -> AsyncEngine:
    """Initialize database with OAuth token injection."""
    global _current_token
    
    w = WorkspaceClient()
    instance = w.database.get_database_instance(name=instance_name)
    
    # Generate initial token
    _current_token = _generate_token(instance_name)
    
    # Build URL (password injected via do_connect)
    url = f"postgresql+psycopg://{username}@{instance.read_write_dns}:5432/{database_name}"
    
    engine = create_async_engine(
        url,
        pool_size=5,
        max_overflow=10,
        pool_recycle=3600,
        connect_args={"sslmode": "require"}
    )
    
    # Inject token on each connection
    @event.listens_for(engine.sync_engine, "do_connect")
    def provide_token(dialect, conn_rec, cargs, cparams):
        cparams["password"] = _current_token
    
    return engine

Databricks Apps Integration

For Databricks Apps, use environment variables for configuration:

# Environment variables set by Databricks Apps:
# - LAKEBASE_INSTANCE_NAME: Instance name
# - LAKEBASE_DATABASE_NAME: Database name
# - LAKEBASE_USERNAME: Username (optional, defaults to service principal)

import os

def is_lakebase_configured() -> bool:
    """Check if Lakebase is configured for this app."""
    return bool(
        os.environ.get("LAKEBASE_PG_URL") or
        (os.environ.get("LAKEBASE_INSTANCE_NAME") and 
         os.environ.get("LAKEBASE_DATABASE_NAME"))
    )

Add Lakebase as an app resource via CLI:

databricks apps add-resource $APP_NAME \
    --resource-type database \
    --resource-name lakebase \
    --database-instance my-lakebase-instance

Register with Unity Catalog

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Register database in Unity Catalog
w.database.register_database_instance(
    name="my-lakebase-instance",
    catalog="my_catalog",
    schema="my_schema"
)

MLflow Model Resources

Declare Lakebase as a model resource for automatic credential provisioning:

from mlflow.models.resources import DatabricksLakebase

resources = [
    DatabricksLakebase(database_instance_name="my-lakebase-instance"),
]

# When logging model
mlflow.langchain.log_model(
    model,
    artifact_path="model",
    resources=resources,
    pip_requirements=["databricks-langchain[memory]"]
)

MCP Tools

The following MCP tools are available for managing Lakebase infrastructure. Use type="provisioned" for Lakebase Provisioned.

Database Management

ToolDescription
create_or_update_lakebase_databaseCreate or update a database. Finds by name, creates if new, updates if existing. Use type="provisioned", capacity (CU_1-CU_8), stopped params.
get_lakebase_databaseGet database details or list all. Pass name to get one, omit to list all. Use type="provisioned" to filter.
delete_lakebase_databaseDelete a database and its resources. Use type="provisioned", force=True to cascade.
generate_lakebase_credentialGenerate OAuth token for PostgreSQL connections (1-hour expiry). Pass instance_names for provisioned.

Reverse ETL (Catalog + Synced Tables)

ToolDescription
create_or_update_lakebase_syncSet up reverse ETL: ensures UC catalog registration exists, then creates a synced table from Delta to Lakebase. Params: instance_name, source_table_name, target_table_name, scheduling_policy ("TRIGGERED"/"SNAPSHOT"/"CONTINUOUS").
delete_lakebase_syncRemove a synced table and optionally its UC catalog registration.

Reference Files

  • connection-patterns.md - Detailed connection patterns for different use cases
  • reverse-etl.md - Syncing data from Delta Lake to Lakebase

CLI Quick Reference

# Create instance
databricks database create-database-instance \
    --name my-lakebase-instance \
    --capacity CU_1

# Get instance details
databricks database get-database-instance --name my-lakebase-instance

# Generate credentials
databricks database generate-database-credential \
    --request-id $(uuidgen) \
    --json '{"instance_names": ["my-lakebase-instance"]}'

# List instances
databricks database list-database-instances

# Stop instance (saves cost)
databricks database stop-database-instance --name my-lakebase-instance

# Start instance
databricks database start-database-instance --name my-lakebase-instance

Common Issues

IssueSolution
Token expired during long queryImplement token refresh loop (see SQLAlchemy with Token Refresh section); tokens expire after 1 hour
DNS resolution fails on macOSUse dig command to resolve hostname, pass hostaddr to psycopg
Connection refusedEnsure instance is not stopped; check instance.state
Permission deniedUser must be granted access to the Lakebase instance
SSL required errorAlways use sslmode=require in connection string

SDK Version Requirements

  • Databricks SDK for Python: >= 0.61.0 (0.81.0+ recommended for full API support)
  • psycopg: 3.x (supports hostaddr parameter for DNS workaround)
  • SQLAlchemy: 2.x with postgresql+psycopg driver
%pip install -U "databricks-sdk>=0.81.0" "psycopg[binary]>=3.0" sqlalchemy

Notes

  • Capacity values use compute unit sizing: CU_1, CU_2, CU_4, CU_8.
  • Lakebase Autoscaling is a newer offering with automatic scaling but limited regional availability. This skill focuses on Lakebase Provisioned which is more widely available.
  • For memory/state in LangChain agents, use databricks-langchain[memory] which includes Lakebase support.
  • Tokens are short-lived (1 hour) - production apps MUST implement token refresh.

Related Skills

Repository
databricks-solutions/ai-dev-kit
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.