CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/npm-pulumi--aws

A Pulumi package for creating and managing Amazon Web Services (AWS) cloud resources with infrastructure-as-code.

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview
Eval results
Files

glue.mddocs/analytics/

AWS Glue

AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it easy to prepare and load data for analytics.

Package

import * as aws from "@pulumi/aws";
import * as glue from "@pulumi/aws/glue";

Key Resources

Glue Catalog Database

Data catalog database for metadata storage.

const database = new aws.glue.CatalogDatabase("analytics-db", {
    name: "analytics_database",
    description: "Analytics data catalog",
});

Glue Catalog Table

Define table schema in the data catalog.

const table = new aws.glue.CatalogTable("events-table", {
    name: "events",
    databaseName: database.name,
    tableType: "EXTERNAL_TABLE",
    storageDescriptor: {
        location: pulumi.interpolate`s3://${bucket.id}/events/`,
        inputFormat: "org.apache.hadoop.mapred.TextInputFormat",
        outputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
        serDeInfo: {
            serializationLibrary: "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
            parameters: {
                "field.delim": ",",
            },
        },
        columns: [
            { name: "event_id", type: "string" },
            { name: "user_id", type: "string" },
            { name: "event_type", type: "string" },
            { name: "timestamp", type: "timestamp" },
        ],
    },
});

Glue Crawler

Automatically discover and catalog data.

const crawler = new aws.glue.Crawler("data-crawler", {
    name: "s3-data-crawler",
    role: crawlerRole.arn,
    databaseName: database.name,
    s3Targets: [{
        path: pulumi.interpolate`s3://${bucket.id}/data/`,
    }],
    schedule: "cron(0 1 * * ? *)", // Daily at 1 AM
});

Glue Job

ETL job for data transformation.

const etlJob = new aws.glue.Job("etl-job", {
    name: "data-transformation",
    roleArn: glueRole.arn,
    command: {
        scriptLocation: pulumi.interpolate`s3://${scriptBucket.id}/scripts/transform.py`,
        pythonVersion: "3",
    },
    defaultArguments: {
        "--job-language": "python",
        "--enable-metrics": "true",
        "--enable-continuous-cloudwatch-log": "true",
    },
    maxRetries: 1,
    timeout: 60,
    glueVersion: "4.0",
    numberOfWorkers: 10,
    workerType: "G.1X",
});

Common Patterns

Glue ETL with PySpark

const pysparkJob = new aws.glue.Job("pyspark-etl", {
    name: "pyspark-transformation",
    roleArn: glueRole.arn,
    command: {
        scriptLocation: pulumi.interpolate`s3://${scriptBucket.id}/scripts/transform.py`,
        pythonVersion: "3",
        name: "glueetl",
    },
    defaultArguments: {
        "--TempDir": pulumi.interpolate`s3://${tempBucket.id}/temp/`,
        "--enable-metrics": "true",
        "--enable-spark-ui": "true",
        "--spark-event-logs-path": pulumi.interpolate`s3://${logBucket.id}/spark-logs/`,
        "--enable-job-insights": "true",
        "--enable-glue-datacatalog": "true",
        "--job-bookmark-option": "job-bookmark-enable",
    },
    glueVersion: "4.0",
    numberOfWorkers: 10,
    workerType: "G.1X",
    executionProperty: {
        maxConcurrentRuns: 2,
    },
});

Glue Workflow

Orchestrate multiple ETL jobs.

const workflow = new aws.glue.Workflow("data-pipeline", {
    name: "data-processing-pipeline",
    description: "Complete data processing workflow",
});

const crawlerTrigger = new aws.glue.Trigger("crawler-trigger", {
    name: "start-crawler",
    type: "SCHEDULED",
    schedule: "cron(0 2 * * ? *)",
    workflowName: workflow.name,
    actions: [{
        crawlerName: crawler.name,
    }],
});

const jobTrigger = new aws.glue.Trigger("job-trigger", {
    name: "start-etl",
    type: "CONDITIONAL",
    workflowName: workflow.name,
    predicate: {
        conditions: [{
            crawlerName: crawler.name,
            crawlState: "SUCCEEDED",
        }],
    },
    actions: [{
        jobName: etlJob.name,
    }],
});

Partition Indexes

Improve query performance with partition indexes.

const partitionIndex = new aws.glue.PartitionIndex("date-index", {
    databaseName: database.name,
    tableName: table.name,
    partitionIndex: {
        indexName: "date_idx",
        keys: ["year", "month", "day"],
    },
});

Key Properties

Database Properties

  • name - Database name
  • description - Database description
  • locationUri - Database location URI

Table Properties

  • name - Table name
  • databaseName - Parent database
  • storageDescriptor - Storage and schema information
  • partitionKeys - Partition columns

Crawler Properties

  • name - Crawler name
  • role - IAM role ARN
  • databaseName - Target database
  • s3Targets - S3 paths to crawl
  • schedule - Cron schedule

Job Properties

  • name - Job name
  • roleArn - IAM role ARN
  • command - Job command configuration
  • glueVersion - Glue version
  • numberOfWorkers - Number of workers
  • workerType - Worker type (Standard, G.1X, G.2X)

Output Properties

  • id - Resource identifier
  • arn - Resource ARN
  • name - Resource name

Use Cases

  • Data Lake ETL: Transform and catalog data lake content
  • Database Migration: Move and transform database data
  • Log Processing: Parse and structure log files
  • Data Quality: Clean and validate data
  • Schema Evolution: Track schema changes over time

Related Services

  • S3 - Data source and destination
  • Athena - Query cataloged data
  • Redshift - Data warehouse integration

Install with Tessl CLI

npx tessl i tessl/npm-pulumi--aws

docs

analytics

glue.md

kinesis.md

index.md

quickstart.md

README.md

tile.json