tessl/npm-pulumi--aws

A Pulumi package for creating and managing Amazon Web Services (AWS) cloud resources with infrastructure-as-code.

—

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview

Eval results

Files

AWS Glue

Name: tessl/npm-pulumi--aws
Author: tessl

AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it easy to prepare and load data for analytics.

Package

import * as aws from "@pulumi/aws";
import * as glue from "@pulumi/aws/glue";

Key Resources

Glue Catalog Database

Data catalog database for metadata storage.

const database = new aws.glue.CatalogDatabase("analytics-db", {
    name: "analytics_database",
    description: "Analytics data catalog",
});

Glue Catalog Table

Define table schema in the data catalog.

const table = new aws.glue.CatalogTable("events-table", {
    name: "events",
    databaseName: database.name,
    tableType: "EXTERNAL_TABLE",
    storageDescriptor: {
        location: pulumi.interpolate`s3://${bucket.id}/events/`,
        inputFormat: "org.apache.hadoop.mapred.TextInputFormat",
        outputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
        serDeInfo: {
            serializationLibrary: "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
            parameters: {
                "field.delim": ",",
            },
        },
        columns: [
            { name: "event_id", type: "string" },
            { name: "user_id", type: "string" },
            { name: "event_type", type: "string" },
            { name: "timestamp", type: "timestamp" },
        ],
    },
});

Glue Crawler

Automatically discover and catalog data.

const crawler = new aws.glue.Crawler("data-crawler", {
    name: "s3-data-crawler",
    role: crawlerRole.arn,
    databaseName: database.name,
    s3Targets: [{
        path: pulumi.interpolate`s3://${bucket.id}/data/`,
    }],
    schedule: "cron(0 1 * * ? *)", // Daily at 1 AM
});

Glue Job

ETL job for data transformation.

const etlJob = new aws.glue.Job("etl-job", {
    name: "data-transformation",
    roleArn: glueRole.arn,
    command: {
        scriptLocation: pulumi.interpolate`s3://${scriptBucket.id}/scripts/transform.py`,
        pythonVersion: "3",
    },
    defaultArguments: {
        "--job-language": "python",
        "--enable-metrics": "true",
        "--enable-continuous-cloudwatch-log": "true",
    },
    maxRetries: 1,
    timeout: 60,
    glueVersion: "4.0",
    numberOfWorkers: 10,
    workerType: "G.1X",
});

Common Patterns

Glue ETL with PySpark

const pysparkJob = new aws.glue.Job("pyspark-etl", {
    name: "pyspark-transformation",
    roleArn: glueRole.arn,
    command: {
        scriptLocation: pulumi.interpolate`s3://${scriptBucket.id}/scripts/transform.py`,
        pythonVersion: "3",
        name: "glueetl",
    },
    defaultArguments: {
        "--TempDir": pulumi.interpolate`s3://${tempBucket.id}/temp/`,
        "--enable-metrics": "true",
        "--enable-spark-ui": "true",
        "--spark-event-logs-path": pulumi.interpolate`s3://${logBucket.id}/spark-logs/`,
        "--enable-job-insights": "true",
        "--enable-glue-datacatalog": "true",
        "--job-bookmark-option": "job-bookmark-enable",
    },
    glueVersion: "4.0",
    numberOfWorkers: 10,
    workerType: "G.1X",
    executionProperty: {
        maxConcurrentRuns: 2,
    },
});

Glue Workflow

Orchestrate multiple ETL jobs.

const workflow = new aws.glue.Workflow("data-pipeline", {
    name: "data-processing-pipeline",
    description: "Complete data processing workflow",
});

const crawlerTrigger = new aws.glue.Trigger("crawler-trigger", {
    name: "start-crawler",
    type: "SCHEDULED",
    schedule: "cron(0 2 * * ? *)",
    workflowName: workflow.name,
    actions: [{
        crawlerName: crawler.name,
    }],
});

const jobTrigger = new aws.glue.Trigger("job-trigger", {
    name: "start-etl",
    type: "CONDITIONAL",
    workflowName: workflow.name,
    predicate: {
        conditions: [{
            crawlerName: crawler.name,
            crawlState: "SUCCEEDED",
        }],
    },
    actions: [{
        jobName: etlJob.name,
    }],
});

Partition Indexes

Improve query performance with partition indexes.

const partitionIndex = new aws.glue.PartitionIndex("date-index", {
    databaseName: database.name,
    tableName: table.name,
    partitionIndex: {
        indexName: "date_idx",
        keys: ["year", "month", "day"],
    },
});

Key Properties

Database Properties

name - Database name
description - Database description
locationUri - Database location URI

Table Properties

name - Table name
databaseName - Parent database
storageDescriptor - Storage and schema information
partitionKeys - Partition columns

Crawler Properties

name - Crawler name
role - IAM role ARN
databaseName - Target database
s3Targets - S3 paths to crawl
schedule - Cron schedule

Job Properties

name - Job name
roleArn - IAM role ARN
command - Job command configuration
glueVersion - Glue version
numberOfWorkers - Number of workers
workerType - Worker type (Standard, G.1X, G.2X)

Output Properties

id - Resource identifier
arn - Resource ARN
name - Resource name

Use Cases

Data Lake ETL: Transform and catalog data lake content
Database Migration: Move and transform database data
Log Processing: Parse and structure log files
Data Quality: Clean and validate data
Schema Evolution: Track schema changes over time

Related Services

S3 - Data source and destination
Athena - Query cataloged data
Redshift - Data warehouse integration

Install with Tessl CLI

npx tessl i tessl/npm-pulumi--aws

docs

analytics

glue.md

kinesis.md

compute

core

database

guides

migration

monitoring

networking

reference

security

storage

tessl/npm-pulumi--aws

glue.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/analytics/

AWS Glue

Package

Key Resources

Glue Catalog Database

Glue Catalog Table

Glue Crawler

Glue Job

Common Patterns

Glue ETL with PySpark

Glue Workflow

Partition Indexes

Key Properties

Database Properties

Table Properties

Crawler Properties

Job Properties

Output Properties

Use Cases

Related Services

glue.mddocs/analytics/