A Pulumi package for creating and managing Amazon Web Services (AWS) cloud resources with infrastructure-as-code.
—
Quality
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it easy to prepare and load data for analytics.
import * as aws from "@pulumi/aws";
import * as glue from "@pulumi/aws/glue";Data catalog database for metadata storage.
const database = new aws.glue.CatalogDatabase("analytics-db", {
name: "analytics_database",
description: "Analytics data catalog",
});Define table schema in the data catalog.
const table = new aws.glue.CatalogTable("events-table", {
name: "events",
databaseName: database.name,
tableType: "EXTERNAL_TABLE",
storageDescriptor: {
location: pulumi.interpolate`s3://${bucket.id}/events/`,
inputFormat: "org.apache.hadoop.mapred.TextInputFormat",
outputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
serDeInfo: {
serializationLibrary: "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
parameters: {
"field.delim": ",",
},
},
columns: [
{ name: "event_id", type: "string" },
{ name: "user_id", type: "string" },
{ name: "event_type", type: "string" },
{ name: "timestamp", type: "timestamp" },
],
},
});Automatically discover and catalog data.
const crawler = new aws.glue.Crawler("data-crawler", {
name: "s3-data-crawler",
role: crawlerRole.arn,
databaseName: database.name,
s3Targets: [{
path: pulumi.interpolate`s3://${bucket.id}/data/`,
}],
schedule: "cron(0 1 * * ? *)", // Daily at 1 AM
});ETL job for data transformation.
const etlJob = new aws.glue.Job("etl-job", {
name: "data-transformation",
roleArn: glueRole.arn,
command: {
scriptLocation: pulumi.interpolate`s3://${scriptBucket.id}/scripts/transform.py`,
pythonVersion: "3",
},
defaultArguments: {
"--job-language": "python",
"--enable-metrics": "true",
"--enable-continuous-cloudwatch-log": "true",
},
maxRetries: 1,
timeout: 60,
glueVersion: "4.0",
numberOfWorkers: 10,
workerType: "G.1X",
});const pysparkJob = new aws.glue.Job("pyspark-etl", {
name: "pyspark-transformation",
roleArn: glueRole.arn,
command: {
scriptLocation: pulumi.interpolate`s3://${scriptBucket.id}/scripts/transform.py`,
pythonVersion: "3",
name: "glueetl",
},
defaultArguments: {
"--TempDir": pulumi.interpolate`s3://${tempBucket.id}/temp/`,
"--enable-metrics": "true",
"--enable-spark-ui": "true",
"--spark-event-logs-path": pulumi.interpolate`s3://${logBucket.id}/spark-logs/`,
"--enable-job-insights": "true",
"--enable-glue-datacatalog": "true",
"--job-bookmark-option": "job-bookmark-enable",
},
glueVersion: "4.0",
numberOfWorkers: 10,
workerType: "G.1X",
executionProperty: {
maxConcurrentRuns: 2,
},
});Orchestrate multiple ETL jobs.
const workflow = new aws.glue.Workflow("data-pipeline", {
name: "data-processing-pipeline",
description: "Complete data processing workflow",
});
const crawlerTrigger = new aws.glue.Trigger("crawler-trigger", {
name: "start-crawler",
type: "SCHEDULED",
schedule: "cron(0 2 * * ? *)",
workflowName: workflow.name,
actions: [{
crawlerName: crawler.name,
}],
});
const jobTrigger = new aws.glue.Trigger("job-trigger", {
name: "start-etl",
type: "CONDITIONAL",
workflowName: workflow.name,
predicate: {
conditions: [{
crawlerName: crawler.name,
crawlState: "SUCCEEDED",
}],
},
actions: [{
jobName: etlJob.name,
}],
});Improve query performance with partition indexes.
const partitionIndex = new aws.glue.PartitionIndex("date-index", {
databaseName: database.name,
tableName: table.name,
partitionIndex: {
indexName: "date_idx",
keys: ["year", "month", "day"],
},
});name - Database namedescription - Database descriptionlocationUri - Database location URIname - Table namedatabaseName - Parent databasestorageDescriptor - Storage and schema informationpartitionKeys - Partition columnsname - Crawler namerole - IAM role ARNdatabaseName - Target databases3Targets - S3 paths to crawlschedule - Cron schedulename - Job nameroleArn - IAM role ARNcommand - Job command configurationglueVersion - Glue versionnumberOfWorkers - Number of workersworkerType - Worker type (Standard, G.1X, G.2X)id - Resource identifierarn - Resource ARNname - Resource nameInstall with Tessl CLI
npx tessl i tessl/npm-pulumi--aws