AWS Glue for ETL (extract, transform and load) processes

Alejandro Cora González
5 min readJan 29, 2020

Building serverless ETL pipelines using Amazon Glue.

(Image taken from: https://aws.amazon.com/glue/)

AWS Glue automatically discovers and categorise your data and make it immediately searchable and query-able using Amazon Athena, Amazon Redshift or Amazon EMR.

In this example project you’ll learn how to use AWS Glue to transform your data stored in S3 buckets and query using Athena.

Create required resources

We will use Terraform to create the role needed with the required permissions. Of course, you could create all these resources manually, but if you feel comfortable using Terraform you can use the following definitions.

// resources.tfprovider "aws" {
region = "us-east-1"
}

resource "aws_iam_role" "aws_iam_glue_role" {
name = "AWSGlueServiceRoleDefault"

assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "glue.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}

resource "aws_iam_role_policy_attachment" "glue_service_attachment" {
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
role = aws_iam_role.aws_iam_glue_role.id
}

resource "aws_iam_role_policy" "s3_policy" {
name = "s3_policy"
role = aws_iam_role.aws_iam_glue_role.id

policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::${var.bucket_for_glue}",
"arn:aws:s3:::${var.bucket_for_glue}/*"
]
}
]
}
EOF
}

resource "aws_iam_role_policy" "glue_service_s3" {
name = "glue_service_s3"
role = aws_iam_role.aws_iam_glue_role.id
policy = aws_iam_role_policy.s3_policy.policy
}
variable "bucket_for_glue" {
description = "Bucket for AWS Glue..."
default = "aws-glue-etl-process"
}
resource "aws_s3_bucket" "bucket_for_glue" {
bucket = var.bucket_for_glue
force_destroy = true
}

Also, we need an S3 Bucket to put the files to process, the output results and some other files managed by Glue.

Remember to run:

$ terraform init
$ terraform apply

To copy some example logs (nginx_logs.txt in project folder) execute:

$ aws s3 cp ./nginx_logs.txt s3://{bucket}/data/raw/nginx_logs.txt

First, we need to create the database within Glue Data Catalog.

Then, you must create a “Crawler” to populate the AWS Glue Data Catalog with tables. A crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your data catalog.

Once, the crawler was created you need to run it.

After that, the data catalog was updated and you can see the table schema and other information.

Athena is a serverless service that makes really easy to analyze data in Amazon S3 using standard SQL.

At this point we are ready to perform an ETL job over the data, in this case, the Apache logs we use in this example project. The purpose of the next step is to extract only some fields and them save it in a parquet format file.

When the job is created, execute it and wait for a while. This process could take some minutes.

Once the process is finished, you must be able to see the data output in your S3 bucket.

Execute once again the crawler and you will see your data catalog updated with the new table.

Using Custom Classifiers

You might need to define a custom classifier if your data doesn’t match any built-in classifiers, or if you want to customize the tables that are created by the crawler.

For this example we’ll use a custom file with a very simple information.

"Ipad mini",EU,50.0,2
"Lenovo Ideapad",AU,350.0,1
"Huawei Y9 2019",UE,120.0,2
"MSI",LATAM,500.0,6
"Samsung 27-VA",CA,50.0,3

Copy the file to your S3 bucket using:

$ aws s3 cp ./custom_data.csv s3://{bucket}/sales/custom_data.csv

This time we could use Terraform (is not required, you could do it manually) for creating a new database, a custom classifier and a new crawler.

resource "aws_glue_catalog_database" "sales_database" {
name = "sales"
}

resource "aws_glue_classifier" "aws_glue_csv_classifier" {
name = "csv-classifier"

csv_classifier {
header = ["PRODUCT_NAME", "COUNTRY", "PRICE", "QUANTITY"]
contains_header = "ABSENT"
quote_symbol = "\""
delimiter = ","
}
}

resource "aws_glue_crawler" "aws_glue_custom_csv_crawler" {
name = "custom-csv-crawler"
database_name = aws_glue_catalog_database.sales_database.name
classifiers = [aws_glue_classifier.aws_glue_csv_classifier.id]
role = aws_iam_role.aws_iam_glue_role.arn

s3_target {
path = "s3://${aws_s3_bucket.bucket_for_glue.bucket}/sales/"
}
}

Then, go and create a new job to process the sales database with the purpose of converting to parquet file type.

You can check your S3 bucket and use Athena to query your data.

That’s it. See you in the next adventure!!! Bye…

Don’t forget to delete all your resources (remember that the resources created manually must be destroyed in the same way)!!!

$ terraform destroy

Code repository: https://github.com/alekglez/aws-glue-etl-pipeline

--

--

Alejandro Cora González

Solutions Architect | Python Developer | Serverless Advocate