Building Serverless ETL Pipelines with AWS Glue
- 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ben Thurgood
Principal Solutions Architect
Building Serverless ETL Pipelines
With AWS Glue
- 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Can I get you
to go ahead
and…
- 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
…prepare
our data for
analysis
- 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Generate Collect Store
Extract
Transform
Load
Analyse
Visualise/
Report
- 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Collect Store
Extract
Transform
Load
Analyse
Visualise/
Report
Generate
ERP
Connected
devices
Transactions
Social
media
Web logs /
cookies
- 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Generate Store
Extract
Transform
Load
Analyse
Visualise/
Report
Collect
Polling Application
Amazon
Kinesis Stream
Amazon
Kinesis Firehose
- 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Generate Store
Extract
Transform
Load
Analyse
Visualise/
Report
Collect
AWSSnowball
Amazon S3
AWSDMS
- 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Generate Collect
Extract
Transform
Load
Analyse
Visualise/
Report
Store
Amazon
RDS
Amazon S3
Database
on EC2
- 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Generate Collect Store
Extract
Transform
Load
Visualise/
Report
Analyse
Amazon Redshift &
Redshift Spectrum
Amazon EMR
Amazon Athena
Amazon Kinesis
Analytics
- 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Generate Collect Store
Extract
Transform
Load
Analyse
Visualise/
Report
Data
scientists
Business
users
Engagement
platforms
Automation/
events
Data
analysts
- 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Generate Collect Store Analyse
Visualise/
Report
Extract
Transform
Load
AWS
Lambda
Amazon
Kinesis Enabled
Amazon EMR
- 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
No Problem…?
Deal with these Terabytes and Petabytes of data
Simplify querying disparate data sets
Combine existing / legacy data with modern data sets
Prepare data for machine learning
- 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
- 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
- 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
- 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Some extra challenges..
Volumes will grow (the new oil)
Adding data sources
Large proportion of ETL is hand coding
Data formats change over time
• Within data you already have
• Changes will be coming soon
Target schemas change
- 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Extract Transform
Load
Analyse
Visualise/
Report
Generate
Collect
Store
- 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
And.. ETL Is Not The Rewarding Part
Time
Value
ETL
Analyse and Consume
Generate
Collect
Store
- 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Generate Collect Store Analyse
Visualise/
Report
Extract
Transform
Load
AWS Glue
- 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why AWS Glue?
- 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Automate your ETL
Automatically discover and categorise your data
• Connect to your data sources
• Generate your Data Catalogue
Make it immediately searchable and queryable
• Athena
• Redshift
• EMR
- 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Automate your ETL
Generates your ETL code
• Clean
• Enrich
• Move
Adaptable code
Extension to Spark in Python or Scala
- 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Automate your ETL
Runs your ETL jobs serverless
• Managed
• Control the amount of resources used
• Scales out automatically
Schedule or trigger jobs
- 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Glue Customer Examples
- 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How do I ETL my data?
- 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Four Steps
Crawl Map Edit and
Explore
Schedule
- 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How Do I Discover My Data?
- 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Glue Data Catalogue: Crawlers
• Automatically discover new data and extract schema definitions
• Detect schema changes and version tables
• Detect Apache Hive style partitions on Amazon S3
• Built-in classifiers for popular data types
• Custom classifiers using Grok expressions
• Run ad hoc or on a schedule; serverless – only pay when crawler runs
- 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Crawlers: Classifiers
IAM Role
Glue Crawler
Data Lakes
Data Warehouse
Databases
Amazon
RDS
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-In Classifiers
MySQL
MariaDB
PostreSQL
Aurora
Redshift
Avro
Parquet
ORC
JSON & BJSON
Logs
(Apache, Linux, MS, Ruby, Redis, and many others)
Delimited
(comma, pipe, tab, semicolon)
Compressed Formats
(ZIP, BZIP, GZIP, LZ4, Snappy)
Create additional Custom
Classifiers with Grok!
- 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Example Classifier
2018-03-18T01:44:19+00:00 [prefix-p-123-a-7z] WARN: There is a message
Grok expression example:
%{TIMESTAMP_ISO8601:timestamp} [%{MESSAGEPREFIX:message_prefix}] %{CRAWLERLOGLEVEL:loglevel} :
%{GREEDYDATA:message}
Built in patterns:
TIMESTAMP_ISO8601 %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?
GREEDYDATA .*
Custom patterns
CRAWLERLOGLEVEL (BENCHMARK|ERROR|WARN|INFO|TRACE)
MESSAGEPREFIX .*-.*-.*-.*-.*
Handy Grok debugger:
https://grokdebug.herokuapp.com/
- 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Crawler: Detecting Partitions
file 1 file N… file 1 file N…
date=10 date=15…
month=Nov
S3 bucket hierarchy Table definition
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…
sim=.99 sim=.95
sim=.93
month
date
col 1
col 2
str
str
int
float
Column Type
- 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Glue Data Catalog
- 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Glue Data Catalog: Table Properties
Table schema
Table properties
Data statistics
Nested fields
- 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Glue Data Catalog: Version control
List of table versionsCompare schema versions
- 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How Do I Build The ETL?
- 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Authoring: Automatic Code Generation
- 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Authoring: Automatic Code Generation
- 38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Authoring: ETL Code
Human-readable, editable, and portable PySpark code
- 39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Authoring: Glue Dynamic Frames
Dynamic frame schema
A C D [ ]
X Y
B1 B2
Like Apache Spark’s Data Frames, but better for:
• Cleaning and (re)-structuring semi-structured
data sets, e.g. JSON, Avro, Apache logs ...
No upfront schema needed:
• Infers schema on-the-fly, enabling
transformations in a single pass
Easy to handle the unexpected:
• Tracks new fields, and inconsistent changing
data types with choices, e.g. integer or string
• Automatically mark and separate error records
- 40. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Authoring: Glue Transforms
B B B
project
B
cast
B
separate into cols
B BResolveChoice()
Apply
Mapping()
C
YX
A
A X Y
- 41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Authoring: Relationalize() Transform
Semi-structured schema Relational schema
FKA B C.X C.Y
PK ValueOffset
A C D [ ]
X Y
B B
B
- 42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Authoring: ETL Code
• Human-readable, editable, and portable PySpark code
• Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data
• Customisable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries
- 43. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Authoring: Developer Endpoints
Remote
Interpreter
• Environment to iteratively develop and test ETL code.
• Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.
• When you are satisfied with the results you can create an ETL job that runs your code.
Interpreter
Server
Glue Apache Spark environment
- 44. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Authoring: ETL Code
Human-readable, editable, and portable
PySpark code
Flexible: Glue’s ETL library simplifies
manipulating complex, semi-structured data
Customisable: Use native PySpark, import
custom libraries, and/or leverage Glue’s
libraries
Collaborative: share code snippets via
GitHub, reuse code across jobs
- 45. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How do I run ETL jobs?
- 46. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless Job Execution
Auto-configure VPC & role-based access
security & isolation preserved
Customers can specify job capacity
using Data Processing Units (DPU)
Automatically scale resources
Only pay for the resources you consume
per-second billing (10-minute min)
Customer VPC Customer VPC
Compute instances
- 47. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Processing Units (DPUs)
1 DPU = 4 vCPU + 16GB RAM
Storage:
• Free for the first million objects stored
• $1 per 100,000 objects stored above 1M, per month
Requests:
• Free for the first million requests per month
• $1 per million requests above 1M in a month
- 48. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Composition: Example
Data based
>10 MB new
ad-click
logs
Sales: Revenue by
customer segment
Schedule
Central: ROI by
customer segment
weekly
sales
- 49. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sounds Good In Theory…
What’s It Really Like?
- 50. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo context
Amazon
RDS
Amazon S3
AWS Glue
Amazon Redshift &
Redshift Spectrum
Amazon EMR
Amazon Athena
- 51. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How AWS Glue Helps with ETL
Automatically discover your data
Generate ETL code
Run your ETL jobs serverless
- 52. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Speaker Contact:
Ben Thurgood
Principal Solutions
Architect
btgood@amazon.com
Homework suggestion:
https://amzn.to/2iWVYey