How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
How to Build a Data Lake with AWS
Glue Data Catalog
P r a j a k t a D a m l e
S e n i o r P r o d u c t M a n a g e r , A W S G l u e
A B D 2 1 3

What to expect from this session
Data challenge today
What is a data lake?
What is AWS Glue Data Catalog?
How does AWS Glue catalogue my data?
My data is catalogued, what’s next?
Q&A

Your data today
Documents and files Records Streams
Amazon
RDS
Amazon
DynamoDB
AWS IoT
On Premises
databases
Amazon Kinesis
Streams
Spreadsheets Infrastructure logs
Clickstream data Mobile app data
Social media data Amazon
Redshift
Device data Amazon Kinesis
Firehose
Sensor data
ERP
Multiple sources and formats… and growing everyday

Why is this a new problem?
Web and mobile
data
Logs
Social Media data
Streaming data IOT data
Spreadsheets
Structured data
Unstructured and Semi-structured data
Dark data

Data Volume
The Data Gap
1990 2000 2010 2020
Generated Data
Available for Analysis
Dark data challenge

Multiple consumers and requirements
Data duplication
Data Scientists
Analysts
Business Users
Applications
Agile Real time
Flexible Scale

What is a data lake?
Collect and store all data, at any
scale, and low cost
Help locate, curate, and secure your
data
Provide democratized access to data
within your organization
Quickly and easily perform new
types of data analysis

Benefits of a data lake
Quickly ingest and store
any type of data, at any
scale, and at low cost
Have a single source of
truth and quickly search
and find the relevant data
Easily query the data
through a unified set of
tools

Layers of a data lake
Athena
Lex
Amazon
Rekognition
Amazon
Polly Amazon ML
Store
Secure
Analyze
AI
AMAZON QUICKSIGHT

The missing piece
> A unified view into your data no matter where it is stored
Integration with your analytics tools
A way to automatically build your metadata and keep it in
sync with your data as it evolves
>
>

Automatically discover and categorize your data making it immediately
searchable and queryable across data sources.
Generate code to clean, enrich, and reliably move data between various
data sources. Easily customize this code or bring your own.
Run your jobs on a serverless, fully managed, scale-out environment. No
compute resources to provision or manage.
Discover
Develop
Deploy
What is AWS Glue?

Select AWS Glue customers

Job AuthoringData Catalog Job Execution
Apache Hive Metastore compatible
Integrated with AWS services
Automatic crawling
Discover
Auto-generates ETL code
Python and Apache
Spark
Edit, debug, and share
Develop
Serverless execution
Flexible scheduling
Monitoring and alerting
Deploy
AWS Glue Components

What is the AWS Glue Data Catalog?
Unified metadata repository across relational databases, Amazon RDS,
Amazon Redshift, and Amazon S3…with support for more coming soon!
• Get a single view into your data, no matter where it is stored
• Automatically classify your data in one central list that is searchable
• Track data evolution using schema versioning
• Query your data using Amazon Athena or Amazon Redshift Spectrum
• Apache Hive metastore compatible; can be used as an external
metastore for applications running on Amazon EMR

Data lake on Amazon S3 with AWS Glue
On premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AMAZON
QUICKSIGHT
AWS GLUE ETL

Logical data lake with AWS Glue
Unified view
AWS GLUE ETL

How do I set up the Glue Data Catalog?
Call Glue’s CreateTable API
Create table manually Run Hive DDL statement

Easier way to build the Glue Data Catalog
1. Tell us where your data is
2. Tell us how often you want to check for updates
And you are done! Your Data Catalog is ready for search and querying

What are crawlers?
Crawlers automatically build your Data Catalog and keep it in sync
• Scan your data stored in various data stores, extract metadata and data statistics,
and add table definitions to your Data Catalog
• Classify data using built-in and custom classifiers
• You can write your own using Grok expressions
• Discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
• Run on demand or on a schedule; serverless – only pay when crawler runs

GitHub timeline data
20+ event types
githubarchive.org
unique payload
per event type

Table schema
Table properties
Data statistics
Nested fields
A table in the Glue Data Catalog

How is my data classified?
Crawlers apply a set of classifiers to the data as they scan it and add the metadata as
Tables to the Data Catalog.
A classifier recognizes the format of your data and generates a schema.
It returns a certainty number between 0.0 and 1.0, which helps crawlers determine if
there is a match.
Glue has a list of in-build classifiers that are applied with every crawl. But you can
write your own!
You can set up your crawler with an ordered set of classifiers. Crawlers invoke
classifiers in the order they were provided until a match is found.

Crawlers: automatic schema inference
semi-structured
per-file schema
semi-structured
unified schema
identify file type
and parse files
enumerate
S3 objects
file 1
file 2
file N
…
int
array
intchar
struct
char int
array
struct
char
bool int
int
arrayint
char
char int
custom classifiers
Grok based parser
built-in classifiers
JSON parser
CSV parser
Parquet parser
…
bool

Detecting schema similarity
name:
str
id: num
Schema A
root
addr
street: str city: str zip: num
name:
str
id: num
Schema B
root
addr: str
Schema similarity heuristic
 1 point for matching name
 1 point for matching data type
 Match when similarity index > 0.7
intersection
min(A,B)
7
8
.875sim

IAM Role
Glue Crawler
Data Lakes
Data Warehouse
Databases
Amazon
RDS
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-In Classifiers
MySQL
MariaDB
PostreSQL
Oracle
Microsoft SQL Server
Amazon Aurora
Amazon Redshift
Avro
Parquet
ORC
XML
JSON & BSON
Logs
(Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis,
and many others)
Delimited
(comma, pipe, tab, semicolon)
Compressions
(ZIP, BZIP, GZIP, LZ4, Snappy)
What can crawlers classify? Create additional Custom
Classifiers with Grok!

How can I write my own classifiers?
You can write a custom classifier by providing a
Grok pattern and a classification string for the
matched schema.
A Grok pattern is a named set of regular
expressions (regex) that are used to match data
one line at a time.
Example:
%{TIMESTAMP_ISO8601:timestamp}
[%{MESSAGEPREFIX:message_prefix}]
%{CRAWLERLOGLEVEL:loglevel} :
%{GREEDYDATA:message}

Custom classifiers
1. Write a custom classifier 2. Add it to your crawler

Available partitions
Automatically detected partitions

How are partitions detected?
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…

Automatically update table version as data evolves
Automatic schema versioning

Import/Export your metadata
Apache Hive
Metastore
Apache Hive
Metastore
Import from an external metastore Export to an external metastore
Find the import/export ETL script on Glue’s GitHub repository
AWS GLUE ETL
AWS GLUE ETL
AWS GLUE
DATA CATALOG

Your data is catalogued…what’s next?

Quickly find your data
Search on key terms Save results and come back to it later
Query data in Amazon Athena

Analyze same data with different engines
AMAZON
QUICKSIGHT

What is Amazon Athena?
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
$
SQL
Query Instantly
Zero setup cost; just
point to Amazon S3
and start querying
Pay per query
Pay only for queries run;
save 30-90% on per-query
costs through
compression
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with Amazon
QuickSight

What is Amazon EMR?
Analytics and ML at scale with 19 open-source projects
Integration with AWS Glue Data Catalog for Apache Spark, Apache Hive, and Presto
Enterprise-grade security
$
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances and
auto-scaling to reduce
costs 50-80%
Use S3 storage
Process data directly in
the Amazon S3 data lake
securely with high
performance using the
EMRFS connector
Easy
Launch fully managed
Apache Hadoop & Apache
Spark in minutes; no cluster
setup, node provisioning,
cluster tuning
Data Lake
100110000100101011100
1010101110010101000
00111100101100101
010001100001

What is Amazon Redshift Spectrum?
E x t e n d t h e d a t a w a r e h o u s e t o y o u r S 3 d a t a l a k e
S3 data lakeAmazon
Redshift data
Amazon Redshift Spectrum
query engine
Exabyte Amazon Redshift SQL queries against S3
Join data across Amazon Redshift and S3
Scale compute and storage separately
Stable query performance and unlimited concurrency
Parquet, ORC, Grok, Avro, & CSV data formats
Pay only for the amount of data scanned

Serverless data exploration
Crawlers AWS GLUE DATA
CATALOG
Data
Unified view
Data explorer
>
Gain insight in minutes without
the need to configure and
operationalize infrastructure
Data scientists want fast
access to disparate datasets for
data exploration
>
>
Glue automatically catalogues
heterogeneous data sources, and
offers serverless Apache Spark
infrastructure for interactive analysis

Move data across storage systems
Unified view

Data lake vs. data warehouse
Data lake Data warehouse
Semi-structured /Unstructured
/structured data
Structured data
Schema on read Schema on write
Data science, predictive analysis, BI
use cases
SQL based BI use cases
Great for storing granular data; raw as
well as processed data
Great for storing frequently accessed
data as well as data aggregates and
summary
Separation of compute and storage Tightly coupled compute and storage

Interoperate data lake and data warehouse
On premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AWS GLUE ETL
AMAZON
QUICKSIGHT

Key announcements (coming soon)
>
>
>
Write Glue ETL jobs in Scala, in addition to PySpark
Glue available in eu-west-1 (Ireland)
Glue available in ap-northeast-1(Tokyo)

THANK YOU!
G l u e - p m @ a m a z o n . c o m
h t t p s : / / a w s . a m a z o n . c o m / g l u e / d e v e l o p e r - r e s o u r c e s /

How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017

Related slideshows

More Related Content

How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017