AWS Lake Formation Deep Dive

© 2020, Amazon Web Services, Inc. or its Affiliates.
Cobus Bernard
Senior Developer Advocate
Amazon Web Services
AWS Lake Formation
Deep Dive

© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
What’s coming up:
Bonus
• QR codes
• Data Lake refresher
• AWS Lake Formation & Related Services
• Use-case examples
• Going forward
• QA

Poll – Which of these are you using?
• S3 Lifecycle policy
• Glue ETL with ML
• Athena to query Relational Databases
• AWS Datalake
Amazon S3 | AWS Glue
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
On-premises
Data Movement
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
Analytics
Machine Learning
Real-time
Data Movement
Data Lake

Data Lake Refresher
Concepts & flow

A data lake isa centralized repository that allows you
to store all your structured and unstructured dataat
any scale
DataLakeDefinition

• All data in one place, a single source of truth
• Support Different Formats - structured/semi-structured/unstructured/raw data
• Supports fast ingestion and consumption
• Schema on read
• Designed for low-cost storage
• Decouples storage and compute
• Supports protection and security rules
DataLakeMainConcepts

Traditional Data lake
Then:
- Enterprise Data Warehouse
- Batched based ETL for BI analysis
- Dashboarding tools
Data silos to
ERP CRM LOB
DW Silo 1
Business
Intelligence
Devices Web Sensors Social
DW Silo 2
Business
Intelligence

Modern Data lake
Now:
- More data, less structured
- Accessed in various ways
- Real time-streaming
- Machine learning, scientific,
analytics, regulatory requirements
- Low cost storage & analytics
OLTP ERP CRM LOB
DataWarehouse
Business
Intelligence
Data Lake
10011000010010101110010101
011100101010000101111101101
0
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW Queries Big data
processing
Interactive Real-time

Why choose AWS for data lakes and analytics?
Most secure
infrastructure for
analytics
Most scalable and
cost effective
Easiest to build data
lakes and analytics
Most
comprehensive and
open
1 2 3 4

Moving to data lake architectures
Extends or evolves DW architectures
Store any data in any format
Durable, available, and exabyte scale
Secure, compliant, auditable
Run any type of analytics from DW to Predictive
Data
Warehousing
Analytics
Machine
Learning
Data lake

More data lakes and analytics than anywhere else
Tens of thousands of data lakes run on AWS across all industries

Workflow of AWS services for big data
End-to-end Pipeline

© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Simplified Data Pipeline
Data Sources Ingest
Process
&
Analyze
Consume
Amazon S3
Catalog
Store
Amazon S3
Store

Multiple Data Sources
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Ingest
Process
&
Analyze
Consume
Amazon S3
Catalog
Store
Amazon S3
Store

Amazon DynamoDB
Fully managed, multi-region, multi-master database
Nonrelational database that delivers reliable performance at any scale
Consistent single-digit millisecond latency
Built-in security, backup and restore, in-memory Caching
Support Streams

Process &
Analyze
Consume
Ingestion Options
Ingest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Catalog
Store
Amazon S3
Store

Amazon Kinesis
Data Streams
• For technical developers
• Build your own custom
applications that process
or analyze streaming data
Amazon Kinesis
Data Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into S3, Amazon Redshift,
and Amazon Elasticsearch
Amazon Kinesis
Data Analytics
• For all developers, data
scientists
• Easily analyze data streams
using standard SQL queries
Amazon Kinesis:StreamingData MadeEasy

Storage Layer
Process &
Analyze
Consume
Catalog
IngestIngest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Amazon S3
Store
Amazon S3

Secure, highly scalable, durable object storage with millisecond
latency for data access
Store any type of data–web sites, mobile apps, corporate
applications, and IoT sensors, at any scale
Store data in the format you want:
Unstructured (logs, dump files) | semi-structured (JSON, XML) | structured (CSV, Parquet)
Storage lifecycle integration
Amazon S3-Standard |Amazon S3-Infrequent Access | Amazon Glacier
AmazonS3is theBase

Store
Data Discovery and Catalog
Amazon S3
Process &
Analyze
Consume
Catalog
AWS Glue
IngestIngest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Store
Amazon S3

Automatically discovers data and stores schema
Data searchable, and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless
AWSGlue -ServerlessDataCatalogand ETL

Ingest
Consume
Amazon
Athena
Amazon
EMR
Amazon
Redshift
Amazon
Elasticsearch
Store
Amazon S3
Process & Analyze
Process and Analyze
Ingest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Catalog
AWS Glue

Interactive query service to analyze data in Amazon
S3 using standard SQL
No infrastructure to set up or manage and no data
to load
Supports Multiple Data Formats – Define Schema
on Demand
AmazonAthena -InteractiveAnalysis

Ingest Consume
Amazon Kinesis
BITools
Querying the Data Lake
Database
Migration Service
AWS Snowball
Amazon MSK
Amazon
Athena
Amazon
EMR
Amazon
Redshift
Amazon
Elasticsearch
Process & Analyze
Jupyter
Notebooks
Amazon
API Gateway
Amazon
QuickSight
Catalog
AWS Glue
Store
Amazon S3
Store
Amazon S3
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices

Amazon QuickSight
Supports variety of Data source andTargets
Fully managed and scalable
Super fast and easy to use
Cost-effective

Poll – Which do you struggle most with?
• Data migration/ingestion
• Data processing/ETL
• Data access security/auditing
• Cancelling your line with Telkom

Data preparation accounts for ~80% of the work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other

Sample of steps required
Find sources
Create Amazon Simple Storage Service (Amazon S3) locations
Configure access policies
Map tables to Amazon S3 locations
ETL jobs to load and clean data
Create metadata access policies
Configure access from analytics services
Rinse and repeat for other:
data sets, users, and end-services
And more:
manage and monitor ETL jobs
update metadata catalog as data changes
update policies across services as users and permissions change
manually maintain cleansing scripts
create audit processes for compliance
…
Manual | Error-prone |Time consuming

Lake Formation
Data lake infrastructure & management
S3/Glacier AWS GlueLake
Formation

Build a secure data lake in days
with AWS Lake Formation
Amazon S3 data lake storage
AWS Lake Formation
Simplified ingest and cleaning enables data
engineers to build faster
Centralized management of fine-grained
permissions empower security officers
Comprehensive set of integrated tools enable
user access consistently
AWS
Glue
Blueprints ML
Transforms
Data
catalog
Access
control

Top features of Lake Formation
• Enhanced governance layer - security and governance layer at the
Data Catalog level
• Blueprints / Data Importers - templates for ETL, metadata (schema)
and partition management
• ML Transformations – ML algorithms that customers can use to create
their own ML Transforms (i.e. record de-duplication)
• Enhanced Data Catalog - enable users to record more metadata and
tag Data Catalog objects (i.e. databases, tables, columns)

Easier way to load data to the lake
Logs
DBs
Blueprints
One-shot
Incremental

How it works

Lake Formation
Security

Security permissions in Lake Formation
Search and view permissions granted
to a user, role, or group in one place
Verify permissions granted to a user
Easily revoke policies for a user

Security permissions in Lake Formation
Control data access with simple grant
and revoke permissions
Specify permissions on tables and
columns rather than on buckets and
objects
Easily view policies granted to a
particular user
Audit all data access at one place

Roles in a data lake project
IAM Administrator Data Lake Admin Data Lake Developer/Analyst
ProvisionsAWS resources and accounts Manages Data Lake Resources Access/Query Data Lake
e.g.
- Create Data Lake Admin
- Create S3 bucket
- Create Roles/Policies
e.g.
- Registers Data Lake
- Secure data lake/add access controls
e.g.
- Query Data Lake
- Creates MLTransforms
- Writes ETL Jobs
- Visualization usingAmazon QuickSight

Configuration and Execution Steps
Activity User
1
Provision the following:
- Data lake administrator
- Data lake analyst/developer
- Data lake and glue roles
IAMAdministrator
2 Register a data lake Data Lake Administrator
3 Assign Lake Formation Permissions to data lake analyst Data Lake Administrator
4 Create AWS Glue Database Data Lake Administrator
5 Crawl and catalog Patient data in AWS Glue Data Lake Analyst
6 Assign table permissions to data lake Analyst Data Lake Administrator
7 Observe the data pattern and duplicates in data usingAmazon Athena Data Lake Analyst
8 Create, teach andTune an AWS Lake Formation MLTransform Data Lake Analyst
9 Create an AWS Glue ETL Job to use MLTransform for data deduplication Data Lake Analyst
10 Catalog de-duplicated data and query usingAmazon Athena Data Lake Analyst

Secure once, access in multiple ways
Data Lake Storage
Data
Catalog
Access
Control
Lake Formation
Admin

Search and collaborate across multiple users
Text-based, faceted search across
all metadata
Add attributes like Data owners,
stewards, and other as table
properties
Add data sensitivity level, column
definitions, and others as column
properties
Text-based search and filtering
Query data in Amazon Athena

Enhanced Governance Layer
AWS Lake Formation provides a security and governance layer at the Data
Catalog level. Users can grant or revoke permissions to the Data Catalog
objects such as databases, tables and columns for IAM principals (IAM
users and roles). This functionality will be extended to row level access in
subsequent releases.

Lake Formation
Glue/Blueprints

AWS Glue
Less hassle
Integrated across AWS: supports
Aurora, RDS, Redshift, S3, and common
database engines in yourVPC running
on EC2
Serverless
Serverless: no infrastructure to provision
or manage
More power
Automatically generates the code to
execute your data transformations and
loading processes
Simple, flexible, and cost-effective ETL & Data Catalog

Blueprints build on AWS Glue

Blueprints / Data Importers
Blueprints are templates for data ingestion, transformation, metadata
(schema) and partition management. Blueprints help customers to
quickly and easily build and maintain a data lake.

With blueprints
You
1. Point us to the source
2. Tell us the location to load
to in your data lake
3. Specify how often you
want to load the data
Blueprints
1. Discover the source table(s)
schema
2. Automatically convert to the target
data format
3. Automatically partition the data
based on the partitioning schema
4. Keep track of data that was
already processed
5. You can customize any of the
above

Glue Features andTips
• Use VPC Endpoint for Glue, Athena and S3
• Daisy-chain JDBC jobs to avoid startup pain
• Use Lambda for tricks that aren’t yet available

Lambda with Glue
• Automatically start an AWS Glue job when a file is uploaded to S3
• Use Lambda to Auto-Run a newly created Glue Crawler
• Use Cloudwatch with Lambda to receive
SNS notices for specific Glue Events
• Create & Enable Glue Trigger in Lambda

Lake Formation
MLTransformations

MLTransformations
Deduplication & FuzzyMatching of your Data

MLTransforms

1. Create a new MLTransform

2.

Fuzzy deduplication – Under the hood

What next?
Recommended things to check out

http://bit.ly/LFML-2019
Lake Formation Lab

© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Free foundational to advanced digital courses cover AWS services and teach
architecting best practices
AWSTraining and Certification
Visit aws.amazon.com/training/path-architecting/
Classroom offerings, including Architecting on AWS, feature
AWS expert instructors and hands-on labs
Validate expertise with the AWS Certified Solutions Architect - Associate or
AWS Certification Solutions Architect - Professional exams
Resources created by the experts at AWS to propel your organization and career forward

© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Big Data blogs
Real deep, use-cases, step by step.
• “Use AWS Glue to run ETL jobs against non-native JDBC data
sources”
• “Matching patient records with the AWS Lake Formation
FindMatches transform”
• “Discovering metadata with AWS
Lake Formation “
https://aws.amazon.com/blogs/?awsf.blog-master-category=category%23big-data

Q&A

Thank you!

AWS Lake Formation Deep Dive

Related slideshows

More Related Content

AWS Lake Formation Deep Dive

Editor's Notes