AWS Lake Formation Deep Dive
- 1. © 2020, Amazon Web Services, Inc. or its Affiliates.
Cobus Bernard
Senior Developer Advocate
Amazon Web Services
AWS Lake Formation
Deep Dive
- 2. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
What’s coming up:
Bonus
• QR codes
• Data Lake refresher
• AWS Lake Formation & Related Services
• Use-case examples
• Going forward
• QA
- 3. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Poll – Which of these are you using?
• S3 Lifecycle policy
• Glue ETL with ML
• Athena to query Relational Databases
• AWS Datalake
Amazon S3 | AWS Glue
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
On-premises
Data Movement
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
Analytics
Machine Learning
Real-time
Data Movement
Data Lake
- 4. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Data Lake Refresher
Concepts & flow
- 5. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
A data lake isa centralized repository that allows you
to store all your structured and unstructured dataat
any scale
DataLakeDefinition
- 6. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
• All data in one place, a single source of truth
• Support Different Formats - structured/semi-structured/unstructured/raw data
• Supports fast ingestion and consumption
• Schema on read
• Designed for low-cost storage
• Decouples storage and compute
• Supports protection and security rules
DataLakeMainConcepts
- 7. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Traditional Data lake
Then:
- Enterprise Data Warehouse
- Batched based ETL for BI analysis
- Dashboarding tools
Data silos to
ERP CRM LOB
DW Silo 1
Business
Intelligence
Devices Web Sensors Social
DW Silo 2
Business
Intelligence
- 8. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Modern Data lake
Now:
- More data, less structured
- Accessed in various ways
- Real time-streaming
- Machine learning, scientific,
analytics, regulatory requirements
- Low cost storage & analytics
OLTP ERP CRM LOB
DataWarehouse
Business
Intelligence
Data Lake
10011000010010101110010101
011100101010000101111101101
0
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW Queries Big data
processing
Interactive Real-time
- 9. © 2020, Amazon Web Services, Inc. or its Affiliates.
Why choose AWS for data lakes and analytics?
Most secure
infrastructure for
analytics
Most scalable and
cost effective
Easiest to build data
lakes and analytics
Most
comprehensive and
open
1 2 3 4
- 10. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Moving to data lake architectures
Extends or evolves DW architectures
Store any data in any format
Durable, available, and exabyte scale
Secure, compliant, auditable
Run any type of analytics from DW to Predictive
Data
Warehousing
Analytics
Machine
Learning
Data lake
- 11. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
More data lakes and analytics than anywhere else
Tens of thousands of data lakes run on AWS across all industries
- 12. © 2020, Amazon Web Services, Inc. or its Affiliates.
Workflow of AWS services for big data
End-to-end Pipeline
- 13. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Simplified Data Pipeline
Data Sources Ingest
Process
&
Analyze
Consume
Amazon S3
Catalog
Store
Amazon S3
Store
- 14. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Multiple Data Sources
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Ingest
Process
&
Analyze
Consume
Amazon S3
Catalog
Store
Amazon S3
Store
- 15. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon DynamoDB
Fully managed, multi-region, multi-master database
Nonrelational database that delivers reliable performance at any scale
Consistent single-digit millisecond latency
Built-in security, backup and restore, in-memory Caching
Support Streams
- 16. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Process &
Analyze
Consume
Ingestion Options
Ingest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Catalog
Store
Amazon S3
Store
- 17. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon Kinesis
Data Streams
• For technical developers
• Build your own custom
applications that process
or analyze streaming data
Amazon Kinesis
Data Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into S3, Amazon Redshift,
and Amazon Elasticsearch
Amazon Kinesis
Data Analytics
• For all developers, data
scientists
• Easily analyze data streams
using standard SQL queries
Amazon Kinesis:StreamingData MadeEasy
- 18. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Storage Layer
Process &
Analyze
Consume
Catalog
IngestIngest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Amazon S3
Store
Amazon S3
- 19. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Secure, highly scalable, durable object storage with millisecond
latency for data access
Store any type of data–web sites, mobile apps, corporate
applications, and IoT sensors, at any scale
Store data in the format you want:
Unstructured (logs, dump files) | semi-structured (JSON, XML) | structured (CSV, Parquet)
Storage lifecycle integration
Amazon S3-Standard |Amazon S3-Infrequent Access | Amazon Glacier
AmazonS3is theBase
- 20. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Store
Data Discovery and Catalog
Amazon S3
Process &
Analyze
Consume
Catalog
AWS Glue
IngestIngest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Store
Amazon S3
- 21. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Automatically discovers data and stores schema
Data searchable, and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless
AWSGlue -ServerlessDataCatalogand ETL
- 22. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Ingest
Consume
Amazon
Athena
Amazon
EMR
Amazon
Redshift
Amazon
Elasticsearch
Store
Amazon S3
Process & Analyze
Process and Analyze
Ingest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Catalog
AWS Glue
- 23. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Interactive query service to analyze data in Amazon
S3 using standard SQL
No infrastructure to set up or manage and no data
to load
Supports Multiple Data Formats – Define Schema
on Demand
AmazonAthena -InteractiveAnalysis
- 24. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Ingest Consume
Amazon Kinesis
BITools
Querying the Data Lake
Database
Migration Service
AWS Snowball
Amazon MSK
Amazon
Athena
Amazon
EMR
Amazon
Redshift
Amazon
Elasticsearch
Process & Analyze
Jupyter
Notebooks
Amazon
API Gateway
Amazon
QuickSight
Catalog
AWS Glue
Store
Amazon S3
Store
Amazon S3
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
- 25. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon QuickSight
Supports variety of Data source andTargets
Fully managed and scalable
Super fast and easy to use
Cost-effective
- 26. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Poll – Which do you struggle most with?
• Data migration/ingestion
• Data processing/ETL
• Data access security/auditing
• Cancelling your line with Telkom
- 27. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Data preparation accounts for ~80% of the work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
- 28. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Sample of steps required
Find sources
Create Amazon Simple Storage Service (Amazon S3) locations
Configure access policies
Map tables to Amazon S3 locations
ETL jobs to load and clean data
Create metadata access policies
Configure access from analytics services
Rinse and repeat for other:
data sets, users, and end-services
And more:
manage and monitor ETL jobs
update metadata catalog as data changes
update policies across services as users and permissions change
manually maintain cleansing scripts
create audit processes for compliance
…
Manual | Error-prone |Time consuming
- 29. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Lake Formation
Data lake infrastructure & management
S3/Glacier AWS GlueLake
Formation
- 30. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Lake Formation
Data lake infrastructure & management
S3/Glacier AWS GlueLake
Formation
- 31. © 2020, Amazon Web Services, Inc. or its Affiliates.
Build a secure data lake in days
with AWS Lake Formation
Amazon S3 data lake storage
AWS Lake Formation
Simplified ingest and cleaning enables data
engineers to build faster
Centralized management of fine-grained
permissions empower security officers
Comprehensive set of integrated tools enable
user access consistently
AWS
Glue
Blueprints ML
Transforms
Data
catalog
Access
control
- 32. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Top features of Lake Formation
• Enhanced governance layer - security and governance layer at the
Data Catalog level
• Blueprints / Data Importers - templates for ETL, metadata (schema)
and partition management
• ML Transformations – ML algorithms that customers can use to create
their own ML Transforms (i.e. record de-duplication)
• Enhanced Data Catalog - enable users to record more metadata and
tag Data Catalog objects (i.e. databases, tables, columns)
- 33. © 2020, Amazon Web Services, Inc. or its Affiliates.
Easier way to load data to the lake
Logs
DBs
Blueprints
One-shot
Incremental
- 36. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Lake Formation
Security
- 37. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Security permissions in Lake Formation
Search and view permissions granted
to a user, role, or group in one place
Verify permissions granted to a user
Easily revoke policies for a user
- 38. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Security permissions in Lake Formation
Control data access with simple grant
and revoke permissions
Specify permissions on tables and
columns rather than on buckets and
objects
Easily view policies granted to a
particular user
Audit all data access at one place
- 39. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Roles in a data lake project
IAM Administrator Data Lake Admin Data Lake Developer/Analyst
ProvisionsAWS resources and accounts Manages Data Lake Resources Access/Query Data Lake
e.g.
- Create Data Lake Admin
- Create S3 bucket
- Create Roles/Policies
e.g.
- Registers Data Lake
- Secure data lake/add access controls
e.g.
- Query Data Lake
- Creates MLTransforms
- Writes ETL Jobs
- Visualization usingAmazon QuickSight
- 40. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Configuration and Execution Steps
Activity User
1
Provision the following:
- Data lake administrator
- Data lake analyst/developer
- Data lake and glue roles
IAMAdministrator
2 Register a data lake Data Lake Administrator
3 Assign Lake Formation Permissions to data lake analyst Data Lake Administrator
4 Create AWS Glue Database Data Lake Administrator
5 Crawl and catalog Patient data in AWS Glue Data Lake Analyst
6 Assign table permissions to data lake Analyst Data Lake Administrator
7 Observe the data pattern and duplicates in data usingAmazon Athena Data Lake Analyst
8 Create, teach andTune an AWS Lake Formation MLTransform Data Lake Analyst
9 Create an AWS Glue ETL Job to use MLTransform for data deduplication Data Lake Analyst
10 Catalog de-duplicated data and query usingAmazon Athena Data Lake Analyst
- 41. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Secure once, access in multiple ways
Data Lake Storage
Data
Catalog
Access
Control
Lake Formation
Admin
- 42. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Search and collaborate across multiple users
Text-based, faceted search across
all metadata
Add attributes like Data owners,
stewards, and other as table
properties
Add data sensitivity level, column
definitions, and others as column
properties
Text-based search and filtering
Query data in Amazon Athena
- 43. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Enhanced Governance Layer
AWS Lake Formation provides a security and governance layer at the Data
Catalog level. Users can grant or revoke permissions to the Data Catalog
objects such as databases, tables and columns for IAM principals (IAM
users and roles). This functionality will be extended to row level access in
subsequent releases.
- 44. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Lake Formation
Glue/Blueprints
- 45. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
AWS Glue
Less hassle
Integrated across AWS: supports
Aurora, RDS, Redshift, S3, and common
database engines in yourVPC running
on EC2
Serverless
Serverless: no infrastructure to provision
or manage
More power
Automatically generates the code to
execute your data transformations and
loading processes
Simple, flexible, and cost-effective ETL & Data Catalog
- 46. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Blueprints build on AWS Glue
- 47. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Blueprints / Data Importers
Blueprints are templates for data ingestion, transformation, metadata
(schema) and partition management. Blueprints help customers to
quickly and easily build and maintain a data lake.
- 48. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
With blueprints
You
1. Point us to the source
2. Tell us the location to load
to in your data lake
3. Specify how often you
want to load the data
Blueprints
1. Discover the source table(s)
schema
2. Automatically convert to the target
data format
3. Automatically partition the data
based on the partitioning schema
4. Keep track of data that was
already processed
5. You can customize any of the
above
- 49. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Glue Features andTips
• Use VPC Endpoint for Glue, Athena and S3
• Daisy-chain JDBC jobs to avoid startup pain
• Use Lambda for tricks that aren’t yet available
- 50. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Lambda with Glue
• Automatically start an AWS Glue job when a file is uploaded to S3
• Use Lambda to Auto-Run a newly created Glue Crawler
• Use Cloudwatch with Lambda to receive
SNS notices for specific Glue Events
• Create & Enable Glue Trigger in Lambda
- 51. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Lake Formation
MLTransformations
- 52. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
MLTransformations
Deduplication & FuzzyMatching of your Data
- 53. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
MLTransforms
- 54. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
1. Create a new MLTransform
- 55. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
2.
- 56. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Fuzzy deduplication – Under the hood
- 57. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
What next?
Recommended things to check out
- 58. © 2020, Amazon Web Services, Inc. or its Affiliates.
http://bit.ly/LFML-2019
Lake Formation Lab
- 59. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Free foundational to advanced digital courses cover AWS services and teach
architecting best practices
AWSTraining and Certification
Visit aws.amazon.com/training/path-architecting/
Classroom offerings, including Architecting on AWS, feature
AWS expert instructors and hands-on labs
Validate expertise with the AWS Certified Solutions Architect - Associate or
AWS Certification Solutions Architect - Professional exams
Resources created by the experts at AWS to propel your organization and career forward
- 60. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Big Data blogs
Real deep, use-cases, step by step.
• “Use AWS Glue to run ETL jobs against non-native JDBC data
sources”
• “Matching patient records with the AWS Lake Formation
FindMatches transform”
• “Discovering metadata with AWS
Lake Formation “
https://aws.amazon.com/blogs/?awsf.blog-master-category=category%23big-data
- 61. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Q&A
- 62. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Thank you!
Editor's Notes
- Data Lake
The new, the old
End to end pipeline
Lake formation
Security
Glue Blueprints
Lambda
ML
- Who doesn’t have an AWS account?
- Dark data
Schema on read means means defined in a catalog
Parquet preferred
Compute managed, but options
- Business decisions were made around the Enterprise Data warehousing with BI tools.
Less relational, more diverse.
10x every 5 years
Who has access/what type of access
- Business decisions were made around the Enterprise Data warehousing with BI tools.
Less relational, more diverse.
10x every 5 years
Who has access/what type of access
Datalake is the evolution of data warehousing.
- 1/ easy path to build a data lake and start running diverse analytics workloads,
2/ secure cloud storage, compute, and network infrastructure that meets the specific needs of analytic workloads,
3/ a fully integrated analytics stack with a mature set of analytics tools, covering all common uses cases and leveraging open source and standard languages, engines, and platforms, and
4/ performance, the most scalability, and the lowest cost for analytics.
- analyze in a variety of ways with different engines
go beyond insights, from operational reporting on historical data, to being able to perform ML and real-time analytics => accurately predict future outcomes.
S3 to provide even more insight without the delays and cost from moving or transforming your data.
- Many mature companies with dark data, to startup companies running real time applications built on what they learn
- (How to Build a Data Lake.pptx)
- Spend more time here, ask what people are using
- Natively supported by big data frameworks (Spark, Hive, Presto, and others)
Decouple storage and compute
No need to run compute clusters for storage (unlike HDFS)
Can run transient Amazon EMR clusters with Amazon EC2 Spot Instances
Multiple & heterogeneous analysis clusters and services can use the same data
Designed for 99.999999999% durability
No need to pay for data replication within a region
Secure – SSL, client/server-side encryption at rest
- Encryptable
Hive compatible
- Mini-ETL with Create Table As Statement, Views, Workgroups, query JSON, catalog upgrades
- Who doesn’t love a good pie chart.
- I think that “other” might be designing their data storage structure.
As someone helping solve issues in customer datasets, I can tell you some people need to spend more time defining their partition structure and data generation size
- Mention security here how governments + medical companies are using it
There is a slide coming, detail in there, but just mention the security here
- Many exists, but still not simple enough.
Manual and time-consuming tasks such as loading data from diverse sources, monitoring these data flows, setting up partitions, turning on encryption and managing keys, re-organizing data into columnar format, and granting and auditing access.
days, not months.
enables secured self-service discovery and access for users
Aware of multiple analytics services,
easy on-demand access to specific resources that fit the processor and memory requirements of each analytics workload.
The data is curated and cataloged, already prepared for any flavor of analytics, and related records are matched and de-duplicated with machine learning.
Automation reduces the time it takes to get to answers when your data lake is built on top of AWS
- Lake Formation simplifies this manual process, automates many of the steps, allowing customers to setup a Data Lake just a few clicks from a single, unified dashboard. This reduces the time to setup a Data Lake from months to days.
To eliminate siloes, you need to build a data lake
automates many of the complex steps required to set up a data lake, reducing the time required to build a secure data lake from months to days.
Security control at the object level for our object storage (data lake storage) layer. Other cloud vendors only provide bucket level security control.
Deep integration across services that are needed to get answers from your data, including storage, compute, networking, and data movement. For example, Amazon EMR makes it easy to use EC2 Spot instances to save up to 90% on analytic workloads. Amazon Redshift allows you to query your S3 objects directly from your data warehouse.
A single security model across all analytic services. AWS Lake Formation provides a single way to control access to your data whether you are accessing that data from a data warehouse, a Spark cluster, or a serverless query technology.
Mature analytics services. Amazon EMR was first released in 2009 and Amazon Redshift first launched in 2013. Amazon S3 was one of the first AWS products and has been available since 2006. Tens of thousands of customers have data lakes on AWS and X exabytes of data is analyzed every day.
A single object storage layer that is compatible with all AWS analytics and machine learning services. Amazon S3 is our only object storage service, we do not have different versions of S3 and we do not have separate “data lake storage.”
5 storage tiers and intelligent tiering in Amazon S3, so are able to store more data at a lower cost and with less manual data lifecycle work than with any other cloud provider.
AWS S3 and AWS managed services store customer data in independent data centers across three availability zones within a single AWS Region and automatically replicate data between any regions regardless of storage class, providing a very high degree of fault tolerance and data durability out-of-the-box.
AWS analytics services provide best of breed performance. Amazon Redshift is 2x faster than the next most popular competitor and Amazon EMR runs Apache Spark workloads over 10x faster than open source Spark. Speed helps get to answers quickly and also helps keep costs down for complex analytics.
- AWS Lake Formation has an enhanced Data Catalog to enable users to record more metadata and Tags at Databases, Tables and Columns. All this data will be searchable.
- Good time to break?
- Database, table, column
- IAM is API based, but it isn’t designed for real-world access control.
On Glue Catalog we can grant resource level permissions, again, it is API based and doesn’t give granular enough access.
-
Register location till file level.
Need to give IAM role that has access to that location and trust Lake Formation
We update SLR policy. By default has permissions to List bucket, but as you register locations, we add additional permissions
- Data lake admin not the IAM Admin by default. Keep separate for security.
Data lake will not ignore IAM. Need to add themselves as DataLake admin.
In LF you would give permissions to a table, without having to give access to the bucket.
- Glue customer but not lake-formation customer, will have full access. Until location registered.
Tells Athena, Redshift and EMR to check LF when querying.
By default, no one has access. Grant permissions to any IAM users/roles.
All permissions are on Catalog objects. After registering, Athena won’t work.
Root is denied, shouldn’t access. Allowed on users/roles
- Takes in principle (user/role), then resource (db/table/column).
Table permissions grant/deny. Similar to databases, but differences.
Grant permissions are permissions to give permissions, for example for managers. Best practice for design.
Example: Athena query:
get-table
get-temporary-credentials
We don’t use
- Grant permissions on table, must specify database as well.
Athena/redshift can do column level filtering. Glue can’t.
- Column level support.
Encryption on catalog
Service side security
- Relationship advice
Who here is building ETL with Glue?
Who uses Crawler?
Who uses Workflow?
- 1/ integrated
natively supports data stored in Amazon Aurora and all other Amazon RDS engines, Amazon Redshift, and Amazon S3, as well as common database engines, databases EC2.
2/ Cost Effective / Serverless: AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running.
3/ Mover Power: AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.
- Console only feature
Why no S3 to S3?
- Endpoint is more secure, less latency.
Examples:
- AWS Lake Formation includes specialized ML-based dataset transformation algorithms customers can use to create their own ML Transforms. These include record de-duplication and match finding.
- 1/ Less Hassle: AWS Glue is integrated across a wide range of AWS services, meaning less hassle for you when onboarding. AWS Glue natively supports data stored in Amazon Aurora and all other Amazon RDS engines, Amazon Redshift, and Amazon S3, as well as common database engines and databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2.
2/ Cost Effective / Serverless: AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running.
3/ Mower Power: AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.
- Go read the docs!!
- STORY BACKGROUND
Georgia-Pacific, owned by Koch Industries, is an American wood products, pulp, and paper company based in Atlanta, Georgia. The organization is one of the world’s largest manufacturers and distributors of pulp, towel and tissue paper and dispensers, packaging, and wood and gypsum building products.
They use an S3 data lake as part of an advanced analytics and ML solution to gain new insights, optimize processes, and maximize resources.
They now save millions annually by leveraging new insights to improve equipment failure predictions, run more production lines efficiently, and ensure high quality products.
https://aws.amazon.com/solutions/case-studies/georgia-pacific/
In the first six months, Georgia-Pacific transferred about 50 TB of production data—more than 500 billion records—from hundreds of large, complex manufacturing and converting-process machines. The company uses Amazon Kinesis to stream real-time data from manufacturing equipment to a central data lake based on Amazon Simple Storage Service (Amazon S3), allowing it to efficiently ingest and analyze structured and unstructured data at scale.
Georgia-Pacific knew it could learn from its structured and unstructured data, but the company lacked a cost-effective storage mechanism to ingest, transform, house, and analyze this data.
Georgia-Pacific uses Amazon Elastic MapReduce (Amazon EMR) to transform the data before delivering it in a structured fashion to data analysts through Amazon Redshift. The analysts use Amazon Athena on top of Amazon S3 to query the raw data, which includes information on pulping mechanisms, paper machines, converting lines, vibration trends, throughput, and paper quality.
Georgia-Pacific also uses Amazon SageMaker, an AWS machine-learning (ML) solution, to build, train, and deploy ML models at scale. Using ML models built with raw production data, Amazon SageMaker provides real-time feedback to machine operators regarding optimum machine speeds and other adjustable variables, enabling less experienced operators to detect breaks earlier and maintain quality.
- Sticky-note feedback.