SlideShare a Scribd company logo
© 2020, Amazon Web Services, Inc. or its Affiliates.
Cobus Bernard
Senior Developer Advocate
Amazon Web Services
AWS Lake Formation
Deep Dive
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
What’s coming up:
Bonus
• QR codes
• Data Lake refresher
• AWS Lake Formation & Related Services
• Use-case examples
• Going forward
• QA
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Poll – Which of these are you using?
• S3 Lifecycle policy
• Glue ETL with ML
• Athena to query Relational Databases
• AWS Datalake
Amazon S3 | AWS Glue
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
On-premises
Data Movement
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
Analytics
Machine Learning
Real-time
Data Movement
Data Lake
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Data Lake Refresher
Concepts & flow
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
A data lake isa centralized repository that allows you
to store all your structured and unstructured dataat
any scale
DataLakeDefinition
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
• All data in one place, a single source of truth
• Support Different Formats - structured/semi-structured/unstructured/raw data
• Supports fast ingestion and consumption
• Schema on read
• Designed for low-cost storage
• Decouples storage and compute
• Supports protection and security rules
DataLakeMainConcepts
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Traditional Data lake
Then:
- Enterprise Data Warehouse
- Batched based ETL for BI analysis
- Dashboarding tools
Data silos to
ERP CRM LOB
DW Silo 1
Business
Intelligence
Devices Web Sensors Social
DW Silo 2
Business
Intelligence
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Modern Data lake
Now:
- More data, less structured
- Accessed in various ways
- Real time-streaming
- Machine learning, scientific,
analytics, regulatory requirements
- Low cost storage & analytics
OLTP ERP CRM LOB
DataWarehouse
Business
Intelligence
Data Lake
10011000010010101110010101
011100101010000101111101101
0
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW Queries Big data
processing
Interactive Real-time
© 2020, Amazon Web Services, Inc. or its Affiliates.
Why choose AWS for data lakes and analytics?
Most secure
infrastructure for
analytics
Most scalable and
cost effective
Easiest to build data
lakes and analytics
Most
comprehensive and
open
1 2 3 4
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Moving to data lake architectures
Extends or evolves DW architectures
Store any data in any format
Durable, available, and exabyte scale
Secure, compliant, auditable
Run any type of analytics from DW to Predictive
Data
Warehousing
Analytics
Machine
Learning
Data lake
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
More data lakes and analytics than anywhere else
Tens of thousands of data lakes run on AWS across all industries
© 2020, Amazon Web Services, Inc. or its Affiliates.
Workflow of AWS services for big data
End-to-end Pipeline
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Simplified Data Pipeline
Data Sources Ingest
Process
&
Analyze
Consume
Amazon S3
Catalog
Store
Amazon S3
Store
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Multiple Data Sources
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Ingest
Process
&
Analyze
Consume
Amazon S3
Catalog
Store
Amazon S3
Store
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon DynamoDB
Fully managed, multi-region, multi-master database
Nonrelational database that delivers reliable performance at any scale
Consistent single-digit millisecond latency
Built-in security, backup and restore, in-memory Caching
Support Streams
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Process &
Analyze
Consume
Ingestion Options
Ingest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Catalog
Store
Amazon S3
Store
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon Kinesis
Data Streams
• For technical developers
• Build your own custom
applications that process
or analyze streaming data
Amazon Kinesis
Data Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into S3, Amazon Redshift,
and Amazon Elasticsearch
Amazon Kinesis
Data Analytics
• For all developers, data
scientists
• Easily analyze data streams
using standard SQL queries
Amazon Kinesis:StreamingData MadeEasy
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Storage Layer
Process &
Analyze
Consume
Catalog
IngestIngest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Amazon S3
Store
Amazon S3
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Secure, highly scalable, durable object storage with millisecond
latency for data access
Store any type of data–web sites, mobile apps, corporate
applications, and IoT sensors, at any scale
Store data in the format you want:
Unstructured (logs, dump files) | semi-structured (JSON, XML) | structured (CSV, Parquet)
Storage lifecycle integration
Amazon S3-Standard |Amazon S3-Infrequent Access | Amazon Glacier
AmazonS3is theBase
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Store
Data Discovery and Catalog
Amazon S3
Process &
Analyze
Consume
Catalog
AWS Glue
IngestIngest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Store
Amazon S3
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Automatically discovers data and stores schema
Data searchable, and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless
AWSGlue -ServerlessDataCatalogand ETL
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Ingest
Consume
Amazon
Athena
Amazon
EMR
Amazon
Redshift
Amazon
Elasticsearch
Store
Amazon S3
Process & Analyze
Process and Analyze
Ingest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Catalog
AWS Glue
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Interactive query service to analyze data in Amazon
S3 using standard SQL
No infrastructure to set up or manage and no data
to load
Supports Multiple Data Formats – Define Schema
on Demand
AmazonAthena -InteractiveAnalysis
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Ingest Consume
Amazon Kinesis
BITools
Querying the Data Lake
Database
Migration Service
AWS Snowball
Amazon MSK
Amazon
Athena
Amazon
EMR
Amazon
Redshift
Amazon
Elasticsearch
Process & Analyze
Jupyter
Notebooks
Amazon
API Gateway
Amazon
QuickSight
Catalog
AWS Glue
Store
Amazon S3
Store
Amazon S3
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon QuickSight
Supports variety of Data source andTargets
Fully managed and scalable
Super fast and easy to use
Cost-effective
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Poll – Which do you struggle most with?
• Data migration/ingestion
• Data processing/ETL
• Data access security/auditing
• Cancelling your line with Telkom
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Data preparation accounts for ~80% of the work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Sample of steps required
Find sources
Create Amazon Simple Storage Service (Amazon S3) locations
Configure access policies
Map tables to Amazon S3 locations
ETL jobs to load and clean data
Create metadata access policies
Configure access from analytics services
Rinse and repeat for other:
data sets, users, and end-services
And more:
manage and monitor ETL jobs
update metadata catalog as data changes
update policies across services as users and permissions change
manually maintain cleansing scripts
create audit processes for compliance
…
Manual | Error-prone |Time consuming
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Lake Formation
Data lake infrastructure & management
S3/Glacier AWS GlueLake
Formation
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Lake Formation
Data lake infrastructure & management
S3/Glacier AWS GlueLake
Formation
© 2020, Amazon Web Services, Inc. or its Affiliates.
Build a secure data lake in days
with AWS Lake Formation
Amazon S3 data lake storage
AWS Lake Formation
Simplified ingest and cleaning enables data
engineers to build faster
Centralized management of fine-grained
permissions empower security officers
Comprehensive set of integrated tools enable
user access consistently
AWS
Glue
Blueprints ML
Transforms
Data
catalog
Access
control
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Top features of Lake Formation
• Enhanced governance layer - security and governance layer at the
Data Catalog level
• Blueprints / Data Importers - templates for ETL, metadata (schema)
and partition management
• ML Transformations – ML algorithms that customers can use to create
their own ML Transforms (i.e. record de-duplication)
• Enhanced Data Catalog - enable users to record more metadata and
tag Data Catalog objects (i.e. databases, tables, columns)
© 2020, Amazon Web Services, Inc. or its Affiliates.
Easier way to load data to the lake
Logs
DBs
Blueprints
One-shot
Incremental
© 2020, Amazon Web Services, Inc. or its Affiliates.
How it works
© 2020, Amazon Web Services, Inc. or its Affiliates.
How it works
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Lake Formation
Security
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Security permissions in Lake Formation
Search and view permissions granted
to a user, role, or group in one place
Verify permissions granted to a user
Easily revoke policies for a user
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Security permissions in Lake Formation
Control data access with simple grant
and revoke permissions
Specify permissions on tables and
columns rather than on buckets and
objects
Easily view policies granted to a
particular user
Audit all data access at one place
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Roles in a data lake project
IAM Administrator Data Lake Admin Data Lake Developer/Analyst
ProvisionsAWS resources and accounts Manages Data Lake Resources Access/Query Data Lake
e.g.
- Create Data Lake Admin
- Create S3 bucket
- Create Roles/Policies
e.g.
- Registers Data Lake
- Secure data lake/add access controls
e.g.
- Query Data Lake
- Creates MLTransforms
- Writes ETL Jobs
- Visualization usingAmazon QuickSight
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Configuration and Execution Steps
Activity User
1
Provision the following:
- Data lake administrator
- Data lake analyst/developer
- Data lake and glue roles
IAMAdministrator
2 Register a data lake Data Lake Administrator
3 Assign Lake Formation Permissions to data lake analyst Data Lake Administrator
4 Create AWS Glue Database Data Lake Administrator
5 Crawl and catalog Patient data in AWS Glue Data Lake Analyst
6 Assign table permissions to data lake Analyst Data Lake Administrator
7 Observe the data pattern and duplicates in data usingAmazon Athena Data Lake Analyst
8 Create, teach andTune an AWS Lake Formation MLTransform Data Lake Analyst
9 Create an AWS Glue ETL Job to use MLTransform for data deduplication Data Lake Analyst
10 Catalog de-duplicated data and query usingAmazon Athena Data Lake Analyst
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Secure once, access in multiple ways
Data Lake Storage
Data
Catalog
Access
Control
Lake Formation
Admin
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Search and collaborate across multiple users
Text-based, faceted search across
all metadata
Add attributes like Data owners,
stewards, and other as table
properties
Add data sensitivity level, column
definitions, and others as column
properties
Text-based search and filtering
Query data in Amazon Athena
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Enhanced Governance Layer
AWS Lake Formation provides a security and governance layer at the Data
Catalog level. Users can grant or revoke permissions to the Data Catalog
objects such as databases, tables and columns for IAM principals (IAM
users and roles). This functionality will be extended to row level access in
subsequent releases.
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Lake Formation
Glue/Blueprints
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
AWS Glue
Less hassle
Integrated across AWS: supports
Aurora, RDS, Redshift, S3, and common
database engines in yourVPC running
on EC2
Serverless
Serverless: no infrastructure to provision
or manage
More power
Automatically generates the code to
execute your data transformations and
loading processes
Simple, flexible, and cost-effective ETL & Data Catalog
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Blueprints build on AWS Glue
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Blueprints / Data Importers
Blueprints are templates for data ingestion, transformation, metadata
(schema) and partition management. Blueprints help customers to
quickly and easily build and maintain a data lake.
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
With blueprints
You
1. Point us to the source
2. Tell us the location to load
to in your data lake
3. Specify how often you
want to load the data
Blueprints
1. Discover the source table(s)
schema
2. Automatically convert to the target
data format
3. Automatically partition the data
based on the partitioning schema
4. Keep track of data that was
already processed
5. You can customize any of the
above
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Glue Features andTips
• Use VPC Endpoint for Glue, Athena and S3
• Daisy-chain JDBC jobs to avoid startup pain
• Use Lambda for tricks that aren’t yet available
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Lambda with Glue
• Automatically start an AWS Glue job when a file is uploaded to S3
• Use Lambda to Auto-Run a newly created Glue Crawler
• Use Cloudwatch with Lambda to receive
SNS notices for specific Glue Events
• Create & Enable Glue Trigger in Lambda
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Lake Formation
MLTransformations
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
MLTransformations
Deduplication & FuzzyMatching of your Data
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
MLTransforms
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
1. Create a new MLTransform
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
2.
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Fuzzy deduplication – Under the hood
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
What next?
Recommended things to check out
© 2020, Amazon Web Services, Inc. or its Affiliates.
http://bit.ly/LFML-2019
Lake Formation Lab
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Free foundational to advanced digital courses cover AWS services and teach
architecting best practices
AWSTraining and Certification
Visit aws.amazon.com/training/path-architecting/
Classroom offerings, including Architecting on AWS, feature
AWS expert instructors and hands-on labs
Validate expertise with the AWS Certified Solutions Architect - Associate or
AWS Certification Solutions Architect - Professional exams
Resources created by the experts at AWS to propel your organization and career forward
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Big Data blogs
Real deep, use-cases, step by step.
• “Use AWS Glue to run ETL jobs against non-native JDBC data
sources”
• “Matching patient records with the AWS Lake Formation
FindMatches transform”
• “Discovering metadata with AWS
Lake Formation “
https://aws.amazon.com/blogs/?awsf.blog-master-category=category%23big-data
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Q&A
© 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates.
Thank you!

More Related Content

AWS Lake Formation Deep Dive

  • 1. © 2020, Amazon Web Services, Inc. or its Affiliates. Cobus Bernard Senior Developer Advocate Amazon Web Services AWS Lake Formation Deep Dive
  • 2. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. What’s coming up: Bonus • QR codes • Data Lake refresher • AWS Lake Formation & Related Services • Use-case examples • Going forward • QA
  • 3. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Poll – Which of these are you using? • S3 Lifecycle policy • Glue ETL with ML • Athena to query Relational Databases • AWS Datalake Amazon S3 | AWS Glue AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams On-premises Data Movement Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight Analytics Machine Learning Real-time Data Movement Data Lake
  • 4. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Data Lake Refresher Concepts & flow
  • 5. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. A data lake isa centralized repository that allows you to store all your structured and unstructured dataat any scale DataLakeDefinition
  • 6. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. • All data in one place, a single source of truth • Support Different Formats - structured/semi-structured/unstructured/raw data • Supports fast ingestion and consumption • Schema on read • Designed for low-cost storage • Decouples storage and compute • Supports protection and security rules DataLakeMainConcepts
  • 7. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Traditional Data lake Then: - Enterprise Data Warehouse - Batched based ETL for BI analysis - Dashboarding tools Data silos to ERP CRM LOB DW Silo 1 Business Intelligence Devices Web Sensors Social DW Silo 2 Business Intelligence
  • 8. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Modern Data lake Now: - More data, less structured - Accessed in various ways - Real time-streaming - Machine learning, scientific, analytics, regulatory requirements - Low cost storage & analytics OLTP ERP CRM LOB DataWarehouse Business Intelligence Data Lake 10011000010010101110010101 011100101010000101111101101 0 0011110010110010110 0100011000010 Devices Web Sensors Social Catalog Machine Learning DW Queries Big data processing Interactive Real-time
  • 9. © 2020, Amazon Web Services, Inc. or its Affiliates. Why choose AWS for data lakes and analytics? Most secure infrastructure for analytics Most scalable and cost effective Easiest to build data lakes and analytics Most comprehensive and open 1 2 3 4
  • 10. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Moving to data lake architectures Extends or evolves DW architectures Store any data in any format Durable, available, and exabyte scale Secure, compliant, auditable Run any type of analytics from DW to Predictive Data Warehousing Analytics Machine Learning Data lake
  • 11. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. More data lakes and analytics than anywhere else Tens of thousands of data lakes run on AWS across all industries
  • 12. © 2020, Amazon Web Services, Inc. or its Affiliates. Workflow of AWS services for big data End-to-end Pipeline
  • 13. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Simplified Data Pipeline Data Sources Ingest Process & Analyze Consume Amazon S3 Catalog Store Amazon S3 Store
  • 14. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Multiple Data Sources Data sources Amazon DynamoDB Web logs / cookies ERP Connected devices Ingest Process & Analyze Consume Amazon S3 Catalog Store Amazon S3 Store
  • 15. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Amazon DynamoDB Fully managed, multi-region, multi-master database Nonrelational database that delivers reliable performance at any scale Consistent single-digit millisecond latency Built-in security, backup and restore, in-memory Caching Support Streams
  • 16. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Process & Analyze Consume Ingestion Options Ingest Amazon Kinesis AWS Snowball Amazon MSK Data sources Amazon DynamoDB Web logs / cookies ERP Connected devices Database Migration Service Catalog Store Amazon S3 Store
  • 17. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Amazon Kinesis Data Streams • For technical developers • Build your own custom applications that process or analyze streaming data Amazon Kinesis Data Firehose • For all developers, data scientists • Easily load massive volumes of streaming data into S3, Amazon Redshift, and Amazon Elasticsearch Amazon Kinesis Data Analytics • For all developers, data scientists • Easily analyze data streams using standard SQL queries Amazon Kinesis:StreamingData MadeEasy
  • 18. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Storage Layer Process & Analyze Consume Catalog IngestIngest Amazon Kinesis AWS Snowball Amazon MSK Data sources Amazon DynamoDB Web logs / cookies ERP Connected devices Database Migration Service Amazon S3 Store Amazon S3
  • 19. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Secure, highly scalable, durable object storage with millisecond latency for data access Store any type of data–web sites, mobile apps, corporate applications, and IoT sensors, at any scale Store data in the format you want: Unstructured (logs, dump files) | semi-structured (JSON, XML) | structured (CSV, Parquet) Storage lifecycle integration Amazon S3-Standard |Amazon S3-Infrequent Access | Amazon Glacier AmazonS3is theBase
  • 20. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Store Data Discovery and Catalog Amazon S3 Process & Analyze Consume Catalog AWS Glue IngestIngest Amazon Kinesis AWS Snowball Amazon MSK Data sources Amazon DynamoDB Web logs / cookies ERP Connected devices Database Migration Service Store Amazon S3
  • 21. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Automatically discovers data and stores schema Data searchable, and available for ETL Generates customizable code Schedules and runs your ETL jobs Serverless AWSGlue -ServerlessDataCatalogand ETL
  • 22. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Ingest Consume Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Store Amazon S3 Process & Analyze Process and Analyze Ingest Amazon Kinesis AWS Snowball Amazon MSK Data sources Amazon DynamoDB Web logs / cookies ERP Connected devices Database Migration Service Catalog AWS Glue
  • 23. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Supports Multiple Data Formats – Define Schema on Demand AmazonAthena -InteractiveAnalysis
  • 24. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Ingest Consume Amazon Kinesis BITools Querying the Data Lake Database Migration Service AWS Snowball Amazon MSK Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Process & Analyze Jupyter Notebooks Amazon API Gateway Amazon QuickSight Catalog AWS Glue Store Amazon S3 Store Amazon S3 Data sources Amazon DynamoDB Web logs / cookies ERP Connected devices
  • 25. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Amazon QuickSight Supports variety of Data source andTargets Fully managed and scalable Super fast and easy to use Cost-effective
  • 26. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Poll – Which do you struggle most with? • Data migration/ingestion • Data processing/ETL • Data access security/auditing • Cancelling your line with Telkom
  • 27. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Data preparation accounts for ~80% of the work Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other
  • 28. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Sample of steps required Find sources Create Amazon Simple Storage Service (Amazon S3) locations Configure access policies Map tables to Amazon S3 locations ETL jobs to load and clean data Create metadata access policies Configure access from analytics services Rinse and repeat for other: data sets, users, and end-services And more: manage and monitor ETL jobs update metadata catalog as data changes update policies across services as users and permissions change manually maintain cleansing scripts create audit processes for compliance … Manual | Error-prone |Time consuming
  • 29. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Lake Formation Data lake infrastructure & management S3/Glacier AWS GlueLake Formation
  • 30. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Lake Formation Data lake infrastructure & management S3/Glacier AWS GlueLake Formation
  • 31. © 2020, Amazon Web Services, Inc. or its Affiliates. Build a secure data lake in days with AWS Lake Formation Amazon S3 data lake storage AWS Lake Formation Simplified ingest and cleaning enables data engineers to build faster Centralized management of fine-grained permissions empower security officers Comprehensive set of integrated tools enable user access consistently AWS Glue Blueprints ML Transforms Data catalog Access control
  • 32. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Top features of Lake Formation • Enhanced governance layer - security and governance layer at the Data Catalog level • Blueprints / Data Importers - templates for ETL, metadata (schema) and partition management • ML Transformations – ML algorithms that customers can use to create their own ML Transforms (i.e. record de-duplication) • Enhanced Data Catalog - enable users to record more metadata and tag Data Catalog objects (i.e. databases, tables, columns)
  • 33. © 2020, Amazon Web Services, Inc. or its Affiliates. Easier way to load data to the lake Logs DBs Blueprints One-shot Incremental
  • 34. © 2020, Amazon Web Services, Inc. or its Affiliates. How it works
  • 35. © 2020, Amazon Web Services, Inc. or its Affiliates. How it works
  • 36. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Lake Formation Security
  • 37. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Security permissions in Lake Formation Search and view permissions granted to a user, role, or group in one place Verify permissions granted to a user Easily revoke policies for a user
  • 38. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Security permissions in Lake Formation Control data access with simple grant and revoke permissions Specify permissions on tables and columns rather than on buckets and objects Easily view policies granted to a particular user Audit all data access at one place
  • 39. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Roles in a data lake project IAM Administrator Data Lake Admin Data Lake Developer/Analyst ProvisionsAWS resources and accounts Manages Data Lake Resources Access/Query Data Lake e.g. - Create Data Lake Admin - Create S3 bucket - Create Roles/Policies e.g. - Registers Data Lake - Secure data lake/add access controls e.g. - Query Data Lake - Creates MLTransforms - Writes ETL Jobs - Visualization usingAmazon QuickSight
  • 40. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Configuration and Execution Steps Activity User 1 Provision the following: - Data lake administrator - Data lake analyst/developer - Data lake and glue roles IAMAdministrator 2 Register a data lake Data Lake Administrator 3 Assign Lake Formation Permissions to data lake analyst Data Lake Administrator 4 Create AWS Glue Database Data Lake Administrator 5 Crawl and catalog Patient data in AWS Glue Data Lake Analyst 6 Assign table permissions to data lake Analyst Data Lake Administrator 7 Observe the data pattern and duplicates in data usingAmazon Athena Data Lake Analyst 8 Create, teach andTune an AWS Lake Formation MLTransform Data Lake Analyst 9 Create an AWS Glue ETL Job to use MLTransform for data deduplication Data Lake Analyst 10 Catalog de-duplicated data and query usingAmazon Athena Data Lake Analyst
  • 41. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Secure once, access in multiple ways Data Lake Storage Data Catalog Access Control Lake Formation Admin
  • 42. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Search and collaborate across multiple users Text-based, faceted search across all metadata Add attributes like Data owners, stewards, and other as table properties Add data sensitivity level, column definitions, and others as column properties Text-based search and filtering Query data in Amazon Athena
  • 43. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Enhanced Governance Layer AWS Lake Formation provides a security and governance layer at the Data Catalog level. Users can grant or revoke permissions to the Data Catalog objects such as databases, tables and columns for IAM principals (IAM users and roles). This functionality will be extended to row level access in subsequent releases.
  • 44. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Lake Formation Glue/Blueprints
  • 45. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. AWS Glue Less hassle Integrated across AWS: supports Aurora, RDS, Redshift, S3, and common database engines in yourVPC running on EC2 Serverless Serverless: no infrastructure to provision or manage More power Automatically generates the code to execute your data transformations and loading processes Simple, flexible, and cost-effective ETL & Data Catalog
  • 46. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Blueprints build on AWS Glue
  • 47. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Blueprints / Data Importers Blueprints are templates for data ingestion, transformation, metadata (schema) and partition management. Blueprints help customers to quickly and easily build and maintain a data lake.
  • 48. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. With blueprints You 1. Point us to the source 2. Tell us the location to load to in your data lake 3. Specify how often you want to load the data Blueprints 1. Discover the source table(s) schema 2. Automatically convert to the target data format 3. Automatically partition the data based on the partitioning schema 4. Keep track of data that was already processed 5. You can customize any of the above
  • 49. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Glue Features andTips • Use VPC Endpoint for Glue, Athena and S3 • Daisy-chain JDBC jobs to avoid startup pain • Use Lambda for tricks that aren’t yet available
  • 50. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Lambda with Glue • Automatically start an AWS Glue job when a file is uploaded to S3 • Use Lambda to Auto-Run a newly created Glue Crawler • Use Cloudwatch with Lambda to receive SNS notices for specific Glue Events • Create & Enable Glue Trigger in Lambda
  • 51. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Lake Formation MLTransformations
  • 52. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. MLTransformations Deduplication & FuzzyMatching of your Data
  • 53. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. MLTransforms
  • 54. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. 1. Create a new MLTransform
  • 55. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. 2.
  • 56. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Fuzzy deduplication – Under the hood
  • 57. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. What next? Recommended things to check out
  • 58. © 2020, Amazon Web Services, Inc. or its Affiliates. http://bit.ly/LFML-2019 Lake Formation Lab
  • 59. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Free foundational to advanced digital courses cover AWS services and teach architecting best practices AWSTraining and Certification Visit aws.amazon.com/training/path-architecting/ Classroom offerings, including Architecting on AWS, feature AWS expert instructors and hands-on labs Validate expertise with the AWS Certified Solutions Architect - Associate or AWS Certification Solutions Architect - Professional exams Resources created by the experts at AWS to propel your organization and career forward
  • 60. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Big Data blogs Real deep, use-cases, step by step. • “Use AWS Glue to run ETL jobs against non-native JDBC data sources” • “Matching patient records with the AWS Lake Formation FindMatches transform” • “Discovering metadata with AWS Lake Formation “ https://aws.amazon.com/blogs/?awsf.blog-master-category=category%23big-data
  • 61. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Q&A
  • 62. © 2020, Amazon Web Services, Inc. or its Affiliates.© 2020, Amazon Web Services, Inc. or its Affiliates. Thank you!

Editor's Notes

  1. Data Lake The new, the old End to end pipeline Lake formation Security Glue Blueprints Lambda ML
  2. Who doesn’t have an AWS account?
  3. Dark data Schema on read means means defined in a catalog Parquet preferred Compute managed, but options
  4. Business decisions were made around the Enterprise Data warehousing with BI tools. Less relational, more diverse. 10x every 5 years Who has access/what type of access
  5. Business decisions were made around the Enterprise Data warehousing with BI tools. Less relational, more diverse. 10x every 5 years Who has access/what type of access Datalake is the evolution of data warehousing.
  6. 1/ easy path to build a data lake and start running diverse analytics workloads, 2/ secure cloud storage, compute, and network infrastructure that meets the specific needs of analytic workloads, 3/ a fully integrated analytics stack with a mature set of analytics tools, covering all common uses cases and leveraging open source and standard languages, engines, and platforms, and 4/ performance, the most scalability, and the lowest cost for analytics.
  7. analyze in a variety of ways with different engines go beyond insights, from operational reporting on historical data, to being able to perform ML and real-time analytics => accurately predict future outcomes.   S3 to provide even more insight without the delays and cost from moving or transforming your data.
  8. Many mature companies with dark data, to startup companies running real time applications built on what they learn
  9. (How to Build a Data Lake.pptx)
  10. Spend more time here, ask what people are using
  11. Natively supported by big data frameworks (Spark, Hive, Presto, and others) Decouple storage and compute No need to run compute clusters for storage (unlike HDFS) Can run transient Amazon EMR clusters with Amazon EC2 Spot Instances Multiple & heterogeneous analysis clusters and services can use the same data Designed for 99.999999999% durability No need to pay for data replication within a region Secure – SSL, client/server-side encryption at rest
  12. Encryptable Hive compatible
  13. Mini-ETL with Create Table As Statement, Views, Workgroups, query JSON, catalog upgrades
  14. Who doesn’t love a good pie chart.
  15. I think that “other” might be designing their data storage structure. As someone helping solve issues in customer datasets, I can tell you some people need to spend more time defining their partition structure and data generation size
  16. Mention security here how governments + medical companies are using it There is a slide coming, detail in there, but just mention the security here
  17. Many exists, but still not simple enough. Manual and time-consuming tasks such as loading data from diverse sources, monitoring these data flows, setting up partitions, turning on encryption and managing keys, re-organizing data into columnar format, and granting and auditing access. days, not months. enables secured self-service discovery and access for users Aware of multiple analytics services, easy on-demand access to specific resources that fit the processor and memory requirements of each analytics workload. The data is curated and cataloged, already prepared for any flavor of analytics, and related records are matched and de-duplicated with machine learning. Automation reduces the time it takes to get to answers when your data lake is built on top of AWS
  18. Lake Formation simplifies this manual process, automates many of the steps, allowing customers to setup a Data Lake just a few clicks from a single, unified dashboard. This reduces the time to setup a Data Lake from months to days. To eliminate siloes, you need to build a data lake automates many of the complex steps required to set up a data lake, reducing the time required to build a secure data lake from months to days. Security control at the object level for our object storage (data lake storage) layer. Other cloud vendors only provide bucket level security control. Deep integration across services that are needed to get answers from your data, including storage, compute, networking, and data movement. For example, Amazon EMR makes it easy to use EC2 Spot instances to save up to 90% on analytic workloads. Amazon Redshift allows you to query your S3 objects directly from your data warehouse. A single security model across all analytic services. AWS Lake Formation provides a single way to control access to your data whether you are accessing that data from a data warehouse, a Spark cluster, or a serverless query technology. Mature analytics services. Amazon EMR was first released in 2009 and Amazon Redshift first launched in 2013. Amazon S3 was one of the first AWS products and has been available since 2006. Tens of thousands of customers have data lakes on AWS and X exabytes of data is analyzed every day. A single object storage layer that is compatible with all AWS analytics and machine learning services. Amazon S3 is our only object storage service, we do not have different versions of S3 and we do not have separate “data lake storage.” 5 storage tiers and intelligent tiering in Amazon S3, so are able to store more data at a lower cost and with less manual data lifecycle work than with any other cloud provider. AWS S3 and AWS managed services store customer data in independent data centers across three availability zones within a single AWS Region and automatically replicate data between any regions regardless of storage class, providing a very high degree of fault tolerance and data durability out-of-the-box. AWS analytics services provide best of breed performance. Amazon Redshift is 2x faster than the next most popular competitor and Amazon EMR runs Apache Spark workloads over 10x faster than open source Spark. Speed helps get to answers quickly and also helps keep costs down for complex analytics.
  19. AWS Lake Formation has an enhanced Data Catalog to enable users to record more metadata and Tags at Databases, Tables and Columns. All this data will be searchable. 
  20. Good time to break?
  21. Database, table, column
  22. IAM is API based, but it isn’t designed for real-world access control. On Glue Catalog we can grant resource level permissions, again, it is API based and doesn’t give granular enough access.
  23. Register location till file level. Need to give IAM role that has access to that location and trust Lake Formation We update SLR policy. By default has permissions to List bucket, but as you register locations, we add additional permissions
  24. Data lake admin not the IAM Admin by default. Keep separate for security. Data lake will not ignore IAM. Need to add themselves as DataLake admin. In LF you would give permissions to a table, without having to give access to the bucket.
  25. Glue customer but not lake-formation customer, will have full access. Until location registered. Tells Athena, Redshift and EMR to check LF when querying. By default, no one has access. Grant permissions to any IAM users/roles. All permissions are on Catalog objects. After registering, Athena won’t work. Root is denied, shouldn’t access. Allowed on users/roles
  26. Takes in principle (user/role), then resource (db/table/column). Table permissions grant/deny. Similar to databases, but differences. Grant permissions are permissions to give permissions, for example for managers. Best practice for design. Example: Athena query: get-table get-temporary-credentials We don’t use
  27. Grant permissions on table, must specify database as well. Athena/redshift can do column level filtering. Glue can’t.
  28. Column level support. Encryption on catalog Service side security
  29. Relationship advice Who here is building ETL with Glue? Who uses Crawler? Who uses Workflow?
  30. 1/ integrated natively supports data stored in Amazon Aurora and all other Amazon RDS engines, Amazon Redshift, and Amazon S3, as well as common database engines, databases EC2. 2/ Cost Effective / Serverless: AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running. 3/ Mover Power: AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.
  31. Console only feature Why no S3 to S3?
  32. Endpoint is more secure, less latency. Examples:
  33. AWS Lake Formation includes specialized ML-based dataset transformation algorithms customers can use to create their own ML Transforms. These include record de-duplication and match finding.
  34. 1/ Less Hassle: AWS Glue is integrated across a wide range of AWS services, meaning less hassle for you when onboarding. AWS Glue natively supports data stored in Amazon Aurora and all other Amazon RDS engines, Amazon Redshift, and Amazon S3, as well as common database engines and databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2. 2/ Cost Effective / Serverless: AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running. 3/ Mower Power: AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.
  35. Go read the docs!!
  36. STORY BACKGROUND Georgia-Pacific, owned by Koch Industries, is an American wood products, pulp, and paper company based in Atlanta, Georgia. The organization is one of the world’s largest manufacturers and distributors of pulp, towel and tissue paper and dispensers, packaging, and wood and gypsum building products. They use an S3 data lake as part of an advanced analytics and ML solution to gain new insights, optimize processes, and maximize resources. They now save millions annually by leveraging new insights to improve equipment failure predictions, run more production lines efficiently, and ensure high quality products. https://aws.amazon.com/solutions/case-studies/georgia-pacific/ In the first six months, Georgia-Pacific transferred about 50 TB of production data—more than 500 billion records—from hundreds of large, complex manufacturing and converting-process machines. The company uses Amazon Kinesis to stream real-time data from manufacturing equipment to a central data lake based on Amazon Simple Storage Service (Amazon S3), allowing it to efficiently ingest and analyze structured and unstructured data at scale. Georgia-Pacific knew it could learn from its structured and unstructured data, but the company lacked a cost-effective storage mechanism to ingest, transform, house, and analyze this data.   Georgia-Pacific uses Amazon Elastic MapReduce (Amazon EMR) to transform the data before delivering it in a structured fashion to data analysts through Amazon Redshift. The analysts use Amazon Athena on top of Amazon S3 to query the raw data, which includes information on pulping mechanisms, paper machines, converting lines, vibration trends, throughput, and paper quality. Georgia-Pacific also uses Amazon SageMaker, an AWS machine-learning (ML) solution, to build, train, and deploy ML models at scale. Using ML models built with raw production data, Amazon SageMaker provides real-time feedback to machine operators regarding optimum machine speeds and other adjustable variables, enabling less experienced operators to detect breaks earlier and maintain quality.
  37. Sticky-note feedback.