SlideShare a Scribd company logo
What's New with Big Data
Analytics
Jia-Ren Lin, Support Engineer, AWS
What's New with Big Data
Analytics
Jia-Ren Lin, Support Engineer, AWS
Agenda
• Real-time data
• Introducing Amazon Managed Streaming for Kafka (Amazon MSK)
• Comparing Amazon MSK with Amazon Kinesis Data Streams
• Why did we build AWS Lake Formation?
• What is AWS Lake Formation?
• How does AWS Lake Formation help you?
Data is produced continuously
Mobile Apps Web Clickstream Application Logs
Metering Records IoT Sensors Smart Buildings
[Wed Oct 11 14:32:52
2018] [error] [client
127.0.0.1] client
denied by server
configuration:
/export/home/live/ap/h
tdocs/test
The diminishing value of data over time
Challenges operating Apache Kafka
Difficult to setup
Hard to achieve high availability
Tricky to scale
AWS integrations = development
No console, no visible metrics 𝒇 𝒌𝒂𝒇𝒌𝒂 𝒖𝒔𝒂𝒈𝒆 = ෍
𝒏=𝟏
∞
𝑺𝑹𝑬
Introducing Amazon Managed Streaming for
Kafka (Amazon MSK)
A fully managed, highly available, and secure service for Apache Kafka
Now available in public preview in the US East (N. Virginia) Region
Getting started with Amazon MSK is easy
• Fully compatible with Apache Kafka v1.1.1
• AWS Management Console and AWS API for provisioning
• Clusters are setup automatically
• Provision Apache Kafka brokers and storage
• Create and tear down clusters on-demand
Automation drives higher availability
@ Preview
• Cluster lifecycle is fully automated
• Brokers and Apache Zookeeper nodes auto-heal
• IPs remain intact
• Patches are applied automatically
@ GA
• Service level agreement (SLA)
• Apache Kafka version upgrades
Scalability and configurability
@ GA
• Scale a cluster
• Horizontally (add more of the same)
• Vertically (add larger brokers) scale a cluster
• Supports Apache Kafka partition reassignment tooling
• Define custom cluster configurations
• Auto scale storage
Deeply integrated with AWS services
@ Preview
• Amazon Virtual Private Cloud (Amazon VPC) for network isolation
• AWS Key Management Service (AWS KMS) for at-rest encryption
• AWS Identity and Access Management (IAM) for control-plane API control
• Amazon CloudWatch for Apache Kafka broker, topic, and ZK metrics
• Amazon Elastic Compute Cloud (Amazon EC2) M5 instances as brokers
• Amazon EBS GP2 broker storage
• Offered in the US-East (N. Virginia) AWS Region
@ GA
• Tagging
• AWS CloudTrail
• AWS CloudFormation
• Offered worldwide
What Amazon MSK does for you
• Makes Apache Kafka more accessible to your organization
• Drives best practices through design, defaults, and automation
• Allows developers to focus more on app development, less on infrastructure
management
• Amazon MSK is committed to improving open-source Apache Kafka
𝑓 𝑘𝑎𝑓𝑘𝑎 𝑢𝑠𝑎𝑔𝑒 = ෍
𝑛=1
∞
𝑆𝑡𝑟𝑒𝑎𝑚𝑖𝑛𝑔 𝐴𝑝𝑝𝑠
How it works
How pricing works
• On-demand, hourly pricing prorated to the second
• Broker and storage pricing
• Broker pricing starts with kafka.m5.large @ $0.21/hr
• Storage pricing is $0.10 per GB-month
Comparing Amazon Kinesis Data Streams to MSK
Amazon Kinesis Data Streams Amazon MSK
Newest dataOldest data
50 1 2 3 4
0 1 2 3
0 1 2 3 4
Shard 2
Shard 1
Shard 3
Writes from
Producers
Stream with 3 shards
Newest dataOldest data
50 1 2 3 4
0 1 2 3
0 1 2 3 4
Partition 2
Partition 1
Partition 3
Writes from
Producers
Topic with 3 partitions
Comparing Amazon Kinesis Data Streams to MSK
• AWS API experience
• Throughput provisioning model
• Seamless scaling
• Typically lower costs
• Deep AWS integrations
• Open-source compatibility
• Strong third-party tooling
• Cluster provisioning model
• Apache Kafka scaling isn’t seamless to clients
• Raw performance
Amazon Kinesis Data Streams Amazon MSK
A data lake is a centralized repository that allows
you to store all your structured and unstructured
data at any scale
The concept of a Data Lake
• All data in one place, a single source of truth
• Handles structured/semi-structured/unstructured/raw data
• Supports fast ingestion and consumption
• Schema on read
• Designed for low-cost storage
• Decouples storage and compute
• Supports protection and security rules
Typical steps of building a data lake
• Setup storages
• Move data
• Cleanse, prep, and catalog data
• Configure and enforce security and compliance policies
• Make data available for analytics
Sample of steps required
Find sources
Create Amazon Simple Storage Service (Amazon S3) locations
Configure access policies
Map tables to Amazon S3 locations
ETL jobs to load and clean data
Create metadata access policies
Configure access from analytics services
Rinse and repeat for other:
data sets, users, and end-services
And more:
manage and monitor ETL jobs
update metadata catalog as data changes
update policies across services as users and permissions change
manually maintain cleansing scripts
create audit processes for compliance
…
Manual | Error-prone | Time consuming
Data lake on AWS
Catalog & Search Access & User Interfaces
Data Ingestion
Analytics & Serving
Amazon
DynamoDB
Amazon Elasticsearch
Service
AWS
AppSync
Amazon
API Gateway
Amazon
Cognito
AWS
KMS
AWS
CloudTrail
Manage & Secure
AWS
IAM
Amazon
CloudWatch
AWS
Snowball
AWS
Storage
Gateway
Amazon
Kinesis Data
Firehose
AWS Direct
Connect
AWS Database
Migration
Service
Amazon
Athena
Amazon
EMR
AWS
Glue
Amazon
Redshift
Amazon
DynamoDB
Amazon
QuickSight
Amazon
Kinesis
Amazon
Elasticsearch
Service
Amazon
Neptune
Amazon
RDS
Central Storage
Scalable, secure,
cost-effective
AWS
Glue
Enforce security policies
across multiple services
Gain and manage new
insights
Identify, ingest, clean, and
transform data
Build a secure data lake in days
AWS Lake Formation
How it works
Register existing data or import new
Amazon S3 forms the storage layer
for Lake Formation
Register existing S3 buckets that
contain your data
Ask Lake Formation to create
required S3 buckets and import data
into them
Data is stored in your account. You
have direct access to it. No lock-in.
Data Lake Storage
Data
Catalog
Access
Control
Data
import
Lake Formation
Crawlers ML-based
data prep
Easily load data to your data lake
logs
DBs
Blueprints
Data Lake Storage
Data
Catalog
Access
Control
Data
import
Lake Formation
Crawlers ML-based
data prep
one-shot
incremental
With blueprints
You
1. Point us to the source
2. Tell us the location to load to in
your data lake
3. Specify how often you want to
load the data
Blueprints
1. Discover the source table(s)
schema
2. Automatically convert to the
target data format
3. Automatically partition the data
based on the partitioning schema
4. Keep track of data that was
already processed
5. You can customize any of the
above
Blueprints build on AWS Glue
Easily de-duplicate your data with ML transforms
Fuzzy de-duplication – under the hood
Naïve: look at all pairs, N2 – state-of-the-art:
0.6
0.8
0.40.1
0.1
Fuzzy de-duplication – Innovations
Intersection Dynamic Blocking
(VLDB 2008)
parallelizable & performant
blocks on dynamic mix of columns
400M+ rows
7.5B+ candidate pairs
2.5 hours
SuperPart
partitions based on customer-
provided ground-truth
gives confidence of grouping
effective without tuning knobs
Secure once, access in multiple ways
Data Lake Storage
Data
Catalog
Access
Control
Lake Formation
2. User tries to access data
via one of the services
3. Service sends user credentials
to Lake Formation
4. Lake Formation returns
temporary credentials
allowing data access
1. Set up user access in Lake
Formation
Admin
Security permissions in Lake Formation
Control data access with simple
grant and revoke permissions
Specify permissions on tables and
columns rather than on buckets
and objects
Easily view policies granted to a
particular user
Audit all data access at one place
Security permissions in Lake Formation
Search and view permissions
granted to a user, role, or group in
one place
Verify permissions granted to a
user
Easily revoke policies for a user
Grant table and column-level permissions
User 1
User 2
Security – deep dive
User
Principals can be
IAM users, roles
Active Directory
users via federation
End-services retrieve
underlying data
directly from S3
AWS Lake
Formation
query T
request access T
short-term creds. for T
Amazon S3
request objs comprising T
return objs of T
Search and collaborate across multiple users
Text-based, faceted search
across all metadata
Add attributes like Data owners,
stewards, and other as table
properties
Add data sensitivity level,
column definitions, and others
as column properties
Text-based search and filtering
Query data in Amazon Athena
Audit and monitor in real time
See detailed alerts in the
console
Download audit logs for further
analytics
Data ingest and catalog
notifications also published to
Amazon CloudWatch events
Example: a data lake in 3 easy steps
1. Use blueprints to ingest data
2. Grant permissions to securely share data
3. Query the data (Amazon Athena)
Step 1: Blueprints to ingest data
Monitor the import
1
Imported data as table in the data lake
Step 2: Grant permissions to securely share data
Step 3: Run query in Amazon Athena
AWS Lake Formation Pricing
No additional charges – Only pay for the
underlying services used.
Thank you!
Follow @DamianWylie on Twitter for live updates
Join the preview: https://pages.awscloud.com/lake-formation-preview.html
lakeformation-pm@amazon.com

More Related Content

What's New with Big Data Analytics

  • 1. What's New with Big Data Analytics Jia-Ren Lin, Support Engineer, AWS What's New with Big Data Analytics Jia-Ren Lin, Support Engineer, AWS
  • 2. Agenda • Real-time data • Introducing Amazon Managed Streaming for Kafka (Amazon MSK) • Comparing Amazon MSK with Amazon Kinesis Data Streams • Why did we build AWS Lake Formation? • What is AWS Lake Formation? • How does AWS Lake Formation help you?
  • 3. Data is produced continuously Mobile Apps Web Clickstream Application Logs Metering Records IoT Sensors Smart Buildings [Wed Oct 11 14:32:52 2018] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/h tdocs/test
  • 4. The diminishing value of data over time
  • 5. Challenges operating Apache Kafka Difficult to setup Hard to achieve high availability Tricky to scale AWS integrations = development No console, no visible metrics 𝒇 𝒌𝒂𝒇𝒌𝒂 𝒖𝒔𝒂𝒈𝒆 = ෍ 𝒏=𝟏 ∞ 𝑺𝑹𝑬
  • 6. Introducing Amazon Managed Streaming for Kafka (Amazon MSK) A fully managed, highly available, and secure service for Apache Kafka Now available in public preview in the US East (N. Virginia) Region
  • 7. Getting started with Amazon MSK is easy • Fully compatible with Apache Kafka v1.1.1 • AWS Management Console and AWS API for provisioning • Clusters are setup automatically • Provision Apache Kafka brokers and storage • Create and tear down clusters on-demand
  • 8. Automation drives higher availability @ Preview • Cluster lifecycle is fully automated • Brokers and Apache Zookeeper nodes auto-heal • IPs remain intact • Patches are applied automatically @ GA • Service level agreement (SLA) • Apache Kafka version upgrades
  • 9. Scalability and configurability @ GA • Scale a cluster • Horizontally (add more of the same) • Vertically (add larger brokers) scale a cluster • Supports Apache Kafka partition reassignment tooling • Define custom cluster configurations • Auto scale storage
  • 10. Deeply integrated with AWS services @ Preview • Amazon Virtual Private Cloud (Amazon VPC) for network isolation • AWS Key Management Service (AWS KMS) for at-rest encryption • AWS Identity and Access Management (IAM) for control-plane API control • Amazon CloudWatch for Apache Kafka broker, topic, and ZK metrics • Amazon Elastic Compute Cloud (Amazon EC2) M5 instances as brokers • Amazon EBS GP2 broker storage • Offered in the US-East (N. Virginia) AWS Region @ GA • Tagging • AWS CloudTrail • AWS CloudFormation • Offered worldwide
  • 11. What Amazon MSK does for you • Makes Apache Kafka more accessible to your organization • Drives best practices through design, defaults, and automation • Allows developers to focus more on app development, less on infrastructure management • Amazon MSK is committed to improving open-source Apache Kafka 𝑓 𝑘𝑎𝑓𝑘𝑎 𝑢𝑠𝑎𝑔𝑒 = ෍ 𝑛=1 ∞ 𝑆𝑡𝑟𝑒𝑎𝑚𝑖𝑛𝑔 𝐴𝑝𝑝𝑠
  • 13. How pricing works • On-demand, hourly pricing prorated to the second • Broker and storage pricing • Broker pricing starts with kafka.m5.large @ $0.21/hr • Storage pricing is $0.10 per GB-month
  • 14. Comparing Amazon Kinesis Data Streams to MSK Amazon Kinesis Data Streams Amazon MSK Newest dataOldest data 50 1 2 3 4 0 1 2 3 0 1 2 3 4 Shard 2 Shard 1 Shard 3 Writes from Producers Stream with 3 shards Newest dataOldest data 50 1 2 3 4 0 1 2 3 0 1 2 3 4 Partition 2 Partition 1 Partition 3 Writes from Producers Topic with 3 partitions
  • 15. Comparing Amazon Kinesis Data Streams to MSK • AWS API experience • Throughput provisioning model • Seamless scaling • Typically lower costs • Deep AWS integrations • Open-source compatibility • Strong third-party tooling • Cluster provisioning model • Apache Kafka scaling isn’t seamless to clients • Raw performance Amazon Kinesis Data Streams Amazon MSK
  • 16. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale
  • 17. The concept of a Data Lake • All data in one place, a single source of truth • Handles structured/semi-structured/unstructured/raw data • Supports fast ingestion and consumption • Schema on read • Designed for low-cost storage • Decouples storage and compute • Supports protection and security rules
  • 18. Typical steps of building a data lake • Setup storages • Move data • Cleanse, prep, and catalog data • Configure and enforce security and compliance policies • Make data available for analytics
  • 19. Sample of steps required Find sources Create Amazon Simple Storage Service (Amazon S3) locations Configure access policies Map tables to Amazon S3 locations ETL jobs to load and clean data Create metadata access policies Configure access from analytics services Rinse and repeat for other: data sets, users, and end-services And more: manage and monitor ETL jobs update metadata catalog as data changes update policies across services as users and permissions change manually maintain cleansing scripts create audit processes for compliance … Manual | Error-prone | Time consuming
  • 20. Data lake on AWS Catalog & Search Access & User Interfaces Data Ingestion Analytics & Serving Amazon DynamoDB Amazon Elasticsearch Service AWS AppSync Amazon API Gateway Amazon Cognito AWS KMS AWS CloudTrail Manage & Secure AWS IAM Amazon CloudWatch AWS Snowball AWS Storage Gateway Amazon Kinesis Data Firehose AWS Direct Connect AWS Database Migration Service Amazon Athena Amazon EMR AWS Glue Amazon Redshift Amazon DynamoDB Amazon QuickSight Amazon Kinesis Amazon Elasticsearch Service Amazon Neptune Amazon RDS Central Storage Scalable, secure, cost-effective AWS Glue
  • 21. Enforce security policies across multiple services Gain and manage new insights Identify, ingest, clean, and transform data Build a secure data lake in days AWS Lake Formation
  • 23. Register existing data or import new Amazon S3 forms the storage layer for Lake Formation Register existing S3 buckets that contain your data Ask Lake Formation to create required S3 buckets and import data into them Data is stored in your account. You have direct access to it. No lock-in. Data Lake Storage Data Catalog Access Control Data import Lake Formation Crawlers ML-based data prep
  • 24. Easily load data to your data lake logs DBs Blueprints Data Lake Storage Data Catalog Access Control Data import Lake Formation Crawlers ML-based data prep one-shot incremental
  • 25. With blueprints You 1. Point us to the source 2. Tell us the location to load to in your data lake 3. Specify how often you want to load the data Blueprints 1. Discover the source table(s) schema 2. Automatically convert to the target data format 3. Automatically partition the data based on the partitioning schema 4. Keep track of data that was already processed 5. You can customize any of the above
  • 27. Easily de-duplicate your data with ML transforms
  • 28. Fuzzy de-duplication – under the hood Naïve: look at all pairs, N2 – state-of-the-art: 0.6 0.8 0.40.1 0.1
  • 29. Fuzzy de-duplication – Innovations Intersection Dynamic Blocking (VLDB 2008) parallelizable & performant blocks on dynamic mix of columns 400M+ rows 7.5B+ candidate pairs 2.5 hours SuperPart partitions based on customer- provided ground-truth gives confidence of grouping effective without tuning knobs
  • 30. Secure once, access in multiple ways Data Lake Storage Data Catalog Access Control Lake Formation 2. User tries to access data via one of the services 3. Service sends user credentials to Lake Formation 4. Lake Formation returns temporary credentials allowing data access 1. Set up user access in Lake Formation Admin
  • 31. Security permissions in Lake Formation Control data access with simple grant and revoke permissions Specify permissions on tables and columns rather than on buckets and objects Easily view policies granted to a particular user Audit all data access at one place
  • 32. Security permissions in Lake Formation Search and view permissions granted to a user, role, or group in one place Verify permissions granted to a user Easily revoke policies for a user
  • 33. Grant table and column-level permissions User 1 User 2
  • 34. Security – deep dive User Principals can be IAM users, roles Active Directory users via federation End-services retrieve underlying data directly from S3 AWS Lake Formation query T request access T short-term creds. for T Amazon S3 request objs comprising T return objs of T
  • 35. Search and collaborate across multiple users Text-based, faceted search across all metadata Add attributes like Data owners, stewards, and other as table properties Add data sensitivity level, column definitions, and others as column properties Text-based search and filtering Query data in Amazon Athena
  • 36. Audit and monitor in real time See detailed alerts in the console Download audit logs for further analytics Data ingest and catalog notifications also published to Amazon CloudWatch events
  • 37. Example: a data lake in 3 easy steps 1. Use blueprints to ingest data 2. Grant permissions to securely share data 3. Query the data (Amazon Athena)
  • 38. Step 1: Blueprints to ingest data
  • 40. Imported data as table in the data lake
  • 41. Step 2: Grant permissions to securely share data
  • 42. Step 3: Run query in Amazon Athena
  • 43. AWS Lake Formation Pricing No additional charges – Only pay for the underlying services used.
  • 44. Thank you! Follow @DamianWylie on Twitter for live updates Join the preview: https://pages.awscloud.com/lake-formation-preview.html lakeformation-pm@amazon.com