1. What's New with Big Data
Analytics
Jia-Ren Lin, Support Engineer, AWS
What's New with Big Data
Analytics
Jia-Ren Lin, Support Engineer, AWS
2. Agenda
• Real-time data
• Introducing Amazon Managed Streaming for Kafka (Amazon MSK)
• Comparing Amazon MSK with Amazon Kinesis Data Streams
• Why did we build AWS Lake Formation?
• What is AWS Lake Formation?
• How does AWS Lake Formation help you?
3. Data is produced continuously
Mobile Apps Web Clickstream Application Logs
Metering Records IoT Sensors Smart Buildings
[Wed Oct 11 14:32:52
2018] [error] [client
127.0.0.1] client
denied by server
configuration:
/export/home/live/ap/h
tdocs/test
5. Challenges operating Apache Kafka
Difficult to setup
Hard to achieve high availability
Tricky to scale
AWS integrations = development
No console, no visible metrics 𝒇 𝒌𝒂𝒇𝒌𝒂 𝒖𝒔𝒂𝒈𝒆 =
𝒏=𝟏
∞
𝑺𝑹𝑬
6. Introducing Amazon Managed Streaming for
Kafka (Amazon MSK)
A fully managed, highly available, and secure service for Apache Kafka
Now available in public preview in the US East (N. Virginia) Region
7. Getting started with Amazon MSK is easy
• Fully compatible with Apache Kafka v1.1.1
• AWS Management Console and AWS API for provisioning
• Clusters are setup automatically
• Provision Apache Kafka brokers and storage
• Create and tear down clusters on-demand
8. Automation drives higher availability
@ Preview
• Cluster lifecycle is fully automated
• Brokers and Apache Zookeeper nodes auto-heal
• IPs remain intact
• Patches are applied automatically
@ GA
• Service level agreement (SLA)
• Apache Kafka version upgrades
9. Scalability and configurability
@ GA
• Scale a cluster
• Horizontally (add more of the same)
• Vertically (add larger brokers) scale a cluster
• Supports Apache Kafka partition reassignment tooling
• Define custom cluster configurations
• Auto scale storage
10. Deeply integrated with AWS services
@ Preview
• Amazon Virtual Private Cloud (Amazon VPC) for network isolation
• AWS Key Management Service (AWS KMS) for at-rest encryption
• AWS Identity and Access Management (IAM) for control-plane API control
• Amazon CloudWatch for Apache Kafka broker, topic, and ZK metrics
• Amazon Elastic Compute Cloud (Amazon EC2) M5 instances as brokers
• Amazon EBS GP2 broker storage
• Offered in the US-East (N. Virginia) AWS Region
@ GA
• Tagging
• AWS CloudTrail
• AWS CloudFormation
• Offered worldwide
11. What Amazon MSK does for you
• Makes Apache Kafka more accessible to your organization
• Drives best practices through design, defaults, and automation
• Allows developers to focus more on app development, less on infrastructure
management
• Amazon MSK is committed to improving open-source Apache Kafka
𝑓 𝑘𝑎𝑓𝑘𝑎 𝑢𝑠𝑎𝑔𝑒 =
𝑛=1
∞
𝑆𝑡𝑟𝑒𝑎𝑚𝑖𝑛𝑔 𝐴𝑝𝑝𝑠
13. How pricing works
• On-demand, hourly pricing prorated to the second
• Broker and storage pricing
• Broker pricing starts with kafka.m5.large @ $0.21/hr
• Storage pricing is $0.10 per GB-month
14. Comparing Amazon Kinesis Data Streams to MSK
Amazon Kinesis Data Streams Amazon MSK
Newest dataOldest data
50 1 2 3 4
0 1 2 3
0 1 2 3 4
Shard 2
Shard 1
Shard 3
Writes from
Producers
Stream with 3 shards
Newest dataOldest data
50 1 2 3 4
0 1 2 3
0 1 2 3 4
Partition 2
Partition 1
Partition 3
Writes from
Producers
Topic with 3 partitions
15. Comparing Amazon Kinesis Data Streams to MSK
• AWS API experience
• Throughput provisioning model
• Seamless scaling
• Typically lower costs
• Deep AWS integrations
• Open-source compatibility
• Strong third-party tooling
• Cluster provisioning model
• Apache Kafka scaling isn’t seamless to clients
• Raw performance
Amazon Kinesis Data Streams Amazon MSK
16. A data lake is a centralized repository that allows
you to store all your structured and unstructured
data at any scale
17. The concept of a Data Lake
• All data in one place, a single source of truth
• Handles structured/semi-structured/unstructured/raw data
• Supports fast ingestion and consumption
• Schema on read
• Designed for low-cost storage
• Decouples storage and compute
• Supports protection and security rules
18. Typical steps of building a data lake
• Setup storages
• Move data
• Cleanse, prep, and catalog data
• Configure and enforce security and compliance policies
• Make data available for analytics
19. Sample of steps required
Find sources
Create Amazon Simple Storage Service (Amazon S3) locations
Configure access policies
Map tables to Amazon S3 locations
ETL jobs to load and clean data
Create metadata access policies
Configure access from analytics services
Rinse and repeat for other:
data sets, users, and end-services
And more:
manage and monitor ETL jobs
update metadata catalog as data changes
update policies across services as users and permissions change
manually maintain cleansing scripts
create audit processes for compliance
…
Manual | Error-prone | Time consuming
20. Data lake on AWS
Catalog & Search Access & User Interfaces
Data Ingestion
Analytics & Serving
Amazon
DynamoDB
Amazon Elasticsearch
Service
AWS
AppSync
Amazon
API Gateway
Amazon
Cognito
AWS
KMS
AWS
CloudTrail
Manage & Secure
AWS
IAM
Amazon
CloudWatch
AWS
Snowball
AWS
Storage
Gateway
Amazon
Kinesis Data
Firehose
AWS Direct
Connect
AWS Database
Migration
Service
Amazon
Athena
Amazon
EMR
AWS
Glue
Amazon
Redshift
Amazon
DynamoDB
Amazon
QuickSight
Amazon
Kinesis
Amazon
Elasticsearch
Service
Amazon
Neptune
Amazon
RDS
Central Storage
Scalable, secure,
cost-effective
AWS
Glue
21. Enforce security policies
across multiple services
Gain and manage new
insights
Identify, ingest, clean, and
transform data
Build a secure data lake in days
AWS Lake Formation
23. Register existing data or import new
Amazon S3 forms the storage layer
for Lake Formation
Register existing S3 buckets that
contain your data
Ask Lake Formation to create
required S3 buckets and import data
into them
Data is stored in your account. You
have direct access to it. No lock-in.
Data Lake Storage
Data
Catalog
Access
Control
Data
import
Lake Formation
Crawlers ML-based
data prep
24. Easily load data to your data lake
logs
DBs
Blueprints
Data Lake Storage
Data
Catalog
Access
Control
Data
import
Lake Formation
Crawlers ML-based
data prep
one-shot
incremental
25. With blueprints
You
1. Point us to the source
2. Tell us the location to load to in
your data lake
3. Specify how often you want to
load the data
Blueprints
1. Discover the source table(s)
schema
2. Automatically convert to the
target data format
3. Automatically partition the data
based on the partitioning schema
4. Keep track of data that was
already processed
5. You can customize any of the
above
28. Fuzzy de-duplication – under the hood
Naïve: look at all pairs, N2 – state-of-the-art:
0.6
0.8
0.40.1
0.1
29. Fuzzy de-duplication – Innovations
Intersection Dynamic Blocking
(VLDB 2008)
parallelizable & performant
blocks on dynamic mix of columns
400M+ rows
7.5B+ candidate pairs
2.5 hours
SuperPart
partitions based on customer-
provided ground-truth
gives confidence of grouping
effective without tuning knobs
30. Secure once, access in multiple ways
Data Lake Storage
Data
Catalog
Access
Control
Lake Formation
2. User tries to access data
via one of the services
3. Service sends user credentials
to Lake Formation
4. Lake Formation returns
temporary credentials
allowing data access
1. Set up user access in Lake
Formation
Admin
31. Security permissions in Lake Formation
Control data access with simple
grant and revoke permissions
Specify permissions on tables and
columns rather than on buckets
and objects
Easily view policies granted to a
particular user
Audit all data access at one place
32. Security permissions in Lake Formation
Search and view permissions
granted to a user, role, or group in
one place
Verify permissions granted to a
user
Easily revoke policies for a user
34. Security – deep dive
User
Principals can be
IAM users, roles
Active Directory
users via federation
End-services retrieve
underlying data
directly from S3
AWS Lake
Formation
query T
request access T
short-term creds. for T
Amazon S3
request objs comprising T
return objs of T
35. Search and collaborate across multiple users
Text-based, faceted search
across all metadata
Add attributes like Data owners,
stewards, and other as table
properties
Add data sensitivity level,
column definitions, and others
as column properties
Text-based search and filtering
Query data in Amazon Athena
36. Audit and monitor in real time
See detailed alerts in the
console
Download audit logs for further
analytics
Data ingest and catalog
notifications also published to
Amazon CloudWatch events
37. Example: a data lake in 3 easy steps
1. Use blueprints to ingest data
2. Grant permissions to securely share data
3. Query the data (Amazon Athena)
43. AWS Lake Formation Pricing
No additional charges – Only pay for the
underlying services used.
44. Thank you!
Follow @DamianWylie on Twitter for live updates
Join the preview: https://pages.awscloud.com/lake-formation-preview.html
lakeformation-pm@amazon.com