AWS Big Data Landscape
- 1. COPYRIGHT – 1 BILLION TECH | CONFIDENTIAL
AWSBIGDATALANDSCAPE
TECH TALK
[29 OCT 2020]
- 2. AGENDA
Big Data – An Introduction
Big Data – The Open Source Stack
Data Lake Reference Architecture on AWS
Big Data – The AWS Stack
- 3. BIG DATAANALYTICS - THE NEED
As we become a more digital society, the amount of data being created and collected is
growing and accelerating significantly.
Analysis of this ever-growing data becomes a challenge with traditional analytical
tools.
We require innovation to bridge the gap between data being generated and data that
can be analyzed effectively
Data management architectures have evolved from the traditional data warehousing
model to more complex architectures that address more requirements.
- 4. THREE MAIN CHALLENGES
There are three main challenges that Big
Data trying to address:
●
Variety (The diversity of resources)
●
Volume (The size of data)
●
Velocity (High data frequency)
- 5. WHAT LEADS TO BIG DATA?
Social Media
Mobile Data
IOT Devices
- 6. WHAT IS A BIG DATAAPPLICATION?
Data must be in Tera / Peta Bytes
More than one source
Huge processing loads
Real time streaming
High Scalability
Advanced Analytics
- 8. SINGAPORE LAND TRANSPORT SYSTEM (LTA)
Source: How Cities using Big Data in Asia? - FutureGov Report
Data Collection:
●
Junction Electronic Eyes, Green Link Determining System, Web Cams, Parking Guidance
Systems, Expressway Monitoring Systems, Traffic Scans
Data Processing:
●
All the data is fed “real-time” into the integrated I-Transport Processing System
Data Visualization:
●
Via government web portals, Navigation Devices, Central Control Rooms, Smart Phones,
etc
●
Certain elements are abstracted to Open Data
- 10. THE HADOOPARCHITECTURE
Hadoop is a framework that provides open source libraries for distributed computing
Has two main components (MapReduce and HDFS)
Designed to scale out from a few computing nodes to thousands of machines, each offering
local computation and storage
Leverages the power of Massive Parallel Processing (MPP) to take advantage of Big Data,
generally by using lots of inexpensive commodity servers, which has a high tolerance of
hardware failure.
- 16. APACHE HIVE
Works as a SQL interface on top of MapReduce
Uses a SQL like syntax, which is HiveQL (Quite
close to SQL syntax)
It is much easier to write Hive Scripts than
writing MapReduce code in Java.
Hive can be run on both MapReduce and Tez
Used primarily on Data Warehouses / Data
Lakes (OLAP Applications)
- 17. APACHE PIG
Pig Latin is a language to query large data sets,
which can reduce the complexity of writing
MapReduce Java code.
It runs on MapReduce to ease the MapReduce
complexity
It can run on Apache Tez as well
Pig is a legacy technology these days with other
frameworks in the picture
- 19. BIG DATA IN AWS
AWS provides a broad platform of “managed services” to help you build, secure and
seamlessly scale end-to-end big data applications quickly and with ease.
Analyzing large data sets requires significant compute capacity that can vary in size based
on the amount of input data and the type of analysis
This characteristic of big data workloads is ideally suited to the pay-as-you-go cloud
computing model, where applications can easily scale up and down based on demand
No hardware to procure, no infrastructure to maintain and scale. Only what you need to do is
to build the correct Big Data pipeline (Collect, Store, Process and Analyze)
- 20. BIG DATA SERVICES IN AWS
Amazon Kinesis
Managed Streaming for Kafka
AWS Lambda
Amazon Elastic MapReduce (EMR)
Amazon Glue
Amazon Redshift
Amazon Athena
Amazon ElasticSearch
Amazon QuickSight
AWS Data Pipeline
- 21. AWS DATA LAKE ARCHITECTURE
“Data lake is more like a body of water in its natural
state. Data flows from the streams (the source
systems) to the lake. Users have access to the lake to
examine, take samples or dive in.”
The Maskeliya Reservoir – Sri Lanka
- 22. DATA LAKE FEATURES
Data Lakes retain ALL data
Data Lake supports ALL data types
Data Lake support ALL user levels
Data Lake can adopt changes easily
Data Lake will provide faster insights
- 23. DATA LAKE OR DATA WAREHOUSE
The Hybrid Approach
●
If you already have a Data Warehouse – Do not change it
●
Build a Data Lake along side of Data Warehouse and feed it from the Data Lake
●
That will give benefits from both ends
- 24. DATA LAKE BEST PRACTICES
Know your data sources well – “Discover first, Schema Later”
Study them well – Analyze and determine how multiple sources interact and correlate with
each other
Improve Data – Identifying relationships, separating out technical and business data types,
filtering can lead to a unified data model (data catalog)
Separate out all four layers (ingest, store, process and visualize) all the time. Let them work
independently
Once you build the data lake, apply security and monitor data movements closely
Do not over-engineer – You may not be able to find out all data sources at first. Apply more
iterative design and keep on adding more data sources and enhance your data lake. You will
never get all your data sources in your first attempt.
- 25. AWS DATA LAKE ARCHITECTURE
Source: https://aws.amazon.com/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-and-amazon-s3/
- 26. AMAZON KINESIS
Used in “Real Time” Big Data applications with huge scale of data ingestion
Kinesis is a managed alternative service to Apache Kafka
Great service to gather application logs, metrics, IoT and clickstreams
Great for stream processing frameworks such as Saprk, NiFi, etc.
Data is automatically replicated synchronously to 3 Availability Zones
Kinesis consists of three main services
●
1. Kinesis Data Streams : Collect data streams
●
2. Kinesis Analytics : Process and deliver data streams
●
3. Kinesis Firehose : Analyze streaming data
Amazon Kinesis
- 27. AWS MSK
MSK = Managed Streaming for Apache Kafka
A Fully Managed Apache Kafka on AWS and alternative to AWS Kinesis
This allows you to create, update, delete MSK clusters (The Control Pane). MSK does this for
you
It creates and manages Apache Kafka Broker Nodes and Zookeeper Nodes for you
Can deploy a MSK cluster in your AWS VPC in a multi-AZ setup
Automatic Recovery from a Apache Kafka failures
Can do custom configurations (e.g. No of Availability Zones, VPC and Subnets, Broker
Instance Type, Number of brokers per AZ).
Data is stored in EBS volumes (The Data Pane). This is your responsibility to manage
AWS MSK
- 28. AWS GLUE
Working as a central meta-data repository for your data lake (S3). This can be later used by
data analytics services such as Athena, EMR, Redshift and later visualized by Amazon
QuickSight
Can work as an ETL (Extract, Transform, Load) tool as well. Running Apache Spark
underneath
Formerly this task (ETL) was carried out by AWS Data Pipeline
AWS Glue is completely Serverless and fully managed
AWS Glue provides a Data Catalog, which are being populated with the help of Glue
Crawlers
Glue Data Sources: S3, RDS, DynamoDB, Kinesis Data Streams and Kafka
CloudWatch can be used to monitor Glue progress
AWS Glue
- 29. GLUE CRAWLER AND DATA CATALOG
Data Catalog:
●
Stores only meta-data (Table definitions and schema details) in the catalog. Data remains in
S3
●
Can have one Data Catalog per region per account
●
Can create Data Crawlers for data sources
Glue Crawler:
●
Fetches data from Data Sources (such as S3) and populate the Data Catalog as a central
meta-data repository.
●
Glue Crawler extract partitions based on how your data is organized within S3
●
How those S3 partitions are structured can really impact the performance of your
downstream query.
●
Hence, it is required to think up front and organize S3 directories if you are planning to quiery
them at a later stage.
- 30. GLUE ETL
ETL is all about Transform Data, Clean Data, Enrich Data before doing the Analysis
The generated ETL code with Glue is in Python or Scala and you can do modifications to the
generated code.
Besides the Glue generated ETL code, you also can upload your own ETL code, which is
written in Python or Scala to AWS Glue. Simply upload the code to Amazon S3 and create one
or more jobs that use that code. You can reuse the same code across multiple jobs by
pointing them to the same code location on S3.
ETL jobs are run in a Spark Platform, which is running under the hood. The Glue Scheduler
can schedule the jobs
ETL Transformations:
●
Joining Data, Filter Data, Map Data, Dropping Null Fields, etc
●
Machine Learning Transformations
●
Format Conversions: CSV, Parquet, Avro, Json, XML converions
●
Spark Transformations (K-means clustering)
- 31. ELASTIC MAP REDUCE (EMR)
EMR is the AWS managed service of Apache Hadoop Framework
Not Serverless (You need to decide the number of server instances in the cluster. Hence it is
not serverless)
Used for Big Data processing, manipulation, analytics, indexing, transformation,etc
Apache Hadoop is an open source framework which can handle Big Data workloads
Includes other frameworks such as Spark, Hbase, Presto, Flink, Hive, Pig, etc
EMR Notebook can be used along with EMR clusters running Apache Spark to create and
open Jupiter Notebook and Jupiter Lab Interfaces from Amazon Console.
Amazon EMR
- 32. AWS DATA PIPELINE
A Task scheduling Framework
It is a service which helps to move data from data storage service to another at specified
intervals
For example, moving EC2 logs to S3 in a scheduled manner and then analyze them in an EMR
cluster
Destinations: S3, RDS, DynamoDB, Redshift, EMR
Manages task dependencies
Cross Region Replications
Precondition Checks
Highly Available
Amazon Data
Pipeline
- 33. AWS ATHENA
It is an interactive query service using SQL for S3
Supported Data formats in S3 : JSON, CSV, ORC, Paraquet, Avro
Out of these Parauet and ORC are having columnar fomats and Avro is having a row based
format
Using Columnar formats such as Paraquet and ORC can reduce cost from 30%-90% while
improving the performance at the same time
No need to load data it. Just query from S3 itself
It is Serverless
It uses Presto under the hood
Amazon
Athena
- 34. ATHENA WITH GLUE
Athena can be integrated with Glue to convert unstructured date to have more structured
data
Glue adds table definitions and columns to those unstructured data using Glue Data Catalog
This can be used as the best practice since Glue can be used as the Unified Data
Respository for your data analytics applications
Partitioning your data within S3 can save query latency drastically
Do not use Athena for Highly formatted Visializations / Reports. (QuickSight can do that for
you)
- 35. AWS REDSHIFT
It is Fully Managed, Peta-byte scaled Data Warehouse solution for AWS
Cost Effective compared to other Cloud Data Warehouse Solutions
Amazon claims that Redshift is 10x times faster than other Data Warehouses in the market
Since it is a Data Warehouse service, it is specifically designed for OLAP queries
Massively Parallel Processing (MPP) query execution – Improves Performance
A Columnar Storage – Improves Performance
Column level Compression
Can scale Up or Down
Built-in Replication and Backups
Monitoring via CloudWatch / CloudTrail
Amazon Redshft
- 36. REDSHIFT SPECTRUM
This allows to load exabytes of unstructured data from S3 without using any other tool. You
could do this via Glue and Athena but this is another alternative without any schema in the
middle
Limitless concurrency
Horizontal Scaling
Supports wide variety of data formats
- 37. AWS QUICKSIGHT
It is and end-user visualization tool to provide fast, easy cloud-enabled business analytics
service
It has the ability to:
●
Build visualizations
●
Perform ad-hoc analysis
●
Quickly get business insights from data
●
Can access from any device (web or mobile)
It is Serverless, Highly Available, Highly Durable
Data sets are imported to an engine called SPICE. SPICE is a super-fast, parallel, in-memory
calculation engine and it can accelerate large interactive queries. Each Quicksight user gets
10GB of SPICE capacity.
Pricing – Subscription based
AWS QuickSight
- 38. REFERENCES
AWS Data Analytics White Paper :
https://d1.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf
Data Lakes and Analytics on AWS:
https://aws.amazon.com/big-data/datalakes-and-analytics/
Big Data Case Studies: https://aws.amazon.com/big-data/use-cases/
Build a Data Lake Foundation with AWS Glue and Amazon S3:
https://aws.amazon.com/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-and-
amazon-s3/
Big Data Architectural Patterns and Best Practices (AWS Re:Invent 2017):
https://www.youtube.com/watch?v=a3713oGB6Zk