AWS Big Data Landscape

COPYRIGHT – 1 BILLION TECH | CONFIDENTIAL
AWSBIGDATALANDSCAPE
TECH TALK
[29 OCT 2020]

AGENDA
 Big Data – An Introduction
 Big Data – The Open Source Stack
 Data Lake Reference Architecture on AWS
 Big Data – The AWS Stack

BIG DATAANALYTICS - THE NEED
 As we become a more digital society, the amount of data being created and collected is
growing and accelerating significantly.
 Analysis of this ever-growing data becomes a challenge with traditional analytical
tools.
 We require innovation to bridge the gap between data being generated and data that
can be analyzed effectively
 Data management architectures have evolved from the traditional data warehousing
model to more complex architectures that address more requirements.

THREE MAIN CHALLENGES
 There are three main challenges that Big
Data trying to address:
●
Variety (The diversity of resources)
●
Volume (The size of data)
●
Velocity (High data frequency)

WHAT LEADS TO BIG DATA?
 Social Media
 Mobile Data
 IOT Devices

WHAT IS A BIG DATAAPPLICATION?
 Data must be in Tera / Peta Bytes
 More than one source
 Huge processing loads
 Real time streaming
 High Scalability
 Advanced Analytics

SINGAPORE LAND TRANSPORT SYSTEM (LTA)
Source: How Cities using Big Data in Asia? - FutureGov Report

SINGAPORE LAND TRANSPORT SYSTEM (LTA)
Source: How Cities using Big Data in Asia? - FutureGov Report
 Data Collection:
●
Junction Electronic Eyes, Green Link Determining System, Web Cams, Parking Guidance
Systems, Expressway Monitoring Systems, Traffic Scans
 Data Processing:
●
All the data is fed “real-time” into the integrated I-Transport Processing System
 Data Visualization:
●
Via government web portals, Navigation Devices, Central Control Rooms, Smart Phones,
etc
●
Certain elements are abstracted to Open Data

THE HADOOPARCHITECTURE
 Hadoop is a framework that provides open source libraries for distributed computing
 Has two main components (MapReduce and HDFS)
 Designed to scale out from a few computing nodes to thousands of machines, each offering
local computation and storage
 Leverages the power of Massive Parallel Processing (MPP) to take advantage of Big Data,
generally by using lots of inexpensive commodity servers, which has a high tolerance of
hardware failure.

THE HADOOPARCHITECTURE
p
Reference: Hadoop In Action

THE HADOOP ECOSYSTEM
Source: The Journal of Big Data

APACHE HIVE
 Works as a SQL interface on top of MapReduce
 Uses a SQL like syntax, which is HiveQL (Quite
close to SQL syntax)
 It is much easier to write Hive Scripts than
writing MapReduce code in Java.
 Hive can be run on both MapReduce and Tez
 Used primarily on Data Warehouses / Data
Lakes (OLAP Applications)

APACHE PIG
 Pig Latin is a language to query large data sets,
which can reduce the complexity of writing
MapReduce Java code.
 It runs on MapReduce to ease the MapReduce
complexity
 It can run on Apache Tez as well
 Pig is a legacy technology these days with other
frameworks in the picture

BIG DATA IN AWS
 AWS provides a broad platform of “managed services” to help you build, secure and
seamlessly scale end-to-end big data applications quickly and with ease.
 Analyzing large data sets requires significant compute capacity that can vary in size based
on the amount of input data and the type of analysis
 This characteristic of big data workloads is ideally suited to the pay-as-you-go cloud
computing model, where applications can easily scale up and down based on demand
 No hardware to procure, no infrastructure to maintain and scale. Only what you need to do is
to build the correct Big Data pipeline (Collect, Store, Process and Analyze)

BIG DATA SERVICES IN AWS
 Amazon Kinesis
 Managed Streaming for Kafka
 AWS Lambda
 Amazon Elastic MapReduce (EMR)
 Amazon Glue
 Amazon Redshift
 Amazon Athena
 Amazon ElasticSearch
 Amazon QuickSight
 AWS Data Pipeline

AWS DATA LAKE ARCHITECTURE
 “Data lake is more like a body of water in its natural
state. Data flows from the streams (the source
systems) to the lake. Users have access to the lake to
examine, take samples or dive in.”
The Maskeliya Reservoir – Sri Lanka

DATA LAKE FEATURES
 Data Lakes retain ALL data
 Data Lake supports ALL data types
 Data Lake support ALL user levels
 Data Lake can adopt changes easily
 Data Lake will provide faster insights

DATA LAKE OR DATA WAREHOUSE
 The Hybrid Approach
●
If you already have a Data Warehouse – Do not change it
●
Build a Data Lake along side of Data Warehouse and feed it from the Data Lake
●
That will give benefits from both ends

DATA LAKE BEST PRACTICES
 Know your data sources well – “Discover first, Schema Later”
 Study them well – Analyze and determine how multiple sources interact and correlate with
each other
 Improve Data – Identifying relationships, separating out technical and business data types,
filtering can lead to a unified data model (data catalog)
 Separate out all four layers (ingest, store, process and visualize) all the time. Let them work
independently
 Once you build the data lake, apply security and monitor data movements closely
 Do not over-engineer – You may not be able to find out all data sources at first. Apply more
iterative design and keep on adding more data sources and enhance your data lake. You will
never get all your data sources in your first attempt.

AWS DATA LAKE ARCHITECTURE
Source: https://aws.amazon.com/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-and-amazon-s3/

AMAZON KINESIS
 Used in “Real Time” Big Data applications with huge scale of data ingestion
 Kinesis is a managed alternative service to Apache Kafka
 Great service to gather application logs, metrics, IoT and clickstreams
 Great for stream processing frameworks such as Saprk, NiFi, etc.
 Data is automatically replicated synchronously to 3 Availability Zones
 Kinesis consists of three main services
●
1. Kinesis Data Streams : Collect data streams
●
2. Kinesis Analytics : Process and deliver data streams
●
3. Kinesis Firehose : Analyze streaming data
Amazon Kinesis

AWS MSK
 MSK = Managed Streaming for Apache Kafka
 A Fully Managed Apache Kafka on AWS and alternative to AWS Kinesis
 This allows you to create, update, delete MSK clusters (The Control Pane). MSK does this for
you
 It creates and manages Apache Kafka Broker Nodes and Zookeeper Nodes for you
 Can deploy a MSK cluster in your AWS VPC in a multi-AZ setup
 Automatic Recovery from a Apache Kafka failures
 Can do custom configurations (e.g. No of Availability Zones, VPC and Subnets, Broker
Instance Type, Number of brokers per AZ).
 Data is stored in EBS volumes (The Data Pane). This is your responsibility to manage
AWS MSK

AWS GLUE
 Working as a central meta-data repository for your data lake (S3). This can be later used by
data analytics services such as Athena, EMR, Redshift and later visualized by Amazon
QuickSight
 Can work as an ETL (Extract, Transform, Load) tool as well. Running Apache Spark
underneath
 Formerly this task (ETL) was carried out by AWS Data Pipeline
 AWS Glue is completely Serverless and fully managed
 AWS Glue provides a Data Catalog, which are being populated with the help of Glue
Crawlers
 Glue Data Sources: S3, RDS, DynamoDB, Kinesis Data Streams and Kafka
 CloudWatch can be used to monitor Glue progress
AWS Glue

GLUE CRAWLER AND DATA CATALOG
 Data Catalog:
●
Stores only meta-data (Table definitions and schema details) in the catalog. Data remains in
S3
●
Can have one Data Catalog per region per account
●
Can create Data Crawlers for data sources
 Glue Crawler:
●
Fetches data from Data Sources (such as S3) and populate the Data Catalog as a central
meta-data repository.
●
Glue Crawler extract partitions based on how your data is organized within S3
●
How those S3 partitions are structured can really impact the performance of your
downstream query.
●
Hence, it is required to think up front and organize S3 directories if you are planning to quiery
them at a later stage.

GLUE ETL
 ETL is all about Transform Data, Clean Data, Enrich Data before doing the Analysis
 The generated ETL code with Glue is in Python or Scala and you can do modifications to the
generated code.
 Besides the Glue generated ETL code, you also can upload your own ETL code, which is
written in Python or Scala to AWS Glue. Simply upload the code to Amazon S3 and create one
or more jobs that use that code. You can reuse the same code across multiple jobs by
pointing them to the same code location on S3.
 ETL jobs are run in a Spark Platform, which is running under the hood. The Glue Scheduler
can schedule the jobs
 ETL Transformations:
●
Joining Data, Filter Data, Map Data, Dropping Null Fields, etc
●
Machine Learning Transformations
●
Format Conversions: CSV, Parquet, Avro, Json, XML converions
●
Spark Transformations (K-means clustering)

ELASTIC MAP REDUCE (EMR)
 EMR is the AWS managed service of Apache Hadoop Framework
 Not Serverless (You need to decide the number of server instances in the cluster. Hence it is
not serverless)
 Used for Big Data processing, manipulation, analytics, indexing, transformation,etc
 Apache Hadoop is an open source framework which can handle Big Data workloads
 Includes other frameworks such as Spark, Hbase, Presto, Flink, Hive, Pig, etc
 EMR Notebook can be used along with EMR clusters running Apache Spark to create and
open Jupiter Notebook and Jupiter Lab Interfaces from Amazon Console.
Amazon EMR

AWS DATA PIPELINE
 A Task scheduling Framework
 It is a service which helps to move data from data storage service to another at specified
intervals
 For example, moving EC2 logs to S3 in a scheduled manner and then analyze them in an EMR
cluster
 Destinations: S3, RDS, DynamoDB, Redshift, EMR
 Manages task dependencies
 Cross Region Replications
 Precondition Checks
 Highly Available
Amazon Data
Pipeline

AWS ATHENA
 It is an interactive query service using SQL for S3
 Supported Data formats in S3 : JSON, CSV, ORC, Paraquet, Avro
 Out of these Parauet and ORC are having columnar fomats and Avro is having a row based
format
 Using Columnar formats such as Paraquet and ORC can reduce cost from 30%-90% while
improving the performance at the same time
 No need to load data it. Just query from S3 itself
 It is Serverless
 It uses Presto under the hood
Amazon
Athena

ATHENA WITH GLUE
 Athena can be integrated with Glue to convert unstructured date to have more structured
data
 Glue adds table definitions and columns to those unstructured data using Glue Data Catalog
 This can be used as the best practice since Glue can be used as the Unified Data
Respository for your data analytics applications
 Partitioning your data within S3 can save query latency drastically
 Do not use Athena for Highly formatted Visializations / Reports. (QuickSight can do that for
you)

AWS REDSHIFT
 It is Fully Managed, Peta-byte scaled Data Warehouse solution for AWS
 Cost Effective compared to other Cloud Data Warehouse Solutions
 Amazon claims that Redshift is 10x times faster than other Data Warehouses in the market
 Since it is a Data Warehouse service, it is specifically designed for OLAP queries
 Massively Parallel Processing (MPP) query execution – Improves Performance
 A Columnar Storage – Improves Performance
 Column level Compression
 Can scale Up or Down
 Built-in Replication and Backups
 Monitoring via CloudWatch / CloudTrail
Amazon Redshft

REDSHIFT SPECTRUM
 This allows to load exabytes of unstructured data from S3 without using any other tool. You
could do this via Glue and Athena but this is another alternative without any schema in the
middle
 Limitless concurrency
 Horizontal Scaling
 Supports wide variety of data formats

AWS QUICKSIGHT
 It is and end-user visualization tool to provide fast, easy cloud-enabled business analytics
service
 It has the ability to:
●
Build visualizations
●
Perform ad-hoc analysis
●
Quickly get business insights from data
●
Can access from any device (web or mobile)
 It is Serverless, Highly Available, Highly Durable
 Data sets are imported to an engine called SPICE. SPICE is a super-fast, parallel, in-memory
calculation engine and it can accelerate large interactive queries. Each Quicksight user gets
10GB of SPICE capacity.
 Pricing – Subscription based
AWS QuickSight

REFERENCES
 AWS Data Analytics White Paper :
https://d1.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf
 Data Lakes and Analytics on AWS:
https://aws.amazon.com/big-data/datalakes-and-analytics/
 Big Data Case Studies: https://aws.amazon.com/big-data/use-cases/
 Build a Data Lake Foundation with AWS Glue and Amazon S3:
https://aws.amazon.com/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-and-
amazon-s3/
 Big Data Architectural Patterns and Best Practices (AWS Re:Invent 2017):
https://www.youtube.com/watch?v=a3713oGB6Zk

Let's work
together
EMAIL
contact@1billiontech.com
MOBILE
+94 117 112191
FACEBOOK
1billiontechnology

AWS Big Data Landscape

Related slideshows

More Related Content

AWS Big Data Landscape