Viet stack 2nd meetup - BigData in Cloud Computing

BigData in Cloud computing
Viet-Trung Tran
@Vietstack
Sunday 1 February 15

Bio
Viet-Trung Tran
trungtv@soict.hust.edu.vn
https://www.facebook.com/groups/BigDataStartUp/
SoICT, Trendiction S.A Luxembourg, Microsoft Research Cambridge,
INRIA France, BKAV

Google trends
Google MapReduce paper 2014

BigData in science

The Data Science: The 4th Paradigm
for Scientiﬁc Discovery
Last
few decades
Thousand
years ago
Today and the
Future
Last few
hundred years
2
2
2.
3
4
a
cG
a
a
Κ−=
##
#
$
%
&&
&
'
(
ρπ
Simulation of
complex phenomena
Newton’s laws,
Maxwell’s equations…
Description of natural
phenomena
Crédits: Dennis Gannon

What’s BigData
Data has always been Big. The one aspect that differs now, if
compared with the past, would be the sheer scale and accessibility
of Data, which is the direct result of the super efficient speeds in
which data can now be computed. Big Data is therefore an all-
encompassing term for any collection of large data sets that were
once difficult to process.
Big data requires exceptional technologies to efficiently process large
quantities of data within tolerable elapsed times.

Data mining -> BigData mining?

Simpliﬁed BigData stack
Data analytics &
visualization
Data processing frameworks
(Streaming, MapReduce, BSP
model)
Data management systems BlobSeer

BigData management

The last 25 years of commercial DBMS development can be summed
up in a single phrase: "one size ﬁts all". This phrase refers to the fact
that the traditional DBMS architecture (originally designed and
optimized for business data processing) has been used to support
many data-centric applications with widely varying characteristics and
requirements. In this paper, we argue that this concept is no longer
applicable to the database market, and that the commercial
world will fracture into a collection of independent database
engines, some of which may be uniﬁed by a common front-end

Why NoSQL
“The whole point of seeking alternatives [to RDBMS systems] is that you need to
solve a problem that relational databases are a bad ﬁt for.” Eric Evans -
Rackspace
ACID does not scale
Web applications have different needs
Scalability
Elasticity
Flexible schema/ semi-structured data
Geographically distributed
Web applications do not always need
Transaction
Strong consistency
Complex queries

Big Data processing engines
MapReduce

Stream processing

Large scale graph processing

Vanilla Hadoop ecosystem

Hortonworks data ﬂatform

Hadoop ecosystem: Microsoft
HDinsight

BigData & Cloud
A Match made in heaven?

Cloud features

Data in the Clouds
As estimated by IDC, by 2020, about 40% data
globally would be touched with Cloud Computing.
Cloud adoption is accelerating – the amount of
data stored in Amazon Web Services (AWS) S3
cloud storage has jumped from 262 billion objects
in 2010 to over 1 trillion objects at the end of the
ﬁrst second of 2012.

While enterprises often keep their most sensitive data in-house, huge
volumes of data such as social media data may be located externally.
It is a fact that data that is too big to process is also too big to transfer
anywhere, so it’s just the analytical program which needs to be moved
—not the data.
"You don't want to be shipping terabytes and petabytes around,".
"Keep the data where it is, and then you move the analytics … to that
data."

Cloud enables BigData
Some of the ﬁrst adopters of big data in
cloud computing are users that deployed
Hadoop clusters in highly scalable and
elastic clouds: IBM, Azure, AWS
Cloud computing democratizes big data –
any enterprise can now work with
unstructured data at a huge scale.
Analytics-as-a-service (AaaS) models
for cloud-based big data analytics

Drivers for big data on cloud adoption
Cost reduction
Managing cloud-based big data is cost-effective, scalable, and fast to build.
Rapid provisioning/time to market
Faster provisioning is important for big data applications because the value of data
reduces quickly as time goes by.
Flexibility/scalability
Big data analysis, especially in the life sciences industry, requires huge compute
power for a brief amount of time. For this type of analysis, servers need to be
provisioned in minutes.

BigData is not always
Cloud-appropriate
Low latency realtime data
Virtualization overhead
Multi-tenancy overhead
Scalability
Lack of cloud computing features to support RDBMS
Availability
“Rain cloud” incorporates clouds
Data integrity/privacy
Data can only be accessed by authorized users
Currently, encryption is utilized by most researchers to ensure data privacy in the cloud

NoSQL vs SQL in the Cloud

Data security/peformance trade-offs
Distributed nodes
Distributed data
Internode communication
RPC over TCP/IP?
Encrypted IO?
Security/performance trade-offs

Cloud Architecture for Big Data
Resource scheduling and SLA for Big Data on
Cloud
Storage and computation management in Cloud for
Big Data
Large-scale data intensive workflow in support of
Big Data processing on Cloud
Multiple source data processing and integration on
Cloud
Virtualisation and visualisation of Big Data on Cloud
Fault tolerance and reliability for Big Data
processing on Cloud
MapReduce with Cloud for Big Data processing
Distributed file storage system with Cloud for Big Data
Inter-cloud technology for Big Data
Security, privacy and trust in Big Data processing on Cloud
Green, energy-efficient models and sustainability issues in Cloud for Big Data
processing
Cloud infrastructure for social networking with Big Data
User friendly Cloud access for Big Data processing
Innovative Cloud data centre networking for Big Data
Wireless and mobility support in Cloud data centre for Big Data

BigData use cases

Security Analytics

Thank you for your attention

8 big trends in big data analytics
http://www.computerworld.com/article/2690856/8-big-trends-in-big-
data-analytics.html

Reference
http://www.oracle.com/us/corporate/proﬁt/big-ideas/012314-
spasalapudi-2112687.html
https://gigaom.com/2014/10/15/cloud-computing-is-going-to-
absorb-your-big-data-workloads-too/

Classiﬁcation of BigData

Relationship between Cloud and
BigData

Open research issues
Data staging
Distributed storage systems: NoSQL, NewSQL
Data analysis
Data security

In theory, Unfortunately, it’s not all good news.
DB administrators don’t have an easy ride. The NoSQL databases
that have appeared in the last few years, with their key-value pairs,
document stores, and missing schemas,

Viet stack 2nd meetup - BigData in Cloud Computing

More Related Content

Viet stack 2nd meetup - BigData in Cloud Computing