SlideShare a Scribd company logo
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 1
Mohammed Guller
Oct 02, 2016
Introduction to Big Data
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 2
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
Agenda
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 4
About Me
• Engineering Manager / Principal Architect at Glassbeam
• Founded two startups
• Passionate about building products, big data analytics, and
machine learning
• www.linkedin.com/in/mohammedguller
• @MohammedGuller
4
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 6
• Hands-on guide with lots of examples
• Covers both fundamental and advanced
topics such as machine learning
• Includes a primer on functional
programming and Scala
• Introduces other important Big Data
technologies such as HDFS, Parquet,
Kafka, HBase, Cassandra, Mesos, and
YARN
Big Data Analytics with Spark
Available on Amazon

Recommended for you

Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...

Mr. Slim Baltagi is a Systems Architect at Hortonworks, with over 4 years of Hadoop experience working on 9 Big Data projects: Advanced Customer Analytics, Supply Chain Analytics, Medical Coverage Discovery, Payment Plan Recommender, Research Driven Call List for Sales, Prime Reporting Platform, Customer Hub, Telematics, Historical Data Platform; with Fortune 100 clients and global companies from Financial Services, Insurance, Healthcare and Retail. Mr. Slim Baltagi has worked in various architecture, design, development and consulting roles at. Accenture, CME Group, TransUnion, Syntel, Allstate, TransAmerica, Credit Suisse, Chicago Board Options Exchange, Federal Reserve Bank of Chicago, CNA, Sears, USG, ACNielsen, Deutshe Bahn. Mr. Baltagi has also over 14 years of IT experience with an emphasis on full life cycle development of Enterprise Web applications using Java and Open-Source software. He holds a master’s degree in mathematics and is an ABD in computer science from Université Laval, Québec, Canada. Languages: Java, Python, JRuby, JEE , PHP, SQL, HTML, XML, XSLT, XQuery, JavaScript, UML, JSON Databases: Oracle, MS SQL Server, MYSQL, PostreSQL Software: Eclipse, IBM RAD, JUnit, JMeter, YourKit, PVCS, CVS, UltraEdit, Toad, ClearCase, Maven, iText, Visio, Japser Reports, Alfresco, Yslow, Terracotta, Toad, SoapUI, Dozer, Sonar, Git Frameworks: Spring, Struts, AppFuse, SiteMesh, Tiles, Hibernate, Axis, Selenium RC, DWR Ajax , Xstream Distributed Computing/Big Data: Hadoop, MapReduce, HDFS, Hive, Pig, Sqoop, HBase, R, RHadoop, Cloudera CDH4, MapR M7, Hortonworks HDP 2.1

# hadoop # enterprisehadoop
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives

This session will provide an executive overview of the Apache Hadoop ecosystem, its basic concepts, and its real-world applications. Attendees will learn how organizations worldwide are using the latest tools and strategies to harness their enterprise information to solve business problems and the types of data analysis commonly powered by Hadoop. Learn how various projects make up the Apache Hadoop ecosystem and the role each plays to improve data storage, management, interaction, and analysis. This is a valuable opportunity to gain insights into Hadoop functionality and how it can be applied to address compelling business challenges in your agency.

clouderahadoop for governmenthadoop
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016

At Monsanto, emerging technologies such as IoT, advanced imaging and geo-spatial platforms; molecular breeding, ancestry and genomics data sets have made us rethink how we approach developing, deploying, scaling and distributing our software to accelerate predictive and prescriptive decisions. We created a Cloud based Data Science platform for the enterprise to address this need. Our primary goals were to perform analytics@scale and integrate analytics with our core product platforms. As part of this talk, we will be sharing our journey of transformation showing how we enabled: a collaborative discovery analytics environment for data science teams to perform model development, provisioning data through APIs, streams and deploying models to production through our auto-scaling big-data compute in the cloud to perform streaming, cognitive, predictive, prescriptive, historical and batch analytics@scale, integrating analytics with our core product platforms to turn data into actionable insights.

© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 7
About Glassbeam
Glassbeam brings structure and meaning to data from any connected machine or device while providing
actionable intelligence
Cloud based analytics platform that helps
organizations turn raw machine data to insights
Making sense of multi
structured machine data
 Data center devices
 Medical devices
 Sensors
 ATMs
 Automobiles
 Data from any machine
Providing comprehensive set of apps
& tools for machine data analysis
 50,000+ systems being tracked today
 1,500+ different software rev codes
 1.2 Billion sensor readings per day
 1+ Trillion sensor readings tracked
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 8
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 9
Data Growing At a Faster Pace Than Ever
9
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 10
Internet of Things (IoT)
• Network of objects embedded with
software for collecting and sending data
over the Internet
• 5x more connected things than people by
2020

Recommended for you

Top 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionTop 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data Solution

This presentation suggests the top 5 things architects and IT managers need to look for in a big data solution.

solrdatastaxapache cassandra
Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?

Industry thought leaders Gaurav Dhillon and David Linthicum discuss the future of cloud integration and data management in the API economy. Topics from this webinar and the accompanying slides include: key considerations of today's CIOs, approaching the reality of the multi-cloud world and new solutions for managing cloud and on-premise data. To learn more, visit: http://www.snaplogic.com/.

data integrationenterprise cloud integrationcloud computing
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...

Non-interactive big-data analysis prohibits experimentation and can interrupt the analyst’s train of thoughts but analyzing and drawing insights in real time is no easy task with jobs often taking minutes/hours to complete. What if you want to put a interactive interface in front of that data that allows iterative insights? What if you need that interactive experience to be sub second? Traditional SQL and most MPP/NoSQL databases cannot run complex calculations over large data in a performant manner. Popular distributed systems such as Hadoop or Spark can execute jobs but their job overhead prohibits sub second response times. Learn how an in-memory computing framework enabled us to perform complex analysis jobs on massive data points with sub second response times — allowing us to plug it into a simple, drag-and-drop web 2.0 interface.

fast databig dataimcsummit
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 11
Industrial IoT
• Manufacturing
• Automotive
• Medical
• Data Center
• EVC
• Smart Meter
11
Glassbeam target market is focused on driving opera onal & business
analy cs value for connected product companies in Industrial IoT market
IT & Networks Medical & Health Care
Transporta on
EV Chargers & Smart Grid
Industrial & Mfg
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 12
Key Attributes of Big Data
Volume
Scale of Data
Variety
Diversity of Data
Velocity
Speed of Data
•
•
•
•
•
•
•
•
•
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 13
Big Data Comes with Big Challenges
• Storage
• Processing
• Value
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 14
Storage Challenges
• Legacy SAN / NAS storage devices are expensive
• Traditional RDBMS were not designed for Big Data
• Cannot handle volume, velocity, variety of Big Data
14

Recommended for you

MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike FergusonMapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike Ferguson

This document discusses best practices for using Hadoop as an enterprise data hub. It provides an overview of how big data is driving new analytical workloads and the need for deeper customer insights. It discusses challenges with analyzing new sources of structured, unstructured and multi-structured data. It introduces the concept of a Hadoop enterprise data hub and data refinery to simplify access to new insights from big data. Key components of the data hub include a data reservoir to capture raw data from various sources, a data refinery to cleanse and transform the data, and publishing high value insights to data warehouses and other systems.

big datadata hubwebinar
How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016

Looking to implement Hadoop but haven’t pulled the trigger yet? You are not alone. Many companies have heard the hype about how Hadoop can solve the challenges presented by big data, but few have actually implemented it. What’s preventing them from taking the plunge? Can it be done in small steps to ensure project success? This session will discuss some of the items to consider when getting started with Hadoop and how to go about making the decision to move to the de facto big data platform. Starting small can be a good approach when your company is learning the basics and deciding what direction to take. There is no need to invest large amounts of time and money up front if a proof of concept is all you aim to provide. Using well known data sets on virtual machines can provide a low cost and effort implementation to know if your big data journey will be successful with Hadoop.

Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture

This document discusses different architectures for big data systems, including traditional, streaming, lambda, kappa, and unified architectures. The traditional architecture focuses on batch processing stored data using Hadoop. Streaming architectures enable low-latency analysis of real-time data streams. Lambda architecture combines batch and streaming for flexibility. Kappa architecture avoids duplicating processing logic. Finally, a unified architecture trains models on batch data and applies them to real-time streams. Choosing the right architecture depends on use cases and available components.

hadooplambda architecturefastdata
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 15
Processing Challenges
• Diverse processing
• Organizations want do more than just BI / traditional analytics
• Go beyond SQL queries
• Timeliness
• Process data in reasonable amount of time
• Value of data decreases over time
15
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 16
How Much Data Can a Standard Server Process
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 17
•
•
17
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 18
• Large number of CPUs / cores
• Faster cores
• Large amount of memory
• Faster memory bus
• High-performance architecture
Scale-up with Powerful High-end Server
18

Recommended for you

Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment

Sean McKeown, Technical Solutions Architect discusses big data architecture and deployment at Cisco Connect Toronto.

cisco systemscisco connect toronto 2015technology innovation
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big Data

If you missed Strata + Hadoop World, you missed quite a bit. This year's event was packed with Big Data practitioners across industries who shared their experiences and how they are driving new innovations like never before. Just because you weren't there, doesn't mean you missed out. In this session, we'll touch on a few of the key highlights from the show, including: Key trends in Big Data adoption The enterprise data hub How the enterprise data hub is used in practice

enterprise data hubhadoopbig data hub
Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchIntuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with Search

Cloudera Tech Day Presentation by Eva Andreasson, Director Product Management, Cloudera. Text-based search recently has become a critical part of the Hadoop stack, and has emerged as one of the highest-performing solutions for big data analytics. In this session, attendees will learn about the new analytics capabilities in Apache Solr that integrate full-text search, faceted search, statistics, and grouping to provide a powerful engine for enabling next-generation big data analytics applications.

clouderasolrsearch
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 19
Disadvantages of Scale-up Architecture
• Proprietary
• Expensive
• Limited scalability
19
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 20
• Cluster of servers
• Commodity machines
• Pool together resources
• CPU
• Memory
• Disk
Scale-out Architecture
20
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 21
Benefits of Scale-out Architecture
• Relatively inexpensive
• Economical to scale
• No huge upfront investment
• Start small and expand cluster as workload increases
21
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 22
Challenges With Scale-out Architecture
• Writing distributed applications is very hard
• Split job into chunks that can be distributed across a cluster
• Schedule compute resources among different jobs
• Manage inter-node communication
• Handle network and node failures
• Hardware failures are more common at a cluster level
• Probability of a single node failing is low
• Probability of any one node in a large cluster failing is high
22

Recommended for you

Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data Silos

This document discusses ING NL's efforts to create a data lake architecture using Hadoop to integrate all of the bank's data sources onto a single processing platform. The data lake aims to collect data in a unified format, securely store it to prevent manipulation and unauthorized access, and make it available for analytical applications. Some of the challenges discussed include managing security, aligning with legacy systems, and facilitating interdepartmental cooperation on agile delivery. The presentation focuses on one part of the data lake, the archive, and how a Hadoop cluster can effectively address the goals of collecting, storing, and accessing data for business intelligence and data science purposes.

apache hadoophadoop summit
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...

The business and technology teams within a health insurer must align the company’s central data platform with its data strategy. That requires substantial organizational alignment. Hear the firsthand perspective from Health Care Service Corporation (HCSC), the largest customer-owned health insurance company in the United States. The speaker will cover how they integrated membership information, regulatory compliance, and the general ledger, to improve overall healthcare management. At HCSC, the strong alignment between executive leadership, business portfolio direction, architectural strategy, technology delivery, and program management have helped create leading-edge capabilities which help the company respond nimbly to a quickly evolving healthcare industry.

dataworks summitdws17dataworks summit 2017
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

- Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. - Cloudera's Data Operating System (CDH) is an enterprise-grade distribution of Apache Hadoop that includes additional components for management, security, and integration with existing systems. - CDH enables enterprises to leverage Hadoop for data agility, consolidation of structured and unstructured data sources, complex data processing using various programming languages, and economical storage of data regardless of type or size.

hadoop in productionapache hadoophadoop enterprise
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 23
Getting Value Out of Big Data
• Traditional analytics / BI
• Custom processing
• Machine Learning
• Predictive analytics
• Automate complex tasks
• Stream processing
• Analyze in real-time/near real-time
• React in real-time
23
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 24
Traditional Analytics / BI
• What
• Customer growth for the last month/quarter/year
• Segmentation of customers by demographics
• Average time spent by mobile app users
• Why
• Sales growth slowed
• regional issue
• supply issue
• Profit dropped
• revenue dropped
• expenses increased
24
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 25
Custom Processing
• Index web pages
• Google
• Bing
• Process genome data
• Identify mutations linked to cancer, Alzheimer's and other disease
• Click analysis
• Log analysis
• 360-degree real time view of a customer
25
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 26
Predictive Analytics
• Advertisements that a visitor will most likely click
• Movies / songs / news that a customer will like
• Products that a customer will buy
• Patient will have an heart attack
26

Recommended for you

Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics

Slim Baltagi, director of Enterprise Architecture at Capital One, gave a presentation at Hadoop Summit on major trends in big data analytics. He discussed 1) increasing portability between execution engines using Apache Beam, 2) the emergence of stream analytics driven by data streams, technology advances, business needs and consumer demands, 3) the growth of in-memory analytics using tools like Alluxio and RocksDB, 4) rapid application development using APIs, notebooks, GUIs and microservices, 5) open sourcing of machine learning systems by tech giants, and 6) hybrid cloud computing models for deploying big data applications both on-premise and in the cloud.

hadoop summit
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark

This document discusses big data analytics using Spark. It provides an overview of the history and growth of data from the 1980s to present. It then demonstrates how to perform word count analytics on text data using both traditional MapReduce techniques in Hadoop as well as using Spark. The code examples show how to tokenize text, count word frequencies, and output results.

big dataanalyticsspark
Wikibon Big Data Capital Markets Day 2014
Wikibon Big Data Capital Markets Day 2014Wikibon Big Data Capital Markets Day 2014
Wikibon Big Data Capital Markets Day 2014

This document discusses the big data market and winners and losers. It finds that traditional companies like Oracle, Teradata, and SAP are under pressure as newer big data technologies like Hadoop and NoSQL have seen rapid growth. While the big data market is expected to be worth over $50 billion by 2017, practitioners have faced barriers adopting big data within their organizations. Overall, practitioners are seen as the biggest winners from leveraging big data, though the market remains in early stages.

big datahadoopanalytics
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 27
• Virtual assistant
• Siri
• Google Now
• Autonomous machine
• Self-driving car
• Robots
• Tag Images
• Facebook
• Flickr
• Expert System
• Medical diagnosis
• Personalized medicine
• Security
• Fraud detection
• Network Security
• Music recognition
• Shazam
• SoundHound
Automate Complex Tasks
27
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 28
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 29
•
•
•
•
•
•
29
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 30
•
•
•
•
•
•
30

Recommended for you

Create your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouseCreate your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouse

The document discusses big data market trends and provides advice on how organizations can develop a big data strategy and implementation plan. It outlines a 5 step approach for modernizing an organization's data warehouse with new big data technologies: 1) enhancing the data warehouse with unstructured data, 2) extending it with data virtualization, 3) increasing scalability with MPP databases, 4) accelerating analytics with in-database processing, and 5) creating an operational data store with Hadoop. The document also provides tips for selecting big data vendors, such as evaluating a vendor's ability to integrate with existing systems and make analytics accessible to both power users and business users.

data warehousehadoop
Steps towards a Data Value Chain
Steps towards a Data Value ChainSteps towards a Data Value Chain
Steps towards a Data Value Chain

This document discusses steps towards a data value chain, including big data, public open data, and linked (open) data. It provides definitions and examples for each topic. For big data, it discusses the large volumes of data being created and challenges in working with such data. For public open data, it outlines principles like completeness and ease of access. It also shows examples of apps using open government data. For linked open data, it discusses moving from a web of documents to a web of interconnected data through using URIs and typed links. It also shows the growth of the linked open data cloud over time.

linked data
Becoming a Data Driven Organisation
Becoming a Data Driven OrganisationBecoming a Data Driven Organisation
Becoming a Data Driven Organisation

Fully embracing a BI tool can mean the difference between the full payoff of your data analytics and returns that are just so-so. Learn how to avoid BI pitfalls and boost BI adoption to become a truly data driven organisation.

analyticsbig databusiness intelligence
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 31
• Text
• CSV
• JSON
• XML
• Binary
• Sequence File
• Avro
• Parquet
• Optimized Row Columnar
(ORC)
File Formats
31
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 32
• Hive
• Spark SQL
• Impala
• Presto
• Drill
• Phoenix
• HAWQ
• Tajo
Distributed SQL Query Engine
32
Data Warehouse
Distributed
Storage
Distributed
Query Engine
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 33
•
•
•
•
•
•
33
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 34
•
•
•
34

Recommended for you

#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"

Talk at #BigDataCanarias (June 16, 2014)

big dataanalyticsdata science
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details

Are you lost between web pages and links to big data ? I've collected all about big data and Hadoop togeather.

dataanalyticshadoop
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...

Title: BigData, AllData, Old Data: Predictive Analytics in a Changing Data Landscape Abstract: The landscape of the platform, access methodologies, shapes, and storage representations has changed dramatically. Much of the assumptions of a structured data world dominated by relational databases have been rendered obsolete. Today’s data analyst faces big challenges and a bewildering environment of technologies and challenges involving semi-structured and unstructured data with access methodologies that have almost no relation to the past. This talk will cover issues and challenges in how to make the benefits of advanced analytics fit within the application environment. The requirement for Real-time data streaming and in situ data mining is stronger than ever. We demonstrate how many of the critical problems remain open with much opportunity for innovative solutions to play a huge enabling role. This opportunity extends equally well to Knowledge Management and several related fields.

data sciencedata miningbigdata predictive analytics
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 35
•
•
35
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 36
Publish – Subscribe / Messaging Systems
• Kafka
• RabbitMQ
• ActiveMQ
• ZeroMQ
36
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 37
• Batch
• Hadoop MapReduce
• HPCC
• Stream
• Kafka Streams
• Heron
• Storm
• Samza
• Batch and Stream
• Spark
• Flink
• Beam
• Apex
• Ignite
Big Data Computing Frameworks
37
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 38
Big Data
Big Data Technologies
Kafka
Hadoop
Spark

Recommended for you

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016

This document provides an overview of Hadoop and Big Data. It begins with introducing key concepts like structured, semi-structured, and unstructured data. It then discusses the growth of data and need for Big Data solutions. The core components of Hadoop like HDFS and MapReduce are explained at a high level. The document also covers Hadoop architecture, installation, and developing a basic MapReduce program.

hadoopbigdata
Lecture on Data Science in a Data-Driven Culture
Lecture on Data Science in a Data-Driven Culture Lecture on Data Science in a Data-Driven Culture
Lecture on Data Science in a Data-Driven Culture

The document discusses the importance of a data-driven culture for businesses. It provides the following key points: 1. Research has shown that companies that emphasize data-driven decision making have 5-6% higher productivity and output than comparable companies. This relationship also appears in other financial metrics like return on equity. 2. Data science draws from various fields like operations research, probability theory, analytics, and computer science. It is used for optimal decision making, handling uncertainties, generating insights from data, and implementing analytical solutions. 3. When adopting a data-driven approach, companies should focus on specific business goals and KPIs rather than just collecting data. Iterative testing is also important to measure impact

data sciencedata-drivenorganization culture
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data

This presentation introduces concepts of Big Data in a layman's language. Author does not claim the originality of the content. The presentation is made by compiling from various sources. Author does not claim copyrights or privacy issues. Big data is exponentially rising in today's age of information and digital shrinkage. This presentation potentially clears the concept and revolving hype around it.

hadoopcomputersoftware
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 39
• Distributed publish-subscribe
messaging system
• Partitioned and replicated
commit log service for
building distributed datastore
Kafka
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 40
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
40
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 41
•
•
•
•
•
•
•
•
41
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 42
•
•
•
•
•
•
42

Recommended for you

How to reach a Data Driven culture
How to reach a Data Driven cultureHow to reach a Data Driven culture
How to reach a Data Driven culture

Presentation about how to reach a data driven culture. I gave the presentation at the Data Driven Digital Marketing Event 26 September 2016

data driven marketingculturedigital marketing
The big data value chain r1-31 oct13
The big data value chain r1-31 oct13The big data value chain r1-31 oct13
The big data value chain r1-31 oct13

Big Data 101 - Originally presented during a seminar when the idea behind big data was just beginning to catch on.

Big Data Industry Insights 2015
Big Data Industry Insights 2015 Big Data Industry Insights 2015
Big Data Industry Insights 2015

This presentation by Gartner discusses big data industry insights and trends. It provides an overview of organizations' investments in big data technology, the challenges they face in adoption, the types of big data being analyzed now and planned for the future, and examples of how different industries are using big data to address key business problems.

big dataanalyticsgartner
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 43
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 44
Hadoop
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 45
•
•
•
•
45
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 46
•
•
•
•
•

Recommended for you

Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning

This document provides an introduction to machine learning. It begins with an agenda that lists topics such as introduction, theory, top 10 algorithms, recommendations, classification with naive Bayes, linear regression, clustering, principal component analysis, MapReduce, and conclusion. It then discusses what big data is and how data is accumulating at tremendous rates from various sources. It explains the volume, variety, and velocity aspects of big data. The document also provides examples of machine learning applications and discusses extracting insights from data using various algorithms. It discusses issues in machine learning like overfitting and underfitting data and the importance of testing algorithms. The document concludes that machine learning has vast potential but is very difficult to realize that potential as it requires strong mathematics skills.

machine learningdata sciencebig data
Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2

Datameer 6 is completely re-imagining the user experience for modern BI, helping you deliver new insights faster and more results to your data-hungry business. During this one-hour webinar, we demonstrated all that's new with Datameer 6 and how you can: Discover answers to a new range of business questions using an iterative, exploratory approach Find answers faster and deliver more insights with a new faster analytic workflow Utilize Spark to speed analytic processing time without needing to know technical details Watch this on-demand webinar, with special guest speaker, Sean Anderson, Senior Product Marketing Manager, from Cloudera, who discusses Cloudera's view of the Hadoop data processing stack and how the market place is benefiting from Spark.

C1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategyC1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategy

The document discusses Oracle's approach to enterprise cloud strategy. It notes that most public and private cloud offerings are incompatible and incomplete for enterprise needs. Oracle proposes an integrated cloud solution providing enterprise SaaS, PaaS and IaaS that can span both private and public clouds. This would allow enterprises to run applications and workloads across their on-premises infrastructure and Oracle's public cloud platform. Oracle argues this integrated approach is needed to bring true cloud agility to enterprise applications and IT.

© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 47
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 48
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 49
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 50
Hadoop is Not a Single Product
50

Recommended for you

Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics

Learn how to get started with Big Data using a platform based on Apache Hadoop, Apache Spark, and IBM BigInsights technologies. The emphasis here is on free or low-cost options that require modest technical skills.

hadoopbiginsightsinterconnect
Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?

Data integration is just plain hard and there is no magic bullet. That said, three new data integration techniques do ameliorate the misery, making silo-busting possible, if not trivial. The three approaches – data lakes, virtual databases (aka federated databases), and data hubs – are a boon to organizations big enough to have separate systems, separate lines of business, and redundant acquired or COTS data stores. Each approach has its place, but how do you make the right decision about which data silo integration approach to choose and when? This webinar describes how you can use the key concepts of data Movement, Harmonization, and Indexing to determine what you are giving up or investing in, and make the best decision for your project.

nosqldata hubdata lake
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020

This session will describe and demonstrate the longstanding integration between Couchbase Server and Apache Kafka and will include descriptions of both the mechanics of the integration and practical situations when combining these products is appropriate.

developerintermediatefinancial services
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 51
Hadoop Core Components
51
=
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 52
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 53
•
•
•
53
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 54
•
•
•
•
•
•
•
54

Recommended for you

Oracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureOracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and Architecture

Oracle Itay Systems Presales Team presents : Big Data in any flavor, on-prem, public cloud and cloud at customer. Presentation done at Digital Transformation event - February 2017

oracleoracle cloud machinecloud@customer
How Cloud Providers are Playing with Traditional Data Center
How Cloud Providers are Playing with Traditional Data CenterHow Cloud Providers are Playing with Traditional Data Center
How Cloud Providers are Playing with Traditional Data Center

The keynote presentation discusses how cloud providers are impacting traditional data centers. It notes that as companies grow from startups to established enterprises, their hosting needs change from fully public cloud to hybrid models. The presentation outlines the tradeoffs of different hosting options like owning your own data center, colocation, managed hosting, and public cloud. It argues that a hybrid multi-cloud approach combining on-premises, dedicated, managed, public and other specialty clouds provides the most flexibility, cost savings, and ability to put the right workload in the right environment. Case studies are presented showing how hybrid cloud delivered major cost reductions and performance gains for Explore.org and enabled critical security and compliance requirements for Samsung. The presentation concludes that

data centercloudhybrid cloud
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels

3 Things to Learn: -How data is driving digital transformation to help businesses innovate rapidly -How Choice Hotels (one of largest hoteliers) is using Cloudera Enterprise to gain meaningful insights that drive their business -How Choice Hotels has transformed business through innovative use of Apache Hadoop, Cloudera Enterprise, and deployment in the cloud — from developing customer experiences to meeting IT compliance requirements

businessdigital transformationon-premise
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 55
Adoption of Spark is Growing Rapidly
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 56
Spark
Fast, easy-to-use, general-purpose cluster computing framework
for processing large datasets using a simpler programming
model
56
• • •
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 57
Benefits
• Scale
• Fault-tolerance
• Abstracts distributed computing
• Hides the messy details of writing distributed applications
• Allows developers to just focus on the data processing logic
• Same code works on a laptop or a cluster of servers
• Ease-of-use
• Speed
• Flexibility
57
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 58
Easy To Use
• Library with an expressive API
• Scala, Java, Python, R
• RDD API with 80+ operators (MR has only two)
• Dataset/DataFrame API
• Interactive development
• spark-shell
• notebooks
58

Recommended for you

Integrating Hyper-converged Systems with Existing SANs
Integrating Hyper-converged Systems with Existing SANs Integrating Hyper-converged Systems with Existing SANs
Integrating Hyper-converged Systems with Existing SANs

Hyper-converged systems offer a great deal of promise and yet come with a set of limitations. While they allow enterprises to re-integrate system components into a single enclosure and reduce the physical complexity, floor space and cost of supporting a workload in the data center, they also often will not support existing storage in local SANs or offered by cloud service providers. There are solutions available to address these challenges and allow hyper-converged systems to realize their promise. During this session you will learn: - What are hyper-converged systems? - What challenges do they pose? - What should the ideal solution to those challenges look like? - About a solution that helps integrate hyper-converged systems with existing SANs

vsanvirtual sansoftware
Journey to the Cloud: What I Wish I Knew Before I Started
 Journey to the Cloud: What I Wish I Knew Before I Started Journey to the Cloud: What I Wish I Knew Before I Started
Journey to the Cloud: What I Wish I Knew Before I Started

Moving to the cloud can raise more questions than answers. Do I move and improve to Infrastructure as a service, or redesign my business processes on Software as a Service. This presentation covers several cloud migrations to OAC, EBS/Cloud (HCM/Financials) and Hyperion, and outlines what went well and what could have gone better.

oracle ebscloudhcm
Building the Glue for Service Discovery & Load Balancing Microservices
Building the Glue for Service Discovery & Load Balancing MicroservicesBuilding the Glue for Service Discovery & Load Balancing Microservices
Building the Glue for Service Discovery & Load Balancing Microservices

One of the challenges that comes from deploying multi-tiered distributed systems, or microservices, atop a dynamic scheduler is the introduction of new problems surrounding load balancing. There are some inherent challenges in building a load balancer that's meant to operate in a highly available way, without any single points of failure. In this talk, Sargun Dhillon will walk through the distributed load balancing mechanism that he built for Mesos. This service discovery mechanism is meant to have the same kinds of features, api, and availability that existed in legacy, statically partitioned environments. The purpose of this is to ease the transition, and remove some of the largest road blocks in moving applications over to modern datacenters. In addition, he will speak to why he built it as opposed to other alternatives for service discovery and load balancing such as using Zookeeper, and the challenges that came from it. We built a library called Lashup that has a membership protocol, a multicast layer, failure detector, and CRDT key/value store. This has allowed us to build applications that orchestrate Mesos clusters with great ease.

sdnnetworkingerlang
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 59
• Batch processing
• Interactive analytics
• Stream analysis
• Machine learning
• Graph analytics
Integrated Libraries For a Variety of DP Tasks
Spark Core
Spark
SQL
GraphX
Spark
Streaming
MLlib
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 60
Benefits of a Unified Platform
• Solve a variety of problems with a single toolkit
• No need to learn different tools for each use case
• Avoid code and data duplication
• Achieve operational simplicity
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 61
Why is Spark Fast
• Advanced job execution engine
• Allows applications to cache data in memory
61
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 62
Advanced Job Execution Engine
• Directed Acyclic Graph (DAG) of stages
• simple job can contain just one stage
• complex job can contain many stages
• eliminates expensive operations between multiple jobs
• synchronization
• serialization/deserialization
• disk I/O
• Lazy operator evaluation
• Pipelined operations
62

Recommended for you

Journey to analytics in the cloud
Journey to analytics in the cloudJourney to analytics in the cloud
Journey to analytics in the cloud

We all know that consumer behavior has changed dramatically. How consumers engage with companies, do research and even purchase leaves a deluge of data that companies have never had. Those companies that can parse that data drive business results like never before. This session presentation at Dog Food Con 2016 helps you to learn how Big Data technology can drive business outcomes from data ingestion to cloud and talks about one company’s journey to Customer 360 and their decision process when moving to the cloud.

big datacloud computinginsurance analytics
Emerging trends in data analytics
Emerging trends in data analyticsEmerging trends in data analytics
Emerging trends in data analytics

- Deep learning (Tensorflow) and microservices (Kubernetes, Docker, Kafka) are emerging trends. - While batch processing is still dominant, stream processing is gaining traction with technologies like Kafka, Flink, and Beam. - Python has surpassed Java as the most popular language for data analytics, according to Stack Overflow trends.

DC/OS 1.8 Container Networking
DC/OS 1.8 Container NetworkingDC/OS 1.8 Container Networking
DC/OS 1.8 Container Networking

This document discusses the evolution of data center networking from 2007 to present day. It describes how earlier networks were static with clear divisions between teams, while modern networks are more dynamic with blurred lines between developers and operations. It outlines projects within DC/OS like Mesos-DNS, Minuteman, and Lashup that provide service discovery, load balancing, and a distributed control plane to manage today's complex networks and microservices applications. Future plans include improved security, quality of service, and potential rewriting of operating systems to enable zero-overhead network functions virtualization.

erlangmesospheredcos
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 65
Allows Applications to Cache Data in Memory
•Minimize disk I/O
•Reading data from memory is orders of magnitude
faster than reading from disk
•In-memory data sharing across DAGs
• different jobs can work with the same cached data
65
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 66
Why Caching Makes Applications Run Faster
66
100 MB/s
500 MB/s
10 GB/s
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 67
Read Latency Comparison
67
0
50
100
150
200
1 TB
Time (Min)
Data Read
HDD
SSD
RAM
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 74
Spark Does Not Provide Storage
• Works with a variety of data sources
• No need to import data into Spark
• Scale compute and storage cluster independently

Recommended for you

Erlang containers
Erlang containersErlang containers
Erlang containers

This document discusses the history and development of container networking and service discovery solutions. It describes how Mesosphere developed DC/OS to provide networking features like load balancing and service discovery using Erlang microservices including Spartan, Minuteman, and Lashup. Spartan provides high availability DNS, Minuteman provides distributed load balancing, and Lashup uses HyParView to maintain global network state across the cluster. The document outlines how these services were developed to enable dynamic container networking and service discovery.

distributed systemsnetworkingcontainers
Data Engineering the Startup Way - AWS Startup Day Chicago 2018
Data Engineering the Startup Way - AWS Startup Day Chicago 2018Data Engineering the Startup Way - AWS Startup Day Chicago 2018
Data Engineering the Startup Way - AWS Startup Day Chicago 2018

Uptake is the industrial analytics platform that delivers products to major industries to increase productivity, security, safety and reliability. About the Event Launching a successful startup takes more than building on the most flexible, reliable, and scalable infrastructure available today. Startup Day is an opportunity to hear from successful startups about how they've tackled the unique technical challenges in their industry. It's also an opportunity to meet other startup leaders in your community to share ideas, help each other grow, and inspire each other while tackling problems that affect your organization. Who Should Attend? This event is built for early stage (pre-seed & bootstrapped) technical leads and entrepreneurs. Attendees will learn from a technical perspective what did and didn't work well from other startups across a diverse range of industries. Hear what funded and late stage startups wish they knew before they began building. Learn how companies are leveraging SageMaker for optimizing Machine Learning training and deployment, deploying ECS for container orchestration, and using Lambda to build companies that are entirely serverless.

awsamazonwebservicescloud computing
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage

The document discusses how legacy customer data stored in organizations can provide a competitive advantage for training AI/machine learning models and powering personalized customer experiences while ensuring privacy protection. It explains that legacy data is needed to train accurate predictive models, enable cross-channel personalization, and allow for strong governance and control over sensitive customer information. Finally, it states that without access to legacy customer data stores, organizations cannot fully leverage AI/ML to drive predictive marketing, deliver personalized experiences, or comprehensively protect customer privacy.

© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 75
Process Data From a Variety Of Data Sources
And Many More
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 76
Spark Does Not Replace Hadoop
76
= =
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 77
Hadoop is Optional
77
= =
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 78
Ideal Applications
• Complex data processing
• multi-step pipeline
• Iterative algorithm
• Machine Learning
• Graph analytics
• Ad hoc analysis
• Interactive

Recommended for you

Linthicum next generation-iaa s-paas-and-database-as-a-service
Linthicum next generation-iaa s-paas-and-database-as-a-serviceLinthicum next generation-iaa s-paas-and-database-as-a-service
Linthicum next generation-iaa s-paas-and-database-as-a-service

The document discusses emerging cloud computing technologies including Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Database as a Service. It notes that IaaS is currently the fastest growing cloud service, with Gartner reporting 42.4% growth in 2012. Popular IaaS providers include Amazon Web Services, CloudStack, and OpenStack. PaaS offerings from Google App Engine, Heroku, and Amazon Elastic Beanstalk are analyzed in terms of their approaches and limitations. Best practices for adopting PaaS include considering application requirements, resources, data needs, and interactions beyond the platform.

The Cloud and Microsoft Windows Azure - A Walk through the clouds
The Cloud and Microsoft Windows Azure - A Walk through the cloudsThe Cloud and Microsoft Windows Azure - A Walk through the clouds
The Cloud and Microsoft Windows Azure - A Walk through the clouds

A high level overview of 'The Cloud', Microsoft Windows Azure and experiences building a cloud platform

azuredevelopment operationswindows azure
Introduction To IPaaS: Drivers, Requirements And Use Cases
Introduction To IPaaS: Drivers, Requirements And Use CasesIntroduction To IPaaS: Drivers, Requirements And Use Cases
Introduction To IPaaS: Drivers, Requirements And Use Cases

This document provides an introduction to integration platform as a service (iPaaS) and SnapLogic. It discusses the drivers for iPaaS adoption including big data, hybrid cloud environments, and the need for faster integration. Ten requirements for modern integration are outlined. The document then introduces SnapLogic and its unified platform for connecting applications, data and APIs anywhere through a library of pre-built connectors. Four primary iPaaS use cases are described: hybrid application integration, cloud data warehousing/analytics, big data ingestion/transformation/delivery, and replacing legacy integration platforms.

ipaastechnologyetl tools
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 110110

More Related Content

What's hot

Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes Keynote
Mark van Rijmenam
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
Softweb Solutions
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
DATAVERSITY
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
StampedeCon
 
Top 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionTop 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data Solution
DataStax
 
Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?
SnapLogic
 
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
In-Memory Computing Summit
 
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike FergusonMapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Technologies
 
How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016
StampedeCon
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Guido Schmutz
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
Cisco Canada
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big Data
Cloudera, Inc.
 
Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchIntuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with Search
Cloudera, Inc.
 
Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data Silos
DataWorks Summit
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
DataWorks Summit
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Cloudera, Inc.
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 

What's hot (19)

Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes Keynote
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 
Top 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionTop 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data Solution
 
Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?
 
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
 
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike FergusonMapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
 
How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big Data
 
Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchIntuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with Search
 
Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data Silos
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 

Viewers also liked

Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark
Mohammed Guller
 
Wikibon Big Data Capital Markets Day 2014
Wikibon Big Data Capital Markets Day 2014Wikibon Big Data Capital Markets Day 2014
Wikibon Big Data Capital Markets Day 2014
Jeff Kelly
 
Create your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouseCreate your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouse
Jeff Kelly
 
Steps towards a Data Value Chain
Steps towards a Data Value ChainSteps towards a Data Value Chain
Steps towards a Data Value Chain
PRELIDA Project
 
Becoming a Data Driven Organisation
Becoming a Data Driven OrganisationBecoming a Data Driven Organisation
Becoming a Data Driven Organisation
Wizdee
 
#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"
Marcos Colebrook-Santamaria
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
Mahmoud Yassin
 
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...
Usama Fayyad
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar
 
Lecture on Data Science in a Data-Driven Culture
Lecture on Data Science in a Data-Driven Culture Lecture on Data Science in a Data-Driven Culture
Lecture on Data Science in a Data-Driven Culture
Johan Himberg
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Karan Desai
 
How to reach a Data Driven culture
How to reach a Data Driven cultureHow to reach a Data Driven culture
How to reach a Data Driven culture
Mark Beekman
 
The big data value chain r1-31 oct13
The big data value chain r1-31 oct13The big data value chain r1-31 oct13
The big data value chain r1-31 oct13
Rei Lynn Hayashi
 
Big Data Industry Insights 2015
Big Data Industry Insights 2015 Big Data Industry Insights 2015
Big Data Industry Insights 2015
Den Reymer
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 

Viewers also liked (15)

Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark
 
Wikibon Big Data Capital Markets Day 2014
Wikibon Big Data Capital Markets Day 2014Wikibon Big Data Capital Markets Day 2014
Wikibon Big Data Capital Markets Day 2014
 
Create your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouseCreate your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouse
 
Steps towards a Data Value Chain
Steps towards a Data Value ChainSteps towards a Data Value Chain
Steps towards a Data Value Chain
 
Becoming a Data Driven Organisation
Becoming a Data Driven OrganisationBecoming a Data Driven Organisation
Becoming a Data Driven Organisation
 
#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Lecture on Data Science in a Data-Driven Culture
Lecture on Data Science in a Data-Driven Culture Lecture on Data Science in a Data-Driven Culture
Lecture on Data Science in a Data-Driven Culture
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
How to reach a Data Driven culture
How to reach a Data Driven cultureHow to reach a Data Driven culture
How to reach a Data Driven culture
 
The big data value chain r1-31 oct13
The big data value chain r1-31 oct13The big data value chain r1-31 oct13
The big data value chain r1-31 oct13
 
Big Data Industry Insights 2015
Big Data Industry Insights 2015 Big Data Industry Insights 2015
Big Data Industry Insights 2015
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 

Similar to Introduction to Big Data

Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2
Datameer
 
C1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategyC1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategy
Dr. Wilfred Lin (Ph.D.)
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Cynthia Saracco
 
Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?
DATAVERSITY
 
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
HostedbyConfluent
 
Oracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureOracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and Architecture
Riccardo Romani
 
How Cloud Providers are Playing with Traditional Data Center
How Cloud Providers are Playing with Traditional Data CenterHow Cloud Providers are Playing with Traditional Data Center
How Cloud Providers are Playing with Traditional Data Center
Hostway|HOSTING
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
Cloudera, Inc.
 
Integrating Hyper-converged Systems with Existing SANs
Integrating Hyper-converged Systems with Existing SANs Integrating Hyper-converged Systems with Existing SANs
Integrating Hyper-converged Systems with Existing SANs
DataCore Software
 
Journey to the Cloud: What I Wish I Knew Before I Started
 Journey to the Cloud: What I Wish I Knew Before I Started Journey to the Cloud: What I Wish I Knew Before I Started
Journey to the Cloud: What I Wish I Knew Before I Started
Datavail
 
Building the Glue for Service Discovery & Load Balancing Microservices
Building the Glue for Service Discovery & Load Balancing MicroservicesBuilding the Glue for Service Discovery & Load Balancing Microservices
Building the Glue for Service Discovery & Load Balancing Microservices
Sargun Dhillon
 
Journey to analytics in the cloud
Journey to analytics in the cloudJourney to analytics in the cloud
Journey to analytics in the cloud
Saama
 
Emerging trends in data analytics
Emerging trends in data analyticsEmerging trends in data analytics
Emerging trends in data analytics
Wei-Chiu Chuang
 
DC/OS 1.8 Container Networking
DC/OS 1.8 Container NetworkingDC/OS 1.8 Container Networking
DC/OS 1.8 Container Networking
Sargun Dhillon
 
Erlang containers
Erlang containersErlang containers
Erlang containers
Sargun Dhillon
 
Data Engineering the Startup Way - AWS Startup Day Chicago 2018
Data Engineering the Startup Way - AWS Startup Day Chicago 2018Data Engineering the Startup Way - AWS Startup Day Chicago 2018
Data Engineering the Startup Way - AWS Startup Day Chicago 2018
Amazon Web Services
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Precisely
 
Linthicum next generation-iaa s-paas-and-database-as-a-service
Linthicum next generation-iaa s-paas-and-database-as-a-serviceLinthicum next generation-iaa s-paas-and-database-as-a-service
Linthicum next generation-iaa s-paas-and-database-as-a-service
David Linthicum
 
The Cloud and Microsoft Windows Azure - A Walk through the clouds
The Cloud and Microsoft Windows Azure - A Walk through the cloudsThe Cloud and Microsoft Windows Azure - A Walk through the clouds
The Cloud and Microsoft Windows Azure - A Walk through the clouds
Mark Rodseth
 
Introduction To IPaaS: Drivers, Requirements And Use Cases
Introduction To IPaaS: Drivers, Requirements And Use CasesIntroduction To IPaaS: Drivers, Requirements And Use Cases
Introduction To IPaaS: Drivers, Requirements And Use Cases
Synerzip
 

Similar to Introduction to Big Data (20)

Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2
 
C1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategyC1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategy
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
 
Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?
 
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
 
Oracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureOracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and Architecture
 
How Cloud Providers are Playing with Traditional Data Center
How Cloud Providers are Playing with Traditional Data CenterHow Cloud Providers are Playing with Traditional Data Center
How Cloud Providers are Playing with Traditional Data Center
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
Integrating Hyper-converged Systems with Existing SANs
Integrating Hyper-converged Systems with Existing SANs Integrating Hyper-converged Systems with Existing SANs
Integrating Hyper-converged Systems with Existing SANs
 
Journey to the Cloud: What I Wish I Knew Before I Started
 Journey to the Cloud: What I Wish I Knew Before I Started Journey to the Cloud: What I Wish I Knew Before I Started
Journey to the Cloud: What I Wish I Knew Before I Started
 
Building the Glue for Service Discovery & Load Balancing Microservices
Building the Glue for Service Discovery & Load Balancing MicroservicesBuilding the Glue for Service Discovery & Load Balancing Microservices
Building the Glue for Service Discovery & Load Balancing Microservices
 
Journey to analytics in the cloud
Journey to analytics in the cloudJourney to analytics in the cloud
Journey to analytics in the cloud
 
Emerging trends in data analytics
Emerging trends in data analyticsEmerging trends in data analytics
Emerging trends in data analytics
 
DC/OS 1.8 Container Networking
DC/OS 1.8 Container NetworkingDC/OS 1.8 Container Networking
DC/OS 1.8 Container Networking
 
Erlang containers
Erlang containersErlang containers
Erlang containers
 
Data Engineering the Startup Way - AWS Startup Day Chicago 2018
Data Engineering the Startup Way - AWS Startup Day Chicago 2018Data Engineering the Startup Way - AWS Startup Day Chicago 2018
Data Engineering the Startup Way - AWS Startup Day Chicago 2018
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
 
Linthicum next generation-iaa s-paas-and-database-as-a-service
Linthicum next generation-iaa s-paas-and-database-as-a-serviceLinthicum next generation-iaa s-paas-and-database-as-a-service
Linthicum next generation-iaa s-paas-and-database-as-a-service
 
The Cloud and Microsoft Windows Azure - A Walk through the clouds
The Cloud and Microsoft Windows Azure - A Walk through the cloudsThe Cloud and Microsoft Windows Azure - A Walk through the clouds
The Cloud and Microsoft Windows Azure - A Walk through the clouds
 
Introduction To IPaaS: Drivers, Requirements And Use Cases
Introduction To IPaaS: Drivers, Requirements And Use CasesIntroduction To IPaaS: Drivers, Requirements And Use Cases
Introduction To IPaaS: Drivers, Requirements And Use Cases
 

Recently uploaded

[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
Amazon Web Services Korea
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
kamli sharma#S10
 
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeLaxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
yogita singh$A17
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
simmi singh$A17
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
sanjay singh
 
EGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithmEGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithm
fatimaezzahraboumaiz2
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model SafePitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
vasudha malikmonii$A17
 
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeNoida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
kumkum tuteja$A17
 
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeVasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
nikita dubey$A17
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
chetankumar9855
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
GaneshGanesh399816
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
Amazon Web Services Korea
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
Amazon Web Services Korea
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
taqyea
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
shoeb2926
 
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
nehadubay1
 
University of the Sunshine Coast degree offer diploma Transcript
University of the Sunshine Coast  degree offer diploma TranscriptUniversity of the Sunshine Coast  degree offer diploma Transcript
University of the Sunshine Coast degree offer diploma Transcript
taqyea
 
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
bookmybebe1
 

Recently uploaded (20)

[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
 
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeLaxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
 
EGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithmEGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithm
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model SafePitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
 
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeNoida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
 
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeVasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
 
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
 
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
University of the Sunshine Coast degree offer diploma Transcript
University of the Sunshine Coast  degree offer diploma TranscriptUniversity of the Sunshine Coast  degree offer diploma Transcript
University of the Sunshine Coast degree offer diploma Transcript
 
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 

Introduction to Big Data

  • 1. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 1 Mohammed Guller Oct 02, 2016 Introduction to Big Data
  • 2. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 2 Big Data Big Data Technologies Kafka Hadoop Spark Agenda
  • 3. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 4 About Me • Engineering Manager / Principal Architect at Glassbeam • Founded two startups • Passionate about building products, big data analytics, and machine learning • www.linkedin.com/in/mohammedguller • @MohammedGuller 4
  • 4. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 6 • Hands-on guide with lots of examples • Covers both fundamental and advanced topics such as machine learning • Includes a primer on functional programming and Scala • Introduces other important Big Data technologies such as HDFS, Parquet, Kafka, HBase, Cassandra, Mesos, and YARN Big Data Analytics with Spark Available on Amazon
  • 5. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 7 About Glassbeam Glassbeam brings structure and meaning to data from any connected machine or device while providing actionable intelligence Cloud based analytics platform that helps organizations turn raw machine data to insights Making sense of multi structured machine data  Data center devices  Medical devices  Sensors  ATMs  Automobiles  Data from any machine Providing comprehensive set of apps & tools for machine data analysis  50,000+ systems being tracked today  1,500+ different software rev codes  1.2 Billion sensor readings per day  1+ Trillion sensor readings tracked
  • 6. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 8 Big Data Big Data Technologies Kafka Hadoop Spark
  • 7. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 9 Data Growing At a Faster Pace Than Ever 9
  • 8. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 10 Internet of Things (IoT) • Network of objects embedded with software for collecting and sending data over the Internet • 5x more connected things than people by 2020
  • 9. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 11 Industrial IoT • Manufacturing • Automotive • Medical • Data Center • EVC • Smart Meter 11 Glassbeam target market is focused on driving opera onal & business analy cs value for connected product companies in Industrial IoT market IT & Networks Medical & Health Care Transporta on EV Chargers & Smart Grid Industrial & Mfg
  • 10. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 12 Key Attributes of Big Data Volume Scale of Data Variety Diversity of Data Velocity Speed of Data • • • • • • • • •
  • 11. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 13 Big Data Comes with Big Challenges • Storage • Processing • Value
  • 12. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 14 Storage Challenges • Legacy SAN / NAS storage devices are expensive • Traditional RDBMS were not designed for Big Data • Cannot handle volume, velocity, variety of Big Data 14
  • 13. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 15 Processing Challenges • Diverse processing • Organizations want do more than just BI / traditional analytics • Go beyond SQL queries • Timeliness • Process data in reasonable amount of time • Value of data decreases over time 15
  • 14. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 16 How Much Data Can a Standard Server Process
  • 15. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 17 • • 17
  • 16. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 18 • Large number of CPUs / cores • Faster cores • Large amount of memory • Faster memory bus • High-performance architecture Scale-up with Powerful High-end Server 18
  • 17. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 19 Disadvantages of Scale-up Architecture • Proprietary • Expensive • Limited scalability 19
  • 18. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 20 • Cluster of servers • Commodity machines • Pool together resources • CPU • Memory • Disk Scale-out Architecture 20
  • 19. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 21 Benefits of Scale-out Architecture • Relatively inexpensive • Economical to scale • No huge upfront investment • Start small and expand cluster as workload increases 21
  • 20. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 22 Challenges With Scale-out Architecture • Writing distributed applications is very hard • Split job into chunks that can be distributed across a cluster • Schedule compute resources among different jobs • Manage inter-node communication • Handle network and node failures • Hardware failures are more common at a cluster level • Probability of a single node failing is low • Probability of any one node in a large cluster failing is high 22
  • 21. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 23 Getting Value Out of Big Data • Traditional analytics / BI • Custom processing • Machine Learning • Predictive analytics • Automate complex tasks • Stream processing • Analyze in real-time/near real-time • React in real-time 23
  • 22. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 24 Traditional Analytics / BI • What • Customer growth for the last month/quarter/year • Segmentation of customers by demographics • Average time spent by mobile app users • Why • Sales growth slowed • regional issue • supply issue • Profit dropped • revenue dropped • expenses increased 24
  • 23. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 25 Custom Processing • Index web pages • Google • Bing • Process genome data • Identify mutations linked to cancer, Alzheimer's and other disease • Click analysis • Log analysis • 360-degree real time view of a customer 25
  • 24. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 26 Predictive Analytics • Advertisements that a visitor will most likely click • Movies / songs / news that a customer will like • Products that a customer will buy • Patient will have an heart attack 26
  • 25. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 27 • Virtual assistant • Siri • Google Now • Autonomous machine • Self-driving car • Robots • Tag Images • Facebook • Flickr • Expert System • Medical diagnosis • Personalized medicine • Security • Fraud detection • Network Security • Music recognition • Shazam • SoundHound Automate Complex Tasks 27
  • 26. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 28 Big Data Big Data Technologies Kafka Hadoop Spark
  • 27. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 29 • • • • • • 29
  • 28. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 30 • • • • • • 30
  • 29. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 31 • Text • CSV • JSON • XML • Binary • Sequence File • Avro • Parquet • Optimized Row Columnar (ORC) File Formats 31
  • 30. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 32 • Hive • Spark SQL • Impala • Presto • Drill • Phoenix • HAWQ • Tajo Distributed SQL Query Engine 32 Data Warehouse Distributed Storage Distributed Query Engine
  • 31. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 33 • • • • • • 33
  • 32. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 34 • • • 34
  • 33. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 35 • • 35
  • 34. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 36 Publish – Subscribe / Messaging Systems • Kafka • RabbitMQ • ActiveMQ • ZeroMQ 36
  • 35. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 37 • Batch • Hadoop MapReduce • HPCC • Stream • Kafka Streams • Heron • Storm • Samza • Batch and Stream • Spark • Flink • Beam • Apex • Ignite Big Data Computing Frameworks 37
  • 36. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 38 Big Data Big Data Technologies Kafka Hadoop Spark
  • 37. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 39 • Distributed publish-subscribe messaging system • Partitioned and replicated commit log service for building distributed datastore Kafka
  • 38. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 40 • • • • • • • • • • • • • • • 40
  • 39. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 41 • • • • • • • • 41
  • 40. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 42 • • • • • • 42
  • 41. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 43 Big Data Big Data Technologies Kafka Hadoop Spark
  • 42. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 44 Hadoop
  • 43. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 45 • • • • 45
  • 44. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 46 • • • • •
  • 45. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 47
  • 46. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 48
  • 47. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 49
  • 48. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 50 Hadoop is Not a Single Product 50
  • 49. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 51 Hadoop Core Components 51 =
  • 50. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 52 Big Data Big Data Technologies Kafka Hadoop Spark
  • 51. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 53 • • • 53
  • 52. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 54 • • • • • • • 54
  • 53. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 55 Adoption of Spark is Growing Rapidly
  • 54. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 56 Spark Fast, easy-to-use, general-purpose cluster computing framework for processing large datasets using a simpler programming model 56 • • •
  • 55. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 57 Benefits • Scale • Fault-tolerance • Abstracts distributed computing • Hides the messy details of writing distributed applications • Allows developers to just focus on the data processing logic • Same code works on a laptop or a cluster of servers • Ease-of-use • Speed • Flexibility 57
  • 56. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 58 Easy To Use • Library with an expressive API • Scala, Java, Python, R • RDD API with 80+ operators (MR has only two) • Dataset/DataFrame API • Interactive development • spark-shell • notebooks 58
  • 57. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 59 • Batch processing • Interactive analytics • Stream analysis • Machine learning • Graph analytics Integrated Libraries For a Variety of DP Tasks Spark Core Spark SQL GraphX Spark Streaming MLlib
  • 58. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 60 Benefits of a Unified Platform • Solve a variety of problems with a single toolkit • No need to learn different tools for each use case • Avoid code and data duplication • Achieve operational simplicity
  • 59. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 61 Why is Spark Fast • Advanced job execution engine • Allows applications to cache data in memory 61
  • 60. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 62 Advanced Job Execution Engine • Directed Acyclic Graph (DAG) of stages • simple job can contain just one stage • complex job can contain many stages • eliminates expensive operations between multiple jobs • synchronization • serialization/deserialization • disk I/O • Lazy operator evaluation • Pipelined operations 62
  • 61. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 65 Allows Applications to Cache Data in Memory •Minimize disk I/O •Reading data from memory is orders of magnitude faster than reading from disk •In-memory data sharing across DAGs • different jobs can work with the same cached data 65
  • 62. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 66 Why Caching Makes Applications Run Faster 66 100 MB/s 500 MB/s 10 GB/s
  • 63. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 67 Read Latency Comparison 67 0 50 100 150 200 1 TB Time (Min) Data Read HDD SSD RAM
  • 64. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 74 Spark Does Not Provide Storage • Works with a variety of data sources • No need to import data into Spark • Scale compute and storage cluster independently
  • 65. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 75 Process Data From a Variety Of Data Sources And Many More
  • 66. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 76 Spark Does Not Replace Hadoop 76 = =
  • 67. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 77 Hadoop is Optional 77 = =
  • 68. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 78 Ideal Applications • Complex data processing • multi-step pipeline • Iterative algorithm • Machine Learning • Graph analytics • Ad hoc analysis • Interactive
  • 69. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 110110