SlideShare a Scribd company logo
Dipping Your Toe Into
Hadoop
How to get started in Big Data without Big
Costs
Bobby Dewitt
VP, Systems Architect
Aisle411
StampedeCon 2016
My Background
• Oracle, MySQL, and PostgreSQL DBA with 15
years of experience
• Led database, infrastructure, and business
intelligence teams to deliver highly available
data systems
• Currently responsible for design,
implementation, and operational availability of
infrastructure and systems at Aisle411
Aisle411
• Digitizing the indoor world
• Indoor maps, positioning, and analytics
• Asset and customer tracking within
locations
• Using augmented reality to make
indoor solutions more interactive
• Small company - big data
RDBMS Versus Hadoop
• Relational databases
• Very structured data
• Good for transactional and operational systems
• Difficult to scale out
• Hardware failures can be disastrous
• Hadoop
• Semistructured or unstructured data
• Good for batch and bulk processing as well as
analytic systems
• Simple to scale out
• Hardware failures are handled seamlessly
Hadoop Adoption
• Still not a reality for many companies
• Major barriers include
• Lack of skilled employees
• Getting value out of the investment
• Constant changes to the ecosystem
Kick the Tires
• Play around with it
• A Hadoop cluster can reside on a single
machine
• Pre-loaded virtual machines
• Install on EC2 or other cloud VM
What Data Should I Use?
• Stick with what you know
• Choose a dataset that is not specific to
your company
• Try documented examples and use
cases
Example Datasets
• Apache web server logs
• Twitter feeds
• Stock market prices
• Census data
• Sports statistics
• Song data
Apache Web Log Data
• Many online resources
• Potentially large data set
• Real business value
• Combine with other data sources
From Batch to Streaming
• Initial testing done with a batch load using HDFS
tools
• Setup streaming to provide near real-time
updates
• Used several Hadoop components
• HDFS
• Flume
• Morphlines
• Avro
• Hive
• Impala
Quick Wins
• Get data into HDFS
• Get data into Hive or Impala
• Stream live data
• Combine with other data sources
• Create pretty graphs and charts
Costs
• Start small with a data puddle
• Use virtual machines, not the big
appliance
• Research and experimentation time
may be biggest cost
Where Am I?
• Evaluate your initial trials
• Is Hadoop everything you thought it would
be?
• Do you have a real business need to use it?
• Can you migrate any existing data or
processes?
Training
• Hortonworks University
• MapR Academy
• Cloudera quick start tutorials
• Online classes through Coursera, edX, and
others
• Conferences like StampedeCon
Hadoop Is Not For Everyone
• Your “big data” may not be big enough
• Still some work to be done with security
and tools
• Skills are being learned, but not quickly
enough
Thank You
• Questions?
rdewitt@aisle411.com

More Related Content

How to get started in Big Data without Big Costs - StampedeCon 2016

  • 1. Dipping Your Toe Into Hadoop How to get started in Big Data without Big Costs Bobby Dewitt VP, Systems Architect Aisle411 StampedeCon 2016
  • 2. My Background • Oracle, MySQL, and PostgreSQL DBA with 15 years of experience • Led database, infrastructure, and business intelligence teams to deliver highly available data systems • Currently responsible for design, implementation, and operational availability of infrastructure and systems at Aisle411
  • 3. Aisle411 • Digitizing the indoor world • Indoor maps, positioning, and analytics • Asset and customer tracking within locations • Using augmented reality to make indoor solutions more interactive • Small company - big data
  • 4. RDBMS Versus Hadoop • Relational databases • Very structured data • Good for transactional and operational systems • Difficult to scale out • Hardware failures can be disastrous • Hadoop • Semistructured or unstructured data • Good for batch and bulk processing as well as analytic systems • Simple to scale out • Hardware failures are handled seamlessly
  • 5. Hadoop Adoption • Still not a reality for many companies • Major barriers include • Lack of skilled employees • Getting value out of the investment • Constant changes to the ecosystem
  • 6. Kick the Tires • Play around with it • A Hadoop cluster can reside on a single machine • Pre-loaded virtual machines • Install on EC2 or other cloud VM
  • 7. What Data Should I Use? • Stick with what you know • Choose a dataset that is not specific to your company • Try documented examples and use cases
  • 8. Example Datasets • Apache web server logs • Twitter feeds • Stock market prices • Census data • Sports statistics • Song data
  • 9. Apache Web Log Data • Many online resources • Potentially large data set • Real business value • Combine with other data sources
  • 10. From Batch to Streaming • Initial testing done with a batch load using HDFS tools • Setup streaming to provide near real-time updates • Used several Hadoop components • HDFS • Flume • Morphlines • Avro • Hive • Impala
  • 11. Quick Wins • Get data into HDFS • Get data into Hive or Impala • Stream live data • Combine with other data sources • Create pretty graphs and charts
  • 12. Costs • Start small with a data puddle • Use virtual machines, not the big appliance • Research and experimentation time may be biggest cost
  • 13. Where Am I? • Evaluate your initial trials • Is Hadoop everything you thought it would be? • Do you have a real business need to use it? • Can you migrate any existing data or processes?
  • 14. Training • Hortonworks University • MapR Academy • Cloudera quick start tutorials • Online classes through Coursera, edX, and others • Conferences like StampedeCon
  • 15. Hadoop Is Not For Everyone • Your “big data” may not be big enough • Still some work to be done with security and tools • Skills are being learned, but not quickly enough