How to get started in Big Data without Big Costs - StampedeCon 2016
- 1. Dipping Your Toe Into
Hadoop
How to get started in Big Data without Big
Costs
Bobby Dewitt
VP, Systems Architect
Aisle411
StampedeCon 2016
- 2. My Background
• Oracle, MySQL, and PostgreSQL DBA with 15
years of experience
• Led database, infrastructure, and business
intelligence teams to deliver highly available
data systems
• Currently responsible for design,
implementation, and operational availability of
infrastructure and systems at Aisle411
- 3. Aisle411
• Digitizing the indoor world
• Indoor maps, positioning, and analytics
• Asset and customer tracking within
locations
• Using augmented reality to make
indoor solutions more interactive
• Small company - big data
- 4. RDBMS Versus Hadoop
• Relational databases
• Very structured data
• Good for transactional and operational systems
• Difficult to scale out
• Hardware failures can be disastrous
• Hadoop
• Semistructured or unstructured data
• Good for batch and bulk processing as well as
analytic systems
• Simple to scale out
• Hardware failures are handled seamlessly
- 5. Hadoop Adoption
• Still not a reality for many companies
• Major barriers include
• Lack of skilled employees
• Getting value out of the investment
• Constant changes to the ecosystem
- 6. Kick the Tires
• Play around with it
• A Hadoop cluster can reside on a single
machine
• Pre-loaded virtual machines
• Install on EC2 or other cloud VM
- 7. What Data Should I Use?
• Stick with what you know
• Choose a dataset that is not specific to
your company
• Try documented examples and use
cases
- 8. Example Datasets
• Apache web server logs
• Twitter feeds
• Stock market prices
• Census data
• Sports statistics
• Song data
- 9. Apache Web Log Data
• Many online resources
• Potentially large data set
• Real business value
• Combine with other data sources
- 10. From Batch to Streaming
• Initial testing done with a batch load using HDFS
tools
• Setup streaming to provide near real-time
updates
• Used several Hadoop components
• HDFS
• Flume
• Morphlines
• Avro
• Hive
• Impala
- 11. Quick Wins
• Get data into HDFS
• Get data into Hive or Impala
• Stream live data
• Combine with other data sources
• Create pretty graphs and charts
- 12. Costs
• Start small with a data puddle
• Use virtual machines, not the big
appliance
• Research and experimentation time
may be biggest cost
- 13. Where Am I?
• Evaluate your initial trials
• Is Hadoop everything you thought it would
be?
• Do you have a real business need to use it?
• Can you migrate any existing data or
processes?
- 15. Hadoop Is Not For Everyone
• Your “big data” may not be big enough
• Still some work to be done with security
and tools
• Skills are being learned, but not quickly
enough