SlideShare a Scribd company logo
www.geekseat.com.au Agile Software Development
Welcome to “Big Data” Jungle
Welly Tambunan
(welly.tambunan@danamon.co.id)
Solution and Integration Architect Lead
Analytics & Data warehouse Department
Outlines
 Big Data Overview and History
 Introduction to Hadoop
 Hadoop Ecosystem
 Hadoop Distribution
 Cloudera
 Big Data Architecture
 ETL vs ELT
 Talend for ETL Tools
Big Data Overview and History
 Google Search Engine
 Search Engine Architecture
 Crawler
 Indexer
 Search Algorithm / Page Rank
 Doug Cutting and Search Engine
 Apache Lucene
 Apache Nutch
 Google File System + Map Reduce
 Hadoop Birth
Hadoop
 HDFS ( Hadoop Distributed File System )
 Map Reduce
 Hadoop = HDFS + Map Reduce
 Hadoop = Storage + Processing
 Feature
 schemaless with no predefined structure, i.e. no rigid schema with tables and columns (and column types and sizes)
 durable once data is written it should never be lost
 capable of handling component failure without human intervention (e.g. CPU, disk, memory, network, power supply, MB)
 automatically rebalanced to even out disk space consumption throughout cluster
Hadoop Ecosystem
 SQL on Hadoop
 HIVE
 Impala
 Hbase
 Hue
 Kafka
 Oozie
 Sqoop
Hadoop Ecosystem
 Yarn
 Zookeeper
 Spark
 Batch
 Streaming
 Flink
 Batch
 Streaming
Hadoop Distribution
 Cloudera ( Danamon choice )
 Hortonworks
 MapR
 IBM
 etc
Cloudera Demo
 Cloudera Manager
 Hue
 File
 Format
 CSV
 Parquet
 Avro
 Compression
 Gzip
 Snappy
 Deflate
 Read as Database from
 Hive
 Impala
ETL vs ELT
 Extract Transform Load
 Extract Load Transform
Talend for ETL/ELT Tools
 Demo for Standard Job with Database
 Demo for Batch Job
 Demo for Streaming Job
Announcement
 https://weltam.wordpress.com/ is back with Big Data Flavor
Questions ?
Rock On !

More Related Content

Big data 101 v1

  • 1. www.geekseat.com.au Agile Software Development Welcome to “Big Data” Jungle Welly Tambunan (welly.tambunan@danamon.co.id) Solution and Integration Architect Lead Analytics & Data warehouse Department
  • 2. Outlines  Big Data Overview and History  Introduction to Hadoop  Hadoop Ecosystem  Hadoop Distribution  Cloudera  Big Data Architecture  ETL vs ELT  Talend for ETL Tools
  • 3. Big Data Overview and History  Google Search Engine  Search Engine Architecture  Crawler  Indexer  Search Algorithm / Page Rank  Doug Cutting and Search Engine  Apache Lucene  Apache Nutch  Google File System + Map Reduce  Hadoop Birth
  • 4. Hadoop  HDFS ( Hadoop Distributed File System )  Map Reduce  Hadoop = HDFS + Map Reduce  Hadoop = Storage + Processing  Feature  schemaless with no predefined structure, i.e. no rigid schema with tables and columns (and column types and sizes)  durable once data is written it should never be lost  capable of handling component failure without human intervention (e.g. CPU, disk, memory, network, power supply, MB)  automatically rebalanced to even out disk space consumption throughout cluster
  • 5. Hadoop Ecosystem  SQL on Hadoop  HIVE  Impala  Hbase  Hue  Kafka  Oozie  Sqoop
  • 6. Hadoop Ecosystem  Yarn  Zookeeper  Spark  Batch  Streaming  Flink  Batch  Streaming
  • 7. Hadoop Distribution  Cloudera ( Danamon choice )  Hortonworks  MapR  IBM  etc
  • 8. Cloudera Demo  Cloudera Manager  Hue  File  Format  CSV  Parquet  Avro  Compression  Gzip  Snappy  Deflate  Read as Database from  Hive  Impala
  • 9. ETL vs ELT  Extract Transform Load  Extract Load Transform
  • 10. Talend for ETL/ELT Tools  Demo for Standard Job with Database  Demo for Batch Job  Demo for Streaming Job