Big data 101 v1
- 1. www.geekseat.com.au Agile Software Development
Welcome to “Big Data” Jungle
Welly Tambunan
(welly.tambunan@danamon.co.id)
Solution and Integration Architect Lead
Analytics & Data warehouse Department
- 2. Outlines
Big Data Overview and History
Introduction to Hadoop
Hadoop Ecosystem
Hadoop Distribution
Cloudera
Big Data Architecture
ETL vs ELT
Talend for ETL Tools
- 3. Big Data Overview and History
Google Search Engine
Search Engine Architecture
Crawler
Indexer
Search Algorithm / Page Rank
Doug Cutting and Search Engine
Apache Lucene
Apache Nutch
Google File System + Map Reduce
Hadoop Birth
- 4. Hadoop
HDFS ( Hadoop Distributed File System )
Map Reduce
Hadoop = HDFS + Map Reduce
Hadoop = Storage + Processing
Feature
schemaless with no predefined structure, i.e. no rigid schema with tables and columns (and column types and sizes)
durable once data is written it should never be lost
capable of handling component failure without human intervention (e.g. CPU, disk, memory, network, power supply, MB)
automatically rebalanced to even out disk space consumption throughout cluster
- 8. Cloudera Demo
Cloudera Manager
Hue
File
Format
CSV
Parquet
Avro
Compression
Gzip
Snappy
Deflate
Read as Database from
Hive
Impala
- 9. ETL vs ELT
Extract Transform Load
Extract Load Transform
- 10. Talend for ETL/ELT Tools
Demo for Standard Job with Database
Demo for Batch Job
Demo for Streaming Job