SlideShare a Scribd company logo
Cloud Infrastructure:
  GFS & MapReduce
          Andrii Vozniuk
  Used slides of: Jeff Dean, Ed Austin


Data Management in the Cloud
            EPFL
      February 27, 2012
Outline
•   Motivation
•   Problem Statement
•   Storage: Google File System (GFS)
•   Processing: MapReduce
•   Benchmarks
•   Conclusions
Motivation
• Huge amounts of data to store and process
• Example @2004:
  – 20+ billion web pages x 20KB/page = 400+ TB
  – Reading from one disc 30-35 MB/s
     • Four months just to read the web
     • 1000 hard drives just to store the web
     • Even more complicated if we want to process data
• Exp. growth. The solution should be scalable.
Motivation
• Buy super fast, ultra reliable hardware?
  – Ultra expensive
  – Controlled by third party
  – Internals can be hidden and proprietary
  – Hard to predict scalability
  – Fails less often, but still fails!
  – No suitable solution on the market
Motivation
• Use commodity hardware? Benefits:
   –   Commodity machines offer much better perf/$
   –   Full control on and understanding of internals
   –   Can be highly optimized for their workloads
   –   Really smart people can do really smart things

• Not that easy:
   – Fault tolerance: something breaks all the time
   – Applications development
   – Debugging, Optimization, Locality
   – Communication and coordination
   – Status reporting, monitoring
• Handle all these issues for every problem you want to solve
Problem Statement
• Develop a scalable distributed file system for
  large data-intensive applications running on
  inexpensive commodity hardware
• Develop a tool for processing large data sets in
  parallel on inexpensive commodity hardware
• Develop the both in coordination for optimal
  performance
Google Cluster Environment



          Cloud                  Datacenters




Servers       Racks   Clusters
Google Cluster Environment
• @2009:
   –   200+ clusters
   –   1000+ machines in many of them
   –   4+ PB File systems
   –   40GB/s read/write load
   –   Frequent HW failures
   –   100s to 1000s active jobs (1 to 1000 tasks)
• Cluster is 1000s of machines
   – Stuff breaks: for 1000 – 1 per day, for 10000 – 10 per day
   – How to store data reliably with high throughput?
   – How to make it easy to develop distributed applications?
Google Technology Stack
Google Technology Stack




                   Focus of
                   this talk
GF S          Google File System*
•   Inexpensive commodity hardware
•   High Throughput > Low Latency
•   Large files (multi GB)
•   Multiple clients
•   Workload
    – Large streaming reads
    – Small random writes
    – Concurrent append to the same file
* The Google File System. S. Ghemawat, H. Gobioff, S. Leung. SOSP, 2003
GF S                      Architecture




•   User-level process running on commodity Linux machines
•   Consists of Master Server and Chunk Servers
•   Files broken into chunks (typically 64 MB), 3x redundancy (clusters, DCs)
•   Data transfers happen directly between clients and Chunk Servers
GF S                  Master Node
• Centralization for simplicity
• Namespace and metadata management
• Managing chunks
   –   Where they are (file<-chunks, replicas)
   –   Where to put new
   –   When to re-replicate (failure, load-balancing)
   –   When and what to delete (garbage collection)
• Fault tolerance
   –   Shadow masters
   –   Monitoring infrastructure outside of GFS
   –   Periodic snapshots
   –   Mirrored operations log
GF S              Master Node
• Metadata is in memory – it’s fast!
  – A 64 MB chunk needs less than 64B metadata => for 640
    TB less than 640MB
• Asks Chunk Servers when
  – Master starts
  – Chunk Server joins the cluster
• Operation log
  – Is used for serialization of concurrent operations
  – Replicated
  – Respond to client only when log is flushed locally and
    remotely
GF S          Chunk Servers
• 64MB chunks as Linux files
  – Reduce size of the master‘s datastructure
  – Reduce client-master interaction
  – Internal fragmentation => allocate space lazily
  – Possible hotspots => re-replicate
• Fault tolerance
  – Heart-beat to the master
  – Something wrong => master inits replication
GF S                 Mutation Order
                             Current lease holder?
    Write request


                        3a. data          identity of primary
                                          location of replicas
                                          (cached by client)

Operation completed                Operation completed
or Error report         3b. data
                                          Primary assigns # to mutations
                                          Applies it
                                          Forwards write request

                        3c. data

                                   Operation completed
GF S         Control & Data Flow
• Decouple control flow and data flow
• Control flow
  – Master -> Primary -> Secondaries
• Data flow
  – Carefully picked chain of Chunk Servers
       • Forward to the closest first
       • Distance estimated based on IP
  – Fully utilize outbound bandwidth
  – Pipelining to exploit full-duplex links
GF S     Other Important Things
• Snapshot operation – make a copy very fast
• Smart chunks creation policy
  – Below-average disk utilization, limited # of recent
• Smart re-replication policy
  – Under replicated first
  – Chunks that are blocking client
  – Live files first (rather than deleted)
• Rebalance and GC periodically

          How to process data stored in GFS?
M M M

 R   R                MapReduce*
• A simple programming model applicable to many
  large-scale computing problems
• Divide and conquer strategy
• Hide messy details in MapReduce runtime library:
     –   Automatic parallelization
     –   Load balancing
     –   Network and disk transfer optimizations
     –   Fault tolerance
     –   Part of the stack: improvements to core library benefit
         all users of library
*MapReduce: Simplified Data Processing on Large Clusters.
J. Dean, S. Ghemawat. OSDI, 2004
M M M

R   R           Typical problem
• Read a lot of data. Break it into the parts
• Map: extract something important from each part
• Shuffle and Sort
• Reduce: aggregate, summarize, filter, transform Map results
• Write the results
• Chain, Cascade

• Implement Map and Reduce to fit the problem
M M M

R   R   Nice Example
M M M

R   R     Other suitable examples
•   Distributed grep
•   Distributed sort (Hadoop MapReduce won TeraSort)
•   Term-vector per host
•   Document clustering
•   Machine learning
•   Web access log stats
•   Web link-graph reversal
•   Inverted index construction
•   Statistical machine translation
M M M

R   R                        Model
• Programmer specifies two primary methods
    – Map(k,v) -> <k’,v’>*
    – Reduce(k’, <v’>*) -> <k’,v’>*
• All v’ with same k’ are reduced together, in order
• Usually also specify how to partition k’
    – Partition(k’, total partitions) -> partition for k’
        • Often a simple hash of the key
        • Allows reduce operations for different k’ to be parallelized
M M M

R   R                   Code
Map
    map(String key, String value):
      // key: document name
      // value: document contents
      for each word w in value:
          EmitIntermediate(w, "1");
Reduce
    reduce(String key, Iterator values):
      // key: a word
      // values: a list of counts
      int result = 0;
      for each v in values:
          result += ParseInt(v);
      Emit(AsString(result))
M M M

R   R                 Architecture




        • One master, many workers
        • Infrastructure manages scheduling and distribution
        • User implements Map and Reduce
M M M

R   R   Architecture
M M M

R   R   Architecture
M M M

R   R   Architecture




        Combiner = local Reduce
M M M

R   R               Important things
•   Mappers scheduled close to data
•   Chunk replication improves locality
•   Reducers often run on same machine as mappers
•   Fault tolerance
    –   Map crash – re-launch all task of machine
    –   Reduce crash – repeat crashed task only
    –   Master crash – repeat whole job
    –   Skip bad records
• Fighting ‘stragglers’ by launching backup tasks
• Proven scalability
• September 2009 Google ran 3,467,000 MR Jobs averaging
  488 machines per Job
• Extensively used in Yahoo and Facebook with Hadoop
GF S   Benchmarks: GFS
M M M

R   R   Benchmarks: MapReduce
Conclusions
“We believe we get tremendous competitive advantage
by essentially building our own infrastructure”
-- Eric Schmidt

• GFS & MapReduce
   – Google achieved their goals
   – A fundamental part of their stack
• Open source implementations
   – GFS  Hadoop Distributed FS (HDFS)
   – MapReduce  Hadoop MapReduce
Thank your for your attention!
   Andrii.Vozniuk@epfl.ch

More Related Content

Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk

  • 1. Cloud Infrastructure: GFS & MapReduce Andrii Vozniuk Used slides of: Jeff Dean, Ed Austin Data Management in the Cloud EPFL February 27, 2012
  • 2. Outline • Motivation • Problem Statement • Storage: Google File System (GFS) • Processing: MapReduce • Benchmarks • Conclusions
  • 3. Motivation • Huge amounts of data to store and process • Example @2004: – 20+ billion web pages x 20KB/page = 400+ TB – Reading from one disc 30-35 MB/s • Four months just to read the web • 1000 hard drives just to store the web • Even more complicated if we want to process data • Exp. growth. The solution should be scalable.
  • 4. Motivation • Buy super fast, ultra reliable hardware? – Ultra expensive – Controlled by third party – Internals can be hidden and proprietary – Hard to predict scalability – Fails less often, but still fails! – No suitable solution on the market
  • 5. Motivation • Use commodity hardware? Benefits: – Commodity machines offer much better perf/$ – Full control on and understanding of internals – Can be highly optimized for their workloads – Really smart people can do really smart things • Not that easy: – Fault tolerance: something breaks all the time – Applications development – Debugging, Optimization, Locality – Communication and coordination – Status reporting, monitoring • Handle all these issues for every problem you want to solve
  • 6. Problem Statement • Develop a scalable distributed file system for large data-intensive applications running on inexpensive commodity hardware • Develop a tool for processing large data sets in parallel on inexpensive commodity hardware • Develop the both in coordination for optimal performance
  • 7. Google Cluster Environment Cloud Datacenters Servers Racks Clusters
  • 8. Google Cluster Environment • @2009: – 200+ clusters – 1000+ machines in many of them – 4+ PB File systems – 40GB/s read/write load – Frequent HW failures – 100s to 1000s active jobs (1 to 1000 tasks) • Cluster is 1000s of machines – Stuff breaks: for 1000 – 1 per day, for 10000 – 10 per day – How to store data reliably with high throughput? – How to make it easy to develop distributed applications?
  • 10. Google Technology Stack Focus of this talk
  • 11. GF S Google File System* • Inexpensive commodity hardware • High Throughput > Low Latency • Large files (multi GB) • Multiple clients • Workload – Large streaming reads – Small random writes – Concurrent append to the same file * The Google File System. S. Ghemawat, H. Gobioff, S. Leung. SOSP, 2003
  • 12. GF S Architecture • User-level process running on commodity Linux machines • Consists of Master Server and Chunk Servers • Files broken into chunks (typically 64 MB), 3x redundancy (clusters, DCs) • Data transfers happen directly between clients and Chunk Servers
  • 13. GF S Master Node • Centralization for simplicity • Namespace and metadata management • Managing chunks – Where they are (file<-chunks, replicas) – Where to put new – When to re-replicate (failure, load-balancing) – When and what to delete (garbage collection) • Fault tolerance – Shadow masters – Monitoring infrastructure outside of GFS – Periodic snapshots – Mirrored operations log
  • 14. GF S Master Node • Metadata is in memory – it’s fast! – A 64 MB chunk needs less than 64B metadata => for 640 TB less than 640MB • Asks Chunk Servers when – Master starts – Chunk Server joins the cluster • Operation log – Is used for serialization of concurrent operations – Replicated – Respond to client only when log is flushed locally and remotely
  • 15. GF S Chunk Servers • 64MB chunks as Linux files – Reduce size of the master‘s datastructure – Reduce client-master interaction – Internal fragmentation => allocate space lazily – Possible hotspots => re-replicate • Fault tolerance – Heart-beat to the master – Something wrong => master inits replication
  • 16. GF S Mutation Order Current lease holder? Write request 3a. data identity of primary location of replicas (cached by client) Operation completed Operation completed or Error report 3b. data Primary assigns # to mutations Applies it Forwards write request 3c. data Operation completed
  • 17. GF S Control & Data Flow • Decouple control flow and data flow • Control flow – Master -> Primary -> Secondaries • Data flow – Carefully picked chain of Chunk Servers • Forward to the closest first • Distance estimated based on IP – Fully utilize outbound bandwidth – Pipelining to exploit full-duplex links
  • 18. GF S Other Important Things • Snapshot operation – make a copy very fast • Smart chunks creation policy – Below-average disk utilization, limited # of recent • Smart re-replication policy – Under replicated first – Chunks that are blocking client – Live files first (rather than deleted) • Rebalance and GC periodically How to process data stored in GFS?
  • 19. M M M R R MapReduce* • A simple programming model applicable to many large-scale computing problems • Divide and conquer strategy • Hide messy details in MapReduce runtime library: – Automatic parallelization – Load balancing – Network and disk transfer optimizations – Fault tolerance – Part of the stack: improvements to core library benefit all users of library *MapReduce: Simplified Data Processing on Large Clusters. J. Dean, S. Ghemawat. OSDI, 2004
  • 20. M M M R R Typical problem • Read a lot of data. Break it into the parts • Map: extract something important from each part • Shuffle and Sort • Reduce: aggregate, summarize, filter, transform Map results • Write the results • Chain, Cascade • Implement Map and Reduce to fit the problem
  • 21. M M M R R Nice Example
  • 22. M M M R R Other suitable examples • Distributed grep • Distributed sort (Hadoop MapReduce won TeraSort) • Term-vector per host • Document clustering • Machine learning • Web access log stats • Web link-graph reversal • Inverted index construction • Statistical machine translation
  • 23. M M M R R Model • Programmer specifies two primary methods – Map(k,v) -> <k’,v’>* – Reduce(k’, <v’>*) -> <k’,v’>* • All v’ with same k’ are reduced together, in order • Usually also specify how to partition k’ – Partition(k’, total partitions) -> partition for k’ • Often a simple hash of the key • Allows reduce operations for different k’ to be parallelized
  • 24. M M M R R Code Map map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); Reduce reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result))
  • 25. M M M R R Architecture • One master, many workers • Infrastructure manages scheduling and distribution • User implements Map and Reduce
  • 26. M M M R R Architecture
  • 27. M M M R R Architecture
  • 28. M M M R R Architecture Combiner = local Reduce
  • 29. M M M R R Important things • Mappers scheduled close to data • Chunk replication improves locality • Reducers often run on same machine as mappers • Fault tolerance – Map crash – re-launch all task of machine – Reduce crash – repeat crashed task only – Master crash – repeat whole job – Skip bad records • Fighting ‘stragglers’ by launching backup tasks • Proven scalability • September 2009 Google ran 3,467,000 MR Jobs averaging 488 machines per Job • Extensively used in Yahoo and Facebook with Hadoop
  • 30. GF S Benchmarks: GFS
  • 31. M M M R R Benchmarks: MapReduce
  • 32. Conclusions “We believe we get tremendous competitive advantage by essentially building our own infrastructure” -- Eric Schmidt • GFS & MapReduce – Google achieved their goals – A fundamental part of their stack • Open source implementations – GFS  Hadoop Distributed FS (HDFS) – MapReduce  Hadoop MapReduce
  • 33. Thank your for your attention! Andrii.Vozniuk@epfl.ch