SlideShare a Scribd company logo
Distributed Data Storage and Parallel Processing Engine   Sector & Sphere Yunhong Gu  Univ. of Illinois at Chicago
What is Sector/Sphere? Sector: Distributed File System Sphere: Parallel Data Processing Engine (generic MapReduce) Open source software, GPL/BSD, written in C++. Started since 2006, current version 1.23 http://sector.sf.net
Overview Motivation Sector Sphere Experimental Results
Motivation Super-computer model: Expensive, data IO bottleneck Sector/Sphere model: Inexpensive, parallel data IO,  data locality
Motivation Parallel/Distributed Programming with MPI, etc.: Flexible and powerful. But too complicated Sector/Sphere model (cloud model): Clusters are a unity to the developer, simplified programming interface. Limited to certain data parallel applications.
Motivation Systems for single data centers: Requires additional effort to locate and move data. Sector/Sphere model: Support wide-area data collection and distribution.
Sector Distributed File System Security Server Masters slaves slaves SSL SSL Clients User account Data protection System Security Metadata Scheduling Service provider System access tools App. Programming Interfaces Storage and Processing Data UDT Encryption optional
Sector Distributed File System Sector stores files on the native/local file system of each slave node. Sector does not split files into blocks Pro: simple/robust, suitable for wide area,  fast and flexible data processing Con: users need to handle file size properly The master nodes maintain the file system metadata. No permanent metadata is needed. Topology aware
Sector: Performance Data channel is set up directly between a slave and a client Multiple active-active masters (load balance), starting from 1.24 UDT is used for high speed data transfer UDT is a high performance UDP-based data transfer protocol. Much faster than TCP over wide area
UDT: UDP-based Data Transfer http://udt.sf.net Open source UDP based data transfer protocol With reliability control and congestion control Fast, firewall friendly, easy to use Already used in many commercial and research software
Sector: Fault Tolerance Sector uses replications for better reliability and availability Replicas can be made either at write time (instantly) or periodically Sector supports multiple active-active masters for high availability
Sector: Security Sector uses a security server to maintain user account and IP access control for masters, slaves, and clients Control messages are encrypted  not completely finished in the current version Data transfer can be encrypted as an option Data transfer channel is set up by rendezvous, no listening server.
Sector: Tools and API Supported file system operation: ls, stat, mv, cp, mkdir, rm, upload, download Wild card characters supported System monitoring: sysinfo. C++ API: list, stat, move, copy, mkdir, remove, open, close, read, write, sysinfo. FUSE
Sphere: Simplified Data Processing Data parallel applications Data is processed at where it resides, or on the nearest possible node (locality) Same user defined functions (UDF) are applied on all elements (records, blocks, or files) Processing output can be written to Sector files or sent back to the client Generalized Map/Reduce
Sphere: Simplified Data Processing for each file F in (SDSS datasets) for each image I in F findBrownDwarf(I, …);   SphereStream sdss; sdss.init("sdss files"); SphereProcess myproc; myproc->run(sdss," findBrownDwarf ", …); myproc->read(result);   findBrownDwarf(char* image, int isize, char* result, int rsize);
Sphere: Data Movement Slave -> Slave Local Slave -> Slaves (Shuffle/Hash) Slave -> Client
Sphere/UDF vs. MapReduce Record Offset Index UDF Hashing / Bucket - UDF - Parser / Input Reader Map Partition Compare Reduce Output Writer
Sphere/UDF vs. MapReduce Sphere is more straightforward and flexible UDF can be applied directly on records, blocks, files, and even directories Native binary data support Sorting is required by Reduce, but it is optional in Sphere  Sphere uses PUSH model for data movement, faster than the PULL model used by MapReduce
Why Sector doesn’t Split Files? Certain applications need to process a whole file or even directory Certain legacy applications need a file or a directory as input Certain applications need multiple inputs, e.g., everything in a directory In Hadoop, all blocks would have to be moved to one node for processing, hence no data locality benefit.
Load Balance The number of data segments is much more than the number of SPEs. When an SPE completes a data segment, a new segment will be assigned to the SPE. Data transfer is balanced across the system to optimize network bandwidth usage.
Fault Tolerance Map failure is recoverable If one SPE fails, the data segment assigned to it will be re-assigned to another SPE and be processed again. Reduce failure is unrecoverable In small-medium systems, machine failure during run time is rare If necessary, developers can split the input into multiple sub-tasks to reduce the cost of reduce failure.
Open Cloud Testbed 4 Racks in Baltimore (JHU), Chicago (StarLight and UIC), and San Diego (Calit2) 10Gb/s inter-site connection on CiscoWave 2Gb/s inter-rack connection Two dual-core AMD CPU, 12GB RAM, 1TB single disk Will be doubled by Sept. 2009.
Open Cloud Testbed
The TeraSort Benchmark Data is split into small files, scattered on all slaves Stage 1: On each slave, an SPE scans local files, sends each record to a bucket file on a remote node according to the key. Stage 2: On each destination node, an SPE sort all data inside each bucket.
TeraSort 10-byte 90-byte Key Value 10-bit Bucket-0 Bucket-1 Bucket-1023 0-1023 Stage 1 : Hash based on  the first 10 bits Bucket-0 Bucket-1 Bucket-1023 Stage 2 : Sort each bucket  on local node 100 bytes record
Performance Results: TeraSort Run time: seconds Sector v1.16 vs Hadoop 0.17 1.2TB 900GB 600GB 300GB Data Size 3702 6675 1526 UIC + StarLight + Calit2 + JHU 3069 4341 1430 UIC + StarLight + Calit2 2617 2896 1361 UIC + StarLight 2252 2889 1265 UIC Hadoop (1 replica) Hadoop (3 replicas) Sphere
Performance Results: TeraSort Sorting 1.2TB on 120 nodes Sphere Hash & Local Sort: 981sec + 545sec Hadoop: 3702/6675 seconds Sphere Hash CPU: 130% MEM: 900MB Sphere Local Sort CPU: 80% MEM: 1.4GB Hadoop: CPU 150% MEM 2GB
The MalStone Benchmark Drive-by problem: visit a web site and get comprised by malware. MalStone-A: compute the infection ratio of each site. MalStone-B: compute the infection ratio of each site from the beginning to the end of every week. http://code.google.com/p/malgen/
MalStone Site ID Time Key Value 3-byte site-000X site-001X site-999X 000-999 Stage 1 : Process each record and hash into buckets according to site ID site-000X site-001X site-999x Stage 2 : Compute infection rate  for each merchant Event ID | Timestamp | Site ID | Compromise Flag | Entity ID 00000000005000000043852268954353585368|2008-11-08 17:56:52.422640|3857268954353628599|1|000000497829 Text Record Transform Flag
Performance Results: MalStone * Courtesy of Collin Bennet and Jonathan Seidman of Open Data Group. Process 10 billions records on 20 OCT nodes (local). 43m 44s  33m 40s  Sector/Sphere 142m 32s  87m 29s Hadoop Streaming/Python 840m 50s  454m 13s  Hadoop MalStone-B MalStone-A
System Monitoring (Testbed)
System Monitoring (Sector/Sphere)
For More Information Sector/Sphere code & docs:  http://sector.sf.net Open Cloud Consortium:  http://www.opencloudconsortium.org NCDM:  http://www. ncdm . uic . edu

More Related Content

Sector Sphere 2009

  • 1. Distributed Data Storage and Parallel Processing Engine Sector & Sphere Yunhong Gu Univ. of Illinois at Chicago
  • 2. What is Sector/Sphere? Sector: Distributed File System Sphere: Parallel Data Processing Engine (generic MapReduce) Open source software, GPL/BSD, written in C++. Started since 2006, current version 1.23 http://sector.sf.net
  • 3. Overview Motivation Sector Sphere Experimental Results
  • 4. Motivation Super-computer model: Expensive, data IO bottleneck Sector/Sphere model: Inexpensive, parallel data IO, data locality
  • 5. Motivation Parallel/Distributed Programming with MPI, etc.: Flexible and powerful. But too complicated Sector/Sphere model (cloud model): Clusters are a unity to the developer, simplified programming interface. Limited to certain data parallel applications.
  • 6. Motivation Systems for single data centers: Requires additional effort to locate and move data. Sector/Sphere model: Support wide-area data collection and distribution.
  • 7. Sector Distributed File System Security Server Masters slaves slaves SSL SSL Clients User account Data protection System Security Metadata Scheduling Service provider System access tools App. Programming Interfaces Storage and Processing Data UDT Encryption optional
  • 8. Sector Distributed File System Sector stores files on the native/local file system of each slave node. Sector does not split files into blocks Pro: simple/robust, suitable for wide area, fast and flexible data processing Con: users need to handle file size properly The master nodes maintain the file system metadata. No permanent metadata is needed. Topology aware
  • 9. Sector: Performance Data channel is set up directly between a slave and a client Multiple active-active masters (load balance), starting from 1.24 UDT is used for high speed data transfer UDT is a high performance UDP-based data transfer protocol. Much faster than TCP over wide area
  • 10. UDT: UDP-based Data Transfer http://udt.sf.net Open source UDP based data transfer protocol With reliability control and congestion control Fast, firewall friendly, easy to use Already used in many commercial and research software
  • 11. Sector: Fault Tolerance Sector uses replications for better reliability and availability Replicas can be made either at write time (instantly) or periodically Sector supports multiple active-active masters for high availability
  • 12. Sector: Security Sector uses a security server to maintain user account and IP access control for masters, slaves, and clients Control messages are encrypted not completely finished in the current version Data transfer can be encrypted as an option Data transfer channel is set up by rendezvous, no listening server.
  • 13. Sector: Tools and API Supported file system operation: ls, stat, mv, cp, mkdir, rm, upload, download Wild card characters supported System monitoring: sysinfo. C++ API: list, stat, move, copy, mkdir, remove, open, close, read, write, sysinfo. FUSE
  • 14. Sphere: Simplified Data Processing Data parallel applications Data is processed at where it resides, or on the nearest possible node (locality) Same user defined functions (UDF) are applied on all elements (records, blocks, or files) Processing output can be written to Sector files or sent back to the client Generalized Map/Reduce
  • 15. Sphere: Simplified Data Processing for each file F in (SDSS datasets) for each image I in F findBrownDwarf(I, …); SphereStream sdss; sdss.init("sdss files"); SphereProcess myproc; myproc->run(sdss," findBrownDwarf ", …); myproc->read(result); findBrownDwarf(char* image, int isize, char* result, int rsize);
  • 16. Sphere: Data Movement Slave -> Slave Local Slave -> Slaves (Shuffle/Hash) Slave -> Client
  • 17. Sphere/UDF vs. MapReduce Record Offset Index UDF Hashing / Bucket - UDF - Parser / Input Reader Map Partition Compare Reduce Output Writer
  • 18. Sphere/UDF vs. MapReduce Sphere is more straightforward and flexible UDF can be applied directly on records, blocks, files, and even directories Native binary data support Sorting is required by Reduce, but it is optional in Sphere Sphere uses PUSH model for data movement, faster than the PULL model used by MapReduce
  • 19. Why Sector doesn’t Split Files? Certain applications need to process a whole file or even directory Certain legacy applications need a file or a directory as input Certain applications need multiple inputs, e.g., everything in a directory In Hadoop, all blocks would have to be moved to one node for processing, hence no data locality benefit.
  • 20. Load Balance The number of data segments is much more than the number of SPEs. When an SPE completes a data segment, a new segment will be assigned to the SPE. Data transfer is balanced across the system to optimize network bandwidth usage.
  • 21. Fault Tolerance Map failure is recoverable If one SPE fails, the data segment assigned to it will be re-assigned to another SPE and be processed again. Reduce failure is unrecoverable In small-medium systems, machine failure during run time is rare If necessary, developers can split the input into multiple sub-tasks to reduce the cost of reduce failure.
  • 22. Open Cloud Testbed 4 Racks in Baltimore (JHU), Chicago (StarLight and UIC), and San Diego (Calit2) 10Gb/s inter-site connection on CiscoWave 2Gb/s inter-rack connection Two dual-core AMD CPU, 12GB RAM, 1TB single disk Will be doubled by Sept. 2009.
  • 24. The TeraSort Benchmark Data is split into small files, scattered on all slaves Stage 1: On each slave, an SPE scans local files, sends each record to a bucket file on a remote node according to the key. Stage 2: On each destination node, an SPE sort all data inside each bucket.
  • 25. TeraSort 10-byte 90-byte Key Value 10-bit Bucket-0 Bucket-1 Bucket-1023 0-1023 Stage 1 : Hash based on the first 10 bits Bucket-0 Bucket-1 Bucket-1023 Stage 2 : Sort each bucket on local node 100 bytes record
  • 26. Performance Results: TeraSort Run time: seconds Sector v1.16 vs Hadoop 0.17 1.2TB 900GB 600GB 300GB Data Size 3702 6675 1526 UIC + StarLight + Calit2 + JHU 3069 4341 1430 UIC + StarLight + Calit2 2617 2896 1361 UIC + StarLight 2252 2889 1265 UIC Hadoop (1 replica) Hadoop (3 replicas) Sphere
  • 27. Performance Results: TeraSort Sorting 1.2TB on 120 nodes Sphere Hash & Local Sort: 981sec + 545sec Hadoop: 3702/6675 seconds Sphere Hash CPU: 130% MEM: 900MB Sphere Local Sort CPU: 80% MEM: 1.4GB Hadoop: CPU 150% MEM 2GB
  • 28. The MalStone Benchmark Drive-by problem: visit a web site and get comprised by malware. MalStone-A: compute the infection ratio of each site. MalStone-B: compute the infection ratio of each site from the beginning to the end of every week. http://code.google.com/p/malgen/
  • 29. MalStone Site ID Time Key Value 3-byte site-000X site-001X site-999X 000-999 Stage 1 : Process each record and hash into buckets according to site ID site-000X site-001X site-999x Stage 2 : Compute infection rate for each merchant Event ID | Timestamp | Site ID | Compromise Flag | Entity ID 00000000005000000043852268954353585368|2008-11-08 17:56:52.422640|3857268954353628599|1|000000497829 Text Record Transform Flag
  • 30. Performance Results: MalStone * Courtesy of Collin Bennet and Jonathan Seidman of Open Data Group. Process 10 billions records on 20 OCT nodes (local). 43m 44s 33m 40s Sector/Sphere 142m 32s 87m 29s Hadoop Streaming/Python 840m 50s 454m 13s Hadoop MalStone-B MalStone-A
  • 33. For More Information Sector/Sphere code & docs: http://sector.sf.net Open Cloud Consortium: http://www.opencloudconsortium.org NCDM: http://www. ncdm . uic . edu