SlideShare a Scribd company logo
Increase computational power
with distributed processing
Neil Stein 03 Nov 2012
Distributed processing
A Discussion Example……..
Getting the data, and ordering it as needed…..
Familiar with grep and sort?

—  “grep” extracts all the matching lines
—  “sort” sorts all the lines
grep “some_record_parameters” hl7_transfer.data-file | sort
[2012/02/25/ 9:15] records sent to healthcare-1
[2012/02/28/ 6:15] records sent to healthcare-2
[2012/03/12/ 10:30] records sent to healthcare-3
A Discussion Example……..
—  As the amount of data increases, process requires more and
more resources

—  What if hl7_transfor.data-file is 500GB or bigger?
—  What if there are hundreds or thousands of data files?
—  What if there are multiple types of data files?
grep “provider 1” hl7_transfor.data-file | sort

—  Ignoring the process for a moment, how do we write all the data to
disk in the first place?

Need to rethink the process
Distributed processing
Distributed File-System – “the cloud”
—  Files can be stored across many machines
—  Files can be replicated across many machines
—  Files can be in a hyrbid-cloud model
—  Share the file-system transparently
—  You simply see the usual file structure
—  Opportunity to leverage private and public cloud environments
Distributed processing
Map-Reduce – the cloud
—  A way of processing large amounts of data across many machines
—  Must be able to split-up the data in chunks for processing, (Map)
—  Recombined after processing (Reduce)
—  Requires a constant flow of data from one simple state to another
—  Allows for a simple way of breaking down a large task into smaller
manageable tasks

—  Increase the available computational power
A look at Hadoop
What is Hadoop
—  A Map-Reduce framework
—  Designed to run applications on clusters of
local and remote systems

—  HDFS
—  The file system of Hadoop (Hadoop Distributed
File System)
—  Designed to access clusters of local and
remote systems
Putting the pieces together….
First, we need some code……
Map

Reduce
Map

Hadoop streams information on STDIN
Separate value with a newline (for Hadoop)
Reduce

Hadoop streams back to us on STDIN
Output the aggregated records
Sanity Checking
Command

Results
This should work with small data-sets
Push file to “the distributed file system”

Put file on the DFS

Check that the file is in the cloud
Running in “the distributed environment”

Call the Hadoop streaming command
Pass the appropriate parameters
Running in “the distributed environment”
Running in “the distributed environment”
Running in “the distributed environment”
Running in “the distributed environment”
Checking Status
—  Cluster Summary
—  Running Jobs
—  Completed Jobs
—  Failed Jobs
—  Job Statistics
—  Detailed Job Logs
Checking Distributed Cluster Health
—  List Data-Nodes
—  Dead Nodes
—  Node Heart-beat information
—  Failed Jobs
—  Job Statistics
—  Detailed Job Logs
Conclusion
—  A different paradigm for solving large-scale problems
—  Designed to solve specific problems that can be defined
in a focused map-reduce manner

More Related Content

Distributed processing

  • 1. Increase computational power with distributed processing Neil Stein 03 Nov 2012
  • 3. A Discussion Example…….. Getting the data, and ordering it as needed….. Familiar with grep and sort? —  “grep” extracts all the matching lines —  “sort” sorts all the lines grep “some_record_parameters” hl7_transfer.data-file | sort [2012/02/25/ 9:15] records sent to healthcare-1 [2012/02/28/ 6:15] records sent to healthcare-2 [2012/03/12/ 10:30] records sent to healthcare-3
  • 4. A Discussion Example…….. —  As the amount of data increases, process requires more and more resources —  What if hl7_transfor.data-file is 500GB or bigger? —  What if there are hundreds or thousands of data files? —  What if there are multiple types of data files? grep “provider 1” hl7_transfor.data-file | sort —  Ignoring the process for a moment, how do we write all the data to disk in the first place? Need to rethink the process
  • 6. Distributed File-System – “the cloud” —  Files can be stored across many machines —  Files can be replicated across many machines —  Files can be in a hyrbid-cloud model —  Share the file-system transparently —  You simply see the usual file structure —  Opportunity to leverage private and public cloud environments
  • 8. Map-Reduce – the cloud —  A way of processing large amounts of data across many machines —  Must be able to split-up the data in chunks for processing, (Map) —  Recombined after processing (Reduce) —  Requires a constant flow of data from one simple state to another —  Allows for a simple way of breaking down a large task into smaller manageable tasks —  Increase the available computational power
  • 9. A look at Hadoop
  • 10. What is Hadoop —  A Map-Reduce framework —  Designed to run applications on clusters of local and remote systems —  HDFS —  The file system of Hadoop (Hadoop Distributed File System) —  Designed to access clusters of local and remote systems
  • 11. Putting the pieces together….
  • 12. First, we need some code…… Map Reduce
  • 13. Map Hadoop streams information on STDIN Separate value with a newline (for Hadoop)
  • 14. Reduce Hadoop streams back to us on STDIN Output the aggregated records
  • 15. Sanity Checking Command Results This should work with small data-sets
  • 16. Push file to “the distributed file system” Put file on the DFS Check that the file is in the cloud
  • 17. Running in “the distributed environment” Call the Hadoop streaming command Pass the appropriate parameters
  • 18. Running in “the distributed environment”
  • 19. Running in “the distributed environment”
  • 20. Running in “the distributed environment”
  • 21. Running in “the distributed environment”
  • 22. Checking Status —  Cluster Summary —  Running Jobs —  Completed Jobs —  Failed Jobs —  Job Statistics —  Detailed Job Logs
  • 23. Checking Distributed Cluster Health —  List Data-Nodes —  Dead Nodes —  Node Heart-beat information —  Failed Jobs —  Job Statistics —  Detailed Job Logs
  • 24. Conclusion —  A different paradigm for solving large-scale problems —  Designed to solve specific problems that can be defined in a focused map-reduce manner