Distributed processing

Increase computational power
with distributed processing
Neil Stein 03 Nov 2012

A Discussion Example……..
Getting the data, and ordering it as needed…..
Familiar with grep and sort?

—  “grep” extracts all the matching lines
—  “sort” sorts all the lines
grep “some_record_parameters” hl7_transfer.data-file | sort
[2012/02/25/ 9:15] records sent to healthcare-1

A Discussion Example……..
—  As the amount of data increases, process requires more and
more resources

—  What if hl7_transfor.data-file is 500GB or bigger?
—  What if there are hundreds or thousands of data files?
—  What if there are multiple types of data files?
grep “provider 1” hl7_transfor.data-file | sort

—  Ignoring the process for a moment, how do we write all the data to
disk in the first place?

Need to rethink the process

Distributed File-System – “the cloud”
—  Files can be stored across many machines
—  Files can be replicated across many machines
—  Files can be in a hyrbid-cloud model
—  Share the file-system transparently
—  You simply see the usual file structure
—  Opportunity to leverage private and public cloud environments

Map-Reduce – the cloud
—  A way of processing large amounts of data across many machines
—  Must be able to split-up the data in chunks for processing, (Map)
—  Recombined after processing (Reduce)
—  Requires a constant flow of data from one simple state to another
—  Allows for a simple way of breaking down a large task into smaller
manageable tasks

—  Increase the available computational power

What is Hadoop
—  A Map-Reduce framework
—  Designed to run applications on clusters of
local and remote systems

—  HDFS
—  The file system of Hadoop (Hadoop Distributed
File System)
—  Designed to access clusters of local and
remote systems

Putting the pieces together….

First, we need some code……
Map

Reduce

Map

Hadoop streams information on STDIN
Separate value with a newline (for Hadoop)

Reduce

Hadoop streams back to us on STDIN
Output the aggregated records

Sanity Checking
Command

Results
This should work with small data-sets

Push file to “the distributed file system”

Put file on the DFS

Check that the file is in the cloud

Running in “the distributed environment”

Call the Hadoop streaming command
Pass the appropriate parameters

Running in “the distributed environment”

Checking Status
—  Cluster Summary
—  Running Jobs
—  Completed Jobs
—  Failed Jobs
—  Job Statistics
—  Detailed Job Logs

Checking Distributed Cluster Health
—  List Data-Nodes
—  Dead Nodes
—  Node Heart-beat information
—  Failed Jobs
—  Job Statistics
—  Detailed Job Logs

Conclusion
—  A different paradigm for solving large-scale problems
—  Designed to solve specific problems that can be defined
in a focused map-reduce manner

Distributed processing

Related slideshows

More Related Content

Distributed processing