Distributed processing
- 3. A Discussion Example……..
Getting the data, and ordering it as needed…..
Familiar with grep and sort?
— “grep” extracts all the matching lines
— “sort” sorts all the lines
grep “some_record_parameters” hl7_transfer.data-file | sort
[2012/02/25/ 9:15] records sent to healthcare-1
[2012/02/28/ 6:15] records sent to healthcare-2
[2012/03/12/ 10:30] records sent to healthcare-3
- 4. A Discussion Example……..
— As the amount of data increases, process requires more and
more resources
— What if hl7_transfor.data-file is 500GB or bigger?
— What if there are hundreds or thousands of data files?
— What if there are multiple types of data files?
grep “provider 1” hl7_transfor.data-file | sort
— Ignoring the process for a moment, how do we write all the data to
disk in the first place?
Need to rethink the process
- 6. Distributed File-System – “the cloud”
— Files can be stored across many machines
— Files can be replicated across many machines
— Files can be in a hyrbid-cloud model
— Share the file-system transparently
— You simply see the usual file structure
— Opportunity to leverage private and public cloud environments
- 8. Map-Reduce – the cloud
— A way of processing large amounts of data across many machines
— Must be able to split-up the data in chunks for processing, (Map)
— Recombined after processing (Reduce)
— Requires a constant flow of data from one simple state to another
— Allows for a simple way of breaking down a large task into smaller
manageable tasks
— Increase the available computational power
- 10. What is Hadoop
— A Map-Reduce framework
— Designed to run applications on clusters of
local and remote systems
— HDFS
— The file system of Hadoop (Hadoop Distributed
File System)
— Designed to access clusters of local and
remote systems
- 16. Push file to “the distributed file system”
Put file on the DFS
Check that the file is in the cloud
- 17. Running in “the distributed environment”
Call the Hadoop streaming command
Pass the appropriate parameters
- 23. Checking Distributed Cluster Health
— List Data-Nodes
— Dead Nodes
— Node Heart-beat information
— Failed Jobs
— Job Statistics
— Detailed Job Logs
- 24. Conclusion
— A different paradigm for solving large-scale problems
— Designed to solve specific problems that can be defined
in a focused map-reduce manner