The document discusses using Python for MapReduce development with Hadoop Streaming. It explains that Hadoop Streaming allows any language to be used as long as mapper and reducer functions are defined that use standard input/output. Examples of Python mapper and reducer code are provided that count word frequencies in a text file using Hadoop Streaming.
5. print ‘Hello BC’
hello.py
매번py를쳐주기귀찮다.
파이썬스크립트자체를실행파일로!
#!/home/hadoop/python/py
print ‘Hello BC’
hello.py
[hadoop@big01 ~]$ chmod 755 hello.py
[hadoop@big01 ~]$ ./hello.py
Hello BC
[hadoop@big01 ~]$ pyhello.py
Hello BC
Python 실행
Hello BC 예제실행
#! (SHA BANG)
6. #!/home/hadoop/python/py
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '{0}t{1}'.format(word, 1)
[hadoop@big01 python]$ echo "bc bc bc card bc card" | mapper.py
bc1
bc1
bc1
card1
bc1
card1
it1
mapper.py
Python MAP
표준I/O Mapper 실행예제
7. [hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py
bc1
bc1
bc1
card1
bc1
card1
it1
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py | sort –k 1
bc1
bc1
bc1
bc1
card1
card1
it1
첫번째필드기준정렬
Python MAP
Mapper 출력값을정렬
8. import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count= line.split('t',1)
count = int(count)
if current_word == word:
current_count += count
else:
if current_word:
print '{0}t{1}'.format(current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '{0}t{1}'.format(current_word, current_count)
reducer.py
기준단어와같다면카운트+1
기준단어가None이아니라면
M/R 결과출력
새로운기준단어설정
마지막라인처리용
Python REDUCE
표준I/O의Reducer 예제
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | ./mapper.py | sort –k 1 | ./reducer.py
bc4
card2
it1
9. Python ♥Hadoop
HadoopStreaming
1. HadoopStreaming에서mapper/reducer는실행가능한쉘로지정되어야한다.
[OK]Hadoop jar hadoop-streaming*.jar –mapper map.py–reducer reduce.py…
[NO] Hadoop jar hadoop-streaming*.jar –mapper python map.py–reducer python reduce.py…
2. Python스크립트는어디에서든접근가능하도록디렉토리PATH를설정
조건
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "mapper.py": error=2, 그런파일이나디렉터리가없습니다
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: error=2, 그런파일이나디렉터리가없습니다
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:187)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 24 more
안하면
11. Python ♥Hadoop
HadoopStreaming 명령어(command)
Parameter
Optional/Required
Description
-inputdirectoryname or filename
Required
Input location for mapper
-outputdirectoryname
Required
Output location for reducer
-mapperexecutable or JavaClassName
Required
Mapper executable
-reducerexecutable or JavaClassName
Required
Reducer executable
-filefilename
Optional
Make the mapper, reducer, or combiner executable available locally on the compute nodes
-inputformat JavaClassName
Optional
Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default
-outputformat JavaClassName
Optional
Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default
-partitioner JavaClassName
Optional
Class that determines which reduce a key is sent to
-combiner streamingCommand or JavaClassName
Optional
Combiner executable for map output
-cmdenv name=value
Optional
Pass environment variable to streaming commands
-inputreader
Optional
For backwards-compatibility: specifies a record reader class (instead of an input format class)
-verbose
Optional
Verbose output
-lazyOutput
Optional
Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write)
-numReduceTasks
Optional
Specify the number of reducers
-mapdebug
Optional
Script to call when map task fails
-reducedebug
Optional
Script to call when reduce task fails
hadoop command [genericOptions] [streamingOptions]
12. Python ♥Hadoop
HadoopStreaming 제네릭옵션
Parameter
Optional/Required
Description
-conf configuration_file
Optional
Specify an application configuration file
-D property=value
Optional
Use value for given property
-fs host:port or local
Optional
Specify a namenode
-files
Optional
Specify comma-separated files to be copied to the Map/Reduce cluster
-libjars
Optional
Specify comma-separated jar files to include in the classpath
-archives
Optional
Specify comma-separated archives to be unarchived on the compute machines
hadoop command [genericOptions] [streamingOptions]
사용예
hadoop jar hadoop-streaming-2.5.1.jar
-D mapreduce.job.reduces=2
-input myInputDirs
-output myOutputDir
-mapper /bin/cat
-reducer /usr/bin/wc
13. Python ♥Hadoop
HadoopStreaming 실행: WordCount
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar
-input alice -output wc_alice
-mapper mapper.py-reducer reducer.py
-file mapper.py -file reducer.py
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040
14/11/1123:51:43 INFO mapred.FileInputFormat: Total input paths to process : 1
14/11/1123:51:43 INFO mapreduce.JobSubmitter: number of splits:2
14/11/1123:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_0009
14/11/1123:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_0009
14/11/1123:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/
14/11/1123:51:44 INFO mapreduce.Job: Running job: job_1416242552451_0009
14/11/1123:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false
14/11/1123:51:53 INFO mapreduce.Job: map 0% reduce 0%
14/11/1123:52:05 INFO mapreduce.Job: map 100% reduce 0%
14/11/1123:52:13 INFO mapreduce.Job: map 100% reduce 100%
14/11/1123:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully
14/11/1123:52:13 INFO mapreduce.Job: Counters: 49
File System Counters
…..
14. Python ♥Hadoop
HadoopStreaming 실행: WordCount
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar
-input alice -output wc_alice
-mapper mapper.py-reducer reducer.py
-file mapper.py -file reducer.py
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040
14/11/1123:51:43 INFO mapred.FileInputFormat: Total input paths to process : 1
14/11/1123:51:43 INFO mapreduce.JobSubmitter: number of splits:2
14/11/1123:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_0009
14/11/1123:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_0009
14/11/1123:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/
14/11/1123:51:44 INFO mapreduce.Job: Running job: job_1416242552451_0009
14/11/1123:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false
14/11/1123:51:53 INFO mapreduce.Job: map 0% reduce 0%
14/11/1123:52:05 INFO mapreduce.Job: map 100% reduce 0%
14/11/1123:52:13 INFO mapreduce.Job: map 100% reduce 100%
14/11/1123:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully
14/11/1123:52:13 INFO mapreduce.Job: Counters: 49
File System Counters
…..
17. Python ♥Hadoop
HadoopStreaming 예제: WordCount 고도화
#!/home/hadoop/python/py
import sys
Import re
for line in sys.stdin:
line = line.strip()
line = re.sub('[=.#/?:$'!,"}]', '', line)
words = line.split()
for word in words:
print '{0}t{1}'.format(word, 1)
mapper.py 수정
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar
-input alice -output wc_alice2
-mapper mapper.py-reducer reducer.py
-file mapper.py -file reducer.py
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040
정규표현식, 특수문자제거