SlideShare a Scribd company logo
HadoopStreaming 
IT가맹점개발팀 
이태영 
2014.11.11 
5번째스터디 
파이썬으로MR 개발하기
•개발자, 팀이익숙한언어를사용해서MR 개발가능 
•특정언어에서제공하는라이브러리사용가능 
•표준I/O로데이터교환-자바MR에비해성능저하 
•그러나개발생산성이보장받는다면? 
HadoopStreaming은표준입출력(stdio)를제공하는모든언어이용가능 
※ 두가지요소가정의되어야함 
1.Map 기능이정의된실행가능Mapper 파일 
2.Reduce 기능이정의된실행가능Reducer 파일 
HadoopStreaming
MapReduce 
1.MAP의역할-표준입력으로입력데이터처리 
2.MAP의역할-표준출력으로Key, Value 출력 
3.REDUCER역할-MAP의출력<Key, Value>를표준입력으로처리 
4.REDUCER 역할-표준출력으로Key, Value 출력 
데이터 
입력 
파이썬 
Map 처리 
파이썬 
Reduce 처리 
PIPE 
파일읽기, 
PIPE, 
스트리밍등 
MR 처리 
결과 
출력
Python 설치 
표준I/O 데이터Mapper 예제 
1.python 사이트에서2.7.8 다운로드후압축해제 
2.계정홈디렉토리에서python 심볼릭링크로파이썬디렉토리경로생성 
3../configure 
4../make 
python명령어도 
귀찮으니py로축약
print ‘Hello BC’ 
hello.py 
매번py를쳐주기귀찮다. 
파이썬스크립트자체를실행파일로! 
#!/home/hadoop/python/py 
print ‘Hello BC’ 
hello.py 
[hadoop@big01 ~]$ chmod 755 hello.py 
[hadoop@big01 ~]$ ./hello.py 
Hello BC 
[hadoop@big01 ~]$ pyhello.py 
Hello BC 
Python 실행 
Hello BC 예제실행 
#! (SHA BANG)
#!/home/hadoop/python/py 
import sys 
for line in sys.stdin: 
line = line.strip() 
words = line.split() 
for word in words: 
print '{0}t{1}'.format(word, 1) 
[hadoop@big01 python]$ echo "bc bc bc card bc card" | mapper.py 
bc1 
bc1 
bc1 
card1 
bc1 
card1 
it1 
mapper.py 
Python MAP 
표준I/O Mapper 실행예제
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py 
bc1 
bc1 
bc1 
card1 
bc1 
card1 
it1 
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py | sort –k 1 
bc1 
bc1 
bc1 
bc1 
card1 
card1 
it1 
첫번째필드기준정렬 
Python MAP 
Mapper 출력값을정렬
import sys 
current_word = None 
current_count = 0 
word = None 
for line in sys.stdin: 
line = line.strip() 
word, count= line.split('t',1) 
count = int(count) 
if current_word == word: 
current_count += count 
else: 
if current_word: 
print '{0}t{1}'.format(current_word, current_count) 
current_count = count 
current_word = word 
if current_word == word: 
print '{0}t{1}'.format(current_word, current_count) 
reducer.py 
기준단어와같다면카운트+1 
기준단어가None이아니라면 
M/R 결과출력 
새로운기준단어설정 
마지막라인처리용 
Python REDUCE 
표준I/O의Reducer 예제 
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | ./mapper.py | sort –k 1 | ./reducer.py 
bc4 
card2 
it1
Python ♥Hadoop 
HadoopStreaming 
1. HadoopStreaming에서mapper/reducer는실행가능한쉘로지정되어야한다. 
[OK]Hadoop jar hadoop-streaming*.jar –mapper map.py–reducer reduce.py… 
[NO] Hadoop jar hadoop-streaming*.jar –mapper python map.py–reducer python reduce.py… 
2. Python스크립트는어디에서든접근가능하도록디렉토리PATH를설정 
조건 
Caused by: java.lang.RuntimeException: configuration exception 
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222) 
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66) 
... 22 more 
Caused by: java.io.IOException: Cannot run program "mapper.py": error=2, 그런파일이나디렉터리가없습니다 
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) 
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209) 
... 23 more 
Caused by: java.io.IOException: error=2, 그런파일이나디렉터리가없습니다 
at java.lang.UNIXProcess.forkAndExec(Native Method) 
at java.lang.UNIXProcess.<init>(UNIXProcess.java:187) 
at java.lang.ProcessImpl.start(ProcessImpl.java:134) 
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) 
... 24 more 
안하면
Python ♥Hadoop 
HadoopStreaming 
hadoop jar hadoop-streaming-2.5.1.jar  
-input myInputDirs  
-output myOutputDir  
-mapper /bin/cat  
-reducer /usr/bin/wc 
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar 
Hadoop 2.x의HadoopStreaming위치 
hadoop command [genericOptions] [streamingOptions]
Python ♥Hadoop 
HadoopStreaming 명령어(command) 
Parameter 
Optional/Required 
Description 
-inputdirectoryname or filename 
Required 
Input location for mapper 
-outputdirectoryname 
Required 
Output location for reducer 
-mapperexecutable or JavaClassName 
Required 
Mapper executable 
-reducerexecutable or JavaClassName 
Required 
Reducer executable 
-filefilename 
Optional 
Make the mapper, reducer, or combiner executable available locally on the compute nodes 
-inputformat JavaClassName 
Optional 
Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default 
-outputformat JavaClassName 
Optional 
Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default 
-partitioner JavaClassName 
Optional 
Class that determines which reduce a key is sent to 
-combiner streamingCommand or JavaClassName 
Optional 
Combiner executable for map output 
-cmdenv name=value 
Optional 
Pass environment variable to streaming commands 
-inputreader 
Optional 
For backwards-compatibility: specifies a record reader class (instead of an input format class) 
-verbose 
Optional 
Verbose output 
-lazyOutput 
Optional 
Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write) 
-numReduceTasks 
Optional 
Specify the number of reducers 
-mapdebug 
Optional 
Script to call when map task fails 
-reducedebug 
Optional 
Script to call when reduce task fails 
hadoop command [genericOptions] [streamingOptions]
Python ♥Hadoop 
HadoopStreaming 제네릭옵션 
Parameter 
Optional/Required 
Description 
-conf configuration_file 
Optional 
Specify an application configuration file 
-D property=value 
Optional 
Use value for given property 
-fs host:port or local 
Optional 
Specify a namenode 
-files 
Optional 
Specify comma-separated files to be copied to the Map/Reduce cluster 
-libjars 
Optional 
Specify comma-separated jar files to include in the classpath 
-archives 
Optional 
Specify comma-separated archives to be unarchived on the compute machines 
hadoop command [genericOptions] [streamingOptions] 
사용예 
hadoop jar hadoop-streaming-2.5.1.jar  
-D mapreduce.job.reduces=2  
-input myInputDirs  
-output myOutputDir  
-mapper /bin/cat  
-reducer /usr/bin/wc
Python ♥Hadoop 
HadoopStreaming 실행: WordCount 
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar  
-input alice -output wc_alice 
-mapper mapper.py-reducer reducer.py 
-file mapper.py -file reducer.py 
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null 
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 
14/11/1123:51:43 INFO mapred.FileInputFormat: Total input paths to process : 1 
14/11/1123:51:43 INFO mapreduce.JobSubmitter: number of splits:2 
14/11/1123:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_0009 
14/11/1123:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_0009 
14/11/1123:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/ 
14/11/1123:51:44 INFO mapreduce.Job: Running job: job_1416242552451_0009 
14/11/1123:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false 
14/11/1123:51:53 INFO mapreduce.Job: map 0% reduce 0% 
14/11/1123:52:05 INFO mapreduce.Job: map 100% reduce 0% 
14/11/1123:52:13 INFO mapreduce.Job: map 100% reduce 100% 
14/11/1123:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully 
14/11/1123:52:13 INFO mapreduce.Job: Counters: 49 
File System Counters 
…..
Python ♥Hadoop 
HadoopStreaming 실행: WordCount 
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar  
-input alice -output wc_alice 
-mapper mapper.py-reducer reducer.py 
-file mapper.py -file reducer.py 
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null 
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 
14/11/1123:51:43 INFO mapred.FileInputFormat: Total input paths to process : 1 
14/11/1123:51:43 INFO mapreduce.JobSubmitter: number of splits:2 
14/11/1123:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_0009 
14/11/1123:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_0009 
14/11/1123:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/ 
14/11/1123:51:44 INFO mapreduce.Job: Running job: job_1416242552451_0009 
14/11/1123:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false 
14/11/1123:51:53 INFO mapreduce.Job: map 0% reduce 0% 
14/11/1123:52:05 INFO mapreduce.Job: map 100% reduce 0% 
14/11/1123:52:13 INFO mapreduce.Job: map 100% reduce 100% 
14/11/1123:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully 
14/11/1123:52:13 INFO mapreduce.Job: Counters: 49 
File System Counters 
…..
Python ♥Hadoop 
HadoopStreaming 결과확인
Python ♥Hadoop 
HadoopStreaming 결과확인 
…. 
you'd8 
you'll4 
you're15 
you've5 
you,25 
you,'6 
you--all1 
you--are1 
you.1 
you.'1 
you:1 
you?2 
you?'7 
young5 
your62 
yours1 
yours."'1 
yourself5 
yourself!'1 
yourself,1 
yourself,'1 
yourself.'2 
youth,3 
youth,'3 
zigzag,1 
part-00000 를열어보면
Python ♥Hadoop 
HadoopStreaming 예제: WordCount 고도화 
#!/home/hadoop/python/py 
import sys 
Import re 
for line in sys.stdin: 
line = line.strip() 
line = re.sub('[=.#/?:$'!,"}]', '', line) 
words = line.split() 
for word in words: 
print '{0}t{1}'.format(word, 1) 
mapper.py 수정 
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar  
-input alice -output wc_alice2 
-mapper mapper.py-reducer reducer.py 
-file mapper.py -file reducer.py 
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null 
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 
정규표현식, 특수문자제거
Python ♥Hadoop 
HadoopStreaming 결과확인 
….. 
ye;1 
year2 
years2 
yelled1 
yelp1 
yer4 
yesterday3 
yet18 
yet--Oh1 
yet--and1 
yet--its1 
you357 
you)1 
you--all1 
you--are1 
youd8 
youll4 
young5 
your62 
youre15 
yours2 
yourself10 
youth6 
youve5 
zigzag1 
wc_alice2의part-00000 를열어보면
끝.

More Related Content

20141111 파이썬으로 Hadoop MR프로그래밍

  • 1. HadoopStreaming IT가맹점개발팀 이태영 2014.11.11 5번째스터디 파이썬으로MR 개발하기
  • 2. •개발자, 팀이익숙한언어를사용해서MR 개발가능 •특정언어에서제공하는라이브러리사용가능 •표준I/O로데이터교환-자바MR에비해성능저하 •그러나개발생산성이보장받는다면? HadoopStreaming은표준입출력(stdio)를제공하는모든언어이용가능 ※ 두가지요소가정의되어야함 1.Map 기능이정의된실행가능Mapper 파일 2.Reduce 기능이정의된실행가능Reducer 파일 HadoopStreaming
  • 3. MapReduce 1.MAP의역할-표준입력으로입력데이터처리 2.MAP의역할-표준출력으로Key, Value 출력 3.REDUCER역할-MAP의출력<Key, Value>를표준입력으로처리 4.REDUCER 역할-표준출력으로Key, Value 출력 데이터 입력 파이썬 Map 처리 파이썬 Reduce 처리 PIPE 파일읽기, PIPE, 스트리밍등 MR 처리 결과 출력
  • 4. Python 설치 표준I/O 데이터Mapper 예제 1.python 사이트에서2.7.8 다운로드후압축해제 2.계정홈디렉토리에서python 심볼릭링크로파이썬디렉토리경로생성 3../configure 4../make python명령어도 귀찮으니py로축약
  • 5. print ‘Hello BC’ hello.py 매번py를쳐주기귀찮다. 파이썬스크립트자체를실행파일로! #!/home/hadoop/python/py print ‘Hello BC’ hello.py [hadoop@big01 ~]$ chmod 755 hello.py [hadoop@big01 ~]$ ./hello.py Hello BC [hadoop@big01 ~]$ pyhello.py Hello BC Python 실행 Hello BC 예제실행 #! (SHA BANG)
  • 6. #!/home/hadoop/python/py import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print '{0}t{1}'.format(word, 1) [hadoop@big01 python]$ echo "bc bc bc card bc card" | mapper.py bc1 bc1 bc1 card1 bc1 card1 it1 mapper.py Python MAP 표준I/O Mapper 실행예제
  • 7. [hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py bc1 bc1 bc1 card1 bc1 card1 it1 [hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py | sort –k 1 bc1 bc1 bc1 bc1 card1 card1 it1 첫번째필드기준정렬 Python MAP Mapper 출력값을정렬
  • 8. import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count= line.split('t',1) count = int(count) if current_word == word: current_count += count else: if current_word: print '{0}t{1}'.format(current_word, current_count) current_count = count current_word = word if current_word == word: print '{0}t{1}'.format(current_word, current_count) reducer.py 기준단어와같다면카운트+1 기준단어가None이아니라면 M/R 결과출력 새로운기준단어설정 마지막라인처리용 Python REDUCE 표준I/O의Reducer 예제 [hadoop@big01 python]$ echo "bc bc bc card bc card it" | ./mapper.py | sort –k 1 | ./reducer.py bc4 card2 it1
  • 9. Python ♥Hadoop HadoopStreaming 1. HadoopStreaming에서mapper/reducer는실행가능한쉘로지정되어야한다. [OK]Hadoop jar hadoop-streaming*.jar –mapper map.py–reducer reduce.py… [NO] Hadoop jar hadoop-streaming*.jar –mapper python map.py–reducer python reduce.py… 2. Python스크립트는어디에서든접근가능하도록디렉토리PATH를설정 조건 Caused by: java.lang.RuntimeException: configuration exception at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222) at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66) ... 22 more Caused by: java.io.IOException: Cannot run program "mapper.py": error=2, 그런파일이나디렉터리가없습니다 at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209) ... 23 more Caused by: java.io.IOException: error=2, 그런파일이나디렉터리가없습니다 at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.<init>(UNIXProcess.java:187) at java.lang.ProcessImpl.start(ProcessImpl.java:134) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 24 more 안하면
  • 10. Python ♥Hadoop HadoopStreaming hadoop jar hadoop-streaming-2.5.1.jar -input myInputDirs -output myOutputDir -mapper /bin/cat -reducer /usr/bin/wc $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar Hadoop 2.x의HadoopStreaming위치 hadoop command [genericOptions] [streamingOptions]
  • 11. Python ♥Hadoop HadoopStreaming 명령어(command) Parameter Optional/Required Description -inputdirectoryname or filename Required Input location for mapper -outputdirectoryname Required Output location for reducer -mapperexecutable or JavaClassName Required Mapper executable -reducerexecutable or JavaClassName Required Reducer executable -filefilename Optional Make the mapper, reducer, or combiner executable available locally on the compute nodes -inputformat JavaClassName Optional Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default -outputformat JavaClassName Optional Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default -partitioner JavaClassName Optional Class that determines which reduce a key is sent to -combiner streamingCommand or JavaClassName Optional Combiner executable for map output -cmdenv name=value Optional Pass environment variable to streaming commands -inputreader Optional For backwards-compatibility: specifies a record reader class (instead of an input format class) -verbose Optional Verbose output -lazyOutput Optional Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write) -numReduceTasks Optional Specify the number of reducers -mapdebug Optional Script to call when map task fails -reducedebug Optional Script to call when reduce task fails hadoop command [genericOptions] [streamingOptions]
  • 12. Python ♥Hadoop HadoopStreaming 제네릭옵션 Parameter Optional/Required Description -conf configuration_file Optional Specify an application configuration file -D property=value Optional Use value for given property -fs host:port or local Optional Specify a namenode -files Optional Specify comma-separated files to be copied to the Map/Reduce cluster -libjars Optional Specify comma-separated jar files to include in the classpath -archives Optional Specify comma-separated archives to be unarchived on the compute machines hadoop command [genericOptions] [streamingOptions] 사용예 hadoop jar hadoop-streaming-2.5.1.jar -D mapreduce.job.reduces=2 -input myInputDirs -output myOutputDir -mapper /bin/cat -reducer /usr/bin/wc
  • 13. Python ♥Hadoop HadoopStreaming 실행: WordCount [hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar -input alice -output wc_alice -mapper mapper.py-reducer reducer.py -file mapper.py -file reducer.py packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null 14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 14/11/1123:51:43 INFO mapred.FileInputFormat: Total input paths to process : 1 14/11/1123:51:43 INFO mapreduce.JobSubmitter: number of splits:2 14/11/1123:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_0009 14/11/1123:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_0009 14/11/1123:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/ 14/11/1123:51:44 INFO mapreduce.Job: Running job: job_1416242552451_0009 14/11/1123:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false 14/11/1123:51:53 INFO mapreduce.Job: map 0% reduce 0% 14/11/1123:52:05 INFO mapreduce.Job: map 100% reduce 0% 14/11/1123:52:13 INFO mapreduce.Job: map 100% reduce 100% 14/11/1123:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully 14/11/1123:52:13 INFO mapreduce.Job: Counters: 49 File System Counters …..
  • 14. Python ♥Hadoop HadoopStreaming 실행: WordCount [hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar -input alice -output wc_alice -mapper mapper.py-reducer reducer.py -file mapper.py -file reducer.py packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null 14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 14/11/1123:51:43 INFO mapred.FileInputFormat: Total input paths to process : 1 14/11/1123:51:43 INFO mapreduce.JobSubmitter: number of splits:2 14/11/1123:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_0009 14/11/1123:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_0009 14/11/1123:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/ 14/11/1123:51:44 INFO mapreduce.Job: Running job: job_1416242552451_0009 14/11/1123:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false 14/11/1123:51:53 INFO mapreduce.Job: map 0% reduce 0% 14/11/1123:52:05 INFO mapreduce.Job: map 100% reduce 0% 14/11/1123:52:13 INFO mapreduce.Job: map 100% reduce 100% 14/11/1123:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully 14/11/1123:52:13 INFO mapreduce.Job: Counters: 49 File System Counters …..
  • 16. Python ♥Hadoop HadoopStreaming 결과확인 …. you'd8 you'll4 you're15 you've5 you,25 you,'6 you--all1 you--are1 you.1 you.'1 you:1 you?2 you?'7 young5 your62 yours1 yours."'1 yourself5 yourself!'1 yourself,1 yourself,'1 yourself.'2 youth,3 youth,'3 zigzag,1 part-00000 를열어보면
  • 17. Python ♥Hadoop HadoopStreaming 예제: WordCount 고도화 #!/home/hadoop/python/py import sys Import re for line in sys.stdin: line = line.strip() line = re.sub('[=.#/?:$'!,"}]', '', line) words = line.split() for word in words: print '{0}t{1}'.format(word, 1) mapper.py 수정 [hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar -input alice -output wc_alice2 -mapper mapper.py-reducer reducer.py -file mapper.py -file reducer.py packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null 14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 정규표현식, 특수문자제거
  • 18. Python ♥Hadoop HadoopStreaming 결과확인 ….. ye;1 year2 years2 yelled1 yelp1 yer4 yesterday3 yet18 yet--Oh1 yet--and1 yet--its1 you357 you)1 you--all1 you--are1 youd8 youll4 young5 your62 youre15 yours2 yourself10 youth6 youve5 zigzag1 wc_alice2의part-00000 를열어보면
  • 19. 끝.