20141111 파이썬으로 Hadoop MR프로그래밍

HadoopStreaming
IT가맹점개발팀
이태영
2014.11.11
5번째스터디
파이썬으로MR 개발하기

•개발자, 팀이익숙한언어를사용해서MR 개발가능
•특정언어에서제공하는라이브러리사용가능
•표준I/O로데이터교환-자바MR에비해성능저하
•그러나개발생산성이보장받는다면?
HadoopStreaming은표준입출력(stdio)를제공하는모든언어이용가능
※ 두가지요소가정의되어야함
1.Map 기능이정의된실행가능Mapper 파일
2.Reduce 기능이정의된실행가능Reducer 파일
HadoopStreaming

MapReduce
1.MAP의역할-표준입력으로입력데이터처리
2.MAP의역할-표준출력으로Key, Value 출력
3.REDUCER역할-MAP의출력<Key, Value>를표준입력으로처리
4.REDUCER 역할-표준출력으로Key, Value 출력
데이터
입력
파이썬
Map 처리
파이썬
Reduce 처리
PIPE
파일읽기,
PIPE,
스트리밍등
MR 처리
결과
출력

Python 설치
표준I/O 데이터Mapper 예제
1.python 사이트에서2.7.8 다운로드후압축해제
2.계정홈디렉토리에서python 심볼릭링크로파이썬디렉토리경로생성
3../configure
4../make
python명령어도
귀찮으니py로축약

print ‘Hello BC’
hello.py
매번py를쳐주기귀찮다.
파이썬스크립트자체를실행파일로!
#!/home/hadoop/python/py
print ‘Hello BC’
hello.py
[hadoop@big01 ~]$ chmod 755 hello.py
[hadoop@big01 ~]$ ./hello.py
Hello BC
[hadoop@big01 ~]$ pyhello.py
Hello BC
Python 실행
Hello BC 예제실행
#! (SHA BANG)

import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '{0}t{1}'.format(word, 1)
[hadoop@big01 python]$ echo "bc bc bc card bc card" | mapper.py
bc1
bc1
bc1
card1
bc1
card1
it1
mapper.py
Python MAP
표준I/O Mapper 실행예제

[hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py
bc1
bc1
bc1
card1
bc1
card1
it1
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py | sort –k 1
bc1
bc1
bc1
bc1
card1
card1
it1
첫번째필드기준정렬
Python MAP
Mapper 출력값을정렬

import sys
current_word = None
current_count = 0
word = None
line = line.strip()
word, count= line.split('t',1)
count = int(count)
if current_word == word:
current_count += count
else:
if current_word:
print '{0}t{1}'.format(current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '{0}t{1}'.format(current_word, current_count)
reducer.py
기준단어와같다면카운트+1
기준단어가None이아니라면
M/R 결과출력
새로운기준단어설정
마지막라인처리용
Python REDUCE
표준I/O의Reducer 예제
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | ./mapper.py | sort –k 1 | ./reducer.py
bc4
card2
it1

Python ♥Hadoop
HadoopStreaming
1. HadoopStreaming에서mapper/reducer는실행가능한쉘로지정되어야한다.
[OK]Hadoop jar hadoop-streaming*.jar –mapper map.py–reducer reduce.py…
[NO] Hadoop jar hadoop-streaming*.jar –mapper python map.py–reducer python reduce.py…
2. Python스크립트는어디에서든접근가능하도록디렉토리PATH를설정
조건
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "mapper.py": error=2, 그런파일이나디렉터리가없습니다
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: error=2, 그런파일이나디렉터리가없습니다
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:187)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 24 more
안하면

Python ♥Hadoop
HadoopStreaming
hadoop jar hadoop-streaming-2.5.1.jar
-input myInputDirs
-output myOutputDir
-mapper /bin/cat
-reducer /usr/bin/wc
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar
Hadoop 2.x의HadoopStreaming위치
hadoop command [genericOptions] [streamingOptions]

Python ♥Hadoop
HadoopStreaming 명령어(command)
Parameter
Optional/Required
Description
-inputdirectoryname or filename
Required
Input location for mapper
-outputdirectoryname
Required
Output location for reducer
-mapperexecutable or JavaClassName
Required
Mapper executable
-reducerexecutable or JavaClassName
Required
Reducer executable
-filefilename
Optional
Make the mapper, reducer, or combiner executable available locally on the compute nodes
-inputformat JavaClassName
Optional
Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default
-outputformat JavaClassName
Optional
Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default
-partitioner JavaClassName
Optional
Class that determines which reduce a key is sent to
-combiner streamingCommand or JavaClassName
Optional
Combiner executable for map output
-cmdenv name=value
Optional
Pass environment variable to streaming commands
-inputreader
Optional
For backwards-compatibility: specifies a record reader class (instead of an input format class)
-verbose
Optional
Verbose output
-lazyOutput
Optional
Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write)
-numReduceTasks
Optional
Specify the number of reducers
-mapdebug
Optional
Script to call when map task fails
-reducedebug
Optional
Script to call when reduce task fails

Python ♥Hadoop
HadoopStreaming 제네릭옵션
Parameter
Optional/Required
Description
-conf configuration_file
Optional
Specify an application configuration file
-D property=value
Optional
Use value for given property
-fs host:port or local
Optional
Specify a namenode
-files
Optional
Specify comma-separated files to be copied to the Map/Reduce cluster
-libjars
Optional
Specify comma-separated jar files to include in the classpath
-archives
Optional
Specify comma-separated archives to be unarchived on the compute machines
사용예
hadoop jar hadoop-streaming-2.5.1.jar
-D mapreduce.job.reduces=2
-input myInputDirs
-output myOutputDir
-mapper /bin/cat
-reducer /usr/bin/wc

Python ♥Hadoop
HadoopStreaming 실행: WordCount
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar
-input alice -output wc_alice
-mapper mapper.py-reducer reducer.py
-file mapper.py -file reducer.py
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040
14/11/1123:51:43 INFO mapred.FileInputFormat: Total input paths to process : 1
14/11/1123:51:43 INFO mapreduce.JobSubmitter: number of splits:2
14/11/1123:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_0009
14/11/1123:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_0009
14/11/1123:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/
14/11/1123:51:44 INFO mapreduce.Job: Running job: job_1416242552451_0009
14/11/1123:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false
14/11/1123:51:53 INFO mapreduce.Job: map 0% reduce 0%
14/11/1123:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully
14/11/1123:52:13 INFO mapreduce.Job: Counters: 49
File System Counters
…..

Python ♥Hadoop
HadoopStreaming 결과확인

Python ♥Hadoop
….
you'd8
you'll4
you're15
you've5
you,25
you,'6
you--all1
you--are1
you.1
you.'1
you:1
you?2
you?'7
young5
your62
yours1
yours."'1
yourself5
yourself!'1
yourself,1
yourself,'1
yourself.'2
youth,3
youth,'3
zigzag,1
part-00000 를열어보면

Python ♥Hadoop
HadoopStreaming 예제: WordCount 고도화
import sys
Import re
line = line.strip()
line = re.sub('[=.#/?:$'!,"}]', '', line)
words = line.split()
for word in words:
print '{0}t{1}'.format(word, 1)
mapper.py 수정
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar
-input alice -output wc_alice2
-mapper mapper.py-reducer reducer.py
-file mapper.py -file reducer.py
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null
정규표현식, 특수문자제거

Python ♥Hadoop
…..
ye;1
year2
years2
yelled1
yelp1
yer4
yesterday3
yet18
yet--Oh1
yet--and1
yet--its1
you357
you)1
you--all1
you--are1
youd8
youll4
young5
your62
youre15
yours2
yourself10
youth6
youve5
zigzag1
wc_alice2의part-00000 를열어보면

20141111 파이썬으로 Hadoop MR프로그래밍

More Related Content

20141111 파이썬으로 Hadoop MR프로그래밍