How to use a file in a hadoop streaming job using python?

Question

I want to read a list from a file in my hadoop streaming job. Here is my simple mapper.py:

#!/usr/bin/env python

import sys
import json

def read_file():
    id_list = []
    #read ids from a file
    f = open('../user_ids','r')
    for line in f:
        line = line.strip()
        id_list.append(line)
    return id_list

if __name__ == '__main__':
    id_list = set(read_file())
    # input comes from STDIN (standard input)
    for line in sys.stdin:
        # remove leading and trailing whitespace
        line = line.strip()
        line = json.loads(line)
        user_id = line['user']['id']
        if str(user_id) in id_list:
            print '%s\t%s' % (user_id, line)

and here is my reducer.py

#!/usr/bin/env python

from operator import itemgetter
import sys

current_id = None
current_list = []
id = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    id, line = line.split('\t', 1)

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_id == id:
        current_list.append(line)
    else:
        if current_id:
            # write result to STDOUT
            print '%s\t%s' % (current_id, current_list)
        current_id = id
        current_list = [line]

# do not forget to output the last word if needed!
if current_id == id:
        print '%s\t%s' % (current_id, current_list)

now to run it I say:

hadoop jar contrib/streaming/hadoop-streaming-1.1.1.jar -file ./mapper.py \
    -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py \
    -input test/input.txt  -output test/output -file '../user_ids'

The job starts to run:

13/11/07 05:04:52 INFO streaming.StreamJob:  map 0%  reduce 0%
13/11/07 05:05:21 INFO streaming.StreamJob:  map 100%  reduce 100%
13/11/07 05:05:21 INFO streaming.StreamJob: To kill this job, run:

I get the error:

job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.         LastFailedTask: task_201309172143_1390_m_000001
13/11/07 05:05:21 INFO streaming.StreamJob: killJob...

I when I do not read the ids from the file ../user_ids it does not give me any errors. I think the problem is it can not find my ../user_id file. I also have used the location in hdfs and still did not work. Thanks for your help.

You may want to write some log files in your hadoop streaming code, so you can tell what's going on — loopbackbee, Commented Nov 7, 2013 at 10:45
I second need of logging, as some of your lines might as well fail on user_id = line['user']['id'] — alko, Commented Nov 7, 2013 at 11:13
@alko it works when I use a static list, instead of reading from a file. so that line should not be a problem. it means when I say: if name == 'main': id_list = ['1','2'] it works — Elham, Commented Nov 7, 2013 at 11:16

Chris White · Accepted Answer · 2013-11-07 11:35:01Z

11

hadoop jar contrib/streaming/hadoop-streaming-1.1.1.jar -file ./mapper.py \
  -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py \
  -input test/input.txt  -output test/output -file '../user_ids'

Does ../user_ids exist on your local file path when you execute the job? If it does then you need to amend your mapper code to account for the fact that this file will be available in the local working directory of the mapper at runtime:

f = open('user_ids','r')

answered Nov 7, 2013 at 11:35

Chris White

30k4 gold badges72 silver badges93 bronze badges

Thanks for your suggestion. It worked when I put it in the same directory as my mapper!!! I would up vote your answer if I had enough reputation. I am new to the community :)
– Elham
Commented Nov 7, 2013 at 13:19
Why you need to have -file ./mapper.py and -mapper ./mapper.py?
– gsamaras
Commented Feb 7, 2016 at 0:13
In the newer versions of hadoop, -file option is deprecated, using the generic option -files instead is advised by the program. -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
– Tapajit Dey
Commented Mar 22, 2017 at 18:22

Add a comment |

Yusufali2205 · Accepted Answer · 2014-09-29 21:49:33Z

1

Try giving full path of the file or While executing hadoop command make sure you are in the same directory in which the file user_ids file is present

answered Sep 29, 2014 at 21:49

Yusufali2205

1,3629 silver badges17 bronze badges

Add a comment |

Collectives™ on Stack Overflow

How to use a file in a hadoop streaming job using python?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
python
hadoop
hadoop-streaming
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged pythonhadoophadoop-streaming or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
hadoop
hadoop-streaming
or ask your own question.