Hadoop Program in Python: March 2014

This Blog is For Running Hadoop word count Program in Python

Pre requisite :

· Hadoop Should be Install

· Python should be install

Step 1 :

Create Map Program

For create Map Function,need read line by line using python function strip)

then split line into word using function split d

Print <word>, 1

map.py

#!/usr/bin/python

import sys

for line in sys.stdin:

lineval = line.strip()

wordval=line.split()

for w in wordval :

print "%s\t%s" % (w,1)

Step 2 :

Test your Map Function using below command

echo "one two three four five one file " | python map.py

Step 3 :

Type your Reduce program
create one dictionary (d)
if word not exist into dictionary then add word in dictionary with count 1
if word exist into dictionary then get currecnt count and add count +1

Reduce.py

#!/usr/bin/python

import sys

curr_word= "NULL1"
curr_cnt=0

d={}

for line in sys.stdin:
line1=line.strip()
word1,cnt1 = line1.split('\t')

if(word1 in d):

val=d[word1]
d[word1]=val+1
else:
d[word1]=1
print(d)

Step 4 :

Test Your reduce Program

echo "one two three four five one five " | python map.py | python reduce.py

Step 5 :

Create Data file (date.txt) and move in to HDFS
move map and reduce python program in to HDCS111

Step 6 :

Create

Try below command

<Hadoop > <jar> <streaming jar file >
-file <Python code for Mapper> -mapper <Python code for Mapper>
-file <Python code for Reducer> -reducer <Python code for Reducer>
-input <Input Path > -output <Out put path>

bin/hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar \

> -file /usr/local/hadoop/map.py -mapper /usr/local/hadoop/map.py \

> -file /usr/local/hadoop/reduce.py -reducer /usr/local/hadoop/reduce.py \

> -input /input/data.txt -output /pythonoutput

Hadoop Program in Python

Sunday, 23 March 2014

Hadoop Program in Python

Blog Archive