Sunday, 23 March 2014

Hadoop Program in Python

This Blog is For Running Hadoop word count Program in Python

Pre requisite :
·        Hadoop Should be Install 
·        Python should be install 

Step 1 :
Create Map Program 
For create Map Function,need read line by line using python function strip)
then split line into word using function split d
Print <word>, 1 


map.py

#!/usr/bin/python
import sys

for line in sys.stdin:
lineval = line.strip()
wordval=line.split()
for w in wordval :
print "%s\t%s" % (w,1)



Step 2 :
Test your Map Function using below command 

echo "one two three four five one file " | python map.py




Step 3 :
Type your Reduce program 
create one dictionary (d)
if word not exist into dictionary then add word in dictionary with count 1 
if word exist into dictionary then get currecnt count and add count +1 

Reduce.py

#!/usr/bin/python

import sys

curr_word= "NULL1"
curr_cnt=0

d={}

for line in sys.stdin:
line1=line.strip()
word1,cnt1 = line1.split('\t') 

if(word1 in d):
val=d[word1]
d[word1]=val+1
else:
d[word1]=1
print(d)




Step 4 :
Test Your reduce Program 

echo "one two three four five one five " | python map.py | python reduce.py


Step 5 :
Create Data file (date.txt) and move in to HDFS
move map and reduce python program in to HDCS111





Step 6 :
Create

Try below command
<Hadoop > <jar> <streaming jar  file >
-file <Python code for Mapper> -mapper <Python code for Mapper>
-file <Python code for Reducer> -reducer <Python code for Reducer>
-input <Input Path >  -output <Out put path>


bin/hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar \
> -file /usr/local/hadoop/map.py  -mapper /usr/local/hadoop/map.py \
> -file /usr/local/hadoop/reduce.py -reducer /usr/local/hadoop/reduce.py \

> -input /input/data.txt  -output /pythonoutput