This Blog is For Running Hadoop word count Program in Python
Pre requisite :
·
Hadoop Should be
Install
·
Python should be
install
Step 1 :
Create
Map Program
For create Map Function,need read line by line using python function strip)
then split line into word using function split d
Print <word>, 1
map.py
#!/usr/bin/python
import sys
for line in sys.stdin:
lineval = line.strip()
wordval=line.split()
for w in wordval :
print "%s\t%s" % (w,1)
Step 2 :
Test your Map Function using below command
echo "one two three four five one file " | python map.py
Step 3 :
Type your Reduce program
create one dictionary (d)
if word not exist into dictionary then add word in dictionary with count 1
if word exist into dictionary then get currecnt count and add count +1
Reduce.py
#!/usr/bin/python
import sys
curr_word= "NULL1"
curr_cnt=0
create one dictionary (d)
if word not exist into dictionary then add word in dictionary with count 1
if word exist into dictionary then get currecnt count and add count +1
Reduce.py
#!/usr/bin/python
import sys
curr_word= "NULL1"
curr_cnt=0
d={}
for line in sys.stdin:
line1=line.strip()
word1,cnt1 = line1.split('\t')
if(word1 in d):
val=d[word1]
d[word1]=val+1
else:
d[word1]=1
print(d)
d[word1]=val+1
else:
d[word1]=1
print(d)
Step 4 :
Test Your reduce Program
echo "one two three four five one five " | python map.py | python reduce.py
echo "one two three four five one five " | python map.py | python reduce.py
Step 5 :
Create Data file (date.txt) and move in to HDFS
move map and reduce python program in to HDCS111
move map and reduce python program in to HDCS111
Step 6 :
Create
Try
below command
<Hadoop
> <jar> <streaming jar file
>
-file <Python code for Mapper> -mapper <Python code for Mapper>
-file <Python code for Reducer> -reducer <Python code for Reducer>
-input <Input Path > -output <Out put path>
-file <Python code for Mapper> -mapper <Python code for Mapper>
-file <Python code for Reducer> -reducer <Python code for Reducer>
-input <Input Path > -output <Out put path>
bin/hadoop jar
/usr/local/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar \
> -file /usr/local/hadoop/map.py -mapper /usr/local/hadoop/map.py \
> -file /usr/local/hadoop/reduce.py -reducer
/usr/local/hadoop/reduce.py \
> -input /input/data.txt
-output /pythonoutput



Good article to start Map Reduce Programming...!
ReplyDelete