Installing Hadoop for Fedora & Oracle Linux(Single Node Cluster) Part-II

In the last post we saw how to setup a single node cluster. In this post we will test our setup by running a sample Map/Reduce JAVA program. For the purpose of this tutorial we will use wordcount example provided with Hadoop distribution. Lets start:

Firstly, start your single node cluster if you have not done so till now. To start your cluster type start-all.sh.

start-all.sh

Next check whether all the JAVA process have started or not by using the command jps.

[hduser@Fuji ~]$ jps
6794 NameNode
7362 TaskTracker
7107 SecondaryNameNode
7494 Jps
6938 DataNode
7205 JobTracker

Change your working directory to $HADOOP_HOME, this will make things easy for us.

 cd $HADOOP_HOME

We need some text files which will act as input for our test program. Put all the text files you could mange to a directory and lets call this directory input_local. Large input data is recommended so that you can realize the power of Hadoop. I am using a text file of 2.8 GB.

In order for the Hadoop to work we need to copy the input directory from local file system to HDFS. Following command will do the job for us assuming input_local is located in our $HADOOP_HOME.

hadoop fs -put input_local /user/hduser/input_HDFS 

This will create a directory named input_HDFS inside the HDFS. To crosscheck whether our copy command did its job or not we will do a listing of the contents of HDFS.

hadoop fs -ls /user/hduser 

Output should be something like this:

[hduser@Fuji ~]$ hadoop fs -ls /user/hduser
Found 1 items
-rw-r--r--   3 hduser supergroup 2918796150 2013-06-17 15:38 /user/hduser/input_HDFS

Now lets try to accomplish our main motive. Type the following command while remaining in the $HADOOP_HOME directory.

hadoop jar hadoop-examples-1.1.2.jar wordcount /user/hduser/input_HDFS /user/hduser/output

Output should look like follows.

[hduser@Fuji hadoop]$ hadoop jar hadoop-examples-1.1.2.jar wordcount /user/hduser/input_HDFS /user/hduser/output
13/06/17 22:57:30 INFO input.FileInputFormat: Total input paths to process : 1
13/06/17 22:57:30 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/06/17 22:57:30 WARN snappy.LoadSnappy: Snappy native library not loaded
13/06/17 22:57:31 INFO mapred.JobClient: Running job: job_201306171550_0012
13/06/17 22:57:32 INFO mapred.JobClient:  map 0% reduce 0%
13/06/17 22:57:41 INFO mapred.JobClient:  map 2% reduce 0%
13/06/17 22:57:44 INFO mapred.JobClient:  map 4% reduce 0%
13/06/17 22:57:47 INFO mapred.JobClient:  map 5% reduce 0%
13/06/17 22:57:50 INFO mapred.JobClient:  map 6% reduce 0%
13/06/17 22:57:51 INFO mapred.JobClient:  map 7% reduce 0%
13/06/17 22:57:54 INFO mapred.JobClient:  map 8% reduce 0%
13/06/17 22:58:00 INFO mapred.JobClient:  map 10% reduce 0%
13/06/17 22:58:03 INFO mapred.JobClient:  map 11% reduce 0%
13/06/17 22:58:06 INFO mapred.JobClient:  map 12% reduce 0%
13/06/17 22:58:07 INFO mapred.JobClient:  map 13% reduce 3%
13/06/17 22:58:09 INFO mapred.JobClient:  map 14% reduce 3%
13/06/17 22:58:12 INFO mapred.JobClient:  map 15% reduce 3%
13/06/17 22:58:16 INFO mapred.JobClient:  map 16% reduce 3%
13/06/17 22:58:18 INFO mapred.JobClient:  map 17% reduce 3%
13/06/17 22:58:19 INFO mapred.JobClient:  map 18% reduce 3%
13/06/17 22:58:21 INFO mapred.JobClient:  map 19% reduce 3%
13/06/17 22:58:22 INFO mapred.JobClient:  map 20% reduce 4%

Alternatively, we can prefix the above command with the linux time command to see the total time hadoop took to process your input. Returned time will help you realize Hadoop’s real power. For me it was 230 seconds with 2.8 GB of input size on a intel corei5 machine, isn’t that great speed.

Our output will be created in a directory named output located at /user/hduser/output. To list the output files created use this command.

[hduser@Fuji hadoop]$ hadoop fs -ls /user/hduser/output
Found 3 items
-rw-r--r--   3 hduser supergroup          0 2013-06-17 16:12 /user/hduser/output/_SUCCESS
drwxr-xr-x   - hduser supergroup          0 2013-06-17 15:54 /user/hduser/output/_logs
-rw-r--r--   3 hduser supergroup     439043 2013-06-17 16:12 /user/hduser/output/part-r-00000

/user/hduser/output/part-r-00000 is our desired output file. To view the contents of this file type the following command:

hadoop dfs -cat /user/hduser/output/part-r-00000

Output will look like this

"1490   1
"1498," 1
"35"    1
"A      2
"AS-IS".        1
"A_     1
"Absoluti       1
"Alack 1

Next time we will see how to setup a multi-node cluster which should not be a hard task after this.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s