Installing Hadoop in pseudo distributed mode lets you mimic multi server cluster on a single machine. Unlike standalone mode, this mode has all the daemons running. Also the data in pseudo distributed mode is stored in HDFS rather than the local hard disk.
If you have followed the last post, the first three steps of this tutorial are the same.
- Create Hadoop user
- Install Java
- Download and unpack Hadoop
- Configure SSH
- Configure Hadoop
- Format Hadoop NameNode
- Start Hadoop
- Test Hadoop installation
Create Hadoop User
It is recommended to create a dedicated Hadoop user account to separate the Hadoop installation from other services running on the same machine.
Open System Preference > Users & Groups
Click the ‘+’ button at the bottom of the small window with the list of already existing users. You may need to click on the lock image and enter the administrator name and password. After entering the admin name and password correctly, click on the ‘+’ button and enter the following:-
Full Name: hadoop Account Name: hadoop
Also set the password for the account. Click on “Create User”. You can now login to the hadoop account to install Hadoop.
Install Java
If you running Mac OS, then you will already have Java installed on your system. But just to make sure, open the terminal and enter the following command.
$:~ java -version java version "1.6.0_37" Java(TM) SE Runtime Environment (build 1.6.0_37-b06-434-11M3909) Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01-434, mixed mode)
By doing so, you will see the version of Java installed on your system (1.6.0_37 in this case). Java 6 or later is required to run Hadoop. If your version number suggest otherwise, please download the latest JDK.
Download and unpack Hadoop
Download Hadoop from the Hadoop release pages (http://hadoop.apache.org/releases.html). Make sure to download the latest stable version of Hadoop (hadoop-1.2.1.tar.gz as of this post). Save the file in /Users/hadoop (or any other location of your choice).
Unpack the file using the following command:-
$:~ tar -xzvf hadoop-1.2.1.tar.gz
Set the owner of the extracted hadoop files to be the hadoop user and group.
$:~ chown -R hadoop hadoop-1.2.1
Configure SSH
SSH (secure shell) allows two networked devices to exchange data using a secure channel. As Pseudo Distributed mode mimics multi server cluster, Hadoop control scripts need SSH to perform cluster wide operations. For example, there is a script start-all.sh to start all the daemons running in the cluster.
To work seamlessly with SSH, we need to setup password-less for hadoop user for machines on the cluster. Since we are in Pseudo distributed mode, we therefore need to setup password-less login to localhost.
To do this, we need to generate public/private key pair and place it in the NFS location that is shared across the cluster.
First generate the key pair by typing the following command in the hadoop user account:
$:~ ssh-keygen -t rsa -f ~/.ssh/id_rsa
This will generate the key pair. It will store the private key in ~/.ssh/id_rsa and the public key will be stored in ~/.ssh/id_rsa.pub.
Now we would like to share the public key will all the machines on the cluster. To do this, we need to make sure that the public key is stored in ~/.ssh/authorized_keys file on all the machines in the cluster that we want to connect to.
To do that, type the following command:
$:~ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
You can now test the password-less ssh login by the following command:
$:~ ssh localhost $:~ Last login: Mon Aug 19 10:49:42 2013
Configure Hadoop
The Hadoop configuration is more elaborate in comparison to the one in standalone mode as we need to configure the daemons.
Before we go any further, let us understand the different Hadoop configuration files and their usage. These files are stored in $Hadoop_HOME/conf folder.
- hadoop-env.sh – Environment variables that are stored in the scripts to run hadoop.
- core-site.xml – Configuration settings for Hadoop core, common to HDFS and Mapreduce.
- hdfs-site.xml – Configuration settings for HDFS daemons.
- mapred-site.xml – Configuration settings for MapReduce daemons.
To start off, we will first of all set the Java Home path so that Hadoop can find the version of Java you want to use. To do this, enter the following in hadoop-env.sh:
export JAVA_HOME=/Library/Java/Home
Set the property fs.default.name, which specifies the location where HDFS resides. We do this by adding the following in core-site.xml under configuration tags:
<property> <name>fs.default.name</name> <value>http://localhost:9000</value> </property>
Set the property dfs.replication, which tell HDFS how many copies to make of a block. To do this by adding the following in hdfs-site.xml:
<property> <name>dfs.replication</name> <value>1</value> </property>
Set the property mapred.job.tracker, which gives the location where the JobTracker runs. To do this add the following lines in mapred-site.xml.
<property> <name>mapred.job.tracker</name> <value>http://localhost:9001</value> </property>
Format Hadoop NameNode
The first step is to format the Hadoop filesystem that is implemented on top of HDFS. You need to do this for the first time you setup hadoop installation.
You can do this by typing the following command:
$:~ hadoop-1.2.1/bin/hadoop namenode -format
You will get an output similar to the following:
13/08/19 12:08:34 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = Prashants-MacBook-Pro.local/***.***.*.** STARTUP_MSG: args = [-format] STARTUP_MSG: version = 1.1.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782; compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013 ************************************************************/ 13/08/19 12:08:40 INFO util.GSet: VM type = 64-bit 13/08/19 12:08:40 INFO util.GSet: 2% max memory = 39.83375 MB 13/08/19 12:08:40 INFO util.GSet: capacity = 2^22 = 4194304 entries 13/08/19 12:08:40 INFO util.GSet: recommended=4194304, actual=4194304 13/08/19 12:08:41 INFO namenode.FSNamesystem: fsOwner=hadoop 13/08/19 12:08:41 INFO namenode.FSNamesystem: supergroup=supergroup 13/08/19 12:08:41 INFO namenode.FSNamesystem: isPermissionEnabled=true 13/08/19 12:08:41 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 13/08/19 12:08:41 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 13/08/19 12:08:41 INFO namenode.NameNode: Caching file names occuring more than 10 times 13/08/19 12:08:41 INFO common.Storage: Image file of size 112 saved in 0 seconds. 13/08/19 12:08:41 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/tmp/hadoop-hadoop/dfs/name/current/edits 13/08/19 12:08:41 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/tmp/hadoop-hadoop/dfs/name/current/edits 13/08/19 12:08:41 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted. 13/08/19 12:08:41 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ******-MacBook-Pro.local/***.***.*.** ************************************************************/
Start Hadoop
Start Hadoop essential means running all the Hadoop daemons. To do this, we execute the following command:
$:~ hadoop-1.2.1/bin/start-all.sh starting namenode, logging to /Users/hadoop/hadoop-1.1.2/libexec/../logs/hadoop-hadoop-namenode-Prashants-MacBook-Pro.local.out localhost: starting datanode, logging to /Users/hadoop/hadoop-1.1.2/libexec/../logs/hadoop-hadoop-datanode-Prashants-MacBook-Pro.local.out localhost: starting secondarynamenode, logging to /Users/hadoop/hadoop-1.1.2/libexec/../logs/hadoop-hadoop-secondarynamenode-Prashants-MacBook-Pro.local.out starting jobtracker, logging to /Users/hadoop/hadoop-1.1.2/libexec/../logs/hadoop-hadoop-jobtracker-Prashants-MacBook-Pro.local.out localhost: starting tasktracker, logging to /Users/hadoop/hadoop-1.1.2/libexec/../logs/hadoop-hadoop-tasktracker-Prashants-MacBook-Pro.local.out
Test Hadoop installation
To test the hadoop installation execute the following command:
$:~ hadoop-1.1.2/bin/hadoop jar hadoop-1.1.2/hadoop-examples-1.1.2.jar pi 10 100
This concludes Hadoop installation in Pseudo Distributed mode. However if you are beginner like me I strongly suggest that you install Cloudera Hadoop Demo Virtual Machine with Virtual Box. Please follow the next post to see how it can done.
—
References
1. Hadoop – The Definitive Guide (Tom White)
2. http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_(Single-Node_Cluster)