In order to make ssh keys work and to not give superuser permissions to students who want to run a map/reduce program, we first create a user called hadoop and a home folder for this user in /ahome (because this is safe from NFS. Then we extract the hadoop program and change its owner to hadoop. In order to not share this user across NFS, we choose a user ID and group that is lower than (to be safe) 700 (in this case, it was 666):
sudo mkdir hadoop (in /ahome)
sudo groupadd -g 666 hadoop
sudo useradd -g hadoop -u 666 -d /ahome/hadoop hadoop
sudo passwd hadoop
sudo tar xvf /ahome/sadmin/Desktop/hadoop-0.20.2.tar.gz (in /usr/local)
sudo chown -R hadoop:hadoop hadoop-0.20.2/
cd /ahome
sudo chown -R hadoop:hadoop hadoop/
Next, we change user to hadoop and then generate a new password-less (the -P "" parameter) RSA key, and then copy it to the ssh authorized_keys file so that hadoop can ssh to a computer without a password:
su - hadoop
ssh-keygen -t rsa -P ""
cat /ahome/hadoop/.ssh/id_rsa.pub >> /ahome/hadoop/.ssh/authorized_keys
Next, we have to change hadoop's hadoop-env.sh configuration file to compensate for problems with IPv6 (change the preferred Java IP stack to IPv4Stack) and to specify the location of the Java virtual machine:
cd /usr/local/hadoop-0.20.2/conf
pico hadoop-env.sh
Change:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export JAVA_HOME=/usr/lib/jvm/java-6-sun
Next, we need to state where the temporary directory for hadoop is and change the fs.default.name and the jobtracker name to the name of the master in the cluster, and the dfs replication number to the number of slaves in the cluster:
pico core-site.xml
Change:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop-0.20.2/hadoop-temp/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<configuration>
pico mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
pico hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>10</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
We make the temp directory (that was specified in the core-site.xml):
cd ..
mkdir hadoop-temp (in /usr/local/hadoop-0.20.2)
Next, we add the IP address and name combinations into the /etc/hosts file:
In /etc/hosts:
192.168.0.1 master
192.168.0.2 slave1
192.168.0.3 slave2
etc...
To complete the ssh portion of the setup, we ssh as the user hadoop from every computer in the cluster as hadoop to every computer in the cluster:
su - hadoop
ssh master
ssh slave1
ssh slave2
etc...
Then we go back to the conf/ directory, and change the masters file to the name of the master (THIS IS ONLY FOR THE MASTER):
pico masters
(then add the name of the master)
And then we change the slaves file in the conf/ directory to include all of the slaves in the cluster (DO THIS ON EVERY MACHINE IN THE CLUSTER):
pico slaves
(then add the names of the slaves)
And (THIS IS ONLY FOR THE MASTER) finally, we format the namenode:
bin/hadoop namenode -format