2011년 12월 22일 목요일

Hadoop: CHD installation on CentOS (multi nodes)

I'm going to explain how to install Cloudera's hadoop known as CHD (Cloudera Hadoop Distribution) on CentOS.


I prepared a physical server for this installation. I installed CentOS 5.6 and xen hypervisor which meant configured to use five VMs for all hadoop nodes.





In my case, there were a master namenode, a secondary namenode and three datanodes. I allocated 1GB of RAM and 100GB storage for each. I didn't consider its performance factor in the environment, because I had a priority on better understand how it worked. 


1. Add cloudera repository file in /etc/yum.repos.d/ directory.
If don't have any cloudera repo file there, you can create new file. for example,   you create a file named "cloudera-cdh3.repo" and save following lines in it. 
[cloudera-cdh3]
name=Cloudera's Distribution for Hadoop, Version 3
mirrorlist=http://archive.cloudera.com/redhat/cdh/3/mirrors
gpgkey = http://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera 
gpgcheck = 1


Now, you can search and install hadoop packages via yum
$ yum search hadoop-0.20 
$ yum install hadoop-0.20


2. Install components
hadoop install file is composed of several demon types. 
  • namenode
  • datanode
  • secondarynamenode
  • jobtracker
  • tasktraker
It needs to install namenode, jobtracker on namanode computer and datanode, tasktracker on datanode computer.


You can execute yum like below:
$ yum install hadoop-0.20-<demon type>


* Before start installation, it'd good to create a user who control hadoop for the security reason. I made a user named "huser" to give privileges of the job related to hadoop. 
# Add a user
$ useradd huser


# Allows members of "huser" group to run superuser command. 
$ vi /etc/sudoers 
%huser ALL=(ALL) NOPASSWD: ALL


# Check /etc/hosts, (Never edit the first line.)
127.0.0.1 localhost.localdomain localhost
XXX.XXX.XXX.171 name01.hadoop.com name01
XXX.XXX.XXX.172 name02.hadoop.com name02
XXX.XXX.XXX.173 node01.hadoop.com node01
XXX.XXX.XXX.174 node02.hadoop.com node02
XXX.XXX.XXX.175 node03.hadoop.com node03


# Generate SSH key on master node and slave nodes
$ ssh-keygen -t rsa
generating public/private rsa key pair.
Enter file in which to save the key (~/.ssh/id_rsa):
Creating directory '~/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your public key has been saved in ~/.ssh/id_rsa.pub.
The key fingerprint is: 


$ cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys


# Copy SSH key of master node to all salves (here, it incluses secondary namenode and datanodes)
$ scp ~/.ssh/id_rsa.pub :/.ssh/authorized_keys


2.1) On the namenode:
$ sudo yum install hadoop-0.20
$ sudo yum install hadoop-0.20-namenode
$ sudo yum install hadoop-0.20-jobtracker


2.2) On the datanode:
$ sudo yum install hadoop-0.20
$ sudo yum install hadoop-0.20-datanode
$ sudo yum install hadoop-0.20-tasktracker


2.3) On the secondarynamenode
$ yum install hadoop-0.20


Originally, Although I had to install secondarynamenode deamon on this node, it worked well without installing this module.  


After installation, directories are followings:
  • Hadoop home: /usr/lib/hadoop-0.20 
  • JDK home: /usr/java/jdk1.6.0_29
And it needs to change the owner of related directories with hadoop.
$ sudo chown -R huser:huser /usr/lib/hadoop-0.20
$ sudo chown -R huser:huser /etc/hadoop-0.20/conf
$ sudo chown -R huser:huser /var/log/hadoop-0.20
$ sudo chown -R huser:huser /var/run/hadoop-0.20


It should create hadoop directory for hdfs and MapReduce
$ mkdir /hadoop


3. Hadoop configuration
Go to the directory /usr/lib/hadoop-0.20/conf and modify config files of haoop.


3.1) hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.6.0_29
export HADOOP_HOME=/usr/lib/hadoop-0.20
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
export HADOOP_PID_DIR=${HADOOP_HOME}/pids


3.2) core-site.xml
<property>
    <name>fs.default.name</name>
    <value>hdfs://name01.hadoop.com:9000</value>
</property>

<property>
    <name>hadoop.tmp.dir</name>
    <value>/hadoop/tmp</value>
</property>

3.3) hdfs-site.xml
<property>
    <name>dfs.name.dir</name>
    <value>/hadoop/dfs/name</value>
</property>
<property>
    <name>dfs.replication</name>
    <value>3</value>
</property>

3.4) mapred-site.xml
<property>
    <name>mapred.job.tracker</name>
    <value>name01.hadoop.com:9001</value>
</property>
<property>
    <name>mapred.local.dir</name>
    <value>/hadoop/mapred/local</value>
</property>
</property>
    <name>mapred.system.dir</name>
    <value>/hadoop/mapred/system</value>
</property>

3.5) slaves
node01
node02
node03


3.6) masters
name02



4. Format namenode
$hadoop namenode --format



5. Start hadoop on the master namenode
Go to /usr/lib/hadoop-0.20/bin and then execute ./start-all.sh
$ ./start-all.sh
starting namenode, logging to .....
node01: starting datanode, logging to ......
node03: starting datanode, logging to ......
node02: starting datanode, logging to ......
name02: starting secondarynamenode, logging to .....
starting jobtracker, logging to .....
node01: starting tasktracker, logging to ....
node02: starting tasktracker, logging to ....
node03: starting tasktracker, logging to ....



# View MapReduce job on the web browser
http://<ip address>:50030
# View HDFS on the web browser
http://<ip address>:50070  





댓글 2개:

  1. You said, "It should create hadoop directory for hdfs and MapReduce."

    What is "it?"

    답글삭제
  2. It meant that a directory named "hadoop" needed to be created.

    Regards,
    Yeonki

    답글삭제