Detailed Explanation of Using Docker to Build Hadoop Distributed Cluster

Building and deploying a Hadoop distributed cluster using Docker

I have been looking for a long time and still can't find a document on using Docker to build a Hadoop distributed cluster. There is no choice but to write one myself.

1: Environment preparation:

1Firstly, you need to have a Centos7operating system, it can be installed in a virtual machine.

2In CentOS7Install docker in CentOS, the version of docker is1.8.2

The installation steps are as follows:

<1Install the specified version of docker　

yum install -y docker-1.8.2-10.el7.centos

<2Error may occur during installation, you need to remove this dependency

rpm -e lvm2-7:2.02.105-14.el7.x86_64

Start docker

service docker start

Verify the installation result:

<3After starting, executing docker info will show the following two warning lines

It is necessary to disable the firewall and restart the system

systemctl stop firewalld
systemctl disable firewalld
#Note: After executing the above command, you need to restart the system
reboot -h (restart the system)

<4Running the container may report an error

It is necessary to disable selinux

Solution:

1setenforce 0 (immediately takes effect without restarting the operating system)

2Modification/etc/selinux/Set SELINUX to disabled in the config file, and then restart the system for the changes to take effect.

It is recommended to execute both steps to ensure that selinux is also in the disabled state after the system restarts.

3Firstly, it is necessary to build a basic Hadoop image using the Dockerfile for construction.

First build an image with ssh functionality for later use. (But this may have an impact on the security of the container)}

Note: The password for the root user in this image is root

Mkdir centos-ssh-root 
Cd centos-ssh-root 
Vi Dockerfile

# Choose an existing OS image as the base 
FROM centos 
# The author of the image 
MAINTAINER crxy 
# Install openssh-server and sudo software packages, and set the sshd's UsePAM parameter to no 
RUN yum install -y openssh-server sudo 
RUN sed -i 's/UsePAM yes/UsePAM no/g' /etc/ssh/sshd_config 
# Install openssh-clients
RUN yum install -y openssh-clients
# Add test user root, password root, and add this user to sudoers 
RUN echo "root:root" | chpasswd 
RUN echo "root  ALL=(ALL)    ALL" >> /etc/sudoers 
# The following two statements are special, in centos6must exist, otherwise the container sshd cannot log in 
RUN ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key 
RUN ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key 
# Start sshd service and expose22port 
RUN mkdir /var/run/sshd 
EXPOSE 22 
CMD ["/usr/sbin/sshd", "-D"]

Build command:

docker build -t=crxy/centos-ssh-root” .

Query the image just built successfully

4：Based on this image to build a new image with JDK
Note: JDK uses1.7version

Mkdir centos-ssh-root-jdk 
Cd centos-ssh-root-jdk 
Cp ../jdk-7u75-Linux-x64.tar.gz . 
Vi Dockerfile

FROM crxy/centos-ssh-root
ADD jdk-7u75-linux-x64.tar.gz /usr/local/
RUN mv /usr/local/jdk1.7.0_75 /usr/local/jdk1.7
ENV JAVA_HOME /usr/local/jdk1.7
ENV PATH $JAVA_HOME/bin:$PATH

Build command:

docker build -t=crxy/centos-ssh-root-jdk” .

Query the successfully built image

5：Based on this JDK image to build a new image with hadoop

Note: hadoop uses2.4.1version.

Mkdir centos-ssh-root-jdk-hadoop 
Cd centos-ssh-root-jdk-hadoop 
Cp ../hadoop-2.4.1.tar.gz . 
Vi Dockerfile

FROM crxy/centos-ssh-root-jdk
ADD hadoop-2.4.1.tar.gz /usr/local
RUN mv /usr/local/hadoop-2.4.1 /usr/local/hadoop
ENV HADOOP_HOME /usr/local/hadoop
ENV PATH $HADOOP_HOME/bin:$PATH

Build command:

docker build -t=crxy/centos-ssh-root-jdk-hadoop” .

Query the successfully built image

Second: build hadoop distributed cluster

1Cluster planning

Prepare to build a cluster with three nodes, one master and two slaves

Master node: hadoop0 ip:192.168.2.10

node1hadoop1 ip:192.168.2.11

node2hadoop2 ip:192.168.2.12

But since the IP of the docker container will change after it is restarted, we need to set a fixed IP for docker. Use pipework to set a fixed IP for the docker container

2Start three containers, respectively as hadoop0 hadoop1 hadoop2

Execute the following command on the host to set the hostname and container name for the container, and open ports in hadoop050070 and8088

docker run --name hadoop0 --hostname hadoop0 -d -P -p 50070:50070 -p 8088:8088 crxy/centos-ssh-root-jdk-hadoop
docker run --name hadoop1 --hostname hadoop1 -d -P crxy/centos-ssh-root-jdk-hadoop
docker run --name hadoop2 --hostname hadoop2 -d -P crxy/centos-ssh-root-jdk-hadoop

Use docker ps to view the three containers started just now

3Set a fixed IP for these three containers

1Download address of pipework:https://github.com/jpetazzo/pipework.git

2Upload the downloaded zip package to the host server, unzip, and rename it

unzip pipework-master.zip
mv pipework-master pipework
cp -rp pipework/pipework /usr/local/bin/

3Install bridge-utils

yum -y install bridge-utils

4Create a network

brctl addbr br0
ip link set dev br0 up
ip addr add 192.168.2.1/24 dev br0

5Set a fixed IP for the container

pipework br0 hadoop0 192.168.2.10/24
pipework br0 hadoop1 192.168.2.11/24
pipework br0 hadoop2 192.168.2.12/24

Verify, ping the three IPs respectively, if they can be pinged, it means there is no problem

4Set up hadoop cluster

First connect to hadoop0, use the command

docker exec -it hadoop0 /bin/bash

The following steps are the configuration process of the hadoop cluster

1Set up hostname and IP mapping, modify three containers: vi /etc/hosts

Add the following configuration

192.168.2.10  hadoop0
192.168.2.11  hadoop1
192.168.2.12  hadoop2

2Set up SSH passwordless login

Execute the following operations on hadoop0

cd ~
mkdir .ssh
cd .ssh
ssh-keygen -t rsa(keep pressing Enter)
ssh-copy-id -i localhost
ssh-copy-id -i hadoop0
ssh-copy-id -i hadoop1
ssh-copy-id -i hadoop2
On hadoop1Execute the following operations here
cd ~
cd .ssh
ssh-keygen -t rsa(keep pressing Enter)
ssh-copy-id -i localhost
ssh-copy-id -i hadoop1
On hadoop2Execute the following operations here
cd ~
cd .ssh
ssh-keygen -t rsa(keep pressing Enter)
ssh-copy-id -i localhost
ssh-copy-id -i hadoop2

3On hadoop0, modify the hadoop configuration file

Enter/usr/local/hadoop/etc/directory in the hadoop directory

Modify the configuration files core in the directory-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml

(1)hadoop-env.sh

export JAVA_HOME=/usr/local/jdk1.7

(2)core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop0:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/local/hadoop/tmp</value>
    </property>
     <property>
         <name>fs.trash.interval</name>
         <value>1440</value>
    </property>
</configuration>

(3)hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.permissions</name>
    <value>false</value>
  </property>
</configuration>

(4)yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property> 
        <name>yarn.log-aggregation-enable</name> 
        <value>true</value> 
    </property>
</configuration>

(5)Rename the file: mv mapred-site.xml.template mapred-site.xml

vi mapred-site.xml

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

(6)Format

Enter/usr/local/in the hadoop directory

1and execute the formatting command

bin/hdfs namenode -format

Note: An error will be reported during execution because the 'which' command is missing. Installation is required

Execute the following command to install

yum install -y which

You can see the following command indicating that the formatting was successful.

The formatting operation cannot be executed repeatedly. If you must repeat the formatting, use the following parameters-force can be used.

(7)Start pseudo-distributed hadoop

Command:

sbin/start-all.sh

You need to enter 'yes' to confirm during the first start-up process.

Use jps to check if the process is started normally? You can see the following processes indicating that the pseudo-distributed mode has started successfully

[root@hadoop0 hadoop]# jps
3267 SecondaryNameNode
3003 NameNode
3664 Jps
3397 ResourceManager
3090 DataNode
3487 NodeManager

(8)Stop pseudo-distributed hadoop

Command:

sbin/stop-all.sh

(9)Specify the address of the nodemanager, modify the file yarn-site.xml

<property>
  <description>The hostname of the RM.</description>
  <name>yarn.resourcemanager.hostname</name>
  <value>hadoop0</value>
 </property>

(10) Modify a Hadoop configuration file etc in hadoop0/hadoop/slaves

Delete all the original content and modify it as follows

hadoop1
hadoop2

(11) Execute the command in hadoop0

 scp -rq /usr/local/hadoop  hadoop1:/usr/local
 scp -rq /usr/local/hadoop  hadoop2:/usr/local

(12) Start the Hadoop distributed cluster service

Execute sbin/start-all.sh

Note: An error will be reported during execution because the two slave nodes are missing the which command. Just install it

Execute the following command on each of the two slave nodes to install

yum install -y which

Then start the cluster (if the cluster is already running, it needs to be stopped first)

sbin/start-all.sh

(13) Verify whether the cluster is normal

First, check the process:

The following processes must be present on Hadoop0

[root@hadoop0 hadoop]# jps
4643 Jps
4073 NameNode
4216 SecondaryNameNode
4381 ResourceManager

Hadoop1The following processes must be present on the host

[root@hadoop1 hadoop]# jps
715 NodeManager
849 Jps
645 DataNode

Hadoop2The following processes must be present on the host

[root@hadoop2 hadoop]# jps
456 NodeManager
589 Jps
388 DataNode

Use the program to verify the cluster service

Create a local file

vi a.txt
hello you
hello me

Upload a.txt to HDFS

hdfs dfs -put a.txt /

Execute the wordcount program

cd /usr/local/hadoop/share/hadoop/mapreduce
hadoop jar hadoop-mapreduce-examples-2.4.1.jar wordcount /a.txt /out

Check the program execution result

This means that the cluster is normal.

Access the cluster service through the browser
Because when starting the hadoop0 container,50070 and8088Mapped to the corresponding port on the host machine

adb9eba7142b                  crxy/centos-ssh-root-jdk-hadoop     "/usr/sbin/sshd -D"     About an hour ago     Up About an hour      0.0.0.0:8088->8088/tcp, 0.0.0.0:50070->50070/tcp, 0.0.0.0:32770->22/tcp     hadoop0

Therefore, it can directly access the Hadoop cluster service in the container through the host machine
The IP of the host machine is:192.168.1.144

http://192.168.1.144:50070/

http://192.168.1.144:8088/

Three: Restart cluster nodes

Stop three containers and execute the following command on the host machine

docker stop hadoop0
docker stop hadoop1
docker stop hadoop2

After the container is stopped, the fixed ip set before will also disappear. When using these containers again, you need to set the fixed ip again.
First, start the three containers that were stopped earlier

docker start hadoop0
docker start hadoop1
docker start hadoop2

Execute the following command on the host machine to set a fixed ip for the container again

pipework br0 hadoop0 192.168.2.10/24
pipework br0 hadoop1 192.168.2.11/24
pipework br0 hadoop2 192.168.2.12/24

We still need to reconfigure the hostname and ip mapping relationship in the container, and it is麻烦 to write it manually each time

Write a script, runhosts.sh

#!/bin/bash
echo 192.168.2.10    hadoop0 >> /etc/hosts
echo 192.168.2.11    hadoop1 >> /etc/hosts
echo 192.168.2.12    hadoop2 >> /etc/hosts

Add execution permission,

chmod +x runhosts.sh

Copy this script to all nodes and execute this script separately

scp runhosts.sh hadoop1:~
scp runhosts.sh hadoop2:~

Command to execute the script

./runhosts.sh

View/etc/Has the hosts file been added successfully

Note: Some docker versions will not automatically generate the following mappings in the hosts file, so we manually set a fixed ip for the container here, and set the mapping relationship between the hostname and ip.

172.17.0.25　　　　 hadoop0
172.17.0.25　　　　 hadoop0.bridge
172.17.0.26　　　　 hadoop1
172.17.0.26　　　　 hadoop1.bridge
172.17.0.27　　　　 hadoop2
172.17.0.27　　　　 hadoop2.bridge

Start hadoop cluster

sbin/start-all.sh

That's all for this article, I hope it will be helpful to everyone's learning, and I also hope everyone will support the Yell Tutorial.

Statement: The content of this article is from the Internet, the copyright belongs to the original author, the content is contributed and uploaded by Internet users spontaneously, this website does not own the copyright, has not been edited by human, nor does it assume relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#w3Please send an email to codebox.com (replace # with @ when sending an email) to report any violations, and provide relevant evidence. Once verified, this site will immediately delete the infringing content.

Basic Tutorial