English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
Building and deploying a Hadoop distributed cluster using Docker
I have been looking for a long time and still can't find a document on using Docker to build a Hadoop distributed cluster. There is no choice but to write one myself.
1: Environment preparation:
1Firstly, you need to have a Centos7operating system, it can be installed in a virtual machine.
2In CentOS7Install docker in CentOS, the version of docker is1.8.2
The installation steps are as follows:
<1Install the specified version of docker
yum install -y docker-1.8.2-10.el7.centos
<2Error may occur during installation, you need to remove this dependency
rpm -e lvm2-7:2.02.105-14.el7.x86_64
Start docker
service docker start
Verify the installation result:
<3After starting, executing docker info will show the following two warning lines
It is necessary to disable the firewall and restart the system
systemctl stop firewalld systemctl disable firewalld #Note: After executing the above command, you need to restart the system reboot -h (restart the system)
<4Running the container may report an error
It is necessary to disable selinux
Solution:
1setenforce 0 (immediately takes effect without restarting the operating system)
2Modification/etc/selinux/Set SELINUX to disabled in the config file, and then restart the system for the changes to take effect.
It is recommended to execute both steps to ensure that selinux is also in the disabled state after the system restarts.
3Firstly, it is necessary to build a basic Hadoop image using the Dockerfile for construction.
First build an image with ssh functionality for later use. (But this may have an impact on the security of the container)}
Note: The password for the root user in this image is root
Mkdir centos-ssh-root Cd centos-ssh-root Vi Dockerfile
# Choose an existing OS image as the base FROM centos # The author of the image MAINTAINER crxy # Install openssh-server and sudo software packages, and set the sshd's UsePAM parameter to no RUN yum install -y openssh-server sudo RUN sed -i 's/UsePAM yes/UsePAM no/g' /etc/ssh/sshd_config # Install openssh-clients RUN yum install -y openssh-clients # Add test user root, password root, and add this user to sudoers RUN echo "root:root" | chpasswd RUN echo "root ALL=(ALL) ALL" >> /etc/sudoers # The following two statements are special, in centos6must exist, otherwise the container sshd cannot log in RUN ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key RUN ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key # Start sshd service and expose22port RUN mkdir /var/run/sshd EXPOSE 22 CMD ["/usr/sbin/sshd", "-D"]
Build command:
docker build -t=crxy/centos-ssh-root” .
Query the image just built successfully
4:Based on this image to build a new image with JDK
Note: JDK uses1.7version
Mkdir centos-ssh-root-jdk Cd centos-ssh-root-jdk Cp ../jdk-7u75-Linux-x64.tar.gz . Vi Dockerfile
FROM crxy/centos-ssh-root ADD jdk-7u75-linux-x64.tar.gz /usr/local/ RUN mv /usr/local/jdk1.7.0_75 /usr/local/jdk1.7 ENV JAVA_HOME /usr/local/jdk1.7 ENV PATH $JAVA_HOME/bin:$PATH
Build command:
docker build -t=crxy/centos-ssh-root-jdk” .
Query the successfully built image
5:Based on this JDK image to build a new image with hadoop
Note: hadoop uses2.4.1version.
Mkdir centos-ssh-root-jdk-hadoop Cd centos-ssh-root-jdk-hadoop Cp ../hadoop-2.4.1.tar.gz . Vi Dockerfile
FROM crxy/centos-ssh-root-jdk ADD hadoop-2.4.1.tar.gz /usr/local RUN mv /usr/local/hadoop-2.4.1 /usr/local/hadoop ENV HADOOP_HOME /usr/local/hadoop ENV PATH $HADOOP_HOME/bin:$PATH
Build command:
docker build -t=crxy/centos-ssh-root-jdk-hadoop” .
Query the successfully built image
Second: build hadoop distributed cluster
1Cluster planning
Prepare to build a cluster with three nodes, one master and two slaves
Master node: hadoop0 ip:192.168.2.10
node1hadoop1 ip:192.168.2.11
node2hadoop2 ip:192.168.2.12
But since the IP of the docker container will change after it is restarted, we need to set a fixed IP for docker. Use pipework to set a fixed IP for the docker container
2Start three containers, respectively as hadoop0 hadoop1 hadoop2
Execute the following command on the host to set the hostname and container name for the container, and open ports in hadoop050070 and8088
docker run --name hadoop0 --hostname hadoop0 -d -P -p 50070:50070 -p 8088:8088 crxy/centos-ssh-root-jdk-hadoop docker run --name hadoop1 --hostname hadoop1 -d -P crxy/centos-ssh-root-jdk-hadoop docker run --name hadoop2 --hostname hadoop2 -d -P crxy/centos-ssh-root-jdk-hadoop
Use docker ps to view the three containers started just now
3Set a fixed IP for these three containers
1Download address of pipework:https://github.com/jpetazzo/pipework.git
2Upload the downloaded zip package to the host server, unzip, and rename it
unzip pipework-master.zip mv pipework-master pipework cp -rp pipework/pipework /usr/local/bin/
3Install bridge-utils
yum -y install bridge-utils
4Create a network
brctl addbr br0 ip link set dev br0 up ip addr add 192.168.2.1/24 dev br0
5Set a fixed IP for the container
pipework br0 hadoop0 192.168.2.10/24 pipework br0 hadoop1 192.168.2.11/24 pipework br0 hadoop2 192.168.2.12/24
Verify, ping the three IPs respectively, if they can be pinged, it means there is no problem
4Set up hadoop cluster
First connect to hadoop0, use the command
docker exec -it hadoop0 /bin/bash
The following steps are the configuration process of the hadoop cluster
1Set up hostname and IP mapping, modify three containers: vi /etc/hosts
Add the following configuration
192.168.2.10 hadoop0 192.168.2.11 hadoop1 192.168.2.12 hadoop2
2Set up SSH passwordless login
Execute the following operations on hadoop0
cd ~ mkdir .ssh cd .ssh ssh-keygen -t rsa(keep pressing Enter) ssh-copy-id -i localhost ssh-copy-id -i hadoop0 ssh-copy-id -i hadoop1 ssh-copy-id -i hadoop2 On hadoop1Execute the following operations here cd ~ cd .ssh ssh-keygen -t rsa(keep pressing Enter) ssh-copy-id -i localhost ssh-copy-id -i hadoop1 On hadoop2Execute the following operations here cd ~ cd .ssh ssh-keygen -t rsa(keep pressing Enter) ssh-copy-id -i localhost ssh-copy-id -i hadoop2
3On hadoop0, modify the hadoop configuration file
Enter/usr/local/hadoop/etc/directory in the hadoop directory
Modify the configuration files core in the directory-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml
(1)hadoop-env.sh
export JAVA_HOME=/usr/local/jdk1.7
(2)core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop0:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop/tmp</value> </property> <property> <name>fs.trash.interval</name> <value>1440</value> </property> </configuration>
(3)hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>
(4)yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> </configuration>
(5)Rename the file: mv mapred-site.xml.template mapred-site.xml
vi mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
(6)Format
Enter/usr/local/in the hadoop directory
1and execute the formatting command
bin/hdfs namenode -format
Note: An error will be reported during execution because the 'which' command is missing. Installation is required
Execute the following command to install
yum install -y which
You can see the following command indicating that the formatting was successful.
The formatting operation cannot be executed repeatedly. If you must repeat the formatting, use the following parameters-force can be used.
(7)Start pseudo-distributed hadoop
Command:
sbin/start-all.sh
You need to enter 'yes' to confirm during the first start-up process.
Use jps to check if the process is started normally? You can see the following processes indicating that the pseudo-distributed mode has started successfully
[root@hadoop0 hadoop]# jps 3267 SecondaryNameNode 3003 NameNode 3664 Jps 3397 ResourceManager 3090 DataNode 3487 NodeManager
(8)Stop pseudo-distributed hadoop
Command:
sbin/stop-all.sh
(9)Specify the address of the nodemanager, modify the file yarn-site.xml
<property> <description>The hostname of the RM.</description> <name>yarn.resourcemanager.hostname</name> <value>hadoop0</value> </property>
(10) Modify a Hadoop configuration file etc in hadoop0/hadoop/slaves
Delete all the original content and modify it as follows
hadoop1 hadoop2
(11) Execute the command in hadoop0
scp -rq /usr/local/hadoop hadoop1:/usr/local scp -rq /usr/local/hadoop hadoop2:/usr/local
(12) Start the Hadoop distributed cluster service
Execute sbin/start-all.sh
Note: An error will be reported during execution because the two slave nodes are missing the which command. Just install it
Execute the following command on each of the two slave nodes to install
yum install -y which
Then start the cluster (if the cluster is already running, it needs to be stopped first)
sbin/start-all.sh
(13) Verify whether the cluster is normal
First, check the process:
The following processes must be present on Hadoop0
[root@hadoop0 hadoop]# jps 4643 Jps 4073 NameNode 4216 SecondaryNameNode 4381 ResourceManager
Hadoop1The following processes must be present on the host
[root@hadoop1 hadoop]# jps 715 NodeManager 849 Jps 645 DataNode
Hadoop2The following processes must be present on the host
[root@hadoop2 hadoop]# jps 456 NodeManager 589 Jps 388 DataNode
Use the program to verify the cluster service
Create a local file
vi a.txt hello you hello me
Upload a.txt to HDFS
hdfs dfs -put a.txt /
Execute the wordcount program
cd /usr/local/hadoop/share/hadoop/mapreduce hadoop jar hadoop-mapreduce-examples-2.4.1.jar wordcount /a.txt /out
Check the program execution result
This means that the cluster is normal.
Access the cluster service through the browser
Because when starting the hadoop0 container,50070 and8088Mapped to the corresponding port on the host machine
adb9eba7142b crxy/centos-ssh-root-jdk-hadoop "/usr/sbin/sshd -D" About an hour ago Up About an hour 0.0.0.0:8088->8088/tcp, 0.0.0.0:50070->50070/tcp, 0.0.0.0:32770->22/tcp hadoop0
Therefore, it can directly access the Hadoop cluster service in the container through the host machine
The IP of the host machine is:192.168.1.144
http://192.168.1.144:50070/
http://192.168.1.144:8088/
Three: Restart cluster nodes
Stop three containers and execute the following command on the host machine
docker stop hadoop0 docker stop hadoop1 docker stop hadoop2
After the container is stopped, the fixed ip set before will also disappear. When using these containers again, you need to set the fixed ip again.
First, start the three containers that were stopped earlier
docker start hadoop0 docker start hadoop1 docker start hadoop2
Execute the following command on the host machine to set a fixed ip for the container again
pipework br0 hadoop0 192.168.2.10/24 pipework br0 hadoop1 192.168.2.11/24 pipework br0 hadoop2 192.168.2.12/24
We still need to reconfigure the hostname and ip mapping relationship in the container, and it is麻烦 to write it manually each time
Write a script, runhosts.sh
#!/bin/bash echo 192.168.2.10 hadoop0 >> /etc/hosts echo 192.168.2.11 hadoop1 >> /etc/hosts echo 192.168.2.12 hadoop2 >> /etc/hosts
Add execution permission,
chmod +x runhosts.sh
Copy this script to all nodes and execute this script separately
scp runhosts.sh hadoop1:~ scp runhosts.sh hadoop2:~
Command to execute the script
./runhosts.sh
View/etc/Has the hosts file been added successfully
Note: Some docker versions will not automatically generate the following mappings in the hosts file, so we manually set a fixed ip for the container here, and set the mapping relationship between the hostname and ip.
172.17.0.25 hadoop0
172.17.0.25 hadoop0.bridge
172.17.0.26 hadoop1
172.17.0.26 hadoop1.bridge
172.17.0.27 hadoop2
172.17.0.27 hadoop2.bridge
Start hadoop cluster
sbin/start-all.sh
That's all for this article, I hope it will be helpful to everyone's learning, and I also hope everyone will support the Yell Tutorial.
Statement: The content of this article is from the Internet, the copyright belongs to the original author, the content is contributed and uploaded by Internet users spontaneously, this website does not own the copyright, has not been edited by human, nor does it assume relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#w3Please send an email to codebox.com (replace # with @ when sending an email) to report any violations, and provide relevant evidence. Once verified, this site will immediately delete the infringing content.