If you are hadoop novice, I strongly suggest you beginning your study from single node building,you can learn from this website, after you having finshed build one single node, then you can reading my blog to learn how to run a N-node clusters just in your computer.
This blog is introduce using one computer to build a N-node clusters.I suggest you use ubuntu to build. You can also use Windows, but you’d better install virtualbox to install one desktop ubuntu as your base server.In this blog, we will try two different way to build hadoop clusters in one computer.
Before you start learning, you can download these required softwares from Intelnet.
we can also install it by apt tool, but may be slow in China.So you’d better download it from website.
choose “jdk-8u181-linux-x64.tar.gz” to download.You can alse install in your master computer later, you can read from this blog
I choose latest 2.85 version, you can download from this website.
- Ubuntu Image
In this trip, we choose Ubuntu16.04 server for build clusters.You can use 163 mirrors to speed up your download.
virtualboxto create our clusters. It’s easy for you to install virtualbox in ubuntu. You can read this article to install virtualbox-5.2
we will try use docker build our clusters, it’s easy install in ubuntu.The install tutorials is https://docs.docker.com/install/linux/docker-ce/ubuntu/#uninstall-old-versions
Now I assuming you’re working on a ubuntu16.04 desktop OS.Now let’s begining our trip.
First,let’s init a master, after we install required software in master, we can use virtualbox
clone function to easy to build slave.
- new a machine named master
- Choose 2G RAM, VDI
then run this image, load the iso file you downloaded.Pay attention to make true install ssh server( Or you can install after installing os by
Before we start install hadoop and java skd, let me tell you something about the internet require.
For our clusters runing, we need a connected internet between master and slaves.If we have many computers, it’s simple, we just need they both have public IP or private IP in one LAN.But if we just in one computer, how can we have independent IP for our master and slaves.
This is why we install
virtualbox provide our independent computers in just one computer.Moreover, it can provide a simulate NIC for each computer.By using that, each computer can have they own private IP in LAN.
So the key our cluster running is the
we need choose the
bridged adapter to make master and slaves just in same LAN.Pay attention to make true you need choose your real NIC.In ubuntu you just run
ifconfig and find out have one line
inet addr:192.168.1.12 .Usually it’s
eth0 in ubuntu.
When you have finished OS installment.You can login in and start installing hadoop clusters.
Step 1. Configure Static IP
In your virtual machine, your IP is changeable when reboot.Because ubuntu use DHCP for init your IP from gateway.We need make true our master and slaves have changeless IP to protect their connection.
To do this, first you need make true your installment is ok. Try
ping baidu.com to check you connected Internet or not.Then we need know our gateway address.Try run
route in shell, you can find a table, in the row
Gateway, you can find one or more static IP like
192.168.0.1 , this is your gateway.Now we open our internet settings.
you can see something like this
auto eth0 iface eth0 inet dhcp
eth0 is your NIC(yours maybe different). and we use
dhcp to get IP. Now we need change it to static.
auto eth0 iface eth0 inet static address 192.168.0.105 netmask 255.255.255.0 gateway 192.168.0.1
PS: make true, you need change the
gateway IP to yours.The address
IP must be subnet of gateway under the control of netmast.eg, you can’t set you ip address to 10.1.1.1 if your gateway is 192.168.0.1.The easiest way is set by
dhcp format.And just change the last number.If you still can’t connect the Internet.Try add one line
dns-nameservers 220.127.116.11 .
ifdown eth0 ifup eth0
now run upper commands in your vm(
eth0 need your NIC name).If run
ifconfig again, you can see our IP address chage to
Step 2. Add Hostname alias
Becase hadoop need hostname to identify their ID, so now we add
Hostname-IP pair to smooth our connection.
/etc/hosts/ and add three line below
192.168.0.105 master 192.168.0.104 slave1 192.168.0.103 slave2
Step 3. Make SSH Login
Becase hadoop need login by root with
SSH , so we need make
root can login in in ubuntu.Open
/etc/ssh/sshd_config and change line
PermitRootLogin prohibit-password to
PermitRootLogin yes, then
service ssh restart .
Also you need use your
sudo to set password for root
sudo passwd root
now check you can login in with
Step 4. Set Hadoop Env
First, we need install
JDK for hadoop, now back to your host computer. And use
scp to upload
JDK to vm.You can add below to
/etc/hosts in your host machine.
192.168.0.105 master 192.168.0.104 slave1 192.168.0.103 slave2
then you can easy upload your
Hadoop to your vm(you need unpack this tar.gz file first)
scp -r /path/your/jdk root@master:/usr/lib/jvm/java-8-oracle scp -r /path/your/hadooproot@master:/usr/local/hadoop
PS: you can also install
Now, we installed
Hadoop in our VM.Then we back to VM and initialize our
- Set JDK Home
/usr/local/hadoop/etc/hadoop/) file add
export JAVA_HOME=/usr/lib/jvm/java-8-oracle to tell
Hadoop JDK local address.
- Set Core IP
We need a boss to handle all employer.So edit
/usr/local/hadoop/etc/hadoop/) and add a property in
<property> <name>fs.defaultFS</name> <value>hdfs://master:9000/</value> </property>
each cluster will send heartbeat to
HDFSreplication and file dir
The hadoop basement is
hdfs-site.xml and add three property
<property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///root/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///root/hdfs/datanode</value> </property>
dfs.replication meaning the backups of
dfs.datanode.data.dir is optional.If you not set this, it will store under
/tmp (when reboot ,it will delete).
In hadoop2, we use
Yarn to manage our
cp mapred-site.xml.template mapred-site.xml and then add property to set
Yarn as our
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
and we also need tell
master of the clusters and our need open
Shuffle Fuction effective our
yarn-site.xml, and add two property
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property>
Now we complete the base
Hadoop settings, now we can try run
cd /usr/local/hadoop/ bin/hadoop namenode -format sbin/start-dfs.sh
We try format our namenode, and start dfs server, now run
jps, you can see
SecondaryNameNode server started.
Now we try start
Yarn to start
jps, you can see
ResourceManager running.You can also try
netstat -tuplen|grep 8088, you will find the
ResourceManager open some tcp port like 8080,8031,8033,etc.And the
8088 is the website of managing clusters.You can open http://master:8088 to see the
clusters status.Now you can only see blank node in clusters, for we have not started one slave yet.
Congratulation, our master is starting, in the running, we need input our password when start, after complete all slave building, we can use ssh-key to autologin.
Now let’s build our slaves.
virtualbox clone function, we clone
master to a new VM named
Because we clone every thing to the
slave1, so we need close
master and goto
slave1 change its hostname and static IP make it to be a
First we need do is rename the VM,edit
/etc/hostname change it to
slave1, then we need do is setting
slave1 Static IP, we do like upper.Just replace IP to
192.168.0.104, and then we reboot and start
slave1 at meatime.
Now let’s check master to start our
slave1, in our
master VM, we edit
/usr/local/hadoop/etc/hadoop/slaves file, and one line
and make true you have add slaves’ hostname alias in
Then we try start our
cd /usr/local/hadoop bin/hadoop datanode -format sbin/start-dfs.sh && sbin/start-yarn.sh
After running these command, check http://master:8088 to find the master have one slave online named
PS: Now you can generate ssh-key for your login in slaves, just run
ssh-keygen -t rsa && ssh-copy-id slave1, you don’t need input your password to start your clusters.
Now we have one node clusters, if you want more, you can add more slaves repeatting upper produce.
After you build your N-Clusters , you can now run those commands to check the hadoop working or not.
# create input files mkdir input echo "Hello Docker" >input/file2.txt echo "Hello Hadoop" >input/file1.txt # create input directory on HDFS hadoop fs -mkdir -p input # put input files to HDFS hdfs dfs -put ./input/* input # run wordcount cd /usr/local/hadoop/bin hadoop jar ../share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-*-sources.jar org.apache.hadoop.examples.WordCount input output # print the input files echo -e "\ninput file1.txt:" hdfs dfs -cat input/file1.txt echo -e "\ninput file2.txt:" hdfs dfs -cat input/file2.txt # print the output of wordcount echo -e "\nwordcount output:" hdfs dfs -cat output/part-r-00000
PS: By the way, if you want to running this clusters for a long time, you can try use
vboxmanage to manage the vm. You can simple use
vboxmanage startvm master --type headless to start master background(change
master to other VM name can start them too)
The difficult of build a clusters in virtualbox is know how master and slaves connecting each other.If you set a right network, it’s easy to running the cluster.But there’re some problem in
virtualbox, we can’t share our network in the host LAN with virtualbox
bridge. So we will introduce you build clusters in
Docker and we can run our clusters in a swarm clusters in a real envirment.
Building clusters is much easily in docker, for docker provide a easy network bride in sigle computer or in a swarm clusters.
kiwenlau/hadoop:1.0 image to our test(which hadoop version is 2.7).Just run
sudo docker pull kiwenlau/hadoop:1.0
After few minutes, we can have a hadoop images, now we need set our private LAN Net just use this(If you want to run a swarm clusters above many computers, just change
overlay, powerful, isn’t it)
sudo docker network create --driver=bridge hadoop
Now let start our master server
sudo docker run -itd \ --net=hadoop \ -p 50070:50070 \ -p 8088:8088 \ --name hadoop-master \ --hostname hadoop-master \ kiwenlau/hadoop:1.0 &> /dev/null
In the command, we set the master hostname to
hadoop-master.and we needn’t change
/etc/hosts to add it like in
virtualbox, docker will do it for us.
Now we start our slaves
sudo docker run -itd \ --net=hadoop \ --name hadoop-slave1 \ --hostname hadoop-slave1 \ kiwenlau/hadoop:1.0 &> /dev/null sudo docker run -itd \ --net=hadoop \ --name hadoop-slave2 \ --hostname hadoop-slave2 \ kiwenlau/hadoop:1.0 &> /dev/null
After doing that, we have finshed all softwares build.Now just run
sudo docker exec -it hadoop-master bash into
master, and then start our clusters
Now you can enjoy your clusters in few minutes, open http://127.0.0.1:8088/ to see our clusters running happily.
After introducing two way to build a hadoop clusters, you will find it’s easy to build a clusters if you know how they work together.In a word, we kind of like using
Docker to running hadoop clusters, we can easy move add more
Hadoop slaves in just one command.Meantime we can use
overlay network for us building a more safe hadoop clusters.