How to Install and Configure Apache Hadoop on a Single Node in CentOS 7
Apache Hadoop is an Open Source framework build for distributed Big Data storage and processing data across computer clusters. The project is based on the following components:
- Hadoop Common – it contains the Java libraries and utilities needed by other Hadoop modules.
- HDFS – Hadoop Distributed File System – A Java based scalable file system distributed across multiple nodes.
- MapReduce – YARN framework for parallel big data processing.
- Hadoop YARN: A framework for cluster resource management.
Install Hadoop in CentOS 7
This article will guide you on how you can install Apache Hadoop on a single node cluster in CentOS 7 (also works for RHEL 7 and Fedora 23+ versions). This type of configuration is also referenced as Hadoop Pseudo-Distributed Mode.
Step 1: Install Java on CentOS 7
1. Before proceeding with Java installation, first login with root user or a user with root privileges setup your machine hostname with the following command.
# hostnamectl set-hostname master
Set Hostname in CentOS 7
Also, add a new record in hosts file with your own machine FQDN to point to your system IP Address.
# vi /etc/hosts
Add the below line:
Set Hostname in /etc/hosts File
Replace the above hostname and FQDN records with your own settings.
2. Next, go to Oracle Java download page and grab the latest version of Java SE Development Kit 8 on your system with the help of curl command:
# curl -LO -H "Cookie: oraclelicense=accept-securebackup-cookie" “http://download.oracle.com/otn-pub/java/jdk/8u92-b14/jdk-8u92-linux-x64.rpm”
Download Java SE Development Kit 8
3. After the Java binary download finishes, install the package by issuing the below command:
# rpm -Uvh jdk-8u92-linux-x64.rpm
Install Java in CentOS 7
Step 2: Install Hadoop Framework in CentOS 7
4. Next, create a new user account on your system without root powers which we’ll use it for Hadoop installation path and working environment. The new account home directory will reside in
# useradd -d /opt/hadoop hadoop
# passwd hadoop
5. On the next step visit Apache Hadoop page in order to get the link for the latest stable version and download the archive on your system.
# curl -O http://apache.javapipe.com/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
Download Hadoop Package
6. Extract the archive the copy the directory content to hadoop account home path. Also, make sure you change the copied files permissions accordingly.
# tar xfz hadoop-2.7.2.tar.gz
# cp -rf hadoop-2.7.2/* /opt/hadoop/
# chown -R hadoop:hadoop /opt/hadoop/
Extract-and Set Permissions on Hadoop
7. Next, login with hadoop user and configure Hadoop and Java Environment Variables on your system by editing the
# su - hadoop
$ vi .bash_profile
Append the following lines at the end of the file:
## JAVA env variables
## HADOOP env variables
Configure Hadoop and Java Environment Variables
8. Now, initialize the environment variables and check their status by issuing the below commands:
$ source .bash_profile
$ echo $HADOOP_HOME
$ echo $JAVA_HOME
Initialize Linux Environment Variables
9. Finally, configure ssh key based authentication for hadoop account by running the below commands (replace the hostname or FQDN against the
ssh-copy-id command accordingly).
Also, leave the passphrase filed blank in order to automatically login via ssh.
$ ssh-keygen -t rsa
$ ssh-copy-id master.hadoop.lan
Configure SSH Key Based Authentication
Step 3: Configure Hadoop in CentOS 7
10. Now it’s time to setup Hadoop cluster on a single node in a pseudo distributed mode by editing its configuration files.
The location of hadoop configuration files is $HADOOP_HOME/etc/hadoop/, which is represented in this tutorial by hadoop account home directory (/opt/hadoop/) path.
Once you’re logged in with user hadoop you can start editing the following configuration file.
The first to edit is
core-site.xml file. This file contains information about the port number used by Hadoop instance, file system allocated memory, data store memory limit and the size of Read/Write buffers.
$ vi etc/hadoop/core-site.xml
Add the following properties between
<configuration> ... </configuration>
tags. Use localhost or your machine FQDN for hadoop instance.
Configure Hadoop Cluster
11. Next open and edit
hdfs-site.xml file. The file contains information about the value of replication data, namenode path and datanode path for local file systems.
$ vi etc/hadoop/hdfs-site.xml
Here add the following properties between
<configuration> ... </configuration> tags. On this guide we’ll use /opt/volume/ directory to store our hadoop file system.
Replace the dfs.data.dir and dfs.name.dir values accordingly.
Configure Hadoop Storage
12. Because we’ve specified /op/volume/ as our hadoop file system storage, we need to create those two directories (datanode and namenode) from root account and grant all permissions to hadoop account by executing the below commands.
$ su root
# mkdir -p /opt/volume/namenode
# mkdir -p /opt/volume/datanode
# chown -R hadoop:hadoop /opt/volume/
# ls -al /opt/ #Verify permissions
# exit #Exit root account to turn back to hadoop user
Configure Hadoop System Storage
13. Next, create the
mapred-site.xml file to specify that we are using yarn MapReduce framework.
$ vi etc/hadoop/mapred-site.xml
Add the following excerpt to mapred-site.xml file:
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
Set Yarn MapReduce Framework
14. Now, edit
yarn-site.xml file with the below statements enclosed between
<configuration> ... </configuration> tags:
$ vi etc/hadoop/yarn-site.xml
Add the following excerpt to yarn-site.xml file:
Add Yarn Configuration
15. Finally, set Java home variable for Hadoop environment by editing the below line from
$ vi etc/hadoop/hadoop-env.sh
Edit the following line to point to your Java system path.
Set Java Home Variable for Hadoop
16. Also, replace the localhost value from slaves file to point to your machine hostname set up at the beginning of this tutorial.
$ vi etc/hadoop/slaves
Step 4: Format Hadoop Namenode
17. Once hadoop single node cluster has been setup it’s time to initialize HDFS file system by formatting the /opt/volume/namenode storage directory with the following command:
$ hdfs namenode -format
Format Hadoop Namenode
Hadoop Namenode Formatting Process
Step 5: Start and Test Hadoop Cluster
18. The Hadoop commands are located in
$HADOOP_HOME/sbin directory. In order to start Hadoop services run the below commands on your console:
Check the services status with the following command.
Start and Test Hadoop Cluster
Alternatively, you can view a list of all open sockets for Apache Hadoop on your system using the ss command.
$ ss -tul
$ ss -tuln # Numerical output
Check Apache Hadoop Sockets
To test hadoop file system cluster create a random directory in the HDFS file system and copy a file from local file system to HDFS
storage (insert data to HDFS
$ hdfs dfs -mkdir /my_storage
$ hdfs dfs -put LICENSE.txt /my_storage
Check Hadoop Filesystem Cluster
To view a file content or list a directory inside HDFS file system issue the below commands:
$ hdfs dfs -cat /my_storage/LICENSE.txt
$ hdfs dfs -ls /my_storage/
List Hadoop Filesystem Content
Check Hadoop Filesystem Directory
To retrieve data from HDFS to our local file system use the below command:
$ hdfs dfs -get /my_storage/ ./
Copy Hadoop Filesystem Data to Local System
Get the full list of HDFS command options by issuing:
$ hdfs dfs -help
Step 6: Browse Hadoop Services
20. In order to access Hadoop services from a remote browser visit the following links (replace the IP Address of FQDN accordingly). Also, make sure the below ports are open on your system firewall.
For Hadoop Overview of NameNode service.
Access Hadoop Services
For Hadoop file system browsing (Directory Browse).
Hadoop Filesystem Directory Browsing
For Cluster and Apps Information (ResourceManager).
Hadoop Cluster Applications
For NodeManager Information.
Step 7: Manage Hadoop Services
21. To stop all hadoop instances run the below commands:
Stop Hadoop Services
22. In order to enable Hadoop daemons system-wide, login with root user, open
/etc/rc.local file for editing and add the below lines:
$ su - root
# vi /etc/rc.local
Add these excerpt to rc.local file.
su - hadoop -c "/opt/hadoop/sbin/start-dfs.sh"
su - hadoop -c "/opt/hadoop/sbin/start-yarn.sh"
Enable Hadoop Services at System-Boot
Then, add executable permissions for
rc.local file and enable, start and check service status by issuing the below commands:
$ chmod +x /etc/rc.d/rc.local
$ systemctl enable rc-local
$ systemctl start rc-local
$ systemctl status rc-local
Enable and Check Hadoop Services
That’s it! Next time you reboot your machine the Hadoop services will be automatically started for you! All you need to do is to fire-up a Hadoop compatible application and you’re ready to go!
For additional information please consult official Apache Hadoop documentation webpage and Hadoop Wiki page.
Related Posts via Categories