Install required software
CentOS – Fedora
Install main software
CentOS – Fedora
CentOS – Fedora
Further Reading and Reference
Introduction and Requirements
Follow this step by step tutorial in order to setup Hadoop for running experiments and assignments with Hadoop in the Computer Science Instructional Facility (CSIF). Much of the software mentioned here is exclusive to the CSIF, so to mimic this setup in another location please use the links at the bottom of the page to find further information.
NOTE: The CSIF machines reboot every night. Any Hadoop daemons you leave running will die during reboot. Any DFS data, which is stored in /tmp, will be deleted during reboot.
Set your Hadoop related environmental variables.
Hadoop uses a few environmental variables, and to make it easier to run, hadoop (and pig) should be in your PATH.
Find what version of Pig to use
$ ls -ld /usr/local/pig*
drwxr-xr-x 14 root root 4096 May 5 11:20 /usr/local/pig-0.10.0
In the next example, we will use: /usr/local/pig-0.10.0
Add Pig to your Path
Once you know the path to the version you will be using, add it to your PATH variable in your ~/.cshrc file. Edit your ~/.cshrc file and change, or add a PATH line so it looks like this:
set path=( $path /usr/local/pig-0.10.0/bin )
For other shells, besides tcsh and csh, please see other shell’s documentation for adding environmental variables.
Add Hadoop environmental variables
Edit your ~/.cshrc file and add these lines
setenv JAVA_HOME /usr/java/default
setenv HADOOP_LOG_DIR ~/hadoop/logs
setenv HADOOP_CONF_DIR ~/hadoop/conf
Remember that you will need to exec your shell ( $ exec tcsh) , or re-log in to get environmental variables to be added to the shell. After re-logging back in, type “env” to see your environmental variables. There will be a lot of output, but it should show your new PATH directories and new variables:
Test that Hadoop is in your path
Run this command
It will output the usage documentation for hadoop if hadoop is in your path. If it says “hadoop: Command not found” then hadoop isn’t yet in your path.
Test pig, run pig in shell mode
Run this command
$ pig -x local
Pig should respond with a “grunt>” shell line. Type “quit” and hit return. If it says “pig: Command not found” then pig isn’t yet in your path.
Run the CSIF hadoop-config program
After setting up your environmental variables, you can set up Hadoop. The hadoop-config program does these things:
- creates a directory for your hadoop configuration files (in this example the directory name is “hadoop”)
- asks you to enter a good password for your hadoop ssh key (called ~/.ssh/id_dsa_hadoop)
- installs hadoop configuration files, using unique port numbers to avoid port conflicts with multiple users in the CSIF
- installs example files for testing Hadoop.
Choose a directory name for your hadoop install
The suggested name for your hadoop directory is “hadoop”.
Choose a good password for use with your new SSH keys
DO NOT use a blank password. Use a good passphrase.
Run hadoop-config in the directory where you want the configuration directory to be installed (usually your home directory)
$ /home/software/bin/hadoop-config hadoop
When it says this, type in your good passphrase.
creating an SSH key for you, PLEASE USE A GOOD PASSWORD
Generating public/private dsa key pair.
Enter passphrase (empty for no passphrase):
After entering your ssh key passphrase you should see the key is generated, and messages will show you the key’s location, fingerprint, randomart and so on.
Prepare SSH and ssh-agent and Test SSH keys
These next steps will prepare your SSH configuration for use with distributing Hadoop.
Start ssh-agent by typing these two commands.
Each time you login to run hadoop, you will need to start ssh-agent to manage your password.
$ ssh-agent $SHELL
$ ssh-add ~/.ssh/id_dsa_hadoop
Enter the passphrase you created above when it says this.
Enter passphrase for /home/youraccount/.ssh/id_dsa_hadoop:
If that goes well, your passphrase for that new SSH key should be cached and you won’t have to enter the passphrase to use SSH to localhost or other CSIF machines.
Test your SSH key in the CSIF
First, try to ssh to the machine you are on, with localhost with this command.
$ ssh localhost
The first time you try to SSH to any machine, you might see something like this, type yes and hit enter to authorize the new key on localhost:
The authenticity of host ‘localhost (127.0.0.1)’ can’t be established.
RSA key fingerprint is …
Are you sure you want to continue connecting (yes/no)?
Once you have successfully logged into ‘localhost’ type “exit” to exit back to your first shell.
NOTE: ”if you try to ssh into “localhost” and you get an error message telling you: WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED, you probably need to delete the “localhost” line in your ~/.ssh/known_hosts file. This can be avoided by using the same CSIF machine each time you run hadoop.
NOTE: If you plan on running Hadoop in a fully distributed way, make sure and SSH login to all the CSIF machines you plan to make into Hadoop slave nodes. This way the authenticity of those hosts is recorded and won’t pop up the “Are you sure you want to continue connecting (yes/no)? ” question when running Hadoop.
Starting up Hadoop
To start hadoop, for testing your installation, run these commands
First, format a new Distributed Filesystem (DFS)
$ hadoop namenode -format
Start all Hadoop daemons with the “start-all.sh” command.
You can check your Hadoop logs in the “logs” directory in your hadoop configuration files directory (~/hadoop/logs, for this set of examples).
Check that the web interfaces for Hadoop are up
In your hadoop configuration files directory there is a README.html file. Open it with firefox or another browser.
$ firefox ~/hadoop/README.html
There are some help links, such as one that comes back to this page. There are also two links to the web interfaces — with your unique ports.
Click on the JobTracker link.
It may take a few minutes for the Hadoop instance to start up. When it is running you should see information on the state of Hadoop.
Click on the NameNode link.
Before you proceed with the next steps the Live Nodes must be 1 or higher. It may take a few minutes for this to happen. Reload the web page until you see the Live Nodes come up. You should be able to browse the filesystem (see link on NameNode page) and not get errors.
Copy the input files needed into the DFS
NOTE: The NameNode web interface should show at least 1 “Live Nodes” before this step.
First change directory into your hadoop configuration files directory, then run the hadoop fs command to copy the input files from the conf directory.
hadoop fs -put conf input
NOTE: ”If the hadoop fs command hangs for minutes, or if you see a mess of error messages after running this command, including: “org.apache.hadoop.ipc.RemoteException: java.io.IOException: File … l could only be replicated to 0 nodes, instead of 1”, then you probably didn’t wait for the NameNode web interface “Live Nodes” to reach at least 1. Try the command again after “Live Nodes” reaches at least 1.
You can browse the filesystem from the NameNodes web interface (see your README.html file)
Test Hadoop with an example jar
Run this command to test Hadoop. You can check the JobTracker web interface to watch it run, get data on how it ran, check error messages in the logs, and more (see your README.html file for the link to JobTracker).
hadoop jar hadoop-*-examples.jar grep input output ‘dfs[a-z.]+’
Look at your output files:
Copy the files locally then view
One way is to copy the output files from the DFS to your home directory, then you can view them
$ hadoop fs -get output output
$ less output/*
Look at the Web Interface
You can also look at the files via your NameNode web interface
See the README.html file, in firefox.
When you are done using Hadoop in the CSIF you should stop all of it’s processes. To stop hadoop, issue the “stop-all.sh” command.
Running Hadoop as a Fully Distributed Cluster
To run Hadoop more like a cluster, follow these steps. Make sure hadoop is stopped with the “stop-all.sh” command first!
Change localhost to PC hostname
Change localhost to the hostname of the machine you are logged into in various hadoop conf files.
Find your host name with this command
Find IP of host
$ nslookup hostname
(Replace hostname with what you got from the above command)
Edit conf files
In the conf directory of your hadoop configuration files directory, change these files:
In conf/core-site.xml, change the hostname “localhost” to the PC’s IP, so if hostname told you “PC22” and nslookup told you “18.104.22.168” the new line should be similar to this (your port number will be different)
In conf/mapred-site.xml, do the same thing as in the core-site.xml. The changed line should be similar to this (your port number will be different)
Do the same for the conf/masters file, localhost becomes a single line
Add some slave nodes
In the conf/slaves file, add a list of host names of a handful of machines that are running the same OS (32 bit or 64 bit, see above). Make sure you can SSH to your slave machines, from your master machine, using ssh-agent and not needing a password (see above). If you can’t SSH to a few machines, choose new machines as those may be down for maintenance.
In this example we will use pc23 through pc27, the file would look like this
If you are using a 32-bit OS, the file would look like this
If you put your server on this list make sure you put the IP number and not the hostname, but all the slaves work with just their hostname.
Rebuild your DFS
Reformat your DFS after removing it.
Remove your old DFS.
NOTE: THIS WILL DESTROY ANY DATA ON YOUR DFS, SO BACK IT UP IF YOU WANT YOUR DATA SAVED
Run this command on any masters or slaves you have already used to run hadoop
rm -rf /tmp/hadoop-accountname*
Reformat your DFS
hadoop namenode -format
You are ready to run
Don’t forget to stop-all.sh when you are done!
NameNode DFS errors
If a DFS starts producing errors, you might need to rebuild it. It’s suggest that you issue a “stop-all.sh” to stop the Hadoop daemons. Then remove all the files associated with Hadoop in /tmp. This command has worked (use your own account name instead of accountname. Execute the rm command on all slave nodes if you are running distributed.
rm -rf /tmp/hadoop-accountname*
Format the new DFS with the “hadoop namenode -format” command. Then “start-all.sh” to start all daemons again. Don’t forget to “hadoop fs -put” your data back!
Sometimes things go horribly wrong and you need to rebuild your Hadoop installation. Issue a “stop-all.sh”, back up your work and data, if needed, then remove or move your hadoop configuration files directory (hadoop in this example). Then follow these instructions from Run the CSIF hadoop-config program.
Where to get more information