CSIF Hadoop Setup Instructions

Introduction
Installation
Install required software
CentOS – Fedora
Ubuntu
Install main software
CentOS – Fedora
InUbuntu
Configuration
Configure main-it
Configure light-pass
Restart Main-it
CentOS – Fedora
Ubuntu
Post Installation
Further Reading and Reference

Introduction and Requirements

Follow this step by step tutorial in order to setup Hadoop for running experiments and assignments with Hadoop in the Computer Science Instructional Facility (CSIF). Much of the software mentioned here is exclusive to the CSIF, so to mimic this setup in another location please use the links at the bottom of the page to find further information.

NOTE: The CSIF machines reboot every night. Any Hadoop daemons you leave running will die during reboot. Any DFS data, which is stored in /tmp, will be deleted during reboot.

Set your Hadoop related environmental variables.

Hadoop uses a few environmental variables, and to make it easier to run, hadoop (and pig) should be in your PATH.

Find what version of Pig to use

$ ls -ld /usr/local/pig*

drwxr-xr-x 14 root root 4096 May 5 11:20 /usr/local/pig-0.10.0

In the next example, we will use: /usr/local/pig-0.10.0

Add Pig to your Path

Once you know the path to the version you will be using, add it to your PATH variable in your ~/.cshrc file. Edit your ~/.cshrc file and change, or add a PATH line so it looks like this:

set path=( $path /usr/local/pig-0.10.0/bin )

For other shells, besides tcsh and csh, please see other shell’s documentation for adding environmental variables.

Add Hadoop environmental variables

Edit your ~/.cshrc file and add these lines

setenv JAVA_HOME /usr/java/default

setenv HADOOP_LOG_DIR ~/hadoop/logs

setenv HADOOP_CONF_DIR ~/hadoop/conf

Remember that you will need to exec your shell ( $ exec tcsh) , or re-log in to get environmental variables to be added to the shell. After re-logging back in, type “env” to see your environmental variables. There will be a lot of output, but it should show your new PATH directories and new variables:

$ env

PATH=/usr/local/bin:/bin:/usr/bin:/pkg/bin:/usr/local/bin:/usr/local/pig-0.10.0/bin

JAVA_HOME=/usr/java/default

HADOOP_LOG_DIR=/home/gribble/hadoop/logs

HADOOP_CONF_DIR=/home/gribble/hadoop/conf

Test that Hadoop is in your path

Run this command

$ hadoop

It will output the usage documentation for hadoop if hadoop is in your path. If it says “hadoop: Command not found” then hadoop isn’t yet in your path.

Test pig, run pig in shell mode

Run this command

$ pig -x local

Pig should respond with a “grunt>” shell line. Type “quit” and hit return. If it says “pig: Command not found” then pig isn’t yet in your path.

Run the CSIF hadoop-config program

After setting up your environmental variables, you can set up Hadoop. The hadoop-config program does these things:

  • creates a directory for your hadoop configuration files (in this example the directory name is “hadoop”)
  • asks you to enter a good password for your hadoop ssh key (called ~/.ssh/id_dsa_hadoop)
  • installs hadoop configuration files, using unique port numbers to avoid port conflicts with multiple users in the CSIF
  • installs example files for testing Hadoop.

Choose a directory name for your hadoop install

The suggested name for your hadoop directory is “hadoop”.

Choose a good password for use with your new SSH keys

DO NOT use a blank password. Use a good passphrase.

Run hadoop-config in the directory where you want the configuration directory to be installed (usually your home directory)

$ cd

$ /home/software/bin/hadoop-config hadoop

When it says this, type in your good passphrase.

creating an SSH key for you, PLEASE USE A GOOD PASSWORD

Generating public/private dsa key pair.

Enter passphrase (empty for no passphrase):

After entering your ssh key passphrase you should see the key is generated, and messages will show you the key’s location, fingerprint, randomart and so on.

Prepare SSH and ssh-agent and Test SSH keys

These next steps will prepare your SSH configuration for use with distributing Hadoop.

Start ssh-agent by typing these two commands.

Each time you login to run hadoop, you will need to start ssh-agent to manage your password.

$ ssh-agent $SHELL

$ ssh-add ~/.ssh/id_dsa_hadoop

Enter the passphrase you created above when it says this.

Enter passphrase for /home/youraccount/.ssh/id_dsa_hadoop:

If that goes well, your passphrase for that new SSH key should be cached and you won’t have to enter the passphrase to use SSH to localhost or other CSIF machines.

Test your SSH key in the CSIF

First, try to ssh to the machine you are on, with localhost with this command.

$ ssh localhost

The first time you try to SSH to any machine, you might see something like this, type yes and hit enter to authorize the new key on localhost:

The authenticity of host ‘localhost (127.0.0.1)’ can’t be established.

RSA key fingerprint is …

Are you sure you want to continue connecting (yes/no)?

Once you have successfully logged into ‘localhost’ type “exit” to exit back to your first shell.

NOTE: ”if you try to ssh into “localhost” and you get an error message telling you: WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED, you probably need to delete the “localhost” line in your ~/.ssh/known_hosts file. This can be avoided by using the same CSIF machine each time you run hadoop.

NOTE: If you plan on running Hadoop in a fully distributed way, make sure and SSH login to all the CSIF machines you plan to make into Hadoop slave nodes. This way the authenticity of those hosts is recorded and won’t pop up the “Are you sure you want to continue connecting (yes/no)? ” question when running Hadoop.

Starting up Hadoop

To start hadoop, for testing your installation, run these commands

First, format a new Distributed Filesystem (DFS)

$ hadoop namenode -format

Start all Hadoop daemons with the “start-all.sh” command.

$ start-all.sh

You can check your Hadoop logs in the “logs” directory in your hadoop configuration files directory (~/hadoop/logs, for this set of examples).

Check that the web interfaces for Hadoop are up

In your hadoop configuration files directory there is a README.html file. Open it with firefox or another browser.

$ firefox ~/hadoop/README.html

There are some help links, such as one that comes back to this page. There are also two links to the web interfaces — with your unique ports.

Click on the JobTracker link.

It may take a few minutes for the Hadoop instance to start up. When it is running you should see information on the state of Hadoop.

Click on the NameNode link.

Before you proceed with the next steps the Live Nodes must be 1 or higher. It may take a few minutes for this to happen. Reload the web page until you see the Live Nodes come up. You should be able to browse the filesystem (see link on NameNode page) and not get errors.

Copy the input files needed into the DFS

NOTE: The NameNode web interface should show at least 1 “Live Nodes” before this step.

First change directory into your hadoop configuration files directory, then run the hadoop fs command to copy the input files from the conf directory.

cd ~/hadoop

hadoop fs -put conf input

NOTE: ”If the hadoop fs command hangs for minutes, or if you see a mess of error messages after running this command, including: “org.apache.hadoop.ipc.RemoteException: java.io.IOException: File … l could only be replicated to 0 nodes, instead of 1”, then you probably didn’t wait for the NameNode web interface “Live Nodes” to reach at least 1. Try the command again after “Live Nodes” reaches at least 1.

You can browse the filesystem from the NameNodes web interface (see your README.html file)

Test Hadoop with an example jar

Run this command to test Hadoop. You can check the JobTracker web interface to watch it run, get data on how it ran, check error messages in the logs, and more (see your README.html file for the link to JobTracker).

hadoop jar hadoop-*-examples.jar grep input output ‘dfs[a-z.]+’

Look at your output files:

Copy the files locally then view

One way is to copy the output files from the DFS to your home directory, then you can view them

$ hadoop fs -get output output

$ less output/*

Look at the Web Interface

You can also look at the files via your NameNode web interface

See the README.html file, in firefox.

Stopping Hadoop

When you are done using Hadoop in the CSIF you should stop all of it’s processes. To stop hadoop, issue the “stop-all.sh” command.

$ stop-all.sh

Running Hadoop as a Fully Distributed Cluster

To run Hadoop more like a cluster, follow these steps. Make sure hadoop is stopped with the “stop-all.sh” command first!

Change localhost to PC hostname

Change localhost to the hostname of the machine you are logged into in various hadoop conf files.

Find your host name with this command

$ hostname

Find IP of host

$ nslookup hostname

(Replace hostname with what you got from the above command)

Edit conf files

In the conf directory of your hadoop configuration files directory, change these files:

In conf/core-site.xml, change the hostname “localhost” to the PC’s IP, so if hostname told you “PC22” and nslookup told you “169.237.5.122” the new line should be similar to this (your port number will be different)

<value>hdfs://169.237.5.122:10000</value>

In conf/mapred-site.xml, do the same thing as in the core-site.xml. The changed line should be similar to this (your port number will be different)

<value>169.237.5.122:20000</value>

Do the same for the conf/masters file, localhost becomes a single line

169.237.5.122

Add some slave nodes

In the conf/slaves file, add a list of host names of a handful of machines that are running the same OS (32 bit or 64 bit, see above). Make sure you can SSH to your slave machines, from your master machine, using ssh-agent and not needing a password (see above). If you can’t SSH to a few machines, choose new machines as those may be down for maintenance.

In this example we will use pc23 through pc27, the file would look like this

pc23

pc24

pc25

pc26

pc27

If you are using a 32-bit OS, the file would look like this

pc13

pc14

pc15

pc16

pc17

If you put your server on this list make sure you put the IP number and not the hostname, but all the slaves work with just their hostname.

Rebuild your DFS

Reformat your DFS after removing it.

Remove your old DFS.

NOTE: THIS WILL DESTROY ANY DATA ON YOUR DFS, SO BACK IT UP IF YOU WANT YOUR DATA SAVED

Run this command on any masters or slaves you have already used to run hadoop

rm -rf /tmp/hadoop-accountname*

Reformat your DFS

hadoop namenode -format

You are ready to run

start-all.sh

Don’t forget to stop-all.sh when you are done!

Troubleshooting

NameNode DFS errors

If a DFS starts producing errors, you might need to rebuild it. It’s suggest that you issue a “stop-all.sh” to stop the Hadoop daemons. Then remove all the files associated with Hadoop in /tmp. This command has worked (use your own account name instead of accountname. Execute the rm command on all slave nodes if you are running distributed.

rm -rf /tmp/hadoop-accountname*

Format the new DFS with the “hadoop namenode -format” command. Then “start-all.sh” to start all daemons again. Don’t forget to “hadoop fs -put” your data back!

Rebuilding

Sometimes things go horribly wrong and you need to rebuild your Hadoop installation. Issue a “stop-all.sh”, back up your work and data, if needed, then remove or move your hadoop configuration files directory (hadoop in this example). Then follow these instructions from Run the CSIF hadoop-config program.

More Information

Where to get more information

SSH

http://www.openssh.org/manual.html

Hadoop single node setup and example

http://hadoop.apache.org/common/docs/current/single_node_setup.html

Comments are closed.