Running Hadoop on Athena

From athena

This is a beta guide to running Hadoop 0.19.0 on the Athena cluster. This set up will not use HDFS at all, but instead will use the Polyserve shared storage resource for input and output. Temporary storage is still configured to use the node-local hard drive.

1 Hadoop installation
2 Hadoop configuration
3 Running Hadoop interactively (not within a PBS batch script)
4 Running Hadoop from within PBS
5 Running the Hadoop "Sort" benchmark

[edit] Hadoop installation

First you need a copy of Java JDK 1.6 in your home directory:

% cp -a ~gardnerj/jdk1.6.0_11 ~
% ln -s ~/jdk1.6.0_11 ~/java

Now determine in which of the /share partitions (NOT your home directory) will you keep the main Hadoop installation. For example, let say your username is <username> and you want to have it in /share/data1/<username>:

% cp -a /share/sdata1/gardnerj/hadoop-0.19.0 /share/data1/<username>
% cd /share/data1<username>
% ln -s hadoop-0.19.0 hadoop

Now set the environment variables JAVA_HOME and HADOOP_HOME to the correct locations (for csh, adjust appropriately for bash):

% setenv JAVA_HOME $HOME/java
% setenv HADOOP_HOME /share/data1/<username>/hadoop

You should probably stick these environment variable definitions in your .cshrc or .bashrc file.

Now you must decide where Hadoop will read in and write out it's data. This should also be in one of the shared data partitions. Let's say we want to put this ins /share/data1/<username> as well:

% mkdir /share/data1/<username>/hadoop_data
% setenv HADOOP_DATA_DIR /share/data1/<username>/hadoop_data

Lastly, decide where you want Hadoop to put its logs. Let's say it's also in the same place:

% mkdir /share/data1/<username>/hadoop_log
% setenv HADOOP_LOG_DIR /share/data1/<username>/hadoop_log

Also be sure to put these environment variable definitions in you .cshrc or .bashrc file.

[edit] Hadoop configuration

Take a look at $HADOOP_HOME/conf/hadoop-site_template.xml. This has the Athena-specific settings. During your job, the string "REPLACE_WITH_JOBTRACKER" will be replaced with the name of the compute node that will function as job tracker. More on this later. The cluster that this will set up has the following properties:

HDFS is not used! The Polyserve distributed file server (i.e. the "/share" partitions) will be used instead
The Hadoop "master" and "jobtracker" nodes will be the same
The Hadoop "master" node will also be a slave
The local data store for each node is /state/partition1/

Edit the hadoop-site_template.xml file:

Change the value line of fs.default.name to the appropriate location for you Hadoop data directory.
You may want to change the port of the job tracker (set to ":9112" to some higher number).
Although parameters dfs.* are set, you can ignore these since you will not be starting Hadoop DFS.
Change the value line of hadoop.tmp.dir to "/state/partition1/<username>/hadoop-tmp" where <username> is your username.

Once you have done all of the above, Hadoop is now installed and configured as much as it can be without knowing the precise nodes it will be running on. Since the Athena nodes that you will be assigned will likely change for every Hadoop session, the remaining configuration detailed below will have to be changes each time Hadoop is started.

[edit] Running Hadoop interactively (not within a PBS batch script)

In testing Hadoop, you may wish to do so within an interactive session first. To launch an interactive session from within PBS, use the "-I" option on the "qsub" command line. Make sure to request all 8 cores per node. Once your session is active, you can get a list of all of the nodes that have been allocated to you by doing "cat $PBS_NODEFILE" from within the interactive job shell.

There is a special Hadoop queue which can be accessed using the "-q hadoop" option in PBS. For example to request 4 Hadoop nodes for 1 hour:

 % qsub -I -q hadoop -l walltime=1:00:00,nodes=4:ppn=8

If are going to configure Hadoop by hand from within an interactive session:

cd to $HADOOP_HOME/conf
Copy hadoop-site_template.xml file to hadoop-site.xml.
In hadoop-site.xml replace the string "REPLACE_WITH_JOBTRACKER" with the node of the jobtracker (e.g. "compute-5-1.local").
create the file "masters" with the name of the master node (e.g. compute-5-2.local)
create the file "slaves" with the names of the slave nodes. Use one entry per node (not the 8 that will appear in $PBS_NODEFILE).

Now you are ready to run Hadoop:

cd to $HADOOP_HOME.
bin/start-mapred.sh
To check that it has been started, so "bin/hadoop job -list" and hadoop should return an empty job queue, but no errors.
DO NOT START THE DFS!!! Do not use "bin/start-dfs.sh" since we are not using the DFS.

To stop the hadoop cluster, "bin/stop-mapred.sh".

[edit] Running Hadoop from within PBS

Normally, you will want to run Hadoop from within a PBS job. This can be done by using some very simple scripts to put the relevant node assignments into the proper Hadoop configuration files.

STILL UNDER CONSTRUCTION. MORE ON THIS SOON.

awk -v jobtracker="$BLAH" '{sub(/REPLACE_WITH_JOBTRACKER/,jobtracker);print}' < hadoop-site_template.xml

[edit] Running the Hadoop "Sort" benchmark

I have set up 3 scripts that are useful running the Hadoop sort example application and benchmark. The first one configures the relevant locations and variables. Change these as need.

sort_config.csh:

 #! /bin/csh -exf
 set HADOOP_VERSION = 0.19.0
 set HADOOP_ROOT = /share/sdata1/gardnerj/hadoop-${HADOOP_VERSION}
 set NMAPS = 56
 set NREDUCES = 7
 # PolyServe
 set INDIR = ${HADOOP_DATA_DIR}/terasort_input
 set OUTDIR = ${HADOOP_DATA_DIR}/terasort_output
 # If you are using HDFS
 #set INDIR = terasort_input
 #set OUTDIR = terasort_output

First, you will need to generate a set of files for Sort to sort. This is done using the "randomwriter" example application, as in the script "run_randomwriter.csh" below:

run_randomwriter.csh

 #! /bin/csh -exf
 source ./sort_config.csh

 # Our output directory is the sort stage's input director
 if ( -e ${INDIR} ) then
   rm -r ${INDIR}
 endif

 time ${HADOOP_ROOT}/bin/hadoop \
      jar ${HADOOP_ROOT}/hadoop-${HADOOP_VERSION}-examples.jar \
      randomwriter ${INDIR}

More documentation on on randomwriter is available on the Randomwriter Wiki page. You can also append the name of an XML configuration file. Here is an example of one:

randomwriter_cfg.xml:

 <xmp>
 <?xml version="1.0"?>
 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 <configuration>
   <property>
     <name>test.randomwrite.min_key</name>
     <value>10</value>
   </property>
   <property>
     <name>test.randomwrite.max_key</name>
     <value>10</value>
   </property>
   <property>
     <name>test.randomwrite.min_value</name>
     <value>90</value>
   </property>
   <property>
     <name>test.randomwrite.max_value</name>
     <value>90</value>
   </property>
   <property>
     <name>test.randomwrite.total_bytes</name>
     <value>1048576</value>
   </property>
 </configuration></xmp>

Once you have generated sort input files to your satisfaction, you can run the sort benchmark itself:

run_sort.csh:

 #! /bin/csh -exf
 source ./sort_config.csh

 # Other options to the sort method:
 #             [-inFormat input format class] 
 #             [-outFormat output format class] 
 #             [-outKey output key class] 
 #             [-outValue output value class] 
 #
 time ${HADOOP_ROOT}/bin/hadoop \
      jar ${HADOOP_ROOT}/hadoop-${HADOOP_VERSION}-examples.jar \
      sort -m ${NMAPS} -r ${NREDUCES} ${INDIR} ${OUTDIR}

See the Hadoop Sort Wiki page for more information.

Running Hadoop on Athena

From athena

Contents

[edit] Hadoop installation

[edit] Hadoop configuration

[edit] Running Hadoop interactively (not within a PBS batch script)

[edit] Running Hadoop from within PBS

[edit] Running the Hadoop "Sort" benchmark

Views

Navigation

Search