Running Hadoop on Athena
From athena
This is a beta guide to running Hadoop 0.19.0 on the Athena cluster. This set up will not use HDFS at all, but instead will use the Polyserve shared storage resource for input and output. Temporary storage is still configured to use the node-local hard drive.
Contents |
[edit] Hadoop installation
First you need a copy of Java JDK 1.6 in your home directory:
% cp -a ~gardnerj/jdk1.6.0_11 ~ % ln -s ~/jdk1.6.0_11 ~/java
Now determine in which of the /share partitions (NOT your home directory) will you keep the main Hadoop installation. For example, let say your username is <username> and you want to have it in /share/data1/<username>:
% cp -a /share/sdata1/gardnerj/hadoop-0.19.0 /share/data1/<username> % cd /share/data1<username> % ln -s hadoop-0.19.0 hadoop
Now set the environment variables JAVA_HOME and HADOOP_HOME to the correct locations (for csh, adjust appropriately for bash):
% setenv JAVA_HOME $HOME/java % setenv HADOOP_HOME /share/data1/<username>/hadoop
You should probably stick these environment variable definitions in your .cshrc or .bashrc file.
Now you must decide where Hadoop will read in and write out it's data. This should also be in one of the shared data partitions. Let's say we want to put this ins /share/data1/<username> as well:
% mkdir /share/data1/<username>/hadoop_data % setenv HADOOP_DATA_DIR /share/data1/<username>/hadoop_data
Lastly, decide where you want Hadoop to put its logs. Let's say it's also in the same place:
% mkdir /share/data1/<username>/hadoop_log % setenv HADOOP_LOG_DIR /share/data1/<username>/hadoop_log
Also be sure to put these environment variable definitions in you .cshrc or .bashrc file.
[edit] Hadoop configuration
Take a look at $HADOOP_HOME/conf/hadoop-site_template.xml. This has the Athena-specific settings. During your job, the string "REPLACE_WITH_JOBTRACKER" will be replaced with the name of the compute node that will function as job tracker. More on this later. The cluster that this will set up has the following properties:
- HDFS is not used! The Polyserve distributed file server (i.e. the "/share" partitions) will be used instead
- The Hadoop "master" and "jobtracker" nodes will be the same
- The Hadoop "master" node will also be a slave
- The local data store for each node is /state/partition1/
Edit the hadoop-site_template.xml file:
- Change the value line of
fs.default.name
to the appropriate location for you Hadoop data directory. - You may want to change the port of the job tracker (set to ":9112" to some higher number).
- Although parameters
dfs.*
are set, you can ignore these since you will not be starting Hadoop DFS. - Change the value line of
hadoop.tmp.dir
to "/state/partition1/<username>/hadoop-tmp" where <username> is your username.
Once you have done all of the above, Hadoop is now installed and configured as much as it can be without knowing the precise nodes it will be running on. Since the Athena nodes that you will be assigned will likely change for every Hadoop session, the remaining configuration detailed below will have to be changes each time Hadoop is started.
[edit] Running Hadoop interactively (not within a PBS batch script)
In testing Hadoop, you may wish to do so within an interactive session first. To launch an interactive session from within PBS, use the "-I" option on the "qsub" command line. Make sure to request all 8 cores per node. Once your session is active, you can get a list of all of the nodes that have been allocated to you by doing "cat $PBS_NODEFILE" from within the interactive job shell.
There is a special Hadoop queue which can be accessed using the "-q hadoop" option in PBS. For example to request 4 Hadoop nodes for 1 hour:
% qsub -I -q hadoop -l walltime=1:00:00,nodes=4:ppn=8
If are going to configure Hadoop by hand from within an interactive session:
- cd to $HADOOP_HOME/conf
- Copy hadoop-site_template.xml file to hadoop-site.xml.
- In hadoop-site.xml replace the string "REPLACE_WITH_JOBTRACKER" with the node of the jobtracker (e.g. "compute-5-1.local").
- create the file "masters" with the name of the master node (e.g. compute-5-2.local)
- create the file "slaves" with the names of the slave nodes. Use one entry per node (not the 8 that will appear in $PBS_NODEFILE).
Now you are ready to run Hadoop:
- cd to $HADOOP_HOME.
- bin/start-mapred.sh
- To check that it has been started, so "bin/hadoop job -list" and hadoop should return an empty job queue, but no errors.
- DO NOT START THE DFS!!! Do not use "bin/start-dfs.sh" since we are not using the DFS.
To stop the hadoop cluster, "bin/stop-mapred.sh".
[edit] Running Hadoop from within PBS
Normally, you will want to run Hadoop from within a PBS job. This can be done by using some very simple scripts to put the relevant node assignments into the proper Hadoop configuration files.
STILL UNDER CONSTRUCTION. MORE ON THIS SOON.
awk -v jobtracker="$BLAH" '{sub(/REPLACE_WITH_JOBTRACKER/,jobtracker);print}' < hadoop-site_template.xml
[edit] Running the Hadoop "Sort" benchmark
I have set up 3 scripts that are useful running the Hadoop sort example application and benchmark. The first one configures the relevant locations and variables. Change these as need.
sort_config.csh: #! /bin/csh -exf set HADOOP_VERSION = 0.19.0 set HADOOP_ROOT = /share/sdata1/gardnerj/hadoop-${HADOOP_VERSION} set NMAPS = 56 set NREDUCES = 7 # PolyServe set INDIR = ${HADOOP_DATA_DIR}/terasort_input set OUTDIR = ${HADOOP_DATA_DIR}/terasort_output # If you are using HDFS #set INDIR = terasort_input #set OUTDIR = terasort_output
First, you will need to generate a set of files for Sort to sort. This is done using the "randomwriter" example application, as in the script "run_randomwriter.csh" below:
run_randomwriter.csh #! /bin/csh -exf source ./sort_config.csh # Our output directory is the sort stage's input director if ( -e ${INDIR} ) then rm -r ${INDIR} endif time ${HADOOP_ROOT}/bin/hadoop \ jar ${HADOOP_ROOT}/hadoop-${HADOOP_VERSION}-examples.jar \ randomwriter ${INDIR}
More documentation on on randomwriter is available on the Randomwriter Wiki page. You can also append the name of an XML configuration file. Here is an example of one:
randomwriter_cfg.xml: <xmp> <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>test.randomwrite.min_key</name> <value>10</value> </property> <property> <name>test.randomwrite.max_key</name> <value>10</value> </property> <property> <name>test.randomwrite.min_value</name> <value>90</value> </property> <property> <name>test.randomwrite.max_value</name> <value>90</value> </property> <property> <name>test.randomwrite.total_bytes</name> <value>1048576</value> </property> </configuration></xmp>
Once you have generated sort input files to your satisfaction, you can run the sort benchmark itself:
run_sort.csh: #! /bin/csh -exf source ./sort_config.csh # Other options to the sort method: # [-inFormat input format class] # [-outFormat output format class] # [-outKey output key class] # [-outValue output value class] # time ${HADOOP_ROOT}/bin/hadoop \ jar ${HADOOP_ROOT}/hadoop-${HADOOP_VERSION}-examples.jar \ sort -m ${NMAPS} -r ${NREDUCES} ${INDIR} ${OUTDIR}
See the Hadoop Sort Wiki page for more information.