ug4
Hermit

Architecture

The Cray XE6 ("Hermit"). In Installation step 1, the XE6 is a 3552 node cluster. Each node is a Dual Socket AMD Opteron 6276 (Interlagos) @ 2.3GHz 16 cores each, which results in 113.664 cores in total. Normal nodes have 32 GB RAM, 480 special nodes have 64 GB (total 126 TB). (Architecture) That is 1 GB RAM for each process when running the maximum of 32 processes on a node. Current maximum number of cores for one job is 64000. Speak to the administration for more nodes.


General

Be aware that hermit blocks all connections to the normal internet. You have to use secSVNSSHTunneling to check out on Hermit, and use uginstall's -svnServer option (see uginstall - Scripts for installation).

The Job Scheduler on Hermit is supported by ugsubmit for unified job scheduling on all clusters. You might want to use -Hermit-workspace .

Note
You have to choose modules every time you log in (you might want to add your module load/swap commands into your .bashrc or similar ).

GCC

First, look what modules are loaded

module list

There is one which is named PrgEnv-cray or PrgEnv-*. Now you swap that to PrgEnv-gnu:

module swap PrgEnv-cray PrgEnv-gnu

There's a one-liner for this task:

module swap $(module li 2>&1 | awk '/PrgEnv/{print $2}') PrgEnv-gnu

Then you start cmake with a Toolchain File :

cmake -DCMAKE_TOOLCHAIN_FILE=../cmake/toolchain/hermit.cmake ..

Cray CC

The Cray C Compiler is not working at the moment because there is an internal compiler error in release mode.

Toolchain file is ../cmake/toolchain/hermit.cmake, and the module is PrgEnv-cray.

module swap $(module li 2>&1 | awk '/PrgEnv/{print $2}') PrgEnv-cray
cmake -DCMAKE_TOOLCHAIN_FILE=../cmake/toolchain/hermit.cmake ..

Workspace Mechanism

Access to the user file system (anything in your home directory) from your running job is very slow on Hermit. It is very noticable in runs with 1024 cores and even if you are accessing only small files once like your script files. The Workspace Mechanism lets you create a directory on a specialized parallel file system. First create a workspace in ug4:

ln -s `ws_allocate ug4ws 31` workspace

Before each time you run ugshell, you'll have to run a script like the following:

mkdir -p workspace/ugcore
rsync -a -W --exclude=.git ugcore/scripts/ workspace/ugcore/scripts/
rsync -a -W --exclude=.svn apps/ workspace/apps/
export UG4_ROOT=workspace

See also rsync docu (-a = subdirectories and keep date, -W = whole files).

Now start your job with ugsubmit from inside your workspace.

Be aware that also files written to can be damaged if you are not using the Workspace mechanism. See also the "-dir" and "-Hermit-workspace" option in ugsubmit - Job Scheduling on Clusters.

Note the export UG4_ROOT-part:

export UG4_ROOT=$runDir

If this environment variable is not specified, ug4 will look in ../scripts/ for scripts and ../data/ for data, relative to the path of the binary.

Warning
Be aware that your workspace time is limited. You'll have to use ws_extend at least every month to prevent your workspace from being deleted. This also includes your output, so be sure to extend your workspace time and copy your result out of the workspace asap.

Debugging on Hermit

On Hermit the parallel debugger DDT is available. See https://wickie.hlrs.de/platforms/index.php/DDT.

Be sure to compile ug4 as a debug build.

DDT uses X11 for its graphical user interface. Be sure to log in with an X window forwarding enabled. This could mean using the -X or -Y option to ssh.

Basic usage of DDT:

  1. Load the DDT module:
    module load ddt
  2. Start a job in interactive mode with
    qsub -IX [other job options]
    -X enables X11 forwarding; -X exports all environment variables in qsub's command environment to the job.
  3. Start DDT by typing
    ddt
  4. Click on "Run and Debug a Program" in the "Welcome" dialog box, enter the executable to debug, numbers of processes to run, the executables parameters etc., then click the "Submit" button.

    One can also immediately specify the application to debug and also its parameters by typing ddt [<your app> <parameters>]

The rest should be quite self-explaining.

Example debug session:

  1. Start interactive session:
    qsub -I -V -X -l mppwidth=64,mppnppn=32,walltime=00:30:00
    qsub: waiting for job 309598.sdb to start
    I.e, the session in this example lasts 30 minutes.
  2. Start DDT with ugshell as executable and some ug4 parameters:

    ddt ./ugshell -ex ../apps/scaling_tests/modular_scalability_test.lua -numPreRefs 1 -hRedistFirstLevel 4 -hRedistStepSize 2 -hRedistNewProcsPerStep 4 -numRefs 8
    parameterString ex
    Executes the specified script.
    Definition: command_line_util.lua:350
    parameterNumber numRefs
    </li>
    

Notes:

  • -I declares that the job is to be run "interactively".
  • -V declares that all environment variables are passed to the job.
  • -X Enables X11 forwarding, necessary to get DDT's GUI as X11 window.
  • After the job is started with qsub you've entered a new shell, so you have to go (again) into ug4's bin directory before starting the debugger.
  • When retrying the above procedere the following error message appeared after executing the qsub command (19092012):

    PrgEnv-cray/4.0.46(11):ERROR:150: Module 'PrgEnv-cray/4.0.46' conflicts with the currently loaded module(s) 'PrgEnv-gnu/4.0.46'
    PrgEnv-cray/4.0.46(11):ERROR:102: Tcl command execution failed: conflict PrgEnv-gnu
    ModuleCmd_Switch.c(172):ERROR:152: Module 'PrgEnv-cray' is currently not loaded

    This can seemingly be ignored.

  • In the same test DDT complained that no licence file is found:

    ddt ./ugshell ...
    Unable to obtain valid licence for DDT.
    The licence file tried was specified using the (optional) environment
    variable specifier DDT_LICENCE_FILE.
    The file "/opt/cray/ddt/3.2/Licence" does not exist.
    If you do not already have a licence for DDT, please visit the Allinea website
    http://www.allinea.com/products/ddt/ to obtain an evaluation licence.

    According to the message above it was possible to specify the license file of an earlier version:

    nid03538:~ igcingo$ export DDT_LICENCE_FILE=/opt/cray/ddt/3.1/Licence

    So, when executing again the ddt command DDT eventually starts ...