ug4
JuGene
Attention
Please note that the BG/P system JuGene was replaced by the BG/Q system named JuQueen in July 2012, see page JuQueen for some instructions to work on this successor of JuGene! Although with this replacement not only the hardware has changed the following informations on JuGene might remain of some use (especially some links to out-sourced parts of the documentation) and is left here to the careful reader. In addition maybe sometime a ug4 user will have access to another BG/P machine.

General Information about JuGene

JuGene — the 72 racks Blue Gene/P (BG/P) system at Jülich Supercomputing Centre (JSC) (FZ Jülich) in total provides 294.912 cores (288 Ki) and 144 Tbyte RAM. The 73.728 compute nodes (CN) each have a quadcore Power 32-bit PC 450 running at 850 MHz, with 2 Gbyte of RAM.

Half a rack (2048 cores) is called a midplane. The JuGene system uses five different networks dedicated for various tasks and functionalities of the machine. Relevant for us: The 3-D torus network. This is a point-to-point network — each of the CNs has six nearest-neighbour connections.

See more on the architecture of JuGene here.

More information about JuGene "in order to enable users of the system to achieve good performance of their applications" can be found in "PRACE Best-Practice Guide for JUGENE".

Note
Note that the login nodes are running under SuSE Linux Enterprise Server 10 (SLES 10), while the CNs are running a limited version of Linux called Compute Node Kernel (CNK). Therefore its necessary to cross-compile for JuGene (cf. sec. CMake, Toolchains, Compilers; sec. Configuration of ug4 for JuGene).

Access to JuGene's Login Nodes

JuGene is reached via two so called front-end or login nodes (jugene3 and jugene4) for interactive access and the submission of batch jobs.

These login nodes are reached via

ssh <user>@jugene.fz-juelich.de

i.e., for login there is only a generic hostname, jugene.fz-juelich.de, from which automatically a connection either to jugene3 or jugene4 will be established.

The front-end nodes have an identical environment, but multiple sessions of one user may reside on different nodes which must be taken into account when killing processes.

It is necessary to upload the SSH key of the machine from which to connect to one of JuGenes login nodes. See Logging on to JUGENE (also for X11 problems).

To be able to connect to JuGene from different machines maybe you find it useful to define one of GCSC's machines (e.g. speedo, quadruped, ...) as a "springboard" to one of JuGenes login nodes (so that you have to login to this machine first, then to JuGene), see SSH Hopping.


Configuration of ug4 for JuGene

For JuGene you have to "cross compile" and to do so use a specific Toolchain File. Start CMake like this

cmake -DCMAKE_TOOLCHAIN_FILE=../cmake/toolchain/jugene.cmake ..
cmake
Definition: unit_tests.doxygen:198

or, for static builds which is the configuration of choice if you want to process very large jobs,

cmake -DSTATIC=ON -DCMAKE_TOOLCHAIN_FILE=../cmake/toolchain/jugene_static.cmake ..

See also Very large Jobs on JuGene!

Note
A static build where also all system libraries are linked statically need some additional "hand work" by now: After configuration with CMake edit the following files by replacing all occurences of libXXXXXXX.so by libXXXXXXX.a (has to be done only once):
CMakeCache.txt,
ugbase/ug_shell/CMakeFiles/ugshell.dir/link.txt,
ugbase/ug_shell/CMakeFiles/ugshell.dir/build.make

Or use this sed command:

sed -i 's/\‍([[:alnum:]]*\‍).so/\1.a/g' CMakeCache.txt ugbase/ug_shell/CMakeFiles/ugshell.dir/link.txt ugbase/ug_shell/CMakeFiles/ugshell.dir/build.make

You can check your executable by running the (standard unix) ldd command ("list dynamic dependencies") on it:

ldd ugshell

Answer should be not a dynamic executable for a completely static build!

Debug builds: Since the pre-installed GCC 4.1.2 (in April 2012) is not able to compile a "debug" build one should add the flag -DDEBUG_FORMAT=-gstabs to the CMake call (cf. GCC 4.1.2).

For debugging a parallel application on JuGene see Debugging on JuGene


Working with ug4 on JuGene

Basic Job Handling

Please take the time and fill out the details in ugsubmit / uginfo / ugcancel (ugsubmit - Job Scheduling on Clusters).

See Quick Introduction to job handling. Also look at job/mpirun options.

Here we only introduce some important details (everything else should become clear by examining the examples provided):

  • Jobs can be submitted by the llrun (interactively) and the mpirun (as batch job; defined in a LoadLeveler script) command. See below for some details.
  • There are different execution modes which can be specified by the mpirun and llrun parameter -mode {VN | DUAL | SMP}:

    • Quad Mode (a.k.a. "Virtual Mode"): All four cores run one MPI process. Memory/MPI Process = 1/4 CN RAM: -mode VN.
    • Dual Mode: Two cores run one MPI process (hybrid MPI/OpenMP). Memory/MPI Process = 1/2 CN RAM: -mode DUAL.
    • SMP Mode ("Symmetrical Multi Processing Mode"): All four cores run one MPI process (hybrid MPI/OpenMP). Memory/MPI Process = CN RAM: -mode SMP.

    Note that in quad mode (using all 4 processors of a computing node) this means each core has only ca. 512 Mbyte of RAM (474 Mbyte to be more specific, since the CNK also needs some memory).

    Obviously "VN" is the preferred execution mode if large numbers of processes should be achieved — and ug4 works with VN mode (at least up to ~64 Ki DoFs per process)!

  • The mpirun parameter -mapfile specifies an order in which the MPI processes are mapped to the CNs / the cores of the BG/P partition reserved for the run. This order can either be specified by a permutation of the letters X,Y,Z and T or the name of a mapfile in which the distribution of the tasks is specified, -mapfile {<mapping>|<mapfile>}:

    • <mapping> is a permutation of X, Y, Z and T.

      The standard mapping on JuGene is to place the tasks in "XYZT" order, where X, Y, and Z are the torus coordinates of the nodes in the partition and T is the number of the cores within each node (T=0,1,2,3).

      When the tasks are distributed across the nodes the first dimension is increased first, i.e. for XYZT the first three tasks would be executed by the nodes with the torus coordinates <0,0,0,0>, <1,0,0,0> and <2,0,0,0>, which obviously is not what we want for our simulation runs. For now we recommend -mapfile TXYZ which fills up a CN before going to the next CN so that MPI processes working on adjacent subdomains are placed closely in the 3-D torus.

    • <mapfile> is the name of a mapfile in which the distribution of the tasks is specified: It contains a line of x y z t coordinates for each MPI process. See sec. 6. of the Best-Practise guide mentioned above for an example and which LoadLeveler keywords to use.

  • If ug4 was dynamically linked add to the mpirun parameters -env LD_LIBRARY_PATH==/bgsys/drivers/ppcfloor/comm/lib/.
    Note
    This parameter is (obviously) not necessary for completely static builds!
  • "Modules" can be loaded by executing the module command.

    E.g. module load lapack for LAPACK (see http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/Libraries/LAPACKJugene.html for choosing one of several versions).

    This is also for loading <strong>performance analysis tools</strong>, e.g. 
    <tt>module load scalasca</tt>, <tt>module load UNITE</tt> etc.
    


  • llview is a tool with a graphical X11 user interface for displaying system status, running jobs, scheduling and prediction of the start time of jobs (the latter can also be achieved by the llq command, see below).

  • Interactive jobs can be started with the llrun command. llrun is invoked in the following way:

    llrun [<llrun_options>] [<mpirun_options>] [<executable>]

    llrun replaces mpirun and can be used with the same command line options like mpirun. One llrun option which might be interesting: -w <hh:mm:ss>: submit job with wallclock limit (Default: 00:30:00). For other options see e.g. http://www.prace-ri.eu/IMG/pdf/Best-practise-guide-JUGENE-v0-3.pdf, sec. 3.5.3.

    Example:

    llrun -np 4 -mode VN -mapfile TXYZ -verbose 2 -exe ./ugshell_debug -args "-ex ../apps/scaling_tests/modular_scalability_test.lua -numPreRefs 3 -numRefs 7"
    location verbose
    Definition: checkpoint_util.lua:128
    Note
    Note the quotation marks around the executables arguments!
    Please note that <tt>llrun</tt> only allows jobs up to 256 
    (<tt>-mode SMP</tt>) / 512 (<tt>-mode DUAL</tt>) / 1024 (<tt>-mode VN</tt>) 
    MPI processes!
    
  • Batch Jobs are defined in so called "LoadLeveler scripts" and submitted by the llsubmit command to the IBM Tivoli Workload Scheduler LoadLeveler (TWS LoadLeveler), typically in the directory where the ug4 executable resides:

    llsubmit <cmdfile>

    <cmdfile> is a (plain Unix) shell script file (i.e., the "LoadLeveler script"), which contains job definitions given by "LoadLeveler keywords" (some important examples are explained below).

    If llsubmit was able to submit the job it outputs a job name (e.g. jugene4b.zam.kfa-juelich.de.298508) with which a specific job can be identified in further commands, e.g. to cancel it (see below).

    The output of the run (messages from Frontend end Backend MPI — and the output of ug4 — is written to a file in the directory where llsubmit was executed and whose name begins with the job name you have specified in the LoadLeveler script and ends with <job number>.out.

    For some example LoadLeveler scripts used with ug4 see subdirectory scripts/shell/:

    • ll_scale_gmg.x contains job definitions for a complete scalability study for GMG in 2-D and 3-D.
    • ll_template.x also contains some documentation of LoadLeveler and mpirun parameters.

    (All mpirun commands therein are commented out — to performe a specific run remove the comment sign.)

    Please change in your copy of one of these scripts at least the value of the notify_user keyword before submitting a job ...

    See also this more recent JSC documentation, and especially the Job File Samples for more details.

  • LoadLeveler keywords are strings embedded in comments beginning with the characters "# @".

    Some selected keywords:

    • job_type = bluegene specifies that the job is running on JuGene's CNs.
    • job_name specifies the name of the job, which will get part of the name of the output file.
    • bg_size specifies the size of the BG/P partition reserved for the job in number of compute nodes.

      That is, for <NP> MPI processes, bg_size must be >= (<NP>)/4.

      Alternatively the size of a job can be defined by the bg_shape keyword.

      See comments in the example LoadLeveler scripts for proper settings.

    • bg_connection specifies the "connection type", i.e. the network topology used by a job.

      Connection type can be one in [TORUS| MESH | PREFER_TORUS]. Default is bg_connection = MESH.

      bg_connection = TORUS — utilising the 3-D torus network — is the preferred topology for our jobs. For this bg_size (see above) must be >= 512.

      See also comments and usage in the example LoadLeveler scripts for proper settings.

    Please note that keywords must not be followed by comments in the same line!

    A nice introduction to LoadLeveler command file syntax is given e.g. here.

  • llq is used to display the status of jobs (of a specified user) in the queue/executed:

    llq [-u <userid>]

    The estimated start time of a job can be determined by llq -s <job-id> (cf. https://docs.loni.org/wiki/Useful_LoadLeveler_Commands), where <job-id> was determined by the previous command, e.g. for the job jugene4b.304384.0:

    llq -s 304384
    ===== EVALUATIONS FOR JOB STEP jugene4b.zam.kfa-juelich.de.304384.0 =====
    Step state : Idle
    Considered for scheduling at : Sun 01 Apr 2012 00:02:55 CEST
    Top dog estimated start time : Sun 01 Apr 2012 16:38:55 CEST
    ...
    parameterString s

  • llcancel is used to cancel a job (<jobname> as displayed by llq):
    llcancel <jobname>
  • Debugging: See documentation to the tools /bgsys/drivers/ppcfloor/tools/coreprocessor, gdbserver ...
  • Available file systems. See JSC documentation for more details.
  • Querying Quota Status:

    q_cpuquota <options>

    Useful options:

    • -? usage information and all options .
    • -j <jobstepid> for a single job.
    • -t <time> for all jobs in the specified time, e.g. q_cpuquota -t 23.11.2011 01.12.2011.
    • -d <number> for last number of days (positive integer).

Very large Jobs on JuGene

  • Although its possible on JuGene to create shared libraries and run dynamically linked executables this is in general not recommended, since loading of shared libraries can delay the startup of such an application considerably, especially when using large partitions (8 racks or more). See also Shared Libraries and Dynamic Executables.

    So, for very large jobs be sure to have ug4 built as a completely static executable (c.f. Configuration of ug4 for JuGene) since otherwise loading of the shared libraries consumes too much wall time!

  • Very large jobs (e.g. jobs larger than 32 racks) normally run on Tuesday only.

    Exceptions to this rule are possible in urgent cases (please contact the SC Support under sc@fz-juelich.de).


Debugging on JuGene

On JuGene the parallel debuggers TotalView and DDT are available.

Be sure to compile ug4 as a debug build.

DDT uses X11 for its graphical user interface. Be sure to log in with an X window forwarding enabled. This could mean using the -X or -Y option to ssh.

Basic usage of DDT:

  1. Load the appropriate module:
    module load UNITE ddt
  2. Start DDT by typing ddt.
  3. A "DDT" logo appears, wait a bit (or click on the logo) to get the welcome dialog box.
  4. Click on "Run and Debug a Program" in the "Welcome" dialog box
  5. Enter your ug4 parameters (don't forget to enclose them by quotation marks), numbers of processes to run, a (hopefully) appropriate wall time to do all your debugging work (after "Queue submission Parameters"), "MPIRun parameters" (e.g. -mode VN -mapfile TXYZ -verbose 2) in the fields provided after clicking "Advanced\>\>" etc., then click the "Submit" button.

    Wait until the job is launched (you might exercise some patience), DDT will catch the focus automatically when resources are available.

The rest should be quite self-explaining.

One can also immediately specify the application to debug and also its parameters by typing ddt [<your app> <parameters>].

Basic usage of TotalView:

  1. Load the appropriate module:
    module load UNITE totalview
  2. Start TotalView by adding the -tv flag to your llrun call with which you normally would run your executable in interactive mode (see above).

Example debug sessions:

  1. Start a DDT debugging session (immediately specifying the executable to debug):

    ddt

    Specification of executable and parameters should also work - parameters are written in the appropriate field, but not recognised ....

    ddt ./ugshell_debug -ex ../apps/scaling_tests/modular_scalability_test.lua -numRefs 7
    parameterNumber numRefs
    parameterString ex
    Executes the specified script.
    Definition: command_line_util.lua:350

    The mpirun parameters, e.g. -mapfile TXYZ -verbose 2 can be placed in the fields accessible after clicking on the "Advanced" button.

  2. Start a TotalView debugging session:

    llrun -tv -np 4 -mode VN -mapfile TXYZ -verbose 2 -exe ./ugshell_debug -args "-ex ../apps/scaling_tests/modular_scalability_test.lua -numPreRefs 3 -numRefs 7"

    Use the llrun option -w <hh:mm:ss> to specify a (hopefully) appropriate wall time to do all your debugging work.

For additional information (especially for debugging with TotalView) see http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUGENE/UserInfo/ParallelDebugging.html.