ug4
|
JuGene — the 72 racks Blue Gene/P (BG/P) system at Jülich Supercomputing Centre (JSC) (FZ Jülich) in total provides 294.912 cores (288 Ki) and 144 Tbyte RAM. The 73.728 compute nodes (CN) each have a quadcore Power 32-bit PC 450 running at 850 MHz, with 2 Gbyte of RAM.
Half a rack (2048 cores) is called a midplane. The JuGene system uses five different networks dedicated for various tasks and functionalities of the machine. Relevant for us: The 3-D torus network. This is a point-to-point network — each of the CNs has six nearest-neighbour connections.
See more on the architecture of JuGene here.
More information about JuGene "in order to enable users of the system to achieve good performance of their applications" can be found in "PRACE Best-Practice Guide for JUGENE".
JuGene is reached via two so called front-end or login nodes (jugene3
and jugene4
) for interactive access and the submission of batch jobs.
These login nodes are reached via
i.e., for login there is only a generic hostname, jugene.fz-juelich.de
, from which automatically a connection either to jugene3
or jugene4
will be established.
The front-end nodes have an identical environment, but multiple sessions of one user may reside on different nodes which must be taken into account when killing processes.
It is necessary to upload the SSH key of the machine from which to connect to one of JuGenes login nodes. See Logging on to JUGENE (also for X11 problems).
To be able to connect to JuGene from different machines maybe you find it useful to define one of GCSC's machines (e.g. speedo
, quadruped
, ...) as a "springboard" to one of JuGenes login nodes (so that you have to login to this machine first, then to JuGene), see SSH Hopping.
For JuGene you have to "cross compile" and to do so use a specific Toolchain File. Start CMake like this
or, for static builds which is the configuration of choice if you want to process very large jobs,
See also Very large Jobs on JuGene!
libXXXXXXX.so
by libXXXXXXX.a
(has to be done only once): Or use this sed
command:
You can check your executable by running the (standard unix) ldd
command ("list dynamic dependencies") on it:
Answer should be not a dynamic executable
for a completely static build!
Debug builds: Since the pre-installed GCC 4.1.2 (in April 2012) is not able to compile a "debug" build one should add the flag -DDEBUG_FORMAT=-gstabs
to the CMake call (cf. GCC 4.1.2).
For debugging a parallel application on JuGene see Debugging on JuGene
Please take the time and fill out the details in ugsubmit
/ uginfo
/ ugcancel
(ugsubmit - Job Scheduling on Clusters).
See Quick Introduction to job handling. Also look at job/mpirun options.
Here we only introduce some important details (everything else should become clear by examining the examples provided):
llrun
(interactively) and the mpirun
(as batch job; defined in a LoadLeveler script) command. See below for some details. There are different execution modes which can be specified by the mpirun
and llrun
parameter -mode {VN | DUAL | SMP}
:
-mode VN
. -mode DUAL
. -mode SMP
. Note that in quad mode (using all 4 processors of a computing node) this means each core has only ca. 512 Mbyte of RAM (474 Mbyte to be more specific, since the CNK also needs some memory).
Obviously "VN" is the preferred execution mode if large numbers of processes should be achieved — and ug4 works with VN mode (at least up to ~64 Ki DoFs per process)!
The mpirun
parameter -mapfile
specifies an order in which the MPI processes are mapped to the CNs / the cores of the BG/P partition reserved for the run. This order can either be specified by a permutation of the letters X,Y,Z and T or the name of a mapfile in which the distribution of the tasks is specified, -mapfile {<mapping>|<mapfile>}
:
<mapping>
is a permutation of X, Y, Z and T.
The standard mapping on JuGene is to place the tasks in "XYZT" order, where X, Y, and Z are the torus coordinates of the nodes in the partition and T is the number of the cores within each node (T=0,1,2,3).
When the tasks are distributed across the nodes the first dimension is increased first, i.e. for XYZT the first three tasks would be executed by the nodes with the torus coordinates <0,0,0,0>
, <1,0,0,0>
and <2,0,0,0>
, which obviously is not what we want for our simulation runs. For now we recommend -mapfile TXYZ
which fills up a CN before going to the next CN so that MPI processes working on adjacent subdomains are placed closely in the 3-D torus.
<mapfile>
is the name of a mapfile in which the distribution of the tasks is specified: It contains a line of x y z t
coordinates for each MPI process. See sec. 6. of the Best-Practise guide mentioned above for an example and which LoadLeveler keywords to use. mpirun
parameters -env LD_LIBRARY_PATH==/bgsys/drivers/ppcfloor/comm/lib/
. "Modules" can be loaded by executing the module
command.
E.g. module load lapack
for LAPACK (see http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/Libraries/LAPACKJugene.html for choosing one of several versions).
This is also for loading <strong>performance analysis tools</strong>, e.g. <tt>module load scalasca</tt>, <tt>module load UNITE</tt> etc.
llview
is a tool with a graphical X11 user interface for displaying system status, running jobs, scheduling and prediction of the start time of jobs (the latter can also be achieved by the llq
command, see below).
Interactive jobs can be started with the llrun command. llrun
is invoked in the following way:
llrun
replaces mpirun
and can be used with the same command line options like mpirun
. One llrun
option which might be interesting: -w <hh:mm:ss>
: submit job with wallclock limit (Default: 00:30:00). For other options see e.g. http://www.prace-ri.eu/IMG/pdf/Best-practise-guide-JUGENE-v0-3.pdf, sec. 3.5.3.
Example:
Please note that <tt>llrun</tt> only allows jobs up to 256 (<tt>-mode SMP</tt>) / 512 (<tt>-mode DUAL</tt>) / 1024 (<tt>-mode VN</tt>) MPI processes!
Batch Jobs are defined in so called "LoadLeveler
scripts" and submitted by the llsubmit
command to the IBM Tivoli Workload Scheduler LoadLeveler (TWS LoadLeveler), typically in the directory where the ug4 executable resides:
<cmdfile>
is a (plain Unix) shell script file (i.e., the "LoadLeveler script"), which contains job definitions given by "LoadLeveler keywords" (some important examples are explained below).
If llsubmit
was able to submit the job it outputs a job name (e.g. jugene4b.zam.kfa-juelich.de.298508
) with which a specific job can be identified in further commands, e.g. to cancel it (see below).
The output of the run (messages from Frontend end Backend MPI — and the output of ug4 — is written to a file in the directory where llsubmit
was executed and whose name begins with the job name you have specified in the LoadLeveler script and ends with <job number>.out
.
For some example LoadLeveler scripts used with ug4 see subdirectory scripts/shell/
:
ll_scale_gmg.x
contains job definitions for a complete scalability study for GMG in 2-D and 3-D. ll_template.x
also contains some documentation of LoadLeveler and mpirun parameters. (All mpirun
commands therein are commented out — to performe a specific run remove the comment sign.)
Please change in your copy of one of these scripts at least the value of the notify_user
keyword before submitting a job ...
See also this more recent JSC documentation, and especially the Job File Samples for more details.
LoadLeveler keywords are strings embedded in comments beginning with the characters "# @".
Some selected keywords:
job_type = bluegene
specifies that the job is running on JuGene's CNs. job_name
specifies the name of the job, which will get part of the name of the output file. bg_size
specifies the size of the BG/P partition reserved for the job in number of compute nodes.
That is, for <NP>
MPI processes, bg_size
must be >= (<NP>)/4
.
Alternatively the size of a job can be defined by the bg_shape
keyword.
See comments in the example LoadLeveler scripts for proper settings.
bg_connection
specifies the "connection type", i.e. the network topology used by a job.
Connection type can be one in [TORUS| MESH | PREFER_TORUS]
. Default is bg_connection = MESH
.
bg_connection = TORUS
— utilising the 3-D torus network — is the preferred topology for our jobs. For this bg_size
(see above) must be >= 512.
See also comments and usage in the example LoadLeveler scripts for proper settings.
Please note that keywords must not be followed by comments in the same line!
A nice introduction to LoadLeveler command file syntax is given e.g. here.
llq
is used to display the status of jobs (of a specified user) in the queue/executed:
The estimated start time of a job can be determined by llq -s <job-id>
(cf. https://docs.loni.org/wiki/Useful_LoadLeveler_Commands), where <job-id>
was determined by the previous command, e.g. for the job jugene4b.304384.0
:
llcancel
is used to cancel a job (<jobname>
as displayed by llq
): /bgsys/drivers/ppcfloor/tools/coreprocessor
, gdbserver
... Querying Quota Status:
Useful options:
-?
usage information and all options . -j <jobstepid>
for a single job. -t <time>
for all jobs in the specified time, e.g. q_cpuquota -t 23.11.2011 01.12.2011
. -d <number>
for last number of days (positive integer). Although its possible on JuGene to create shared libraries and run dynamically linked executables this is in general not recommended, since loading of shared libraries can delay the startup of such an application considerably, especially when using large partitions (8 racks or more). See also Shared Libraries and Dynamic Executables.
So, for very large jobs be sure to have ug4 built as a completely static executable (c.f. Configuration of ug4 for JuGene) since otherwise loading of the shared libraries consumes too much wall time!
Very large jobs (e.g. jobs larger than 32 racks) normally run on Tuesday only.
Exceptions to this rule are possible in urgent cases (please contact the SC Support under sc@fz-juelich.de).
On JuGene the parallel debuggers TotalView and DDT are available.
Be sure to compile ug4 as a debug build.
DDT uses X11 for its graphical user interface. Be sure to log in with an X window forwarding enabled. This could mean using the -X
or -Y
option to ssh
.
Basic usage of DDT:
ddt
. Enter your ug4 parameters (don't forget to enclose them by quotation marks), numbers of processes to run, a (hopefully) appropriate wall time to do all your debugging work (after "Queue submission Parameters"), "MPIRun parameters" (e.g. -mode VN -mapfile TXYZ -verbose 2
) in the fields provided after clicking "Advanced\>\>" etc., then click the "Submit" button.
Wait until the job is launched (you might exercise some patience), DDT will catch the focus automatically when resources are available.
The rest should be quite self-explaining.
One can also immediately specify the application to debug and also its parameters by typing ddt [<your app> <parameters>]
.
Basic usage of TotalView:
-tv
flag to your llrun
call with which you normally would run your executable in interactive mode (see above). Example debug sessions:
Start a DDT debugging session (immediately specifying the executable to debug):
Specification of executable and parameters should also work - parameters are written in the appropriate field, but not recognised ....
The mpirun parameters, e.g. -mapfile TXYZ -verbose 2
can be placed in the fields accessible after clicking on the "Advanced" button.
Start a TotalView debugging session:
Use the llrun
option -w <hh:mm:ss>
to specify a (hopefully) appropriate wall time to do all your debugging work.
For additional information (especially for debugging with TotalView) see http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUGENE/UserInfo/ParallelDebugging.html.