ug4
|
JuQueen — the 24 racks Blue Gene/Q (BG/Q) system at Jülich Supercomputing Centre (JSC) (FZ Jülich) — is the successor of JuGene, the 72 racks BG/P system at FJZ (288 Ki cores; 144 Tbyte RAM; see page JuGene for some documentation), whose installation began in April 2012 with 8 racks.
JuQueen currently provides 393.216 cores (384 Ki) and 375 Tbyte RAM in total (late October 2012), organized in 24 racks. It is intended to extend JuQueen to 28 racks (448~Ki cores; source: c't, 15/2012, S. 58).
llbgstatus
command and the sketch provided by llview (see below) it looks like all 28 racks are installed in the last upgrade step, but only 24 racks are active in the moment (in llview the midplanes of those last four racks are labeled with "do", probably meaning "down", while the active ones are labeled with "ru" (i.e. "running"))!Each rack contains 25 node cards, each consisting of 25 compute cards (or compute nodes - CN's).
Each CN has a 16-core IBM PowerPC A2 running at 1.6 GHz (plus one additional core executing the operating system, plus one spare core), with 16 Gbyte SDRAM-DDR3 of RAM in total or 1 Gbyte per core.
So, one rack provides 25 × 25 = 210 = 1 Ki CN's, or 16 Ki (16.384) cores. Half a rack (8192 cores) is called a midplane.
According to the announcement of Blue Gene/Q at JSC it should be necessary for efficiency to use a hybrid parallelization strategy (e.g. MPI/OpenMP or MPI/Pthreads) but our results so far indicate that pure MPI parallelization is ok.
One of the four networks is a 5-D torus: Each CN is connected to its six direct neighbors, plus "loops" in all three space dimensions, plus further connections in "fourth" and "fiveth" dimension ...
For some explanation see the description given here (p. 3).
See more on the architecture of JuQueen in this short description or, in more detail, in these system overview slides.
You can use ugsubmit
to run your jobs on JUQUEEN. Make sure to source ug4/trunk/scripts/shell/ugbash and to export the variables
e.g. in ~/.bashrc
. (of course you have to replace 'your@' with your real email adress...). emai l.com
Not possible in the moment!
Read ugsubmit - Job Scheduling on Clusters for further instructions on unified ugsubmit
usage.
ugsubmit
(ugsubmit - Job Scheduling on Clusters) does the job of submitting jobs for you. It is strongly recommendet to use it! The following section is thus only of concern to people who want to submit jobs manually.For some basic information see the Quick Introduction to job handling and the references therein. Here we only provide some important details (everything else should become clear by examining the examples provided):
Batch jobs are submitted with the llsubmit
command. The command is typically executed in the directory where the ug4 executable resides ("ug4-bin"):
llsubmit
invokes the IBM Tivoli Workload Scheduler LoadLeveler (TWS LoadLeveler), which is used as batch system on Blue Gene/Q.
The script <jobscript>
— also called "LoadLeveler script" — is a (plain Unix) shell script file and contains the actual job definitions, specified by "LoadLeveler keywords" (some prominent examples are explained below) and a call of the runjob
command (also introduced below). It can also contain standard Unix commands.
If llsubmit
was able to submit the job it outputs a job name (or job-id; e.g. juqueen1c1.40051.0
) with which a specific job can be identified in further commands, e.g. to cancel it (see below).
The output of the run (messages from Frontend end Backend MPI — and the output of ug4 — is written to a file in the directory where llsubmit
was executed and whose name begins with the job name you have specified in the LoadLeveler script and ends with <job-id>.out
(see below how to specify the job name).
Some details concerning the definition of batch jobs:
Example LoadLeveler scripts used with ug4 can be found in ug4's sub directory scripts/shell/
:
ll_scale_gmg_bgq.x
contains job definitions for a scalability study from 4 to 64 Ki PE (2-D Laplace, solved with geometric multigrid).(All runjob
commands therein are commented out — to perform a specific run remove its comment sign.)
Hint: It might be helpful for the understanding of the following details to open an example LoadLeveler script in your favorite text editor!
If you want to use one of these scripts copy it to your "ug4-bin" directory. Please change at least the value of the notify_user
keyword (see below) before submitting a job ... Furthermore it might be good to check (and possibly change) the values of some keywords (especially bg_size
; search for the string "TO CHECK")!
See also this batch job documentation, and especially the Job File Samples for more details.
LoadLeveler keywords are strings embedded in comments beginning with the characters "# @"
.
These keywords inform the LoadLeveler of the resources required for the job to run, the job environment, the name of output files, notification of the user about job result etc.
A few important keywords:
job_type = bluegene
specifies that the job is running on JuQueen's CNs. job_name
specifies the name of the job, which will become part of the name of the output file (see above). bg_size
specifies the size of the BG/Q partition reserved for the job in number of compute nodes.
That is, for <NP>
MPI processes, bg_size
must be ≥ (<NP>)/16
(since one CN has 16 cores).
From the documentation, section "Blue Gene specific keywords" in this batch job documentation:
Blue Gene/Q only allows blocks including 32, 64, 128, 256 and multiples of 512 compute nodes.
Thus e.g. bg_size
of 1 specifies a block of size 32 and bg_size
of 129 requests a partition of size 256.
runjob
command described below. bg_size
is relevant for charging of computing time it is wise to keep its value as small as possible.Alternatively the size of a job can be defined by the bg_shape
keyword (see same source for details).
See comments in the example LoadLeveler scripts for proper settings.
bg_connectivity
specifies the "connection type", i.e. the network topology used by a job.
Connection type can be one in [TORUS| MESH | EITHER | <Xa Xb Xc Xd>]
.
The specification for Xy
is equal to Torus or Mesh, specified for the y
dimension. Default is MESH
.
<tt>bg_connection = TORUS</tt> — utilising the 5-D torus network — is the preferred topology for our jobs. <!--On JuGene: For TORUS <tt>bg_size</tt> (see above) must be >= 512. - is there also such a limit on JuQueen? --> See also comments and usage in the example LoadLeveler scripts for proper settings. \note The connection type is on JuQueen described by the keyword \c bg_connectivity instead of \c bg_connection as it was on JuGene! </li> </ul> A nice introduction to LoadLeveler command file syntax is given e.g. <a href="https://docs.loni.org/wiki/LoadLeveler_Command_File_Syntax">here</a> (please keep in mind that, to the authors knowledge, this documentation is \em not (especially) about Blue Gene machines).
Jobs are actually started by the runjob
command (this replaces the mpirun
command used on JuGene).
Example:
This starts a job with 64 MPI processes, running on 4 CN's, where on each CN one process per core is executed.
Some explanations to te runjob
parameters used above:
runjob
parameter –np <n>
specifies the number <n>
of MPI processes of the entire job. The runjob
parameter –ranks-per-node <r>
specifies the number of ranks per CN: <r>
in {1, 2, 4, 8, 16, 32, 64}
(default: <r> = 1
).
Background: Each of the 16 cores of a CN is four-way hardware threaded => 64 process/threads per CN possible.
Please note that in this case each process/thread has only 16 Gbyte/64 = 256 Mbyte available.
There are three "execution modes":
runjob
parameter –exe <executable>
specifies the executable. The runjob
parameter –args
specifies the arguments of the executable (and not of the runjob
command!)
NOTE: There is a different syntax for the command used in the example above, where the executable and its arguments are separated by a colon (":"), and where no –exe
and no –args
parameter has to be specified! This seems to be the preferred way!
The runjob
parameter -mapping
specifies an order in which the MPI processes are mapped to the CNs / the cores of the BG/Q partition reserved for the run.
This order can either be specified by a permutation of the letters A, B, C, D, E and T or the name of a mapfile in which the distribution of the tasks is specified, -mapping {<mapping>|<mapfile>}
:
<mapping>
is a permutation of A, B, C, D, E and T, where A, B, C, D, E are the torus coordinates of the CN's in the partition and T is the number of the cores within each node (T=0,1,2,...,15). The standard mapping on JuQueen is to place the tasks in "ABCDET" order — T increments first, then E (which can become maximal 2 on BG/Q) etc. <mapfile>
is the name of a mapfile in which the distribution of the tasks is specified:
It contains a line of A B C D E T
coordinates for each MPI process.
The runjob
parameter –verbose <level>
specifies the "verbosity level" of the job output. The lowest level is 0 or "OFF" (or "O"), the highest level is 7 or "ALL (or "A").
Level 3 or "WARN" (or "W") seems to be a good compromise between "briefness"
and "ausschweifenden" system informations.
Please note that a verbosity level higher than 3 tends to mix up system output and output produced by ug4 itself which looks really messy (but might give sometime helpful debug informations)
To display the status of jobs in the queue you can use the llq
command. This prints the job names ("Id's"), owners of jobs, submission times etc. and where it is running (or not).
To display the status of jobs of a specified user use
This displays all jobs in the queue submitted by the user with id <userid>
. Example (only one job in the queue, not running):
llq -s <jobid>
(cf. this intro (already cited above)), where <job-id>
was determined by the previous command, e.g. for the job juqueen1c1.40051.0
: To cancel a job you have to use the llcancel
command. The job to be cancelled has to be specified by its <job-id>
(as displayed by llq
):
Graphical monitoring tool llview:
llview has a graphical X11 based user interface and provides a schematical sketch of the machine It displays system status, running jobs, scheduling and prediction of the start time of jobs. The latter can also be achieved by the llq
command, see above).
See the JuGene documentation llview
Status of the whole machine is displayed by the llbgstatus
command:
See man llbgstatus
for more information.
Status information about your account: Querying Quota Status:
Useful options:
-?
usage information and all options . -j <jobstepid>
for a single job. -t <time>
for all jobs in the specified time, e.g. q_cpuquota -t 23.11.2011 01.12.2011
. -d <number>
for last number of days (positive integer). Very large jobs (e.g. jobs larger than 32 racks) normally run on Tuesday only.
Exceptions to this rule are possible in urgent cases (please contact the SC Support under sc@fz-juelich.de).
See this JSC documentation, section "Other filesystems", for some details.
As usual you need to compile ug4 in "debug mode" (i.e. configured with cmake -DDEBUG=ON ..
; cf. Run CMake) so that the necessary debugging information is available.
Since November 2012 a parallel debugger, TotalView, is (eventually) available on JuQueen.
Basic usage of TotalView:
TotalView uses X11 for its graphical user interface. Be sure to log in with an X window forwarding enabled. This could mean using the -X
or -Y
option to ssh
(on all ssh connections if you aren't connected directly (cf. ref secSSHHopping)).
Load the appropriate module:
(UNITE — "UNIform Integrated Tool Environment" for debugging and performance analysis", see <a href="http://apps.fz-juelich.de/unite/index.php/Main_Page">here.)
Use the lltv
command to start TotalView (cf. output of module help totalview
):
In the Totalview windows appearing as soon as the job has started specify/control your parameters, then run your job, as illustrated in the following example.
This will start the executable <executable>
to be debugged on <numNodes>
nodes utilising <numProcs>
processes, attaching TotalView to ranks <rank-range>
(a space separated list of ranks, see documentation).
Example session:
Start the debugger:
I.e., to get the corresponding interactive debug job one can (to the authors experience so far) just add the runjob
line of a batch job definition (as written in a LoadLeveler script; see above) after the colon of the lltv
command PLUS add the option -a
to the runjob
command!
The system answers with something like
Additionally TotalView displays three windows (and a few temporary windows) after the job has started:
Depending on system load / network traffic you will experience some waiting time.
In the "startup-parameter" window enter at least your ug4 arguments (or control your arguments given on the command line) in the "Arguments" tag.
According to the message in the "Parallel" tag do not change anything here. When finished with your input click "OK".
In the TotalView GUI press the "GO" button first.
Another (dialog) window appears with the question "Process runjob is a parallel job. Do you want to stop the job now?" which one has to answer with "YES" (which is not really intuitive ...).
Another small window appears showing a progress bar while starting the job ("Loading symbos from shell ...").
In the TotalView GUI press the "GO" button again to run your executable. Now you can begin your debugging work.
Additional hints
As for batch jobs it is possible to shorten a command line as the one in the example by using an environment variable, e.g.
The parameter -default_parallel_attach_subset=0-15
is optional if one want to attach to all MPI processes.
~/.totalview
. Additional information for debugging with TotalView on JuQueen might be found in this FZJ short introduction to parallel debugging.
For detailed information see
${TV_ROOT}/doc/pdf/
— this is something like /usr/local/UNITE/packages/totalview/toolworks/totalview.8.11.0-0/doc/pdf/
(December 2012).Basic usage of DDT:
DDT uses X11 for its graphical user interface. Be sure to log in with an X window forwarding enabled. This could mean using the -X
or -Y
option to ssh
.
ddt
. Enter your ug4 parameters (don't forget to enclose them by quotation marks), numbers of processes to run, a (hopefully) appropriate wall time to do all your debugging work (after "Queue submission Parameters"), "MPIRun parameters" (e.g. -mode VN -mapfile TXYZ -verbose 2
) in the fields provided after clicking "Advanced\>\>" etc., then click the "Submit" button.
Wait until the job is launched (you might exercise some patience), DDT will catch the focus automatically when resources are available.
The rest should be quite self-explaining.
One can also immediately specify the application to debug and also its parameters by typing ddt [<your app> <parameters>]
.
Example DDT session (directly specifying the executable to debug on the command line):
Specification of executable and parameters should also work —
The mpirun parameters, e.g. -mapfile TXYZ -verbose 2
can be placed in the fields accessible after clicking on the "Advanced" button.
If a job crashes normally one or more corefiles are created. Corefiles are plain text files (stored in the "Parallel Tools Consortium Lightweight Corefile Format") and named like core.0
, core.1
, core.2
, ...
You can analyze those corefiles so that you hopefully are able to find the cause of the crash. The appropriate way to accomplish this task on JuQueen is using a tool called coreprocessor.pl
. This is a perl script which provides a X11 based GUI.
Usage:
Go into your "ug4-bin" directory, then start coreprocessor:
From the drop down list "Select Grouping Mode" select e.g. "Stack Traceback (detailed)".
As a result the call stack (or "(back) trace") which describes the call order of all active methods/functions is listed in the main window. Each line contains the name and parameters of the method called. When clicking/marking a line (a frame) of this trace, the filename and the line number of the call is shown at the bottom of the window.
More convenient is to specify the location of the corefiles and the executable directly on the commandline (after loading proceed with step 3 from above):
-b=ugshell
specifies the executable ("binary"), i.e. ugshell
, and -c=.
means load all corefiles in working directory.
For more usage information execute
Maybe you can find some additional information in this ALCF documentation.