Rack Information about JuQueen

JuQueen — the 24 racks Blue Gene/Q (BG/Q) system at Jülich Supercomputing Centre (JSC) (FZ Jülich) — is the successor of JuGene, the 72 racks BG/P system at FJZ (288 Ki cores; 144 Tbyte RAM; see page JuGene for some documentation), whose installation began in April 2012 with 8 racks.

JuQueen currently provides 393.216 cores (384 Ki) and 375 Tbyte RAM in total (late October 2012), organized in 24 racks. It is intended to extend JuQueen to 28 racks (448~Ki cores; source: c't, 15/2012, S. 58).

Note: (TMP) According to the output of llbgstatus command and the sketch provided by llview (see below) it looks like all 28 racks are installed in the last upgrade step, but only 24 racks are active in the moment (in llview the midplanes of those last four racks are labeled with "do", probably meaning "down", while the active ones are labeled with "ru" (i.e. "running"))!

Each rack contains 2⁵ node cards, each consisting of 2⁵ compute cards (or compute nodes - CN's).

Each CN has a 16-core IBM PowerPC A2 running at 1.6 GHz (plus one additional core executing the operating system, plus one spare core), with 16 Gbyte SDRAM-DDR3 of RAM in total or 1 Gbyte per core.

So, one rack provides 2⁵ × 2⁵ = 2¹⁰ = 1 Ki CN's, or 16 Ki (16.384) cores. Half a rack (8192 cores) is called a midplane.

According to the announcement of Blue Gene/Q at JSC it should be necessary for efficiency to use a hybrid parallelization strategy (e.g. MPI/OpenMP or MPI/Pthreads) but our results so far indicate that pure MPI parallelization is ok.

One of the four networks is a 5-D torus: Each CN is connected to its six direct neighbors, plus "loops" in all three space dimensions, plus further connections in "fourth" and "fiveth" dimension ...

4D are cabled to other midplanes
5th dimension: extent 2 (stays within nodecard)
6th dimension: core number within a CN.

For some explanation see the description given here (p. 3).

See more on the architecture of JuQueen in this short description or, in more detail, in these system overview slides.

Note: Note that the login nodes are running under Linux (Red Hat Enterprise Linux Server (release 6.3 (Santiago), as of October 2012)), while the CNs are running a limited version of Linux called Compute Node Kernel (CNK). Therefore it is necessary to cross-compile for JuQueen (cf. sec. CMake, Toolchains, Compilers; sec. Configuration of ug4 for JuQueen).

Working with ug4 on JuQueen

Basic Job Handling

You can use ugsubmit to run your jobs on JUQUEEN. Make sure to source ug4/trunk/scripts/shell/ugbash and to export the variables

export UGSUBMIT_TYPE=Juqueen

export UGSUBMIT_EMAIL=your@email.com

e.g. in ~/.bashrc. (of course you have to replace 'your@.nosp@m.emai.nosp@m.l.com' with your real email adress...).

Interactive jobs

Not possible in the moment!

Batch jobs

Read ugsubmit - Job Scheduling on Clusters for further instructions on unified ugsubmit usage.

Warning: Make sure to execute all batch jobs on the $WORK partition of Juqueen. Access to $HOME from the compute-nodes is not only slow but will likely cause other troubles, too.

Manual Job Handling using LoadLeveler (not recommended)

Note: ugsubmit (ugsubmit - Job Scheduling on Clusters) does the job of submitting jobs for you. It is strongly recommendet to use it! The following section is thus only of concern to people who want to submit jobs manually.

For some basic information see the Quick Introduction to job handling and the references therein. Here we only provide some important details (everything else should become clear by examining the examples provided):

Batch jobs are submitted with the llsubmit command. The command is typically executed in the directory where the ug4 executable resides ("ug4-bin"):

llsubmit <jobscript>

llsubmit invokes the IBM Tivoli Workload Scheduler LoadLeveler (TWS LoadLeveler), which is used as batch system on Blue Gene/Q.

The script <jobscript> — also called "LoadLeveler script" — is a (plain Unix) shell script file and contains the actual job definitions, specified by "LoadLeveler keywords" (some prominent examples are explained below) and a call of the runjob command (also introduced below). It can also contain standard Unix commands.

If llsubmit was able to submit the job it outputs a job name (or job-id; e.g. juqueen1c1.40051.0) with which a specific job can be identified in further commands, e.g. to cancel it (see below).

The output of the run (messages from Frontend end Backend MPI — and the output of ug4 — is written to a file in the directory where llsubmit was executed and whose name begins with the job name you have specified in the LoadLeveler script and ends with <job-id>.out (see below how to specify the job name).

Some details concerning the definition of batch jobs:

Example LoadLeveler scripts used with ug4 can be found in ug4's sub directory scripts/shell/:
- ll_scale_gmg_bgq.x contains job definitions for a scalability study from 4 to 64 Ki PE (2-D Laplace, solved with geometric multigrid).
(All runjob commands therein are commented out — to perform a specific run remove its comment sign.)

Hint: It might be helpful for the understanding of the following details to open an example LoadLeveler script in your favorite text editor!

If you want to use one of these scripts copy it to your "ug4-bin" directory. Please change at least the value of the notify_user keyword (see below) before submitting a job ... Furthermore it might be good to check (and possibly change) the values of some keywords (especially bg_size; search for the string "TO CHECK")!

See also this batch job documentation, and especially the Job File Samples for more details.
LoadLeveler keywords are strings embedded in comments beginning with the characters "# @".
These keywords inform the LoadLeveler of the resources required for the job to run, the job environment, the name of output files, notification of the user about job result etc.

Note
Please note that keywords must not be followed by comments in the same line!

A few important keywords:
- job_type = bluegene specifies that the job is running on JuQueen's CNs.
- job_name specifies the name of the job, which will become part of the name of the output file (see above).
- bg_size specifies the size of the BG/Q partition reserved for the job in number of compute nodes.
 
 That is, for <NP> MPI processes, bg_size must be ≥ (<NP>)/16 (since one CN has 16 cores).
 
 From the documentation, section "Blue Gene specific keywords" in this batch job documentation:
 
 Blue Gene/Q only allows blocks including 32, 64, 128, 256 and multiples of 512 compute nodes.
 Thus e.g. bg_size of 1 specifies a block of size 32 and bg_size of 129 requests a partition of size 256.
 
 Note
 Please note that the actual number of MPI processes of a job is (as usual) specified by a parameter of the runjob command described below.
 But since the bg_size is relevant for charging of computing time it is wise to keep its value as small as possible.
 
 Alternatively the size of a job can be defined by the bg_shape keyword (see same source for details).
 
 See comments in the example LoadLeveler scripts for proper settings.
- bg_connectivity specifies the "connection type", i.e. the network topology used by a job.
 
 Connection type can be one in [TORUS| MESH | EITHER | <Xa Xb Xc Xd>].
 
 The specification for Xy is equal to Torus or Mesh, specified for the y dimension. Default is MESH.
```
 <tt>bg_connection = TORUS</tt> &mdash; utilising the 5-D torus network 
 &mdash; is the preferred topology for our jobs.
 

 See also comments and usage in the example LoadLeveler scripts for proper
 settings.

 \note The connection type is on JuQueen described by the keyword
 \c bg_connectivity instead of \c bg_connection as it was on JuGene!
 </li>
</ul>

A nice introduction to LoadLeveler command file syntax is given e.g. 
<a href="https://docs.loni.org/wiki/LoadLeveler_Command_File_Syntax">here</a>
(please keep in mind that, to the authors knowledge, this documentation is
\em not (especially) about Blue Gene machines).
```
- Jobs are actually started by the runjob command (this replaces the mpirun command used on JuGene).
 
 Example:
 runjob --np 64 --ranks-per-node 16 --mapping ABCDET --verbose 3 : ./ugshell -ex ../apps/conv_diff/laplace.lua
 
 verbose
 location verbose
 Definition checkpoint_util.lua:128
 
 ex
 parameterString ex
 Executes the specified script.
 Definition command_line_util.lua:350
 
 This starts a job with 64 MPI processes, running on 4 CN's, where on each CN one process per core is executed.
 
 Some explanations to te runjob parameters used above:
 - The runjob parameter –np <n> specifies the number <n> of MPI processes of the entire job.
 - The runjob parameter –ranks-per-node <r> specifies the number of ranks per CN: <r> in {1, 2, 4, 8, 16, 32, 64} (default: <r> = 1).
 
 Background: Each of the 16 cores of a CN is four-way hardware threaded => 64 process/threads per CN possible.
 Please note that in this case each process/thread has only 16 Gbyte/64 = 256 Mbyte available.
 
 There are three "execution modes":
 - 64 MPI tasks – 1 Thread per task.
 - 2, 4, 8, 16, 32 MPI tasks — 32, 16, 8, 4, 2 Threads per task.
 - 1 MPI task – 64 Threads per task.
 See e.g. here (p. 10) and here (p. 20).
 - The runjob parameter –exe <executable> specifies the executable.
 - The runjob parameter –args specifies the arguments of the executable (and not of the runjob command!)
 
 NOTE: There is a different syntax for the command used in the example above, where the executable and its arguments are separated by a colon (":"), and where no –exe and no –args parameter has to be specified! This seems to be the preferred way!
 - The runjob parameter -mapping specifies an order in which the MPI processes are mapped to the CNs / the cores of the BG/Q partition reserved for the run.
 
 This order can either be specified by a permutation of the letters A, B, C, D, E and T or the name of a mapfile in which the distribution of the tasks is specified, -mapping {<mapping>|<mapfile>}:
 - <mapping> is a permutation of A, B, C, D, E and T, where A, B, C, D, E are the torus coordinates of the CN's in the partition and T is the number of the cores within each node (T=0,1,2,...,15). The standard mapping on JuQueen is to place the tasks in "ABCDET" order — T increments first, then E (which can become maximal 2 on BG/Q) etc.
 - <mapfile> is the name of a mapfile in which the distribution of the tasks is specified:
 
 It contains a line of A B C D E T coordinates for each MPI process.
 - The runjob parameter –verbose <level> specifies the "verbosity level" of the job output. The lowest level is 0 or "OFF" (or "O"), the highest level is 7 or "ALL (or "A"). Level 3 or "WARN" (or "W") seems to be a good compromise between "briefness" and "ausschweifenden" system informations. Please note that a verbosity level higher than 3 tends to mix up system output and output produced by ug4 itself which looks really messy (but might give sometime helpful debug informations) </li> </ul> @subsubsection secFurther_LoadLeveler_Commands Further LoadLeveler Commands <ul> <li>To display the status of jobs in the queue you can use the \c llq command. This prints the job names ("Id's"), owners of jobs, submission times etc. and where it is running (or not). <ul> <li> To display the status of jobs of a specified user use @code llq [-u <userid>] \endcode This displays all jobs in the queue submitted by the user with id <tt>\<userid\></tt>. Example (only one job in the queue, not running): @code llq -u hfu023 Id Owner Submitted ST PRI Class Running On ------------------------ ---------- ----------- -- --- ------------ ----------- juqueen1c1.40051.0 hfu023 10/25 17:01 I 50 n004 \endcode </li> <li> The estimated start time of a job can be determined by <tt>llq -s \<jobid\></tt> (cf. <a href="https://docs.loni.org/wiki/Useful_LoadLeveler_Commands">this intro</a> (already cited above)), where <tt>\<job-id\></tt> was determined by the previous command, e.g. for the job <tt>juqueen1c1.40051.0</tt>: @code llq -s juqueen1c1.40051.0 ===== EVALUATIONS FOR JOB STEP juqueen1c1.zam.kfa-juelich.de.40051.0 ===== Step state : Idle Considered for scheduling at : Fri 02 Nov 2012 21:51:56 CET Minimum initiators needed: 1 per machine, 1 total. Machine juqueen1c1.zam.kfa-juelich.de does not support job class n004. 3 machines do not support job class n004. No machine can be used to run this job step. \endcode </li> </ul> <li>To cancel a job you have to use the \c llcancel command. The job to be cancelled has to be specified by its <tt>\<job-id\></tt> (as displayed by <tt>llq</tt>): @code llcancel <job-id> \endcode </li> <li>Informations about running and queued jobs etc. can also be displayed by the graphical tool \em llview (see \ref secGettingStatusInfosOnJuQueen). </li> </ul> <hr> @subsection secGettingStatusInfosOnJuQueen Getting status information on JuQueen <ul> <li>Graphical monitoring tool \em llview: \em llview has a graphical X11 based user interface and provides a schematical sketch of the machine It displays system status, running jobs, scheduling and prediction of the start time of jobs. The latter can also be achieved by the \c llq command, see above). See the JuGene documentation <a href="http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUGENE/UserInfo/llview.html"><tt>llview</tt></a> </li> <li>Status of the whole machine is displayed by the \c llbgstatus command: @code llbgstatus Name Midplanes ComputeNodes InQ Run BGQ 2x7x2x2 28672 975 0 \endcode See <tt>man llbgstatus</tt> for more information. </li> <li>Status information about your account: Querying Quota Status: @code q_cpuquota <options> \endcode Useful options: \arg <tt>-?</tt> usage information and all options . \arg <tt>-j <jobstepid></tt> for a single job. \arg <tt>-t <time></tt> for all jobs in the specified time, e.g. <tt>q_cpuquota -t 23.11.2011 01.12.2011</tt>. \arg <tt>-d <number></tt> for last number of days (positive integer). </li> </ul> <hr> @subsection secVery_large_jobs_on_JuQueen Very large Jobs on JuQueen <ul> <li>Use \ref secStaticBuild . So, for very large jobs be sure to have ug4 built as a completely static executable (c.f. \ref secConfiguration_of_ug4_for_JuQueen) since otherwise loading of the shared libraries consumes too much wall time! </li> <li>Very large jobs (e.g. jobs larger than 32 racks) normally run on Tuesday only. Exceptions to this rule are possible in urgent cases (please contact the SC Support under <a href="sc@fz.nosp@m.-jue.nosp@m.lich..nosp@m.de">sc@fz-juelich.de</a>). </li> </ul> <hr> @subsection secAvailableFileSystems Available file systems on JuQueen See <a href="http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUDGE/Userinfo/Access_Environment.html">this</a> JSC documentation, section "Other filesystems", for some details. <hr> @subsection secDebuggingAndCorefileAnalysisOnJuQueen Debugging and corefile analysis on JuQueen As usual you need to compile ug4 in "debug mode" (i.e. configured with <tt>cmake -DDEBUG=ON ..</tt>; cf. \ref secBuildUG4RunCmake) so that the necessary debugging information is available. @subsubsection secDebuggingOnJuQueen Debugging on JuQueen Since November 2012 a parallel debugger, \em TotalView, is (eventually) available on \em JuQueen. Basic usage of \em TotalView: \em TotalView uses X11 for its graphical user interface. Be sure to log in with an X window forwarding enabled. This could mean using the <tt>-X</tt> or <tt>-Y</tt> option to <tt>ssh</tt> (on \em all ssh connections if you aren't connected directly (cf. ref secSSHHopping)). <ol> <li> Load the appropriate module: @code module load UNITE totalview \endcode </li> (UNITE — "UNIform Integrated Tool Environment" for debugging and performance analysis", see here.)
 - Use the lltv command to start TotalView (cf. output of module help totalview):
 lltv -n <numNodes> : -default_parallel_attach_subset=<rank-range> runjob -a --exe <executable> -np <numProcs> --ranks-per-node <number>
 
 In the Totalview windows appearing as soon as the job has started specify/control your parameters, then run your job, as illustrated in the following example.
 
 This will start the executable <executable> to be debugged on <numNodes> nodes utilising <numProcs> processes, attaching TotalView to ranks <rank-range> (a space separated list of ranks, see documentation).
 
 Example session:
 1. Start the debugger:
 lltv -n 1 : -default_parallel_attach_subset=0-15 runjob -a --np 16 --ranks-per-node 16 --mapping ABCDET --verbose 3 : ./ugshell -ex scaling_tests/modular_scalability_test.lua -dim 2 -grid unit_square_01/unit_square_01_quads_8x8.ugx -lsMaxIter 100 -numPreRefs 3 -lsType feti -ds famg -ns famg -numPPSD 4 -numRefs 4
 
 numRefs
 parameterNumber numRefs
 
 I.e., to get the corresponding interactive debug job one can (to the authors experience so far) just add the runjob line of a batch job definition (as written in a LoadLeveler script; see above) after the colon of the lltv command PLUS add the option -a to the runjob command!
 
 The system answers with something like
 Creating LoadLeveler Job
 
 Submiting LoadLeveler Interactive Job for TotalView
 
 Wait for job juqueen1c1.49410.0 to be started:......
 
 Additionally TotalView displays three windows (and a few temporary windows) after the job has started:
 - a shell-like window titled "totalview" (with black background, which will later display the output of the executable to be debugged),
 - the "root window", titled "TotalView <Version number>" (which will contain the list of attached processes), and (typically a bit later)
 - the "process window", first titled "ProcessWindow", the TotalView GUI (which title later changes to "runjob" and then to something like "runjob<ugshell>.3" in our case), and the (temporary) "startup-parameter" window, titled "Startup Parameters".
 Depending on system load / network traffic you will experience some waiting time.
 2. In the "startup-parameter" window enter at least your ug4 arguments (or control your arguments given on the command line) in the "Arguments" tag.
 
 According to the message in the "Parallel" tag do not change anything here. When finished with your input click "OK".
 3. In the TotalView GUI press the "GO" button first.
 
 Another (dialog) window appears with the question "Process runjob is a parallel job. Do you want to stop the job now?" which one has to answer with "YES" (which is not really intuitive ...).
 
 Another small window appears showing a progress bar while starting the job ("Loading symbos from shell ...").
 4. In the TotalView GUI press the "GO" button again to run your executable. Now you can begin your debugging work.
 Todo:
 Maybe some notes about available "Tools" like the "call graph" would be nice. For the moment try yourself and/or see the TotalView docu (links below).
 
 Additional hints
 - As for batch jobs it is possible to shorten a command line as the one in the example by using an environment variable, e.g.
 UGARGS2D="-ex scaling_tests/modular_scalability_test.lua -dim 2 -grid unit_square_01/unit_square_01_quads_8x8.ugx -lsMaxIter 100 -numPreRefs 3 -lsType feti -ds famg -ns famg -numPPSD 4"
 
 lltv -n 1 : -default_parallel_attach_subset=0-15 runjob -a --np 16 --ranks-per-node 16 --mapping ABCDET --verbose 3 : ./ugshell $UGARGS2D -numRefs 4
 - The parameter -default_parallel_attach_subset=0-15 is optional if one want to attach to all MPI processes.
 - TotalView startup actions can be defined on a per user basis in ~/.totalview.
 Additional information for debugging with TotalView on JuQueen might be found in this FZJ short introduction to parallel debugging.
 
 For detailed information see
 - See ${TV_ROOT}/doc/pdf/ — this is something like /usr/local/UNITE/packages/totalview/toolworks/totalview.8.11.0-0/doc/pdf/ (December 2012).
 - http://www.roguewave.com/products/totalview.aspx
 Basic usage of DDT:
 Note
 In the moment (December 2012) DDT is not available on JuQueen! So, in the moment the following instructions about DDT (tested on JuGene only!) is meant as a "placeholder".
 
 DDT uses X11 for its graphical user interface. Be sure to log in with an X window forwarding enabled. This could mean using the -X or -Y option to ssh.
 1. Load the appropriate module:
 module load UNITE ddt
 2. Start DDT by typing ddt.
 3. A "DDT" logo appears, wait a bit (or click on the logo) to get the welcome dialog box.
 4. Click on "Run and Debug a Program" in the "Welcome" dialog box
 5. Enter your ug4 parameters (don't forget to enclose them by quotation marks), numbers of processes to run, a (hopefully) appropriate wall time to do all your debugging work (after "Queue submission Parameters"), "MPIRun parameters" (e.g. -mode VN -mapfile TXYZ -verbose 2) in the fields provided after clicking "Advanced\>\>" etc., then click the "Submit" button.
 
 Wait until the job is launched (you might exercise some patience), DDT will catch the focus automatically when resources are available.
 
 The rest should be quite self-explaining.
 
 One can also immediately specify the application to debug and also its parameters by typing ddt [<your app> <parameters>].
 Example DDT session (directly specifying the executable to debug on the command line):
 ddt
 
 Specification of executable and parameters should also work —
 ddt ./ugshell_debug -ex ../apps/scaling_tests/modular_scalability_test.lua -numRefs 7
 
 The mpirun parameters, e.g. -mapfile TXYZ -verbose 2 can be placed in the fields accessible after clicking on the "Advanced" button.

Analysis of corefiles with the Coreprocessor tool

If a job crashes normally one or more corefiles are created. Corefiles are plain text files (stored in the "Parallel Tools Consortium Lightweight Corefile Format") and named like core.0, core.1, core.2, ...

You can analyze those corefiles so that you hopefully are able to find the cause of the crash. The appropriate way to accomplish this task on JuQueen is using a tool called coreprocessor.pl. This is a perl script which provides a X11 based GUI.

Usage:

Go into your "ug4-bin" directory, then start coreprocessor:
/bgsys/drivers/ppcfloor/coreprocessor/bin/coreprocessor.pl
Load corefiles: "Select File - Load Core".
Specify the executable with which the corefile(s) was (were) produced.
From the drop down list "Select Grouping Mode" select e.g. "Stack Traceback (detailed)".

As a result the call stack (or "(back) trace") which describes the call order of all active methods/functions is listed in the main window. Each line contains the name and parameters of the method called. When clicking/marking a line (a frame) of this trace, the filename and the line number of the call is shown at the bottom of the window.
Optional: Save the trace in a file for later analysis.

More convenient is to specify the location of the corefiles and the executable directly on the commandline (after loading proceed with step 3 from above):

/bgsys/drivers/ppcfloor/coreprocessor/bin/coreprocessor.pl -c=. -b=ugshell

-b=ugshell specifies the executable ("binary"), i.e. ugshell, and -c=. means load all corefiles in working directory.

For more usage information execute

/bgsys/drivers/ppcfloor/coreprocessor/bin/coreprocessor.pl -h

Maybe you can find some additional information in this ALCF documentation.