General Practical Information about Scalability Tests

First of all you have to enable profiling by configuring your build with CMake (cf. Enable Profiling in ug4):
cmake -DPROFILER=ON [-DPROFILE_PCL=ON] ..

cmake
cmake
Definition unit_tests.doxygen:198
Then, to get "clean" timing measurements, you should do a release build, i.e.
cmake -DEBUG=OFF ..

(Obviously the above configurations can be performed by only one cmake command.)
Furthermore, since generation of output can be very time consuming for jobs with a very large number of MPI processes its highly recommended to turn off any output generation by disabling debug writers and any other print/write operations for saving large data to files (vectors and matrices, refined grids, partition maps, solutions ...).
Timing measurements are only useful at points where processes are synchronised, e.g. after computing of global norms, after performing pcl::Synchronize(), pcl::AllProcsTrue() ...
For weak scalability e.g. of GMG : Check if the number of iterations is constant over all problem sizes.

Specific Information about Scalability Tests

Hierarchical Redistribution

The approach of (re)distributing the grid to all MPI processes involved in a simulation run in a hierarchical fashion turned out to be essential for a good performance of large jobs (running with >= 1024 PE) on JuGene (see Working with ug4 on JuGene).

Example (applicable in a LoadLeveler script on JuGene):
mpirun -np 1024 -exe ./ugshell -mode VN -mapfile TXYZ -args "<other ug4 args> -numPreRefs 3 -numRefs 10 -hRedistFirstLevel 5 -hRedistStepSize 100 -hRedistNewProcsPerStep 16"

Or (a smaller, not very reasonable one) on cekon:
salloc -n 64 mpirun ./ugshell -ex ../apps/scaling_tests/modular_scalability_test.lua -numPreRefs 1 -hRedistFirstLevel 4 -hRedistStepSize 2 -hRedistNewProcsPerStep 4 -numRefs 8

numRefs
parameterNumber numRefs

ex
parameterString ex
Executes the specified script.
Definition command_line_util.lua:350
Parameters that control hierarchical redistribution:
- -numPreRefs (as usual): level where the grid is distributed the first time.
- -numRefs (as usual): toplevel of the grid hierarchy.
- -hRedistFirstLevel (default -1): first level where grid is redistributed (default -1, i.e, hierarchical redistribution is deactivated).
- -hRedistStepSize (default 1): Specifies after how much further refinements the grid will be redistributed again.
- -hRedistNewProcsPerStep (default: 2^dim): Number of MPI processes ("target procs") in a redistribution step to which each processor who already has received its part of the grid redistributes it.
- -hRedistMaxSteps (default: 1000000000; not used in the example above): Limits the number of redistribution steps (to avoid useless redistributions involving only a few processes at the "end of the hierarchy").
The following inequality must apply: numPreRefs < hRedistFirstLevel < numRefs.
Sketch of the algorithm:
1. At first the (pristine) grid (as defined by the grid file) will be refined numPreRef times => toplevel of the grid hierarchy is now level numPreRef.
2. At this level the grid will be distributed to k MPI processes (k will be explained below).
3. Now the grid will be further refined until level hRedistFirstLevel is reached.
4. There the grid will be redistributed to another hRedistNumProcsPerStep MPI processes for every process which already has received a part of the grid.
5. Then the grid will be refined at most hRedistStepSize times:
  If numRefs refinement steps, and also the number of redistribution steps controlled by -hRedistMaxSteps are not yet reached, go to 4. (else: finished).
Now all MPI processes of the simulation run have their part of the grid. To make things clear:
- After numPreRefs refinement steps the grid will be distributed, and
- after the grid is already distributed to some processes it will be redistributed
- at level(s) hRedistFirstLevel + i * hRedistStepSize
  (numPreRefs + hRedistFirstLevel + i * hRedistStepSize < numRefs; 0 <= i < hRedistMaxSteps),
  until at the end all MPI processes have received a portion of the grid.
So, all parameters with name part "Redist" refer to the redistribution of an already distributed grid.
The number of processes k of the first distribution step is determined by the (total) number of MPI processes, numProcs, on one side, and the other redistribution parameters on the other side, starting "from top" (i.e. top most redistribution level) "to bottom" (first distribution step):
numProcs / hRedistNumProcsPerStep is the number of target procs to which the grid is distributed in the second last redistribution step,
numProcs / hRedistNumProcsPerStep / hRedistNumProcsPerStep the number of target procs in the third last redistribution step (or the first distribution step, if only one redistribution step is performed) etc.
A test of the paramaters for hierarchical redistribution can be performed with the LUA script parameter_test.lua, e.g.
./ugshell -ex ../apps/scaling_tests/parameter_test.lua -numPreRefs 3 -numRefs 10 -hRedistFirstLevel 5 -hRedistStepSize 100 -hRedistNewProcsPerStep 16 -numProcs 1024

Or the dry run for the smaller job on cekon above:
./ugshell -ex ../apps/scaling_tests/parameter_test.lua -numPreRefs 1 -hRedistFirstLevel 4 -hRedistStepSize 2 -hRedistNewProcsPerStep 4 -numRefs 8 -numProcs 64

I.e. since the run is a serial run one has just to specify the number of MPI processes in addition to the distribution parameters.
Please note that this is only a "dry run": The script basically processes the same algorithm (ddu.PrintSteps()) as the one that actually carries out the (re)distribution (ddu.RefineAndDistributeDomain(); cf. domain_distribution_util.lua).

Please note that hierarchical redistribution is not compatible with "grid distribution type" (distributionType) "grid2d" (see Mapping of MPI processes). In the moment (march 2012) also grid distribution type "metis" is unsupported.

See also e.g. ll_scale_gmg.x (in scripts/shell/) for usage examples (specifically to JuGene, but also in general).

Mapping of MPI processes

"Topology aware mapping" of MPI processes to nodes / cores with respect to the network topology of the parallel machine on which a parallel job is run might be important one day.

Parameter -distType: Available values: "grid2d", "bisect", "metis"

Utilities for Scalability Tests

There exist some LUA scripts specifically tuned for timing measurements (pathes relavtive to ug4's main directory):
- For the Laplace problem: apps/scaling_tests/modular_scalability_test.lua.
- For the Elder problem: apps/d3f/elder_scalability_test.lua.
For a (quiet convenient) analysis of the profiling results of several simulation runs there also exist a special analyzer script: scripts/tools/scaling_analyzer.lua.
1. To get log files of your simulation runs use the -ugshell parameter -logtofile. This is of course not necessary if logfiles are automatically created by the resource manager (e.g., on General Information about JuGene).
2. In the list variable inFiles of the analyzer script enter the names of the logfiles of all runs which profiling results should be analysed (edit a local copy).
3. Then execute (with the stand-alone LUA interpreter; see Installation of LUA (optional) for installation if necessary):
  lua my_scaling_analyzer.lua
  
  or, redirecting the output to a file :
  lua my_scaling_analyzer.lua > jugene_ug4-static_laplace-2d_gmg_weak-scaling_pe4-256k_rev4354.txt
  
  If ugshell is executable on the machine used for this analysis (which is not the case if you are working on a login node of e.g. JuGene), one can also execute ugshell -ex jugene/scaling_analyzer.lua (adapt file pathes in inFiles relative to ugshell).
See also the util.printStats(stats) functionality, e.g. in apps/scaling_tests/modular_scalability_test.lua.

Table of Contents

General Practical Information about Scalability Tests

Specific Information about Scalability Tests

Hierarchical Redistribution

Mapping of MPI processes

Utilities for Scalability Tests