ug4
|
For any questions, contact marti or n.ru pp@gc sc.u ni-fr ankf urt.d esebas . tian .reit er@g csc.u ni-f rankf urt. de
You enable profiling in ug4 by setting the CMake flag -DPROFILER=<Profiler>
. E.g., using a shell open in your subdirectory build_debug
(as explained here or here), and enter
cmake -DPROFILER=<Profiler> .. make
The currently supported options for <Profiler>
are: None, Shiny, Scalasca, Vampir, ScoreP
Note that CMake "remembers" the flags you set. You can see the flags set in the output (for not changing any flags just enter cmake ..
).
To enable profiling of the PCL (Parallel Communication Layer), set -DPROFILE_PCL=ON
.
Check the output of cmake ..
for changes and additional flags. Keep in mind that you have to do a make when changing Cmake flags.
PROFILE_*
that must be inserted into the code manually. Choosing a compiler those marks are defined for the requested profiler API. The external profilers Scalasca, Vampir and ScoreP also provide a automatic instrumentation facility, that inserts profiling marks in every function automatically. However, using C++ this is error-prone and not usable with ug4, since all inline functions are profiled resulting in far to many profiling marks, hugh runtime overhead and even not starting binaries due to buffer size restrictions.Using the Shiny Profiler which is included with ug4 can be enabled setting the CMake flag -DPROFILER=Shiny
.
For getting profiling output at the end of the execution of your script, you can add
at the beginning of your script. With this, profiler statistics are output at the end of the execution of ugshell
, there you will see the total call tree (from main on) and a flat list with timings.
For a more visual output you can add
At the end of your script. For viewing that file, download the ShinyProfileViewer.
The Scalasca profiler is enabled by setting the CMake flag -DPROFILER=Scalasca
. This will specify the user-defined PROFILE
profiling marks (see below) to write profiling output using the EPIK-API provided by Scalasca.
In addition the scalsca compiler-wrapper must be used when building the code. Make sure you have the scalasca
binary in your PATH and run cmake with
CC="scalasca -instrument -comp=none -user mpicc" CXX="scalasca -instrument -comp=none -user mpicxx" cmake -DPROFILER=Scalasca ...
Note, that the compiler is only set at the very first cmake run and not changed on successive reruns. In order to change the compiler wrapper you must remove your build completely and run cmake from scratch. Your application is the analysed using
scalasca -analyse mpirun ugshell -ex ...
See the profiling result via
scalasca -examine <epik-folder>
The VampirTrace profiler is enabled by setting the CMake flag -DPROFILER=Vampir
. This will specify the user-defined PROFILE
profiling marks (see below) to write profiling output using the VampirTrace-API.
In addition the vampirtrace compiler-wrapper must be used when building the code. Make sure you have the vtcc, vtcxx
binary in your PATH and run cmake with
CC="vtcc" CXX="vtcxx" cmake -DPROFILER=Vampir ...
Note, that the compiler is only set at the very first cmake run and not changed on successive reruns. In order to change the compiler wrapper you must remove your build completely and run cmake from scratch. Run your application normally, trace output will be produced automatically.
The ScoreP profiler is enabled by setting the CMake flag -DPROFILER=ScoreP
. This will specify the user-defined PROFILE
profiling marks (see below) to write profiling output using the SCOREP-API.
In addition the scorep compiler-wrapper must be used when building the code. Make sure you have the scorep
binary in your PATH and run cmake with
CC="scorep --user --nocompiler mpicc" CXX="scorep --user --nocompiler mpicxx" cmake -DPROFILER=ScoreP ...
Note, that the compiler is only set at the very first cmake run and not changed on successive reruns. In order to change the compiler wrapper you must remove your build completely and run cmake from scratch. Run your application normally, trace output will be produced automatically.
You can enable profiling all calls made through the ug4 Registry / Bridge by calling CMake with
cmake -DPROFILE_BRIDGE=ON ..
To enable profiling of your LUA Script, simply add a
at the beginning of your script. You can then print profiling information like in Enable Profiling in ug4 or More Profiling Control in the Script. However, you can also use
at the end of your code. This will list you your LUA script files together with line numbers and time spend in these lines.
ProfileLUA(false) AssembleLinearOperatorRhsAndSolution(linOp, u, b) ProfileLUA(true)
If you want to measure the time spent in your functions or in sections of your code, you can use our profiling functions:
PROFILE_FUNC_GROUP()
is for profiling functions: Put PROFILE_FUNC_GROUP()
at the beginning of the function. Here name
is the unique name of the profile node, and group
is a group where you want to put that node in. For example, you can put all sections of code which do file access in the group "io": PROFILE_FUNC_GROUP(group)
not at the beginning of the scope you or others might get unexpected behaviour.PROFILE_BEGIN_GROUP(name, group)
is for profiling sections of code. Profiling is done from the line where PROFILE_BEGIN_GROUP(name, group)
is until PROFILE_END()
or until the end of the scope where PROFILE_BEGIN_GROUP(name, group)
was in.
Here the profile node MyFunction_CalculationA
measures only the part in do calculation A
, MyFunction_CalculationB
measures only the part in do calculation B
, and MyFunction_io
measures only the part in do rest
.
PROFILE_FUNC_GROUP()
is basically only a PROFILE_BEGIN_GROUP
with name = function name. PROFILE_BEGIN(name)
and PROFILE_FUNC()
is the same only that group is set to NULL (no group).
PROFILE_BEGIN
or arguments with destructors which are passed as value: PROFILE_BEGIN
has a small overhead, so dont use it in functions which do only very little. Called MyFunction 1 times. Spend 0.1044 ms for MyFunction alone (without child nodes). Spend 725.6421 ms in MyFunction altogether. MyFunction call tree call tree hits self time total time MyFunction 1 104.4 us 0% 725.64 ms 100% FunctionC 1 900 ns 0% 725.54 ms 99% FunctionB 2 579.6 us 0% 725.49 ms 99% FunctionA 2000 98.969 ms 13% 98.969 ms 13% FunctionA_bad 2000 243.79 ms 33% 625.94 ms 86% FunctionA_bad_in_loop 2e+06 382.15 ms 52% 382.15 ms 52% FunctionA 1 49.5 us 0% 49.5 us 0% MyFunction call list self time sorted self time sorted hits self time total time FunctionC 1 900 ns 0% 725.54 ms 99% FunctionA 1 49.5 us 0% 49.5 us 0% MyFunction 1 104.4 us 0% 725.64 ms 100% FunctionB 2 579.6 us 0% 725.49 ms 99% FunctionA 2000 98.969 ms 13% 98.969 ms 13% FunctionA_bad 2000 243.79 ms 33% 625.94 ms 86% FunctionA_bad_in_loop 2e+06 382.15 ms 52% 382.15 ms 52% MyFunction call list total time sorted total time sorted hits self time total time FunctionA 1 49.5 us 0% 49.5 us 0% FunctionA 2000 98.969 ms 13% 98.969 ms 13% FunctionA_bad_in_loop 2e+06 382.15 ms 52% 382.15 ms 52% FunctionA_bad 2000 243.79 ms 33% 625.94 ms 86% FunctionB 2 579.6 us 0% 725.49 ms 99% FunctionC 1 900 ns 0% 725.54 ms 99% MyFunction 1 104.4 us 0% 725.64 ms 100% MyFunction call list entry count sorted entry count sorted hits self time total time MyFunction 1 104.4 us 0% 725.64 ms 100% FunctionC 1 900 ns 0% 725.54 ms 99% FunctionA 1 49.5 us 0% 49.5 us 0% FunctionB 2 579.6 us 0% 725.49 ms 99% FunctionA 2000 98.969 ms 13% 98.969 ms 13% FunctionA_bad 2000 243.79 ms 33% 625.94 ms 86% FunctionA_bad_in_loop 2e+06 382.15 ms 52% 382.15 ms 52%Here we see that
FunctionA_bad
takes six times as long as FunctionA
, even though they both perform the same task. This is because the small overhead of PROFILE_BEGIN
accumulated to a measureable time when it was executed two million times. self time
of a method is the time spent in this method, without the time spent in all methods called from the first one — if they also use the profiler. To make it clear: If a called method doesn't use the profiler the time spent for it accounts for the self time
of the calling method! Printing the whole call tree at the end of the execution can be pretty much information, and that is why there is also a direct way of accessing timing information for a specific Profiler Node (Profiler Nodes are all nodes which got created with PROFILE_BEGIN
or PROFILE_FUNC
, see Profiling your C/C++ code).
Example usage is
The UGProfilerNode
has the following functions:
dMarginal != 0.0
, only nodes which account for more than dMarginal*100 % of the total time
are part of the output. So if you want to print the call tree, and want to display only nodes which account for more than 3% of the total time, you can write For nodes other than "main", you should check that the profile node could be found. It's also a good idea to check if the profiler is available - otherwise every GetProfileNode(...):is_valid() = false. For this you use GetProfilerAvailable() == true
.
Here's an example to view detailed information about ALS_ApplyLinearSolver without having to look at the whole call tree of main.
if pcl:GetProcRank() == 0 then
to prevent all processors from writing.Another way of profiling is
This will list the time spent in different groups, for example mpirun -np 4 ./ugshell -ex conv_diff/laplace.lua -numRefs 8
will give you
algebra 3.90574 s max: 3.96512 s min: 3.90574 s diff: 59.3775 ms (1.4975 %) debug 21.6 us max: 21.6 us min: 14.4 us diff: 7.2 us (33.3333 %) parallelization 8.0802 ms max: 8.1243 ms min: 8.0613 ms diff: 63 us (0.775451 %)
This will give you also parallel information, and you can check if some cores spend more time in some group than others (shown in diff).