Enable Profiling in ug4

You enable profiling in ug4 by setting the CMake flag -DPROFILER=<Profiler>. E.g., using a shell open in your subdirectory build_debug (as explained here or here), and enter

cmake -DPROFILER=<Profiler> ..
make

The currently supported options for <Profiler> are: None, Shiny, Scalasca, Vampir, ScoreP

Note that CMake "remembers" the flags you set. You can see the flags set in the output (for not changing any flags just enter cmake ..).

To enable profiling of the PCL (Parallel Communication Layer), set -DPROFILE_PCL=ON.

Check the output of cmake .. for changes and additional flags. Keep in mind that you have to do a make when changing Cmake flags.

Note: The Shiny profiler is shipped with the ug4 distribution and can always be used. However, for the Scalasca, Vampir and ScoreP profiling additional software must be installed (refer to the provided links below). One some machines this software is already installed, e.g. on cekon include "/home/profiler/bin" to your PATH in order to obtain scalasca and scorep. VampirTrace is provided with Open MPI 1.3 and higher.; All profilings rely on the user-defined profiling marks PROFILE_* that must be inserted into the code manually. Choosing a compiler those marks are defined for the requested profiler API. The external profilers Scalasca, Vampir and ScoreP also provide a automatic instrumentation facility, that inserts profiling marks in every function automatically. However, using C++ this is error-prone and not usable with ug4, since all inline functions are profiled resulting in far to many profiling marks, hugh runtime overhead and even not starting binaries due to buffer size restrictions.

Shiny Profiling

Using the Shiny Profiler which is included with ug4 can be enabled setting the CMake flag -DPROFILER=Shiny.

For getting profiling output at the end of the execution of your script, you can add

SetOutputProfileStats(true)

at the beginning of your script. With this, profiler statistics are output at the end of the execution of ugshell, there you will see the total call tree (from main on) and a flat list with timings.

For a more visual output you can add

WriteProfileData("pd.pdxml")

At the end of your script. For viewing that file, download the ShinyProfileViewer.

Scalasca Profiling

The Scalasca profiler is enabled by setting the CMake flag -DPROFILER=Scalasca. This will specify the user-defined PROFILE profiling marks (see below) to write profiling output using the EPIK-API provided by Scalasca.

In addition the scalsca compiler-wrapper must be used when building the code. Make sure you have the scalasca binary in your PATH and run cmake with

CC="scalasca -instrument -comp=none -user mpicc" CXX="scalasca -instrument -comp=none -user mpicxx" cmake -DPROFILER=Scalasca ...

Note, that the compiler is only set at the very first cmake run and not changed on successive reruns. In order to change the compiler wrapper you must remove your build completely and run cmake from scratch. Your application is the analysed using

scalasca -analyse mpirun ugshell -ex ...

See the profiling result via

scalasca -examine <epik-folder>

VampirTrace Profiling

The VampirTrace profiler is enabled by setting the CMake flag -DPROFILER=Vampir. This will specify the user-defined PROFILE profiling marks (see below) to write profiling output using the VampirTrace-API.

In addition the vampirtrace compiler-wrapper must be used when building the code. Make sure you have the vtcc, vtcxx binary in your PATH and run cmake with

CC="vtcc" CXX="vtcxx" cmake -DPROFILER=Vampir ...

Note, that the compiler is only set at the very first cmake run and not changed on successive reruns. In order to change the compiler wrapper you must remove your build completely and run cmake from scratch. Run your application normally, trace output will be produced automatically.

ScoreP Profiling

The ScoreP profiler is enabled by setting the CMake flag -DPROFILER=ScoreP. This will specify the user-defined PROFILE profiling marks (see below) to write profiling output using the SCOREP-API.

In addition the scorep compiler-wrapper must be used when building the code. Make sure you have the scorep binary in your PATH and run cmake with

CC="scorep --user --nocompiler mpicc" CXX="scorep --user --nocompiler mpicxx" cmake -DPROFILER=ScoreP ...

Note, that the compiler is only set at the very first cmake run and not changed on successive reruns. In order to change the compiler wrapper you must remove your build completely and run cmake from scratch. Run your application normally, trace output will be produced automatically.

Note: Currently the ScoreP support for ug4 is experimental and may not be successful

Profiling Registry

You can enable profiling all calls made through the ug4 Registry / Bridge by calling CMake with

cmake -DPROFILE_BRIDGE=ON ..

Profiling your LUA Script

To enable profiling of your LUA Script, simply add a

ProfileLUA(true)

at the beginning of your script. You can then print profiling information like in Enable Profiling in ug4 or More Profiling Control in the Script. However, you can also use

PrintLUA()

at the end of your code. This will list you your LUA script files together with line numbers and time spend in these lines.

Note

For the moment, we only profile the top-level calls, not all subroutine calls. This has performance reasons. Expect a performance drawback when calling small functions very often. You can also disable the LUA profiler for these calls:

ProfileLUA(false)
AssembleLinearOperatorRhsAndSolution(linOp, u, b)
ProfileLUA(true)

Profiling your C/C++ code

If you want to measure the time spent in your functions or in sections of your code, you can use our profiling functions:

#include "common/profiler/profiler.h"

profiler.h

PROFILE_FUNC_GROUP() is for profiling functions: Put PROFILE_FUNC_GROUP() at the beginning of the function. Here name is the unique name of the profile node, and group is a group where you want to put that node in. For example, you can put all sections of code which do file access in the group "io":
void WriteAMGResultsToFile(const char* filename, std::vector<double> &results)

{

PROFILE_FUNC_GROUP("io");

// do stuff

}

PROFILE_FUNC_GROUP
#define PROFILE_FUNC_GROUP(groups)
Definition profiler.h:258

Note
When you put PROFILE_FUNC_GROUP(group) not at the beginning of the scope you or others might get unexpected behaviour.

You can also put profile nodes in more than one group at once by seperating groups with a space: "io amg debug".
PROFILE_BEGIN_GROUP(name, group) is for profiling sections of code. Profiling is done from the line where PROFILE_BEGIN_GROUP(name, group) is until PROFILE_END() or until the end of the scope where PROFILE_BEGIN_GROUP(name, group) was in.
void MyFunction(size_t n)

{

// first do some stuff

PROFILE_BEGIN_GROUP(MyFunction_CalculationA, "calculation")

// do calculation A

PROFILE_END()

PROFILE_BEGIN_GROUP(MyFunction_CalculationB, "calculation")

// do calculation B

PROFILE_END()

PROFILE_BEGIN_GROUP(MyFunction_io, "io")

// do rest

// MyFunction_io is evaluated until the end

}

PROFILE_END
#define PROFILE_END()
Definition profiler.h:256

PROFILE_BEGIN_GROUP
#define PROFILE_BEGIN_GROUP(name, groups)
Definition profiler.h:255

Here the profile node MyFunction_CalculationA measures only the part in do calculation A, MyFunction_CalculationB measures only the part in do calculation B, and MyFunction_io measures only the part in do rest.

Note
PROFILE_FUNC_GROUP() is basically only a PROFILE_BEGIN_GROUP with name = function name.
PROFILE_BEGIN(name) and PROFILE_FUNC() is the same only that group is set to NULL (no group).
When a profile node is auto-ended by the scope mechanism, we use destructors to accomplish this. However, it is possible that some destructors are called AFTER the destructor of the profile node. The reason for this is that destructors are called in reverse order of constructors: First constructed, last destructed (see here). So the times spend in some destructors might not be added to this profile node. Examples for this are objects which are declared before PROFILE_BEGIN or arguments with destructors which are passed as value:
void MyFunction(std::vector<int> vec)

{

PROFILE_FUNC_GROUP("myGroup") // does not measure destructor of vec.

std::vector<int> vec2;

PROFILE_BEGIN_GROUP(MyFunctionSectionA, "myGroup") // does not measure destructor of vec and vec2.

std::vector<int> vec3;

// implicit destructor of vec3

// implicit PROFILE_END(MyFunctionSectionA)

// implicit destructor of vec2

// implicit PROFILE_END(MyFunction)

// implicit destructor of vec

}

std
Definition smart_pointer.h:814
You can also nest Profiling:
void MyFunction(size_t n)

{

PROFILE_FUNC_GROUP("myGroup")

for(size_t i=0; i<5; i++)

{

PROFILE_BEGIN_GROUP(MyFunction_InLoop, "myGroup");

// do some stuff

}

// do some other stuff (not measure by MyFunction_InLoop anymore)

}

for
for(yylen=0;yystr[yylen];yylen++) continue
Be aware that, no matter which compiler you are using, PROFILE_BEGIN has a small overhead, so dont use it in functions which do only very little.

C++ Example

Example Code to be profiled:
void FunctionA(double &b)

{

PROFILE_FUNC();

for(size_t i=0; i<1000; i++)

b += sin(b);

}

void FunctionA_bad(double &b)

{

PROFILE_FUNC();

for(size_t i=0; i<1000; i++)

{

// this section is too small!

PROFILE_BEGIN(FunctionA_bad_in_loop)

b += sin(b);

}

}

void FunctionB(double &b)

{

PROFILE_FUNC();

for(size_t i=0; i<1000; i++)

{

FunctionA(b);

FunctionA_bad(b);

}

}

void FunctionC(double &b)

{

PROFILE_FUNC();

FunctionB(b);

FunctionA(b);

FunctionB(b);

}

void MyFunction()

{

PROFILE_FUNC();

double d=0.1;

FunctionC(d);

cout << d << endl;

}

PROFILE_BEGIN
#define PROFILE_BEGIN(name)
Definition profiler.h:254

PROFILE_FUNC
#define PROFILE_FUNC()
Definition profiler.h:257
Now we write a LUA Script:
MyFunction()

-- check if profiler is available

if GetProfilerAvailable() == true then

-- get node

pn = GetProfileNode("MyFunction")

-- check if node is valid

if pn:is_valid() then

print("Called MyFunction "..pn:get_avg_entry_count().." times.")

print("Spend "..pn:get_avg_self_time_ms().." ms for MyFunction alone (without child nodes).")

print("Spend "..pn:get_avg_total_time_ms().." ms in MyFunction altogether.\n")

print("MyFunction call tree")

print(pn:call_tree())

print("MyFunction call list self time sorted")

print(pn:child_self_time_sorted())

print("MyFunction call list total time sorted")

print(pn:total_time_sorted())

print("MyFunction call list entry count sorted")

print(pn:entry_count_sorted())

else

print("MyFunction is not known to the profiler.")

end

else

print("Profiler not available.")

end

print
function table print(data, style)

Results:

Called MyFunction 1 times.
Spend 0.1044 ms for MyFunction alone (without child nodes).
Spend 725.6421 ms in MyFunction altogether.

MyFunction call tree
call tree                                            hits       self time      total time 
MyFunction                                             1  104.4 us    0% 725.64 ms  100% 
FunctionC                                              1    900 ns    0% 725.54 ms   99%
  FunctionB                                            2  579.6 us    0% 725.49 ms   99% 
  FunctionA                                         2000 98.969 ms   13% 98.969 ms   13%
  FunctionA_bad                                     2000 243.79 ms   33% 625.94 ms   86%
    FunctionA_bad_in_loop                          2e+06 382.15 ms   52% 382.15 ms   52%
  FunctionA                                            1   49.5 us    0%   49.5 us    0% 

MyFunction call list self time sorted
self time sorted                                     hits       self time      total time 
FunctionC                                              1    900 ns    0% 725.54 ms   99% 
FunctionA                                              1   49.5 us    0%   49.5 us    0% 
MyFunction                                             1  104.4 us    0% 725.64 ms  100% 
FunctionB                                              2  579.6 us    0% 725.49 ms   99% 
FunctionA                                           2000 98.969 ms   13% 98.969 ms   13% 
FunctionA_bad                                       2000 243.79 ms   33% 625.94 ms   86% 
FunctionA_bad_in_loop                              2e+06 382.15 ms   52% 382.15 ms   52%

MyFunction call list total time sorted
total time sorted                                    hits       self time      total time 
FunctionA                                              1   49.5 us    0%   49.5 us    0% 
FunctionA                                           2000 98.969 ms   13% 98.969 ms   13% 
FunctionA_bad_in_loop                              2e+06 382.15 ms   52% 382.15 ms   52%
FunctionA_bad                                       2000 243.79 ms   33% 625.94 ms   86% 
FunctionB                                              2  579.6 us    0% 725.49 ms   99% 
FunctionC                                              1    900 ns    0% 725.54 ms   99% 
MyFunction                                             1  104.4 us    0% 725.64 ms  100% 

MyFunction call list entry count sorted
entry count sorted                                   hits       self time      total time 
MyFunction                                             1  104.4 us    0% 725.64 ms  100% 
FunctionC                                              1    900 ns    0% 725.54 ms   99% 
FunctionA                                              1   49.5 us    0%   49.5 us    0% 
FunctionB                                              2  579.6 us    0% 725.49 ms   99% 
FunctionA                                           2000 98.969 ms   13% 98.969 ms   13% 
FunctionA_bad                                       2000 243.79 ms   33% 625.94 ms   86% 
FunctionA_bad_in_loop                              2e+06 382.15 ms   52% 382.15 ms   52%

Here we see that FunctionA_bad takes six times as long as FunctionA, even though they both perform the same task. This is because the small overhead of PROFILE_BEGIN accumulated to a measureable time when it was executed two million times.

Explanations:
- The self time of a method is the time spent in this method, without the time spent in all methods called from the first one — if they also use the profiler. To make it clear: If a called method doesn't use the profiler the time spent for it accounts for the self time of the calling method!
Utilities for the Analysis of Profiling Results (automatic comparison of several simulation runs etc.) see Scalability Tests .

More Profiling Control in the Script

Printing the whole call tree at the end of the execution can be pretty much information, and that is why there is also a direct way of accessing timing information for a specific Profiler Node (Profiler Nodes are all nodes which got created with PROFILE_BEGIN or PROFILE_FUNC, see Profiling your C/C++ code).

Warning: If a part of code never got executed, the profiler nodes in there do not exist.

UGProfileNode *GetProfileNode(const char *str);

Example usage is

print(GetProfileNode("main"):call_tree())

The UGProfilerNode has the following functions:

class UGProfilerNode
{
  double get_avg_entry_count() const;                         // number of entries in this profiler node
  double get_avg_self_time_ms() const;                        // time in milliseconds spend in this node excluding subnodes
  double get_avg_total_time_ms() const;                       // time in milliseconds spend in this node including subnodes
  string call_tree(double dMarginal=0.0) const;               // returns string with call tree
  string child_self_time_sorted(double dMarginal=0.0) const;  // string with sorted childs (sorted by self time)
  string total_time_sorted(double dMarginal=0.0) const;       // string with sorted childs (sorted by total time)
  string entry_count_sorted(double dMarginal=0.0) const;      // string with sorted childs (sorted by entry count)
  string groups() const;                    // lists time spent in groups
  bool is_valid() const;                                      // true if this is a valid node (i.e. != NULL)
}

Attention: If dMarginal != 0.0, only nodes which account for more than dMarginal*100 % of the total time are part of the output. So if you want to print the call tree, and want to display only nodes which account for more than 3% of the total time, you can write
GetProfileNode("main"):call_tree(0.03)

For nodes other than "main", you should check that the profile node could be found. It's also a good idea to check if the profiler is available - otherwise every GetProfileNode(...):is_valid() = false. For this you use GetProfilerAvailable() == true.

Here's an example to view detailed information about ALS_ApplyLinearSolver without having to look at the whole call tree of main.

if GetProfilerAvailable() == true then
  pn = GetProfileNode("ALS_ApplyLinearSolver")
  if pn:is_valid() then   -- check if node is valid
 
    print("Called MyFunction "..pn:get_avg_entry_count().." times.")
    print("Spend "..pn:get_avg_self_time_ms().." ms for ALS_ApplyLinearSolver alone (without child nodes).")
    print("Spend "..pn:get_avg_total_time_ms().." ms in ALS_ApplyLinearSolver altogether.\n")
    print("ALS_ApplyLinearSolver call tree")
    print(pn:call_tree())
    print("ALS_ApplyLinearSolver call list self time sorted")
    print(pn:child_self_time_sorted())
    print("ALS_ApplyLinearSolver call list total time sorted")
    print(pn:total_time_sorted())
    print("ALS_ApplyLinearSolver call list entry count sorted")
    print(pn:entry_count_sorted())
  else
    print("ALS_ApplyLinearSolver is not known to the profiler.")
  end
else
  print("Profiler not available.")
end

Note: You can also write the strings returned by e.g. call_tree to a file. Remember to add some if pcl:GetProcRank() == 0 then to prevent all processors from writing.

Another way of profiling is

print(GetProfileNodes("main"):groups())

This will list the time spent in different groups, for example mpirun -np 4 ./ugshell -ex conv_diff/laplace.lua -numRefs 8 will give you

algebra             3.90574 s  max: 3.96512 s  min: 3.90574 s  diff: 59.3775 ms (1.4975 %)
debug               21.6    us max: 21.6    us min: 14.4    us diff: 7.2     us (33.3333 %)
parallelization     8.0802  ms max: 8.1243  ms min: 8.0613  ms diff: 63      us (0.775451 %)

This will give you also parallel information, and you can check if some cores spend more time in some group than others (shown in diff).

Table of Contents