Shared memory debugging
This debug_test
directory contains scripts for running a set of short runs, intended to be used with the --debug
flag to check for bugs (e.g. race conditions). The output is not checked - the intention is just to catch errors raised by the debugging checks.
The inputs only have 3 time-steps, and very few grid points, because the debug checks are very slow. The actual output is not important, so it does not matter that the runs are badly under-resolved.
It may be necessary to use the --compiled-modules=no
flag to Julia for changes to the --debug
setting to be picked up correctly. This setting means that all precompilation is redone each time Julia is started, which can be slow. An alternative workaround is to hard-code the moment_kinetics.debugging._debug_level
variable in debugging.jl
to the desired value.
To run the debug tests, call (from the top-level moment_kinetics
directory) something like
julia --project --check-bounds=yes --compiled-modules=no debug_test/runtests.jl --debug 99
Collision operator and 'anyv' region
The collision operator uses a slightly hacky special set of functions for shared memory parallelism, to allow the outer loop over species and spatial dimensions to be parallelised, but also inner loops over vperp
, vpa
or vperp
and vpa
to be parallelised - changing the type of inner-loop parallelism within the outer loop. This happens within an 'anyv' region, which is started with the begin_s_r_z_anyv_region()
function. The debug checks within an 'anyv' region only check for correctness on the sub-block communicator that parallelises over velocity space, so if there were errors due to incorrect species or spatial parallelism they would not (might not?) be detected. These errors should be unlikely as the collision operator only writes to a single species at a single spatial point.
Finding race conditions
The code is parallelized using MPI with shared memory arrays. 'Race conditions' can occur if a shared array is accessed incorrectly. All the processes sharing an array can be synchronized, ensuring they pass through the following code block with a consistent state, by using the _block_synchronize()
function (which calls MPI.Barrier()
to synchronize the processes). Race conditions occur if between consecutive calls to _block_synchronize()
any array is:
- written by 2 or more processes at the same position
- written by one process at a certain position, and read by one or more other processes at the same position.
If a race condition occurs, it can result in errors in the results. These are sometimes small, but often show inconsistent results between runs (because results erroneously depend on the execution order on different processes). They are undefined behaviour though, and so can also cause anything up to segfaults.
The provided debugging routines can help to pin down where either of these errors happen.
The @debug_shared_array
macro (activated at --debug 2
or higher) counts all reads and writes to shared arrays by each process, and checks at each _block_synchronize()
call whether either pattern has occurred since the previous _block_synchronize()
. If they have and in addition @debug_track_array_allocate_location
is active (--debug 3
or higher), then the array for which the error occured is identified by printing a stack-trace of the location where it was allocated, and the stack-trace for the exception shows the location of the _block_synchronize()
call where the error occured.
@debug_block_synchronize
(activated at --debug 4
)checks that all processes called _block_synchronize()
from the same place - i.e. the same line in the code, checked by comparing stack traces.
@debug_detect_redundant_block_synchronize
(activated at --debug 5
) aims to find any unnecessary calls to _block_synchronize()
. These calls can be somewhat expensive (for large numbers of processes at least), so it is good to minimise the number. When this mode is active, at each _block_synchronize()
a check is made whether there would be a race-condition error if the previous _block_synchronize()
call was removed. If there would not be, then the previous call was unnecessary and could be removed. The tricky part is that whether it was necessary or not could depend on the options being used... Detecting redundant block_synchronize()
calls requires that all dimensions that could be split over processes are actually split over processes, which demands a large number of processes are used. The @debug_detect_redundant_block_synchronize
flag, when activated, modifies the splitting algorithm to force every dimension to be split if possible, and raise an error if not.
Suggested debugging strategy for race conditions is:
- Look at the loop types and ensure that there is an appropriate
begin_*_region()
call before each new loop type. - Run
debug_test/runtests.jl
with@debug_shared_array
activated, but not@debug_detect_redundant_block_synchronize
. It will be faster to first run without@debug_track_array_allocate_location
to find failing tests, then with@debug_track_array_allocate_location
to help identify the cause of the failure. Usually a failure should indicate where there is a missingbegin_*_region()
call. There may be places though where synchronization is required even though the type of loop macros used does not change (for example whenphi
is calculated contributions from all ion species need to be summed, resulting in an unusual pattern of array accesses); in this case_block_synchronize()
can be called directly.- The function
debug_check_shared_memory()
can be inserted betweenbegin_*_region()
calls when debugging to narrow down the location where the incorrect array access occured. It is defined when@debug_shared_array
is active, and can be imported withusing ..communication: debug_check_shared_memory()
. The function runs the same error checks as are added by@debug_shared_array
in_block_synchronize()
. - The tests in
debug_test/
check for correctness by looping over the dimensions and forcing each to be split over separate processes in turn. This allows the correctness checks to be run using only 2 processes, which would not be possible if all dimensions had to be split at the same time.
- The function
- [This final level of checking only looks for minor optimizations rather than finding bugs, so it is much less important than the checks above.] Run
debug_test/debug_redundant_synchronization/runtests.jl
with@debug_detect_redundant_block_synchronize
activated. This should show if any call to_block_synchronize()
(including the ones insidebegin_*_region()
calls) was 'unnecessary' - i.e. there would be no incorrect array accesses if it was removed. This test needs to be run on a suitable combination of grid sizes and numbers of processes so that all dimensions are split across multiple processes to avoid false positives. Any redundant calls which appear in all tests can be deleted. Redundant calls that appear in only some tests (unless they are in some code block that is just not called in all the other tests) should preferably be moved inside a conditional block, so that they are called only when necessary, if a suitable one exists. If there is no conditional block that the call can be moved to, it may sometimes be necessary to just test one or more options before calling, e.g.moments.evolve_upar && _block_synchronize()
- The checks for redundant
_block_synchronize()
calls have been separated from the correctness checks so that the correctness checks can be run in the CI using only 2 processes, while the redundancy checks can be run manually on a machine with enough memory and cpu cores.
- The checks for redundant
You can find out what loop type is currently active by looking at loop_ranges[].parallel_dims
. This variable is a Tuple containing Symbols for each dimension currently being parallelized.