| StarPU Handbook
    | 
Similar to other runtimes, StarPU introduces some overhead in managing tasks. This overhead, while not always negligible, is mitigated by its intelligent scheduling and data management capabilities. The typical order of magnitude for this overhead is a few microseconds, which is notably smaller than the inherent CUDA overhead. To ensure that this overhead remains insignificant, the work assigned to a task should be substantial enough.
The length of tasks should ideally be relatively larger to effectively counterbalance this overhead. It iss advised to consider the offline performance feedback, which provides insights into task lengths. Monitoring task lengths becomes crucial if you're encountering suboptimal performance.
To gauge the scalability potential based task size, you can run the tests/microbenchs/tasks_size_overhead.sh script. It provides a visual representation of the speedup achievable with independent tasks of very small sizes.
This benchmark is installed in $STARPU_PATH/lib/starpu/examples/. It gives a glimpse into how long a task should be (in µs) for StarPU overhead to be low enough to keep efficiency. The script generates a plot illustrating the speedup trends for tasks of different sizes, correlated with the number of CPUs in use.
For example, in the figure below, for 128 µs tasks (the red line), StarPU overhead is low enough to guarantee a good speedup if the number of CPUs is not more than 36. But with the same number of CPUs, 64 µs tasks (the black line) cannot have a correct speedup. The number of CPUs must be decreased to about 17 in order to keep efficiency.
 
To determine the task size your application is using, it is possible to use starpu_fxt_data_trace as explained in Data trace and tasks length.
The selection of a scheduler in StarPU also plays a significant role. Different schedulers have varying impacts on the overall execution. For example, the dmda scheduler may require additional time to make decisions, while the eager scheduler tends to be more immediate in its decisions.
To assess the impact of scheduler choice on your target machine, you can once again utilize the tasks_size_overhead.sh script. This script provides valuable insights into how different schedulers affect performance in conjunction with task sizes.
To enable StarPU to perform online optimizations effectively, it is recommended to submit tasks asynchronously whenever possible. The goal is to maximize the level of asynchronous submission, allowing StarPU to have more flexibility in optimizing the scheduling process. Ideally, all tasks should be submitted asynchronously, and the use of functions like starpu_task_wait_for_all() or starpu_data_unregister() should be limited to waiting for task completion.
StarPU will then be able to rework the whole schedule, overlap computation with communication, manage accelerator local memory usage, etc. A simple example is in the file examples/basic_examples/variable.c
StarPU's default behavior considers tasks in the order they are submitted by the application. However, in scenarios where the application programmer possesses knowledge about certain tasks that should take priority due to their impact on performance (such as tasks whose output is crucial for subsequent tasks), the starpu_task::priority field can be utilized to convey this information to StarPU's scheduling process.
An example is provided in the application examples/heat/dw_factolu_tag.c.
The maximum number of data that a task can manage is fixed by the macro STARPU_NMAXBUFS. This macro has a default value which can be customized through the configure option --enable-maxbuffers.
However, if you have specific cases where you need tasks to manage more data than the maximum allowed, you can use the field starpu_task::dyn_handles when defining a task, along with the field starpu_codelet::dyn_modes when defining the corresponding codelet.
This dynamic handle mechanism enables tasks to handle additional data beyond the usual limit imposed by STARPU_NMAXBUFS.
The whole code for this complex data interface is available in the file examples/basic_examples/dynamic_handles.c.
Normally, the number of data handles given to a task is set with starpu_codelet::nbuffers. This field can however be set to STARPU_VARIABLE_NBUFFERS, in which case starpu_task::nbuffers must be set, and starpu_task::modes (or starpu_task::dyn_modes, see Setting Many Data Handles For a Task) should be used to specify the modes for the handles. Examples in examples/basic_examples/dynamic_handles.c show how to implement it.
StarPU provides the wrapper function starpu_task_insert() to ease the creation and submission of tasks.
Here is the implementation of a codelet:
And the call to starpu_task_insert():
The call to starpu_task_insert() is equivalent to the following code:
In the example file tests/main/insert_task_value.c, we use these two ways to create and submit tasks.
Instead of calling starpu_codelet_pack_args(), one can also call starpu_codelet_pack_arg_init(), then starpu_codelet_pack_arg() for each data, then starpu_codelet_pack_arg_fini() as follow:
A full code example is in file tests/main/pack.c.
Here a similar call using STARPU_DATA_ARRAY.
If some part of the task insertion depends on the value of some computation, the macro STARPU_DATA_ACQUIRE_CB can be very convenient. For instance, assuming that the index variable i was registered as handle A_handle[i]:
The macro STARPU_DATA_ACQUIRE_CB submits an asynchronous request for acquiring data i for the main application, and will execute the code given as the third parameter when it is acquired. In other words, as soon as the value of i computed by the codelet which_index can be read, the portion of code passed as the third parameter of STARPU_DATA_ACQUIRE_CB will be executed, and is allowed to read from i to use it e.g. as an index. Note that this macro is only available when compiling StarPU with the compiler gcc. In the example file tests/datawizard/acquire_cb_insert.c, this macro is used.
StarPU also provides a utility function starpu_codelet_unpack_args() to retrieve the STARPU_VALUE arguments passed to the task. There is several ways of calling starpu_codelet_unpack_args(). The full code examples are available in the file tests/main/insert_task_value.c.
Instead of calling starpu_codelet_unpack_args(), one can also call starpu_codelet_unpack_arg_init(), then starpu_codelet_pack_arg() or starpu_codelet_dup_arg() or starpu_codelet_pick_arg() for each data, then starpu_codelet_unpack_arg_fini() as follow:
During unpacking one can also call starpu_codelet_unpack_discard_arg() to skip saving the argument in pointer.
A full code example is in file tests/main/pack.c.
Here a list of other functions to help with task management.
StarPU provides several functions to help insert data into a task. The function starpu_task_insert_data_make_room() is used to allocate memory space for a data structure that is required for inserting data into a task. This function is called before inserting any data handles into a task, and ensures that enough memory is available for the data to be stored. Once memory is allocated, the data handle can be inserted into the task using the following functions