| StarPU Handbook - StarPU Introduction
    | 
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License”.
The use of specialized hardware, such as accelerators or coprocessors offers an interesting approach to overcoming the physical limits encountered by processor architects. As a result, many machines are now equipped with one or several accelerators (e.g. a GPU), in addition to the usual processor(s). While significant efforts have been devoted to offloading computation onto such accelerators, very little attention has been paid to portability concerns on the one hand, and to the possibility of having heterogeneous accelerators and processors interact on the other hand.
StarPU is a runtime system that provides support for heterogeneous multicore architectures. It not only offers a unified view of the computational resources (i.e. CPUs and accelerators simultaneously) but also takes care of efficiently mapping and executing tasks onto an heterogeneous machine while transparently handling low-level issues such as data transfers in a portable manner.
StarPU is a software tool designed to enable programmers to harness the computational capabilities of both CPUs and GPUs, all while sparing them the need to meticulously adapt their programs for specific target machines and processing units.
At the heart of StarPU lies its runtime support library, which takes charge of scheduling tasks supplied by applications on heterogeneous CPU/GPU systems. Furthermore, StarPU provides programming language support through an OpenCL front-end (SOCLOpenclExtensions).
StarPU's runtime mechanism and programming language extensions are built around a task-based programming model. In this modell, applications submit computational tasks, with CPU and/or GPU implementations. StarPU effectively schedules these tasks and manages the associated data transfers across available CPUs and GPUs. The data that a task operates on are automatically exchanged between accelerators and the main memory, thereby sparing programmers the intricacies of scheduling and the technical details tied to these transfers.
StarPU excels in its adaptness at efficiently scheduling tasks using established algorithms from the literature (TaskSchedulingPolicy). Furthermore addition, it provides the flexibility for scheduling experts, such as compiler or computational library developers, to implement custom scheduling policies in a manner that is easily portable (HowToDefineANewSchedulingPolicy).
The remainder of this section describes the main concepts used in StarPU.
A video, lasting 26 minutes, accessible on the StarPU website (https://starpu.gitlabpages.inria.fr/) presents these concepts.
Additionally, a serie of tutorials can be found at https://starpu.gitlabpages.inria.fr/tutorials/
One of the tutorials is available within a docker image https://starpu.gitlabpages.inria.fr/tutorials/docker/
One of StarPU's key data structures is the codelet. A codelet defines a computational kernel that can potentially be implemented across various architectures, including CPUs, CUDA devices, or OpenCL devices.
Another pivotal data structure is the task. Executing a StarPU task involves applying a codelet to a data set, utilizing one of the architectures on which the codelet is implemented. Therefore, a task describes the codelet that it uses, the data accessed, and how they are accessed during the computation (read and/or write). StarPU tasks are asynchronous, meaning that submitting a task to StarPU is a non-blocking operation. The task structure can also specify a callback function, which is called once StarPU succesfully completes the task. Additionally, it contains optional fields that the application may use to provide hints to the scheduler, such as priority levels.
By default, task dependencies are inferred from data dependency (sequential coherency) within StarPU. However, the application has the ability to disable sequential coherency for specific data, and dependencies can also be specifically defined. A task can be uniquely identified by a 64-bit number, chosen by the application, referred to as a tag. Task dependencies can be enforced through callback functions, by submitting other tasks, or by specifying dependencies between tags (which can correspond to tasks that have yet to be submitted).
As StarPU dynamically schedules tasks at runtime, the need for data transfers is automatically managed in a`‘just-in-time’' manner between different processing units, This automated approach alleviates the burden on application programmers to explicitly handle data transfers. Furthemore, to minimize needless transfers, StarPU retains data at the location of its last use, even if modifications were made there. Additionally, StarPU allows multiple instances of the same data to coexist across various processing units simultaneously, as long as the data remains unaltered.
We will explain here shortly the concept of "taskifying" an application.
Before transitioning to StarPU, you must transform your application as follows:
Once this restructuring is complete, integrating StarPU or any similar task-based library becomes straightforward. You merely replace function calls with task submissions, leveraging the library's capabilities.
Chapter A Stencil Application shows how to easily convert an existing application to use StarPU.
Research papers about StarPU can be found at https://starpu.gitlabpages.inria.fr/publications/.
A good overview is available in the research report at http://hal.archives-ouvertes.fr/inria-00467677.