| StarPU Handbook - StarPU Extensions
    | 
When using StarPU, one may need to store more data than what the main memory (RAM) can store. This part describes the method to add a new memory node on a disk and to use it.
Similarly to what happens with GPUs (it's actually exactly the same code), when available main memory becomes scarce, StarPU will evict unused data to the disk, thus leaving room for new allocations. Whenever some evicted data is needed again for a task, StarPU will automatically fetch it back from the disk.
The principle is that one first registers a disk memory node with a set of functions to manipulate data by calling starpu_disk_register(), and then registers a disk location, seen by StarPU as a void*, which can be for instance a Unix path for the stdio, unistd or unistd_o_direct backends, or a leveldb database for the leveldb backend, an HDF5 file path for the HDF5 backend, etc. The disk backend opens this place with the plug() method.
StarPU can then start using it to allocate room and store data there with the disk write method, without user intervention.
Users can also use starpu_disk_open() to explicitly open an object within the disk, e.g. a file name in the stdio or unistd cases, or a database key in the leveldb case, and then use starpu_*_register functions to turn it into a StarPU data handle. StarPU will then use this file as an external source of data, and automatically read and write data as appropriate. In the end use starpu_disk_close() to close an existing object.
In any case, users also need to set STARPU_LIMIT_CPU_MEM to the amount of data that StarPU will be allowed to afford. By default, StarPU will use the machine memory size, but part of it is taken by the kernel, the system, daemons, and the application's own allocated data, whose size can not be predicted. That is why users need to specify what StarPU can afford.
Some Out-of-core tests are worth giving a read, see tests/disk/*.c
To use a disk memory node, you have to register it with this function:
Here, we use the unistd library to realize the read/write operations, i.e. fread/ This structure must have a path where to store files, as well as the maximum size the software can afford to store on the disk.fwrite.
Don't forget to check if the result is correct!
This can also be achieved by just setting environment variables STARPU_DISK_SWAP, STARPU_DISK_SWAP_BACKEND and STARPU_DISK_SWAP_SIZE :
export STARPU_DISK_SWAP=/tmp export STARPU_DISK_SWAP_BACKEND=unistd export STARPU_DISK_SWAP_SIZE=200
The backend can be set to stdio (some caching is done by libc and the kernel), unistd (only caching in the kernel), unistd_o_direct (no caching), leveldb, or hdf5.
It is important to understand that when the backend is not set to unistd_o_direct, some caching will occur at the kernel level (the page cache), which will also consume memory... STARPU_LIMIT_CPU_MEM might need to be set to less than half of the machine memory just to leave room for the kernel's page cache, otherwise the kernel will struggle to get memory. Using unistd_o_direct avoids this caching, thus allowing to set STARPU_LIMIT_CPU_MEM to the machine memory size (minus some memory for normal kernel operations, system daemons, and application data).
When the register call is made, StarPU will benchmark the disk. This can take some time.
Warning: the size thus has to be at least STARPU_DISK_SIZE_MIN bytes !
StarPU will then automatically try to evict unused data to this new disk. One can also use the standard StarPU memory node API to prefetch data etc., see the Standard Memory Library and the Data Interfaces.
The disk is unregistered during the execution of starpu_shutdown().
StarPU will only be able to achieve Out-Of-Core eviction if it controls memory allocation. For instance, if the application does the following:
StarPU will not be able to release the corresponding memory since it's the application which allocated it, and StarPU can not know how, and thus how to release it. One thus have to use the following instead:
Which makes StarPU automatically do the allocation when the task running cl_fill_with_data gets executed. And then if it needs to, it will be able to release it after having pushed the data to the disk. Since no initial buffer is provided to starpu_matrix_data_register(), the handle does not have any initial value right after this call, and thus the very first task using the handle needs to use the STARPU_W mode like above, STARPU_R or STARPU_RW would not make sense.
By default, StarPU will try to push any data handle to the disk. To specify whether a given handle should be pushed to the disk, starpu_data_set_ooc_flag() should be used. To get to know whether a given handle should be pushed to the disk, starpu_data_get_ooc_flag() should be used.
By default, StarPU uses a Least-Recently-Used (LRU) algorithm to determine which data should be evicted to the disk. This algorithm can be hinted by telling which data will not be used in the coming future thanks to starpu_data_wont_use(), for instance:
StarPU will mark the data as "inactive" and tend to evict to the disk that data rather than others.
The full code is provided in the file tests/disk/disk_copy.c
The full code is provided in the file tests/disk/disk_compute.c
Scheduling heuristics for Out-of-core are still relatively experimental. The tricky part is that you usually have to find a compromise between privileging locality (which avoids back and forth with the disk) and privileging the critical path, i.e. taking into account priorities to avoid lack of parallelism at the end of the task graph.
It is notably better to avoid defining different priorities to tasks with low priority, since that will make the scheduler want to schedule them by levels of priority, at the expense of locality.
The scheduling algorithms worth trying are thus dmdar and lws, which privilege data locality over priorities. There will be work on this area in the coming future.
Beyond pure performance feedback, some figures are interesting to have a look at.
Using export STARPU_BUS_STATS=1 (STARPU_BUS_STATS and STARPU_BUS_STATS_FILE to define a filename in which to display statistics, by default the standard error stream is used) gives an overview of the data transfers which were needed. The values can also be obtained at runtime by using starpu_bus_get_profiling_info(). An example can be read in src/profiling/profiling_helpers.c.
#---------------------
Data transfer speed for /tmp/sthibault-disk-DJzhAj (node 1):
0 -> 1: 99 MB/s
1 -> 0: 99 MB/s
0 -> 1: 23858 µs
1 -> 0: 23858 µs
#---------------------
TEST DISK MEMORY
#---------------------
Data transfer stats:
        Disk 0 -> NUMA 0        0.0000 GB       0.0000 MB/s     (transfers : 0 - avg -nan MB)
        NUMA 0 -> Disk 0        0.0625 GB       63.6816 MB/s    (transfers : 2 - avg 32.0000 MB)
Total transfers: 0.0625 GB
#---------------------
Using export STARPU_ENABLE_STATS=1 gives information for each memory node on data miss/hit and allocation miss/hit.
#---------------------
MSI cache stats :
memory node NUMA 0
        hit : 32 (66.67 %)
        miss : 16 (33.33 %)
memory node Disk 0
        hit : 0 (0.00 %)
        miss : 0 (0.00 %)
#---------------------
#---------------------
Allocation cache stats:
memory node NUMA 0
        total alloc : 16
        cached alloc: 0 (0.00 %)
memory node Disk 0
        total alloc : 8
        cached alloc: 0 (0.00 %)
#---------------------
There are various ways to operate a disk memory node, described by the structure starpu_disk_ops. For instance, the variable starpu_disk_unistd_ops uses read/write functions.
All structures are in Out Of Core.
Examples are provided in src/core/disk_ops/disk_*.c