| StarPU Handbook - StarPU Extensions
    | 
Maxeler provides hardware and software solutions for accelerating computing applications on dataflow engines (DFEs). DFEs are in-house designed accelerators that encapsulate reconfigurable high-end FPGAs at their core and are equipped with large amounts of DDR memory.
We extend the StarPU task programming library that initially targets heterogeneous architectures to support Field Programmable Gate Array (FPGA).
To create StarPU/FPGA applications exploiting DFE configurations, MaxCompiler allows an application to be split into three parts:
Kernel, which implements the computational components of the application in hardware.Manager configuration, which connects Kernels to the CPU, engine RAM, other Kernels and other DFEs via MaxRing.CPU application, which interacts with the DFEs to read and write data to the Kernels and engine RAM.The Simple Live CPU interface (SLiC) is Maxeler’s application programming interface for seamless CPU-DFE integration. SLiC allows CPU applications to configure and load a number of DFEs as well as to subsequently schedule and run actions on those DFEs using simple function calls. In StarPU/FPGA applications, we use Dynamic SLiC Interface to exchange data streams between the CPU (Main Memory) and DFE (Local Memory).
The way to port an application to FPGA is to set the field starpu_codelet::max_fpga_funcs, to provide StarPU with the function for FPGA implementation, so for instance:
struct starpu_codelet cl =
{
    .max_fpga_funcs = {myfunc},
    .nbuffers = 1,
}
A basic example is available in the file tests/maxfpga/max_fpga_basic_static.c.
To give you an idea of the interface that we used to exchange data between host (CPU) and FPGA (DFE), here is an example, based on one of the examples of Maxeler (https://trac.version.fz-juelich.de/reconfigurable/wiki/Public).
StreamFMAKernel.maxj represents the Java kernel code; it implements a very simple kernel (c=a+b), and Test.c starts it from the fpga_add function; it first sets streaming up from the CPU pointers, triggers execution and waits for the result. The API to interact with DFEs is called SLiC which then also involves the MaxelerOS runtime.
StreamFMAKernel.maxj: the DFE part is described in the MaxJ programming language, which is a Java-based metaprogramming approach.StreamFMAManager.maxj: is also described in the MaxJ programming language and orchestrates data movement between the host and the DFE.Once StreamFMAKernel.maxj and StreamFMAManager.maxj are written, there are other steps to do:
$ maxjc -1.7 -cp $MAXCLASSPATH streamfma/
$ java -XX:+UseSerialGC -Xmx2048m -cp $MAXCLASSPATH:. streamfma.StreamFMAManager DFEModel=MAIA maxFileName=StreamFMA target=DFE_SIM
$ sliccompile StreamFMA.max
Test.c :to interface StarPU task-based runtime system with Maxeler's DFE devices, we use the advanced dynamic interface of SLiC in non_blocking mode.
Test code must include MaxSLiCInterface.h and MaxFile.h. The .max file contains the bitstream. The StarPU/FPGA application can be written in C, C++, etc. Some examples are available in the directory tests/maxfpga.
To write the StarPU/FPGA application: first, the programmer must describe the codelet using StarPU’s C API. This codelet provides both a CPU implementation and an FPGA one. It also specifies that the task has two inputs and one output through the starpu_codelet::nbuffers and starpu_codelet::modes attributes.
fpga_add function is the name of the FPGA implementation and is mainly divided in four steps:
In the main function, there are four important steps:
The rest of the application (data registration, task submission, etc.) is as usual with StarPU.
The design load can also be delegated to StarPU by specifying an array of load specifications in starpu_conf::max_fpga_load, and use starpu_max_fpga_get_local_engine() to access the loaded max engines.
Complete examples are available in tests/fpga/*.c
The communication between the host and the DFE is done through the Dynamic advance interface to exchange data between the main memory and the local memory of the DFE.
For the moment, we use STARPU_MAIN_RAM to send and store data to/from DFE's local memory. However, we aim to use a multiplexer to choose which memory node we will use to read/write data. So, users can tell that the computational kernel will take data from the main memory or DFE's local memory, for example.
In StarPU applications, when starpu_codelet::specific_nodes is set to 1, this specifies the memory nodes where each data should be sent to for task execution.
To configure StarPU with Maxeler FPGA accelerators, make sure that the slic-config is available from your PATH environment variable.
Maxeler provides a simple tutorial to use MaxCompiler (https://trac.version.fz-juelich.de/reconfigurable/wiki/Public). Running the Java program to generate maxfile and slic headers (hardware) on Maxeler's DFE device, takes a VERY long time, approx. 2 hours even for this very small example. That's why we use the simulation.
$ maxcompilersim -c LIMA -n StreamFMA restart
$ export LD_LIBRARY_PATH=$MAXELEROSDIR/lib:$LD_LIBRARY_PATH $ export SLIC_CONF="use_simulation=StreamFMA"
$ STARPU_NCPU=0 ./StreamFMA
$ maxcompilersim -c LIMA -n StreamFMA stop