In our last example we allocated memory simply, but as we saw the DMA engine requires that our buffers be aligned to 4 KiB pages boundaries. If the buffers are not so aligned, which they likely won’t be if don’t explicitly ask for it, then the runtime will copy the buffers so that their contents are aligned.
That’s an expensive operation, but can we quantify how expensive? And how can we allocate aligned memory?
This is a relatively short example in that we’re only changing four lines vs. Example 1, our buffer allocation. There are various ways to allocate aligned memory but in this case we’ll make use of a POSIX function,
posix_memalign(). This change replaces our previous allocation from listing 3.4 with the code from listing 3.10. We also need to include an additional header not shown in the listing, memory.
Listing 3.10: Allocating Aligned Buffers
uint32_t*a,*b,*c,*d = NULL; posix_memalign((void**)&a, 4096, BUFSIZE*sizeof(uint32_t)); posix_memalign((void**)&b, 4096, BUFSIZE*sizeof(uint32_t)); posix_memalign((void**)&c, 4096, BUFSIZE*sizeof(uint32_t)); posix_memalign((void**)&d, 4096, BUFSIZE*sizeof(uint32_t));
Note that for our calls to
posix_memalign(), we’re passing in our requested alignment, or 4 KiB as previously noted.
Otherwise, this is the only change to the code vs. the first example. Note that we have changed the allocation for all of the buffers, including buffer d which is only used by the CPU baseline VADD function. We’ll see if this has any impact on the runtime peformance for both the accelerator and the CPU.
With the XRT initialized, run the application by running the following command from the build directory:
The program will output a message similar to this:
-- Example 2: Vector Add with Aligned Allocation --
Loading XCLBin to program the Alveo board:
Platform Name: Xilinx
XCLBIN File Name: alveo_examples
INFO: Importing ./alveo_examples.xclbin
Running kernel test with aligned virtual buffers
Simple malloc vadd example complete!
--------------- Key execution times ---------------
OpenCL Initialization: 256.254 ms
Allocating memory buffer: 0.055 ms
Populating buffer inputs: 47.884 ms
Software VADD run: 35.808 ms
Map host buffers to OpenCL buffers : 9.103 ms
Memory object migration enqueue: 6.615 ms
Set kernel arguments: 0.014 ms
OCL Enqueue task: 0.116 ms
Wait for kernel to complete: 92.110 ms
Read back computation results: 2.479 ms
This seems at first glance to be much better! Let’s compare these results to our results from Example 1 to see how things have changed. Refer to table 3.2 for details, noting that we’ll exclude minor run-to-run variation from the comparison to help keep things clean.
Table 3.2: Timing Summary - Example 2
|Operation||Example 1||Example 2||∆1→2|
|OCL Initialization||247.371 ms||256.254 ms||-|
|Buffer Allocation||30 μs||55 μs||25 μs|
|Buffer Population||47.955 ms||47.884 ms||-|
|Software VADD||35.706 ms||35.808 ms|
|Buffer Mapping||64.656 ms||9.103 ms||−55.553 ms|
|Write Buffers Out||24.829 ms||6.615 ms||−18.214 ms|
|Set Kernel Args||9 μs||14 μ||-|
|Kernel Runtime||92.118 ms||92.110 ms||-|
|Read Buffer In||24.887 ms||2.479 ms||−22.408 ms|
|∆Alveo→CPU||−418.228 ms||−330.889 ms||87.339 ms|
|∆Alveo→CPU (algorithm only)||−170.857 ms||−74.269 ms||96.588 ms|
Nice! By only changing four lines of code we’ve managed to shave nearly 100 ms off of our execution time. The CPU is still faster, but just by changing one minor thing about how we’re allocating memory we saw huge improvement. That’s really down to the memory copy that’s needed for alignment; if we take a few extra microseconds to ensure the buffers are aligned when we allocate them, we can save orders of magnitude more time later when those buffers are consumed.
Also note that as expected in this use case, the software runtime is the same. We’re changing the alignment of the allocated memory, but otherwise it’s normal userspace memory allocation.
Some things to try to build on this experiment:
Now we’re getting somewhere! Let’s try using the OpenCL API to allocate memory and see what happens.
Rob Armstrong leads the AI and Software Acceleration technical marketing team at Xilinx, bringing the power of adaptive compute to bear on today’s most exciting challenges. Rob has extensive experience developing FPGA and ACAP accelerated hardware applications ranging from small-scale, low-power edge applications up to high-performance, high-demand workloads in the datacenter.