Overview

In our last example we allocated memory simply, but as we saw the DMA engine requires that our buffers be aligned to 4 KiB pages boundaries.  If the buffers are not so aligned, which they likely won’t be if don’t explicitly ask for it, then the runtime will copy the buffers so that their contents are aligned.

That’s an expensive operation, but can we quantify how expensive?  And how can we allocate aligned memory?


Key Code

This is a relatively short example in that we’re only changing four lines vs. Example 1, our buffer allocation.  There are various ways to allocate aligned memory but in this case we’ll make use of a POSIX function, posix_memalign().   This change replaces our previous allocation from listing 3.4 with the code from listing 3.10.   We also need to include an additional header not shown in the listing, memory.

Listing 3.10: Allocating Aligned Buffers

	
uint32_t*a,*b,*c,*d = NULL;
posix_memalign((void**)&a, 4096, BUFSIZE*sizeof(uint32_t));
posix_memalign((void**)&b, 4096, BUFSIZE*sizeof(uint32_t));
posix_memalign((void**)&c, 4096, BUFSIZE*sizeof(uint32_t));
posix_memalign((void**)&d, 4096, BUFSIZE*sizeof(uint32_t));

Note that for our calls to posix_memalign(), we’re passing in our requested alignment, or 4 KiB as previously noted.

Otherwise, this is the only change to the code vs. the first example.  Note that we have changed the allocation for all of the buffers, including buffer d which is only used by the CPU baseline VADD function.  We’ll see if this has any impact on the runtime peformance for both the accelerator and the CPU.


Running the Application

With the XRT initialized, run the application by running the following command from the build directory:

./02_aligned_malloc alveo_examples

The program will output a message similar to this:

-- Example 2: Vector Add with Aligned Allocation --

Loading XCLBin to program the Alveo board:

Found Platform
Platform Name: Xilinx
XCLBIN File Name: alveo_examples
INFO: Importing ./alveo_examples.xclbin
Loading: ’./alveo_examples.xclbin’
Running kernel test with aligned virtual buffers

Simple malloc vadd example complete!

--------------- Key execution times ---------------
OpenCL Initialization: 256.254 ms
Allocating memory buffer: 0.055 ms
Populating buffer inputs: 47.884 ms
Software VADD run: 35.808 ms
Map host buffers to OpenCL buffers : 9.103 ms
Memory object migration enqueue: 6.615 ms
Set kernel arguments: 0.014 ms
OCL Enqueue task: 0.116 ms
Wait for kernel to complete: 92.110 ms
Read back computation results: 2.479 ms

This seems at first glance to be much better!  Let’s compare these results to our results from Example 1 to see how things have changed.  Refer to table 3.2 for details, noting that we’ll exclude minor run-to-run variation from the comparison to help keep things clean.

Table 3.2: Timing Summary - Example 2

Operation Example 1 Example 2 ∆1→2
OCL Initialization 247.371 ms 256.254 ms -
Buffer Allocation 30 μs 55 μs 25 μs
Buffer Population 47.955 ms 47.884 ms -
Software VADD 35.706 ms 35.808 ms  
Buffer Mapping 64.656 ms 9.103 ms −55.553 ms
Write Buffers Out 24.829 ms 6.615 ms −18.214 ms
Set Kernel Args 9 μs 14 μ -
Kernel Runtime 92.118 ms 92.110 ms -
Read Buffer In 24.887 ms 2.479 ms −22.408 ms
∆Alveo→CPU −418.228 ms −330.889 ms 87.339 ms
∆Alveo→CPU (algorithm only) −170.857 ms −74.269 ms 96.588 ms

Nice!  By only changing four lines of code we’ve managed to shave nearly 100 ms off of our execution time.  The CPU is still faster, but just by changing one minor thing about how we’re allocating memory we saw huge improvement.  That’s really down to the memory copy that’s needed for alignment;  if we take a few extra microseconds to ensure the buffers are aligned when we allocate them, we can save orders of magnitude more time later when those buffers are consumed.

Also note that as expected in this use case, the software runtime is the same.  We’re changing the alignment of the allocated memory, but otherwise it’s normal userspace memory allocation.


Extra Exercises

Some things to try to build on this experiment:

  • Once again vary the size of the buffers allocated.  Do the relationships that you derived in the previous example still hold true?
  • Experiment with other methods of allocating aligned memory (not the OCL API).  Do you see differences between the approaches, beyond minor run-to-run fluctuations?

Key Takeaways

  • Unaligned memory will kill your performance.  Always ensure buffers you want to share with the Alveo card are aligned.

Now we’re getting somewhere!  Let’s try using the OpenCL API to allocate memory and see what happens.


About Rob Armstrong

About Rob Armstrong

Rob Armstrong leads the AI and Software Acceleration technical marketing team at Xilinx, bringing the power of adaptive compute to bear on today’s most exciting challenges.  Rob has extensive experience developing FPGA and ACAP accelerated hardware applications ranging from small-scale, low-power edge applications up to high-performance, high-demand workloads in the datacenter.