Get Moving with Alveo: Example 2 Aligned Memory Allocation

Rob Armstrong

Published: Nov 02, 2019

Automotive
Communications
Financial Technologies
High-Performance Computing
Industrial IoT
Video and Image Processing
Database and Data Analytics
Network Acceleration
Introductory Tutorials
Alveo U200
Alveo U250
Alveo U280
Vitis

Read Get Moving with Alveo: Example 1 Simple Memory Allocation

Full source for the examples in this article can be found here: https://github.com/xilinx/get_moving_with_alveo

Overview

In our last example we allocated memory simply, but as we saw the DMA engine requires that our buffers be aligned to 4 KiB pages boundaries. If the buffers are not so aligned, which they likely won’t be if don’t explicitly ask for it, then the runtime will copy the buffers so that their contents are aligned.

That’s an expensive operation, but can we quantify how expensive? And how can we allocate aligned memory?

Key Code

This is a relatively short example in that we’re only changing four lines vs. Example 1, our buffer allocation. There are various ways to allocate aligned memory but in this case we’ll make use of a POSIX function, posix_memalign(). This change replaces our previous allocation from listing 3.4 with the code from listing 3.10. We also need to include an additional header not shown in the listing, memory.

Listing 3.10: Allocating Aligned Buffers

    
uint32_t*a,*b,*c,*d = NULL;
posix_memalign((void**)&a, 4096, BUFSIZE*sizeof(uint32_t));
posix_memalign((void**)&b, 4096, BUFSIZE*sizeof(uint32_t));
posix_memalign((void**)&c, 4096, BUFSIZE*sizeof(uint32_t));
posix_memalign((void**)&d, 4096, BUFSIZE*sizeof(uint32_t));

Note that for our calls to posix_memalign(), we’re passing in our requested alignment, or 4 KiB as previously noted.

Otherwise, this is the only change to the code vs. the first example. Note that we have changed the allocation for all of the buffers, including buffer d which is only used by the CPU baseline VADD function. We’ll see if this has any impact on the runtime peformance for both the accelerator and the CPU.

Running the Application

With the XRT initialized, run the application by running the following command from the build directory:

./02_aligned_malloc alveo_examples

The program will output a message similar to this:

-- Example 2: Vector Add with Aligned Allocation --

Loading XCLBin to program the Alveo board:

Found Platform
Platform Name: Xilinx
XCLBIN File Name: alveo_examples
INFO: Importing ./alveo_examples.xclbin
Loading: ’./alveo_examples.xclbin’
Running kernel test with aligned virtual buffers

Simple malloc vadd example complete!

--------------- Key execution times ---------------
OpenCL Initialization: 256.254 ms
Allocating memory buffer: 0.055 ms
Populating buffer inputs: 47.884 ms
Software VADD run: 35.808 ms
Map host buffers to OpenCL buffers : 9.103 ms
Memory object migration enqueue: 6.615 ms
Set kernel arguments: 0.014 ms
OCL Enqueue task: 0.116 ms
Wait for kernel to complete: 92.110 ms
Read back computation results: 2.479 ms

This seems at first glance to be much better! Let’s compare these results to our results from Example 1 to see how things have changed. Refer to table 3.2 for details, noting that we’ll exclude minor run-to-run variation from the comparison to help keep things clean.

Table 3.2: Timing Summary - Example 2

Operation	Example 1	Example 2	∆1→2
OCL Initialization	247.371 ms	256.254 ms	-
Buffer Allocation	30 μs	55 μs	25 μs
Buffer Population	47.955 ms	47.884 ms	-
Software VADD	35.706 ms	35.808 ms
Buffer Mapping	64.656 ms	9.103 ms	−55.553 ms
Write Buffers Out	24.829 ms	6.615 ms	−18.214 ms
Set Kernel Args	9 μs	14 μ	-
Kernel Runtime	92.118 ms	92.110 ms	-
Read Buffer In	24.887 ms	2.479 ms	−22.408 ms
∆Alveo→CPU	−418.228 ms	−330.889 ms	87.339 ms
∆Alveo→CPU (algorithm only)	−170.857 ms	−74.269 ms	96.588 ms

Nice! By only changing four lines of code we’ve managed to shave nearly 100 ms off of our execution time. The CPU is still faster, but just by changing one minor thing about how we’re allocating memory we saw huge improvement. That’s really down to the memory copy that’s needed for alignment; if we take a few extra microseconds to ensure the buffers are aligned when we allocate them, we can save orders of magnitude more time later when those buffers are consumed.

Also note that as expected in this use case, the software runtime is the same. We’re changing the alignment of the allocated memory, but otherwise it’s normal userspace memory allocation.

Extra Exercises

Some things to try to build on this experiment:

Once again vary the size of the buffers allocated. Do the relationships that you derived in the previous example still hold true?
Experiment with other methods of allocating aligned memory (not the OCL API). Do you see differences between the approaches, beyond minor run-to-run fluctuations?

Key Takeaways

Unaligned memory will kill your performance. Always ensure buffers you want to share with the Alveo card are aligned.

Now we’re getting somewhere! Let’s try using the OpenCL API to allocate memory and see what happens.

Read Get Moving with Alveo: Example 3 Memory Allocation with OpenCL

About Rob Armstrong

Rob Armstrong leads the AI and Software Acceleration technical marketing team at AMD, bringing the power of adaptive compute to bear on today’s most exciting challenges. Rob has extensive experience developing FPGA and ACAP accelerated hardware applications ranging from small-scale, low-power edge applications up to high-performance, high-demand workloads in the datacenter.

Servers

Business Systems

Workstations

Embedded

Personal Laptops

Personal Desktops

Handheld

Resources

GPU Accelerators

Adaptive Accelerators

DPU Accelerators

SmartNICs & Ethernet Adapters

Workstations

Desktops

Laptops

Resources

Adaptive SoCs & FPGAs

System-on-Modules (SOMs)

Technologies

Resources

Evaluation Boards & Kits

Processor Tools

Graphics Tools & Apps

Adaptive SoC & FPGA Tools

Intellectual Property & Apps

GPU Accelerator Tools & Apps

DPU Accelerator Tools

Overview

For Data Center & Cloud

For Edge & Endpoints

For Developers

Industries

Industries

Industries

Industries

Workloads

Deployments

Network, Infrastructure, & Storage

Resources

Gaming

Technologies

Systems

EPYC Processors

Radeon Graphics & AMD Chipsets

Adaptive SoCs & FPGAs

Alveo Accelerators & Kria SOMs

Ryzen Processors

Ethernet Adapters

Overview

Processors

Accelerators, SOMs & NICs

Adaptive SoCs & FPGAs

Graphics

Overview

Product Information & Training

Product Specifications

Resources

Processors & Graphics

DPU Accelerators

Adaptive SoCs & FPGAs

Gaming & Personal Computing

Adaptive & Embedded Computing

Get AMD Fan Gear

Shop Our Retail Partners

Get Moving with Alveo: Example 2 Aligned Memory Allocation

Overview

Key Code

Running the Application

Extra Exercises

Key Takeaways

About Rob Armstrong

Company

News & Events

Community

Partners

Investors