Overview

We’ve made some big gains in our overall system throughput, but the bottleneck is now very clearly the accelerator itself.  Remember that there are two axes on which we can optimize a design: we can throw more resources at a problem to solve it faster, which is bound by Amdahl’s Law, or we can do more fundamental operations in parallel following Gustafson’s Law.

In this case we’ll try optimizing along the axis of Amdahl’s Law. Up until now our accelerator has been following basically the same algorithm as the CPU, performing one 32-bit addition on each clock tick.  But, because the CPU has a significantly faster clock (and has the advantage of not having to transfer the data over PCIe) it’s always been able to beat us.  Now it’s time to turn the tables.

Our DDR controller natively has a 512-bit wide interface internally.  If we parallelize the data flow in the accelerator, that means we’ll be able to process 16 array elements per clock tick instead of one. So, we should be able to get an instant 16x speed-up by just vectorizing the input.


Key Code

In this example, if we continue to focus on the host software and treat the kernel as a black box we wind up with code that’s essentially identical to the code from Example 3.  All we need to do is change the kernel name from vadd to wide_vadd.

Since the code is the same, this is a good opportunity to introduce another concept of working with memory in XRT and OpenCL.  On the Alveo™ Data Center accelerator cards we have four memory banks available, and while it isn’t necessary with these simple accelerators you may find that you wish to spread operations among them to help maximize the bandwidth available to each of the interfaces.  For our simple vector addition example, just for illustration purposes we’ve used the topology shown in figure 3.3.

Figure 3.3: Wide VADD Memory Connectivity

What we gain by doing this is the ability to perform high-bandwidth transactions simultaneously with different external memory banks.  Remember, long bursts are generally better for performance than many small reads and writes, but you can’t fundamentally perform two operations on the memory at the same time.

It’s easy enough to specify the hardware connectivity via command line switches, but when we want to actually transfer buffers down to the hardware we need to specify which memories to use for XRT.  Xilinx accomplishes this by way of an extension to the standard OpenCL library, using a struct cl_mem_ext_ptr_t in combination with the buffer allocation flag CL_MEM_EXT_PTR_XILINX.  This looks like the code in listing 3.15.

Listing 3.15: Specifying External Memory Banks with XRT

    
cl_mem_ext_ptr_t bank1_ext, bank2_ext;
bank2_ext.flags = 2 | XCL_MEM_TOPOLOGY;
bank2_ext.obj   = NULL;
bank2_ext.param = 0;
bank1_ext.flags = 1 | XCL_MEM_TOPOLOGY;
bank1_ext.obj   = NULL;
bank1_ext.param = 0;
cl::Buffer a_buf(context,
                 static_cast<cl_mem_flags>(CL_MEM_READ_ONLY |
                                           CL_MEM_EXT_PTR_XILINX),
                 BUFSIZE*sizeof(uint32_t),
                 &bank1_ext,
                 NULL);
cl::Bufferb_buf(context,
                 static_cast<cl_mem_flags>(CL_MEM_READ_ONLY |
                                           CL_MEM_EXT_PTR_XILINX),
                 BUFSIZE*sizeof(uint32_t),
                 &bank2_ext,
                 NULL);
cl::Buffer c_buf(context,
                 static_cast<cl_mem_flags>(CL_MEM_READ_WRITE |
                                           CL_MEM_EXT_PTR_XILINX),
                 BUFSIZE*sizeof(uint32_t),
                 &bank1_ext,
                 NULL);

You can see that this code is very similar to listing 3.11, with the difference being that we’re passing our cl_mem_ext_ptr_t object to the cl::Buffer constructor and are using the flags field to specify the memory bank we need to use for that particular buffer.  Note again that there is fundamentally no reason for us to do this for such a simple example, but since this example is so structurally similar to the previous one it seemed like a good time to mix it in.  This can be a very useful technique for performance optimization for very heavy workloads.

Otherwise, aside from switching to the wide_vadd kernel the code here is exactly unchanged other than increasing the buffer size to1 GiB.  This was done to better accentuate the differences between the hardware and software.


Running the Application

With the XRT initialized, run the application by running the following command from the build directory:

./04_wide_vadd alveo_examples

The program will output a message similar to this:

-- Example 4: Parallelizing the Data Path --

Loading XCLBin to program the Alveo board:

Found Platform
Platform Name: Xilinx
XCLBIN File Name: alveo_examples
INFO: Importing ./alveo_examples.xclbin
Loading: ’./alveo_examples.xclbin’
Running kernel test XRT-allocated buffers and wide data path:

OCL-mapped contiguous buffer example complete!

--------------- Key execution times ---------------
OpenCL Initialization: 244.463 ms
Allocate contiguous OpenCL buffers: 37.903 ms
Map buffers to userspace pointers: 0.333 ms
Populating buffer inputs: 30.033 ms
Software VADD run : 21.489 ms
Memory object migration enqueue :4.639 ms
Set kernel arguments: 0.012 ms
OCL Enqueue task: 0.090 ms
Wait for kernel to complete : 9.003 ms
Read back computation results : 2.197 ms

Table 3.4: Timing Summary - Example 3

Operation Example 3 Example 4 ∆3→4
OCL Initialization 247.460 ms 244.463  ms -
Buffer Allocation 30.365 ms 37.903 ms 7.538 ms
Buffer Population 22.527 ms 30.033 ms 7.506 ms
Software VADD 24.852 ms 21.489 ms −3.363 ms
Buffer Mapping 222 μs 333 μs -
Write Buffers Out 66.739 ms 4.639 ms -
Set Kernel Args 14 μs 12 μs -
Kernel Runtime 92.068 ms 9.003 ms −83.065 ms
Read Buffer In 2.243 ms 2.197 ms -
∆Alveo→CPU −323.996 ms −247.892 ms −76.104 ms
∆FPGA→CPU (algorithm only) −76.536 ms 5.548 ms −82.084 ms

Mission accomplished - we’ve managed to beat the CPU!

There are a couple of interesting bits here, though.  The first thing to notice is that, as you probably expected, the time required to transfer that data into and out of the FPGA hasn’t changed.  Like we disussed in earlier sections, that’s effectively a fixed time based on your memory topology, amount of data, and overall system memory bandwidth utilization.  You’ll see some minor run-to-run variability here, especially in a virtualized environment like a cloud data center, but for the most part you can treat it like a fixed-time operation.

The more interesting one, though, is the achieved speedup.  You might be surprised to see that widening the data path to 16 words per clock did not, in fact, result in a 16x speedup.  The kernel itself processes 16 words per clock, but what we’re seeing in the timed results is a cumulative effect of the external DDR latency.  Each interface will burst in data, process it, and burst it out.  But the inherent latency of going back and forth to DDR slows things down and we only actually achieve a 10x speedup over the single-word implementation.

Unfortunately, we’re hitting a fundamental problem with vector addition: it’s too easy.  The computational complexity of vadd is O(N), and a very simpleO(N) at that.  As a result you very quickly become I/O bandwidth bound, not computation bound.  With more complex algorithms, such as nested loops which are O(Nx) or even computationally complex O(N) algorithms like filters needing lots of calculations, we can achieve significantly higher acceleration by performing more computation inside the FPGA fabric instead of going out to the memory quite so often.  The best candidate algorithms for acceleration have very little data transfer and lots of calculations.

We can also run another experiment with a larger buffer.  Change the buffer size as in listing 3.16:

Listing 3.16: Larger Memory Sizes

#define BUFSIZE (1024*1024*256)// 256*sizeof(uint32_t) = 1 GiB

Rebuild and run it again, and now let’s switch over to a simplified set of metrics as in table 3.5.  I think you get the idea on OpenCL initialization by now, too, so we’ll only compare the algorithm’s performance.  That doesn’t mean that this initialization time isn’t important, but you should architect your application so it’s a one-time cost that you pay during other initialization operations.

We will also stop tracking the buffer allocation and initialization times.  Again this is something that you’d generally want to do when setting up your application.  If you’re routinely allocating large buffers in the critical path of your application, odds are you’d get more “bang for your buck” rethinking your architecture rather than trying to apply hardware acceleration.

Table 3.5: 1 GiB Buffer Timing Summary - Example 4

Operation Example 4
Software VADD 820.596 ms
Buffer PCIe TX 383.907 ms
VADD Kernel 484.050 ms
Buffer PCIe RX 316.825 ms
Hardware VADD (Total) 1184.897 ms
∆Alveo→CPU 364.186 ms

Oh no!  We had such a good thing going, and here we went and ruined it.  With this larger buffer we’re not beating the CPU anymore at all.  What happened?

Looking closely at the numbers, we can see that while the hardware kernel itself is processing the data pretty efficiently, the data transfer to and from the FPGA is really eating up a lot of our time.  In fact, if you were to draw out our overall execution timeline, it would look something like figure 3.4 at this point.

Figure 3.4: Kernel Execution Timeline

This chart is figurative, but to relative scale.  Looking closely at the numbers, we see that in order to beat the CPU our entire kernel would need to run to completion in only about 120 ms.  For that to work we would need to run four times faster (not likely), process four times as much data per clock, or run four accelerators in parallel. Which one should we pick?  Let’s leave that for the next example.


Extra Exercises

Some things to try to build on this experiment:


Key Takeaways

Parallelizing the kernel can provide great results, but you need to take care that your data transfer doesn’t start to dominate your execution time.  Or if it does, that you stay under your target execution time.

  • Simple computation isn’t always a good candidate for acceleration, unless you’re trying to free up the processor to work on other tasks.  Better candidates for acceleration are complex workloads with high complexity, especially O(Nx) algorithms like nested loops, etc.
  • If you have to wait for all of the data to transfer before you begin processing you may have trouble hitting your overall performance target even if your kernels are highly optimized.

Now we’ve seen that just parallelizing the data path isn’t quite enough.  Let’s see what we can do about those transfer times.


About Rob Armstrong

About Rob Armstrong

Rob Armstrong leads the AI and Software Acceleration technical marketing team at AMD, bringing the power of adaptive compute to bear on today’s most exciting challenges.  Rob has extensive experience developing FPGA and ACAP accelerated hardware applications ranging from small-scale, low-power edge applications up to high-performance, high-demand workloads in the datacenter.