Rather than dive immediately into the API, especially if you don’t have much experience with acceleration, let’s start with a simplified metaphor to help explain what must happen in an acceleration system.
Imagine that you have been given the job of giving tours of your city. Your city may be large or small, dense or sparse, and may have specific traffic rules that you must follow. This is your application space. Along the way, you are tasked with teaching your wards a specific set of facts about your city, and you must hit a particular set of sights (and shops - you have to get paid, after all). This set of things you must do is your algorithm.
Given your application space and your algorithm, you started small: your tours fit into one car, and every year you would buy a new one that was a bit bigger, a bit faster, and so on. But, as your popularity grew, more and more people started to sign up for the tours. There’s a limit to how fast your car can go (even the sports model), so how can you scale? This is the problem with CPU acceleration.
The answer is a tour bus! It might have a lower top speed and take much longer to load and unload than your earlier sports car idea, but you can transport many more people - much more data- through your algorithm. Over time, your fleet of buses can expand and you can get more and more people through the tour. This is the model for GPU acceleration. It scales very well - until, that is, your fuel bills start to add up, there’s a traffic jam in front of the fountain, and wouldn’t you know it, the World Cup is coming to town!
If only the city would allow you to build a monorail. On their dollar, with the permits pre-approved. Train after train, each one full, making all the stops along the way (and with full flexibility to bulldoze and change the route as needed). For the tour guide in our increasingly tortured metaphor, this is fantasy. But for you, this is the model for Alveo™ acceleration. FPGAs combine the parallelism of a GPU with the low-latency streaming of a domain-specific architecture for unparalleled performance.
But, as the joke goes, even a Ferrari isn’t fast enough if you never learn to change gears. So, let’s abandon metaphor and roll up our sleeves.
In general, for acceleration systems we are trying to hit a particular performance target, whether that target is expressed in terms of overall end-to-end latency, frames per second, raw throughput, or something else.
Generally speaking, candidates for acceleration are algorithmic processing blocks that work on large chunks of data in deterministic ways. In 1967, Gene Amdahl famously proposed what has since become known as Amdahl’s Law. It expresses the potential speedup in a system:
In this equation,
Slatency represents the theoretical speedup of a task,
s is the speedup of the portion of the algorithm that benefits from acceleration, and
p represents the proportion of the execution time the task occupied pre-acceleration.
Amdahl’s Law demonstrates a clear limit to the benefit of acceleration in a given application, as the portion of the task that cannot be accelerated (generally involving decision making, I/O, or other system overhead tasks) will always become the system bottleneck.
If Amdahl’s Law is correct, however, then why do so many modern systems use either general-purpose or domain-specific acceleration? The crux lies here: modern systems are processing an ever-increasing amount of data, and Amdahl’s Law only applies to cases where the problem size is fixed. That is to say, the limit is imposed only so long as p is a constant fraction of overall execution time. In 1988 John Gustafson and Edwin Barsis proposed a reformulation that has since become known as Gustafson’s Law. It re-casts the problem:
Again, Slatency represents the theoretical speedup of a task,
s is the speedup in latency of the task benefiting from parallelism, and
p is the percentage of the overall task latency of the part benefiting from improvement (prior to the application of the improvement).
Gustafson’s Law re-frames the question raised by Amdahl’s Law. Rather than more compute resources speeding up the execution of a single task, more compute resources allow more computation in the same amount of time.
Both of the laws frame the same thing in different ways: to accelerate an application, we are going to parallelize it. Through parallelization we are attempting to either process more data in the same amount of time, or the same amount of data in less time. Both approaches have mathematical limitations, but both also benefit from more resources (although to different degrees).
In general, Alveo cards are useful for accelerating algorithms with lots of “number crunching” on large data sets - things like video transcoding, financial analysis, genomics, machine learning, and other applications with large blocks of data that can be processed in parallel. It’s important to approach acceleration with a firm understanding of how and when to apply external acceleration to a software algorithm. Throwing random code onto any accelerator, Alveo or otherwise, does not generally give you optimal results. This is for two reasons: first and foremost, leveraging acceleration sometimes requires restructuring a sequential algorithm to increase its parallelism. Second, while the Alveo architecture allows you to quickly process parallel data, transferring data over PCIe and between DDR memories has an inherent additive latency. You can think of this as a kind of “acceleration tax” that you must pay to share data with any external accelerator.
With that in mind, we want to look for areas of code that satisfy several conditions. They should:
Before diving into the software, let’s familiarize ourselves with the capabilities of the Alveo Data Center accelerator card itself. Each Alveo card combines three essential things: a powerful FPGA or ACAP for acceleration, high-bandwidth DDR4 memory banks, and connectivity to a host server via a high-bandwidth PCIe Gen3x16 link. This link is capable of transferring approximately 16 GiB of data per second between the Alveo card and the host.
Recall that unlike a CPU, GPU, or ASIC, an FPGA is effectively a blank slate. You have a pool of very low-level logic resources like flip-flops, gates, and SRAM, but very little fixed functionality. Every interface the device has, including the PCIe and external memory links, is implemented using at least some of those resources.
To ensure that the PCIe link, system monitoring, and board health interfaces are always available to the host processor, Alveo designs are split into a shell and role conceptual model. The shell contains all of the static functionality: external links, configuration, clocking, etc. Your design fills the role portion of the model with custom logic implementing your specific algorithm(s). You can see this topology reflected in figure 2.1.
The Alveo FPGA is further subdivided into multiple super logic regions (SLRs), which aid in the architecture of very high-performance designs. But this is a slightly more advanced topic that will remain largely unnoticed as you take your first steps into Alveo development.
Alveo cards have multiple on-card DDR4 memories. These memories have a high bandwidth to and from the Alveo device, and in the context of OpenCL are collectively referred to as the device global memory. Each memory bank has a capacity of 16 GiB and operates at 2400 MHzDDR. This memory has a very high bandwidth, and your kernels can easily saturate it if desired. There is, however, a latency hit for either reading from or writing to this memory, especially if the addresses accessed are not contiguous or we are reading/writingshort data beats.
The PCIe lane, on the other hand, has a decently high bandwidth but not nearly as high as the DDR memory on the Alveo card itself. In addition, the latency hit involved with transferring data over PCIe is quite high. As a general rule of thumb, transfer data over PCIe as infrequently as possible. For continuous data processing, try designing your system so that your data is transferring while the kernels are processing something else. For instance, while waiting for Alveo to process one frame of video you can be simultaneously transferring the next frame to the global memory from the CPU.
There are many more details we could cover on the FPGA architecture, the Alveo cards themselves, etc. But for introductory purposes, we have enough details to proceed. From the perspective of designing an acceleration architecture, the important points to remember are:
Although this may seem obvious, any hardware acceleration system can be broadly discussed in two parts: the hardware architecture and implementation, and the software that interacts with that hardware. For the Alveo cards, regardless of any higher-level software frameworks you may be using in your application such as FFmpeg, GStreamer, or others, the software library that fundamentally interacts with the Alveo hardware is the Xilinx Runtime (XRT).
While XRT consists of many components, its primary role can be boiled down to three simple things:
Rob Armstrong leads the AI and Software Acceleration technical marketing team at Xilinx, bringing the power of adaptive compute to bear on today’s most exciting challenges. Rob has extensive experience developing FPGA and ACAP accelerated hardware applications ranging from small-scale, low-power edge applications up to high-performance, high-demand workloads in the datacenter.