Background

Many of today's workloads and applications such as AI, data analytics, and live video transcoding, and genomic analytics, require an increasing amount of bandwidth. Traditional DDR memory solutions have not been able to keep up with the growing compute and memory bandwidth-intensive workloads are becoming data movement and access bottlenecks. This figure shows the compute capacity growth vs traditional DDR bandwidth growth.

compute-capacity-improvement-vs-ddr-bandwidth-improvement

High-bandwidth memory (HBM) helps alleviate this bottleneck by providing more storage capacity and data bandwidth using system in package (SiP) memory technology to stack DRAM chips vertically and using a wide (1024-bit) interface.

hbm-25d-structure

The Virtex UltraScale+ HBM-enabled devices (VU+ HBM) close the bandwidth gap with greatly improved bandwidth capabilities up to 460GB/s delivered by two HBM2 stacks. These devices also include up to 2.85 million logic cells and up to 9,024 DSP slices capable of delivering 28.1 peak INT8 TOPs. For more details on how Xilinx's VU+ HBM devices are accelerating applications refer to WP508.

The purpose of this article is to discuss what design aspects can negatively impact memory bandwidth, what options we have available to improve the bandwidth, and then one way to profile the HBM bandwidth to illustrate the trade-offs. These same techniques can be used to profile HBM bandwidth on the Alveo U280, VCU128, and any Xilinx UltraScale+ HBM device. It can also be used on any accelerated application using a pre-existing DSA or custom DSAs. We’ll explain the process for creating a custom DSA in Vivado and how to use Xilinx® Vitis™ unified software platform to create C/C++ Kernels and memory traffic to profile the HBM stacks. 


What can impact memory bandwidth?

Before discussing what impacts memory bandwidth let's explain how bandwidth is calculated. Using VU+ HBM as an example, with 2 HBM2 stacks available these devices can provide a theoretical bandwidth up to 460GB/s:
2xHBM2 stacks Each stack has 16 Channels
Each stack has 16 Channels
Each channel is 64-bits of Data (DQ) bits Data can be transferred up to 1800Mbps
Theoretical Bandwidth = 2x16x64x1800Mbps=3.686Tb/s or 460GB/s

Anyone who has worked with external DRAM interfaces knows achieving theoretical bandwidth is not possible. In fact, depending on several different factors, it can be difficult to even come close. Here are several of the top contributing factors that can negatively impact your effective bandwidth.

  1. Traffic Pattern
    1.    Addressing Pattern
      1. Random access pattern
        1. DRAM requires opening (ACT) and closing (PRE) rows within a bank. Random accesses require more maintenance which prevents data from transferring during this time
      2. Bank Group pattern
        1. Some DRAM architectures (i.e. DDR4, HBM) have overhead associated with consecutive accesses to the same Bank Group
      3. Short burst of or alternating read/write data
        1. The DQ bits are bi-directional and have a bus turnaround time associated when switching direction.
  2. DRAM maintenance and overhead
    1. Activate (ACT) opening a new row within a bank
    2. Precharge (PRE) closing row within a bank
    3. Refresh (REF) periodically run to refresh and restore the memory cell value
    4. ZQ Calibration (ZQCL/ZQCS) required to compensate for voltage and temperature drifts

In VU+ HBM, there is a hardened AXI Switch which enables access from any of the 32 AXI channels to any of the HBM pseudo channels and addressable memory. 

hardened axi switch in us+ hbm

There are many advantages to having a hardened switch such as flexible addressing and reduction of design complexity and routing congestion. WP485 does a good job of highlighting many of the advantages if you're interested. To enable flexible addressing across the entire HBM stacks the hardened AXI switch contains switch boxes broken up across 4 masters x 4 slaves.

hbm switch internal connections

This facilitates the flexible addressing but there is a limitation that can impact memory bandwidth. There are only 4 horizontal paths available which depending on which AXI channel is access which addressable memory location in the HBM stack can greater limit your achievable bandwidth due to arbitration.


How to maximize memory bandwidth

Now that we know what some of the contributing factors to poor memory bandwidth let's discuss some options available to mitigate them.

Consider changing your command and addressing patterns. Since random accesses and short bursts of read/write transactions result in the worst bandwidth see if you can alter this on the user application. This will get you the biggest bang for your buck.

If you’re unable to change your traffic pattern the HBM Memory Controller IP has several options available that may help:

  • Custom Address Mapping: As mentioned previously, random accesses require higher rates of ACT and PRE commands. With a custom address map, you can define the AXI addresses to HBM memory addresses which can increase the number of page hits and improve bandwidth.
  • Bank Group Interleave: Enables sequential address operation to alternate between even and odd bank groups to maximize bandwidth efficiency.
  • Enable Request Re-Ordering: Enables the controller to re-order commands (i.e. coalesce commands to reduce bus turnaround times)
  • Enable Close Page Reorder: Enables the controller to close a page after instruction has completed. If disabled, the page remains open until a higher priority operation is requested for another page in the same bank. This can be advantageous depending on if using a random, linear, or custom addressing pattern.
  • Enable Look Ahead Pre-Charge: Enables controller to re-order commands to minimize PRE commands.
  • Enable Look Ahead Activate: Enables controller to re-order commands to minimize ACT commands.
  • Enable Lookahead Single Bank Refresh: Enables the controller to insert refresh operations based on pending operations to maximize efficiency.
  • Single Bank Refresh: Instructs the controller to refresh banks individually instead of all at once.
  • Enable Refresh Period Temperature Compensation: This enables the controller to dynamically adjust the refresh rate based on the temperature of the memory stacks.
  • Hold Off Refresh for Read/Write: This allows the controller to delay a refresh to permit operations to complete first.

The figures below are taken from our VCU128 HBM Performance and Latency demo and attempt to highlight the bandwidth/throughput results from several different AXI Switch configurations.

bandwidth results from different axi switch configurations

HBM Monitor

New to Vivado is the HBM monitor which, similar to SysMon, can display the die temperature of each HBM2 die stack individually. It also can display the bandwidth on a per MC or Psuedo Channel (PC) basis.

hbm monitor activity set-up window

For this test, read-only traffic is sent across all MC's. Only MC0 was added to the HBM monitor and it reports that the Read bandwidth is 26.92GBps. This is around 90% efficiency with a theoretical bandwidth being 30GBps

hbm monitor window within vivado hardware manager

To profile your hardware design and HBM configuration properly start with the default HBM settings and capture the read/write throughput as your baseline. Then regenerate new .bit files using each of and combinations of HBM MC options discussed earlier to determine which provides the highest throughput. Note, that how the AXI Switch is configured can also impact the HBM bandwidth and throughput and should be considered profiling as well.

A future update to this article will provide profiling results from using various different MC options. We will also explore using the AXI Performance Monitors for profiling bandwidth to the HBM AXI channels.

If you’re using a pre-existing design and the Vitis tool, you will need to modify the hardware platform design using a custom DSA flow. This flow will be described later in the article.


Design requirements

To profile the HBM bandwidth create or use an existing design or application. To profile different HBM configurations you will need access to the hardware design in order to modify the HBM IP core and then generate new bitstreams and new .xsa/.dsa files that are used in the Vitis tool for software development.

What is Vitis technology you ask? Vitis is a unified software tool that provides a framework for developing and delivering FPGA accelerated data center applications using standard programming languages and for creating software platforms targeting embedded processors.

For existing designs refer to Github, the SDAccel Example repositories, the U280 product page and the VCU128 product page which contains targeted reference designs (TRDs). If you are targeting a custom platform, or even the U280 or VCU128, and need to create a custom hardware platform design this can also be done.

Why do I need to create a custom hardware platform for the Alveo U280 if dsa’s already exist? As workload algorithms evolve, reconfigurable hardware enables Alveo to adapt faster than fixed-function accelerator card product cycles. Having the flexibility to customize and reconfigure the hardware gives Alveo a unique advantage over competition. In the context of this tutorial, we want to customize and generate several new hardware platforms using different HBM IP core configurations to profile the impacts on memory bandwidth to determine which provides the best results.

There are several ways to build a custom hardware platform but the quickest is to use Vivado IP Integrator (IPI). I’ll walk you through one way to do this using Microblaze to generate the HBM memory traffic in software. This could also be done in HLS, SDAccel, or in the Vitis tool with hardware accelerated memory traffic. Using Microblaze as the traffic generator makes it easy to control the traffic pattern including memory address locations and we can use a default memory test template to modify and create loops and various patterns to help profile the HBM bandwidth effectively.

The steps to build a design in the Vitis tool or SDK are similar and will include something like this:

  1. Open Vivado
    File=>Project=>New
    1. Create new or open existing Vivado design
      1. Target the U280, VCU128 or whichever US+ HBM device being used
board selection for vivado project
  1. Create Block Design
    1.     Add HBM IP core
      Note: Ensure project contains an IP Integrator (IPI) block design that includes HBM and Microblaze. This block design is what we refer to as the hardware design and to achieve near maximum theoretical bandwidth (460GB/s) for both HBM2 stacks you'll need to drive continuous traffic to all 16 available Memory Controllers (MC) via the AXI channels.
configuration view within hbm ip core

Add MicroBlaze, UART and any additional peripheral IP needed

Example hbm block design
  1. Validate design and generate output products
    1. validate_bd_design
    2. generate_target all [get_files <>.bd]
  2. Create HDL wrapper for .bd
    1. make_wrapper -files [get_files <>.bd] -top
  3. Run synthesis
  4. Run Implementation
  5. Generate Bitstream
  6. Export Hardware
    1. File=>Export Hardware…
    2. If using the Vitis tool you may need to follow these instructions
      1.  (If using 2019.2) write_hw_platform -fixed <>/xsa
      2. (If using 2019.1) write_dsa -fixed <>.dsa
  7. Launch the Vitis tool
vitis

8. Select workspace

select-workspace

9. Create new application project and Board Support Package

create-new-space

10. Click Next, Select Create from hardware, click “+” and point to .xsa

11 Click Next, select CPU Microblaze, Language C

12. Click Next, select “Memory Tests” and click Finish

13. Build and run memory test on target

The memory test template is a good starting point for generating traffic as it will run through all AXI channels enabled in your design and the HBM memory range and traffic patterns can be easily modified

Note, A future update to this article will include a reference design which can be used for HBM profiling.


Conclusion

This article has explained why HBM is needed to keep up with the growing DDR bandwidth demand and hopefully has educated you on what can impact DRAM bandwidth, options available to maximize your bandwidth, and how to monitor and profile your results.

Using Vitis technology to generate and accelerate HBM traffic is a quick and easy way to verify your bandwidth requirements are met and to profile various different HBM configurations to determine which is optimal for your system.

Stay tuned to this article for future updates including reference designs, software accelerated traffic, custom hardware DSA's to profile different MC options, and bandwidth profiled results in the HBM monitor.

Citations

UG1352 – Get Moving with Alveo

WP485 – Virtex UltraScale+ HBM FPGA: A Revolutionary Increase in Memory Performance

PG276 – AXI High Bandwidth memory Controller v1.0


About Chris Riley

About Chris Riley

Chris Riley is a FAE based in Colorado with particular expertise in all things memory-related.  He has spent his entire career at AMD troubleshooting technical issues for customers and still enjoys it (imagine that!).  In his spare time, he enjoys spending time and traveling with his wife and two young kids.  He is a ski bum at heart, and can spend hours talking shop and all things ski.  He has also become obsessed with mountain biking which occupies any remaining free time when there’s no snow on the ground.