DPUv3E for Alveo Accelerator Card with HBM

Gang Yuan

Published: Feb 05, 2020

AI - Machine Learning
Database and Data Analytics
Introductory Tutorials
Alveo U280
Alveo U50
Vitis AI
Vitis

Introduction to DPUv3E

DPUv3E is a member of the Xilinx® DPU IP family for convolution neural network (CNN) inference application. It is designed for the latest Xilinx Alveo U50/U280 adaptable accelerator cards with HBM support. DPU V3E is a high-performance CNN inference IP optimized for throughput and data center workloads. DPUv3E runs with highly optimized instructions set and supports all mainstream convolutional neural networks, such as VGG, ResNet, GoogLeNet, YOLO, SSD, MobileNet, FPN, etc.

DPUv3E is one of the fundamental IPs (Overlays) of the Xilinx Vitis™ AI development environment, and the user can use Vitis AI toolchain to finish the full stack ML development with DPUv3E. The user can also use standard Vitis flow to finish the integration of DPUv3E with other customized acceleration kernel to realize powerful X+ML solution. DPUv3E is provided as encrypted RTL or XO file format for Vivado or Vitis based integration flow.

The major supported Neural Network operators include:

Convolution / Deconvolution
Max pooling / Average pooling
ReLU, ReLU6, and Leaky ReLU
Concat
Elementwise-sum
Dilation
Reorg
Fully connected layer
Batch Normalization
Split

DPUv3E is highly configurable, a DPUv3E kernel consists of several Batch Engines, an Instruction Scheduler, a Shared Weights Buffer, and a Control Register Bank. Following is the block diagram of a DPUv3E kernel including 5 Batch Engines.

Batch Engine

Batch Engine is the core computation unit of DPUv3E. A Batch Engine can handle an input image at a time, so multiple Batch Engines in a DPUv3E kernel can process several input images simultaneously. The number of Batch Engine in a DPUv3E kernel can be configured based on FPGA resource condition and customer's performance requirement. For example, in Alveo U280 card, SLR0 (with direct HBM connection) can contain a DPUv3E kernel with maximal four Batch Engines while SLR1 or 2 can contain a DPUv3E kernel with five Batch Engines. In Batch Engine, there is a convolution engine to handle regular convolution/deconvolution computation, and a MISC engine to handle pooling, ReLu, and other miscellaneous operations. MISC engine is also configurable for optional functions according to specific neural network requirements. Each Batch Engine uses a AXI read/write master interfaces for feature map data exchange between device memory (HBM).

Instruction Scheduler

Similar to general-purpose processor in concept, Instruction Scheduler carries out instruction fetch, decode and dispatch jobs. Since all the Batch Engines in a DPUv3E kernel will run the same neural network, so Instruction Scheduler serves all the Batch Engines with the same instruction stream. The instruction stream is loaded by host CPU to device memory (HBM) via PCIe interface, and Instruction Scheduler use a AXI read master interface to fetch DPU instruction for Batch Engine.

Shared Weight Buffer

Shared Weight Buffer includes complex strategy and control logic to manage the loading of neural network weight from Alveo device memory and transferring them to Batch Engines efficiently. Since all the Batch Engines in a DPUv3E kernel will run the same neural network, so the weights data are wisely loaded into the on-chip buffer and shared by all the Batch Engines to eliminate unnecessary memory access to save bandwidth. Shared Weight Buffer uses two AXI read master interfaces to load Weight data from device memory (HBM).

Control Register Bank

Control Register Bank is the control interface between DPUv3E kernel and host CPU. It implements a set of controller register compliant to Vitis development flow. Control Register Bank has a AXI slave interface.

Performance and Resource Utilization

The following table lists the performance data for DPUv3E single kernel / single batch engine (smallest configuration) with some typical neural network as well as two possible implementations on Alveo U50/U280. Please note because of the characteristics of the HBM memory system, the total overall performance is nearly linear with the number of kernel and batch engines, which provide great flexibility to satisfy specific performance requirements with the least resource occupation and power consumption.

Configuration	NN Model	Frame Rate (FPS)
Single Kernel with single batch engine	ResNet50	98
	Inception V1	193
	Inception V2	138
	Inception V3	44
	Inception V4	23
	Yolo V2	30
	Tiny Yolo V2	90
	FRCNN	35
Two Kernel with five batch engine (5+5) (U50 without power throttle limitation)	ResNet50	1000
Three Kernel, five batch engines for two kernels, and four batch engines for on kernels (5+5+4) (U280)	ResNet50	1550

Following is the resource utilization statistics of a typical DPUv3E kernel with five batch engines.

Configuration	LUT	FF	BRAM	URAM	DSP
5 Batch Engines with Leaky ReLU support	250290	310752	628	320	2600

Integrating DPUv3E on Alveo HBM card with Vitis Flow

DPUv3E uses HBM of Alveo HBM card (U280, U50) as the external memory. Alveo U280 and U50 have 8GB HBM (two 4GB stacks), providing thirty-two HBM pseudo channels for customer logic and thirty-two 256 bits hardened HBM AXI ports. The Vitis target platform for U280 and U50 (such as xilinx_u280_xdma_201920_1 and xilinx_u50_xdma_201920_1) with providing the needed AXI bus fabric as well as the host-device interaction layer.

With Vitis, the steps to integrating DPUv3E on U280/U50 card is very simple and straightforward:

Prepare DPUv3E kernel files with specific configuration (officially released .XO file)
Prepare specific acceleration kernel files (.XO file, if any)
Figure out HBM port assignment for DPUv3E kernels and other customized kernels (if any)
Edit v++ command-line scripts, Makefile, and/or configuration file
Use v++ to finish the system build-up, get the final XCLBIN overlay files for U50/U280
Use Vitis and Vitis AI software flow for host application development

The following diagram shows an example of the design HBM connection scheme of Alveo U280 with two DPUv3E kernels (five batch engines), four JPEG decoder kernels and an image resizer kernel.

The JPEG decoder in the example is a Xilinx IP, which is an RTL kernel packaged to XO file. Following is the top-level port diagram, the kernel name is jpage_decoder_v1_0.

The image resizer kernel is a Vitis vision library based HLS kernel synthesized to XO file. Below is the main body of the kernel, the kernel name is resize_accel.

    #include "ap_int.h"
#include "common/xf_common.hpp"
#include "common/xf_utility.hpp"
#include "hls_stream.h"
#include "imgproc/xf_resize.hpp"
 
#define AXI_WIDTH 256
#define NPC XF_NPPC8
#define TYPE XF_8UC3
#define MAX_DOWN_SCALE 7
 
#define PRAGMA_SUB(x) _Pragma(#x)
#define DYN_PRAGMA(x) PRAGMA_SUB(x)
 
#define MAX_IN_WIDTH 3840
#define MAX_IN_HEIGHT 2160
#define MAX_OUT_WIDTH 3840
#define MAX_OUT_HEIGHT 2160
 
#define STREAM_DEPTH 8
#define MAX_DOWN_SCALE 7
 
extern "C"
{
    void resize_accel (ap_uint<AXI_WIDTH> *image_in,
                            ap_uint<AXI_WIDTH> *image_out,
                            int width_in,
                            int height_in,
                            int width_out,
                            int height_out)
    {
#pragma HLS INTERFACE m_axi port = image_in offset = slave bundle = image_in_gmem
#pragma HLS INTERFACE m_axi port = image_out offset = slave bundle = image_out_gmem
#pragma HLS INTERFACE s_axilite port = image_in bundle = control
#pragma HLS INTERFACE s_axilite port = image_out bundle = control
#pragma HLS INTERFACE s_axilite port = width_in bundle = control
#pragma HLS INTERFACE s_axilite port = height_in bundle = control
#pragma HLS INTERFACE s_axilite port = width_out bundle = control
#pragma HLS INTERFACE s_axilite port = height_out bundle = control
#pragma HLS INTERFACE s_axilite port = return bundle = control
 
        xf::cv::Mat<TYPE, MAX_IN_HEIGHT, MAX_IN_WIDTH, NPC> in_mat(height_in, width_in);
        DYN_PRAGMA(HLS stream variable = in_mat.data depth = STREAM_DEPTH)
 
        xf::cv::Mat<TYPE, MAX_OUT_HEIGHT, MAX_OUT_WIDTH, NPC> out_mat(height_out, width_out);
        DYN_PRAGMA(HLS stream variable = out_mat.data depth = STREAM_DEPTH)
 
#pragma HLS DATAFLOW
 
        xf::cv::Array2xfMat<AXI_WIDTH, TYPE, MAX_IN_HEIGHT, MAX_IN_WIDTH, NPC>(image_in, in_mat);
        xf::cv::resize<XF_INTERPOLATION_AREA,
                        XF_8UC3,
                        MAX_IN_HEIGHT,
                        MAX_IN_WIDTH,
                        MAX_OUT_HEIGHT,
                        MAX_OUT_WIDTH,
                        NPC,
                        MAX_DOWN_SCALE>(in_mat, out_mat);
        xf::cv::xfMat2Array<AXI_WIDTH, TYPE, MAX_OUT_HEIGHT, MAX_OUT_WIDTH, NPC>(out_mat, image_out);
    }
}

Following is the example v++ configuration file corresponding to the HBM connection block diagram.

    [connectivity]
# ---------------------------------------------------------------
# multiple instances of 'jpeg_decoder_v1_0'
nk=jpeg_decoder_v1_0:4:jpeg_decoder_1.jpeg_decoder_2.jpeg_decoder_3.jpeg_decoder_4
 
# SLR assignment of 'jpeg_decoder'
slr=jpeg_decoder_1:SLR0
slr=jpeg_decoder_2:SLR0
slr=jpeg_decoder_3:SLR0
slr=jpeg_decoder_4:SLR0
 
# HBM port assignment of 'jpeg_decoder'
sp=jpeg_decoder_1.m00_axi:HBM[16]
sp=jpeg_decoder_2.m00_axi:HBM[17]
sp=jpeg_decoder_3.m00_axi:HBM[18]
sp=jpeg_decoder_4.m00_axi:HBM[19]
 
# ---------------------------------------------------------------
# single instance of 'resize_accel'
nk=resize_accel:1:resize_accel_1
 
# SLR assignment of 'resize_accel'
slr=resize_accel_1:SLR0
 
# HBM port assignment of 'resizer'
sp=resize_accel_1.m_axi_image_in_gmem:HBM[20]
sp=resize_accel_1.m_axi_image_out_gmem:HBM[21]
 
# ---------------------------------------------------------------
# multiple instances of 'dpuv3e_5be'
nk=dpuv3e_5be:2:dpuv3e_5be_1.dpuv3e_5be_2
 
# SLR assignment of 'dpuv3e_5be'
slr=dpuv3e_5be_1:SLR1
slr=dpuv3e_5be_2:SLR2
 
# HBM port assignment of 'dpuv3e_5be'
sp=dpuv3e_5be_1.dpu_axi_0:HBM[0]
sp=dpuv3e_5be_1.dpu_axi_1:HBM[1]
sp=dpuv3e_5be_1.dpu_axi_2:HBM[2]
sp=dpuv3e_5be_1.dpu_axi_3:HBM[3]
sp=dpuv3e_5be_1.dpu_axi_4:HBM[4]
sp=dpuv3e_5be_1.dpu_axi_i:HBM[5]
sp=dpuv3e_5be_1.dpu_axi_w0:HBM[6]
sp=dpuv3e_5be_1.dpu_axi_w1:HBM[7]
 
sp=dpuv3e_5be_2.dpu_axi_0:HBM[8]
sp=dpuv3e_5be_2.dpu_axi_1:HBM[9]
sp=dpuv3e_5be_2.dpu_axi_2:HBM[10]
sp=dpuv3e_5be_2.dpu_axi_3:HBM[11]
sp=dpuv3e_5be_2.dpu_axi_4:HBM[12]
sp=dpuv3e_5be_2.dpu_axi_i:HBM[13]
sp=dpuv3e_5be_2.dpu_axi_w0:HBM[14]
sp=dpuv3e_5be_2.dpu_axi_w1:HBM[15]

You can use following command line to finish the xclbin build-up (assuming the XO files for DPUv3E, JPEG decoder and image resizer are dpuv3e_5be.xo, jpeg_decoder_v1_0.xo and resize_accel.xo):

    v++ --link                               \
    --target hw                          \
    --platform xilinx_u280_xdma_201920_1 \
    --config example_config.txt          \
    --output dpuv3e_integration.xclbin   \
    dpuv3e_5be.xo jpeg_decoder_v1_0.xo resize_accel.xo

Conclusion

DPUv3E is a flexible high-performance Convolution Neural Network inference IP target for Alveo HBM card. The number of required DPUv3E kernels and the number of Batch Engines of each DPUv3E kernel are fully configurable, which makes it very adaptive to satisfy the requirement of a specific scenario. The user can finish the integration of DPUv3E with their own specific acceleration kernels easily with Vitis and Vitis AI flow.

Further Reference

Choose a Bio Fragment

Servers

Business Systems

Workstations

Embedded

Personal Laptops

Personal Desktops

Handheld

Resources

GPU Accelerators

Adaptive Accelerators

DPU Accelerators

SmartNICs & Ethernet Adapters

Workstations

Desktops

Laptops

Resources

Adaptive SoCs & FPGAs

System-on-Modules (SOMs)

Technologies

Resources

Evaluation Boards & Kits

Processor Tools

Graphics Tools & Apps

Adaptive SoC & FPGA Tools

Intellectual Property & Apps

GPU Accelerator Tools & Apps

DPU Accelerator Tools

Overview

For Data Center & Cloud

For Edge & Endpoints

For Developers

Industries

Industries

Industries

Industries

Workloads

Deployments

Network, Infrastructure, & Storage

Resources

Gaming

Technologies

Systems

EPYC Processors

Radeon Graphics & AMD Chipsets

Adaptive SoCs & FPGAs

Alveo Accelerators & Kria SOMs

Ryzen Processors

Ethernet Adapters

Overview

Processors

Accelerators, SOMs & NICs

Adaptive SoCs & FPGAs

Graphics

Overview

Product Information & Training

Product Specifications

Resources

Processors & Graphics

DPU Accelerators

Adaptive SoCs & FPGAs

Gaming & Personal Computing

Adaptive & Embedded Computing

Get AMD Fan Gear

Shop Our Retail Partners

DPUv3E for Alveo Accelerator Card with HBM

Introduction to DPUv3E

Batch Engine

Instruction Scheduler

Shared Weight Buffer

Control Register Bank

Performance and Resource Utilization

Integrating DPUv3E on Alveo HBM card with Vitis Flow

Conclusion

Further Reference

Company

News & Events

Community

Partners

Investors