Introduction

The Ultra96™ is a great platform for building edge use-case machine learning applications. The form factor of the 96 board along with the programmable logic on the Zynq® MPSoC ZU3 device gives the flexibility to add the common MIPI CSI2 RX standard interface for video input used in these type of end applications, while the Xilinx Deep Learning processing unit (DPU) can be composed into the design to drive high performance and low power Machine learning Edge applications. Since many users adopting the Xilinx ® Vitis™ unified software platform flow for the first time will be starting with existing Vivavdo-based designs, we'll begin our tutorial by converting a traditional design implemented in Vivado IPI into an acceleration-ready Vitis target platform. Taking advantage of the common 96Boards form factor of the Ultra96, the MIPI pipeline in this design uses an OV5640 imaging sensor on a MIPI imaging mezzanine card and uses the YUYV output type to do a direct input from the MIPI RX IP to a framebuffer write DMA core and into the PS DDR. Next, we'll show the steps to update the PetaLinux project to include the necessary libraries and drivers to create a Vitis software platform that's capable of supporting HW accelerated workflows, including a DPU-based Machine Learning application. Once we have the hardware and software platform components completed, we'll use the Vitis development kit to combine them into a Vitis acceleration platform that we can then build hardware-accelerated software application against. Finally, we'll walk through the integration of the Xilinx Deep Learning Processing Unit (DPU) for machine learning acceleration applications. Following the addition of the DPU, we can use the provided DPU runtime to evaluate a high performance Face Detection application using streaming MIPI input from the generated platform.


Requirements for the Design

This section lists the software and hardware tools required to use the Xilinx® Deep Learning Processor (DPU) IP to accelerate machine learning algorithms.

  • Xilinx Design tools 2019.2
  • Board files for Ultra96 v1
  • The Ultra96 board (v1)
  • 12V power supply for Ultra96
  • MicroUSB to USB-A cable
  • AES-ACC-USB-JTAG board
  • A blank, FAT32 formatted microSD card
  • Xilinx CSI2 RX MIPI IP License
  • D3 Engineering Designcore Camera Mezzanine Board OV5640 (MIPI Input Source)

Optional

  • DisplayPort monitor
  • Mini-display port cable suitable for the chosen monitor
  • USB Webcam

Preparing your workspace

Clone this repository to your local machine, and download the reference files directory. After downloading the reference files, unzip them into the reference files directory in the cloned repository. The rest of the folders in this hierarchy will be blank after downloading and will be populated throughout the tutorial.


Generate the base MIPI project

We will first create the original non-accelerated MIPI project in the Vivado® and Petalinux tools. After this step, you will have bootable hardware and software images to launch a pipeline to view the input MIPI video from the Ultra96.

Vivado

  1. Copy the "scripts" folder from the [reference_files/ base_project/ vivado] directory to the top level vivado directory
  2. Launch Vivado and create a new project named "u96v1_mipi" in the top level "vivado" directory making sure to select the "create project subdirectory" option, then click next
  3. Select to create an "RTL project," then click next
  4. Select to "Add Files" and select the "u96v1_mipi_wrapper.v" file from the [reference_files/ base_project/ vivado/ sources] directory, then click next
  5. Select to "Add Files" and select the "cnst.xdc" file from the [reference_files/ base_project/ vivado/ sources] directory, then click next
  6. Select the "Boards" tab and choose the Ultra96 V1 board for the project, before selecting next and finish to create the project
  7. Within the TCL console tab at the bottom of the screen, change directory to the top level "vivado" directory
  8. Use the TCL console to call source ./scripts/u96v1_mipi_static.tcl
  9. Select to "Generate Block Design" within the Flow Navigator window and allow this to complete
  10. Build the project to bitstream
  11. Select "File --> Export --> Export Hardware" and export the hardware to the hw_platform top level directory, including the bitstream

Petalinux

  1. Source the Petalinux tools
  2. Change directory to the top level directory and create a new petalinux project with the following command: petalinux-create –t project –n petalinux –s reference_files/base_project/bsps/u96v1_mipi_static.bsp
  3. Change directory into the new directory
  4. Run petalinux-config --get-hw-description ../hw_platform --silentconfig to import the hardware you generated
  5. Explore the petalinux-config -c rootfs and petalinux-config -c kernel menus to see what customizations were made to include MIPI on the Ultra96
  6. Run petalinux-build to build the system
  7. Run petalinux-package --boot --force --fsbl --pmufw --u-boot --fpga to create BOOT.bin
  8. Copy BOOT.bin and image.ub from the [petalinux/images/linux] directory to an SD card and use this to boot the system.
  9. [enter GStreamer commands here to test video input]

Creating the Vitis Hardware Platform

We'll now make the necessary additions and modifications to the hardware design to prepare the design for software defined acceleration: open the base Vivado project to get started.

Configure the MPSoC blocks

As we add additional components to the hardware design to accomodate acceleration, we need to customize the Processing Subsystem. Here, we will modify the configuration to create additional clocks, open up additional interrupt ports, and create the AXI master port so that we can add additional peripherals to the design.

  1. Double click the Zynq IP block to open the Processor Configuration Wizard (PCW)
  2. Go to the Clock Configuration tab, select the "Output Clocks" tab, and expand Low Power Domain Clocks->PL Fabric Clocks
  3. Enable PL1, and change the requested frequency of both clocks to 100MHz, selecting IOPLL as the source clock
  4. Go to the PS-PL Configuration tab, expand General->Fabric Resets, and enable a second fabric reset
  5. Add another AXI Master port that we'll use later to connect our interrupt controller - click on PS-PL interfaces->Master interface, and enable AXI HPM0 LPD
  6. Expand General->Interrupts->PL to PS
  7. Enable IRQ1[0-7]
  8. Leave the PCW by clicking OK

Configure Platform interfaces

In order for the Vitis  tool to insert our hardware acceleration blocks into the design, we need to leave open and designate the interfaces that it can use to connect the blocks. In this design, we'll need a few memory-mapped interfaces so that the DPU can connect into the PS DDR. We'll open up three HP Slave ports on this platform since there are three memory mapped masters on the DPU block. This portion of the process also allows us to "name" the port, giving it a shorter nickname to designate connections later on.

  1. In the Window menu, select "Platform Interfaces"
  2. In the Platform Interfaces tab, click "Enable platform interfaces"
  3. Add the three PS HPx_FPD slaves that we're not already using by right clicking on the interface and selecting Enable (HP 0, 1, 2)
  4. Also enable the HPM0 master interface – make sure this interface is disabled on the Zynq PS block. This master will be used by the tools to connect to accelerator
  5. For each slave interface enabled, add a "sptag" value in the options tab that will be used to reference the port later in the flow: HP0, HP1, and HP2, respectively

Designate Platform clocks

Similarly to how we designated the interfaces for the platform, we now have to indicate to the tools which clocks it should use for the accelerators that it places in the platform. In this case, the DPU uses two clocks (a 1x and a 2x clock) so we will indicate to the platform both a 250 and 500MHz clock. The DPU can be clocked faster or slower than this rate, and this rate was chosen to balance power and framerate performance in our application.

  1. Right click in the block design, select "Add IP" and then add a Clocking Wizard IP
  2. Change the name of the instance to clk_wiz_dynamic
  3. Double-click on the clk_wiz_dynamic IP, and make the following changes in the Output Clocks tab: [clk_out1=250MHz], [clk_out2=500MHz], [Matched routing on both], [Reset Type = Active Low]
  4. Move the original clocking wizard (clk_wiz_static) from pl clk0 to pl clk1
  5. In the Platform Interfaces tab, enable clk_out1 and clk_out2 of the clk_wiz_dynamic instance
  6. Set the slower clock (clk_out1 in this case) as the default
  7. clk_out1 should have its id set to 0, and clk_out2 should have its id set to 1
  8. Make sure the proc_sys_reset block listed in each window is set to the instance that is connected to that clock

Separate the original components

In this design, we've chosen to place the original components (the MIPI subsystem) on a seperate clock coming from the PS. We're connecting the clock wizards and processor system resets for the accelerators to the PL0 clock, and the MIPI subsystem to the PL1 clock. This allows us to make sure that any changes in clock frequency (or clock gating) to the original or acceleration components will not affect the operation of the other.

  1. Right click on the pl_clk0 and select "Disconnect Pin" in the menu
  2. Connect pl_clk0 to clk_wiz_dynamic clk_in1, and connect pl_clk1 to clk_in1 of clk_wiz_static
  3. Delete the net connected to pl_reset0
  4. Right click on the block design, select "Add IP", and add a processor system reset IP for each of the new clocks.
  5. Name them proc_sys_reset_dynamic_1 and proc_sys_reset_dynamic_2
  6. Connect the clk_out1 and clk_out2 outputs of clk_wizard_dynamic block to proc_sys_reset_dynamic_1 and proc_sys_reset_dynamic_2 slowest_sync_clk inputs, respectively
  7. Connect the PS pl_reset0 to the ext_reset_in input of the two new processor system reset blocks
  8. Connect pl_reset0 to the resetn port of clk_wiz_dynamic
  9. Connect pl_reset1 to the resetn port of clk_wiz_static and to the ext_reset_in pin of proc_sys_reset_200
  10. Connect the locked output of clk_wiz_dynamic to the dcm_locked inputs of the two new processor system reset blocks.

Enable Interrupt based Kernels

The default scheduling mode for the acceleration kernels is polled. In order to enable interrupt based processing within our platform, we need to add an interrupt controller. Within the current design, we will connect up a constant "gnd" to the interrupt controller and not connecting any valid interrupt sources at this time. Paired with the AXI Interrupt Controller is a "dynamic_postlink" tcl script in the Vivado sources, which will select the interrupt constant net, disconnect it from the concatenation block, and then automatically connect up our acceleration kernel after it's been added by the Vitis tool .

  1. Right click on the block design, select "Add IP", and add an AXI Interrupt controller
  2. In the block properties for the interrupt controller, set the name to axi_intc_dynamic
  3. Add a "Concat" IP to concatenate inputs into the interrupt controller
  4. In the block properties for the Concat block, set the name to int_concat_dynamic
  5. Double click the Concat block and modify the number of ports to 8
  6. Add a "Constant" IP to provide a constant "0" to the interrupt controller - this constant will get disconnected and replaced by a connection to acceleration interrupts by the tool at compile time
  7. Double click the Constant IP and modify the constant value to 0
  8. Click the "Run Connection Automation" link in the Designer Assistance bar to connect the AXI Interrupt controller's Slave AXI interface - choose the HPM0_LPD since the HPM1_FPD is being used for the video subsystem.
  9. Connect all of the inputs of the interrupt controller to the concat block output
  10. Connect the output of the Concat block to the intr input of the interrupt controller
  11. Connect the output of the interrupt controller to pl_ps_irq1 on the PS block
  12. Select the output net of the constant block and name it int_const_net

Generate the Design and XSA

Now that we've customized this design, this can be exported to the Vitis tool through a Xilinx Support Archive (XSA). Note here: we're not going to build this project to a bitstream. The Vitis tool will utilize this archive to import the design, compose in our hardware accelerators, and at that point, it will build a bitstream. We'll automate a portion of this process using the dsa.tcl script - this automates naming and platform details before exporting the Xilinx Support Archive (xsa) file into the hw_platform directory. This script also links our dynamic_postlink.tcl script mentioned earlier, so that the script specific to this platform is included inside of the archive.

  1. Generate the block design
  2. Export the hardware platform by running source ./scripts/dsa.tcl

Creating the Software Platform

The software platform requires some changes to be made to the Petalinux project, adding the necessary Xilinx Runtime (XRT) components into the design. At this point, there are two options: following all of the steps below to copy in the necessary files and enable those components in Petalinux, or skip 1-8 and replace the Petalinux project with a new one from the u96v1_mipi_dynamic.bsp at [reference_files/platform_project/bsps/u96v1_mipi_dynamic.bsp].

Add the Xilinx Runtime recipes

The first step to creating our acceleration platform is adding in the library components mentioned above: the Xilinx Runtime, and the DPU runtime (dnndk). These come in the form of recipes, which we'll add into the user layer within our Petalinux build. First, we'll copy over the files and build recipes, and then we will enable them through the Petalinux root filesystem configuration menu.

  1. Change directory into the Petalinux directory
  2. Add a recipe to add the DPU utilities, libraries, and header files into the root file system
    cp -rp ../reference_files/platform_project/plnx/recipes-apps/dnndk project-spec/meta-user/recipes-apps
  3. Add a recipe to add the Xilinx Runtime (XRT) drivers
    cp -rp ../reference_files/platform_project/plnx/recipes-xrt project-spec/meta-user
  4. Add a recipe to create hooks for adding an “autostart” script to run automatically during Linux boot
    cp -rp ../reference_files/platform_project/plnx/recipes-apps/autostart project-spec/meta-user/recipes-apps
  5. Add the recipes above to the Petalinux image configuration
    vi project-spec/meta-user/recipes-core/images/petalinux-image-full.bbappend
    and add this to the end of the document:
    IMAGE_INSTALL_append = " dnndk"
    IMAGE_INSTALL_append = " autostart"
    IMAGE_INSTALL_append = " opencl-headers"
    IMAGE_INSTALL_append = " ocl-icd"
    IMAGE_INSTALL_append = " xrt"
    IMAGE_INSTALL_append = " xrt-dev"
    IMAGE_INSTALL_append = " zocl"
  6. Update the Petalinux project with the new exported XSA from Vivado
    petalinux-config --get-hw-description=../hw_platform --silentconfig
  7. Open the Petalinux root filesystem configuration GUI to enable the recipes above
    petalinux-config -c rootfs
    and then enable all of the recipes above within the "User Packages" sub menu

Modifying the Linux Device tree

The Linux Device Tree needs to be modified so that the Xilinx Runtime kernel drivers are probed correctly. We'll modify [project-spec/meta-user/recipes-bsp/device-tree/files/system-user.dtsi] to add the Zynq OpenCL node to the Device Tree.

  1. At the bottom of project-spec/meta-user/recipes-bsp/device-tree/files/system-user.dtsi, add the following text:
    &amba {
    ​ zyxclmm_drm: zyxclmm_drm@0xA0000000 {
    ​ reg = <0x0 0xA0000000 0x0 0x800000>;
    ​ compatible = "xlnx,zocl";
    ​ status = "okay";
    ​ interrupt-parent = <&axi_intc_dynamic>;
    ​ interrupts = <0 1>, <1 1>, <2 1>, <3 1>,
    ​ <4 1>, <5 1>, <6 1>, <7 1>;
    ​ };
    };\

Build Petalinux and package Software Components

Now that we've made all of the necessary configuration changes for the Petalinux build, we can kick off the build. This may take quite a while, or a short time given the processing power on your machine. After the Linux build is complete, we need to move all of the built software components into a common directory. By placing all of our boot components in one directory, it makes it easy when packaging up the Hardware and Software sides into the resulting platform. We will also use Petalinux to build the sysroot in order to provide the complete cross-compilation environment for this software platform. This sysroot will also be included in the Software portion of the platform, as it is needed to provide the correct version of headers/includes when compiling for our platform.

  1. Build Petalinux
    petalinux-build\
  2. Copy all .elf files from the [petalinux/images/linux] directory to [sw_platform/boot] This should copy over the following files :
    • ARM Trusted Firmware - b131.elf
    • PMU Firmware – pmufw.elf
    • U-Boot – u-boot.elf
    • Zynq FSBL – zynqmp_fsbl.elf
  3. Copy the image.ub file from the [petalinux/images/linux] directory to [sw_platform/image]
  4. Copy the linux.bif file from the [reference_files/platform_project/plnx] directory to [sw_platform/boot]
  5. Build the Yocto SDK (this provides our sysroot) from the project petalinux-build --sdk
  6. Move [images/linux/sdk.sh] to [sw_platform/sysroot] and then extract the SDK cd sw_platform/sysroot ./sdk.sh -d ./ -y

Generate the Vitis Software Platform

The Vitis software platform is a set of components that comprise everything needed to boot and develop for a particular board/board configuration, and contains both a hardware and software component. Now that we have built the hardware (XSA) and software (Linux image and boot elf files) components for the platform, we can use these components to generate and export our custom user-defined platform. We're going to walk through these steps in the Xilinx Vitis development kit.

  1. Open the Vitis IDE and select the top-level workspace directory as the workspace
  2. Select File, New, and "Platform Project"
  3. Name the platform "u96v1_mipi"
  4. Select to create from hardware specification and select the XSA in [hw_platform]
  5. Select the Linux operating system and the psu_cortexa53
  6. Double click on the platform.spr in the file navigator to open the project
  7. Customize the "linux on psu_cortexa53" domain to point to the boot components and bif in [sw_platform/boot]
  8. Customize the "linux on psu_cortexa53" domain to point to the image directory in [sw_platform/image]
  9. Click the hammer icon or "Generate Platform" buttons to generate the output products from the platform project

Now that the platform has been generated, you'll note that there is an "export" directory. This export directory is the complete, generated platform and can be zipped up and shared - providing the components to enable new developers on the custom platform.


Create the Face Detection Application Project

For the final application, we can target our MIPI platform for a Machine Learning application. We will use the pre-generated Xilinx Deep Learning Processor (DPU) as our acceleration kernel and compile this kernel into the platform using the Xilinx Vitis IDE, and then build a user-space application calling that hardware to run our custom Face Detection application.

Create a new Application Project

We'll start by creating the new application project. In the Vitis tool, the Application Project exists inside of a System Project container in order to provide a method for cohesive system development across the enabled domains in the platform (for instance, A53 and R5). Since we're working in the same workspace as before, we can simply target the platform that we generated earlier - but you can also add additional platform repos by clicking the "plus" button and pointing to the directory which contains your xpfm within the Platform Selection dialog.

  1. Open the Vitis IDE and select the top-level workspace directory as the workspace
  2. Select File, New, and "Application Project"
  3. Name the project "face_detection" and use the auto-generated System project name
  4. Select the "u96v1_mipi" platform that you just created
  5. Confirm that the Linux domain is selected in the next screen, and then point to the sysroot you generated in sw_platform
  6. Select "Empty Application" as the template and "Finish"

Edit the Build Settings

  1. Right click on [face_detection/src] in the file navigator, and select "Import"
  2. Choose General, and Filesystem
  3. Use [reference_files/application] as the target location, and import all of these files
  4. Right click on "face_detection" in the file navigator and select "C/C++ Build Settings"
  5. Navigate to the C/C++ Build, Settings menu if you are not there
  6. For Configuration, select "All Configurations"
  7. In the GCC Host Linker, Libraries submenu below, click the green "+" to add the following libraries:
    - n2cube
    - dputils
    - opencv_core
    - opencv_imgcodecs
    - opencv_highgui
    - opencv_imgproc
    - opencv_videoio\
  8. In the Host Linker, Miscellaneous, Other Objects submenu, add the dpu_densebox.elf from the workspace ./src location
  9. In the Host Compiler, Includes section, select the include path for XILINX_VIVADO_HLS and click the red "X" to remove
  10. Click Apply and Close

Add the DPU as a Hardware Accelerator

Finally, we will add the DPU in as our hardware acceleration kernel, and use Vitis to connect and compile the design.

  1. Double click the project.sdx under your face_detection project to open the project view
  2. Under Hardware Functions, click the lightning bolt logo to add a new accelerator
  3. Select the "dpu_xrt_top" included as part of the dpu.xo file that we included earlier
  4. Click on binary_container_1 to change the name to dpu
  5. Right click on "dpu" and select Edit V++ options
  6. Add --config ../src/connections.ini to designate which port of the DPU will connect to your Platform Interfaces you created earlier
  7. In the upper right corner, change the active build to "System"
  8. Click the hammer icon to build the project

This may take approximately 30 minutes or longer, dependent on your build machine. You may have noticed earlier that we never took the Hardware portion of the design to bitstream generation. While running, the tool is using your "open" interfaces in the hardware design, importing the DPU into the design, and then connecting those interfaces to match what is called out in "connections.ini." After it finalizes the design and those new components, it will run Synthesis and Implementation to generate a binary to load onto the fabric.


Running the Application on the Ultra96

Following the build process, you will have a populated sd_card folder under the System directory of your project. Copy the sd_card image to your formatted SD card to boot the board. After the board boots successfully, you can follow a few quick steps to run the design.

  1. On the board, change directory to [/run/media/mmcblktab/]
  2. Copy the dpu.xclbin file to /usr/lib
  3. Run face_detection.elf

When run without arguments, the face_detection application provides a help dialog with example pipelines through the application (mipi, webcam, UDP stream). These are given to the application in gstreamer-like sinks to provide easy customization of the Face Detection app.

Example Pipelines:
"./face_detection -i /dev/video0 -o autovideosink" will display over x11 forwarding or on local monitor
"./face_detection -i /dev/video0 -o udpsink host=192.168.1.50 port=8080" will stream over UDP\


About Parker Holloway

About Parker Holloway

Parker Holloway has been with AMD for just over a year and focuses on Edge platforms and Acceleration design work. His focus on these topics comes from an interest in a Software centric approach to algorithm design on FPGA and ACAP devices, specifically in the fields of Computer Vision and Robotics.  Parker is a graduate of Southern Methodist University and lives in Dallas, Texas.