The Ultra96™ is a great platform for building edge use-case machine learning applications. The form factor of the 96 board along with the programmable logic on the Zynq® MPSoC ZU3 device gives the flexibility to add the common MIPI CSI2 RX standard interface for video input used in these type of end applications, while the Xilinx Deep Learning processing unit (DPU) can be composed into the design to drive high performance and low power Machine learning Edge applications. Since many users adopting the Xilinx ® Vitis™ unified software platform flow for the first time will be starting with existing Vivavdo-based designs, we'll begin our tutorial by converting a traditional design implemented in Vivado IPI into an acceleration-ready Vitis target platform. Taking advantage of the common 96Boards form factor of the Ultra96, the MIPI pipeline in this design uses an OV5640 imaging sensor on a MIPI imaging mezzanine card and uses the YUYV output type to do a direct input from the MIPI RX IP to a framebuffer write DMA core and into the PS DDR. Next, we'll show the steps to update the PetaLinux project to include the necessary libraries and drivers to create a Vitis software platform that's capable of supporting HW accelerated workflows, including a DPU-based Machine Learning application. Once we have the hardware and software platform components completed, we'll use the Vitis development kit to combine them into a Vitis acceleration platform that we can then build hardware-accelerated software application against. Finally, we'll walk through the integration of the Xilinx Deep Learning Processing Unit (DPU) for machine learning acceleration applications. Following the addition of the DPU, we can use the provided DPU runtime to evaluate a high performance Face Detection application using streaming MIPI input from the generated platform.
This section lists the software and hardware tools required to use the Xilinx® Deep Learning Processor (DPU) IP to accelerate machine learning algorithms.
Clone this repository to your local machine, and download the reference files directory. After downloading the reference files, unzip them into the reference files directory in the cloned repository. The rest of the folders in this hierarchy will be blank after downloading and will be populated throughout the tutorial.
We will first create the original non-accelerated MIPI project in the Vivado® and Petalinux tools. After this step, you will have bootable hardware and software images to launch a pipeline to view the input MIPI video from the Ultra96.
./scripts/u96v1_mipi_static.tcl
petalinux-create –t project –n petalinux –s
reference_files/base_project/bsps/u96v1_mipi_static.bsp petalinux-config --get-hw-description ../hw_platform --silentconfig
to import the hardware you generatedpetalinux-config -c rootfs and petalinux-config -c kernel
menus to see what customizations were made to include MIPI on the Ultra96 petalinux-build to build the system
petalinux-package --boot --force --fsbl --pmufw --u-boot --fpga
to create BOOT.binWe'll now make the necessary additions and modifications to the hardware design to prepare the design for software defined acceleration: open the base Vivado project to get started.
As we add additional components to the hardware design to accomodate acceleration, we need to customize the Processing Subsystem. Here, we will modify the configuration to create additional clocks, open up additional interrupt ports, and create the AXI master port so that we can add additional peripherals to the design.
In order for the Vitis tool to insert our hardware acceleration blocks into the design, we need to leave open and designate the interfaces that it can use to connect the blocks. In this design, we'll need a few memory-mapped interfaces so that the DPU can connect into the PS DDR. We'll open up three HP Slave ports on this platform since there are three memory mapped masters on the DPU block. This portion of the process also allows us to "name" the port, giving it a shorter nickname to designate connections later on.
Similarly to how we designated the interfaces for the platform, we now have to indicate to the tools which clocks it should use for the accelerators that it places in the platform. In this case, the DPU uses two clocks (a 1x and a 2x clock) so we will indicate to the platform both a 250 and 500MHz clock. The DPU can be clocked faster or slower than this rate, and this rate was chosen to balance power and framerate performance in our application.
[clk_out1=250MHz], [clk_out2=500MHz], [Matched routing on both], [Reset Type = Active Low]
In this design, we've chosen to place the original components (the MIPI subsystem) on a seperate clock coming from the PS. We're connecting the clock wizards and processor system resets for the accelerators to the PL0 clock, and the MIPI subsystem to the PL1 clock. This allows us to make sure that any changes in clock frequency (or clock gating) to the original or acceleration components will not affect the operation of the other.
The default scheduling mode for the acceleration kernels is polled. In order to enable interrupt based processing within our platform, we need to add an interrupt controller. Within the current design, we will connect up a constant "gnd" to the interrupt controller and not connecting any valid interrupt sources at this time. Paired with the AXI Interrupt Controller is a "dynamic_postlink" tcl script in the Vivado sources, which will select the interrupt constant net, disconnect it from the concatenation block, and then automatically connect up our acceleration kernel after it's been added by the Vitis tool .
Now that we've customized this design, this can be exported to the Vitis tool through a Xilinx Support Archive (XSA). Note here: we're not going to build this project to a bitstream. The Vitis tool will utilize this archive to import the design, compose in our hardware accelerators, and at that point, it will build a bitstream. We'll automate a portion of this process using the dsa.tcl script - this automates naming and platform details before exporting the Xilinx Support Archive (xsa) file into the hw_platform directory. This script also links our dynamic_postlink.tcl script mentioned earlier, so that the script specific to this platform is included inside of the archive.
source ./scripts/dsa.tcl
The software platform requires some changes to be made to the Petalinux project, adding the necessary Xilinx Runtime (XRT) components into the design. At this point, there are two options: following all of the steps below to copy in the necessary files and enable those components in Petalinux, or skip 1-8 and replace the Petalinux project with a new one from the u96v1_mipi_dynamic.bsp at [reference_files/platform_project/bsps/u96v1_mipi_dynamic.bsp].
The first step to creating our acceleration platform is adding in the library components mentioned above: the Xilinx Runtime, and the DPU runtime (dnndk). These come in the form of recipes, which we'll add into the user layer within our Petalinux build. First, we'll copy over the files and build recipes, and then we will enable them through the Petalinux root filesystem configuration menu.
cp -rp ../reference_files/platform_project/plnx/recipes-apps/dnndk project-spec/meta-user/recipes-apps
cp -rp ../reference_files/platform_project/plnx/recipes-xrt project-spec/meta-user
cp -rp ../reference_files/platform_project/plnx/recipes-apps/autostart project-spec/meta-user/recipes-apps
vi project-spec/meta-user/recipes-core/images/petalinux-image-full.bbappend
IMAGE_INSTALL_append = " dnndk"
IMAGE_INSTALL_append = " autostart"
IMAGE_INSTALL_append = " opencl-headers"
IMAGE_INSTALL_append = " ocl-icd"
IMAGE_INSTALL_append = " xrt"
IMAGE_INSTALL_append = " xrt-dev"
IMAGE_INSTALL_append = " zocl"
petalinux-config --get-hw-description=../hw_platform --silentconfig
petalinux-config -c rootfs
The Linux Device Tree needs to be modified so that the Xilinx Runtime kernel drivers are probed correctly. We'll modify [project-spec/meta-user/recipes-bsp/device-tree/files/system-user.dtsi] to add the Zynq OpenCL node to the Device Tree.
&amba {
zyxclmm_drm: zyxclmm_drm@0xA0000000 {
reg = <0x0 0xA0000000 0x0 0x800000>;
compatible = "xlnx,zocl";
status = "okay";
interrupt-parent = <&axi_intc_dynamic>;
interrupts = <0 1>, <1 1>, <2 1>, <3 1>,
<4 1>, <5 1>, <6 1>, <7 1>;
};
};\
Now that we've made all of the necessary configuration changes for the Petalinux build, we can kick off the build. This may take quite a while, or a short time given the processing power on your machine. After the Linux build is complete, we need to move all of the built software components into a common directory. By placing all of our boot components in one directory, it makes it easy when packaging up the Hardware and Software sides into the resulting platform. We will also use Petalinux to build the sysroot in order to provide the complete cross-compilation environment for this software platform. This sysroot will also be included in the Software portion of the platform, as it is needed to provide the correct version of headers/includes when compiling for our platform.
petalinux-build\
petalinux-build --sdk
cd sw_platform/sysroot ./sdk.sh -d ./ -y
The Vitis software platform is a set of components that comprise everything needed to boot and develop for a particular board/board configuration, and contains both a hardware and software component. Now that we have built the hardware (XSA) and software (Linux image and boot elf files) components for the platform, we can use these components to generate and export our custom user-defined platform. We're going to walk through these steps in the Xilinx Vitis development kit.
Now that the platform has been generated, you'll note that there is an "export" directory. This export directory is the complete, generated platform and can be zipped up and shared - providing the components to enable new developers on the custom platform.
For the final application, we can target our MIPI platform for a Machine Learning application. We will use the pre-generated Xilinx Deep Learning Processor (DPU) as our acceleration kernel and compile this kernel into the platform using the Xilinx Vitis IDE, and then build a user-space application calling that hardware to run our custom Face Detection application.
We'll start by creating the new application project. In the Vitis tool, the Application Project exists inside of a System Project container in order to provide a method for cohesive system development across the enabled domains in the platform (for instance, A53 and R5). Since we're working in the same workspace as before, we can simply target the platform that we generated earlier - but you can also add additional platform repos by clicking the "plus" button and pointing to the directory which contains your xpfm within the Platform Selection dialog.
Finally, we will add the DPU in as our hardware acceleration kernel, and use Vitis to connect and compile the design.
--config ../src/connections.ini
to designate which port of the DPU will connect to your Platform Interfaces you created earlierThis may take approximately 30 minutes or longer, dependent on your build machine. You may have noticed earlier that we never took the Hardware portion of the design to bitstream generation. While running, the tool is using your "open" interfaces in the hardware design, importing the DPU into the design, and then connecting those interfaces to match what is called out in "connections.ini." After it finalizes the design and those new components, it will run Synthesis and Implementation to generate a binary to load onto the fabric.
Following the build process, you will have a populated sd_card folder under the System directory of your project. Copy the sd_card image to your formatted SD card to boot the board. After the board boots successfully, you can follow a few quick steps to run the design.
When run without arguments, the face_detection application provides a help dialog with example pipelines through the application (mipi, webcam, UDP stream). These are given to the application in gstreamer-like sinks to provide easy customization of the Face Detection app.
Example Pipelines:
"./face_detection -i /dev/video0 -o autovideosink" will display over x11 forwarding or on local monitor
"./face_detection -i /dev/video0 -o udpsink host=192.168.1.50 port=8080" will stream over UDP\
Parker Holloway has been with AMD for just over a year and focuses on Edge platforms and Acceleration design work. His focus on these topics comes from an interest in a Software centric approach to algorithm design on FPGA and ACAP devices, specifically in the fields of Computer Vision and Robotics. Parker is a graduate of Southern Methodist University and lives in Dallas, Texas.