Traditional Data Acceleration

Traditional data acceleration in a Xilinx SoC would begin with an architect determining which functions would be best accelerated in PL.

traditional data acceleration

The FPGA designer would then need to create the infrastructure to establish the communication link between the PS and the PL as well as the accelerated function which includes the following:

  • PS-to-PL AXI Interface
  • Datamover(s)
  • DMA(s)
  • kernel(s)

The embedded SW designer would need to develop the software drivers and handle the management of the kernel binary and kernel execution.

This is a significant amount of engineering effort to develop and verify for each custom design. However, Xilinx introduced the Xilinx Runtime which supports OpenCL APIs running on Linux to schedule the kernels and control data movement. XRT was initially targeted for cloud deployment (remote or on-premise) which utilized an x86 Host and a PCIe based Accelerator Card. However, Vitis technology has unified Xilinx's data acceleration offering and now additionally supports an ARM Host and an AXI Interface based Accelerator Programmable Logic.

deployed-accelerator

Open Computing Language (OpenCL)

The goal of this section is to give you a quick overview of the Open Computing Language such that when you read the OpenCL APIs and description of the host application code, you will understand the terms associated like Platform, Device, Context, Kernel, Command-Queue, etc.

Below is Open Computing Language (OpenCL) as defined by Wikipedia:

OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies programming languages (based on C99 and C++11) for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.

The authors of OpenCL Programming Guide concisely define that an application for a heterogeneous platform must carry out the following steps in every case:

  1. Discover the components that make up the heterogeneous system.
  2. Probe the characteristics of these components so that the software can adapt to the specific features of different hardware elements.
  3. Create the blocks of instructions (kernels) that will run on the platform.
  4. Set up and manipulate memory objects involved in the computation.
  5. Execute the kernels in the right order and on the right components of the system.
  6. Collect the final results.

The steps summarized above are achieved by utilizing the OpenCL APIs. The only exception being the 'kernel' development can be in OpenCL, C/C++, or RTL in the case of Xilinx.

There are four models associated with OpenCL and they are listed below and a brief discussion of each will follow.

  • Platform Model
  • Execution Model
  • Memory Model
  • Programming Models

Platform Model

An OpenCL platform model is a representation of any heterogeneous platform utilizing OpenCL. A Platform always includes a single host. The host is responsible for communication with any external environment (i.e. kernel).

The host can be connected to one or more OpenCL Devices. The device is where a kernel is executed. The device can be any CPU, GPU, DSP, FPGA, etc. which supports OpenCL. The device may be divided into Compute Units (CUs) and further divided into Processing Elements (PEs). This blog is focused on the basics of Vitis data acceleration utilizing the OpenCL framework, therefore, I will not get into further details on the CUs and PEs but please take the time to research these topics as you focus more on the kernel development. The Platform Model is shown below.

processing-element

Execution Model

Execution of an OpenCL program occurs in two parts: kernels that execute on one or more OpenCL devices and a host program that executes on the host. The host program defines the context for the kernels and manages their execution.

The host program will define the kernel and then issue a command to execute that kernel on an OpenCL device. There are many details that go along with understanding how the execution of a kernel is performed on the device. Some of the terms you will want to become familiar include work-itemswork-groups, and NDRange. Again, this blog is focused on the basics of the Vitis tool utilizing the OpenCL framework, therefore, I will not get into further details on the Execution Model and its associated terms but please take the time to research these topics as you focus more on the kernel development.

The OpenCL specification defines the Context and Command Queues as shown below.

The host defines a Context for the execution of the kernels. The context includes the following resources:

  1. Devices: The collection of OpenCL devices to be used by the host.
  2. Kernels: The OpenCL functions that run on OpenCL devices.
  3. Program Objects: The program source and executable that implement the kernels.
  4. Memory Objects: A set of memory objects visible to the host and the OpenCL devices. Memory objects contain values that can be operated on by instances of a kernel.

The context is created and manipulated by the host using functions from the OpenCL API. The host creates a data structure called a Command-Queue to coordinate the execution of the kernels on the devices. The host places command into the command-queue which are then scheduled onto the devices within the context. These include:

  • Kernel execution commands: Execute a kernel on the processing elements of a device.
  • Memory commands: Transfer data to, from, or between memory objects, or map and unmap memory objects from the host address space.
  • Synchronization commands: Constrain the order of execution of commands.

Memory Model

Work-item(s) executing a kernel have to access to four distinct memory regions:

  • Global Memory. This memory region permits read/write access to all work-items in all work-groups. Work-items can read from or write to any element of a memory object. Reads and writes to global memory may be cached depending on the capabilities of the device.
  • Constant Memory: A region of global memory that remains constant during the execution of a kernel. The host allocates and initializes memory objects placed into constant memory.
  • Local Memory: A memory region local to a work-group. This memory region can be used to allocate variables that are shared by all work-items in that work-group. It may be implemented as dedicated regions of memory on the OpenCL device. Alternatively, the local memory region may be mapped onto sections of the global memory.
  • Private Memory: A region of memory private to a work-item. Variables defined in one work-item’s private memory are not visible to another work-item.

Below is the Memory Model graphic as presented in the OpenCL Specification.

compute-device-memory

Now that the terms Platform, Device, Context, Command-Queue, and Program have been defined, let's move on to the Xilinx Runtime environment which manages the OpenCL APIs.


Xilinx Runtime (XRT)

Support for OpenCL APIs using the Linux-based Xilinx Runtime (XRT) is provided to schedule the HW kernels and control data movement.

Xilinx Runtime (XRT) is implemented as a combination of userspace and kernel driver components. XRT supports both PCIe based accelerator cards and MPSoC based embedded architecture which provides a standardized software interface to Xilinx Programmable Logic.

xrt-stack

For more details on the Xilinx Runtime, please reference the Xilinx Runtime (XRT) Documentation GitHub.


Example Design

The Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit (Part #: EK-U1-ZCU102-G) was utilized for this example design. The predefined Vitis target platform for the ZCU102 Evaluation Kit is the zcu102_base. This platform includes the Zynq UltraScale+ MPSoC XCZU9EG-2FFVB1156E as well as the XRT.

The OpenCL Host will be targeted to the Quad Core Cortex-A53 APU and the OpenCL Device will be targeted to the Programmable Logic. Communication between the Host and the Device, including control and data transfers, occurs across the AXI interconnect (embedded platform).

The kernel will be a simple Array Add function written in OpenCL. This kernel is provided as an example in the Vitis unified software platform as part of the data acceleration flow.

Vitis technology was used for this example design.

Please note that the C++ code presented is NOT reflective of the coding style of a typical software programmer. I have not used any function calls (with the exception of a helper function that reads in a file) but I tried to keep it as a series of OpenCL API calls which will illustrate and help the low level understanding of this flow. The code was also not written in object-oriented fashion but by referencing the OpenCL Class Diagram available in the OpenCL API 1.2 Reference Guide, you could easily adapt.

 

open-cl-diagram

Programming the Kernel

The kernel is not the focus of this blog so a simple Array Adder function will be accelerated in the PL. Below is the OpenCL code for the krnl_arrayAdd kernel.

There are two input arrays a and b with a single output array c. This example will perform the addition of a and b for each array index and output the result on the respective c array index. The constant input n_elements is used to define the size of the array and therefore the number of iterations.

    #define BUFFER_SIZE 256
kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void krnl_arrayAdd(global const int* a,
                   global const int* b,
                   global int* c,
                   const int n_elements)
{
    int arrayA[BUFFER_SIZE];
    for (int i = 0 ; i < n_elements ; i += BUFFER_SIZE)
    {
        int size = BUFFER_SIZE;
        //boundary check
        if (i + size > n_elements) size = n_elements - i;

        //Burst reading A
        readA: for (int j = 0 ; j < size ; j++)
            arrayA[j] = a[i+j];

        //Burst reading B and calculating C and Burst writing
        // to  Global memory
        vadd_wrteC: for (int j = 0 ; j < size ; j++)
            c[i+j] = arrayA[j] + b[i+j];
    }
}


Programming the Host Application

Main

To begin coding the host application, the following headers were included. It is important to include the <CL/cl2.hpp> header as this provides access to the OpenCL APIs Library.

  • <iostream> - Standard Input / Output Streams Library (i.e. cout)
  • <CL/cl2.hpp> - OpenCL APIs Library

The static constant DATA_SIZE was declared to define the array size and therefore the number of iterations for the kernel. The load_file_to_memory function prototype was defined as this was the only function utilized in the host application code. This helper function was copied directly from the Vitis Programmers Guide (UG1357) as it does not directly apply to OpenCL.

The host application will be launched using the application name followed by the binary filename which includes the kernel to be accelerated. The binary filename from the command line will be stored in the variable krnl_file. For example, the application name is ocl_1 and the kernel name is krnl_arrayAdd which resides in the binary_container_1.xclbin binary file. The following would be the command typed from the linux command line to properly launch the host application with associated kernel:

./ocl_1.exe binary_container_1.xclbin

The required array size memory is calculated and stored in the variable size_in_bytes which will be used later to size the OpenCL buffer objects. The array size is 4096, as defined by DATA_SIZE, and each index is of type int.

Finally, the variables for the OpenCL objects are defined and initialized. As previously mentioned, the OpenCL environment is comprised of the following objects:

  • platform (platform_id)
  • device(s) (device_id)
  • context (context)
  • command queue (command_queue)
  • program (program)
  • kernel (kernel)
  • buffer(s) (buffer_a, buffer_b, buffer_res)

There are 3 int pointers defined (ptr_a, ptr_b, ptr_res) which are used as pointers to the OpenCL buffers. Finally, there is an errs variable which is used for OpenCL API return value and would be used for error checking. In general, the errs variable was not used for the illustrations below. However, there are two C++ files included of which one has the code without error checking (ocl_1.cpp) and one has the code with error checking (ocl_2.cpp).

At this point, the application is defined and we will walk through the steps to Initialize the OpenCL Environment followed by the steps to Execute Host-to-Kernel Commands.

    #include <iostream>
#include <CL/cl2.hpp>

static const int DATA_SIZE = 4096;

int load_file_to_memory(const char *filename, char **result);

int main(int argc, char* argv[]) {
    //Kernel binary filename to be passed from gcc command line
    if(argc != 2) {
        std::cout << "Usage: " << argv[0] <<" <xclbin>" << std::endl;
        return EXIT_FAILURE;
    }

    char *krnl_file = argv[1];

    // Compute the size of array in bytes
    size_t size_in_bytes = DATA_SIZE * sizeof(int);

    cl_platform_id   platform_id   = 0;
    cl_device_id     device_id     = 0;
    cl_context       context       = 0;
    cl_command_queue command_queue = 0;
    cl_program       program       = 0;
    cl_kernel        kernel        = 0;
    cl_mem           buffer_a      = 0;
    cl_mem           buffer_b      = 0;
    cl_mem           buffer_res    = 0;
    int              *ptr_a        = 0;
    int              *ptr_b        = 0;
    int              *ptr_res      = 0;
    cl_int           errs;

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // WILL BEGIN ADDING CODE HERE                                     //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////
    
}

int load_file_to_memory(const char *filename, char **result) {
    uint size = 0;
    FILE *f = fopen(filename, "rb");
    if (f == NULL) {
        *result = NULL;
        return -1; // -1 means file opening fail
    }
    fseek(f, 0, SEEK_END);
    size = ftell(f);
    fseek(f, 0, SEEK_SET);
    *result = (char *)malloc(size+1);
    if (size != fread(*result, sizeof(char), size, f)) {
        free(*result);
        return -2; // -2 means file reading fail
    }
    fclose(f);
    (*result)[size] = 0;
    return size;
}


Initialize OpenCL Environment

Below are the steps for initializing the OpenCL environment. I have included the definitions of the OpenCL API functions utilized at the beginning of each step for easy reference.

Step 1: Select Platform

For full details on the OpenCL API functions used in this section, select the desired link.

clGetPlatformIDs

Obtain the list of platforms available.

    cl_int clGetPlatformIDs (cl_uint num_entries,
                         cl_platform_id *platforms,
                         cl_uint *num_platforms)

clGetPlatformInfo

Get specific information about the OpenCL platform.

    cl_int clGetPlatformInfo (cl_platform_id platform,
                          cl_platform_info param_name,
                          size_t param_value_size,
                          void *param_value,
                          size_t *param_value_size_ret)

The first step in initializing the OpenCL environment is to identify the Xilinx OpenCL Platform and extract the Platform ID. First, the host application will utilize the clGetPlatformIDs API to build a list of available platforms. By setting the platforms argument to NULL, the API call will return the number of OpenCL platforms available and store that value in the variable num_platforms. After determining the number of OpenCL platforms available, the application allocates memory for the platform_ids variable. The clGetPlatformIDs API will be utilized again to store the list of platforms available in the platform_ids variable knowing the total number of OpenCL platforms available (num_platforms).

The clGetPlatformInfo API will now be used to identify the Xilinx based platform by iterating through the platform_ids list and using a string compare to match the cl_platform_vendor parameter to the string "Xilinx". The result of this iteration is assigning the Platform ID to the platform_id variable. In the next step, the Device ID needs to be identified.

      /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // INITIALIZE OPENCL ENVIRONMENT: STEP 1 - FIND PLATFORM  ID       //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    cl_uint num_platforms = 0;
    char    cl_platform_vendor[1001];

    clGetPlatformIDs(0, NULL, &num_platforms);

    cl_platform_id *platform_ids = (cl_platform_id *) malloc(sizeof(platform_id) * num_platforms);

    clGetPlatformIDs(num_platforms, platform_ids, NULL);

    for (unsigned int iplat = 0; iplat < num_platforms; iplat++) {
        clGetPlatformInfo(platform_ids[iplat], CL_PLATFORM_VENDOR,
                                 1000, (void *)cl_platform_vendor, NULL);
        if (strcmp(cl_platform_vendor, "Xilinx") == 0) {
            platform_id = platform_ids[iplat];
        }
    }

Step 2: Ged Device ID

For full details on the OpenCL API functions used in this section, select the desired link.

clGetDeviceIDs

Obtain the list of devices available on a platform.

    cl_int clGetDeviceIDs (cl_platform_id  platform,
                       cl_device_type  device_type,
                       cl_uint  num_entries,
                       cl_device_id  *devices,
                       cl_uint  *num_devices)

clGetDeviceInfo

Get information about an OpenCL device.

    cl_int clGetDeviceInfo (cl_device_id  device,
                        cl_device_info  param_name,
                        size_t  param_value_size,
                        void  *param_value,
                        size_t  *param_value_size_ret)

The second step in initializing the OpenCL environment is to identify the Xilinx OpenCL Device and extract the Device ID. The devices[] array was created assuming an index of 16 was large enough as most platforms will likely not have more than 16 OpenCL devices. In this example, there is only one OpenCL device.

The clGetDeviceIDs API in the application code is used extract the number of Xilinx OpenCL devices available on the Platform that are of type CL_DEVICE_TYPE_ACCELERATOR. The number of OpenCL devices that match this type are stored in the num_devices variable and the list of OpenCL devices found are stored in the devices list.

The clGetDeviceInfo API will now be used to identify the Xilinx based device by iterating through the devices list and extracting the device name string for each Xilinx OpenCL device. The result of this iteration is assigning the Xilinx OpenCL Device ID to the device_id variable.

In this example, the CL_DEVICE_NAME returned and stored in the cl_device_name[] variable was zcu102_base.

In the next step, the Compute Context needs to be created.

      /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // INITIALIZE OPENCL ENVIRONMENT: STEP 2 - FIND DEVICE ID          //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    cl_device_id devices[16];
    char         cl_device_name[1001];
    cl_uint      num_devices;

    clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ACCELERATOR, 16,
                          devices, &num_devices);

    for (uint idev = 0; idev < num_devices; idev++) {
        clGetDeviceInfo(devices[idev], CL_DEVICE_NAME, 1024, cl_device_name, 0);
        device_id = devices[idev];
    }

Step 3: Create Compute Context

For full details on the OpenCL API functions used in this section, select the desired link.

clCreateContext

Creates an OpenCL context.

    cl_context clCreateContext (const cl_context_properties *properties,
                            cl_uint num_devices,
                            const cl_device_id *devices,
                            (void CL_CALLBACK  *pfn_notify) (
                                const char *errinfo,
                                const void *private_info, size_t cb,
                                void *user_data
                            ),
                            void *user_data,
                            cl_int *errcode_ret)

The third step in initializing the OpenCL environment is to create the OpenCL Context with one or more Xilinx devices that will communicate with the host. This step is very straight forward as the clCreateContext API is used to create a context with the Xilinx OpenCL device declared in the device_id variable.

In the next step, the Command Queue needs to be created.

      /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // INITIALIZE OPENCL ENVIRONMENT: STEP 3 - CREATE CONTEXT          //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    context = clCreateContext(0, 1, &device_id, NULL, NULL, &errs);

Step 4: Create Command Queue

For full details on the OpenCL API functions used in this section, select the desired link.

clCreateCommandQueue

Create a command-queue on a specific device.

    cl_command_queue clCreateCommandQueue (cl_context context,
                                       cl_device_id device,
                                       cl_command_queue_properties properties,
                                       cl_int *errcode_ret)

Command Queue(s) for each Xilinx device. This step is very straight forward as the clCreateCommandQueue API is used to create a command queue in the device_id variable with the associated Xilinx OpenCL device declared and its OpenCL context . Please note that in the example, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property was set to execute commands out-of-order. By not setting the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property, the commands will execute in-order.

In the next step, the Program Object needs to be created and load the binary bits.

      /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // INITIALIZE OPENCL ENVIRONMENT: STEP 4 - CREATE COMMAND QUEUE    //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    command_queue = clCreateCommandQueue(context, device_id,
                                         CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
                                         &errs);

Step 5: Program

For full details on the OpenCL API functions used in this section, select the desired link.

clCreateProgramWithBinary

Creates a program object for a context, and loads the binary bits specified by binary into the program object.

    cl_program clCreateProgramWithBinary (cl_context context,
                                      cl_uint num_devices,
                                      const cl_device_id *device_list,
                                      const size_t *lengths,
                                      const unsigned char **binaries,
                                      cl_int *binary_status,
                                      cl_int *errcode_ret)

The fifth and final step in initializing the OpenCL environment is to create an OpenCL Program Object and load the PL binary kernel into the Program Object for the Xilinx OpenCL Device. The binary file (binary_container_1.xclbin) must be read in by the application code and loaded into the OpenCL Program Object. This is where the helper function load_file_to_memory is utilized. The application code reads the kernel binary file which is identified by the filename stored in the krnl_bin variable. The clCreateProgramWithBinary API is used to load the kernel into the PL.

This completes the steps to initialize the OpenCL environment. In the next section, the Host-to-Kernel commands are exercised to interact with the PL kernel.

      /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // INITIALIZE OPENCL ENVIRONMENT: STEP 5 - PROGRAM KERNEL          //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    cl_int err_code;
    unsigned char *krnl_bin;

    const size_t krnl_size = load_file_to_memory(krnl_file, (char **) &krnl_bin);
    program = clCreateProgramWithBinary(context, 1, &device_id, &krnl_size,
                                        (const unsigned char **) &krnl_bin,
                                        &errs, &err_code);


Host-to-Kernel Commands

Step 1: Create Kernel Object & Buffers

For full details on the OpenCL API functions used in this section, select the desired link.

clCreateKernel

Creates a kernel object.

    cl_kernel clCreateKernel (cl_program program,
                          const char *kernel_name,
                          cl_int *errcode_ret)

clCreateBuffer

Creates a buffer object.

    cl_mem clCreateBuffer (cl_context context,
                       cl_mem_flags flags,
                       size_t size,
                       void *host_ptr,
                       cl_int *errcode_ret)

clSetKernelArg

Used to set the argument value for a specific argument of a kernel.

    cl_int clSetKernelArg (cl_kernel kernel,
                       cl_uint arg_index,
                       size_t arg_size,
                       const void *arg_value)

The first step in Host-to-Kernel Commands is to create an OpenCL Kernel Object for the PL kernel. Using the clCreateKernel API, the kernel name (krnl_arrayAdd) is provided to identify the kernel contained within the PL kernel binary file which can now be used by the host application.

In order for the host application and the kernel to communicate, an OpenCL Buffer Object(s) must be created. In this example, there will be three OpenCL Buffer Objects required; two for the a and b input vectors and one for the res output vector. Each of these three OpenCL Buffer Objects is created using the clCreateBuffer API assigning the CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY, and CL_MEM_READ_WRITE flags to signify their communication direction from the kernel perspective. The size of these buffers were calculated earlier and defined in the size_in_bytes variable.

The clSetKernelArg API is used to define arguments for the kernel. Each argument has its own indices that start at 0 and increment up from there. In the example, the clSetKernelArg API calls were used to set the 4 arguments required by the krnl_arrayAdd kernel. These are comprised of three buffer arguments and one scalar argument. The three buffer arguments are used for large data transfer. The scaler argument is used for small data transfer and is write-only. The arg_size argument is defined by the size of the argument type which is cl_mem in the case of the three buffer arguments and is cl_uint in the case of the one scaler argument. The arg_value argument is a pointer to the data. In the case of the three buffer arguments, this is &buffer_a, &buffer_b, and &buffer_res. In the case of the scaler argument, this is &numIterations. In the example, the scaler argument was used to send the numIterations variable required by the krnl_arrayAdd kernel. As mentioned previously, the number of iterations is equal to the array size which was defined byDATA_SIZE.

In the next step, the Host will transfer data from Host Memory to the Device Global Memory.

      /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // EXEC HOST-TO-KERNEL CMDS: STEP 1 - CREATE KERNEL OBJ & BUFFERS  //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    kernel = clCreateKernel(program, "krnl_arrayAdd", &errs);
    buffer_a = clCreateBuffer(context, CL_MEM_READ_ONLY, size_in_bytes,
    		                  NULL, &errs);
    buffer_b = clCreateBuffer(context, CL_MEM_READ_ONLY, size_in_bytes,
    		                  NULL, &errs);
    buffer_res = clCreateBuffer(context, CL_MEM_WRITE_ONLY, size_in_bytes,
    		                    NULL, &errs);
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &buffer_a);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &buffer_b);
    clSetKernelArg(kernel, 2, sizeof(cl_mem), &buffer_res);
    cl_uint numIterations = DATA_SIZE;
    clSetKernelArg(kernel, 3, sizeof(cl_uint), &numIterations);

Step 2: Buffer(s) Transfer from Host to Kernel

For full details on the OpenCL API functions used in this section, select the desired link.

clEnqueueMapBuffer

Enqueues a command to map a region of the buffer object given by buffer into the host address space and returns a pointer to this mapped region.

 

    void * clEnqueueMapBuffer (cl_command_queue command_queue,
                           cl_mem buffer,
                           cl_bool blocking_map,
                           cl_map_flags map_flags,
                           size_t offset,
                           size_t size,
                           cl_uint num_events_in_wait_list,
                           const cl_event *event_wait_list,
                           cl_event *event,
                           cl_int *errcode_ret)

clEnqueueMigrateMemObjects

Enqueues a command to indicate which device a set of memory objects should be associated with.

    cl_int clEnqueueMigrateMemObjects (cl_command_queue  command_queue,
                                   cl_uint  num_mem_objects,
                                   const cl_mem  *mem_objects,
                                   cl_mem_migration_flags  flags,
                                   cl_uint  num_events_in_wait_list,
                                   const cl_event  *event_wait_list,
                                   cl_event  *event)

The second step in Host-to-Kernel Commands is to transfer data from Host Memory to the Device Global Memory. The clEnqueueMapBuffer API maps the specified Buffer Object and returns a pointer created by the Xilinx Runtime to this mapped region. You will use this host side pointer to fill your data. In this example, ptr_a is the host side pointer for buffer_a, ptr_b is the host side pointer for buffer_b, and ptr_res is the host side pointer for buffer_res. Please note that buffer_a and buffer_b are assigned the CL_MAP_WRITE flag since the host will be writing to these buffers and buffer_res is assigned the CL_MAP_READ flag since the host will be reading from this buffer.

In this example, a for loop is used to fill the host side pointer memory for the a and b input vectors and to clear the res output vector.

The host application will transfer the buffer_a, buffer_b, and buffer_res data at the same time. The reason for including buffer_res even though there are no results available yet is to make sure the cache has latest data. Therefore, the mems[] array of type cl_mem was created to make a list of the OpenCL Buffer Objects. The clEnqueueMigrateMemObjects API is used to start the data transfer from Host Memory to the Device Global Memory (buffer_a & buffer_b) and from the Device Global Memory to the Host Memory (buffer_res).

In the next step, the Host will launch the kernel.

      /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // EXEC HOST-TO-KERNEL CMDS: STEP 2 - BUFFER(S) XFER HOST > KERNEL //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    ptr_a   = (int *) clEnqueueMapBuffer(command_queue, buffer_a, true,
                                         CL_MAP_WRITE, 0, size_in_bytes,
                                         0, nullptr, nullptr, &errs);
    ptr_b   = (int *) clEnqueueMapBuffer(command_queue, buffer_b, true,
                                         CL_MAP_WRITE, 0, size_in_bytes,
                                         0, nullptr, nullptr, &errs);
    ptr_res = (int *) clEnqueueMapBuffer(command_queue, buffer_res, true,
                                         CL_MAP_READ, 0, size_in_bytes,
                                         0, nullptr, nullptr, &errs);

    for(int i = 0 ; i< DATA_SIZE; i++){
        ptr_a[i]   = 10;
        ptr_b[i]   = 20;
        ptr_res[i] = 0;
    }

    const cl_mem mems[3] = {buffer_a, buffer_b, buffer_res};

    clEnqueueMigrateMemObjects(command_queue, 3, mems, 0, 0, NULL, NULL);

Step 3: Launch the Kernel

For full details on the OpenCL API functions used in this section, select the desired link.

clEnqueueTask

Enqueues a command to execute a kernel on a device.

    cl_int clEnqueueTask (cl_command_queue  command_queue,
                      cl_kernel  kernel,
                      cl_uint  num_events_in_wait_list,
                      const cl_event  *event_wait_list,
                      cl_event  *event)

clFinish

Blocks until all previously queued OpenCL commands in a command-queue are issued to the associated device and have completed.

    cl_int clFinish (cl_command_queue command_queue)

The third step in Host-to-Kernel Commands is to launch the kernel. This step is very straightforward as the clEnqueueTask API is used to identify the kernel to launch. In this example, the kernel object is defined by the kernel variable.

To guarantee that the kernel has had time to complete the execution prior to transferring the results back from Device Global Memory to Host Memory, the clFinish API is used to block until all previously queued OpenCL commands in the command queue are issued and the associated device has completed.

In the next step, the Host will transfer data from the Device Global Memory to the Host Memory.

      /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // EXEC HOST-TO-KERNEL CMDS: STEP 3 - LAUNCH KERNEL                //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    clEnqueueTask(command_queue, kernel, 0, NULL, NULL);
    clFinish(command_queue);

Step 4: Buffer(s) Transfer from Kernel to Host

For full details on the OpenCL API functions used in this section, select the desired link.

clEnqueueMigrateMemObjects

Enqueues a command to indicate which device a set of memory objects should be associated with.

    cl_int clEnqueueMigrateMemObjects (cl_command_queue  command_queue,
                                   cl_uint  num_mem_objects,
                                   const cl_mem  *mem_objects,
                                   cl_mem_migration_flags  flags,
                                   cl_uint  num_events_in_wait_list,
                                   const cl_event  *event_wait_list,
                                   cl_event  *event)

The fourth step in Host-to-Kernel Commands is to transfer data from Device Global Memory to the Host Memory once the kernel execution is complete. This step is very straightforward as the clEnqueueMigrateMemObjects API is used to start the data transfer from Device Global Memory to Host Memory.

In the next step, the Host will perform post-processing and PL cleanup.

      /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // EXEC HOST-TO-KERNEL CMDS: STEP 4 - BUFFER(S) XFER KERNEL > HOST //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    clEnqueueMigrateMemObjects(command_queue, 1, &buffer_res,
                                      CL_MIGRATE_MEM_OBJECT_HOST, 0,
                                      NULL, NULL);
    int match = 0;
    for (int i = 0; i < DATA_SIZE; i++) {
        int host_result = ptr_a[i] + ptr_b[i];
        if (ptr_res[i] != host_result) {
            std::cout << "Result Mismatch = " << ptr_res[i] << std::endl;
            match = 1;
            break;
        }

    }
    if (match == 0) {
        std::cout << "The result matches; kernel run was successful" << std::endl;
    }

Step 5: Post Processing

For full details on the OpenCL API functions used in this section, select the desired link.

clEnqueueUnmapMemObject

Enqueues a command to unmap a previously mapped region of a memory object.

    cl_int clEnqueueUnmapMemObject (cl_command_queue  command_queue,
                                cl_mem  memobj,
                                void  *mapped_ptr,
                                cl_uint  num_events_in_wait_list,
                                const cl_event  *event_wait_list,
                                cl_event  *event )

clReleaseCommandQueue

Decrements the command_queue reference count.

    cl_int clReleaseCommandQueue (cl_command_queue command_queue)

clReleaseContext

Decrement the context reference count.

    cl_int clReleaseContext (cl_context context)

clReleaseDevice

Decrements the device reference count.

    cl_int clReleaseDevice (cl_device_id device)

clReleaseKernel

Decrements the kernel reference count.

    cl_int clReleaseKernel (cl_kernel kernel)

clReleaseProgram

Decrements the program reference count.

    cl_int clReleaseProgram (cl_program program)

The Fifth and final step in Host-to-Kernal Commands is for the Host to perform some post-processing and PL cleanup. This step is very straightforward as you can see that each OpenCL Object is released. The clEnqueueUnmapMemObject API unmaps the previously mapped region of memory. This API is called for all three Buffer Objects.

Next, the clReleaseCommandQueue API is called to decrement the reference count of the Command Queue Object and once the count becomes zero and all commands queued have finished, the command queue is deleted.

Next, the clReleaseContext API is called to decrement the reference count of the Context Object and once the count becomes zero and all the objects associated to the context are released, the context is deleted.

Next, the clReleaseDevice API is called to decrement the reference count of the Device Object and once the count becomes zero and all objects associated with the device are released, the device is deleted.

Next, the clReleaseKernel API is called to decrement the reference count of the Kernel Object and once the number of instances of that kernel become zero and the kernel object is no longer needed by any enqueued commands, the kernel is deleted.

Finally, the clReleaseProgram API is called to decrement the reference count of the Program Object and once all of the kernel objects associated with the program object have been deleted, the program is deleted.

It is also efficient coding to make sure you free the memory allocation of any variables that were created and no longer in use or valid. In this example, the platform_ids were allocated memory but the device_ids were not.

That is it! The application is now ready to be compiled and then run on the ZCU102 board.

      /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // EXEC HOST-TO-KERNEL CMDS: STEP 5 - POST-PROCESSING/FPGA CLEANUP //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    clEnqueueUnmapMemObject(command_queue, buffer_a, ptr_a, 0, NULL, nullptr);
    clEnqueueUnmapMemObject(command_queue, buffer_b, ptr_b, 0, NULL, nullptr);
    clEnqueueUnmapMemObject(command_queue, buffer_res, ptr_res, 0, NULL, nullptr);
    clReleaseCommandQueue(command_queue);
    clReleaseContext(context);
    clReleaseDevice(device_id);
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    free(platform_ids);
    //free(device_ids);


Building the Host Application and Kernel Application in Vitis

  1. Launch Vitis
  2. Select "Create Application Project"
vitis-ide

3. Enter the Project name: ocl_1 and select "Next>"

project-name

4. Select the zcu102_base [custom] platform which supports Embedded Acceleration (ie.XRT)

zcu102-base

5. Provide path to the Linux Sysroot with XRT support

linux-system

6. Select "Empty Application" and select "Finish".

templates

7. From the Explorer window, select the "src" folder followed by selecting the "Import Files..." icon.

explore-folder

8. Select the "Browse..." button next to the "From directory:" cell and navigate to the directory location of the ioc_1.cpp and krnl_arrayAdd.cl source files.

file-system

9. Switch "Active build configuration:" to "Hardware" using the pulldown menu.

step9

10. Select the "Add Hardware Functions..." icon in the "Hardware Functions" window.

step10

11. Select the "krnl_arrayAdd" kernel from the list and select "OK".

step11

12. Select the "Build" icon in the "Assistant" window.

build-icon

13. NOTICE - this step should only be used if your "Build" does not fully populate the ocl_1_system > ocl_1 > Hardware > SD Card Image directory.  Add an advanced parameter to the V++ Kernel Linker.

     1. Select the ocl_1 [ xrt ] project in the "Explorer" window.

     2. Select "File" > "Properties" from the pulldown menu.

     3. Select the C/C++ Build > Settings > V++ Kernel Linker > Miscellaneous.

     4. In "Other flags", select the "Add..." icon.

     5. Add the following --advanced.param "compiler.addOutputTypes=sd_card" and select "OK".

     6. Select "Apply and Close".

step131

Running the Host Application

  1. Connect the JTAG and UART to the ZCU102 platform via the micro USB cables.
  2. Set ZCU102 configuration to "SD Card" (SW6[1-4] = ON-OFF-OFF-OFF).
  3. Insert SD Card with the Vitis tool  <workspace>/ocl_1/Hardware/sd_card/ directory contents copied.
  4. Connect to ZCU102 UART via terminal of choice (PuTTY, in this screen capture).
  5. Power on ZCU102 board.
  6. Wait for Linux to launch.
starting

7. Mount the SD Card.

cd /mnt

haveged

8. Launch the application.

./ocl_1.exe binary_container_1.xclbin

kernal=execution

9. Wait for kernel execution results The result matches; kernel run was successful.

root-xilinx

Conclusion

By adopting the Vitis unified software platform and the Data Acceleration flow utilizing the OpenCL Framework, a designer can significantly increase productivity as it is no longer necessary to develop and verify a custom hardware acceleration design for every design. Designers can now focus their efforts on optimizing their kernels and leave the management of those kernels to the XRT. Even better, the designer can choose to target an embedded platform (i.e. Zynq-7000 SoC, Zynq UltraScale+ MPSoC, Versal) or a data accelerator card (i.e. Alveo) but the underlying framework is ubiquitous regardless of your communication channel (AXI Interconnect for embedded platform or PCIe for data accelerator card).

Useful References

  • Vitis Programmers Guide (UG1357; v2019.2) PDF
  • Xilinx Runtime (XRT) Documentation GitHub HTML
  • Khronos OpenCL website HTML
  • Khronos OpenCL 1.2 API and C Language Specification (November 14, 2012) PDF
    • Vitis 2019.2 environment provides an OpenCL 1.2 embedded profile conformant runtime API
  • Khronos OpenCL 1.2 Reference Guide PDF
  • Khronos OpenCL 2.2 API Specification (July 19, 2019) PDF
  • Khronos OpenCL 2.2 Reference Guide PDF
    • latest OpenCL version
  • Khronos OpenCL 2.0 C Language Specification (July 19, 2019) PDF
  • Khronos OpenCL 1.0 C++ Language Specification (July 19, 2019) PDF
  • OpenCL Programming Guide by Munshi, Gaster, Mattson, Fung, and Ginsburg (2011; ISBN-13: 978-0321749642) HTML
    • OpenCL 1.1
  • Wikipedia OpenCL webpage HTML

Full Application Code Listing [ocl_1.cpp]

    #include <CL/cl2.hpp>

static const int DATA_SIZE = 4096;

int load_file_to_memory(const char *filename, char **result);

int main(int argc, char* argv[]) {
    //Kernel binary file to be passed from gcc command line
    if(argc != 2) {
        std::cout << "Usage: " << argv[0] <<" <xclbin>" << std::endl;
        return EXIT_FAILURE;
    }

    char *krnl_file = argv[1];

    // Compute the size of array in bytes
    size_t size_in_bytes = DATA_SIZE * sizeof(int);

    cl_platform_id   platform_id   = 0;
    cl_device_id     device_id     = 0;
    cl_context       context       = 0;
    cl_command_queue command_queue = 0;
    cl_program       program       = 0;
    cl_kernel        kernel        = 0;
    cl_mem           buffer_a      = 0;
    cl_mem           buffer_b      = 0;
    cl_mem           buffer_res    = 0;
    int              *ptr_a        = 0;
    int              *ptr_b        = 0;
    int              *ptr_res      = 0;
    cl_int           errs;

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // INITIALIZE OPENCL ENVIRONMENT: STEP 1 - FIND PLATFORM ID        //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    cl_uint num_platforms = 0;
    char    cl_platform_vendor[1001];

    clGetPlatformIDs(0, NULL, &num_platforms);

    cl_platform_id *platform_ids = (cl_platform_id *) malloc(sizeof(platform_id) * num_platforms);

    clGetPlatformIDs(num_platforms, platform_ids, NULL);

    for (unsigned int iplat = 0; iplat < num_platforms; iplat++) {
        clGetPlatformInfo(platform_ids[iplat], CL_PLATFORM_VENDOR, 1000,
        		          (void *)cl_platform_vendor, NULL);
        if (strcmp(cl_platform_vendor, "Xilinx") == 0) {
            platform_id = platform_ids[iplat];
        }
    }

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // INITIALIZE OPENCL ENVIRONMENT: STEP 2 - FIND DEVICE ID          //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    cl_device_id devices[16];
    char         cl_device_name[1001];
    cl_uint      num_devices;

    clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ACCELERATOR, 16,
                          devices, &num_devices);

    for (uint idev = 0; idev < num_devices; idev++) {
        clGetDeviceInfo(devices[idev], CL_DEVICE_NAME, 1024, cl_device_name, 0);
        device_id = devices[idev];
    }

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // INITIALIZE OPENCL ENVIRONMENT: STEP 3 - CREATE CONTEXT          //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    context = clCreateContext(0, 1, &device_id, NULL, NULL, &errs);

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // INITIALIZE OPENCL ENVIRONMENT: STEP 4 - CREATE COMMAND QUEUE    //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    command_queue = clCreateCommandQueue(context, device_id,
                                         CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
                                         &errs);

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // INITIALIZE OPENCL ENVIRONMENT: STEP 5 - PROGRAM KERNEL          //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    cl_int err_code;
    unsigned char *krnl_bin;

    const size_t krnl_size = load_file_to_memory(krnl_file, (char **) &krnl_bin);
    program = clCreateProgramWithBinary(context, 1, &device_id, &krnl_size,
                                        (const unsigned char **) &krnl_bin,
                                        &errs, &err_code);

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // EXEC HOST-TO-KERNEL CMDS: STEP 1 - CREATE KERNEL OBJ & BUFFERS  //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    kernel = clCreateKernel(program, "krnl_arrayAdd", &errs);
    buffer_a = clCreateBuffer(context, CL_MEM_READ_ONLY, size_in_bytes,
    		                  NULL, &errs);
    buffer_b = clCreateBuffer(context, CL_MEM_READ_ONLY, size_in_bytes,
    		                  NULL, &errs);
    buffer_res = clCreateBuffer(context, CL_MEM_WRITE_ONLY, size_in_bytes,
    		                    NULL, &errs);
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &buffer_a);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &buffer_b);
    clSetKernelArg(kernel, 2, sizeof(cl_mem), &buffer_res);
    cl_uint numIterations = DATA_SIZE;
    clSetKernelArg(kernel, 3, sizeof(cl_uint), &numIterations);

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // EXEC HOST-TO-KERNEL CMDS: STEP 2 - BUFFER(S) XFER HOST > KERNEL //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    ptr_a   = (int *) clEnqueueMapBuffer(command_queue, buffer_a, true,
                                         CL_MAP_WRITE, 0, size_in_bytes,
                                         0, nullptr, nullptr, &errs);
    ptr_b   = (int *) clEnqueueMapBuffer(command_queue, buffer_b, true,
                                         CL_MAP_WRITE, 0, size_in_bytes,
                                         0, nullptr, nullptr, &errs);
    ptr_res = (int *) clEnqueueMapBuffer(command_queue, buffer_res, true,
                                         CL_MAP_READ, 0, size_in_bytes,
                                         0, nullptr, nullptr, &errs);

    for(int i = 0 ; i< DATA_SIZE; i++){
        ptr_a[i]   = 10;
        ptr_b[i]   = 20;
        ptr_res[i] = 0;
    }

    const cl_mem mems[3] = {buffer_a, buffer_b, buffer_res};

    clEnqueueMigrateMemObjects(command_queue, 3, mems, 0, 0, NULL, NULL);

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // EXEC HOST-TO-KERNEL CMDS: STEP 3 - LAUNCH KERNEL                //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    clEnqueueTask(command_queue, kernel, 0, NULL, NULL);
    clFinish(command_queue);

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // EXEC HOST-TO-KERNEL CMDS: STEP 4 - BUFFER(S) XFER KERNEL > HOST //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    clEnqueueMigrateMemObjects(command_queue, 1, &buffer_res,
                                      CL_MIGRATE_MEM_OBJECT_HOST, 0,
                                      NULL, NULL);
    int match = 0;
    for (int i = 0; i < DATA_SIZE; i++) {
        int host_result = ptr_a[i] + ptr_b[i];
        if (ptr_res[i] != host_result) {
            std::cout << "Result Mismatch = " << ptr_res[i] << std::endl;
            match = 1;
            break;
        }

    }
    if (match == 0) {
        std::cout << "The result matches; kernel run was successful" << std::endl;
    }

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // EXEC HOST-TO-KERNEL CMDS: STEP 5 - POST-PROCESSING/FPGA CLEANUP //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    clEnqueueUnmapMemObject(command_queue, buffer_a, ptr_a, 0, NULL, nullptr);
    clEnqueueUnmapMemObject(command_queue, buffer_b, ptr_b, 0, NULL, nullptr);
    clEnqueueUnmapMemObject(command_queue, buffer_res, ptr_res, 0, NULL, nullptr);
    clReleaseCommandQueue(command_queue);
    clReleaseContext(context);
    clReleaseDevice(device_id);
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    free(platform_ids);
    //free(device_ids);

}


int load_file_to_memory(const char *filename, char **result) {
    uint size = 0;
    FILE *f = fopen(filename, "rb");
    if (f == NULL) {
        *result = NULL;
        return -1; // -1 means file opening fail
    }
    fseek(f, 0, SEEK_END);
    size = ftell(f);
    fseek(f, 0, SEEK_SET);
    *result = (char *)malloc(size+1);
    if (size != fread(*result, sizeof(char), size, f)) {
        free(*result);
        return -2; // -2 means file reading fail
    }
    fclose(f);
    (*result)[size] = 0;
    return size;
}

Full Application Code Listing with Error Checking [ocl_2.cpp]

    #include <iostream>
#include <CL/cl2.hpp>

static const int DATA_SIZE = 4096;

int load_file_to_memory(const char *filename, char **result);

int main(int argc, char* argv[]) {
    //Kernel binary file to be passed from gcc command line
    if(argc != 2) {
        std::cout << "Usage: " << argv[0] <<" <xclbin>" << std::endl;
        return EXIT_FAILURE;
    }

    char *krnl_file = argv[1];

    // Compute the size of array in bytes
    size_t size_in_bytes = DATA_SIZE * sizeof(int);

    cl_platform_id   platform_id   = 0;
    cl_device_id     device_id     = 0;
    cl_context       context       = 0;
    cl_command_queue command_queue = 0;
    cl_program       program       = 0;
    cl_kernel        kernel        = 0;
    cl_mem           buffer_a      = 0;
    cl_mem           buffer_b      = 0;
    cl_mem           buffer_res    = 0;
    int              *ptr_a        = 0;
    int              *ptr_b        = 0;
    int              *ptr_res      = 0;
    cl_int           errs;

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // INITIALIZE OPENCL ENVIRONMENT: STEP 1 - FIND PLATFORM ID        //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    cl_uint num_platforms = 0;
    char    cl_platform_vendor[1001];

    // Use clGetPlatformIDs API to extract # of platforms available
    errs = clGetPlatformIDs(0, NULL, &num_platforms);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to find any platforms" << std::endl;
        exit(EXIT_FAILURE);
    }
    std::cout << "The number of platforms found is " << num_platforms << std::endl;

    cl_platform_id *platform_ids = (cl_platform_id *) malloc(sizeof(platform_id) * num_platforms);

    // Use # of platforms from the above clGetPlatformIDs API call
    // to limit the # of platforms read into "platform_ids" list
    errs = clGetPlatformIDs(num_platforms, platform_ids, NULL);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to retrieve Platform ID" << std::endl;
        exit(EXIT_FAILURE);
    }
    std::cout << "num_platforms = " << num_platforms << std::endl;

    // Loop through platforms list and extract the CL_PLATFORM_VENDOR
    // field to find the platform targeted for "Xilinx"
    for (unsigned int iplat = 0; iplat < num_platforms; iplat++) {
        errs = clGetPlatformInfo(platform_ids[iplat], CL_PLATFORM_VENDOR,
                                 1000, (void *)cl_platform_vendor, NULL);
        if (errs != CL_SUCCESS) {
            std::cout << "FAILED to retrieve Platform Info" << std::endl;
            exit(EXIT_FAILURE);
        }
        if (strcmp(cl_platform_vendor, "Xilinx") == 0) {
            platform_id = platform_ids[iplat];
        }
    }

    std::cout << "The Platform ID = " << platform_id << std::endl;

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // INITIALIZE OPENCL ENVIRONMENT: STEP 2 - FIND DEVICE ID          //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    // Chose 16 devices as a safe maximum number of devices on platform
    cl_device_id devices[16];
    char         cl_device_name[1001];
    cl_uint      num_devices;

    errs = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ACCELERATOR, 16,
                          devices, &num_devices);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to find any devices" << std::endl;
        exit(EXIT_FAILURE);
    }

    std::cout << "Found " << num_devices << " device(s)" << std::endl;

    // This assumes ONLY 1 Xilinx device is found for the sake of simplicity
    for (uint idev = 0; idev < num_devices; idev++) {
        errs = clGetDeviceInfo(devices[idev], CL_DEVICE_NAME, 1024, cl_device_name, 0);
        std::cout << "Device Name (CL_DEVICE_NAME) is " << cl_device_name << std::endl;
        device_id = devices[idev];
    }

    std::cout << "The Device ID = " << device_id << std::endl;

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // INITIALIZE OPENCL ENVIRONMENT: STEP 3 - CREATE CONTEXT          //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    context = clCreateContext(0, 1, &device_id, NULL, NULL, &errs);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to create a compute context" << std::endl;
        exit(EXIT_FAILURE);
    }

    std::cout << "The Platform Context = " << context << std::endl;

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // INITIALIZE OPENCL ENVIRONMENT: STEP 4 - CREATE COMMAND QUEUE    //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    // OpenCL 1.2 compliant command;  This will generate a "clCreateCommandQueue is deprecated" warning
    command_queue = clCreateCommandQueue(context, device_id,
                                         CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
                                         &errs);
    // OpenCL 2.0 compliant command; Scout currently on OpenCL 1.2
    //command_queue = clCreateCommandQueueWithProperties(context, device_id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &errs);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to create a compute command queue" << std::endl;
        exit(EXIT_FAILURE);
    }

    std::cout << "The Command Queue was created" << std::endl;

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // INITIALIZE OPENCL ENVIRONMENT: STEP 5 - PROGRAM KERNEL          //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////
    cl_int err_code;
    unsigned char *krnl_bin;

    const size_t krnl_size = load_file_to_memory(krnl_file, (char **) &krnl_bin);
    program = clCreateProgramWithBinary(context, 1, &device_id, &krnl_size,
                                        (const unsigned char **) &krnl_bin,
                                        &errs, &err_code);
    if ( errs != CL_SUCCESS) {
        std::cout << "FAILED to program the kernel" << std::endl;
        exit(EXIT_FAILURE);
    }

    std::cout << "The PL was programmed with the kernel" << std::endl;

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // EXEC HOST-TO-KERNEL CMDS: STEP 1 - CREATE KERNEL OBJ & BUFFERS  //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    kernel = clCreateKernel(program, "krnl_arrayAdd", &errs);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to identify the kernel" << std::endl;
        exit(EXIT_FAILURE);
    }
    buffer_a = clCreateBuffer(context, CL_MEM_READ_ONLY, size_in_bytes, NULL, &errs);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to create buffer_a" << std::endl;
        exit(EXIT_FAILURE);
    }
    buffer_b = clCreateBuffer(context, CL_MEM_READ_ONLY, size_in_bytes, NULL, &errs);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to create buffer_b" << std::endl;
        exit(EXIT_FAILURE);
    }
    buffer_res = clCreateBuffer(context, CL_MEM_WRITE_ONLY, size_in_bytes, NULL, &errs);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to create buffer_res" << std::endl;
        exit(EXIT_FAILURE);
    }
    errs = clSetKernelArg(kernel, 0, sizeof(cl_mem), &buffer_a);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to set Kernel Arguments for buffer_a" << std::endl;
        exit(EXIT_FAILURE);
    }
    errs = clSetKernelArg(kernel, 1, sizeof(cl_mem), &buffer_b);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to set Kernel Arguments for buffer_b" << std::endl;
        exit(EXIT_FAILURE);
    }
    errs = clSetKernelArg(kernel, 2, sizeof(cl_mem), &buffer_res);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to set Kernel Arguments for buffer_res" << std::endl;
        exit(EXIT_FAILURE);
    }
    cl_uint numIterations = DATA_SIZE;
    errs = clSetKernelArg(kernel, 3, sizeof(cl_uint), &numIterations);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to set Kernel Arguments for numIterations" << std::endl;
        exit(EXIT_FAILURE);
    }

    std::cout << "The Kernel Object was created" << std::endl;

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // EXEC HOST-TO-KERNEL CMDS: STEP 2 - BUFFER(S) XFER HOST > KERNEL //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    ptr_a   = (int *) clEnqueueMapBuffer(command_queue, buffer_a, true,
                                         CL_MAP_WRITE, 0, size_in_bytes,
                                         0, nullptr, nullptr, &errs);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to Enqueue Map Buffer 'buffer_a'" << std::endl;
        exit(EXIT_FAILURE);
    }
    ptr_b   = (int *) clEnqueueMapBuffer(command_queue, buffer_b, true,
                                         CL_MAP_WRITE, 0, size_in_bytes,
                                         0, nullptr, nullptr, &errs);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to Enqueue Map Buffer 'buffer_b'" << std::endl;
        exit(EXIT_FAILURE);
    }
    ptr_res = (int *) clEnqueueMapBuffer(command_queue, buffer_res, true,
                                         CL_MAP_READ, 0, size_in_bytes,
                                         0, nullptr, nullptr, &errs);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to Enqueue Map Buffer 'buffer_res'" << std::endl;
        exit(EXIT_FAILURE);
    }

    for(int i = 0 ; i< DATA_SIZE; i++){
        ptr_a[i]   = 10;
        ptr_b[i]   = 20;
        ptr_res[i] = 0;
    }

    const cl_mem mems[3] = {buffer_a, buffer_b, buffer_res};

    errs = clEnqueueMigrateMemObjects(command_queue, 3, mems, 0, 0,
                                      NULL, NULL);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to set Enqueue buffer_a and buffer_b" << std::endl;
        exit(EXIT_FAILURE);
    }


    std::cout << "The transfer from Global Memory to Host Memory was successful" << std::endl;

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // EXEC HOST-TO-KERNEL CMDS: STEP 3 - LAUNCH KERNEL                //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////
    errs = clEnqueueTask(command_queue, kernel, 0, NULL, NULL);
    if (errs != CL_SUCCESS) {
          std::cout << "FAILED to execute kernel" << std::endl;
          exit(EXIT_FAILURE);
    }

    // Wait for all commands in command queue to complete
    clFinish(command_queue);

    std::cout << "The kernel execution was successful" << std::endl;

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // EXEC HOST-TO-KERNEL CMDS: STEP 4 - BUFFER(S) XFER KERNEL > HOST //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////
    errs = clEnqueueMigrateMemObjects(command_queue, 1, &buffer_res,
                                      CL_MIGRATE_MEM_OBJECT_HOST, 0,
                                      NULL, NULL);
    if (errs != CL_SUCCESS) {
        std::cout << "FAILED to set Enqueue buffer_res" << std::endl;
        exit(EXIT_FAILURE);
    }
    std::cout << "The transfer from Host Memory to Global Memory was successful" << std::endl;

    //Verify the result
    int match = 0;
    for (int i = 0; i < DATA_SIZE; i++) {
        int host_result = ptr_a[i] + ptr_b[i];
        if (ptr_res[i] != host_result) {
            std::cout << "Result Mismatch = " << ptr_res[i] << std::endl;
            match = 1;
            break;
        }

    }
    if (match == 0) {
        std::cout << "The result matches; kernel run was successful" << std::endl;
    }

  /////////////////////////////////////////////////////////////////////
  //                                                                 //
  // EXEC HOST-TO-KERNEL CMDS: STEP 5 - POST-PROCESSING/FPGA CLEANUP //
  //                                                                 //
  /////////////////////////////////////////////////////////////////////

    clEnqueueUnmapMemObject(command_queue, buffer_a, ptr_a, 0, NULL, nullptr);
    clEnqueueUnmapMemObject(command_queue, buffer_b, ptr_b, 0, NULL, nullptr);
    clEnqueueUnmapMemObject(command_queue, buffer_res, ptr_res, 0, NULL, nullptr);
    clReleaseCommandQueue(command_queue);
    clReleaseContext(context);
    clReleaseDevice(device_id);
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    free(platform_ids);
    //free(device_ids);

}


int load_file_to_memory(const char *filename, char **result) {
    uint size = 0;
    FILE *f = fopen(filename, "rb");
    if (f == NULL) {
        *result = NULL;
        return -1; // -1 means file opening fail
    }
    fseek(f, 0, SEEK_END);
    size = ftell(f);
    fseek(f, 0, SEEK_SET);
    *result = (char *)malloc(size+1);
    if (size != fread(*result, sizeof(char), size, f)) {
        free(*result);
        return -2; // -2 means file reading fail
    }
    fclose(f);
    (*result)[size] = 0;
    return size;
}



About Mike Rockel

About Mike Rockel

Mike Rockel is located near Chicago (IL) and has served as a Field Applications Engineer (FAE) supporting AMD for 20 years.  Mike began supporting AMD in the days of XC4000 FPGAs, XC9500 CPLDs, and M1/Foundation development tools.  During these past 20 years, Mike has focused on training customers on the current AMD silicon and development solutions, assisting with design integration issues, and assisting with initial board bring-up with the ultimate goal of getting his customers to market as quickly as possible.  Mike has been happily married for 25 years and has two boys, both of whom are currently studying Computer Science in college possibly future AMD developers!  In Mike's free time, his interests include following Big Ten college basketball, following USA soccer, music, concerts, and detailing cars.