## Introduction

When you visit a website to search a product, based upon your profile and past purchases the e-commerce site can find products and services that you are most likely to be interested in. Similarly, when you use a streaming entertainment service such as Netflix, based upon your profile and history of actions, you are presented with recommendations for shows and movies. One of the ways employed by such recommendation systems is to find buyers or viewers most similar to you and recommend products and services availed by similar buyers or viewers. Although most visible in the retail and entertainment industries, identifying similarity between two items is employed in variety of industries and applications including recruitment to identify similar candidates, academia and publishing to detect plagiarism, machine learning to train for classification, health care to identify similar patients, and bioinformatics to identify similar proteins.

There are multiple algorithms to compute the similarity between two items, such as Jaccard, Pearson, Cosine, Overlap, and Euclidean. All of these require representing properties of an item as a vector of numerical values. We will call that a Property Vector in this article. To compute the similarity between two items, the Property Vector of the two items are compared using an algorithm that has been shown to represent a relationship such as angle between the Property Vectors if the vectors were plotted in an N-dimensional space.

As both the number of items and number of properties of the items has seen exponential growth in recent times, there is a pressing need to compute similarity faster and at a lower cost.  This article talks about how computation of Cosine Similarity, the most popular of the similarity algorithms, can be accelerated using Xilinx Alveo U50 or U280 cards enabling businesses to offer better products and services at lower operational cost. Although this article talks about Cosine Similarity, any of the similarity algorithms can be accelerated similarly on Alveo by merely choosing an appropriate compute function (called kernel) for that algorithm.

## Cosine Similarity

Cosine Similarity is a measure of similarity of two non-zero size vectors of numbers. Specifically, it is a measure of the cosine of angle between two vectors if plotted in N-dimensional coordinate system. Smaller the angle,  more similar the vectors are. The angle can vary from 0 (most similar) to 180 degrees (most dissimilar). Accordingly, cosine of the angle (cos θ) can vary from 1 to -1, with 1 denoting most similar and -1 denoting most dissimilar. Mathematically, it is defined as below:

where A and B are vectors of numbers. ||A|| is magnitude of vector A and ||B|| is magnitude of vector B.

Let us take two examples to illustrate.

Example 1:

A = \[3, 4] and B = \[5, 12]

||A|| = sqrt( 3 x 3 + 4 x 4) = 5

||B|| = sqrt(5 x 5 + 12 x 12) = 13

similarity(A, B) = cos θ = (3 x 5 + 4 x 12) /  ( 5 x 13) = 0.969

If point A and B were plotted on a 2-D space, the blue line represents A and red line represents B. The cos of angle (which is around 15º) between them is similarity (A, B).

Example 2:

A = \[3, -4] and B = \[5, 12]

||A|| = sqrt( 3 x 3 + 4 x 4) = 5

||B|| = sqrt(5 x 5 + 12 x 12) = 13

similarity(A, B) = cos θ = (3 x 5 + (-4) x 12) /  ( 5 x 13) = -0.5077

If point A and B were plotted on a 2-D space, the blue line represents A and red line represents B. The cos of angle (which is around 120º) between them is similarity (A, B).

As can be noted, A and B in Example 1 are more similar to A and B in Example 2.

Now, this same concept can be extended to vectors of size greater than two such as when A =\[2, 4, 5, 6, 8] and B = \[4, 5, 7, 8, 9]. Since the vector is of size 5 in this case, imagine a 5-dimensional space and plot points A and B in that space, the angle between the lines from the origin to points A and B is θ, and cos θ is similarity(A, B).

## Applying Cosine Similarity to a problem

Items being compared may not have numerical properties always. As an example, a property may be a string of characters. In all cases, a mechanism to uniquely represent string or other non-numerical values as a numerical value (such as computing a hash) can be devised and used for similarity calculation.

### Representative Problem

We present a solution for following representative problem:

There are 10 million existing items, each with a Property Vector of size 198. We will call these items as "target" items against which similarity of a "source" item (also with a Property Vector of size 198) needs to be computed. After computing 10 million cosine similarity, the top 100 highest scores need to be displayed. Note that "source" item can be one of the items out of 10 million "target" or a completely new item.

## Implementation

The records of "target" items can be stored in any database. A process or query run on CPU can traverse the database to construct a vector of 200 integers per item – 1st to 198th locations storing item's Property Vector, 199th location storing magnitude of the Property Vector (square root of the sum of squares of elements) and 200th storing a unique ID for the item. These 200 long vectors for each of 10 million items comprising total 8GB of data can be stored in High Bandwidth Memory (HBM) of Alveo U50 or U280 card. To compute cosine similarity of a "source" item, a vector of 200 long integers (similar to target item vectors) can be constructed on CPU and sent to Alveo along with a request to compute cosine similarity. Alveo can then compute and return top similarity scores to CPU for any required small amount of computing followed by the presentation of data.

A two compute unit (CU) system can be implemented on Programmable Logic of U50 or U280 using HLS C++ kernel as shown in Figure 1 below to compute the top cosine similarities between a source item and the target 10 million items stored on HBM. As shown below, two compute units (CUs) are run in parallel with each CU connecting to one HBM stack that stores Property Vector of 5 million target items

The design of HLS C++ kernel for each CU is illustrated in the figure below. Each CU contains 16 fully pipelined cosine similarity processing elements (PEs) and one MaxK component to choose the top similarities. The 16 PEs are connected to 16 channels of HBM to access 5 million target Property Vectors in parallel. The key implementation components, e.g., data movers and dot computation primitives come from the Vitis BLAS library.  The incoming source item record is transmitted to the FPGA's PLRAM by the host and then duplicated to 16 PEs. The MaxK primitive calculates the top cosine similarities and their corresponding indices and writes them to the PLRAM, which is read out by the host. In the end, the host will do a simple computation to extract the final top similarities from the two returned top similarity sets computed by the two CUs.

This streamlined architecture not only provides a practical usage example for the Vitis BLAS L1 primitives but also demonstrates the application of the Vitis BLAS library's design strategy, a Composable Linear Algebra acceleration library on FPGA. As illustrated by Figure 3, the library components covered three levels, namely L1 primitives, L2 Kernels, and L3 APs, to allow both hardware and software developers to build efficient FPGA accelerators. The L1 primitives provided by Vitis BLAS library include computation primitives with stream interfaces and data movement primitives for retrieving data from different memory layouts. These primitives allow hardware developers to quickly assemble custom logic, in this case, MaxK, with library primitives. Also L1 data movers and dot primitives, are used to construct a highly efficient streamlined data path.

## Results

Card Frequency Resources Used Time Time Savings over TDP
Power Savings overCost Savings over

Cost

Cost Savings over

Xeon E5 Xeon 8268 Xeon E5 Xeon 8268 Xeon E5 Xeon 8268
Xilinx U50 280 MHz 27.24% DSP, 15.7% BRAM, 29.52%REG, 36.72% LUT 29 ms 56X 20X 75 W 2.4X 5.4X \$2K 3X 20X
Xilinx U280 303 MHz 17.96% DSP, 9.34% BRAM, 18.05%REG, 21.94% LUT 27 ms 60X 22X 225 W 0.8X 1.8X \$7.5K 0.8X 5X
2x Intel Xeon E5-2640 v3 2.6 GHz 2x8 =16  physical cores (run with 16 threads) 1633 ms NA NA 180 W NA NA \$6K NA NA
2x Intel Xeon 8268  2.9 GHz 2x24=48 physical cores (run with 48 threads) 600 ms NA NA 410 W NA NA \$40K NA NA

## Conclusion

Xilinx Alveo offers compelling time, power and cost advantages for computing cosine similarity over CPU's. As an example, U50 Alveo card employs a very power efficient Xilinx FPGA with thousands of multiply and accumulate units (5900 DSPs), millions of hardware elements for building custom digital logic for an algorithm written in C++ (or Verilog), and more than 25 MB of distributed on-chip fast memory capable of delivering 30TB/s bandwidth. CPUs are constrained by Von-Neumann Fixed Instructions set architecture. On the contrary, FPGAs are free to choose an optimal hardware level implementation for an algorithm. As an example, to get top 100 sorted similarity scores out of 10 million, Xilinx compiler tools synthesize a digital circuit that has time complexity of O(N), hence it will take only 10 million clock cycles. CPU's will take a lot more clock cycles as the compilers for this hardware have to map an algorithm in terms of instructions that the fixed ALU's are capable of executing. As an example, an optimized implementation on CPU using heap data structure of depth log2(100) would take 10 million * log2(100) * (average number of cycles per heap manipulation=16) + 100 * log2(100) = 10,000,000 * 7 * 16 + 700 =~ 1 billion clock cycles to get the top sorted 100 scores, which is 100x more clock cycles than taken on FPGA. So, even if CPU clock frequency is 5X faster, the overall time taken will be 20X more on CPU. Additionally, FPGAs do not have the overhead of thread synchronization as synchronization between massively parallel digital circuit-level compute is built-in the clock-driven automatic synchronization already put in place as part of the synthesizing circuit. Last but not the least data movement across different blocks of computing is a lot faster on FPGA as data produced by one block is made available to the next block in a streaming fashion enabling effectively a very deep (more than 200 level at times) pipeline which results in much higher throughput. Alvin Clark is a Sr. Technical Marketing Engineer working on Software and AI Platforms at Xilinx, helping usher in the ACAP era of programmable devices.  Alvin has spent his career working and supporting countless FPGA and embedded designs in many varied fields ranging from Consumer to Medical to Aerospace & Defense.  He has a degree from the University of California, San Diego and is working on his graduate degree at the Georgia Institute of Technology.  