Published: Dec 04, 2019

When you visit a website to search a product, based upon your profile and past purchases the e-commerce site can find products and services that you are most likely to be interested in. Similarly, when you use a streaming entertainment service such as Netflix, based upon your profile and history of actions, you are presented with recommendations for shows and movies. One of the ways employed by such recommendation systems is to find buyers or viewers most similar to you and recommend products and services availed by similar buyers or viewers. Although most visible in the retail and entertainment industries, identifying similarity between two items is employed in variety of industries and applications including recruitment to identify similar candidates, academia and publishing to detect plagiarism, machine learning to train for classification, health care to identify similar patients, and bioinformatics to identify similar proteins.

There are multiple algorithms to compute the similarity between two items, such as Jaccard, Pearson, Cosine, Overlap, and Euclidean. All of these require representing properties of an item as a vector of numerical values. We will call that a Property Vector in this article. To compute the similarity between two items, the Property Vector of the two items are compared using an algorithm that has been shown to represent a relationship such as angle between the Property Vectors if the vectors were plotted in an N-dimensional space.

As both the number of items and number of properties of the items has seen exponential growth in recent times, there is a pressing need to compute similarity faster and at a lower cost. This article talks about how computation of Cosine Similarity, the most popular of the similarity algorithms, can be accelerated using Xilinx Alveo U50 or U280 cards enabling businesses to offer better products and services at lower operational cost. Although this article talks about Cosine Similarity, any of the similarity algorithms can be accelerated similarly on Alveo by merely choosing an appropriate compute function (called kernel) for that algorithm.

where A and B are vectors of numbers. ||A|| is magnitude of vector A and ||B|| is magnitude of vector B.

Let us take two examples to illustrate.

Example 1:

A = \[3, 4] and B = \[5, 12]

||A|| = sqrt( 3 x 3 + 4 x 4) = 5

||B|| = sqrt(5 x 5 + 12 x 12) = 13

similarity(A, B) = cos θ = (3 x 5 + 4 x 12) / ( 5 x 13) = 0.969

If point A and B were plotted on a 2-D space, the blue line represents A and red line represents B. The cos of angle (which is around 15º) between them is similarity (A, B).

Example 2:

A = \[3, -4] and B = \[5, 12]

||A|| = sqrt( 3 x 3 + 4 x 4) = 5

||B|| = sqrt(5 x 5 + 12 x 12) = 13

similarity(A, B) = cos θ = (3 x 5 + (-4) x 12) / ( 5 x 13) = -0.5077

If point A and B were plotted on a 2-D space, the blue line represents A and red line represents B. The cos of angle (which is around 120º) between them is similarity (A, B).

As can be noted, A and B in Example 1 are more similar to A and B in Example 2.

Now, this same concept can be extended to vectors of size greater than two such as when A =\[2, 4, 5, 6, 8] and B = \[4, 5, 7, 8, 9]. Since the vector is of size 5 in this case, imagine a 5-dimensional space and plot points A and B in that space, the angle between the lines from the origin to points A and B is θ, and cos θ is similarity(A, B).

Items being compared may not have numerical properties always. As an example, a property may be a string of characters. In all cases, a mechanism to uniquely represent string or other non-numerical values as a numerical value (such as computing a hash) can be devised and used for similarity calculation.

We present a solution for following representative problem:

There are 10 million existing items, each with a Property Vector of size 198. We will call these items as "target" items against which similarity of a "source" item (also with a Property Vector of size 198) needs to be computed. After computing 10 million cosine similarity, the top 100 highest scores need to be displayed. Note that "source" item can be one of the items out of 10 million "target" or a completely new item.

The records of "target" items can be stored in any database. A process or query run on CPU can traverse the database to construct a vector of 200 integers per item – 1st to 198th locations storing item's Property Vector, 199th location storing magnitude of the Property Vector (square root of the sum of squares of elements) and 200th storing a unique ID for the item. These 200 long vectors for each of 10 million items comprising total 8GB of data can be stored in High Bandwidth Memory (HBM) of Alveo U50 or U280 card. To compute cosine similarity of a "source" item, a vector of 200 long integers (similar to target item vectors) can be constructed on CPU and sent to Alveo along with a request to compute cosine similarity. Alveo can then compute and return top similarity scores to CPU for any required small amount of computing followed by the presentation of data.

A two compute unit (CU) system can be implemented on Programmable Logic of U50 or U280 using HLS C++ kernel as shown in Figure 1 below to compute the top cosine similarities between a source item and the target 10 million items stored on HBM. As shown below, two compute units (CUs) are run in parallel with each CU connecting to one HBM stack that stores Property Vector of 5 million target items

Card |
Frequency |
Resources Used |
Time |
Time Savings over |
TDP |
Power Savings overCost Savings over |
Cost |
Cost Savings over |
|||
---|---|---|---|---|---|---|---|---|---|---|---|

Xeon E5 |
Xeon 8268 |
Xeon E5 |
Xeon 8268 |
Xeon E5 |
Xeon 8268 |
||||||

Xilinx U50 |
280 MHz | 27.24% DSP, 15.7% BRAM, 29.52%REG, 36.72% LUT | 29 ms | 56X | 20X | 75 W | 2.4X | 5.4X | $2K | 3X | 20X |

Xilinx U280 |
303 MHz | 17.96% DSP, 9.34% BRAM, 18.05%REG, 21.94% LUT | 27 ms | 60X | 22X | 225 W | 0.8X | 1.8X | $7.5K | 0.8X | 5X |

2x Intel Xeon E5-2640 v3 |
2.6 GHz |
2x8 =16 physical cores (run with 16 threads) | 1633 ms | NA | NA | 180 W | NA | NA | $6K | NA | NA |

2x Intel Xeon 8268 |
2.9 GHz |
2x24=48 physical cores (run with 48 threads) | 600 ms | NA | NA | 410 W | NA | NA | $40K | NA | NA |

Alvin Clark is a Sr. Technical Marketing Engineer working on Software and AI Platforms at Xilinx, helping usher in the ACAP era of programmable devices. Alvin has spent his career working and supporting countless FPGA and embedded designs in many varied fields ranging from Consumer to Medical to Aerospace & Defense. He has a degree from the University of California, San Diego and is working on his graduate degree at the Georgia Institute of Technology.

Kumar Deepak is a Distinguished Engineer in the Data Center Group (DCG) at Xilinx. He has over 20 years of experience in architecting and developing large-scale complex software and hardware systems. Currently, he is serving as the lead of Debug & Profile features of Vitis software suite and architect of solutions for Database Acceleration using Alveo. He received his B.S in Electronics and Communication Engineering from Indian Institute of Technology, Kharagpur.

Dr. Liang Ma is a Senior FPGA Design Engineer serving in Data Center Group (DCG) at Xilinx Ireland. He has many designs and publications in FPGA acceleration for various financial algorithms, machine learning algorithms, and other HPC algorithms. Before he joined Xilinx, he received his Ph.D. degree cum laude from the Polytechnic University of Turin with research interests in low-power and high-performance computing. Most of his previous work focused on FPGA accelerators’ development via Xilinx high-level synthesis and software-hardware co-design toolchains.