One of the fundamental challenges of modern design with SoC and FPGAs is meeting the system power goal.  This gets even trickier with the acceleration flows in Xilinx® Vitis™ unified software platform as tradeoffs can move power from Processing System (PS) to the Programmable Logic (PL).  While a Zynq® UltraScale+™ device has lots of power-saving features in both the PS and in the PL fabric it is often hard to quantify how system design and settings are impacting power when making design tradeoffs.  Fortunately, many Xilinx development boards such as the Zynq ZCU102, ZCU104, and RFSoC ZCU111 board were designed to allow monitoring of the power.  In this article, we will show you how to easily get power measurements to examine design tradeoffs on the ZCU102.

Specifically, for the ZCU102 board, the power rails have been kept separate and shunt resistors have been placed on each of the rails that touch the FPGA and even on the supply for the FMC card.  Each shunt resistor is connected to a TI INA226 which measures both the voltage and current and is connected to an I2C bus for easy measurement data collection.  Details on the INA226 can be found at http://www.ti.com/product/INA226.  Table 1 shows all the rails and the information on each. 

table-zcu10-monitored-power-rails

A simple Linux application was made to interact with the INA266 shunt resistors on the ZCU102 to enable voltage and current monitoring.  The application leverages the /sys/class/hwmon interface, which is documented here:

https://www.kernel.org/doc/Documentation/hwmon/sysfs-interface

The ZCU102 BSP already included the I2C device entry linking the TI ina2xx driver which plugs into the hwmon framework. This entry included the I2C address as well as the shunt resistor value.  The application reads the curr1_input and in1_input attribute to extract the current and voltage for each rail listed in table 1.  The application also consolidates all the rails to report back PS, PL, and MGT power.  This application highly leveraged the pre-existing code here:

https://github.com/parker-xilinx/xilinx-linux-power-utility/blob/master/src/ina_bm.cc

The modified version can be found at the end of the article

Although this application is tailored to the ZCU102, the top of the application can be modified to match the shunt resistors of any board as long as the device tree includes the appropriate entries.  The device tree and the corresponding BSP should be correct for all Xilinx Zynq UltraScale+ boards.

To run the application, there are several switches to make the logging process simpler:

·        -t changes the frequency that measurements are made.  The default is one second.

·        -o allows the user to specify the name of an output file

·        -v enables a verbose mode

·        -d enables a mode that displays the power values to the terminal

·        -l lists out the hwmon interfaces

·        -n allows you to specify the number of times to log a value.

Most of these options were to help generic power monitoring application so not all were used in capturing data in this article.  Since we were interested in the power during actual operation the following command was used to capture power data after the application test was started.  The command below takes 5 readings with a delay of 2 seconds between each reading and outputs it to a test.txt. 

./powerapp.elf -n 5 -t 2 -o test.txt

When the power app is run it calculates the following four different power values in Watts. The screen capture showing compiling powerapp.c thru running it on the board is shown in Figure 1.

PS = VCCPSINTFP+VCCPSINTLP+VCCPSAUX+VCCPSPLL+ VCCPSDDR+VCCOPS+ VCCOPS3+ VCCPSDDRPL

PL = VCCINT+ VCCBRAM+ VCCAUX+ VCC1V2VCC3V3

MGT =    MGTRAVCC+MGTRAVTT+ MGTAVCC+ MGTAVTT 

Total Power = PS + PL + MGT  

screen-capture-of-compiling-and-using-powerapp

Now that we understand the basics of the power app, lets test it out on a ZCU102 running an Artificial Intelligence application that accelerates the computation of a Tiny Yolo V3 network by varying the number of threads for the DPUs.  For our testing the reference design that comes with the Xilinx DNNDK 3.1 petalinux image was used in conjunction with the AI SDK.  This combination makes it easy to launch and AI acceleration and vary the number of threads being accelerated..  For information on this DNNDK 3.1 design and the AI SDK toolkit please refer to the Xilinx AI Developer Hub for Edge applications.

On the design that comes with the DNNDK petalinux image for the ZCU102 there are three B4096 DPU that run at 333 MHz.   For the initial test, a trained TinyYoloV3 network for object detection will be used and the number of threads will be varied.  In this test case the images for detection will come from the SD card and the performance in frames per second and the power will be measured.   To run the test the following commands were used in two separate terminals.  One terminal for launching the acceleration and one for the power measurement.  The data in table 2 and the chart in Figure 2 are from this test.

./test_performance_yolov3_tiny_coco image.lst -t 1 -s 60

./powerapp.elf -n 5 -t 2 -o onethreads.txt

./test_performance_yolov3_tiny_coco image.lst -t 2 -s 60

./powerapp.elf -n 5 -t 2 -o onethreads.txt

  •                
  •                

./test_performance_yolov3_tiny_coco image.lst -t 5 -s 60

./powerapp.elf -n 5 -t 2 -o fivethreads.txt

accelerated-tiny yolo-v3-vs.-number-of-threads
tiny-yolo-v3-frames-second-and-power-vs-threads

The data collected shows that with DNNDK 3.1, the DPU, and its clock gating is effective and that the power required is proportional to the amount of processing.  (In early versions of the DPU, the clocks were always running, now they are gated and only run when DPU is processing data.)  Furthermore, it is easy to see that performance increased linearly up to 4 threads on the three B4096 DPUs.   After 4 threads, the design performance becomes limited and additional threads do not increase performance.

While the PL power increased by about 4W per thread as additional processing threads were added, the overhead and power on the PS only increased by about 0.2W per thread.  During these same changes, MGT power was nearly constant as the MGTR’s are used to transmit the image to a DisplayPort display.

When using the Vitis tool to develop or accelerate a design will be important to minimize accelerated design power while meeting performance goals.  Using this application and the power monitor features available on Xilinx development boards does make it easier to spot thermal issues and to minimize accelerated design power while meeting performance goals when using Vitis technology.

The code in the demo was adapted specifically for the ZCU102, so it could easily be modified to work with the ZCU104 or ZCU111 boards.  The power supply monitoring information for those boards is in the board User Guide.  Also, while this was used as a standalone app, it could easily be a function call used in a custom design to record the system power.

Modified Version of code: 

    #include <stdlib.h>
#include <stdint.h>
#include <dirent.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>

//These are specific to ZCU102
#define VCCPSINTFP 0
#define VCCINTLP 1
#define VCCPSAUX 2
#define VCCPSPLL 3
#define MGTRAVCC 4
#define MGTRAVTT 5
#define VCCPSDDR 6
#define VCCOPS 7
#define VCCOPS3 8
#define VCCPSDDRPLL 9
#define VCCINT  10
#define VCCBRAM 11
#define VCCAUX 12
#define VCC1V2 13
#define VCC3V3 14
#define VADJ_FMC 15
#define MGTAVCC 16
#define MGTAVTT 17

const char railname_arr[50][12] = {
		"VCCPSINTFP",
		"VCCINTLP",
		"VCCPSAUX",
		"VCCPSPLL",
		"MGTRAVCC",
		"MGTRAVTT",
		"VCCPSDDR",
		"VCCOPS",
		"VCCOPS3",
		"VCCPSDDRPLL",
		"VCCINT",
		"VCCBRAM",
		"VCCAUX",
		"VCC1V2",
		"VCC3V3",
		"VADJ_FMC",
		"MGTAVCC",
		"MGTAVTT"
};



typedef struct ina {

	char current_path[50];
	char voltage_path[50];
	char name[12];
	int current;
	int voltage;
	int last;

} ina;

int cmp_ina(const void *a, const void *b) {
	ina *temp1 = (ina*)a;
	ina *temp2 = (ina*)b;
	int len1 = strlen(temp1->current_path);
	int len2 = strlen(temp2->current_path);

	if(len1==len2){
		return strcmp(temp1->current_path, temp2->current_path);
	} else if(len1>len2){
		return 1;
	} else {
		return -1;
	}

}

void populate_ina_array(ina *inas) {
	DIR *d;
	struct dirent *dir;

	char buffer[100];
	char fname_buff[100];

	FILE *fptr;

	d = opendir("/sys/class/hwmon/");
	int counter = 0;

	while ((dir = readdir(d)) != NULL) {
		if (strncmp(".", dir->d_name, 1) == 0) {
			continue;
		}
		//printf("tree: %s\n", dir->d_name);
		strcpy(fname_buff, "/sys/class/hwmon/");
		strcat(fname_buff, dir->d_name);
		strcat(fname_buff, "/name");

		//printf("name: %s\n", fname_buff);

		fptr = fopen(fname_buff, "r");
		fread(&buffer, 10, 1, fptr);
		//printf("device type: %s", buffer);

		if (strncmp(buffer, "ina", 3) == 0) {
			fname_buff[strlen(fname_buff)-5] = 0;

			strcpy(inas[counter].current_path,fname_buff);
			strcat(inas[counter].current_path,"/curr1_input");

			strcpy(inas[counter].voltage_path,fname_buff);
			strcat(inas[counter].voltage_path,"/in1_input");

//			printf("found: %s\n", inas[counter].ina_dir);
			inas[counter].last = 0;
			counter++;
		}

	}

	qsort(inas, counter, sizeof(ina), cmp_ina);
	if (counter > 0)
		inas[counter-1].last = 1;

	counter = 0;
	while(1) {
		sprintf(inas[counter].name, railname_arr[counter]);
		if(inas[counter].last == 1)
			return;

		counter++;
	}

	closedir(d);

}

void list_inas (ina *inas) {
	int counter = 0;
	while(1) {
		printf("Found INA%03d at dir: %s\n", counter, inas[counter].current_path);
		if(inas[counter].last == 1)
			break;

		counter++;
	}
	return;
}

void run_bm (char target_file[50], int sleep_per, int iterations, int verbose, int display, ina *inas) {
	FILE *sav_ptr;
	FILE *ina_ptr;

	sav_ptr = fopen(target_file, "w");

	char buffer[20];
	float plpower = 0;
	float pspower = 0;
	float mgtpower = 0;

	int counter = 0;
	while(1) {
		if (verbose == 1) {
			fprintf(sav_ptr, "%s mV,%s mA,", inas[counter].name, inas[counter].name);
		}
		if(inas[counter].last == 1)
			break;

		counter++;
	}

	if (verbose == 1) {
		fprintf(sav_ptr, "\n");
	}

	for (int j = 0; j < iterations; j++) {
		counter = 0;
		while(1) {

			ina_ptr = fopen(inas[counter].voltage_path, "r");

			fscanf(ina_ptr,"%[^\n]", buffer);

			inas[counter].voltage = atoi(buffer);

			if(verbose==1) {
				printf("Voltage # %d = %d \n", counter, atoi(buffer));
				fprintf(sav_ptr, "%s,", buffer);
			}
			fclose(ina_ptr);

			ina_ptr = fopen(inas[counter].current_path, "r");

			fscanf(ina_ptr,"%[^\n]", buffer);

			inas[counter].current = atoi(buffer);
			if(verbose==1) {
				printf("Current # %d = %d \n", counter, atoi(buffer));
				fprintf(sav_ptr, "%s,", buffer);
			}




			if(inas[counter].last) {
				if(verbose==1){
					fprintf(sav_ptr, "\n");
				}
				if (j == 0){
				fprintf(sav_ptr, "PS Power, PL Power, MGT Power, Total Power");
				if(display==1){
					printf("PS Power, PL Power, MGT Power, Total Power\n");
				}
				fprintf(sav_ptr, "\n");
				}

				pspower = (float) (inas[VCCPSINTFP].voltage*inas[VCCPSINTFP].current+
						inas[VCCINTLP].voltage*inas[VCCINTLP].current+
						inas[VCCPSAUX].voltage*inas[VCCPSAUX].current+
						inas[VCCPSPLL].voltage*inas[VCCPSPLL].current+
						inas[VCCPSDDR].voltage*inas[VCCPSDDR].current+
						//inas[VCCOPS].voltage*inas[VCCOPS].current+
						//inas[VCCOPS3].voltage*inas[VCCOPS3].current+
						inas[VCCPSDDRPLL].voltage*inas[VCCPSDDRPLL].current)/1000000.0;

				fprintf(sav_ptr, " %.3f,", pspower);
				if(display==1){
					printf(" %.3f,", pspower);
				}
				plpower = (float) (inas[VCCINT].voltage*inas[VCCINT].current+
						inas[VCCBRAM].voltage*inas[VCCBRAM].current+
						inas[VCCAUX].voltage*inas[VCCAUX].current+
						inas[VCC1V2].voltage*inas[VCC1V2].current+
						inas[VCC3V3].voltage*inas[VCC3V3].current)/1000000.0;

				fprintf(sav_ptr, " %.3f,", plpower);
				if(display==1){
					printf(" %.3f,", plpower);
				}
				mgtpower = (float) (inas[MGTRAVCC].voltage*inas[MGTRAVCC].current+
						inas[MGTRAVTT].voltage*inas[MGTRAVTT].current+
						inas[MGTAVCC].voltage*inas[MGTAVCC].current+
						inas[MGTAVTT].voltage*inas[MGTAVTT].current+
						inas[VCC3V3].voltage*inas[VCC3V3].current)/1000000.0;

				fprintf(sav_ptr, " %.3f,", mgtpower);
				if(display==1){
					printf(" %.3f,", mgtpower);
				}

				fprintf(sav_ptr, " %.3f", mgtpower+plpower+pspower);
				if(display==1){
					printf(" %.3f\n", mgtpower+plpower+pspower);
				}
				fprintf(sav_ptr, "\n");

				fclose(ina_ptr);
				break;
			}

			fclose(ina_ptr);

			counter++;

		}

		sleep(sleep_per);
	}
	fclose(sav_ptr);
}

int main(int argc, char *argv[]) {

	ina inas[30];
	populate_ina_array(inas);

	int opt;
	int sleep_per = 1;
	int iterations = 1;
	int verbose = 0;
	int display = 0;
	char target_file[50] = "./out.txt";

	while ((opt = getopt(argc, argv, "t:o:vdn:l")) != -1) {

		switch (opt) {

			case 't':
				printf("Running with sleep @ %d\n", atoi(optarg));
				sleep_per = atoi(optarg);
				break;
			case 'o':
				printf("File output to %s\n", optarg);
				strcpy(target_file, optarg);
				break;
			case 'v':
				printf("Verbose mode on\n");
				verbose = 1;
				break;
			case 'd':
				printf("Display mode on\n");
				display = 1;
				break;
			case 'l':
				list_inas(inas);
				break;
			case 'n':
				printf("Testing %d iterations\n", atoi(optarg));
				iterations = atoi(optarg);
				break;
		}
	}
	run_bm(target_file, sleep_per, iterations, verbose, display, inas);

	return 0;
}


About Don Matson

About Don Matson

Don Matson is an experienced FAE with AMD in the Seattle area.  His areas of expertise include timing closure, power, thermal solutions, and machine learning.  Outside of work, he enjoys playing basketball, skiing, hiking, kayaking, etc. as well as traveling with his wife exploring the world.


About Luis Bielich

About Luis Bielich

Luis Bielich is a Colorado native who holds a Bachelor’s and Master’s degree at the Colorado School of Mines.  He has been employed with AMD for 12 years supporting Colorado engineers with their FPGA/SoC designs.  He is also an author of five application notes as well as one patent.  His recent time has been focused on embedded systems and AMD software solutions.  He is looking forward to the new capabilities of the ACAP solution.