We measure power consumption for ASIC digital blocks (even before manufacturing)

Recently, many articles devoted to development for FPGA / FPGA have appeared on Habrahabr . This happened with the direct participation of my colleagues and other users. It can be seen that such articles contribute to the popularization of this area of development and show that there is already a significant interest in the development of hardware as a whole (figuratively called “iron”).

In this article, I will enter into an almost “un-plowed field” of development for ASIC and talk about one interesting aspect of creating digital parts ( IP blocks ) in ASIC chips. This area of development is even narrower than FPGA .

ASIC (application-specific integrated circuit, “special purpose integrated circuit”) is an integrated circuit specialized for solving a specific problem.

My article illustrates the commonly used method of measuring power consumption for a single IP block inside a chip before it can be measured in a fabricated chip. This assessment allows already at an early stage:

compare different variants of the digital data processing algorithm,
choose the best implementation option by the criterion of consumption / digital losses,
pretty accurate in numbers to estimate the power consumption when working in a chip produced by a certain technology.

A preliminary assessment will show in advance which blocks will consume the most energy in active mode.

With this method, with some assumptions, it is possible to quite accurately compare several implementations of the algorithm on HDL (digital circuit description language). In our case, this will be Verilog , which is the most popular language for developing for ASIC .
')

Two assumptions to speed up the process of comparing multiple implementations:

I will not conduct a complete synthesis of the IP block to obtain the final layered implementation (it includes all the parasitic capacitances that also affect the consumption), and confine myself to the so-called ideal wire-load model of the synthesized IP block . * A more accurate assessment in absolute numbers is obtained by synthesis in the advanced topographical mode (with layer-by-layer synthesis), but with relative comparison this can be neglected.
It will not be possible to estimate and take into account the consumption of the "clock tree" after the wire-load synthesis. To evaluate it, you need to do a full wiring in the crystal. In synchronous circuits, it can give in numbers a significant consumption relative to the consumption of the entire unit during operation. But when comparing different implementations of a block with approximately the same area of flip-flops, the consumption of the forked tree can also be considered approximately the same.

What we need to measure consumption:

Component Library (Standard Cell Library) for the target technology (130/90 / 65nm; provided by the manufacturer under NDA )
The program for the synthesis of netlist from Verilog in the basis of the selected component library
Program to assess consumption
The program for the simulation and logging of the operating mode of our IP block (we want to get an accurate estimate of consumption in the operating mode, and not a statistical estimate of the block consumption)

In particular, I used Synopsys DC (Design Compiler) to synthesize and calculate consumption, and Modelsim to simulate the operation and log the number of switching signals in the circuit. Similar data and results can be obtained using programs from other companies.

To get consumption, you need to know how many times and which elements have switched from a synthesized IP block from 1 to 0 and from 0 to 1 (in the digital circuit, elements can only be in these two states). You can, of course, not get accurate data that and how many times switched, and calculate them on the basis of statistical data (we will consider this signal only 10% of the total time switches), but then the consumption estimate will be statistical. And we need to get accurate estimates for several implementations in order to compare. Therefore, we will simulate the operation of the IP block using testbench and log all switching elements in the test block.

The evaluation process in stages will be shown using an example.

As an example, for the evaluation we will use the source code of the digital data processing unit from the output of the ADC (Analog-to-Digital Converter). Its task is to perform digital signal processing ( DSP / DSP ) in order to implement Digital Down-shift Conversion (digital down-conversion) for further processing. Using this example, I will consistently illustrate the steps that allow you to get the consumption result for any IP block written in Verilog / VHDL . * Some names in the example are changed due to the inability to lay out the source code of the mentioned digital block as it is.

To automate the process of testing different implementations of the IP block, I wrote scripts that run sequentially in 3 steps:

Synthesis of IP-block under the target technology (in our case we will synthesize for TSMC 90nm , Library: typical)
Running a simulated synthesized description ( netlist )
Calculate the consumption of the collected switching data from the simulation.

And now the process of measurement in steps and the result obtained with comments

For each stage, you can select the commands in your script, as done to automate the process with me, or run them in a row by command.

Script run_srs for the first stage (Design Compiler):

saif_map -start

read_verilog ~ / srs / ddc_notch.v
read_verilog ~ / srs / ddc_qnt3b.v
read_verilog ~ / srs / ddc_intp.v
read_verilog ~ / srs / ddc_qsr0.v
read_verilog ~ / srs / ddc_qsr1.v
read_verilog ~ / srs / ddc_lpf0.v
read_verilog ~ / srs / ddc_qsr_lpf1.v
read_verilog ~ / srs / ddc_reg.v
read_verilog ~ / srs / ddc_top.v
current_design ddc_top

create_clock clk_ddc -period 20
set_clock_uncertainty 0.15 [all_clocks]

compile -gate_clock

change_names -rules verilog -hierarchy
write -format verilog -hierarchy -output ddc_top.v
write -format ddc -hierarchy -output ddc_top.ddc

report_area -hierarchy> ./log/area_ddc.rpt
exit

Enable name matching logging for consumption calculation.

saif_map -start

Compile the source

 read_verilog ~/srs/ddc_notch.v read_verilog ~/srs/ddc_qnt3b.v read_verilog ~/srs/ddc_intp.v read_verilog ~/srs/ddc_qsr0.v read_verilog ~/srs/ddc_qsr1.v read_verilog ~/srs/ddc_lpf0.v read_verilog ~/srs/ddc_qsr_lpf1.v read_verilog ~/srs/ddc_reg.v read_verilog ~/srs/ddc_top.v current_design ddc_top

We only need to specify the settings ( constraints ) for the clock. Although for complex blocks of such settings, you need to specify quite a lot so that the result after the synthesis would meet expectations.

 create_clock clk_ddc -period 38.46 set_clock_uncertainty 0.15 [all_clocks]

We only compile with the gate_clock parameter, which allows us to automatically insert a cloak-off scheme during the synthesis (if the FF is not active (flip-flop / register) and certain rules for registering on the Verilog are observed). This is the most significant method of reducing active consumption of the scheme in the ASIC )

 compile -gate_clock

We write the results of the synthesis for the second and third stage

 change_names -rules verilog -hierarchy write -format verilog -hierarchy -output ddc_top.v write -format ddc -hierarchy -output ddc_top.ddc ddc_top.ddc —   ,       . ddc_top.v —  Verilog netlist

Script vsim.do for the second stage (simulating and logging switching in Modelsim)

vlib work
vmap work work
vlog -work work ~ / work / tsmc090.v
vlog -work work ~ / work / ddc_top.v
vlog -work work ~ / work / tb.v
vsim + notimingchecks -novopt work.tb + nowarn3017 + nowarn3722

run 90us
power add -r tb / ddc_top / *
run 200us
power report -all -bsaif saif.saif
exit

Create a library in which we will compile and simulate

 vlib work vmap work work

We compile the netlist synthesized at the first stage (ddc_top.v), the library of TSMC 90nm elements for simulation (tsmc090.v) and testbench (tb)

 vlog -work work ~/work/tsmc090.v vlog -work work ~/work/ddc_top.v vlog -work work ~/work/tb.v

Verilog code for testbench (tb.v)

 `timescale 1ps/1ps module tb; reg clk_ddc; reg rstz_ddc; reg in_valid; reg [2:0] in_i; reg [2:0] in_q; reg [2:0] i_temp; reg [2:0] q_temp; always @(posedge clk_ddc) if (~rstz_ddc) i_temp <= 'd0; else if (in_valid) i_temp <= $random % 8; always @(posedge clk_ddc) if (~rstz_ddc) q_temp <= 'd0; else if (in_valid) q_temp <= $random % 8; always @* case (i_temp) 3'd0: in_i = 4'b001; 3'd1: in_i = 4'b001; 3'd2: in_i = 4'b010; 3'd3: in_i = 4'b011; 3'd4: in_i = 4'b100; 3'd5: in_i = 4'b101; 3'd6: in_i = 4'b110; default:in_i= 4'b111; endcase always @* case (q_temp) 3'd0: in_q = 4'b001; 3'd1: in_q = 4'b001; 3'd2: in_q = 4'b010; 3'd3: in_q = 4'b011; 3'd4: in_q = 4'b100; 3'd5: in_q = 4'b101; 3'd6: in_q = 4'b110; default:in_q= 4'b111; endcase initial clk_ddc = 'd0; always #19230 clk_ddc = ~clk_ddc; //26 Mhz always @(posedge clk_ddc) if (~rstz_ddc) in_valid <= 'd0; else in_valid <= 'd1; initial begin rstz_ddc = 'd0; #80000; @(posedge clk_ddc); rstz_ddc = 'd1; @(posedge clk_ddc); @(posedge clk_ddc); #50000000; $display ($time); #50000000; $display ($time); #50000000; $display ($time); #400000000; end ddc_top ddc_top ( .clk_ddc ( clk_ddc ), .rstz_ddc ( rstz_ddc ), // APB .clk_apb ( 1'd0 ), .reg_adr ( 10'd0 ), .reg_we ( 1'd0 ), .reg_wd ( 32'd0 ), .reg_rd ( reg_rd ), .ddc_qi ( {in_q,in_i} ), .ddc_in_valid ( in_valid ), .ddc_out_i ( ), .ddc_out_q ( ), .ddc_out_valid ( ) );

I note that we input the initially randomly generated input data, and not the data from the model simulating a real input signal. For our case of measuring the consumption of a digital circuit for a signal that is “under noise”, it will be equivalent. In fact, the actual input signal represents “white noise” (the uniform distribution of the random generator). Although, to be precise, it is not perfectly “white” due to the limitation of the signal bandwidth and the influence of analog input amplifiers, but this generally does not affect the result of the simulation and the measurement of consumption.

We run the simulation without optimization for accurate logging and without checking the time relationships from the TSMC library (also hide certain Warning libraries).

 vsim +notimingchecks -novopt work.tb +nowarn3017 +nowarn3722

Skip the initial initialization for comparison purposes.

 run 90us

And we start logging all signals in our block.

 power add -r tb/ddc_top/*

We collect data 200mks in operating mode

 run 200us

And write the collected data in the Switching Activity Interchange Format (SAIF) format to a file for use in the third stage.

 power report -all -bsaif saif.saif

Example of part of the data from the saif

 (INSTANCE dff_20_reg_6_
	 NET
		 (flag (T0 0) (T1 200000000) (TX 0) (TC 0) (IG 0))
		 (n0 (T0 100534440) (T1 99465560) (TX 0) (TC 1324) (IG 0))
		 (clk (T0 150002000) (T1 49998000) (TX 0) (TC 5200) (IG 0))
		 (xRN (T0 0) (T1 200000000) (TX 0) (TC 0) (IG 0))
		 (xSN (T0 0) (T1 200000000) (TX 0) (TC 0) (IG 0))

Shows how many times a certain internal netlist signal has been switched during the test.

Run script for the third stage (measurement of block consumption on the saif database in the Design Compiler)

read_ddc ./ddc_top.ddc
current_design ddc_top
read_saif -input ./saif.saif -instance tb / ddc_top
report_power -hierarchy -levels 1 -analysis_effort high> ./log/power_ddc.rpt
report_saif

Read the previously saved database for the synthesized netlist and connect the previously obtained saif

 read_ddc ./ddc_top.ddc current_design ddc_top read_saif -input ./saif.saif -instance tb/ddc_top

Run the measurement of power consumption and write to the file

 report_power -hierarchy -levels 1 -analysis_effort high > ./log/power_ddc.rpt

We check that all our internal signals were correctly compared and taken into account in the analysis of consumption. And at the same time there were no signals for which it was not possible to find the switching statistics in the saif file (this goes for 50% of the signals if you simulate the original Verilog, and not the synthesized netlist under the target library).

 report_saif

The report of this team for our example

 -------------------------------------------------- ------------------------------
               User Default Propagated
 Object type Annotated (%) Activity (%) Activity (%) Total
 -------------------------------------------------- ------------------------------
  
  Nets 2029 (100.00%) 0 (0.00%) 0 (0.00%) 2029
  Ports 655 (100.00%) 0 (0.00%) 0 (0.00%) 655
  Pins 7492 (100.00%) 0 (0.00%) 0 (0.00%) 7492
 -------------------------------------------------- ------------------------------

And now you can run all 3 scripts in succession with one command and see what happened for our example.

 dc_shell source ./run_srs ; vsim do ./vsim.do ; dc_shell source ./run

Here is a final report

 Library(s) Used: typical (File: ~/lib/lib90nm/typical.db) Power-specific unit information : Voltage Units = 1V Capacitance Units = 1.000000pf Time Units = 1ns Dynamic Power Units = 1mW (derived from V,C,T units) Leakage Power Units = 1pW -------------------------------------------------------------------------------- Switch Int Leak Total Hierarchy Power Power Power Power % -------------------------------------------------------------------------------- ddc_top 6.60e-02 0.315 1.10e+08 0.491 100.0 r313 (ddc_top_DW01_inc_0) 0.000 0.000 1.40e+05 1.40e-04 0.0 ddc_reg (ddc_reg) 0.000 1.27e-03 3.78e+06 5.06e-03 1.0 ddc_intp (ddc_intp) 8.41e-03 3.34e-02 5.75e+06 4.76e-02 9.7 ddc_qnt3b (ddc_qnt3b) 2.59e-03 9.17e-03 2.20e+06 1.40e-02 2.8 ddc_qsr_lpf1(ddc_qsr_lpf1) 2.44e-02 0.110 1.54e+07 0.150 30.4 ddc_notch3 (ddc_notch_1) 1.75e-03 8.81e-03 1.05e+07 2.11e-02 4.3 ddc_notch2 (ddc_notch_2) 1.69e-03 8.81e-03 1.05e+07 2.10e-02 4.3 ddc_notch1 (ddc_notch_3) 1.69e-03 8.81e-03 1.05e+07 2.10e-02 4.3 ddc_notch0 (ddc_notch_0) 2.16e-03 9.98e-03 1.05e+07 2.27e-02 4.6 ddc_qsr1 (ddc_qsr1) 1.46e-03 2.98e-03 4.32e+05 4.86e-03 1.0 ddc_lpf0_q (ddc_lpf0_1) 9.79e-03 5.75e-02 3.80e+06 7.11e-02 14.5 ddc_lpf0_i (ddc_lpf0_0) 1.06e-02 6.05e-02 3.90e+06 7.50e-02 15.3 ddc_qsr0 (ddc_qsr0) 1.33e-03 2.68e-03 2.34e+05 4.24e-03 0.9

Total consumption consists of three components:

Cell Leakage power - leakage current. The static component, which depends on the production technology (90/65 / 28nm) and the conditions of operation of the circuit (temperature / voltage). The calculated value is proportional to the block area.
Cell Internal power - the current that occurs when the input / output state of the library component changes ( cell — example: logical “or” ORX2, DFF trigger). Dynamic component.
Net Switching power - the current associated with the recharging of the output capacitances of the component when switching. Dynamic component.

In our example, we got the result that, at a block of 26 MHz, our digital processing unit consumes 491 µA in active mode for 90nm

In the next article, you can analyze the consumption of the same IP block when it is implemented in FPGA. For this, every Altera / Xilinx / Microsemi manufacturer has specialized programs within their CAD systems (Computer Aided Design). In particular, Altera has this part called PowerPlay Power Analysis , which allows you to automate the process described above for your FPGAs . Only here it is well known that the ability to “program” in an FPGA any IP block has an obvious negative component in the form of much higher consumption. The difference in consumption can reach several dozen times, if we compare the implementation of the same IP block in ASIC 90nm and in modern FPGA , made using 28nm technology.

useful links

Source: https://habr.com/ru/post/264389/

All Articles