Static power analysis for power-driven synthesis

S.-Y. Yuan
K.-H. Chen
J.-Y. Jou
S.-Y. Kuo

Indexing terms: Probabilistic method, Simulation-based method, Symbolic simulator

Abstract: A new static power analysis method for CMOS combinational circuits is presented. This approach integrates the simulation-based method and the probabilistic method, and can establish the relationships between the primary inputs and the internal nodes in the circuit. Based on the relationships, our approach can also indicate which internal node or input sequence consumes the most power. It is thus suitable for performing power estimation in the synthesis environment for power optimisation. To the best of our knowledge, this is the first attempt to develop a systematic way to symbolically represent the relationships between the primary inputs and the power consumption at every internal node of a circuit. Furthermore, by using the existing piecewise linear delay model, as well as the proposed algorithm, this novel method is also very accurate and efficient. For a set of benchmark circuits, the experimental results show that the power estimated by our technique is within 5% error as compared with that by the exact SPICE simulation, while the execution speed is more than four orders of magnitude faster.

1 Introduction

Power dissipation has emerged as an important design parameter in the design of microelectronic circuits, especially in portable computing and personal communication applications. More generally, as the density and the size of chips and systems continue to increase, the problem of power consumption becomes a critical concern in VLSI design [1]. Low power design techniques are becoming increasingly important in today's integrated circuit designs [2]. Therefore, a fast and accurate power estimator is necessary for a low power circuit/system designer.

During the synthesis for low power at higher level of abstractions such as the register transfer level, it is almost impossible to have a very accurate estimation about the power consumption of each module. However, if the estimation model is able to correctly sort the alternative designs according to the estimated power, we can still have the right designs for low power because there are relatively few alternative designs and each design has very different trade-off among power, area and performance. However, performing synthesis for low power at gate level or below, the power estimation with high accuracy becomes a must, because there could be many alternative designs and it is difficult to have estimation with relatively low accuracy but with good fidelity [3]. Therefore, with a less accurate estimation of power, we may optimise the circuits in the wrong spots such that we cannot lower the power consumption with minimum overheads in terms of area and performance.

There are quite a few approaches proposed [4-6] to reduce the inefficiency of the SPICE while maintaining acceptable estimation errors. The PowerMill approach [4] is a transistor-level power simulator, which uses an event-driven simulation algorithm to increase the speed by two to three orders of magnitude over SPICE. Although the PowerMill is relatively accurate, it is still not suitable for power-driven synthesis. This is because, when power optimisation is performed, the circuits will be modified frequently. Power estimation should be done incrementally to speed up the process. The simulation-based approach cannot be used in this situation. Switch-level simulation techniques are, in general, much faster than circuit-level simulation techniques, but are not as accurate or versatile. Standard simulators, such as IRSIM [6], can be easily modified to report the switched capacitance (thus the dynamic power dissipation) during a simulation run.

However, the approaches mentioned above suffer three major problems as power simulation tools. First, they must simulate the 'chosen patterns' with many iterations to determine the average power consumption of each node, which slows down the simulation speed. Also, the average results strongly depend on the 'chosen patterns', and we may get biased simulation results. Secondly, if the PIs are not fully independent, the choices of patterns need much more attention and the number of patterns needed is large. Thirdly, due to the nature of simulation-based simulators, they cannot provide enough power information such as the percentage of glitch power of each node's power consumption, the difference between generation of glitches and passing of glitches, the source node of a glitch, or the cause of the glitch. Thus, these approaches can leave the user or synthesiser in a vague situation to improve or resynthesise a circuit.

© IEE, 1998

IEE Proceedings online no. 19981909

Paper received 5th January 1998

S.-Y. Yuan, K.-H. Chen and S.-Y. Kuo are with the Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, Republic of China

J.-Y. Jou is with the Department of Electronic Engineering, National Chiao Tung University, Hsinchu, Taiwan, Republic of China

In this paper, we will develop a new method which retains the advantages of both the simulation-based methods and the probabilistic methods. This power analyser can estimate the total power consumption due to the circuit itself as well as the power consumption due to the transition, and the spike at every internal node in the circuit efficiently and accurately. Furthermore, this new approach can not only establish the relationships between the primary inputs and the internal nodes in the circuit but also increase the efficiency of the simulation-based method. Ghosh proposed a symbolic simulator based on the binary decision diagram (BDD) [7]. The major difference between our method and the method in [7] is that we use the idea of cube representation to simplify the process of simulation. By using the idea of a cube, we can get much more information from the simulation than the method in [7]. To the best of our knowledge, our approach is the first attempt to develop a systematic way to symbolically represent the relationships between the primary inputs and the power consumption at every internal node of a circuit. The simulation results show that our new method provides much more information on power consumption than other methods. The approach can thus be integrated into a synthesis environment to determine where it can be improved or resynthesised for low power. Thus, this method is very suitable for performing power estimation in the synthesis environment for power optimisation.

2 Power dissipation model and definitions

2.1 Power dissipation model
It is well known that the dynamic power estimation formula is \( P = 1/2 \alpha CV^2 f \) where \( P \) is the average power, \( \alpha \) is the switching activity, \( V \) is the supply voltage, \( f \) is the frequency, and \( C \) is the load capacitance of the gate [8, 9]. We use an ideal gate (e.g. AND, OR, etc.) and equivalent input and output capacitances to model a real gate. Using [1], we can estimate \( C \).

\[
P = \frac{1}{2} \alpha CV^2 f
\]

\[
= \frac{1}{2} \times \frac{\text{actual-node-transition-number}}{\text{max-node-transition-number}} \times CV^2 \times \frac{1}{P_{\text{gate}}}
\]

\[
\Rightarrow C = \frac{2 \times p \times \text{simulation-time}}{V^2 \times \text{actual-node-transition-number}} \quad (1)
\]

We also established the database of multiple-SPICE simulation-based thresholds [10] to approach the piecewise linear delay model, which will further improve the precision and efficiency of the simulation. The simulation result was very accurate and efficient in [10]. The error percentage is less than 5% as compared with the HSPICE simulation, while the execution speed is more than three orders of magnitude faster. For some circuits, the speedups are even more than four orders of magnitude larger. However, a large number of different input sequences is required for [10], and this simulation-based method was still very time consuming. Furthermore, the information on the average power consumption is not enough for performing power optimisation in the synthesis environment. In other words, with only these average power values derived from the simulation-based method, we cannot efficiently figure out where the most power is consumed and why. Therefore, it is necessary to combine a simulation-based method with a probabilistic method. In the following discussions, we will develop a new method based on the static analysis approach. This method cannot only construct the relationships between the primary inputs and the internal nodes in the circuit but also increase the efficiency of the simulation-based method.

2.2 Node probabilities
The signal probability \( p(X) \) of a node \( X \) is defined as the probability that node \( X \) has a value of logic 1. Let us now define three special probabilities \( P_1 \), \( P_0 \), and \( P_S \). Assume that node \( X \) is the output of a gate \( g \). Thus, the switching probability of \( X \), \( P_s(X) \), is equal to \( 2 \times p(X) \times (1 - p(X)) \) and is defined as the probability that node \( X \) will switch from low to high or high to low if any input(s) of gate \( g \) changes. The holding-one probability of \( X \), \( P_h(X) \), is equal to \( p(X) \) and is defined as the probability that node \( X \) will hold in high (one) if any input(s) of gate \( g \) changes. The holding-zero probability of \( X \), \( P_0(X) \), is equal to \( (1 - p(X)) \) and is defined as the probability that node \( X \) will hold in low (zero) if any input(s) of gate \( g \) changes.

We define a cube with \( n \) elements as: \( \langle PK(P_{I1})(P_{I2}), PK(P_{I1})(P_{I2}), ..., PK(P_{I1})(P_{I2}) \rangle \), given a circuit with \( n \) primary inputs \( (P_{Ij}, j = 1, ..., n) \), \( m \) primary outputs \( (P_{O_k}, k = 1, ..., m) \), and many internal nodes. The probability of this cube is: \( \Pi_{j=1}^{n} PK(P_{Ij})(P_{Ij}) \), where \( K(P_{Ij}) \) can be 1, 0, or \( S \). Here we assume that the primary inputs are uncorrelated for the sake of easy explanation.

![Fig. 1 Example circuit](image)

For example, in Fig. 1, \( a, b, c \) are \( P_{I} \)s, \( f \) is \( P_{O} \), and \( d \), \( e \) are internal nodes.

Assume the AND/OR gate has a delay time of 1.5 units, the NOR gate has a delay time of 1.0 unit, and the inertial of each gate is 0.3 units. Given the signal probabilities of the primary inputs \( a, b \), and \( c \), we can determine \( P_1(a), P_0(a), P_h(a), ..., P_s(c) \). By applying the symbolic simulation approach, we can further calculate the \( P_1(a), P_0(d), P_h(d), ..., P_s(f) \). For example, let \( C_1(d) = \langle P_1(a), P_1(b), 1 \rangle \) for internal node \( d \), the cube \( C_1(d) \) means that to have internal node \( d \) in the holding-one state, the primary input node \( a \) and node \( b \) must be both in the holding-one state and the node \( c \) can be in any of the three states. The probability \( P_1(d) \) is equal to \( P_1(a) \times P_1(b) \times 1 \).

2.3 Definitions of symbols and cubes
Since each element of the cube represents the corresponding input, we do not have to write the \( X \) explicitly. Therefore, we introduce simple notations to be used in the cube as follows:

(i) \( 1 \): the probability of any particular \( P_I \) in the holding-one state.
(ii) \( 0 \): the probability of any particular \( P_I \) in the holding-zero state.
(iii) \( s \) or \( b \): the probability of any particular \( P_I \) in the switching state. All the \( P_I \)s with the same switching direction (i.e. low to high (high to low)), are represented as \( s \) (b), while all other \( P_I \)s with opposite switching direction are represented as \( b \) (s).
(iv) \(S(i)\) or \(B(i)\): the probability of any particular \(PI\) in the switching state \((i)\) is just a sequence index. However, the effect of this switching is blocked by some gates between the \(PIs\) and the node under consideration, and hence no transition is generated by the switching of \(PIs\) at this node. Thus, this switching state is less restrictive than \(s\) or \(b\). We call this a don't-care switching state.

(v) \(-\): this input is not related to the cube. We call this a don't-care state. The probability is one.

For the determination of the effect of a spike, we extend the cube by adding two fields, the beginning time and the lasting time, to describe the timing information. The beginning time represents the starting time of the action and the lasting time represents how long such an action will last. We will simply call the holding-one cube as the 1-cube (represented as \(C_1(x)\)), the holding-zero cube as the 0-cube (represented as \(C_0(x)\)), the switching cube as the \(S\)-cube (represented as \(C_s(x)\)), and the spike cube as the \(G\)-cube (represented as \(C_g(x)\)). The 1-cube and the 0-cube specify the conditions and probabilities for any particular node to hold in the 1 state and the 0 state respectively. Thus, their beginning times are always set to 0.0 and their lasting times are always set to \(\infty\). For the \(S\)-cube, the beginning time is set to the time when the switching starts and the lasting time is set to \(\infty\). The \(G\)-cube's beginning time is the beginning time of the first calculated switching, and its lasting time is the spike duration time. For example, assume that there is a node \(y\) whose \(C_s(y) = <1.35, \infty, s, b, 1, S(1), -, B(1), S(2)>\). The circuit has seven \(PIs\) (\(PI_1 \ldots PI_7\)). The \(C_g(y)\) means that one of the cases to make node \(y\) switch is to let \(PI_1\) in the switching state, \(PI_2\) in the switching state with a switching direction opposite to the switching direction of \(PI_1\), \(PI_3\) in the holding-one state, \(PI_4\) in the don't-care switching state, \(PI_5\) in the don't-care state, \(PI_6\) in the don't-care switching state with opposite switching direction to \(PI_6\), and \(PI_7\) also in the don't-care switching state not related with any other \(PIs\). The switching will start at time \(= 1.35\) units.

The symbols 0, 1, and \(s\) are similar to \(P_{b0}^0, P_{b1}^1, P_{b}^0, P_{b1}^1\), and \(P_{b}^1\) proposed in [7]. However, Ghosh in [7] used the BDD representation to simulate the circuit. Instead of BDD, we use the cube-based operation in our simulator. It cannot only easily simulate the circuit but can also provide us with much more information for synthesising low-power circuits. This information includes which internal node or which input sequence consumes the most power. Therefore, our new proposed method is more suitable to be used in the synthesis environment.

### 2.4 Definition of cube sets

We call the union of the same type of cubes a 'cube set'. Every node in a circuit has four cube sets (i.e. 1, 0, \(S\), and \(G\) cube sets). For a node \(X\), the four sets are represented as \(\{C_0(X)\}, \{C_1(X)\}, \{C_s(X)\}, \{C_g(X)\}\). For example, \(\{C_g(d)\} = \{<1.5, \infty, 1, s, \rightarrow, <1.5, \infty, s, 1, \rightarrow, <1.5, \infty, s, \rightarrow, <1.5, \infty, s, \rightarrow, <1.5, \infty, s, \rightarrow>\}\) and \(\{C_s(f)\} = \{<2.5, 0.5, s, s, s, s, s, s, \rightarrow\}\) The \(\{C_g(d)\}\) has three \(S\)-cubes in it. Each represents a different \(PI\) combination to make node \(d\) switch. The example \(G\)-cube in the \(\{C_g(f)\}\) means that the node \(f\) will have a spike after a delay of 2.5 time units and the spike duration is 0.5 time unit, if all the \(PIs\) switch in the same direction.

### 3 Operators and algorithm

We will, in this Section, define operations at the cube level, the cube set level, and the logic node level, respectively. All higher level operations are built upon the lower level operations.

#### 3.1 Cube level operators

The proposed operators for cubes are intersection \((\cap)\), e.g. \(A \cap B\), bar\(-\), e.g. \(\neg A\), and don't-care \((dc\), e.g. \(dc(A)\). Given an \(OR\) gate \(g\) with a gate delay of 1.3 time units and inertial delay of 0.3 time unit, its fanout is node \(z\) and its fanins are node \(x\) and node \(y\). The nodes \(x, y, z\) are all internal nodes. Assume a 1-cube in \(\{C_1(x)\}\) is \(<0.0, \infty, 1, -, S(1), B(1), s, S(2)>\) and an \(S\)-cube in \(\{C_s(y)\}\) is \(<1.4, \infty, 1, S(2), S(2), s, \rightarrow\). The main function of the \((\cap)\) operator is to find the \(PI\) state which is compatible with the \(PI\)-part of both cubes. The procedure is explained as follows and is illustrated in Fig. 2.

\[
\begin{align*}
&\cap <0.0, \infty, 1, -, S(1), B(1), s > \text{-} 1\text{-cube of node } x \\
&\cap <1.4, \infty, 1, S(2), S(2), s > \text{-} S\text{-cube of node } y \\
&\cap <2.7, \infty, 1, b, b, s, s >> 1\text{-cube of node } z
\end{align*}
\]

(i) \(PI_1\): \(s \cap - \Rightarrow s\).

(ii) \(PI_2\): \(B(1) \cap s \Rightarrow s\). Since \(s\) is more restricted than \(B(1)\), the result is \(s\).

(iii) \(PI_3\): \(S(1) \cap S(2) = b \cup S(2) \Rightarrow b\). Since \(B(1)\) in \(PI_4\) position of the first cube is changed to \(s\), the \(S(1)\) in \(PI_3\) position of the first cube has to be changed to \(b\), which is opposite to \(s\), and \(S(2)\) in \(PI_3\) position of the second cube is also changed to \(b\).

(iv) \(PI_5\): \(\cap S(2) = - \cap b \Rightarrow b\), since the \(S(2)\) in \(PI_3\) position of the second cube has been changed to \(b\).

(v) \(PI_1\): \(1 \cap 1 \Rightarrow 1\).

We can see that the \((\cap)\) operator is not a straightforward element-wise AND operation on cubes. Because the result of intersection is in the 1-cube set of node \(z\), the cube cannot contain \(s\) (or \(b\)) and the beginning time must be reset to 0.0. So we apply the \(dc\) operator on the intersection result, such that \(dc(\{<2.7, \infty, 1, b, b, s, s, \Rightarrow\} \cap <0.0, 00, 1, B(1), B(1), S(1), S(1)>\). Another operator is the bar \((\neg)\) cube. It inverts all switching/don't-care switching elements in a cube only. For example, given the delay time \(d\) of an inverter, \(<0.5, \infty, 1, s, \rightarrow, B(2), B(1), S(1)> \Rightarrow <0.5 + d, \infty, 1, b, -S(2), s, B(1)>\).

#### 3.2 Spike determination

A spike is generated by two different signals going through the same gate with different arrival times and opposite transition directions. In Fig. 3, we show a...
spike generated at node \( f \). Here we assume the parameters of Fig. 3 are the same as Fig. 1.

The \( G \)-cube is determined as:

(i) beginning time \( = \) the smaller of the beginning times of the two cubes + the delay of this gate,

(ii) lasting time \( = \) absolute value of the difference of the two cubes' beginning times,

(iii) PI-part: the intersection of the two cubes' PI-part with one cube inverted.

For example, the \( C_\delta(f) \) can be obtained by: \( C_\delta(d) \cap C_\delta(e) \Rightarrow \{1,0, \infty, s, s\} \cap \{1,0, \infty, s, s\} \Rightarrow \{1, s\} \).

If the difference between the two cubes' beginning times is less than \( i \), the result is an empty cube; otherwise, the way to derive the new cube is:

\[
\text{timing fields: beginning time + delay of this gate of the two cubes;}
\]

\[
\text{PI-part: Intersection of the PI-parts of the two cubes;}
\]

3.3.4 \( \cap(d, i, 1) \): If both lasting times are less than \( i \), the result is an empty cube; otherwise, the way to derive the new cube is:

\[
\text{timing fields: lasting time of the two cubes;}
\]

\[
\text{PI-part: Intersection of the PI-parts of the two cubes;}
\]

3.3.5 \( \cap(d, i, 2) \): If the difference between the two cubes' beginning times is less than \( i \), the result is an empty cube; otherwise, the way to derive the new cube is:

\[
\text{timing fields: difference of the two cubes' beginning times;}
\]

\[
\text{PI-part: Intersection of the PI-parts of the two cubes;}
\]

3.4 Formal definitions of cube-set level operators

Let \( C_u \), \( C_v \) and \( C_w \) be three cube sets. Union is the ordinary union operator on sets. For the cube set \( (1, 0, S, G) \), there are six operators.

(i) \(- (d, i)\): \( C_u = C_u \cup C_v \) is union of \( \forall v \in C_v \), \( \text{bar}(d, i) \) (\( v \))

(ii) \( DC \): \( C_u = DC(C_u) \cup \forall v \in C_v \), \( \text{dc}(w) \) \( \forall \)

(iii) \( x(d) \): \( C_u = C_x \cup \forall v \in C_v \), \( \text{dc}(w) \) \( \forall \)

(iv) \( x(d, i) \): \( C_u = C_x \cup \forall \in C_v \), \( \text{dc}(w) \) \( \forall \)

(v) \( \text{and} \): \( C_v = C_v \cup \forall \in C_v \), \( \text{dc}(w) \) \( \forall \)

(vi) \( DC \): \( C_v = C_v \cup \forall \in C_v \), \( \text{dc}(w) \) \( \forall \)

3.5 Operators at the node level

We will develop the operators only for the two input AND, OR, and NOT gates. The operators for all other complex gates can be built based on this foundation.

3.5.1 Operator NOT: Given an inverter with input \( a \) and output \( c \) and with delay \( d \) and inertial \( i \), the corresponding cube set of output \( c \) is calculated as follows:

(i) \( \{C_0(c)\} = \{C_0(a)\}, \{C_0(c)\} = \{C_0(a)\} \)

(ii) \( \{C_0(c)\} = \{C_0(a)\}, \{C_0(c)\} = \{C_0(a)\} \)

3.5.2 operator AND: Given an AND gate with inputs \( a \) and \( b \) and output \( c \) and with delay \( d \) and inertial \( i \), the corresponding cube sets of output \( c \) is calculated as follows:

(i) \( \{C_0(c)\} = DC\{C_0(a)\} \cap \{C_0(b)\} \)

(ii) \( \{C_0(c)\} = DC\{C_0(a)\} \cap \{C_0(b)\} \)

(iii) \( \{C_0(c)\} = \{C_0(a)\} \cap \{C_0(b)\} \)

(iv) \( \{C_0(c)\} = \{C_0(a)\} \cap \{C_0(b)\} \)

3.5.3 Operator OR: Given an OR gate with inputs \( a \) and \( b \) and output \( c \) and with delay \( d \) and inertial \( i \), the corresponding cube sets of output \( c \) is calculated as follows:

(i) \( \{C_0(c)\} = DC\{C_0(a)\} \cap \{C_0(b)\} \)

(ii) \( \{C_0(c)\} = DC\{C_0(a)\} \cap \{C_0(b)\} \)

(iii) \( \{C_0(c)\} = \{C_0(a)\} \cap \{C_0(b)\} \)

(iv) \( \{C_0(c)\} = \{C_0(a)\} \cap \{C_0(b)\} \)
3.6 Partitioning and cube reduction algorithm

To store the cube and cubeset completely, the memory may not be enough. If we can reduce the number of PIS, we can reduce the size of individual cube and cubeset. Also, if we can reduce the number of cubes, we can surely reduce the memory requirement. We solve the problem with two steps: partitioning the circuit into groups, then reducing the number of cubes dynamically.

3.6.1 Partitioning into groups: We first determine the distances between different POs, then divide POs into several groups by using the distance information. Then, the grouped POs, the PIS and the internal nodes that have fanouts to these POs are put into the same partition. Note that we have not lost any information at this step. We just separate the unrelated PIs and internal nodes into different groups to reduce the unnecessary calculations.

\[ d(x, y) = 1, \text{ if there exists a gate } g \text{ such that } x \text{ (or } y) \text{ is } g's \text{ fanin, and the other is } g's \text{ fanout.} \]

\[ d(x, y) = \min_{z \in \text{internal node}} \{d(x, z) + d(z, y)\} \]

3.6.2 Cube reduction algorithm: When a predefined memory usage limitation is reached, the dynamic cube reduction algorithm is executed to average the \( C_1 \) and \( CO \) cubeset of some frontier nodes. Make them as new PIs and have a new simulation starting from these nodes. Since the most important cubesets are \( C_S \) and \( C_G \), we focus on how to reduce these cube sets.

3.7 Algorithm

Given a netlist, we first sort the gates in the netlist topologically. We then apply the corresponding operators defined previously to each gate based on the topological order. The complexity of traverse is proportional to the number of gates. The overall algorithm is shown below.

```
static-power-analysis
{
  Read netlist;
  Read technology file;
  Sort nodes topologically;
  Partition the netlist by predefined distance;
  for (each partition) do
  {
    for (each node in this partition) do
    {
      if (memory limit is reached) do cube-reduction;
      Apply node level operations to this node;
    }
  }
  Report results;
}
```

4 Experimental results

The proposed analyser based on the above power estimation model and delay model has been implemented in C on a SUN SPARCstation 10 with 7000 lines of C codes. The transistor models used are the models of a TSMC 0.8\( \mu \text{m} \) SPDM CMOS technology. Table 2 shows the full timing data of the basic gates derived from the SPICE simulation by applying the method proposed in [10]. We run the SPICE and the proposed analyser on the circuit shown in Fig. 5 to evaluate the quality and efficiency of the method. Table 3 shows the detail simulation results the power consumption.

![Example circuit](image-url)

Table 2: Full specification of timing data

<table>
<thead>
<tr>
<th>Fanin</th>
<th>Equ out ca</th>
<th>Equ inp cb</th>
<th>Delay (ns)</th>
<th>Threshold (ns)</th>
</tr>
</thead>
<tbody>
<tr>
<td>inv</td>
<td>1</td>
<td>0.5</td>
<td>0.4</td>
<td>0.035</td>
</tr>
<tr>
<td>and</td>
<td>2</td>
<td>1.36</td>
<td>0.28</td>
<td>0.195</td>
</tr>
<tr>
<td>and</td>
<td>3</td>
<td>1.29</td>
<td>0.26</td>
<td>0.195</td>
</tr>
<tr>
<td>nand</td>
<td>2</td>
<td>0.37</td>
<td>0.25</td>
<td>0.13</td>
</tr>
<tr>
<td>nand</td>
<td>3</td>
<td>0.41</td>
<td>0.26</td>
<td>0.13</td>
</tr>
<tr>
<td>or</td>
<td>2</td>
<td>1.45</td>
<td>0.37</td>
<td>0.195</td>
</tr>
<tr>
<td>or</td>
<td>3</td>
<td>1.53</td>
<td>0.37</td>
<td>0.195</td>
</tr>
<tr>
<td>nor</td>
<td>2</td>
<td>0.54</td>
<td>0.37</td>
<td>0.13</td>
</tr>
<tr>
<td>nor</td>
<td>3</td>
<td>0.62</td>
<td>0.36</td>
<td>0.13</td>
</tr>
</tbody>
</table>

\( \ast \) stands for equivalent output capacitance, unit: \( \text{f} \)

\( \ast \) stands for equivalent input capacitance, unit: \( \text{f} \)

Table 3: Simulation results on an example circuit

<table>
<thead>
<tr>
<th>Node</th>
<th>SPICE (mW)</th>
<th>Proposed method (mW)</th>
<th>Error (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>real</td>
<td>spike</td>
<td>total</td>
</tr>
<tr>
<td>d</td>
<td>0.157</td>
<td>0.174</td>
<td>0.174</td>
</tr>
<tr>
<td>e</td>
<td>0.238</td>
<td>0.218</td>
<td>0.044</td>
</tr>
<tr>
<td>f</td>
<td>0.156</td>
<td>0.174</td>
<td>0</td>
</tr>
<tr>
<td>g</td>
<td>0.192</td>
<td>0.168</td>
<td>0.021</td>
</tr>
<tr>
<td>h</td>
<td>0.118</td>
<td>0.106</td>
<td>0</td>
</tr>
<tr>
<td>i</td>
<td>0.282</td>
<td>0.218</td>
<td>0.067</td>
</tr>
<tr>
<td>j</td>
<td>0.105</td>
<td>0.098</td>
<td>0</td>
</tr>
<tr>
<td>k</td>
<td>0.176</td>
<td>0.118</td>
<td>0.069</td>
</tr>
<tr>
<td>o</td>
<td>0.070</td>
<td>0.045</td>
<td>0.023</td>
</tr>
<tr>
<td>total</td>
<td>1.494</td>
<td>1.319</td>
<td>0.234</td>
</tr>
</tbody>
</table>

From the cube sets derived above, we can easily determine the number of real transitions \( E_S(\text{sw}) \) and the number of spikes \( E_G(\text{sw}) \). Therefore, \( E_{\text{sw}}(\text{sw}) = E_S(\text{sw}) + E_G(\text{sw}) \). The simulation results of all the nodes by the SPICE and the proposed method are shown in Table 3. The estimation on the total power consumption by the proposed method has an error of about 4% compared with that by the SPICE simulation.
input pattern is more than three days. Therefore, the

marks include decoders, full adders, and a multiplexer.

terns. It means that the proposed method is more effi-

cient than the SPICE method. From the c432 to c2670,

are taken out. The ‘c*’ benchmarks (c17

techniques from spikes and has incremental
capability. Last but not least, the analyser can identify 
the input transitions which cause the large power con-
sumption so that power optimisation can be appropri-
ately applied to improve the circuit. Therefore, this

analyser is very useful in the synthesis environment for 
low power.

One of the main future pieces of work is to solve 
the spatial and temporal dependency of the primary inputs.
By giving different weights to different cubes of the S-
cube set and the G-cube set, which means the patterns of 
some cubes may appear more frequently in the input 
vector than others, we can calculate the temporal dependency approximately. Using a table lookup method instead of the simple multiplication to calculate the \( E(\text{sw}) \), which means different primary inputs are not independent of each other, we can approximate the spatial dependency among different primary inputs. Another direction of future works is to expand this approach to cover other abstraction levels such as circuit level and functional level by extending the symbols and the definitions of cube and cube operations.

6 Acknowledgment

This research was supported by the National Science Council, Taiwan, ROC under Grant NSC 84-2213-E-005-035.

7 References


Table 4: Simulation results

<table>
<thead>
<tr>
<th>Circuit</th>
<th>Events</th>
<th>Inputs</th>
<th>Outputs</th>
<th>Number of gates</th>
<th>SPICE Power (mW)</th>
<th>Time (s)</th>
<th>Proposed method Power (mW)</th>
<th>Time (s)</th>
<th>Error (%)</th>
<th>Speedup</th>
</tr>
</thead>
<tbody>
<tr>
<td>decoder</td>
<td>64</td>
<td>3</td>
<td>4</td>
<td>6</td>
<td>0.5344</td>
<td>106</td>
<td>0.520</td>
<td>0.055</td>
<td>-2.62</td>
<td>1927</td>
</tr>
<tr>
<td>1b-adder</td>
<td>64</td>
<td>3</td>
<td>2</td>
<td>10</td>
<td>1.569</td>
<td>141.5</td>
<td>1.699</td>
<td>0.109</td>
<td>2.54</td>
<td>1298</td>
</tr>
<tr>
<td>1b-mul</td>
<td>256</td>
<td>4</td>
<td>4</td>
<td>14</td>
<td>1.543</td>
<td>884</td>
<td>1.510</td>
<td>0.549</td>
<td>-2.14</td>
<td>1810</td>
</tr>
<tr>
<td>adder2</td>
<td>64</td>
<td>3</td>
<td>2</td>
<td>16</td>
<td>1.457</td>
<td>101</td>
<td>1.483</td>
<td>0.136</td>
<td>1.78</td>
<td>1183</td>
</tr>
<tr>
<td>decoder2</td>
<td>256</td>
<td>4</td>
<td>8</td>
<td>19</td>
<td>1.765</td>
<td>1110</td>
<td>1.821</td>
<td>0.718</td>
<td>3.17</td>
<td>1546</td>
</tr>
<tr>
<td>c17</td>
<td>1024</td>
<td>5</td>
<td>3</td>
<td>6</td>
<td>0.4731</td>
<td>1270</td>
<td>0.458</td>
<td>0.165</td>
<td>3.17</td>
<td>7697</td>
</tr>
<tr>
<td>cs27</td>
<td>1000</td>
<td>7</td>
<td>4</td>
<td>10</td>
<td>1.429</td>
<td>2197</td>
<td>1.71</td>
<td>0.275</td>
<td>0.96</td>
<td>7989</td>
</tr>
<tr>
<td>cs208</td>
<td>100</td>
<td>19</td>
<td>10</td>
<td>102</td>
<td>6.52</td>
<td>6320</td>
<td>6.60</td>
<td>7.30</td>
<td>1.23</td>
<td>866</td>
</tr>
<tr>
<td>cs298</td>
<td>100</td>
<td>17</td>
<td>20</td>
<td>164</td>
<td>25.77</td>
<td>1361</td>
<td>26.47</td>
<td>8.77</td>
<td>-4.27</td>
<td>1159</td>
</tr>
<tr>
<td>c880</td>
<td>100</td>
<td>60</td>
<td>26</td>
<td>447</td>
<td>71.65</td>
<td>44600</td>
<td>74.32</td>
<td>41.41</td>
<td>3.73</td>
<td>1077</td>
</tr>
<tr>
<td>c432</td>
<td>-</td>
<td>36</td>
<td>7</td>
<td>267</td>
<td>-</td>
<td>-</td>
<td>73.14</td>
<td>28.72</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>cs1196</td>
<td>-</td>
<td>31</td>
<td>31</td>
<td>623</td>
<td>-</td>
<td>-</td>
<td>76.33</td>
<td>85.38</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>c1355</td>
<td>-</td>
<td>41</td>
<td>32</td>
<td>682</td>
<td>-</td>
<td>-</td>
<td>126.45</td>
<td>58.36</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>c1908</td>
<td>-</td>
<td>33</td>
<td>25</td>
<td>1049</td>
<td>-</td>
<td>-</td>
<td>213.62</td>
<td>241.0</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>c2670</td>
<td>-</td>
<td>157</td>
<td>64</td>
<td>1385</td>
<td>-</td>
<td>-</td>
<td>378.12</td>
<td>258.0</td>
<td>-</td>
<td>-</td>
</tr>
</tbody>
</table>

Table 4 shows the simulation results on several cir-
cuits when exhaustive simulation is performed using 
both the SPICE and our analyser. These test bench-
circuits are taken out. The ‘c*’ benchmarks (cs27
- 157 64
cube calculation algorithm. From the cs208 to c2670, 
we use 50 groups of 100 random patterns for SPICE simulation and then average the power of the different 50 groups to compare with our simulation. Due to the partitioning, the results of our simulator for these benchmarks are three orders faster than the SPICE. However, we must remember that the result of SPICE is from the average of 50 groups of 100 random patterns. It means that the proposed method is more efficient than the SPICE method. From the c432 to c2670, it takes an enormous time to get the results of SPICE simulation results. The simulation time of a single input pattern is more than three days. Therefore, the results by the SPICE simulation are ignored.

Since our analyser can recalculate the results by using 
the existing cube sets, whenever the transition density 
of any node is changed, our algorithm has the incre-
camental capability which the SPICE does not have.
Therefore, our estimator is particularly useful in the 
synthesis environment for power optimisation.

5 Conclusions and future work

We have proposed a novel static power analyser for 
CMOS combinational circuits. The analyser can esti-
mate the power consumption of a circuit very fast (4 orders faster than SPICE) and very accurately (with a 5% error compared with SPICE). The analyser is also applicable for different delay models with or without inertial delays. Furthermore, it can distinguish func-
tional transitions from spikes and has incremental 
capability. Last but not least, the analyser can identify 
the input transitions which cause the large power con-
sumption so that power optimisation can be appropri-
ately applied to improve the circuit. Therefore, this

analyser is very useful in the synthesis environment for 
low power.

One of the main future pieces of work is to solve 
the spatial and temporal dependency of the primary inputs.
By giving different weights to different cubes of the S-
cube set and the G-cube set, which means the patterns of 
some cubes may appear more frequently in the input 
vector than others, we can calculate the temporal dependency approximately. Using a table lookup method instead of the simple multiplication to calculate the \( E(\text{sw}) \), which means different primary inputs are not independent of each other, we can approximate the spatial dependency among different primary inputs. Another direction of future works is to expand this approach to cover other abstraction levels such as circuit level and functional level by extending the symbols and the definitions of cube and cube operations.

6 Acknowledgment

This research was supported by the National Science Council, Taiwan, ROC under Grant NSC 84-2213-E005-035.

7 References

2 LANDMAN, P.E.: ‘Low-power architectural methodologies’. Technical report, Dept. of EECS, University of California, Berke-
ley, 1994
6 SALZ, A., and HOROWITZ, M.A.: 'IRSIM: An incremental MOS switch-level simulator'. Proceedings of the 26th Design auto-


10 CHEN, K.-H., YUAN, S.-Y., JOU, J.-Y., and KUO, S.-Y.: 'Cell-based power estimation for CMOS combinational circuits using a logic simulator'. The 7th VLSI Design/CAD symposium proceedings, 1996, pp. 81–84