VLSI Neuroprocessors for Video Motion Detection

Ji-Chien Lee, Member, IEEE, Bing J. Sheu, Senior Member, IEEE, Wai-Chi Fang, Member, IEEE, and Rama Chellappa, Fellow, IEEE

Abstract—The system design of a locally connected competitive neural network for video motion detection is presented. The motion information from a sequence of image data can be determined through a two-dimensional multiprocessor array in which each processing element consists of an analog neuroprocessor. Massively parallel neurocomputing is done by compact and efficient neuroprocessors. Local data transfer between the neuroprocessors is performed by using an analog point-to-point interconnection scheme. To maintain strong signal strength over the whole system, global data communication between the host computer and neuroprocessors is carried out in a digital common bus. A mixed-signal very large scale integration (VLSI) neural chip that includes multiple neuroprocessors for fast video motion detection has been developed. Measured results of the program on the pixel level can be achieved. The intrinsic speed-up factor over a Sun-4/75 workstation is around 180.

I. INTRODUCTION

RAPID advances in a very large scale integration (VLSI) technologies have made possible the integration of multiple-million transistors on a single chip. In 1992, the feature size for computer memory technologies is around 0.4 μm [1]. The use of VLSI circuits can greatly reduce the physical size and enhance the performance and reliability of microelectronic systems. In the microprocessor domain, continuous progress on reduced instruction set computers (RISC) enables the introduction of the Intel-i860 chip [2], the SPARC chip from Sun Microsystems, Inc. [3], and the 400-MIPS Alpha chip from Digital Equipment Corporation [4]. In the digital signal processing domain, the TMS-320C40 chip from Texas Instruments, Inc. [5] includes six communication ports to facilitate various data communication schemes. In the dedicated neural computing domain, the 11-million-transistor CNAPS chip from Adaptive Solutions, Inc. and Inova Microelectronics, Inc. [6] includes 64 digital processors for general-purpose neural network execution.

A desirable configuration for an integrated information processing system is shown in Fig. 1. A powerful multimedia data-fusion machine is equipped with several smart interface units to communicate with the analog signals in the real world. These analog signals contain information with some degree of fuzziness [7]. The image acquisition/understanding and advanced display units are used to process visual information. The speech recognizer and synthesizer units are used to process audio information [8]. The microsensor [9] and controller units are used to perform physical actions. Such an intelligent system can be used in offices, factories, and autonomous vehicles. VLSI neural chips can be effectively used in the construction of the interface units.

Various design approaches are applicable for the construction of neural chips. The analog circuit approach is quite attractive in terms of hardware size, power consumption, and speed [10]. Analog neural networks were used as sensory devices to preprocess the real-world data, as reported by Mead et al. [11], [12] and Abidi et al. [13]. In addition, many other analog VLSI neural chips have been reported [14]–[16]. One unique feature of a purely analog neural network is the limited computational precision for complex problems. To enhance the performance of analog VLSI neurocomputing, extra analog switching circuits can be used to facilitate the reconfigurability and scalability of analog neural networks [17]. The digital circuit approach offers greater flexibility, scalability, and accuracy than the analog circuit approach. By using logic and memory, a large problem can be partitioned and processed by the digital neural networks. Some general-purpose digital VLSI neural chips were reported [18]–[20].

In this paper, a mixed-signal design approach is used to exploit the massively parallel computational power of the neural network architecture for video motion detection. To solve low-level vision processing problems, multiple neurons and synapses can be clustered together to function as one pixel processing element. By using compact analog circuit design for the neuron and synapse cells, highly parallel computation on the pixel level can be achieved.

Fig. 1. An integrated information system. Information can be transferred between the system and the real world in multimedia format.
II. SYSTEM ARCHITECTURE

Motion information extracted from a sequence of time-varying images plays a key role in the image understanding and automated control processes. The requirement of an enormous amount of computational power for analyzing image sequences is always a major barrier to real-world applications of most vision-processing algorithms. By using multiprocessor-based VLSI design, the parallelism embedded in low-level vision processes can be fully explored. The single instruction multiple data (SIMD) architecture is a good example [21], [22]. Two specific multiprocessor-based neural engines based on the SIMD architecture have been reported. The CNAPS machine, from Adaptive Solutions, Inc. [18], consists of an array of processing nodes (PN’s). Each PN is an arithmetic processor with its own local memory. The array is sequenced by a system controller. Thus every PN executes the same instruction at a given clock period. The input data and control commands are broadcast to all PN’s through the common bus. The output data of the PN’s are transmitted through the data bus by the time-multiplexing scheme. A local digital data link exists between adjacent PN’s to allow quick data transfer. Due to the simplicity of the broadcast scheme, no complex routing networks are required. The systolic/cellular array processors (SCAP) system, from Hughes Research Lab. in Malibu, CA [23], consists of a 16 x 16 processor array, a dual-port array memory, and a system controller. The mesh-connection architecture is used. The boundary columns are connected via the wrap-around scheme, and the top and bottom rows are connected to the two ports of the system memory. Data communication can be conducted in the parallel format or in the pipelined format.

In our design, a mesh-connected two-dimensional neuroprocessor array is used for high-speed video motion detection. Each processing element can extract the velocity information for one pixel. Interprocessor communication is done by dedicated analog point-to-point interconnections. Data communication between the host computer and array processors is carried out through the digital common bus to preserve signal strength and to achieve simple network scalability. By using this efficient communication among processors, a high computational power per unit silicon area can be achieved.

III. MOTION DETECTION ALGORITHM

Many features from the images such as points, lines, curves, and optical flow, can be used to estimate motion parameters. Optical flow is the apparent motion of the brightness patterns changing images plays a key role in the image understanding and automated control processes. The requirement of an enormous amount of computational power for analyzing image sequences is always a major barrier to real-world applications of most vision-processing algorithms. By using multiprocessor-based VLSI design, the parallelism embedded in low-level vision processes can be fully explored. The single instruction multiple data (SIMD) architecture is a good example [21], [22]. Two specific multiprocessor-based neural engines based on the SIMD architecture have been reported. The CNAPS machine, from Adaptive Solutions, Inc. [18], consists of an array of processing nodes (PN’s). Each PN is an arithmetic processor with its own local memory. The array is sequenced by a system controller. Thus every PN executes the same instruction at a given clock period. The input data and control commands are broadcast to all PN’s through the common bus. The output data of the PN’s are transmitted through the data bus by the time-multiplexing scheme. A local digital data link exists between adjacent PN’s to allow quick data transfer. Due to the simplicity of the broadcast scheme, no complex routing networks are required. The systolic/cellular array processors (SCAP) system, from Hughes Research Lab. in Malibu, CA [23], consists of a 16 x 16 processor array, a dual-port array memory, and a system controller. The mesh-connection architecture is used. The boundary columns are connected via the wrap-around scheme, and the top and bottom rows are connected to the two ports of the system memory. Data communication can be conducted in the parallel format or in the pipelined format.

In our design, a mesh-connected two-dimensional neuroprocessor array is used for high-speed video motion detection. Each processing element can extract the velocity information for one pixel. Interprocessor communication is done by dedicated analog point-to-point interconnections. Data communication between the host computer and array processors is carried out through the digital common bus to preserve signal strength and to achieve simple network scalability. By using this efficient communication among processors, a high computational power per unit silicon area can be achieved.

In order to obtain a dense flow field, the intensity based approach is preferable. However, the intensity value may be corrupted by noise appeared in natural images and partial derivatives of the intensity value are sensitive to rotation. It is difficult to detect the rotational objects in natural images based on such measurement primitives. Under the assumption that changes in intensity are strictly due to the motion of the object, Zhou et al. [32], [33] use the principal curvatures of the intensity function to compute the optical flow because they are rotation-invariant. The intensity values and their principal curvatures are estimated by using a polynomial fitting technique. Under the assumption of local rigid motion and the smoothness constraint, a self-organizing neural network [34]–[36] was developed to compute the optical flow. A deterministic decision rule was used for the updating of neuron states.

Let the velocity field consist of two components k and l. A set of \((2D_k + 1)(2D_l + 1)\) modules of neurons are used to represent the optical flow field, where \(D_k\) and \(D_l\) are the maximum values of velocity components in \(k\) and \(l\) directions, respectively. For the implementation purposes, the velocity component range is sampled using bins of size \(Q\). As shown in Fig. 2, each module corresponds to a velocity value and contains \(N_r \times N_c\) neurons if the images are of size \(N_r \times N_c\). All neurons in the same module are self-connected and locally interconnected with other neurons in a neighborhood of size \(\Gamma \times \Gamma\). Every pixel is represented by \((2D_k + 1)(2D_l + 1)\)
mutually exclusive neurons which form a hypercolumn for velocity selection. When the neuron at the point \((i,j)\) in the \((k,l)\)th module is one, the actual velocities in the \(k\) and \(l\) directions at the point \((i,j)\) are \(kQ\) and \(lQ\), respectively.

Let \(V = \{v_{i,j,k,l}, 1 \leq i \leq N, 1 \leq j \leq N, -D_k \leq k \leq D_k, -D_l \leq l \leq D_l\} \) be a binary set of the neural network with \(v_{i,j,k,l}\) denoting the state of the \((i,j,k,l)\)th neuron which is located at point \((i,j)\) in the \((k,l)\)th module. \(T_{i,j,k,l,m,n,k,l} \) be the synaptic interconnection strength from neuron \((i,j,k,l)\) to neuron \((m,n,k,l)\), and \(I_{i,j,k,l}\) be the bias input.

At each step, the neuron \((i,j,k,l)\) synchronously receives signals from itself, neighboring neurons, and a bias input

\[
u_{i,j,k,l} = \sum_{(m-n,j) \in S_0} T_{i,j,k,l,m,n,k,l} v_{m,n,k,l} + I_{i,j,k,l}
\]

where \(S_0\) is an index set for all neighbors in a \(\Gamma \times \Gamma\) window centered at point \((i,j)\). The \(v_{i,j,k,l}\) is then processed by the winner-take-all circuitry to determine the velocity of the pixel

\[v_{i,j,k,l} = g(u_{i,j,k,l}) \]

where \(g(x_{i,j,k,l})\) is the winner-take-all function:

\[g(x_{i,j,k,l}) = \begin{cases} 1 & \text{if } x_{i,j,k,l} = \max(x_{i,j,p,q}; -D_k \leq p \leq D_k, -D_l \leq q \leq D_l), \\ 0 & \text{otherwise.} \end{cases} \]

At each step, the neuron \((i,j,k,l)\) synchronously receives signals from itself, neighboring neurons, and a bias input

\[\begin{align*}
u_{i,j,k,l} &= \sum_{(m-n,j) \in S_0} T_{i,j,k,l,m,n,k,l} v_{m,n,k,l} + I_{i,j,k,l} \\
&= \sum_{(m-n,j) \in S_0} T_{i,j,k,l,m,n,k,l} v_{m,n,k,l} + I_{i,j,k,l}
\end{align*}\]

reaches a minimum.

Two important features of the network should be noted:

i) The synaptic interconnection strength between neurons on different modules are zeros because only the neurons in the same module are connected, i.e.

\[T_{i,j,k,l,m,n,p,q} = 0, \quad \text{for } (k,l) \neq (p,q), \quad (i,j) \neq (m,n).\]

ii) A maximum evolution function is used to ensure that only one neuron which has the maximum excitation is fired and the other \((2D_k + 1)(2D_l + 1) - 1\) neurons are turned off.

As reported in [32], a smoothness constraint is used for obtaining a smooth optical flow field and a line process is employed for detecting motion discontinuities. The line process consists of vertical and horizontal lines, \(L^v\) and \(L^h\), respectively. Each line can be in either one of the two states: 1 for being active and 0 for being idle. The error function for computing the optical flow from a pair of image frames can be expressed as

\[E = \sum_{i=1}^{N} \sum_{j=1}^{N} \sum_{k=-D_k}^{D_k} \sum_{l=-D_l}^{D_l} \left( A[k_{11}(i,j) - k_{21}(i+k,j+l)]^2 + A[k_{12}(i,j) - k_{22}(i+k,j+l)]^2 \right) v_{i,j,k,l}
\]

\[+ \left( \frac{B}{2} \sum_{s \in S} (v_{i,j,k,l} - v_{i,j+k,l})^2 \right) v_{i,j+k,l}^2
\]

\[+ \left( \frac{C}{2} \sum_{s \in S} (v_{i,j,k,l} - v_{i,j+k,l})^2 (1 - L^h_{i,j,k,l}) \right)^2\]

where \(k_{11}(i,j)\) and \(k_{12}(i+k,j+l)\) are the principal curvatures of the first image, \(k_{21}(i,j)\) and \(k_{22}(i+k,j+l)\) are the principal curvatures of the second image, \(g_1(i,j)\) and \(g_2(i+k,j+l)\) are the intensity values of the first and second images, respectively. Here, \(S = S_0 - (0,0)\) is an index set excluding \((0,0)\). \(A, B,\) and \(C\) are empirical constants.

The principal curvatures are defined as [37]

\[k_1(i,j) = M + (M^2 - G)^{1/2}\]

and

\[k_2(i,j) = M - (M^2 - G)^{1/2}\]

where \(k_1(i,j)\) and \(k_2(i,j)\) are the principal curvatures, \(G\) and \(M\) are the Gaussian and mean curvatures given by

\[G = \frac{\partial^2 g(i,j)}{\partial i^2} \cdot \frac{\partial^2 g(i,j)}{\partial j^2} - \left( \frac{\partial^2 g(i,j)}{\partial i \partial j} \right)^2\]
A polynomial fitting technique can be used to estimate the derivatives. The \( k_{11}, k_{21}, k_{12}, \) and \( k_{22} \) values are calculated from the images by the host computer and sent to the neuroprocessor for network evaluation.

In (6), the first term is to find velocity values such that all points of two images are matched as closely as possible in a least-squares sense. The second term, which is weighted by \( B \), is the smoothness constraint on the solution and the third term, which is weighted by \( C \), is a line process to weaken the smoothness constraint and to detect motion discontinuities. The constant \( A \) in the first term determines the relative importance of the intensity values and their principal curvatures to achieve the best results. The line process weakens the smoothness constraints by changing the smoothing weights, resulting in space-variant smoothing weights. For example, if all lines are on, the weights will be \( B/2 \). If all lines are off, the weights at the four nearest neighbors of the center point are increased by \( C/2 \).

By choosing the interconnection strengths and bias inputs as

\[
T_{i,j,k,l,m,n,k,l} = -[48B + C(4 - L_{i,j,k,l}^h - L_{j+1,i,k,l}^h - L_{i+1,j,k,l}^h - L_{j+1,i,k,l}^h)]
\]

\[
+ C[(1 - L_{i,j,k,l}^h)\delta_{i,m}\delta_{j,n} + (1 - \delta_{i,m}\delta_{j,n})] + 2B \sum_{\sigma \in S} \delta_{(i,j),i(m,n)+s}
\]

and

\[
I_{i,j,k,l} = -A\{(k_{11}(i,j) - k_{21}(i + k, j + l))^2 + (k_{12}(i,j) - k_{22}(i + k, j + l))^2\}
\]

\[
+ (g_{11}(i,j) - g_{22}(i + k, j + l))^2\}
\]

(12)

where \( \delta_{a,b} \) is the Dirac delta function, the error function in (6) is mapped into the energy function of the neural network in (4). Notice that the interconnection strengths consist of constants and line process only. The bias inputs contain all the information from images. When the network reaches a stable condition, the optical flow field is determined by the neuron states. The size of a typical smoothing window is \( 5 \times 5 \).

Since the first and second terms in (6) do not contain the line process, the updating of the line process is prior to the updating of neuron states. Let \( L_{i,j,k,l}^v \) and \( L_{i,j,k,l}^s \) denote the new and old states of the vertical line \( L_{i,j,k,l}^v \), respectively. Let \( \Psi_{i,j,k,l} \) be the potential of the vertical line \( L_{i,j,k,l}^v \) given by

\[
\Psi_{i,j,k,l} = \frac{C}{2} (v_{i,j,k,l} - v_{i,j+1,k,l})^2.
\]

Then, the new state is determined by

\[
L_{i,j,k,l}^v = \begin{cases} 1 & \text{if } \Psi_{i,j,k,l} > 0 \\ 0 & \text{otherwise.} \end{cases}
\]

(14)

Whenever the states of neurons \( v_{i,j,k,l} \) and \( v_{i+1,j,k,l} \) are different, the vertical line \( L_{i,j,k,l} \) will be active provided that the parameter \( C \) is greater than zero. If \( C = 0 \), then all lines are inactive, which means that no line process exists in the network operation. The choice of \( C \) is closely related to selecting the smoothness parameter \( B \) in (6). A similar updating scheme is also used for the horizontal lines. In the prototype neural chip design, computation for the terms which are weighted by the parameter \( C \) is not included.

The state of each neuron is synchronously evaluated and updated according to (1) and (2). The initial states of the neurons are set as

\[
v_{i,j,k,l} = \begin{cases} 1 & \text{if } I_{i,j,k,l} \geq \max(I_{i,j,p,q}, -D_k \leq p \leq D_k, -D_l \leq q \leq D_l) \\ 0 & \text{otherwise.} \end{cases}
\]

(15)
transistor MI M, M, M, M5 M6 M, M31 M32 M3, M34


Fig. 5. Circuit schematic of the velocity-sensitive component. It contains a programmable synapse array, a summing neuron, and a winner-take-all cell.

where \( I_{i,j,k,l} \) is the bias input. The initial conditions are completely determined by the bias inputs. If there are two maximal bias inputs at point \((i,j)\), then only the neuron corresponding to the smaller velocity is initially set to 1 and the other one is set to 0. This is consistent with the minimal mapping theory [38]. In the updating scheme, the minimal mapping theory is also used to handle the case of two neurons having the same largest inputs.

IV. THE NEURAL-BASED NEUROPROCESSOR DESIGN

A. VLSI Architecture

To implement the electronic neural network processor, a VLSI architecture has been developed which maps the three-dimensional neural network configuration onto a two-dimensional plane. As shown in Fig. 3, each small frame represents one velocity-selective hypercolumn which contains \((2D_h + 1)(2D_v + l)\) velocity-sensitive components. Each hypercolumn is locally interconnected with the \(\Gamma \times \Gamma - 1\) neighboring hypercolumns. The hypercolumn is designed as a neuroprocessor within which the velocity selectivity of an image pixel can be conducted. Mixed analog/digital design technologies are utilized for the neuroprocessor design to achieve compact and programmable synapses and neurons for massively paralleled neural computation [39].

To simplify the two-dimensional interconnection design for computation of optical flow, the analog point-to-point interconnection for local communication and the digital common bus for global communication are used. Since velocity information of one pixel is affected by its neighbors, each neuroprocessor receives information from the neighboring neuroprocessors during the network operation. Data communication between these locally interconnected neuroprocessors is one key factor on the overall system performance. There are three different
methods to accomplish the local data communication with trade-offs on the operation speed and silicon area.

The first method is to use the digital bit-parallel point-to-point interconnection. The $(2D_k + 1)\times (2D_l + 1)$-bit $v_{i,j,p,q}$s, where $-D_k \leq p \leq D_k, -D_l \leq q \leq D_l$, are transmitted using the word-wide point-to-point interconnections. The data transfer speed is very fast. However, the total number of interconnection lines for each neuroprocessor is as large as $(2D_k + 1)\times (2D_l + 1)\times (\Gamma \times \Gamma)$. The silicon area for the interconnection routing is large. The required large pin count becomes a major constraint for hardware implementation.

The second method is to use the digital bit-serial point-to-point interconnection. The $v_{i,j,p,q}$s are sent in a bit-serial order by using a time-multiplexing technique. The total number of interconnection lines is reduced by a factor of $(2D_k + 1)\times (2D_l + 1)$. However, the time required for data transfer increases with the same factor [40]. In addition, the required hardware overhead for time-multiplexing includes a one-bit latch for each synapse cell, the multiplexing control signals, and the associated decoding circuitry.

The third method is to use the analog bit-parallel point-to-point interconnection. The $v_{i,j,p,q}$s are converted to an analog value, and then sent to the neighboring neuroprocessors. The $v_{i,j,p,q}$s are converted back into digital values at the receiving sites. The required hardware overhead includes the digital/analog converter or analog/digital converter at the two ends of interconnection wire. Both the analog interconnection method and the multiplexing digital interconnection method are suitable in the neural network processor design. For applications with large $D_k$ and $D_l$ values, the multiplexing digital interconnection method is preferred.

A functional diagram of the velocity-selective neuroprocessor is shown in Fig. 4. It includes a velocity-sensitive component array, and a data conversion block. The array has $(2D_k + 1)\times (2D_l + 1)$ velocity-sensitive components which are laterally connected through the winner-take-all circuit. The velocity of the neuroprocessor is determined by competition which is performed by the winner-take-all circuit. Only one velocity component which has the maximum excitation will be the winner to represent the velocity of that pixel. The data conversion block is used for the analog point-to-point interprocessor interconnection.

As shown in Fig. 5, the velocity-sensitive component is constructed with one synapse array, one summing neuron, and one winner-take-all cell. The synapse array contains $\Gamma \times \Gamma + 1$ programmable synapses. The synapse weights $T_{i,j,k,l,m,n,k,l}$ are stored as charge packets on capacitors and must be refreshed periodically [17], [41]. The binary outputs $v_{m,n,p,q}$ from the neighboring neuroprocessors are routed to the corresponding mask ports of the synapse cells to conduct the network operation. A summing neuron functions as a parallel current-mode adder. Each summing neuron with its associated programmable synapse array perform a complete inner-product computation. The binary outputs of the winner-take-all circuit represent the velocity status.

The synapse weights and bias inputs are calculated by the host computer or a digital coprocessor and stored in a digital static-RAM. The 8-bit digital/analog converter transforms the digital representation of the synapse weights into analog values for charging the weight-storage capacitances of the synapse matrix. A two-port static-RAM and differential amplifier-based synapse design allows network retrieving and learning processes to occur concurrently.

B. Detailed Circuit Design

In Fig. 5, a transconductance amplifier consisting of transistors $M_1$-$M_5$ produces synapse output current $I_{o,i,j}$ according to mask voltage $V_{mask}$ and weight voltage $V_{w,i,j}$. The bias voltage $V_{bias}$ controls the dynamic range of synapse cells by adjusting the bias current in the transconductance amplifiers. When the $V_{mask}$ is a logic 1, the $V_{bias}$ is connected to $V_{on}$.
to provide the amplifier with a specific bias current \( I_{\text{bias}}^{\text{max}} \). When the \( V_{\text{mask}} \) is at logic 0, the \( V_{\text{bias}} \) is connected to the negative power supply so that no synapse output current is produced. Therefore, the \( V_{\text{mask}} \) performs a masking operation on the synapse weight voltage \( V_{y} \). The mask voltage of each synapse cell is directly related to the value of the \( v_{m,n,k,l} \), which represents the velocity information of the neighboring pixels. The maximum synapse conductance is decided by device sizes of the differential pair and the bias current \( I_{\text{bias}}^{\text{max}} \), while the minimum synapse conductance is determined by the resolution of the weight value on the MOS capacitance.

The synapse output currents are summed up and converted to the voltage format at the summing neuron. This compact synapse circuit performs a two-quadrant multiplication. The polarity of the synapse output depends on the value of weight voltage \( V_{y} \). An 8-bit resolution can be easily supported in the DRAM-style synapse cell. The required refreshing time is around 100 ms. Recently, detailed design of four quadrant Gilbert multiplier for synapse cells were reported [17], [42]. Multiple differential pairs and current-mirror circuits make the wide-range operation possible. If more than 8-bit resolution is required for the synapse function, large-geometry MOS transistors, and shorter refreshing time will be needed. In the EEPROM-style synapse cell [43], [44], at least a 6-bit resolution can be obtained.

The summing neuron functions as a current-to-voltage converter and is realized by using a two-stage operational amplifier and a feedback resistor. Circuit schematic diagram of the two-stage operational amplifier is shown in Fig. 6. Transistors \( M_{13} \) and \( M_{14} \) form an improved cascode stage to increase the voltage gain and \( M_{24} \) operates as a resistor for proper frequency compensation. The amplifier voltage gain of 100 dB can be achieved.

The outputs of the winner-take-all circuit are binary values. Only one winner cell with the maximum input voltage will have the logic-1 output value. The other cells will have the logic-0 output value. The winner-take-all circuitry functions as a multiple-input parallel comparator. Fig. 7 shows the circuit schematic diagram of two winner-take-all cells. The high-resolution and expandability of this winner-take-all circuit makes it suitable for many competitive learning neural networks [45]–[48]. Winner-take-all circuits with transistors biased in the subthreshold region were reported in the literature. Such low-power winner-take-all circuits [49], [50] have slower speed response and are suitable for special applications if the power consumption is crucial.

After the winner-take-all circuit, the \( (2^{D_{k}} + 1)(2^{D_{l}} + 1) \) binary outputs represent the velocity information of one image pixel. Combinational logic gates are used to encode these \( (2^{D_{k}} + 1)(2^{D_{l}} + 1) \) binary signals and to store the result into a data latch. Fig. 8 shows digital circuits of the data latch with the associated read/write control logic. The final velocity result is read by the host computer from the data latches through the digital common bus.

Fig. 9 shows a voltage-scaling digital-to-analog converter [51] which is used to convert the encoded binary code to the analog value and send it to the neighboring neuroprocessors. The voltage-scaling converter uses a series of resistors connected between \( V_{\text{ref}} \) and \( -V_{\text{ref}} \) to provide intermediate voltage values. For an \( N \)-bit converter, the resistor string would have \( 2^{N} + 1 \) resistor segments. In Fig. 9, a total of 26 resistor segments are used. The resistor is implemented in the \( P \)-well diffusion layer of the MOS fabrication process. The sheet resistance of the \( P \)-well layer is around 2 \( \Omega/\square \) from the MOSIS 1.2-\( \mu \)m CMOS \( P \)-well process [52], [53]. Unity-gain followers are used to buffer the resistor string from conductive loading. Each tap is connected to a switching tree whose switches are controlled by the bits of the digital word. Each switch is implemented by a CMOS transmission gate.

When the analog velocity information from the neighboring neuroprocessors is received, a total of \( \Gamma \times \Gamma - 1 \) analog-to-digital converters are used to convert these analog values back to the binary values with \( (2^{D_{k}} + 1)(2^{D_{l}} + 1) \) bits.
Fig. 11. The layout of one velocity-selective neuroprocessor. It contains 25 neurons, 25 x 27 synapse matrix and is able to detect the moving object with 25 different velocities. It occupies 2482 x 5636 silicon area. The synapse cell is shown in the insert.

Only one of these bits is logic-1 and the others are logic-0. To achieve high-speed performance and a compact silicon area, a parallel and distributed analog-to-digital converter has been designed. One voltage scaling resistor-chain is used. As shown in Fig. 10, the comparators and the associated digital decoding circuitries are distributed into the synapse cells. The comparators included in the same velocity-sensitive component use the same reference voltage provided by the resistor-chain. The distributed decoding circuitries make sure that only one of \((2D_k + 1)(2D_l + 1)\) binary outputs is logic 1 and the others are inhibited to logic 0.

V. EXPERIMENTAL RESULTS

In the prototype neuroprocessor chip design, \(D_k = D_l = 2\) and a size of 5 x 5 smoothing window are used. The
Fig. 12. The layout of VLSI neural chip. It consists of 64 neuroprocessors and occupies 1.5 x 2.8 cm² silicon area.

Fig. 13. The detailed layout of interconnects among four neuroprocessors.

physical layout of the velocity-selective neuroprocessor for one image pixel using the scalable CMOS design rules is shown in Fig. 11. It occupies an area of 2,482 x 5.636 cm² and contains 25 neurons, 25 x 27 synapse cells, and is able to detect the moving object with 25 different velocities. In the hardware implementation, two rows of synapses are used to increase the resolution of synapse weights coming from the bias inputs and also to enhance the fault tolerance of the network. With an advanced 1.2-μm CMOS technology, 64 neuroprocessors can be accommodated into one VLSI neural chip of 1.5 x 2.8 cm² in size. The chip layout is shown in Fig. 12. It requires a 178-pin PGA package. The analog interprocessor data communication requires 128 pins. The detailed layout of interconnects among four neuroprocessors is shown in Fig. 13. The interconnection routing area occupies 23% of the chip area. A performance comparison against the digital bit-parallel point-to-point interconnection method is listed in Table I. In the digital bit-parallel method, each data link requires 25 lines. Only 12 neuroprocessors can be

| TABLE I  |
| Performance Comparison of Two Interconnect Methods |
| Design Approach | Using Bit-Parallel Analog Interconnect Method | Using Bit-Parallel Digital Interconnect Method |
| Performance | | |
| Chip Size (cm²) | 1.5 x 2.8 | 1.5 x 2.8 |
| Number of Neuro-processors | 64 | 12 |
| Interconnection Routing Area (%) | 23 | 85 |
| Pin count | 178 | 3250 |
| Network Iteration Time (ns) | 522 | 250 |
| Speed Performance (connection/second) | 8.03 x 10^10 | 2.08 x 10^10 |

Fig. 14. The system diagram for high-speed motion detection using multiple VLSI neural chips. Each VLSI neural chips can accommodate 64 neuroprocessors with 1.2-μm CMOS technology and 1.5 x 2.8 cm² chip area. Each neuroprocessor in the VLSI neural chip can communicate with its neighbors through the analog point-to-point interconnections. The standard IC parts such as SRAM and 8-bit DAC are used for refreshing of synapse weights.
Fig. 15. The layout of the test module which includes key circuit blocks.

Fig. 16. Measured results of programmable synapse characteristics. A big bias voltage can provide a large dynamic range.

Fig. 17. Measured results of winner-take-all circuits. (a) one input as indicated in the x-axis changes linearly from -1.53 V to -1.48 V, the second input is connected to 1.5 V, and the other seven inputs are kept at -1.525 V. (b) one input as indicated in the x-axis changes linearly from 1.47 V to 1.52 V, the second input is connected to 1.5 V, and the other seven inputs are kept at 1.475 V.

accommodated in the same chip area and 85% of chip area will be used for the interconnection routing purpose.

With 128 VLSI neural chips and many supporting standard IC parts such as SRAM's and 8-bit DAC's for storing the weight information and dynamically refreshing of the synapse cells, computation of optical flow from an image with 64 x 128

<table>
<thead>
<tr>
<th>Circuit Response Time</th>
<th>Measured Results</th>
</tr>
</thead>
<tbody>
<tr>
<td>Analog-to-Digital Conversion</td>
<td>73 ns</td>
</tr>
<tr>
<td>Synapse Multiplication</td>
<td>120 ns</td>
</tr>
<tr>
<td>Neuron threshoding</td>
<td>20 ns</td>
</tr>
<tr>
<td>Winner-Take-All Operation</td>
<td>38 ns</td>
</tr>
<tr>
<td>Encoder &amp; Data Latch</td>
<td>8 ns</td>
</tr>
<tr>
<td>Digital-to-Analog Conversion</td>
<td>84 ns*, 263 ns*</td>
</tr>
<tr>
<td>Total</td>
<td>343 ns*, 522 ns*</td>
</tr>
</tbody>
</table>

Note: \( T_{\text{inv}} = 202 \) A in MOSIS 1.2-um CMOS technology.
+ with an output loading of 5 pF
* with an output loading of 50 pF

TABLE II
PERFORMANCE OF VLSI MOTION ESTIMATION SYSTEM

<table>
<thead>
<tr>
<th>Synapse Weight Loading Time</th>
<th>2,080 us (50 ns per write)</th>
</tr>
</thead>
<tbody>
<tr>
<td>(for 36 iterations)</td>
<td>18.792 us</td>
</tr>
<tr>
<td>Neuron State Read Out Time</td>
<td>409.6 us (50 ns per read)</td>
</tr>
<tr>
<td>Total Processing Time</td>
<td>2.509 ms</td>
</tr>
</tbody>
</table>

Fig. 18. The statistical distribution of measured synapse output conductances. (a) The synapse conductances can be described by a Gaussian distribution with a mean value of 14.07 \( \mu A/V \) and a standard deviation of 0.042 \( \mu A/V \) at weight voltage \( V_{w} = 2V \). (b) The synapse conductances can be described by a Gaussian distribution with a mean value of -13.69 \( \mu A/V \) and a standard deviation of 0.036 \( \mu A/V \) at weight voltage \( V_{w} = -2V \).
pixels and 256 gray levels can be performed at a rate of 30 frames per second. The proposed system set-up for fast motion detection using multiple VLSI neural chips is shown in Fig. 14.

To obtain the electrical properties of the basic circuit blocks, a test structure containing key circuit components was fabricated with a 2-μm CMOS process from Orbit Semiconductor, Inc. through the MOSIS Service of USC/Information Sciences Institute at Marina del Ray, CA and tested. The picture of the test structure is shown in Fig. 15. Measured transfer curves of the synapse cell with different bias voltages are shown in Fig. 16. The dynamic range of the synapse cell is controlled by the bias voltage. Experimental data on the winner-take-all circuit are shown in Fig. 17. The circuit consists of nine winner-take-all cells. Two experiments were conducted. In Fig. 17(a), one input sweeps linearly from -1.53 to -1.48 V, the second input is connected to -1.5 V, and the other seven inputs are kept at -1.525 V. In Fig. 17(b), one input sweeps linearly from 1.47 to 1.52 V, the second input is connected to 1.5 V, and the other seven inputs are kept at 1.475 V. The winner-take-all function is successfully implemented with a resolution of 15 mV.

The processing time for one network iteration is around 522 ns. Each iteration cycle includes synapse multiplication, neuron summing, winner-take-all operation, data storage on latches, digital/analog and analog/digital conversion, and interprocessor data transfer. SPICE [54] simulation results on various circuit blocks are listed in Table II. The large response
time of the synapse multiplication is due to the significant
capacitance loading on the current-summation line. For the
digital/analog conversion simulations, 5 pF and 50 pF effective
capacitance loadings are estimated for interchip data
communication and off-chip data communication, respectively. The
major delay will come from the off-chip interprocessor data
communication. The total computing power of $8.32 \times 10^{10}$
connections per second can be achieved by using one VLSI
neural chip containing 1600 neurons, 41,600 synapses cells,
and operated at a master clock rate of 2 MHz. Based on the
results of Table II the speed comparison of a system using
128 VLSI neural chips with a Sun-4/75 SPARC workstation
is listed in Table III. The speedup factor is very large.

System-level analysis has been conducted to illustrate the
performance of the motion detection chip. The mismatch effect
of analog synapse components has been included. Fig. 18
shows the statistical distribution of measured synapse output
conductances. A total of 300 synapses was measured. In
Fig. 18(a), the synapse conductances can be described by a
Gaussian distribution with a mean value of 14.07 $\mu$A/V
and a standard deviation of 0.042 $\mu$A/V at weight voltage $V_{\text{out}} = 2$ V. In Fig. 18(b), the synapse conductances can be
described by a Gaussian distribution with a mean value of
$-13.69$ $\mu$A/V and a standard deviation of 0.036 $\mu$A/V at
weight voltage $V_{\text{out}} = -2$ V. During computer analysis, the
effects of process variation on synapse weights are included
through the use of Gaussian function.

A set of four successive image frames directly produced by
a Sony XC-77 CCD camera was used as the input data. Fig.
19(a)-(d) shows four successive image frames of a mobile
missile launcher moving from left to right against a stationary
background. The size of each image frame is 130 $\times$ 160 pixels.
The maximum displacement of the mobile missile launcher
between the time-varying image frame is 7 pixels. To estimate
the principal curvatures and intensity values, a 5 $\times$ 5 window
and a third order polynomial was used for all frames. By
setting $A = 4$, $B = 850$, $C = 0$, $D_1 = 7$, and $D_2 = 1$, the
velocity field was obtained after 36 iterations. The parameter $A$
is set to 4, because four successive image frames are used. The
parameter $B$ is chosen by using trial-and-error method.
The parameter $C$ is set to 0 in the prototype design to simplify
the neuron-state updating scheme of the network. Fig. 19(c)
shows the final result of using synapse weights obtained by
including the effects of process variation. Comparing with the
result in Fig. 19(f), which the effects of process variation are
not included, the motion information of the moving object
still can be successful detected.

VI. CONCLUSION

A mixed-signal two-dimensional mesh-connected architec-
ture for high-speed motion detection has been presented. A
compact and efficient VLSI neuroprocessor which including
25 neurons and $25 \times 27$ synapse cells is able to estimate the
motion of each pixel with 25 different velocities. Multiple
neuroprocessors can be connected as a two-dimensional mesh
to fully exploit the massively parallel computational power of
neural networks. In this architecture, the local computation
is processed in analog neuroprocessor and the local data
communication is performed in parallel. Each $1.5 \times 2.8\text{cm}^2$
VLSI neural chip from a 1.2-$\mu$m CMOS technology can
operate at a rate of 83.2 giga connections per second.

ACKNOWLEDGMENT

Discussions with Dr. Y.-T. Zhou on artificial neural
networks, and Dr. B. W. Lee on neural circuit design are highly
appreciated. J. Choi provided detailed information on winner-
take-all circuit design. The authors would like to thank the
reviewers for valuable comments and suggestions.

REFERENCES

[1] H. Koike et al., "A 30 ns 64 Mb DRAM with built-in self-test and repair
function," in Tech. Dig. IEEE Int. Solid-State Circ. Conf., San Francisco,

[2] L. Kohn and N. Mangalis, "Introducing the Intel 8086 64-bit microproces-


[4] D. Dobberpuhl et al., "A 200MHz 64b dual-issue CMOS microproces-

Dallas, TX, 1990.

million transistor neural network execution engine," in Tech. Dig. IEEE


[8] R. W. Brodersen, "Low-power design of a wireless multimedia termi-

with on-chip digital front-end processor," in Tech. Dig. IEEE Int. Solid-

[10] Y. Tsididis, "Analog MOS integrated circuits—certain new ideas,
June 1987.


Gaussian convolution with embedded image sensing," in Tech. Dig.
pp. 216–217.

programmable network topology," in Tech. Dig. IEEE Int. Solid-State

network with dynamically updated weights," in Tech. Dig. IEEE Int.

R. Etienne, and P. Kinget, "An analog neural computer with modular
architecture for real-time dynamic computations," IEEE J. Solid-


[18] D. Hammerstrom, "A VLSI architecture for high-performance, low-cost,

digital processor for implementing large artificial neural networks," in
1991, pp. 16.3.1–16.3.4.

[20] B. J. Sheu, "VLSI Neurocomputing with analog programmable chips
digital systolic array chips," presented at the Int. Symp. Circuits and

[21] K. S. Fu and T. Ichikawa, Special Computer Architectures for Pattern
was also a recipient of the 1990 and 1991 Best Presenter Award at the IEEE International Conference on Computer Design. He has published more than one hundred papers in international scientific and technical journals and conferences and is a coauthor of the book *Hardware Annealing in Analog VLSI Neurocomputing* (Kluwer Academic Publishers). He has served on the Technical Program Committees and as the session chairman in IEEE/International Joint Conference on Neural Networks, International Conference on Computer Design, International Symposium on Circuits and Systems, Custom Integrated Circuits Conference, and on the VLSI Committee of the IEEE Signal Processing Society. He is on the editorial board of the Journal of Analog Integrated Circuits and Signal Processing. He also serves as a Guest Editor for *IEEE Journal of Solid-State Circuits* (the March 1992 and March 1993 Issues). He serves as an Associate Editor and a Guest Editor for the *IEEE Transactions on Very Large Scale Integration Systems* and an Associate Editor of the *IEEE Circuits and Devices Magazine*. He is a senior member of the Signal Processing Society, Circuits and Systems Society, Communication Society, Computer Society of IEEE, a member of International Neural Networks Society, Phi Tau Phi, and Eta Kappa Nu.

**Wai-Chi Fang** (S’81–M’86) was born in Taiwan in 1956. He received the B.S. degree in electronics engineering in 1978 from National Chiao Tung University at Hsin-Chu, Taiwan, and the Master of Engineering and Ph.D. degrees in electrical engineering from University of Southern California in 1987 and 1992, respectively.

From 1988 to 1992, he studied VLSI implementation of image and video compression systems, digital neurocomputing and systolic array-based image understanding in the VLSI Signal Processing Laboratory at University of Southern California. Since 1985, he has also been with the Jet Propulsion Laboratory at Pasadena, California and worked on the architecture and design of high-performance computing and image processing systems. He used advanced CAD tools to generate applications-specific VLSI chips in the full-custom and standard-cell design styles based on the specially modified signal processing algorithms. He is currently involved in satellite image data compression for Cassini Titan Radar Mapper in the Radar Science and Engineering Division. He has published more than fifteen papers in scientific conferences and journals. His research interests include high-speed data compression for multimedia applications, artificial neural networks, image and video processing, signal processing for synthetic aperture radars, and VLSI system design.

**Rama Challappa** (S’78–M’79–SM’83–F’92) was born in Tanjore, Madras, India in 1953. He received the B.S. degree (with honors) in electronics and communication engineering from the University of Madras in 1975 and the M.S. degree (with distinction) in electrical communication engineering from the Indian Institute of Science, Bangalore, in 1977. He received the M.S. and Ph.D. degrees in electrical engineering from Purdue University, West Lafayette, IN in 1978 and 1981, respectively.

From 1979 to 1981, he was a faculty research assistant at the Computer Vision Laboratory, University of Maryland, College Park. From 1981 to 1991, he was a faculty member in the Department of Electrical Engineering-Systems, University of Southern California. From 1988 to 1990, he was also the Director of Signal and Image Processing Institute at USC. Presently, he is a Professor in the Department of Electrical Engineering at the University of Maryland, where he is also affiliated with the Center for Automation Research, Computer Science Department and the Institute for Advanced Computer Studies. He has authored ten book chapters and over a hundred journal and conference papers. Many of his papers have been reprinted in collected works published by IEEE Press, IEEE Computer Society Press, and MIT Press. His current research interests are in signal and image processing, computer vision, and pattern recognition. He is the coeditor of two volumes of selected papers on digital image processing and analysis published in 1985. He was an Associate Editor for the *IEEE Transactions on Acoustics, Speech, and Signal Processing* from 1987 to 1989. Currently, he is a coeditor-in-chief of *Computer Vision, Graphics, and Image Processing: Graphical Models and Image Processing* and is an associate editor for the *IEEE Transactions on Neural Networks* and *IEEE Transactions on Image Processing*.

Dr. Challappa received a national scholarship from the Government of India during the period from 1969 to 1975, and he received the 1975 Jawaharlal Nehru Memorial Award from the Department of Education, Government of India. In addition, he received the 1985 National Science Foundation (NSF) Presidential Young Investigator Award and the 1985 IBM Faculty Development Award. In 1990, he received the Excellence in Teaching Award from the School of Engineering at USC. He was the general chairman of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition and of the IEEE Computer Society Workshop on Artificial Intelligence for Computer Vision, both held in San Diego, CA in June 1989. He was a program cochairman for the NSF-sponsored Workshop on Markov Random Fields, also held in San Diego in June 1989.