A Programmable Vision Chip with High Speed Image Processing

Jérôme Dubois, Michel Paindavoine, Dominique Ginhac

To cite this version:

Jérôme Dubois, Michel Paindavoine, Dominique Ginhac. A Programmable Vision Chip with High Speed Image Processing. 28th International Congress on High-Speed Imaging and Photonics (ICH-SIP), Nov 2008, Camberra, Australia. pp.1-10, 2009, <10.1117/12.822294>. <hal-00785915>

HAL Id: hal-00785915
https://hal-univ-bourgogne.archives-ouvertes.fr/hal-00785915
Submitted on 7 Feb 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
A Programmable Vision Chip with High Speed Image Processing

Jérôme Dubois\textsuperscript{a}, Michel Paindavoine\textsuperscript{b} and Dominique Gin hac\textsuperscript{b}

\textsuperscript{a}LCE CEA Saclay, 91190 Gif sur Yvette, Saclay, France;
\textsuperscript{b}LE2I UMR CNRS 5158, 21078 Dijon Cedex, Dijon, France

ABSTRACT

A high speed Analog VLSI Image acquisition and pre-processing system is described in this paper. A $64 \times 64$ pixel retina is used to extract the magnitude and direction of spatial gradients from images. So, the sensor implements some low-level image processing in a massively parallel strategy in each pixel of the sensor. Spatial gradients, various convolutions as Sobel filter or Laplacian are described and implemented on the circuit. The retina implements in a massively parallel way, at pixel level, some various treatments based on a four-quadrants multipliers architecture. Each pixel includes a photodiode, an amplifier, two storage capacitors and an analog arithmetic unit. A maximal output frame rate of about 10 000 frames per second with only image acquisition and 2000 to 5000 frames per second with image processing is achieved in a 0.35 $\mu$m standard CMOS process. The retina provides address-event coded output on three asynchronous buses, one output is dedicated to the gradient and both other to the pixel values. A prototype based on this principle, has been designed. Simulation results from Mentor Graphics\textsuperscript{TM} software and AustriaMicrosystem Design kit are presented.

Keywords: CMOS Image Sensor, Parallel architecture, High-speed image processing, Analog arithmetic unit.

1. INTRODUCTION

Today, improvements continue to be made in the growing digital imaging world with two main image sensor technologies: the charge coupled devices (CCD) and CMOS sensors. The continuous advances in CMOS technology for processors and DRAMs have made CMOS sensor arrays a viable alternative to the popular CCD sensors. New technologies provide the potential for integrating a significant amount of VLSI electronics onto a single chip, greatly reducing the cost, power consumption and size of the camera.\textsuperscript{1–4} This advantage is especially important for implementing full image systems requiring significant processing such as digital cameras and computational sensors.\textsuperscript{5} Most of the work on complex CMOS systems talks about the integration of sensors providing a processing unit at chip level or column level\textsuperscript{6, 7}. Indeed, pixel level processing is generally dismissed because pixel sizes are often too large to be of practical use. However, integrating a processing element at each pixel or group of neighbouring pixels seems to be promising. More significantly, employing a processing element per pixel offers the opportunity to achieve massively parallel computations. This benefits traditional high speed image capture\textsuperscript{8, 9, 10} and enables the implementation of several complex applications at standard video rates\textsuperscript{11, 12}.

Vision chips are designed based on the concept that analog VLSI systems with low precision are sufficient for implementing many low level vision algorithms. The precision in analog VLSI systems is affected by many factors, which are not usually controllable. As a result, if the algorithm does not account for these inaccuracies, the processing reliability may be severely affected. Vision chips implement a specific algorithm in a limited silicon area. They are always full custom designed which is a challenging task, known to be time consuming and error prone. One should consider issues from visual processing algorithms to low level circuit design problems, from photo transduction principles to high-level VLSI architectural issues.\textsuperscript{13}

This paper describes the design and the implementation of a $64 \times 64$ active pixel sensor with per-pixel programmable processing element. Both the circuit design and layout are targeted for manufacturing in a

Send correspondence to J.D.

J.D.: E-mail: jerome1.dubois@cea.fr, Telephone: +33 (0)1 69 08 60 51
M.P.: E-mail: paindav@u-bourgogne.fr, Telephone: +33 (0)3 80 39 60 43
standard 0.35\,\mu m double-poly quadruple-metal CMOS technology. The main objectives of our design are: (1) to evaluate the speed of the sensor, and, in particular, to reach a 10,000 frames/s rate, (2) to demonstrate a versatile and programmable processing unit at pixel level, (3) to provide a original platform dedicated to embedded image processing.

The rest of this paper is organized as follows: the section 2 describes the main characteristics of the overall architecture. The section 3 is dedicated to the description of the operational principle at pixel level in the sensor. In the following, the sections 4 and 5 respectively describe the details of the pixel design and the Analog Arithmetic Unit embedded in each pixel. Finally, some simulation results of high speed image acquisition with processing at pixel level are presented in the last section of this paper.

2. OVERALL ARCHITECTURE

The core circuit is, obviously, the bidimensional array of pixels. This array is organized into a cartesian arrangement of 64×64 pixels and contains 160,000 transistors on a 3.5 mm × 3.5 mm die. The full layout of the retina is shown in Fig. 1 and the main chip characteristics are listed in Table 1. Each pixel contains a photodiode for the light-to-voltage transduction and 38 transistors integrating all the analog circuitry necessary to implement the algorithm described in section 3. This amount of electronics includes a preloading circuit, two ‘Analog Memory, Amplifier and Multiplexer’ (AM²) and an ‘Analog Arithmetic Unit’ (A²U) based on a four-quadrant multiplier architecture. The full pixel size is 35 \,\mu m × 35 \,\mu m with a 25\% fill factor.

The left part of the sensor is dedicated to a row decoder for addressing the successive rows of pixels. Below the chip core are the readout circuits with the three asynchronous buses described in section 4. The chip also contains test structures used for detailed characterization of the photodiodes and processing units. The test structures can be seen on the bottom left of the chip.

![Figure 1. Layout of the Retina in a standard 0.35 \,\mu m CMOS technology](image)

3. OPERATIONAL PRINCIPLE

3.1 Photodiode Structure

In a standard design, a pixel includes a photodiode (called PD in our chip) and a processing unit (implemented in the zone called free surface) as shown in Fig. 2(a). With the proposed approach, we focus on the optimization of the processing unit mapping. Each pixel integrates image processing based on neighborhood. So, each processing unit must easily have access to the nearest neighbors that’s why an original structure was chosen as shown in Fig. 2(b). The major advantage of this structure is the possibility to limit the length of metal interconnection between adjacent pixels and the processing units, contributing to a better fill factor.
Table 1. Chip Characteristics

<table>
<thead>
<tr>
<th>Characteristic</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Technology</td>
<td>0.35 µm 4-metal CMOS</td>
</tr>
<tr>
<td>Array size</td>
<td>64 × 64</td>
</tr>
<tr>
<td>Chip size</td>
<td>11 mm²</td>
</tr>
<tr>
<td>Number of transistors</td>
<td>160 000</td>
</tr>
<tr>
<td>Number of transistors / pixel</td>
<td>38</td>
</tr>
<tr>
<td>Pixel size</td>
<td>35 µm × 35 µm</td>
</tr>
<tr>
<td>Sensor Fill Factor</td>
<td>25 %</td>
</tr>
<tr>
<td>Dynamics power consumption</td>
<td>250 mW</td>
</tr>
<tr>
<td>Supply voltage</td>
<td>3.3 V</td>
</tr>
<tr>
<td>Frame rate</td>
<td>10 000 fps</td>
</tr>
</tbody>
</table>

![Table 1. Chip Characteristics](image)

Figure 2. (a) Photosite classical structure, (b) Considered approach in our system

N-type photodiodes consist of an n⁺-type diffusion in a p-type silicon substrate. During the preloaded period, the depletion region is formed in the neighborhood of photodiode cathode. However, considering the layout constraints, choosing a cross shape for the photodiode appears to be not realistic. Finally, a quasi-octagonal structure, shown in Fig. 3, was selected because of three main properties:

1. The reserved surface dedicated to the interconnections is about 12% lower compared to a square shape,
2. The depletion region is more efficient at the edges of the photodiode
3. This shape, based on 45° structures, is technologically realizable by the founder.

![Figure 3. (a) Photodiode shape, (b) Photodiode layout](image)
3.2 Spatial Gradients

The structure of our processing unit is perfectly adapted to the computation of spatial gradients. The main idea for evaluating these gradients in-situ is based on the definition of the first-order derivative of a 2-D function performed in the vector direction $\xi$, which can be expressed as:

$$\frac{\partial V(x, y)}{\partial \xi} = \cos(\beta) \frac{\partial V(x, y)}{\partial x'} + \sin(\beta) \frac{\partial V(x, y)}{\partial y'}$$  \hspace{1cm} (1)

where $\beta$ is the vector’s angle. A discretization of the equation 1 at the pixel level, according to the Fig. 5, would be given by:

$$\frac{\partial V_i}{\partial \xi} = (V_1 - V_3) \cos(\beta) + (V_2 - V_4) \sin(\beta)$$  \hspace{1cm} (2)

where $V_i, i \in \{1; 4\}$ is the luminance at the pixel $i$, i.e., the photodiode output. In this way, the local derivative in the direction of vector $\xi$ is continuously computed as a linear combination of two basis functions, the derivatives in the $x'$ and $y'$ directions.

Using a four-quadrant multiplier, shown in Fig. 4, the product of the derivatives by a circular function in cosines can be easily computed:

$$P = V_1 \cos(\beta) + V_2 \sin(\beta) - V_3 \sin(\beta) - V_4 \cos(\beta)$$  \hspace{1cm} (3)

According to the Fig. 5, the processing element implemented at the pixel level carries out a linear combination of the four adjacent pixels by the four associated weights ($\text{coef}_i, i \in \{1; 4\}$). To evaluate the equation 3, the following values have to be given to the coefficients:

$$\begin{bmatrix} \text{coef}_1 & \text{coef}_2 \\ \text{coef}_3 & \text{coef}_4 \end{bmatrix} = \begin{bmatrix} \sin(\beta) & \cos(\beta) \\ -\sin(\beta) & -\cos(\beta) \end{bmatrix}$$  \hspace{1cm} (4)

3.3 Sobel operator

The structure of our architecture is also adapted to various algorithms based on convolutions using binary masks on a neighborhood of pixels. As example, the Sobel operator is described here. With our chip, the evaluation of the Sobel algorithm leads to the result directly centered on the photosensor and directed along the natural axes of the image according to the Fig. 5. In order to compute the mathematical operation, a $3 \times 3$ neighborhood
is applied on the whole image. To carry out the derived operation discretized in two dimensions, along the horizontal and vertical axes, it is necessary to build two $3 \times 3$ matrices called $h_1$ and $h_2$:

$$h_1 = \begin{pmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{pmatrix} \quad h_2 = \begin{pmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{pmatrix}$$  \hspace{1cm} (5)$$

Within the four processing elements numbered from 1 to 4 (see Fig. 5), two $2 \times 2$ masks are locally applied. According to the equation 5, this allows the evaluation of the following series of operations:

$$h_1: \quad I_{11} = I_3 + 2I_6 + I_9 \quad I_{13} = I_1 + 2I_4 + I_7 \quad h_2: \quad I_{22} = I_1 + 2I_2 + I_3 \quad I_{24} = I_7 + 2I_8 + I_9$$  \hspace{1cm} (6)$$

with the $I_{1k}$ and $I_{2k}$ provided by the processing element $(k)$. Then, from these trivial operations, the discrete amplitudes of the derivatives along the vertical axis ($I_{h1}=I_{11}-I_{13}$) and the horizontal axis ($I_{h2}=-I_{22}+I_{24}$) can be computed. All these operations can be carried out in two cycle’s retina. We call cycle’s retina, all operations of treatments carried out at the end of the screen of acquisition.

### 3.4 Second-order detector: Laplacian

Edge detection based on some second-order derivatives as the Laplacian can also be implemented on our architecture. Unlike spatial gradients previously described, the Laplacian is a scalar (see equation 7) and does not provide any indication about the edge direction.

$$\delta = \begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix}$$  \hspace{1cm} (7)$$

From this $3 \times 3$ mask, the following operation can be extracted according to the principles used previously for the evaluation of the Sobel operator; $\delta = I_1 + I_2 + I_3 + I_4 - 4I_5$. This operation can be carried out in only one cycle’s retina.

### 4. ARCHITECTURE AND OPERATION

The proposed chip is based on a classical architecture widely used in the literature as shown on the Fig. 6. The CMOS image sensor consists of an array of pixels that are typically selected a row at a time by a row decoder. The pixels are read out to vertical column buses that connect the selected row of pixels to an output multiplexer. The chip includes three asynchronous output buses, the first one is dedicated to the image processing results whereas the two others provides parallel outputs for full high rate acquisition of raw images.

As a classical APS, all reset transistor gates are connected in parallel, so that the whole array is reset when is active the reset line. In order to supervise the integration period (i.e., the time when incident light on each pixel generates a photocurrent that is integrated and stored as a voltage in the photodiode), the global output
called Out_int provides the average incidental illumination of the whole matrix of pixels. So, if the average level of the image is too low, the exposure time may be increased. On the contrary, if the scene is too luminous, the integration period may be reduced.

To increase the algorithmic possibilities of the architecture, the acquisition of the light inside the photodiode and the readout of the stored value at pixel level are dissociated and can be independently executed. So, two specific circuits, including an analog memory, an amplifier and a multiplexer are implemented at pixel level as shown in Fig. 7.

With these circuits called [AM]$^2$ (Analog Memory, Amplifier and Multiplexer), the capture sequence can be made in the first memory in parallel with a readout sequence and/or processing sequence of the previous image stored in the second memory (see Fig. 8). With this strategy, the frame rate can be increased without reducing the exposure time. Indeed, it is possible to double the speed of acquisition. Simulations of the chip show that frame rates up to 10000 frames per second can be achieved with a brightness superior to 15 000 luxes.

The chip operates at a single 3.3 V power supply. In each pixel, as seen on Fig. 9, the photosensor is a NMOS photodiode associated with a PMOS transistor reset, which represents the first stage of the capture circuit.
pixel array is held in a reset state until the 'reset' signal goes high. Then, the photodiode discharges according to incidental luminous flow. This signal is polarized around of $V_{DD}/2$, so the half power supply voltage. While the “read” signal remains high, the analog switch is open and the capacitor $C_{AM}$ of the analog memory stores the pixel value. The $C_{AM}$ capacitors are able to store, during the frame capture, the pixel values, either from the switch 1 or the switch 2. The following inverter, polarized on $V_{DD}/2$, serves as an amplifier of the stored value and provides a level of potential proportional to the incidental pixel illumination.

Figure 10. Architecture of the $A^2U$

5. ANALOG ARITHMETIC UNIT: $A^2U$

The analog arithmetic unit ($A^2U$) represents the central part of the pixel and includes four multipliers (M1, M2, M3 and M4), as illustrated on the Fig. 10 and 11. The four multipliers are all interconnected with a diode-connected load (i.e., a NMOS transistor with gate connected to drain). The operation result at the 'node' point is a linear combination of the four adjacent pixels.

Fig. 12 and 13 show the simulation results of this multiplier structure with cosine signals as inputs, i.e.

$$coef_1 = A \cos(2\pi f_1) \quad with \quad f_1 = 2.5kHz$$
$$V_i = B \cos(2\pi f_2) \quad with \quad f_2 = 20kHz$$

In this case, the output $Node$ value can be written as following:

$$Node = \frac{AB}{2} [\cos(2\pi(f_2 - f_1)) + \cos(2\pi(f_2 + f_1))]$$
The signal’s spectrum, represented on Fig. 13 contains two frequencies (17.5 kHz and 22.5 kHz) around the carrier frequency. The residues which appear in the spectrum are known as 'inter modulation products'. They are mainly due to the nonlinearity of the structure (around 10 kHz and 30 kHz) and the defects in input pads insulation (at 40 kHz). However, the amplitude of these intermodulation products are significantly lower than the two main frequencies.

Furthermore, in order to obtain the best linearity of the multiplier, the amplitude of the signal $V_i$ is limited to a range of 0.6-2.6 V. In the full chip, the signal $V_i$ corresponds to the voltage coming from the pixel and can be easily included in the range described before.

6. LAYOUT AND PRELIMINARY RESULTS

The layout of a 2×2 pixel block is shown in Fig. 14. This layout is symmetrically built in order to reduce fixed pattern noise among the four pixels and to ensure uniform spatial sampling. An experimental 64×64 pixel image sensor has been developed in a 0.35μm, 3.3 V, standard CMOS technology with poly-poly capacitors.
This prototype has been sent to foundry at the beginning of 2006 and will be available at the end of the second quarter of the year. The Fig. 15 represents a simulation of the capture operation. Various levels of illumination are simulated by activating different readout signals (read 1 and read 2). The two outputs (output 1 and output 2) give the levels between GND and $V_{DD}$, corresponding to incidental illumination on the pixels. The calibration of the structure is ensured by the biasing ($V_{bias}=1.35V$). Moreover, in this simulation, the 'node' output is the result of the difference operation (out1-out2). The factors were fixed at the following values: $coef1=coef2=V_{DD}$ and $coef3=coef4=V_{DD}/2$. MOS transistors operate in sub-threshold region. There is no energy spent for transferring information from one level of processing to another level. According to the simulation’s results, the voltage gain of the amplifier stage of the two $[AM]^2$ is $Av=10$ and the disparities on the output levels are about 4.3 %.

**7. CONCLUSION**

An experimental pixel sensor implemented in a standard digital CMOS 0.35 $\mu$m process was described. Each $35\mu m \times 35\mu m$ pixel contains 38 transistors implementing a circuit with photo-current’s integration, two $[AM]^2$ (Analog Memory, Amplifier and Multiplexer), and a $A^2U$ (Analog Arithmetic Unit).

Chip simulations reveal that raw images acquisition at 10 000 frames per second can be easily achieved using the parallel $A^2U$ implemented at pixel level. With basic image processing, the maximal frame rate slows to reach about 5000 fps.
The next step in our research will be the characterization of the real circuit as soon as the chip comes back from the foundry. Furthermore, we focus on the development of a fast analog to digital converter (ADC). The integration of this ADC on future chips will allow us to provide new and sophisticated vision systems on chip dedicated to digital embedded image processing at thousands of frames per second.

REFERENCES


