SESSION XIII: DIGITAL SIGNAL PROCESSORS Chairman: H.M. Ahmed **Boston University** Boston, MA ## THAM 13.1: A 60ns CMOS DSP with On-Chip Instruction Cache Craig J. Caren, Bruce P. Benjamin, James R. Boddie, Michael L. Fuccio, Renato N. Gadenz, W. Patrick Hays, Leonard McMillan AT&T Bell Laboratories Holmdel, NJ James L. Henry, Lawrence E. Bays, Arupratan Gupta, James J. Klinikowski, Goh Komoriya, Lawrence A. Rigge, David A. Willenbecher, Kaichuen Wong AT&T Bell Laboratories Allentown, PA THIS PAPER will describe a programmable 16b fixed-point Digital Signal Processor (DSP) with an instruction cycle time of 60ns. A $51 \mathrm{mm}^2$ chip was processed using an advanced $1.0 \mu \mathrm{m}$ , twin-tub, double level metal, CMOS technology; Figure 1. The processor is an enhanced architecture, incorporating an on-chip instruction cache memory for vector operations. A full custom design methodology was employed to design and verify the circuit design and layout. Figure 2 is a block diagram of the DSP. The Data Arithmetic Unit (DAU) is instruction configurable as a 2-stage multiply/accumulate or a 1-stage ALU. The multiply/add signal processing data path consists of a 16 x 16 full precision two's complement parallel multiplier and 36b adder feeding two 36b accumulators. Product bit alignment shifter and saturation protection are provided to simplify programming. Microprocessor style operations utilize the ALU data path that by passes the multiplier with 16 or 32b data to a 15 function ALU and an 8 function shifter, with 4 shifts to both right and left. A set of conditional accumulator functions perform nonlinear signal operations and assist in the performance of many algorithms. An on-chip memory includes 2048 x 16 words of ROM for instructions and fixed coefficients, and 512 x 16 words of RAM for variable data. An internal ROM can be replaced with a larger external memory of 64K words for full speed prototyping, or for applications that require either frequent program modification or more memory than provided on-chip. Two addressing units support high-speed, register-indirect memory addressing with post-modification. Four address registers can be used for either read or write addresses to the RAM without restrictions. Modulo addressing of arbitrary length and memory organization is supported. One address register is dedicated to the ROM for table look-up. The DSP includes two I/O ports. A Serial I/O port (SIO) provides double buffering and interfaces with codecs, time division multiplexed data, and to other DSPs in a zero-chip multiprocessor environment. A Parallel I/O port (PIO) is capable of operating as a bus master or slave for 8 and 16b interfaces. Communication with DSPs, microprocessors, or other I/O peripherals is supported. Both the SIO and the PIO include a user-maskable interrupt capability for automatic I/O handling. The processor is packaged in an 84-pin plastic or ceramic chip carrier. Many signal processing algorithms consist of repetitive multiply/accumulate instruction sequences. In principle, three busses (instruction, data and data) and three memories are required to execute these operations in a single cycle. Because of the repetitive nature of the algorithms, a simpler dual bus architecture was chosen for stores and replays of the repeated instructions from a small 15-word cache memory; Figure 3. This memory structure allows parallel access of instruction (from CACHE) and two data operands, one fixed (from ROM) and one variable (from RAM), in each machine cycle for high-speed processing. Automatic replaying provides hardware looping which reduces the need for *in-line* code and, therefore, conserves ROM storage. Fully-static CMOS circuits were used throughout the chip. A basic clocking approach uses two-phased overlapping clocks, and level-sensitive latches as registers. A 33MHz clock is divided by two on chip to achieve a 50% duty cycle, and clock distribution is done in metal paths to minimize clock skew. The circuit design and layout is full custom, and two levels of metal have been used extensively in all of the modules, including the memories. Also developed were two static PLAs whose performance and structure were optimized for their particular architectural requirements. The process has been described previously <sup>1</sup>. To minimize the design cycle of the DSP, advanced CAD tools were systematically applied for design and verification; Figure 4. The architecture was modeled via a multi-level simulator, and circuit design was initiated from C language descriptions and high level schematics. Before layout, schematic extractions of the circuit design were simulated using a switch-level simulator and verified by comparison to the high-level models. A hierarchical verification procedure culminated in simulation of the complete, extracted, circuit schematic with over 20,000 assembly language instructions. From the completed layout, connectivity, transistor sizes and parasitics were extracted. The connectivity was compared $<sup>^1</sup>$ Tran, A., et. al., "Device Characteristics of a 1.0 $\mu$ m CMOS Technology for Logic and Custom VLSI Applications", IEEE Custom Integrated Circuit Conference; 1986. against the schematic connectivity and this extracted database was again completely exercised by the test programs. The layout underwent extensive design rule checking during assembly and after layout completion. Finally, all subsystems, including the 20,000 transistor DAU were simulated with a limited set of test vectors using a transistor timing simulator. The DSP has been designed for applications such as speech coding, high speed modems, secure voice communications, cellular phones, and high speed control. The device can compute a nonrecursive filter at the rate of 60ns per tap, a double precision adaptive FIR filter at the rate of 420ns per tap, and a five multiply second-order-section in 420ns. Figure 5 is a summary of the DSP. ## Acknowledgments We gratefully acknowledge the contributions of J.E. Beck, D.J. Emrich, A.E. Bryfogle, D.M. Guerrerro, G.E. Hall, L.V. Tran, R.E. Vannicolo, W.O. Yoder, B. Tang to the design, testing and fabrication of this circuit. We also thank R.A. Pederson for his guidance and support throughout the project. FIGURE 2-DSP block diagram. [See pages 376-377 for Figures 1, 3, 4, 5.] ${\bf FIGURE}~1{-}{\bf DSP}~{\bf chip}~{\bf photomic rograph}.$ ## A 60ns CMOS DSP with On-Chip Instruction Cache (Continued from Page 378) do κ {instr1, instr2, ... instrN} redo κ FIGURE 3-Cache architecture and programming syntax. FIGURE 4-DSP design methodology. FIGURE 5-DSP summary. \* TRANSISTORS CHIP SIZE EXTERNAL CLOCK RATE INTERNAL MACHINE CYCLE INSTRUCTION/COEFFICIENT ROM DATA RAM COMPUTATION PRECISION POWER SUPPLY POWER DISSIPATION PACKAGE TECHNOLOGY 1.0 µm 2 LEVEL AR CMOS 140,000 6.0 mm x 8.5 mm = 51 sqmm 33.33 MHz 60 ns 2048 x 16 512 x 16 (16 x 16 ->32) + 36 5V (NOMINAL) 500mw AT 60 ns (WORST CASE) 84 PIN PLASTIC LEADED CC 84 PIN CERAMIC LEADLESS CC