Title

NULL convention multiply and accumulate unit with conditional rounding, scaling, and saturation

Authors

Authors

S. C. Smith; R. F. DeMara; J. S. Yuan; M. Hagedorn;D. Ferguson

Abbreviated Journal Title

J. Syst. Architect.

Keywords

asynchronous circuit design; multiply and accumulate unit; array; multiplication; modified Baugh-Wooley algorithm; Booth's algorithm; gate-level pipelining; NULL convention logic; Computer Science, Hardware & Architecture; Computer Science, Software; Engineering

Abstract

Approaches for maximizing throughput of self-timed multiply-accumulate units (MACs) are developed and assessed using the NULL convention logic paradigm. In this class of self-timed circuits, the functional correctness is independent of any delays in circuit elements, through circuit construction, and independent of any wire delays, through the isochronic fork assumption [1,2], where wire delays are assumed to be much less than gate delays. Therefore self-timed circuits provide distinct advantages for System-on-a-Chip applications. First, a number of alternative MAC algorithms are compared and contrasted in terms of throughput and area to determine which approach will yield the maximum throughput with the least area. It was determined that two algorithms that meet these criteria well are the Modified Baugh-Wooley and Modified Booth2 algorithms. Dual-rail non-pipelined versions of these algorithms were first designed using the threshold combinational reduction method [3]. The non-pipelined designs were then optimized for throughput using the gate-level pipelining method [4]. Finally, each design was simulated using Synopsys to quantify the advantage of the dual-rail pipelined Modified Baugh-Wooley MAC, which yielded a speedup of 2.5 over its initial non-pipelined version. This design also required 20% fewer gates than the dual-rail pipelined Modified Booth2 MAC that had the same throughput. The resulting design employs a three-stage feed-forward multiply pipeline connected to a four-stage feedback multifunctional loop to perform a 72 + 32 x 32 MAC in 12.7 ns on average using a 0.25 mum CMOS process at 3.3 V, thus outperforming other delay-insensitive/self-timed MACs in the literature. (C) 2002 Elsevier Science B.V. All rights reserved.

Journal Title

Journal of Systems Architecture

Volume

47

Issue/Number

12

Publication Date

1-1-2002

Document Type

Article

Language

English

First Page

977

Last Page

998

WOS Identifier

WOS:000177024100002

ISSN

1383-7621

Share

COinS