Serial multiplier verilog

download Serial multiplier verilog

of 36

description

Verilog multiplication

Transcript of Serial multiplier verilog

  • BIT-SERIAL MULTIPLIER USING VERILOG HDL

    A

    Mini Project Report

    Submitted in the Partial Fulfillment of the

    Requirements

    for the Award of the Degree of

    BACHELOR OF TECHNOLOGY

    IN

    ELECTRONICS AND COMMUNICATION ENGINEERING

    Submitted

    By

    K.BHARGAV 11885A0401

    P.DEVSINGH 11885A0404

    Under the Guidance of

    Mr. S. RAJENDAR

    Associate Professor

    Department of ECE

    Department of Electronics and Communication Engineering

    VARDHAMAN COLLEGE OF ENGINEERING (AUTONOMOUS)

    (Approved by AICTE, Affiliated to JNTUH & Accredited by NBA)

    2013- 14

  • VARDHAMAN COLLEGE OF ENGINEERING (AUTONOMOUS)

    Es td.1999 Shamshabad, Hyderabad 501218

    Kacharam (V), Shamshabad (M), Ranga Reddy (Dist.) 501 218, Hyderabad, A.P. Ph: 08413-253335, 253201, Fax: 08413-253482, www.vardhaman.org

    Department of Electronics and Communication Engineering

    CERTIFICATE

    This is to certify that the mini project report work entitled Bit-Serial Multiplier Using

    Verilog HDL carried out by Mr. K.Bhargav, Roll Number 11885A0401, Mr. P.Devsingh, Roll

    Number 11885A0404, submitted to the department of Electronics and Communication

    Engineering, in partial fulfillment of the requirements for the award of degree of Bachelor of

    Technology in Electronics and Communication Engineering during the year 2013 2014.

    Name & Signature of the Supervisor

    Mr. S. Rajendar

    Associate Professor

    Name & Signature of the HOD

    Dr. J. V. R. Ravindra

    Head, ECE

  • iii

    ACKNOWLEDGEMENTS

    The satisfaction that accompanies the successful completion of the task would be

    put incomplete without the mention of the people who made it possible, whose constant

    guidance and encouragement crown all the efforts with success.

    I express my heartfelt thanks to Mr. S. Rajendar, Associate Professor, technical

    seminar supervisor, for her suggestions in selecting and carrying out the in-depth study of

    the topic. Her valuable guidance, encouragement and critical reviews really helped to

    shape this report to perfection.

    I wish to express my deep sense of gratitude to Dr. J. V. R. Ravindra, Head of

    the Department for his able guidance and useful suggestions, which helped me in

    completing the technical seminar on time.

    I also owe my special thanks to our Director Prof. L. V. N. Prasad for his intense

    support, encouragement and for having provided all the facilities and support.

    Finally thanks to all my family members and friends for their continuous support

    and enthusiastic help.

    K.Bhargav 11885A0401

    P.Devsingh 11885A0404

  • iv

    ABSTRACT

    Bit-serial arithmetic is attractive in view of it is smaller pin count, reduced wire

    length, and lower floor space requirement in VLSI. In fact ,the compactness of the design

    may allow us to run a bit-serial multiplier at a clock rate high enough to make the unit

    almost competitive with much more complex designs with regard to speed. In addition, in

    certain application contexts inputs are supplied bit-serially anyway. In such a case, using

    a parallel multiplier would be quite wasteful, since the parallelism may not lead to any

    speed benefit. Furthermore, in applications that call for a large number of independent

    multiplications, multiple bit-serial multiplier may be more cost-effective than a complex

    highly pipelined unit.

    Bit-serial multipliers can be designed as systolic arrays: synchronous arrays of

    processing element that are interconnected by only short, local wires thus allowing very

    high clock rates. Let us begin by introducing a semi systolic multiplier, so named because

    its design involves broadcasting a single bit of the multiplier x to a number of circuit

    element, thus violating the short, local wires requirement of pure systolic design.

  • v

    CONTENTS

    Acknowledgements (iii)

    Abstract (iv)

    List Of Figures (vii)

    1 INTRODUCTION 1

    1.1 The Context of Computer Arithmetic 1

    1.2 What is computer arithmetic 2

    1.3 Multiplication 4

    1.4 Organization of report 5

    2 VLSI 6

    2.1 Introduction 6

    2.2 What is VLSI? 7

    2.2.1 History of Scale Integration 7

    2.3 Advantages of ICs over discrete components 7

    2.4 VLSI And Systems 8

    2.5 Applications of VLSI 8

    2.6 Conclusion 9

    3 VERILOG HDL 10

    3.1 Introduction 10

    3.2 Major Capabilities 11

    3.3 SYNTHESIS 12

    3.4 Conclusion 12

    4 BIT-SERIAL MULTIPLIER 14

    4.1 Multiplier 14

    4.2 Background 14

    4.2.1 Binary Multiplication 15

  • vi

    4.2.2 Hardware Multipliers 15

    4.2.3 Array Multipliers 16

    4.3 Variations in Multipliers 18

    4.4 Bit-serial Multipliers 19

    5 IMPLEMENTATION 22

    5.1 Tools Used 22

    5.2 Coding Steps 22

    5.3 Simulation steps 22

    5.4 Full adder code 23

    5.5 Full adder flowchart 24

    5.6 Full adder testbench 24

    5.7 Bit-serial multiplier algorithm 25

    5.8 Bit-Serial multiplier code 25

    5.9 Full adder waveform 26

    5.10 Bit-serial multiplier testbench 26

    5.11 Bit-serial multiplier waveforms 27

    6 CONCLUSIONS 28

    REFERENCES 29

  • vii

    LIST OF FIGURES

    3.1 Mixed level modeling 11

    3.2 Synthesis process 12

    3.3 Typical design process 13

    4.1 Basic Multiplication Data flow 15

    4.2 Two Rows of an Array Multiplier 17

    4.3 Data Flow through a Pipelined Array Multiplier 18

    4.4 Bit-serial multiplier; 4x4 multiplication in 8 clock cycles 19

    4.5 Bit Serial multiplier design in dot notation 21

    5.1 Project directory structure 22

    5.2 Simulation window 23

    5.3 Waveform window 23

    5.4 Full adder flowchart 24

    5.5 Bit-Serial multiplier flowchart 25

    5.6 Full adder output waveforms 26

    5.7 Bit serial multiplier input/output waveforms 27

    5.8 Bit serial multiplier with intermediate waveforms 27

  • 1

    CHAPTER 1

    INTRODUCTION

    1.1 The Context of Computer Arithmetic

    Advances in computer architecture over the past two decades have allowed the

    performance of digital computer hardware to continue its exponential growth, despite

    increasing technological difficulty in speed improvement at the circuit level. This

    phenomenal rate of growth, which is expected to continue in the near future, would not

    have been possible without theoretical insights, experimental research, and tool-building

    efforts that have helped transform computer architecture from an art into one of the most

    quantitative branches of computer science and engineering. Better understanding of the

    various forms of concurrency and the development of a reasonably efficient and user-

    friendly programming model has been key enablers of this success story.

    The downside of exponentially rising processor performance is an unprecedented

    increase in hardware and software complexity. The trend toward greater complexity is not

    only at odds with testability and verifiability but also hampers adaptability, performance

    tuning, and evaluation of the various trade-offs, all of which contribute to soaring

    development costs. A key challenge facing current and future computer designers is to

    reverse this trend by removing layer after layer of complexity, opting instead for clean,

    robust, and easily certifiable designs, while continuing to try to devise novel methods for

    gaining performance and ease-of-use benefits from simpler circuits that can be readily

    adapted to application requirements.

    In the computer designers quest for user-friendliness, compactness, simplicity,

    high performance, low cost, and low power, computer arithmetic plays a key role. It is

    one of oldest subfields of computer architecture. The bulk of hardware in early digital

    computers resided in accumulator and other arithmetic/logic circuits. Thus, first-

    generation computer designers were motivated to simplify and share hardware to the

    extent possible and to carry out detailed cost- performance analyses before proposing a

    design. Many of the ingenious design methods that we use today have their roots in the

    bulky, power-hungry machines of 30-50 years ago.

    In fact computer arithmetic has been so successful that it has, at times, become

    transparent. Arithmetic circuits are no longer dominant in terms of complexity; registers,

    memory and memory management, instruction issue logic, and pipeline control have

    become the dominant consumers of chip area in todays processors. Correctness and high

    performance of arithmetic circuits is routinely expected, and episodes such as the Intel

  • 2

    Pentium division bug are indeed rare.

    The preceding context is changing for several reasons. First, at very high clock

    rates, the interfaces between arithmetic circuits and the rest of the processor become

    critical. Arithmetic units can no longer be designed and verified in isolation. Rather, an

    integrated design optimization is required, which makes the development even more

    complex and costly. Second, optimizing arithmetic circuits to meet design goals by taking

    advantage of the strengths of new technologies, and making them tolerant to the

    weaknesses, requires a reexamination of existing design paradigms. Finally, incorporation

    of higher-level arithmetic primitives into hardware makes the design, optimization, and

    verification efforts highly complex and interrelated.

    This is why computer arithmetic is alive and well today. Designers and

    researchers in this area produce novel structures with amazing regularity. Carry-

    lookahead adders comprise a case in point. We used to think, in the not so distant past,

    that we knew all there was to know about carry-lookahead fast adders. Yet, new designs,

    improvements, and optimizations are still appearing. The ANSI/IEEE standard floating-

    point format has removed many of the concerns with compatibility and error control in

    floating-point computations, thus resulting in new designs and products with mass-market

    appeal. Given the arithmetic-intensive nature of many novel application areas (such as

    encryption, error checking, and multimedia), computer arithmetic will continue to thrive

    for years to come.

    1.2 What is computer arithmetic

    A sequence of events, begun in late 1994 and extending into 1995, embarrassed

    the worlds largest computer chip manufacturer and put the normally dry subject of

    computer arithmetic on the front pages of major newspapers. The events were rooted in

    the work of Thomas Nicely, a mathematician at the Lynchburg College in Virginia, who

    is interested in twin primes (consecutive odd numbers such as 29 and 31 that are both

    prime). Nicelys work involves the distribution of twin primes and, particularly, the sum

    of their reciprocals S = 1/5 + 1/7 1/11+1/13 +1/17 +1/19+1/29+1/31+-+1/P +1/(p +2) + -

    - -. While it is known that the infinite sum S has a finite value, no one knows what the

    value is.

    Nicely was using several different computers for his work and in March 1994

    added a machine based on the Intel Pentium processor to his collection. Soon he began

    noticing inconsistencies in his calculations and was able to trace them back to the values

    computed for 1 / p and 1 / (p + 2) on the Pentium processor. At first, he suspected his own

    programs, the compiler, and the operating system, but by October, he became convinced

  • 3

    that the Intel Pentium chip was at fault. This suspicion was confirmed by several other

    researchers following a barrage of e-mail exchanges and postings on the Internet. The

    diagnosis finally came from Tim Coe, an engineer at Vitesse Semiconductor. Coe built a

    model of Pentiums floating-point division hardware based on the radix-4 SRT algorithm

    and came up with an example that produces the worst-case error. Using double-precision

    floating- point computation, the ratio c = 4 195 835/3 145 727 = 1.333 820 44- - - is

    computed as 1.333 739 06 on the Pentium. This latter result is accurate to only 14 bits;

    the error is even larger than that of single-precision floating-point and more than 10

    orders of magnitude worse that what is expected of double-precision computation.

    The rest, as they say, is history. Intel at first dismissed the severity of the problem

    and admitted only a subtle flaw, with a probability of 1 in 9 billion, or once in 27,000

    years for the average spreadsheet user, of leading to computational errors. It nevertheless

    published a white paper that described the bug and its potential consequences and

    announced a replacement policy for the defective chips based on customer need; that is,

    customers had to show that they were doing a lot of mathematical calculations to get a

    free replacement. Under heavy criticism from customers, manufacturers using the

    Pentium chip in their products, and the on-line community, Intel later revised its policy to

    no-questions-asked replacement.

    Whereas supercomputing, microchips, computer networks, advanced applications

    (particularly chess-playing programs), and many other aspects of computer technology

    have made the news regularly in recent years, the Intel Pentium bug was the first instance

    of arithmetic (or anything inside the CPU for that matter) becoming front-page news.

    While this can be interpreted as a sign of pedantic dryness, it is more likely an indicator

    of stunning technological success. Glaring software failures have come to be routine

    events in our information-based society, but hardware bugs are rare and newsworthy.

    Within the hardware realm, we will be dealing with both general-purpose

    arithmetic/logic units (ALUS), of the type found in many commercially available

    processors, and special-purpose structures for solving specific application problems. The

    differences in the two areas are minor as far as the arithmetic algorithms are concerned.

    However, in view of the specific technological constraints, production volumes, and

    performance criteria, hardware implementations tend to be quite different. General-

    purpose processor chips that are mass-produced have highly optimized custom designs.

    Implementations of 1ow-volume, special-purpose systems, on the other hand, typically

    rely on semicustom and off-the-shelf components. However, when critical and strict

    requirements, such as extreme speed, very low power consumption, and miniature size,

  • 4

    preclude the use of semicustom or off-the shelf components, the much higher cost of a

    custom design may be justified even for a special-purpose system.

    1.3 Multiplication

    Multiplication (often denoted by the cross symbol "", or by the absence of

    symbol) is the third basic mathematical operation of arithmetic, the others being addition,

    subtraction and division (the division is the fourth one, because it requires multiplication

    to be defined). The multiplication of two whole numbers is equivalent to the addition of

    one of them with itself as many times as the value of the other one; for example, 3

    multiplied by 4 (often said as "3 times 4") can be calculated by adding 4 copies of 3

    together: 3 times 4 = 3 + 3 + 3 + 3 = 12 Here 3 and 4 are the "factors" and 12 is the

    "product". One of the main properties of multiplication is that the result does not depend

    on the place of the factor that is repeatedly added to it (commutative property). 3

    multiplied by 4 can also be calculated by adding 3 copies of 4 together: 3 times 4 = 4 + 4

    + 4 = 12. The multiplication of integers (including negative numbers), rational numbers

    (fractions) and real numbers is defined by a systematic generalization of this basic

    definition. Multiplication can also be visualized as counting objects arranged in a

    rectangle (for whole numbers) or as finding the area of a rectangle whose sides have

    given lengths. The area of a rectangle does not depend on which side is measured first,

    which illustrates the commutative property. In general, multiplying two measurements

    gives a new type, depending on the measurements. For instance: 2.5 meters \times 4.5

    meters = 11.25 square meters 11 meters/second times 9 seconds = 99 meters The inverse

    operation of the multiplication is the division. For example, since 4 multiplied by 3 equals

    12, then 12 divided by 3 equals 4. Multiplication by 3, followed by division by 3, yields

    the original number (since the division of a number other than 0 by itself equals 1).

    Multiplication is also defined for other types of numbers, such as complex numbers, and

    more abstract constructs, like matrices. For these more abstract constructs, the order that

    the operands are multiplied sometimes does matter.

    Multiplication often realized by k cycles of shifting and adding, is a heavily used

    arithmetic operation that figures prominently in signal processing and scientific

    applications. In this part, after examining shift/add multiplication schemes and their

    various implementations, we note that there are but two ways to speed up the underlying

    multi operand addition: reducing the number of operands to be added leads to high-radix

    multipliers, and devising hardware multi operand adders that minimize the latency and/or

    maximize the throughput leads to tree and array multipliers. Of course, speed is not the

    only criterion of interest. Cost, VLSI area, and pin limitations favor bit-serial designs,

  • 5

    while the desire to use available building blocks leads to designs based on additive

    multiply modules. Finally, the special case of squaring is of interest as it leads to

    considerable simplification

    1.4 Organization of report

    This report starts with introduction to computer arithmetic and then introduces

    multiplication. Then it explains implementation of one of the multiplier bit serial

    multiplier.

    Chapter 1: Introduction This chapter explains importance of computer arithmetic and

    multiplication in computations.

    Chapter 2: VLSI This chapter focuses on VLSI and its evolution, also its applications

    and advantages

    Chapter 3: Verilog HDL This chapter explains how HDLs reduce design cycle in VLSI

    and automation makes faster implementation.

    Chapter 4: Bit-serial multiplier This chapter explains about multiplier and its types and

    how bit serial multiplier is useful.

    Chapter 5: Implementation This chapter explains Implementation flow of Bit-serial

    multiplier its Verilog code and output waveforms.

    Chapter 6: Conclusions This chapter summarizes Bit-serial multiplier and its future

    improvements.

  • 6

    CHAPTER 2

    VLSI

    2.1 Introduction

    Very-large-scale integration (VLSI) is the process of creating integrated

    circuits by combining thousands of transistor-based circuits into a single chip. VLSI

    began in the 1970s when complex semiconductor and communication technologies

    were being developed. The microprocessor is a VLSI device. The term is no longer as

    common as it once was, as chips have increased in complexity into the hundreds of

    millions of transistors.

    The first semiconductor chips held one transistor each. Subsequent advances

    added more and more transistors, and, as a consequence, more individual functions or

    systems were integrated over time. The first integrated circuits held only a few

    devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making it

    possible to fabricate one or more logic gates on a single device. Now known

    retrospectively as "small-scale integration" (SSI), improvements in technique led to

    devices with hundreds of logic gates, known as large-scale integration (LSI), i.e.

    systems with at least a thousand logic gates. Current technology has moved far past

    this mark and today's microprocessors have many millions of gates and hundreds of

    millions of individual transistors.

    At one time, there was an effort to name and calibrate various levels of large-

    scale integration above VLSI. Terms like Ultra-large-scale Integration (ULSI) were

    used. But the huge number of gates and transistors available on common devices has

    rendered such fine distinctions moot. Terms suggesting greater than VLSI levels of

    integration are no longer in widespread use. Even VLSI is now somewhat quaint,

    given the common assumption that all microprocessors are VLSI or better.

    As of early 2008, billion-transistor processors are commercially available, an

    example of which is Intel's Montecito Itanium chip. This is expected to become more

    commonplace as semiconductor fabrication moves from the current generation of 65 nm

    processes to the next 45 nm generations (while experiencing new challenges such as

    increased variation across process corners).

    This microprocessor is unique in the fact that its 1.4 Billion transistor count,

    capable of a teraflop of performance, is almost entirely dedicated to logic (Itanium's

    transistor count is largely due to the 24MB L3 cache). Current designs, as opposed to

    the earliest devices, use extensive design automation and automated logic synthesis to

  • 7

    lay out the transistors, enabling higher levels of complexity in the resulting logic

    functionality. Certain high-performance logic blocks like the SRAM cell, however, are

    still designed by hand to ensure the highest efficiency (sometimes by bending or

    breaking established design rules to obtain the last bit of performance by trading

    stability).

    2.2 What is VLSI?

    VLSI stands for "Very Large Scale Integration". This is the field which involves

    packing more and more logic devices into smaller and smaller areas.

    Simply we say Integrated circuit is many transistors on one chip.

    Design/manufacturing of extremely small, complex circuitry using modified

    semiconductor material

    Integrated circuit (IC) may contain millions of transistors, each a few mm in size

    Applications wide ranging: most electronic logic devices

    2.2.1 History of Scale Integration

    late 40s Transistor invented at Bell Labs

    late 50s First IC (JK-FF by Jack Kilby at TI)

    early 60s Small Scale Integration (SSI)

    o 10s of transistors on a chip

    o late 60s Medium Scale Integration (MSI)

    o 100s of transistors on a chip

    early 70s Large Scale Integration (LSI)

    o 1000s of transistor on a chip

    early 80s VLSI 10,000s of transistors on a chip (later 100,000s & now 1,000,000s)

    Ultra LSI is sometimes used for 1,000,000s

    2.3 Advantages of ICs over discrete components

    While we will concentrate on integrated circuits, the properties of integrated

    circuits-what we can and cannot efficiently put in an integrated circuit- largely

    determine the architecture of the entire system. Integrated circuits improve system

    characteristics in several critical ways. ICs have three key advantages over digital

    circuits built from discrete components:

    Size: Integrated circuits are much smaller-both transistors and wires are shrunk to

    micrometer sizes, compared to the millimeter or centimeter scales of discrete

  • 8

    components. Small size leads to advantages in speed and power consumption, since

    smaller components have smaller parasitic resistances, capacitances, and inductances.

    Speed: Signals can be switched between logic 0 and logic 1 much quicker within a chip

    than they can between chips. Communication within a chip can occur hundreds of

    times faster than communication between chips on a printed circuit board. The high

    speed of circuits on- chip is due to their small size-smaller components and wires have

    smaller parasitic capacitances to slow down the signal.

    Power consumption: Logic operations within a chip also take much less power. Once

    again, lower power consumption is largely due to the small size of circuits on the chip-

    smaller parasitic capacitances and resistances require less power to drive them

    2.4 VLSI And Systems

    These advantages of integrated circuits translate into advantages at the system

    level:

    Smaller physical size: Smallness is often an advantage in itself- consider portable

    televisions or handheld cellular telephones.

    Lower power consumption: Replacing a handful of standard parts with a single chip

    reduces total power consumption. Reducing power consumption has a ripple effect on

    the rest of the system: a smaller, cheaper power supply can be used; since less power

    consumption means less heat, a fan may no longer be necessary; a simpler cabinet with

    less shielding for electromagnetic shielding may be feasible, too.

    Reduced cost: Reducing the number of components, the power supply requirements,

    cabinet costs, and so on, will inevitably reduce system cost. The ripple effect of

    integration is such that the cost of a system built from custom ICs can be less, even

    though the individual ICs cost more than the standard parts they replace.

    Communication within a chip can occur hundreds of times faster than communication

    between chips on a printed circuit board.

    Understanding why integrated circuit technology has such profound influence

    on the design of digital systems requires understanding both the technology of IC

    manufacturing and the economics of ICs and digital systems.

    2.5 Applications of VLSI

    Electronic systems now perform a wide variety of tasks in daily life. Electronic

    systems in some cases have replaced mechanisms that operated mechanically,

    hydraulically, or by other means; electronics are usually smaller, more flexible, and

    easier to service. In other cases electronic systems have created totally new applications.

  • 9

    Electronic systems perform a variety of tasks, some of them visible, some more hidden.

    Electronic systems in cars operate stereo systems and displays; they also control fuel

    injection systems, adjust suspensions to varying terrain, and perform the control

    functions required for anti-lock braking (ABS) systems.

    Digital electronics compress and decompress video, even at high-definition data

    rates, on-the-fly in consumer electronics.

    Low-cost terminals for Web browsing still require sophisticated electronics,

    despite their dedicated function.

    Personal computers and workstations provide word-processing, financial analysis,

    and games. Computers include both central processing units (CPUs) and special-

    purpose hardware for disk access, faster screen display, etc.

    Medical electronic systems measure bodily functions and perform complex

    processing algorithms to warn about unusual conditions. The availability of these

    complex systems, far from overwhelming consumers, only creates demand for

    even more complex systems.

    2.6 Conclusion

    The growing sophistication of applications continually pushes the design and

    manufacturing of integrated circuits and electronic systems to new levels of complexity.

    And perhaps the most amazing characteristic of this collection of systems is its variety-

    as systems become more complex, we build not a few general-purpose computers but

    an ever wider range of special-purpose systems. Our ability to do so is a testament to

    our growing mastery of both integrated circuit manufacturing and design, but the

    increasing demands of customers continue to test the limits of design and

    manufacturing.

  • 10

    CHAPTER 3

    VERILOG HDL

    3.1 Introduction

    Verilog HDL is a hardware description language that can be used to model a

    digital system at many levels of abstraction ranging from the algorithmic-level to the

    gate-level to the switch-level. The complexity of the digital system being modeled

    could vary from that of a simple gate to a complete electronic digital system, or

    anything in between. The digital system can be described hierarchically and timing

    can be explicitly modeled within the same description.

    The Verilog HDL language includes capabilities to describe the behavior-al

    nature of a design, the dataflow nature of a design, a design's structural composition,

    delays and a waveform generation mechanism including aspects of response monitoring

    and verification, all modeled using one single language. In addition, the language

    provides a programming language interface through which the internals of a design can

    be accessed during simulation including the control of a simulation run.

    The language not only defines the syntax but also defines very clear simulation

    semantics for each language construct. Therefore, models written in this language

    can be verified using a Verilog simulator. The language inherits many of its operator

    symbols and constructs from the C programming language. Verilog HDL provides an

    extensive range of modeling capabilities, some of which are quite difficult to

    comprehend initially. However, a core subset of the language is quite easy to learn and

    use. This is sufficient to model most applications.

    The Verilog HDL language was first developed by Gateway Design Automation

    in 1983 as hardware are modeling language for their simulator product, At that time ,it

    was a proprietary language. The Verilog HDL language includes capabilities to describe

    the behavior-al nature of a design, the dataflow nature of a design, a design's structural

    Because of the popularity of the, simulator product, Verilog HDL gained acceptance as a

    usable and practical language by a number of designers. In an effort to increase the

    popularity of the language, the language was placed in the public domain in 1990.

    Open Verilog International (OVI) was formed to promote Verilog. In 1992 OVI

    decided to pursue standardization of Verilog HDL as an IEEE standard. This effort was

    successful and the language became an IEEE standard in 1995. The complete standard is

    described in the Verilog hardware description language reference manual. The standard

    is called std. 1364-1995.

  • 11

    3.2 Major Capabilities

    Listed below are the major capabilities of the Verilog hardware description:

    Primitive logic gates, such as and, or and nand, are built-in into the language.

    Flexibility of creating a user-defined primitive (UDP). Such a primitive could

    either be a combinational logic primitive or a sequential logic primitive.

    Switch-level modeling primitive gates, such as pmos and nmos, are also built- in

    into the language.

    A design can be modeled in three different styles or in a mixed style. These

    styles are: behavioral style modeled using procedural constructs; dataflow style

    - modeled using continuous assignments; and structural style modeled using

    gate and module instantiations.

    There are two data types in Verilog HDL; the net data type and the register

    data type. The net type represents a physical connection between structural

    elements while a register type represents an abstract data storage element.

    Figure.3-1 shows the mixed-level modeling capability of Verilog HDL, that is, in

    one design; each module may be modeled at a different level.

    Figure 3.1 Mixed level modeling

    Verilog HDL also has built-in logic functions such as & (bitwise-and) and I

    (bitwise-or).

    High-level programming language constructs such as conditionals, case

    statements, and loops are available in the language.

    Notion of concurrency and time can be explicitly modeled.

    Powerful file read and write capabilities fare provided.

    The language is non-deterministic under certain situations, that is, a model may

    produce different results on different simulators; for example, the ordering of

    events on an event queue is not defined by the standard.

  • 12

    3.3 SYNTHESIS

    Synthesis is the process of constructing a gate level netlist from a register-

    transfer level model of a circuit described in Verilog HDL. Figure.3-2 shows such a

    process. A synthesis system may as an intermediate step, generate a netlist that is

    comprised of register-transfer level blocks such as flip-flops, arithmetic-logic-units,

    and multiplexers, interconnected by wires. In such a case, a second program called the

    RTL module builder is necessary. The purpose of this builder is to build, or acquire

    from a library of predefined components, each of the required RTL blocks in the user-

    specified target technology.

    Figure 3.2 Synthesis process

    The above figure shows the basic elements of Verilog HDL and the elements

    used in hardware. A mapping mechanism or a construction mechanism has to be

    provided that translates the Verilog HDL elements into their corresponding hardware

    elements as shown in figure.3-3

    3.4 Conclusion

    The Verilog HDL language includes capabilities to describe the behavior-al

    nature of a design, the dataflow nature of a design, a design's structural composition,

    delays and a waveform generation mechanism including aspects of response monitoring

    and verification, all modeled using one single language. The language not only defines

    the syntax but also defines very clear simulation semantics for each language construct.

    Therefore, models written in this language can be verified using a Verilog simulator.

    The Verilog HDL language includes capabilities to describe the behavior-al nature of

    a design, the dataflow nature of a design, a design's structural composition, delays.

  • 13

    Figure 3.3: Typical design process

  • 14

    CHAPTER 4

    BIT-SERIAL MULTIPLIER

    4.1 Multiplier

    Multipliers are key components of many high performance systems such as FIR

    filters, microprocessors, digital signal processors, etc. A systems performance is

    generally determined by the performance of the multiplier because the multiplier is

    generally the slowest clement in the system. Furthermore, it is generally the most area

    consuming. Hence, optimizing the speed and area of the multiplier is a major design

    issue. However, area and speed are usually conflicting constraints so that improving

    speed results mostly in larger areas. As a result, whole spectrums of multipliers with

    different area-speed constraints are designed with fully parallel processing. In between

    are digit serial multipliers where single digits consisting of several bits are operated on.

    These multipliers have moderate performance in both speed and area. However, existing

    digit serial multipliers have been plagued by complicated switching systems and/or

    irregularities in design. Radix 2^n multipliers which operate on digits in a parallel fashion

    instead of bits bring the pipelining to the digit level and avoid most of the above

    problems. They were introduced by M. K. Ibrahim in 1993. These structures are iterative

    and modular. The pipelining done at the digit level brings the benefit of constant

    operation speed irrespective of the size of the multiplier. The clock speed is only

    determined by the digit size which is already fixed before the design is implemented.

    The growing market for fast floating-point co-processors, digital signal processing

    chips, and graphics processors has created a demand for high speed, area-efficient

    multipliers. Current architectures range from small, low-performance shift and add

    multipliers, to large, high performance array and tree multipliers. Conventional linear

    array multipliers achieve high performance in a regular structure, but require large

    amounts of silicon. Tree structures achieve even higher performance than linear arrays

    but the tree interconnection is more complex and less regular, making them even larger

    than linear arrays. Ideally, one would want the speed benefits of a tree structure, the

    regularity of an array multiplier, and the small size of a shift and add multiplier.

    4.2 Background

    Websters dictionary defines multiplication as a mathematical operation that at

    its simplest is an abbreviated process of adding an integer to itself a specified number of

    times. A number (multiplicand) is added to itself a number of times as specified by

  • 15

    another number (multiplier) to form a result (product). In elementary school, students

    learn to multiply by placing the multiplicand on top of the multiplier. The multiplicand is

    then multiplied by each digit of the multiplier beginning with the rightmost, Least

    Significant Digit (LSD). Intermediate results (partial-products) are placed one atop the

    other, offset by one digit to align digits of the same weight. The final product is

    determined by summation of all the partial-products. Although most people think of

    multiplication only in base 10, this technique applies equally to any base, including

    binary. Figure 1.2.1 shows the data flow for the basic multiplication technique just

    described. Each black dot represents a single digit.

    Figure 4.1: Basic Multiplication Data flow

    4.2.1 Binary Multiplication

    In the binary number system the digits, called bits, are limited to the set. The

    result of multiplying any binary number by a single binary bit is either 0, or the original

    number. This makes forming the intermediate partial-products simple and efficient.

    Summing these partial- products is the time consuming task for binary multipliers. One

    logical approach is to form the partial-products one at a time and sum them as they are

    generated. Often implemented by software on processors that do not have a hardware

    multiplier, this technique works fine, but is slow because at least one machine cycle is

    required to sum each additional partial-product.

    For applications where this approach does not provide enough performance,

    multipliers can be implemented directly in hardware.

    4.2.2 Hardware Multipliers

    Direct hardware implementations of shift and add multipliers can increase

    performance over software synthesis, but are still quite slow. The reason is that as each

    additional partial- product is summed a carry must be propagated from the least

    significant bit (LSB) to the most significant bit (MSB). This carry propagation is time

  • 16

    consuming, and must be repeated for each partial product to be summed.

    One method to increase multiplier performance is by using encoding techniques to

    reduce the number of partial products to be summed. Just such a technique was first

    proposed by Booth. The original Booths algorithm ships over contiguous strings of ls by

    using the property that: 2 + 2(n-1) + 2(n-2) + . . . + 2hm) = 2(n+l) - 2(n-m). Although

    Booths algorithm produces at most N/2 encoded partial products from an N bit operand,

    the number of partial products produced varies. This has caused designers to use modified

    versions of Booths algorithm for hardware multipliers. Modified 2-bit Booth encoding

    halves the number of partial products to be summed.

    Since the resulting encoded partial-products can then be summed using any

    suitable method, modified 2 bit Booth encoding is used on most modern floating-point

    chips LU 881, MCA 861. A few designers have even turned to modified 3 bit Booth

    encoding, which reduces the number of partial products to be summed by a factor of three

    IBEN 891. The problem with 3 bit encoding is that the

    Carry-propagate addition required to form the 3X multiples often overshadows the

    potential gains of 3 bit Booth encoding.

    To achieve even higher performance advanced hardware multiplier architectures

    search for faster and more efficient methods for summing the partial-products. Most

    increase performance by eliminating the time consuming carry propagate additions. To

    accomplish this, they sum the partial-products in a redundant number representation. The

    advantage of a redundant representation is that two numbers, or partial-products, can be

    added together without propagating a carry across the entire width of the number. Many

    redundant number representations are possible. One commonly used representation is

    known as carry-save form. In this redundant representation two bits, known as the carry

    and sum, are used to represent each bit position. When two numbers in carry-save form

    are added together any carries that result are never propagated more than one bit position.

    This makes adding two numbers in carry-save form much faster than adding two normal

    binary numbers where a carry may propagate. One common method that has been

    developed for summing rows of partial products using a carry-save representation is the

    array multiplier.

    4.2.3 Array Multipliers

    Conventional linear array multipliers consist of rows of carry-save adders (CSA).

    A portion of an array multiplier with the associated routing can be seen in Figure 4.2.

  • 17

    Figure 4.2: Two Rows of an Array Multiplier

    In a linear array multiplier, as the data propagates down through the array, each

    row of CSAs adds one additional partial-product to the partial sum. Since the

    intermediate partial sum is kept in a redundant, carry-save form there is no carry

    propagation. This means that the delay of an array multiplier is only dependent upon the

    depth of the array, and is independent of the partial-product width. Linear array

    multipliers are also regular, consisting of replicated rows of CSAs. Their high

    performance and regular structure have perpetuated the use of array multipliers for VLSI

    math co-processors and special purpose DSP chips.

    The biggest problem with full linear array multipliers is that they are very large.

    As operand sizes increase, linear arrays grow in size at a rate equal to the square of the

    operand size. This is because the number of rows in the array is equal to the length of the

    multiplier, with the width of each row equal to the width of multiplicand. The large size

    of full arrays typically prohibits their use, except for small operand sizes, or on special

    purpose math chips where a major portion of the silicon area can be assigned to the

    multiplier array.

    Another problem with array multipliers is that the hardware is underutilized. As

    the sum is propagated down through the array, each row of CSAs computes a result only

    once, when the active computation front passes that row. Thus, the hardware is doing

    useful work only a very small percentage of the time. This low hardware utilization in

    conventional linear array multipliers makes performance gains possible through increased

    efficiency. For example, by overlapping calculations pipelining can achieve a large gain

    in throughput Figure 4.3 shows a full array pipelined after each row of CSAs. Once the

    partial sum has passed the first row of CSAs, represented by the shaded row of GSAs in

  • 18

    cycle 1, a subsequent multiply can be started on the next cycle. In cycle 2, the first partial

    sum has passed to the second row of CMs, and the second multiply, represented by the

    cross hatched row of CSAs, has begun. Although pipelining a full array can greatly

    increase throughput, both the size and latency are increased due to the additional latches

    While high throughput is desirable, for general purpose computers size and latency tend

    to be more important; thus, fully pipelined linear array multipliers are seldom found.

    Figure 4.3: Data Flow through a Pipelined Array Multiplier

    4.3 Variations in Multipliers

    We do not always synthesize our multipliers from scratch but may desire, or be

    required, to use building blocks such as adders, small multipliers, or lookup tables.

    Furthermore, limited chip area and/or pin availability may dictate the use of bit-serial

    designs. In this chapter, we discuss such variations and also deal with modular

    multipliers, the special case of squaring, and multiply-accumulators.

    Divide-and-Conquer Designs

    Additive Multiply Modules

    Bit-Serial Multipliers

    Modular Multipliers

    The Special Case of Squaring

    Combined Multiply-Add Units

  • 19

    4.4 Bit-serial Multipliers

    Bit-serial arithmetic is attractive in view of its smaller pin count, reduced wire

    length, and lower floor space requirements in VLSI. In fact, the compactness of the

    design may allow us to run a bit-serial multiplier at a clock rate high enough to make the

    unit almost competitive with much more complex designs with regard to speed. In

    addition, in certain application contexts inputs are supplied bit-serially anyway. In such a

    case, using a parallel multiplier would be quite wasteful, since the parallelism may not

    lead to any speed benefit. Furthermore, in applications that call for a large number of

    independent multiplications, multiple bit-serial multipliers may be more cost-effective

    than a complex highly pipelined unit.

    Figure 4.4: Bit-serial multiplier; 4x4 multiplication in 8 clock cycles

    Bit-serial multipliers can be designed as systolic arrays: synchronous arrays of

    processing elements that are interconnected by only short, local wires thus allowing very

    high clock rates. Let us begin by introducing a semisystolic multiplier, so named because

    its design involves broadcasting a single bit of the multiplier x to a number of circuit

    elements, thus violating the short, local wires requirement of pure systolic design.

    Figure 4.4 shows a semisystolic 4 x 4 multiplier. The multiplicand a is supplied in

    parallel from above and the multiplier x is supplied bit-serially from the right, with its

    least significant bit arriving first. Each bit x i of the multiplier is multiplied by a and the

  • 20

    result added to the cumulative partial product, kept in carry-save form in the carry and

    sum latches. The carry bit stays in its current position, while the sum bit is passed on to

    the neighboring cell on the right. This corresponds to shifting the partial product to the

    right before the next addition step (normally the sum bit would stay put and the carry bit

    would be shifted to the left). Bits of the result emerge serially from the right as they

    become available.

    A k-bit unsigned multiplier x must be padded with k zeros to allow the carries to

    propagate to the output, yielding the correct 2k-bit product. Thus, the semisystolic

    multiplier of Figure 4.4 can perform one k x k unsigned integer multiplication every 2k

    clock cycles. If k-bit fractions need to be multiplied, the first k output bits are discarded

    or used to properly round the most significant k bits.

    To make the multiplier of Figure 4.4 fully systolic, we must remove the

    broadcasting of the multiplier bits. This can be accomplished by a process known as

    systolic retiming, which is briefly explained below

    Consider a synchronous (clocked) circuit, with each line between two functional

    parts having an integral number of unit delays (possibly 0). Then, if we cut the circuit into

    two parts CL and CR, we can delay (advance) all the signals going in one direction and

    advance (delay) the ones going in the opposite direction by the same amount without

    affecting the correct functioning or external timing relations of the circuit. Of course, the

    primary inputs and outputs to the two parts CL and cg must be correspondingly advanced

    or delayed, too.

    For the retiming to be possible, all the signals that are advanced by d must have

    had original delays of d or more (negative delays are not allowed). Note that all the

    signals going into CL have been delayed by d time units. Thus, CL will work as before,

    except that everything, including output production, occurs d time units later than before

    retiming. Advancing the outputs by d time units will keep the external view of the circuit

    unchanged.

    We apply the preceding process to the multiplier circuit of Figure 4.4 in three

    successive steps corresponding to cuts 1, 2, and 3, each time delaying the left-moving

    signal by one unit and advancing the right-moving signal by one unit. Verifying that the

    multiplier in Fig. 12.9 works correctly is left as an exercise. This new version of our

    multiplier does not have the fan-out problem of the design in Figure 4.4 but it suffers

    from long signal propagation delay through the four FAs in each clock cycle, leading to

    inferior operating speed. Note that the culprits are zero-delay lines that lead to signal

    propagation through multiple circuit elements.

  • 21

    One way of avoiding zero-delay lines in our design is to begin by doubling all the

    delays in Figure 4.4. This is done by simply replacing each of the sum and carry flip-flops

    with two cascaded flip-flops before retiming is applied. Since the circuit is now operating

    at half its original speed, the multiplier x must also be applied on alternate clock cycles.

    The resulting design is fully systolic, inasmuch as signals move only between adjacent

    cells in each clock cycle. However, twice as many cycles are needed.

    The easiest way to derive a multiplier with both inputs entering bit-serially is to

    allow k clock ticks for the multiplicand bits to be put into place in a shift register and then

    use the design of Figure 4.4 to compute the product. This increases the total delay by k

    cycles.

    Figure 4.5 uses dot notation to show the justification for the bit-serial multiplier

    design above. Figure 4.5 depicts the meanings of the various partial operands and results.

    Figure 4.5: Bit Serial multiplier design in dot notation

  • 22

    CHAPTER 5

    IMPLEMENTATION

    5.1 Tools Used

    1) Pc installed with linux operating system

    2) Installed cadence tools:

    Ncvlog For checking errors

    Ncverilog For execution of code

    Simvision To View waveforms

    5.2 Coding Steps

    1) Create directory structure for the project as below

    Figure 5.1: Project directory structure

    2) Write RTL code in a text file and save it as .v extension in RTL directory

    3) Write code for testbench and store in TB directory

    5.3 Simulation steps

    The Commands that are used in cadence for the execution are

    1) Initially we should mount the server using mount -a.

    2) Go to the C environment with the command csh //c shell.

    3) The source file should be opened by the command source /root/cshrc.

    4) The next command is to go to the directory of cadence_dgital_labs

    #cd .../../cadence_digital_labs/

    5) Then check the file for errors by the command ncvlog ../rtl/filename.v -mess.

    6) Then execute the file using ncverilog +access +rwc ../rtl/filename.v ../tb/file_tb.v

    +nctimescale +1ns/1ps

    Rwc read write command Gui- graphical unit interface

    7) After running the program we open simulation window by command simvision

    &".

  • 23

    Figure 5.2: Simulation window

    8) After the simulation the waveforms are shown in the other window.

    Figure 5.3: Waveform window

    5.4 Full adder code

    module fulladder(output reg cout,sum,input a,b,cin,rst);

    always@(a,b,cin)

    {cout,sum}=a+b+cin;

    always@(posedge rst)

    begin

    sum

  • 24

    5.5 Full adder flowchart

    Figure 5.4: Full adder flowchart

    5.6 Full adder testbench

    module full_adder_tb;

    wire cout,sum;

    reg a,b,cin,rst;

    //dut

    fulladder fa(cout,sum,a,b,cin,rst);

    initial

    begin

    #2 rst=1'b1;

    #(period/2) rst=1'b0;

    a=1'b1;

    b=1'b0;

    cin=1'b1;

    #5 a=1'b0;

    b=1'b1;

    cin=1'b1;

    $finish;

    end

    endmodule

  • 25

    5.7 Bit-serial multiplier algorithm

    Figure 5.5: Bit-Serial multiplier flowchart

    5.8 Bit-Serial multiplier code

    module serial_mult(output product,input [3:0] a,input b,clk,rst);

    wire s1,s2,s3;

    reg s1o,s2o,s3o; //latches for sum at various stages

    wire c0,c1,c2,c3;

    reg c0o,c1o,c2o,c3o;//latches for carry at various stages

    wire a3o,a2o,a1o,a0o;

    reg s;

    fulladder fa0(c0,product,a0o,s1o,c0o,rst);

    fulladder fa1(c1,s1,a1o,s2o,c1o,rst);

    fulladder fa2(c2,s2,a2o,s3o,c2o,rst);

    fulladder fa3(c3,s3,a3o,s,c3o,rst);

    and n0(a0o,a[0],b);

    and n1(a1o,a[1],b);

    and n2(a2o,a[2],b);

    and n3(a3o,a[3],b);

    always@(posedge clk, posedge rst)

    begin

  • 26

    if(rst)

    begin

    s=0;

    c0o

  • 27

    initial clk=0;

    always #period clk=~clk;

    initial

    begin

    #2 rst=1'b1;

    #(period/2) rst=1'b0;

    a=4'b1101;

    b=1;

    @(posedge clk) b=0;

    @(posedge clk) b=0;

    @(posedge clk) b=1;

    @(posedge clk) b=0;

    @(posedge clk) b=0;

    @(posedge clk) b=0;

    @(posedge clk) b=0;

    #period $finish;

    end

    endmodule

    5.11 Bit-serial multiplier waveforms

    Figure 5.7: Bit serial multiplier input/output waveforms

    Figure 5.8: Bit serial multiplier with intermediate waveforms

  • 28

    CHAPTER 6

    CONCLUSIONS

    Multipliers play an important role in todays digital signal processing and various

    other applications. With advances in technology, many researchers have tried and are

    trying to design multipliers which offer either of the following design targets high

    speed, low power consumption, regularity of layout and hence less area or even

    combination of them in one multiplier thus making them suitable for various high speed,

    low power and compact VLSI implementation. The common multiplication method is

    add and shift algorithm. In parallel multipliers number of partial products to be added is

    the main parameter that determines the performance of the multiplier. To reduce the

    number of partial products to be added, Modified Booth algorithm is one of the most

    popular algorithms. To achieve speed improvements Wallace Tree algorithm can be used

    to reduce the number of sequential adding stages. Further by combining both Modified

    Booth algorithm and Wallace Tree technique we can see advantage of both algorithms in

    one multiplier. However with increasing parallelism, the amount of shifts between the

    partial products and intermediate sums to be added will increase which may result in

    reduced speed, increase in silicon area due to irregularity of structure and also increased

    power consumption due to increase in interconnect resulting from complex routing. On

    the other hand serial-parallel multipliers compromise speed to achieve better

    performance for area and power consumption. The selection of a parallel or serial

    multiplier actually depends on the nature of application.

    A key challenge facing current and future computer designers is to reverse the

    trend by removing layer after layer of complexity, opting instead for clean, robust, and

    easily certifiable designs, while continuing to try to devise novel methods for gaining

    performance and ease-of-use benefits from simpler circuits that can be readily adapted to

    application requirements.

    This is achieved by using Bit Serial multipliers.

  • 29

    REFERENCES

    [1] Behrooz Parhami, Computer arithmetic: algorithms and hardware designs, Oxford

    University Press, 2009

    [2] F. Sadiq M. Sait, Gerhard Beckoff, A Novel Technique for Fast Multiplication.

    IEEE Fourteenth Annual International Phoenix Conference on Computers and

    Communications, vol. 7803-2492-7, pp. 109-114, 1995.

    [3] Ghest, C., Multiplying Made Easy for Digital Assemblies, Electronics, Vol. 44,

    pp.56-61. November 22. 1971.

    [4] Ienne, P., and M. A. Viredaz, Bit-Seria1 Multipliers and Squarers, IEEE Trans.

    Computers, Vol. 43, No. 12, pp. 1445-1450, 1994

    [5] Samir Palnitkar, Verilog HDL: A Guide to Digital Design and Synthesis, Prentice

    Hall Professional, 2003