International Journal of Electrical and Computer Engineering (IJECE) Vol. 3, No. 6, December 2013, pp. 805~814 ISSN: 2088-8708  805 Evaluation of High Speed Hardware Multipliers - Fixed Point and Floating point Awais Ahmed1, Syed Haider Abbas2, Muhammad Faheem Siddique2, Hussnain Haider2 1 2 School of Electrical Engineering, University of Faisalabad, Pakistan Departement of Electrical Engineering, Sarhad University of Science and IT, Pakistan Article Info ABSTRACT Article history: There is a huge demand in high speed arithmetic blocks, due to increased performance of processing units. For higher frequency clocks of the system, the arithmetic blocks must keep pace with greater requirement of more computational power. Area and speed are usually conflicting constraints so that improving speed results mostly in larger areas. In our research we will try to determine the best solution to this problem by comparing the results of different multipliers. Different sized of two algorithms for high speed hardware multipliers were studied and implemented ie. Parallel multiplier, Bit serial multiplier. The workings of these two multipliers were compared by implementing each of them separately in VHDL. A number of high speed adder designs are developed and algorithm and design of these adders are discussed. The result of this research will help us to choose the better option between serial and parallel multipliers for both fixed point and floating point multipliers to fabricate in different systems. As multipliers form one of the most important components of many systems, analysing different multipliers will help us to frame a better system with area and better speed. Received Aug 11, 2013 Revised Oct 14, 2013 Accepted Nov 1, 2013 Keyword: Bit Serial Multiplier Fixed Point Floating Point Parallel Multiplier VHDL Copyright © 2013 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Syed Haider Abbas Departement of Electrical Engineering, Sarhad University of Science and Information Technology, Hayatabad Link, Ring Road, Peshawar, Pakistan Email: habbas33@gmail.com 1. INTRODUCTION Multipliers are the major components of high performance systems used extensively in digital electronics such as microprocessors, digital signal processors and FIR Filters etc. The performance of any system is determined by the performance of multipliers because they are the slowest part in the system. Moreover, they require greater area than other components. Therefore, optimizing the speed and area of the multipliers is the foremost issue. As area and speed are both conflicting constraints this means that for greater speed we need larger area. A number of algorithms are proposed and used to design multipliers and the actual implementation is mostly some little refinements and variations of the few basic algorithms presented here. In addition to choosing those algorithms for addition, subtraction, multiplication etc an architect must make other decisions like how exceptions should be handled and what precisions should be implemented. We have designed two typed of multiplier for fixed and floating point [1-2]. Our discussion on floating point will focus almost exclusively on the IEEE floating-point standard (IEEE 754) because of its rapidly increasing acceptance. Although floating point arithmetic involves manipulating exponents and shifting fractions, the bulk of the time in floating point is spent operating on fractions using integer algorithms. Thus, after our discussion of floating point we will take a more detailed look at efficient algorithms and architectures. Journal homepage: http://iaesjournal.com/online/index.php/IJECE IJECE ISSN: 2088-8708  806 When two binary numbers, multiplicand “N” and multiplier “M” are multiplied the algorithm utilises distributive property of multiplication. The multiplication of M*N consists of partial products which represents a single component of the total product [3]. A binary multiplier can be decomposed in to a sum of partial products. It can be done by selection of a partial product value, and some shifting that is independent of the selection value. Opting a group length for multiplication limit us to make a trade-off between the number of partial products and the complexity of computing them. 2. BINARY NUMBER REPRESENTATION 2.1. Fixed Point Arithmetic Most of the general Digital processing applications use fixed point multiplication due to its easier algorithm and clear understanding. A multiplication can be implemented using a network of binary shifts and adders [1] [4-6]. The hardware cost can be approximated with the required numbers of adders and subtractors. The basic concept of multiplication is similar to the traditional pencil and paper method. A similar example is shown in the Figure 1. Figure 1. Block Diagram of the Hardware Setup of Project We can see that, there is no carry unlike the decimal multiplication which makes the hardware easier. From above example we can see that the partial products were either zero or same as multiplicand except that they were shifted to left. 2.2. Floating Point Arithmetic There are number of applications that require decimal notations. The non-integer numbers are represented in various forms. Whereas, one of those method has gain wide recognition and is called floating point representation.The real number is represented in scientific notation where the binary word is divided into mantissa and exponent. For instance 123.456 can be represented as 1.23456 x 102. This method solves many problems. However, it has a fixed window of representation, which limits it from representing very large or very small numbers [7-9]. Floating point representation is not straight forward as the schematic of the rest of representation methods. It contains variation which includes the number of bits allocated to the exponent and mantissa, the range of exponents, how rounding is carried out, the action taken in exceptional conditions like underflow and overflow. These days IEEE 754-1985 standard also known as international standard IEC 559 is used vastly. IEEE 754 has defined two types of storage layouts for floating points [10] Single Precision 32 bits: 1 S-bit, 8 Exponent bits & 23 fraction bits. Double Precision 64 bits: 1 S-bit, 11 Exponent bits & 52 fraction bits. A simple floating point multiplication for two binary numbers can be seen below mathematically. ( F1 × 2 E 1 ) × ( F 2 × 2 E 2 ) = ( F1 × F 2 ) × 2 ( E 1 + E 2 ) = F × 2E The multiplication includes the addition of exponents and multiplication of fractions. The product fraction is then normalised and exponent overflow or underflow is observed. Finally the product fraction is rounded. The flow chart diagram for floating point multiplication can be seen in Figure 2. Evaluation of High Speed Hardware Multipliers - Fixed Point and Floating point (Syed Haider Abbas) 807  ISSN: 2088-8708 Figure 2. Flowchart for floating point multiplication 3. ALGORITHMS AND ARCHITECTURES There are many algorithms and architectures used for multipliers some of them are suitable for FPGA while few are not. Here, we will discuss and implement two of these algorithms i.e, bit-serial multipliers and parallel multipliers. 3.1. Bit-Sequential Multiplier In Bit sequential multiplier also known as bit serial multiplier, input bits arrive on a single wire over a number of clock cycles and output bits are produces serially (Figure 3). Digital serial architecture process more than one bits at a time. This is done by dividing the N-bit word into X different digits of Y bits wide each, where N is X*Y. The main objective behind this approach is to maximise the speed of operator by a factor greater than its size. It works in the similar manner of manual multiplication of two decimals. This is relatively slow because adding N partial product requires N clock cycles. The easiest clocking scheme is to make use of a system clock. If the multiplier is embedded in a larger system. The clock in normally much slower than the maximum speed at which the simple iterative multiplier can be clocked, so if the delay is to be minimized and expensive and tricky clock multiplier is needed or the hardware must be self clocked [2]. IJECE Vol. 3, No. 6, December 2013 : 805 – 813 IJECE ISSN: 2088-8708  808 Figure 3. Simple bit-serial multiplier 3.1.1. Fixed Point 16x16 Bit-serial Design To design the fixed point 16x16 Bit Serial multiplier 33 bit accumulator was used to store the result. Bit adders were designed using “and”, “or” and “xor” gates [3]. A state machine was designed for multiplication to shift the bits. Total number of 32 states were used to perform the operation. when state is 0 and st is 1 the multiplier is loaded into accumulator and at the same time state changes to 1. The function named add16 is used to computes two 16 bit vectors whose result is 17 bit. This represents the adder output which is loaded into ACC and at the same time the state counter is incremented. The right shift on the ACC is accomplished by loading ACC with 0 concatenated with upper 32 bits of accumulator. The states will define the number of shifting and bit to bit addition of multiplier and multiplicand. Figure 4 is the simulation result of the fixed point bit serial multiplier. Figure 4. 16x16 Bit Serial Fixed point multiplier simulation We can seen that at 800ns 16 bit multiplicand having value of 229 in decimal is multiplied with the multiplier of 63549 value in decimal and their product is 14552721 which is at 1350ns because all the states are finished and the done signal is up at this stage. Evaluation of High Speed Hardware Multipliers - Fixed Point and Floating point (Syed Haider Abbas) 809  ISSN: 2088-8708 3.1.2. Floating Point 16x16 Bit-serial Design The same multiplier is designed for greater number of bits by using greater accumulator size. Floating Point 16x16 Bit serial design 16 bits for multiplicand and multipliers were divided into 10 bit mantissas and 6 bits of exponents. The 10x10 bit serial multiplication for mantissas of the 16x16 bit floating point multiplier. The 10x10bit multiplication is done in the similar manner as done above for the fixed point multiplier. The exponents are added to find the exponent of the product. Afterwards resulted exponents were normalised and adjusted the decimal point for the product. The final exponent and mantissa of the product is loaded properly according to floating point representation. The simulation results of the floated point can be seen below in Figure 5. X=(0101110010101111)2(floating) = (44000.25)10 Y=(0000000000000000)2(floating) = (0)10 Z=X*Y=(0001111010101111)2(floating) = (0.00002047)10≈ (0)10 Figure 5. 16x16 Bit Serial Floating point multiplier simulation Note that representations of 0 is an irritating anomaly, but the IEEE committee decided that it was benign compared to the complexity of any techniques they considered for avoiding it. Floating point representation has some limitations which includes the length of mantissa for example 0000000000000000 is when converted to decimal it comes to 4.65e-10. This is because of the non zero digit to the left side of the mantissa when converting it to the decimal. This is the same reason why we are getting not exact values for our examples. 3.2. Parallel Multiplier In bit parallel design all the input bits of the sample word are read in single clock cycle and output bits are read together. In addition, when the partial products are to be added the adders are connected in parallel. This requires no more hardware than a linear array but does have more complex interconnections. The time require to add partial products is now proportional to log N, so this can be much faster than large values of N. On other hand the extra complexity in the interconnection may result in greater size and more delay [4]. The partial products can be formed by an array of AND gates. These partial products can be added IJECE Vol. 3, No. 6, December 2013 : 805 – 813 IJECE ISSN: 2088-8708  810 together using three adders to add the four partial products after partial products are shifted and filled with zeros. For instance, A0 is 00001011 and A1 is 00000000 A2 is 00101100 and A3 is 01011000 whereas the partial product is equal to A0+A1+A2+A3. Another approach can be tried using carry save adder arrays [5]. 3.2.1. Fixed Point 16x16 bits Parallel Multiplier Design To design the fixed point parallel multiplier in VHDL 16 bit multiplier and multiplicand vectors were used and other vectors for partial product PP, partial sum PS, and partial carry PC are declared. Multiplier and multiplicand was loaded onto first layer of partial product. Full adderswere implemented for the addition of partial products [6]. Whereas, the product output from the partial sum forms the most significant bit of the product from the left most full adder’s carry output. Figure 6 is the 16x16 fixed point multiplier. We can see from the figure that there is no clock because the design has generate statements. Moreover, the product is shown at the same time in parallel multiplier. Figure 6.16x16 Bit Fixed point Parallel multiplier simulation 3.2.2. Floating Point 16x16 bits Parallel Multiplier Design Design of the floating point parallel multiplier is similar to that of bit serial parallel multiplier. The only difference is that instead of multiplying the 10 bit mantissa using bit serial multiplication we use parallel multiplication technique. The rest of the code is similar and need not many changes. The simulation results of the floated point can be seen in Figure 7. X=(1101110010101111)2(floating) = (44000.25)10 Y=(1100110100100010)2(floating) = (200.5)10 Z=X*Y= (0110110000011010)2(floating) = (8814592)10 ≈ (8822050.125)10 4. IMPLEMENTATION RESULTS The main objective of this research is to compare different multiplier algorithms. The multiplier designs presented above are designed using VHDL and synthesised using Xilinx ISE 13.1. Here, we present the implementation results of multiplier designs. The performance measures considered here are cell usage (Area) and delay [8]. Two different techniques of multiplier designs are investigated for fixed point for the various size of the multipliers. These sizes provides fair comparison of the two fixed point multipliers described in previous sections. Similarly for floating point multipliers 16 bit coefficient of the same multiplication techniques were implemented and their comparison is made between themselves and fixed point multiplication. During synthesis similar constraints and operating conditions were made as far as possible. Evaluation of High Speed Hardware Multipliers - Fixed Point and Floating point (Syed Haider Abbas) 811  ISSN: 2088-8708 Figure 7. 16x16 Bit Floating point Parallel multiplier simulation The table below gives the implementation results from synthesis of these multipliers using Xilinx ISE 13.1 tool. The best results based upon compiled values on maximum speed and minimum area is presented here. Speed optimization id focused on implementing the design in target technology library with shortest delay and fastest clock rate. Whereas area optimization looks for functional implementation of the design with smallest total cell area in the target library [9]. The area column is taken from the Xilinx synthesis tool. It shows the cell usage which includes the number of slices occupied by each design. Another result taken from the same tool is the estimated maximum clock rate for each design and optimization mode. it also represents the worst case propagation delay for each design in these tables. Table 1. Comparison results for fixed point multiplier Fixed Point Multipliers 8x8 bit 16x16 bit 32x32 bit 64x64 bit 128x128 bit Area 61 118 257 516 772 Serial Delay (ns) 7.190 8.124 9.816 11.754 145.632 Area 71 290 1194 3612 8229 Parallel Delay(ns) 27.569 52.708 101.214 199.053 391.98 The Table 1 gives the results of fixed point multipliers of different sizes form 8 bits to 128 bits simulated for two different types of multipliers Bit serial andParallel. It can be observed that the parallel multipliers have relatively greater delay and it occupies more area while bit serial multipliers consumes less space and little delay except 128x128 bit multiplier where greater delay can be seen but still its better than the parallel multipliers. Table 2. Comparison result for floating point multiplier Floating Point Multiplier (16x16 bit) Bit Serial Parallel IJECE Vol. 3, No. 6, December 2013 : 805 – 813 Area (No of slices used) Delay (ns) 59 9.883 108 30.637 IJECE  ISSN: 2088-8708 812 Table two gives the synthesis results for the floating point multiplier for 16x16 bit vector. These results can be compared with the fixed point multipliers of same technique to observe the difference. Now we will compare the fixed point multiplier results with the floating point designs the area for the both multiplier design with 16x16 bit size can be seen in the figure below. Where it can be observe clearly that the floating point design occupied less area than the fixed point multiplier. This result was anticipated earlier and the reason behind it is simple as in floating point multiplier only mantissa bits are multiplier while other bits for sign and exponent are added. Therefore, the number of slices occupied by the floating point is lower in number because few bits are multiplied using the multiplication techniques as shown. 350 300 250 150 Area 200 100 50 0 Bit Serial Fixed Point Parallel Floating point Figure 8. Fixed point and Floating point Area comparison graph The delay comparison can be seen below between fixed point and floating point multiplier and it can be seen clearly that fixed point multiplier has greater delay than floating point multiplier design whereas in bit serial the delay of both designs are very close to each other. 60 50 30 20 Delay (ns) 40 10 0 Bit serial Fixed Point Parallel Floating Point Figure 9. Fixed point and Floating point Delay comparison graph Evaluation of High Speed Hardware Multipliers - Fixed Point and Floating point (Syed Haider Abbas) 813  ISSN: 2088-8708 5. CONCLUSION Multipliers are used as benchmark designs. Two different multipliers for floating and fixed point multiplication are designed, implemented and studied. A comparative analysis of these multipliers is presented, which is the principal contribution of this research. It was found that as we increase in size the delay as well as area of the hardware increases. Moreover, Parallel multiplier posses more area as well as delay as compared to serial multiplier. Whereas, a floating point design of a parallel multiplier occupies lesser area than the fixed point design of the bit serial multiplier. REFERENCES [1] P Denyerand D Renshaw. ‘’VLSI Signal Processing.A Bit Serial Approach’’. [2] SG Smith, MS McGregor and PB Denyer. “Techniques to increase the computational throughput of bit serial architectures”. IEEE Proc on Intl Conf on Accoustics, Speech and Signal Processing. [3] L Kuhnel and Hartmut Schmeck. “A closer look at VLSI Multiplication”. Integration- the VLSI Journal. Sept 1988 [4] Avizienis A. “Signed digit number representations for fast parallel arithmetic”. IRE Trans, Elec Computer. Sept 1961. [5] Baugh CR and BA Wooly. “A two complement parallel array multiplier”. Dec 1973. [6] Oscar Gustafsson, Henrikohlsson. “Minimum adder integer multipliers using carry save adders”. Linkoping university SWEDON. [7] Carl Burch, Hendrix College, September 2011, ‘’Floating point representation IEEE-754’’ [8] Garry W Bewick. “Fast multiplication: Algorithms and implementation”. Feb 1994. [9] L Dadda. “Some schemes for parallel multipliers”. Alta Frequenza. May 1965. [10] “IEEE Standard for Binary floating point arithmetic”. ANSUIEEE std 754-1985, Newyork Aug 1985. IJECE Vol. 3, No. 6, December 2013 : 805 – 813