90 CHAPTER 4 DDS USING HWP CORDIC ALGORITHM 4.1 INTRODUCTION Conventional DDFS implementations have disadvantages in area and power (Song and Kim 2004b). The conventional implementation of DDS is a brute-force approach using a lookup table (Nicholas 1988). In this approach, the lookup table size grows exponentially as more bits are used to represent the sine and cosine waveforms, although a memory compression method applied to reduce the size of the lookup table. In 1959Volder described the Coordinate Rotation Digital Computer or CORDIC for the calculation of trigonometric functions, multiplication, division and conversion between binary and mixed radix number systems. A pure computational algorithm such as CORDIC has been proposed to replace the large-sized lookup table. Several algorithms are proposed for calculation of sine and cosine function. The digit-by-digit methods for the computation of the elementary functions such as trigonometric, inverse trigonometric, logarithm, exponential, multiplication, and division functions described by Henry Briggs in 1624 in “Arithmetica Logarithmica” (Metafas and Goutis 1990). The programmable CORDIC chip for DSP applications were proposed by Timmermann et al (1991). These are iterative pseudo division 91 and pseudo multiplication processes, which resemble repeated-addition multiplication and repeated-subtraction division. In 2000, Volder had proposed a special purpose digital computing unit known as CORDIC, while building a real time navigational computer for use in an aircraft. This algorithm was initially developed for trigonometric functions which were expressed in terms of basic plane rotations. The CORDIC algorithm computes 2D rotation using iterative equations employing shift and add operations. Walther (1971) proposed a unified algorithm to compute rotation in circular, linear, and hyperbolic coordinate systems using the CORDIC algorithm. The CORDIC algorithm developed by using an unified approach was proposed which is used in many applications (Walther2000). The DDS architecture based on the differential CORDIC (DCORDIC) algorithm has presented by Kang et al (2006). The digit-level pipelining in the CORDIC angle path implemented a two-dimensional systolic array. More recently, the advances in the VLSI technology and the advent of EDA tools have extended the application of CORDIC algorithm to the field of software defined radio, MIMO systems (Wang et al 2006) and neural networks (Meyer-Base et al 2003) etc. Optimized DDFS for complex demodulation using CORDIC was discussed by Jridi (2009). DDFS with 8Hz tuning frequency resolution and 20 bits output data (for sine and cosine waves) implemented in Xilinx FPGA device giving a maximum operating frequency of more than 306 MHz and a SFDR of 112dBc. 92 Sung et al (2009) had reported Hybrid CORDIC algorithm for design and implementation of the DDFS. The multiplier-less architecture with small ROM and pipelined data path provides a spurious free dynamic range of 84.4dBc. Later on, Eugene (1998) had proposed DDFS based on the Modified Coordinate Rotation (CORDIC) algorithm. A second order parabolic approximation for sine function has been proposed by Sodagar (2001) which is so close to the sine function that satisfies the accuracy requirements for sine computation in sine-output DDS. The ROMless Parabolic DDS was designed with a 32-bits phase accumulator whose output was truncated to 12 bits and the synthesizer had a 10-bit output. The maximum operating frequency of this method was 175MHz, frequency resolution was 0.0029Hz and maximum harmonic level was found to be 64dBc. The CORDIC is an iterative approximation method, which can be implemented without a lookup table but generates higher spurious harmonic tones. In order to reduce the spurious tones, a significant number of iterations and high resolution of the data path are required. Using a computational algorithm and a lookup table in a combined approach can offer better optimization in hardware complexity and speed (Torosyan et al 2002). Further reduction of the lookup table size is desired as the implementation cost of a digital communication system becomes more important. Antelo et al (2008) proposed the unfolded and pipelined CORDIC by using linear approximation. The linear approximation schemes for rotation 93 (multiplication) and vectoring (division) were complicated the implementation in a single unit. This design demands on high resources on FPGA with more power. Although to minimize the delay, the scale factor compensation technique was combined with linear approximation. Even though, the pipelined CORDIC used more number of registers to improve the throughput. Later, Jun Ma et al (2000) developed the CORDIC-based IIR digital filters based on fast orthogonal micro rotations for the realization. Each orthogonal section realizes one real zero or a pair of complex conjugate zeros of the transfer function. The CORDIC based IIR filter implementation leads low sensitivity to finite word-length truncation in the filter stop band. The CORDIC algorithm is performing vector rotations by arbitrary angles using only shift and add operations. Volder’s algorithm is derived from the general equations for a vector rotation. If a vector V with Coordinates (x, y) is rotated through an angle then a vector can be obtained with coordinates (x , y ) where x and y can be obtained using (x,y) and by the following method X= rcos , Y= rsin (4.1) V= (4.2) = 94 Figure 4.1 Rotation of a vector ‘V’ by an angle As shown in Figure 4.1, a vector V(x,y) can be resolved in two parts along the x-axis and y-axis as rcos and rsin respectively. Figure 4.2 illustrates the rotation of a vector V= by the angle . Figure 4.2 Vector ‘V’ with magnitude r and phase i.e x= r cos & y= r sin (4.3) 95 Similarly, from Figure 4.1 it can be seen that vector V and V can be resolved in to two parts. Let V has its magnitude and phase as r and respectively and V has its magnitude and phase as r and where V came in to picture after anticlockwise rotation of vector V by an angle . From Figure 4.1 it can be observed - = (4.4) = + (4.5) OX = x =r cos = r cos( ) = r (cos .cos -sin .sin ) = (rcos )cos -(rsin )sin (4.6) Using Figure 4.2 and equation 4.3 OX can be represented as OX =x =xcos – ysin (4.7) Similarly, OY OY =y =ycos +xsin (4.8) Similarly, value for the vector V in the clockwise direction rotating the vector V by the angle and the equations obtain in this case becomes x = xcos + ysin (4.9) y = xsin – ysin (4.10) The Equations (4.7) to (4.10) can be represented in the matrix form as = (4.11) The individual equations for x and y can be rewritten as: 96 x = xcos( ) ± ysin( ) (4.12) y = ycos( )±xsin( ) (4.13) Volder observed that by factoring out a cos from both the sides, resulting equation be in terms of the tangent of the angle , the angle of which to find sine and cosine. Then this equation can be rewritten as an iterative formula. x = cos (x ± ytan( ) (4.14) y = cos (y±xtan( ) (4.15) z =z ± , here is the angle of rotation (± sign is showing the direction of rotation) and z is the argument. For the ease of calculation here only rotation anticlockwise direction is observed first. Rearranging the Equation (4.7) and (4.8) x = cos( ) ((x-ytan( )) (4.16) y =cos( )(y+xtan( )) (4.17) The multiplication by the tangent can be avoided in the rotation angles and therefore tan( ) is restricted so that tan( ) = 2-i. In the digital hardware this denotes a simple shift operation. Furthermore, if those rotations are performed iteratively and in both directions every value of tan( ) is representable. With = arctan(2-i ) the cosine term could also be simplified and since cos ( ) = cos(- ). It is a constant for a fixed number of iterations. 97 This iterative rotation can be expressed as: Xi+1 = ki(xi-yi.d i.2-i) (4.18) Yi+1 = ki(yi-xi.d i.2-i) (4.19) Where ‘i’ denotes the number of rotation required to reach the required angle of the required vector ki= cos (arctan(2-i)) and di = ±1. The product of the ki’s represent K factor (Walther 1971): K= (4.20) is the angle of rotation here for n times rotation). Table 4.1 12-bit CORDIC hardware i tan i=2-i -i i=arctan(2 ) 0 1 45 0.7854 1 0.5 26.565 0.4636 2 0.25 14.036 0.2450 3 0.125 7.125 0.1244 4 0.0625 3.576 0.0624 5 0.03125 1.7876 0.0312 6 0.015625 0.8938 0.0156 7 0.0078125 0.4469 0.0078 8 0.00390625 0.22345 0.0039 9 0.001953125 0.111725 0.00195 10 0.0009765625 0.0558625 0.000975 11 0.00048828125 0.02793125 0.0004875 i in radians 98 K is the gain and its value changes as the number of iteration increases. For 8-bit hardware CORDIC approximation method the value of Ki as K= cos i = cos 0. cos 1 … … cos 7 = cos 45°cos26.565° … … … … … cos 0.4469°. =0.6703 (4.21) From the Table 4.1 it can be seen that precision up to 0.4469 is possible for 8-bit CORDIC hardware. This l are stored in the ROM hardware of the CORIC hardware as a look up table. 4.2 IMPLEMENTATION OF CORDIC ARCHITECTURES In this chapter, different hardware architecture for sine and cosine computation using CORDIC has been presented. If the sine and cosine functions have been implemented in digital hardware, it needs more number of multipliers for many algebraic methods. The alternative techniques such as polynomial approximation, table-lookup method as well as shift and add algorithms have reported by Andraka (1998). FPGAs are suitable for hardware implementation of CORDIC algorithm as with other hardware circuitry. The CORDIC algorithm only performs shift and add operations and it can be easy to implement and also suitable for FPGA based design. It is necessary to analyze the various hardware architecture of CORDIC for the present research work. This work has considered four different hardware architectures of CORDIC. Its speed and area performance has been analyzed. The most obvious methods of implementing a CORDIC such as bit-serial, bitparallel, unrolled and iterative, are described and compared in the following sections. 99 4.2.1 Iterative CORDIC Architecture The CORDIC structure as described in Equations (4.10) to (4.14) is represented by the schematics in Figure 4.3 when directly translated into hardware. Each branch consists of an adder-subtractor combination, a shift unit and a register for buffering the output. At the beginning of a calculation initial values are fed into the register by the multiplexer, where the MSB of the stored value in the z-branch determines the operation mode for the addersubtractor unit. The data in the x and the y branch pass the shift units and are then added to or subtracted from the unshifted data in the opposite path. Figure 4.3 Block Diagram of Iterative CORDIC Architecture The z branch arithmetically combines the register values with the values taken from a lookup table whose address is changed according to the number of iterations. For n iterations the output is mapped back to the registers before initial values are fed. Then the final sine value can be accessed at the output. The design has been implemented in a FPGA the initial values for the vector coordinates as well as the constant values in the 100 LUT can be hardwired. The adder and the subtractor components are carried out separately in the architecture. The multiplexer unit is controlled by the sign of angle accumulator, which distinguishes between addition and subtraction operations. The shift operations require a high fan in and reduce the maximum speed when numbers of iterations are more. In addition the throughput rate is also limited by the operations that are performed iteratively. 4.2.2 Unrolled CORDIC Architecture In Unrolled CORDIC architecture, the output of one stage is the input of the next one as shown in Figure 4.4. First, the shift operations for each step can be performed by wiring the connections between stages appropriately. Second, there is no need for changing constant values and that can be hardwired as well. The purely unrolled design only consists of combinatorial components and computes one sine and cosine value per clock cycle. The input values find their path through the architecture on their own and do not need to be controlled. Obviously the resources in a FPGA are not very suitable for this kind of architecture. A bit-parallel unrolled design with 16 bit word length, each stage contains 48 inputs and outputs with a great number crossconnections between single stages. These cross connections from the x-path through the shift components to the y-path and vice versa. It makes the design difficult to route in a FPGA and cause additional delay times. From Table 4.2 shows the performance and resource usage change with the number of iterations if implemented in a XILINX FPGA. The area in FPGAs can be measured in CLBs. The CLB consists of two lookup tables as well as storage cells with additional control components (Andhraka 1998). 101 For the purely combinatorial design the CLBs function generators perform the add and shift operations and storage cells are not used. This means registers could be inserted easily without significantly increasing the area. Figure 4.4 Block Diagram of Unrolled CORDIC Architecture 102 However, inserting registers between stages would also reduce the maximum path delays and also increases the speed. It can be seen that the number of CLBs are increased with the reduction in the maximum frequency. The reason for that is the decreasing amount of combinatorial logic between sequential cells. Obviously, the gain of speed when inserting registers exceeds the cost of area and therefore the fully pipelined CORDIC is a suitable solution for generating a sine wave in FPGA’s. Table 4.2 Performance and CLB usage in Xilinx Spartan 3 FPGA No. of Iterations Complexity(CLB) Max path delay(ns) 4.2.3 8 9 10 11 12 184 208 232 256 280 163.75 177.17 206.9 225.72 256.87 Pipelined CORDIC Architecture Both the Unrolled and the iterative bit-Parallel designs, show disadvantages in terms of complexities and path delays along with the large number of cross connections between single stages. In order to reduce the complexity, the architecture of bit parallel unrolled CORDIC has been modified. The modified architecture is known as Bit serial iterative CORDIC (Wang 2005). Bit-serial means only one bit is processed at the time and hence the cross connection become one bit-wide data paths. 103 The throughput of pipelined CORDIC is as high as with the unrolled design as shown in Figure 4.5. The structural simplicity of a bitserial design has been achieved high throughput with speed. Figure 4.5 Block diagram of Pipelined CORDIC Architecture The pipeline CORDIC design is implemented in a XILINX Spartan device. The performance is constrained by the use of multiplexers for the shift operation and even more for the constant LUT. The latter could be replaced by a RAM or serial ROM where values are read by simply incrementing the memory’s address. 104 4.2.4 HWP CORDIC Architecture The development of the HWP CORDIC architecture was carried out for achieving the highest throughput rate and reduction of hardwarecomplexity as well as the computational latency of implementation. Some of the typical approaches for reducing complexity implementation are targeted on minimization of using the scaling-operation and complexity of barrelshifters and adders in the CORDIC engine. Angle recoding schemes, mixedgrain rotation and higher radix CORDIC have been developed for reduced latency realization. The parallel and pipelined CORDIC have been suggested for high-throughput computation (Lakshmi 2010). The inherent drawbacks of the conventional pipeline CORDIC algorithm are computational latency and hardware complexity. Hence, an attempt has been made in the present research work to concentrate on (i) reduction of computational latency (ii) reduction of clock routing and hardware complexity of the pipelined CORDIC architecture based-on Hybrid wave pipelining technique. 105 Sign(z) X0 R e g X Register R e g Adder/ Subtractor x(n) Y0 y(n) R e g Y Register R e g Adder/ Subtractor Sign(z) Z0 Sign(z) z(n) R e g Angle Register R e g Adder/ Subtractor Constant LUT Figure 4.6 Block digram of HWP CORDIC Architecture Figure 4.6 shows the architecture of the present HWP CORDIC processor. This architecture includes X register, Y Register, Angle Register, Adder/Subtractor and Multiplexers. The Adder/Subtractor module has been implemented with basic full Adder Circuit. The HWP CORDIC processor can accept 12 bit input data and produce sine and cosine outputs. The X Register, Y Register and Angle Registers were separated with D-Flip flip or Register to compensate the path delays and increases the throughput. A multiplexer can be used to change position according to the current iteration. The initial values X0, Y0 and Z0 are fed into the array at the left end of this serial-in serial-out register. Finally, when all iterations are passed the input 106 multiplexers switch again and initial values enter the HWP CORDIC processor as the computed sine and cosine values exit. The HWP CORDIC architecture has been implemented on to Xilinx Spartan FPGA and its critical paths and path delay characteristics were analyzed. In the conventional pipeline CORDIC architecture Dmax has been observed at 5.265ns and Dmin was found to be 2.345ns. The maximum and minimum delay difference was measured as 2.92ns. In HWP CORDIC architecture the critical paths were adjusted and it was found to be 4.365ns (Dmax) and 2.784nsec (Dmin) respectively. The path delay difference was measured at 1.581ns. From the observation it is clear that the Dmax and Dmin values are less in the present HWP CORDIC architecture. 4.3 COMPARISON OF CORDIC ARCHITECTURES Various CORDIC architectures used for FPGA implementations have been discussed in previous section. Table 4.3 illustrates the comparison between the various CORDIC architectures such as iterative, unrolled, Pipelined and HWP CORDIC. The iterative design stands out due to its low area usage and low speed, whereas the maximum throughput rate is much lower compared to the bit-parallel designs. The bit-parallel unrolled and fully pipelined design uses the resources extensively but shows the best latency per sample and maximum throughput rate. The HWP CORDIC design provides a balance between unrolled and bit-serial design. It shows an optimum usage of the resources with high speed and maximum throughput rate. 107 Table 4.3 Performance and Area usage for the various CORDIC architectures Architecture Type 4.4 # Slices # LUT #Flip-flops Speed (MHz) Iterative CORDIC 52 63 68 48 Unrolled CORDIC 68 84 78 76 Pipelined CORDIC 57 72 56 96 HWP CORDIC 36 33 44 188 RESULTS AND DISCUSSION The present research work has also taken the comparison among various CORDIC architecture. The VHDL code has been developed for Pipeline CORDIC and its architecture has been modified into HWP CORDIC by altering the critical path delays. From the timing details of the synthesis report, the maximum(Dmax)and minimum(Dmin) path delays have been idenified after downloading the source code into Spatran FPGA. The input data have been applied and delay values are altered by using Xilinx FPGA Editor.The Figure 4.6a shows the ModelSim simulation result of pipelined CORDIC.The output samples have been taken and sine and cosine output waveforms are plotted in Figure 4.6b. Figure 4.6a Sine and Cosine output of CORDIC using ModelSim6.2g 108 Sine Plot 100 50 0 1 Angle in degrees 150 -50 -100 -150 Time in ns Sine Value Cosine Plot 150 Angle in degrees 100 50 0 -50 -100 -150 Time in ns Cosine Figure 4.6b Sine and Cosine waveform generated by CORDIC samples The conventional pipelinedCORDIC and HWP CORDIC based DDS architectures have been considered for performance analyis of CORDIC. From the comparion result, that the pipeline CORDIC and HWP CORDIC produce better performance than other two architectures. Hence, pipelined CORDIC and HWP CORDIC have been taken for DDS system integration. This chapter deals with the performance analysis of both pipeline CORDIC DDS and HWP CORDIC DDS. Figures 4.7and 4.11 illustrate the design summary of pipeline CORDIC DDS and HWP CORDIC DDS architetcure respectively. The pipeline CORDIC DDS uses 47 slices, 41 slice flipflops, 73 four input LUTs, 48 IOBs and 40 IOB Flipflops. Whereas HWP CORDIC DDS utilizes 45 slice flipflops, 32 four input LUTs, 83 IOBs and 43 109 IOB flipflops. Figures 4.8 and 4.12 show the RTL schematic of pipeline CORDIC DDS and HWP CORDIC DDS. Figure 4.7 Design summary of Pipelined CORDIC DDS Figure 4.8 RTL Schematic of pipelined CORDIC DDS architecture 110 Figure 4.9 Power summary of pipelined CORDIC DDS Figure 4.10 shows the block diagram of HWP CORDIC DDS architecture. This DDS design includes the conventional phase accumulatorand HWP CORDIC. The input of the PA is 32bits and its output is 16bits. The 16bits are quantized into 12bits and it is given to HWP CORDIC. If the 16bit input is used for the HWP CORDIC the number of iterations are increased in sine and cosine calculation. In order to minimize the hardware complexity 12bit input is used for HWP CORDIC. Phase Accumulator FCW Input Sine Cosine fclk Figure 4.10 Block Diagram of HWP CORDIC DDS architecture 111 Figure 4.11 Design summary of HWP CORDIC DDS architecture The pipeline CORDIC DDS has been consumed 82mW of power and HWP CORDIC DDS consumed 78mW of power dissipation as shown in Figures 4.9 and 4.13. Figure 4.12 RTL Schematic of HWP CORDIC DDS 112 Figure 4.13 Power summary of HWP CORDIC DDS Table 4.4 illustrates the performance comparision of area and speed of the DDS architectures pipelined CORDIC and HWP CORDIC. The area usuage of pipeline CORDIC DDS and HWP CORDIC DDS are almost equal in terms of Flipflops and Four input LUTs. The total estimated power of HWP CORDIC DDS has been reduced to 78mW and throughput is increased from 75% to 81% with 176.34MHz frequency improvement. Table 4.4 Performance comparison of Area and Speed Parameters Pipelined CORDIC DDS HWP CORDIC DDS # slice flip-flop 41 45 # Slices 47 23 # 4input LUT 73 32 # Bonded IOB 48 83 Total estimated Power Consumption (mW) 82 78 Throughput (%) 75 81 164.26 176.34 Maximum frequency(MHz) 113 Figure 4.14 shows the ModelSim simulation result of HWP CORDIC DDS architecture design. It uses 100MHz reference clock and generates the high frequency of 47MHz output. The same design can produce minimum frequency to maximum frequencyas shown in Figures 4.15 to 4.21. The common reference clock frequency and input bits have been consider for this work. Because these values can produce the optimum outputs such as frequency resolution, speed, area, throughput, power and SFDR etc. Figure 4.15 has taken 28F5C28h FCW and it can generate the 1MHz output sine and cosine waveform. Similarly, Figures 4.16, 4.17, 4.18 and 4.19 shows the output frequencies such as 1.7, 3.5, 4.5 and 7.5MHz. The correcsponding FCWs are 45A1CACh, 8F5C28Fh, 251EB851h and 13333333h respectively. The output frequencies 25 and 50MHz also has been generated by applying the FCWs of 40000000h and 80000000h respectively as shown in Figures 4.20 and 4.21. 114 Figure 4.14 ModelSim output of HWP CORDIC DDS architecture Figure 4.15 1MHz HWP CORDIC DDS output 115 Figure 4.16 1.7MHz HWP CORDIC DDS output Figure 4.17 3.5MHz HWP CORDIC DDS output Figure 4.18 4.5MHz HWP CORDIC DDS output 116 Figure 4.19 7.5MHz HWP CORDIC DDS output Figure 4.20 25MHz HWP CORDIC DDS output Figure 4.21 50MHz HWP CORDIC DDS output 117 4.5 CONCLUSION In this chapter, the researcher discusses the basic CORDIC equation for sine and cosine calculation and analyzed about various CORDIC architecture. The VHDL code has been developed for HWP CORDIC architecture and it is simulated using Xilinx ISE9.2i and ModelSim6.2g. The sine and cosine samples are taken and plotted. The HWP CORDIC has been developed with DDS top module, the synthesis and simulation results are verified. The Pipeline PA with pipelined CORDIC based DDS has been compared with HWP CORDIC based DDS design in terms of speed and area. From the synthesis and implementation, it is clear that HWP CORDIC has produced the better performance than pipelined CORDIC.
© Copyright 2024 ExpyDoc