CHAPTER 4 DDS USING HWP CORDIC ALGORITHM

90
CHAPTER 4
DDS USING HWP CORDIC ALGORITHM
4.1
INTRODUCTION
Conventional DDFS implementations have disadvantages in area
and power (Song and Kim 2004b). The conventional implementation of DDS
is a brute-force approach using a lookup table (Nicholas 1988). In this
approach, the lookup table size grows exponentially as more bits are used to
represent the sine and cosine waveforms, although a memory compression
method applied to reduce the size of the lookup table.
In 1959Volder described the Coordinate Rotation Digital Computer
or CORDIC for the calculation of trigonometric functions, multiplication,
division and conversion between binary and mixed radix number systems. A
pure computational algorithm such as CORDIC has been proposed to replace
the large-sized lookup table.
Several algorithms are proposed for calculation of sine and cosine
function. The digit-by-digit methods for the computation of the elementary
functions
such
as
trigonometric,
inverse
trigonometric,
logarithm,
exponential, multiplication, and division functions described by Henry Briggs
in 1624 in “Arithmetica Logarithmica” (Metafas and Goutis 1990).
The programmable CORDIC chip for DSP applications were
proposed by Timmermann et al (1991). These are iterative pseudo division
91
and pseudo multiplication processes, which resemble repeated-addition
multiplication and repeated-subtraction division.
In 2000, Volder had proposed a special purpose digital computing
unit known as CORDIC, while building a real time navigational computer for
use in an aircraft. This algorithm was initially developed for trigonometric
functions which were expressed in terms of basic plane rotations.
The CORDIC algorithm computes 2D rotation using iterative
equations employing shift and add operations. Walther (1971) proposed a
unified algorithm to compute rotation in circular, linear, and hyperbolic
coordinate systems using the CORDIC algorithm. The CORDIC algorithm
developed by using an unified approach was proposed which is used in many
applications (Walther2000).
The DDS architecture based on the differential CORDIC
(DCORDIC) algorithm has presented by Kang et al (2006). The digit-level
pipelining in the CORDIC angle path implemented a two-dimensional
systolic array. More recently, the advances in the VLSI technology and the
advent of EDA tools have extended the application of CORDIC algorithm to
the field of software defined radio, MIMO systems (Wang et al 2006) and
neural networks (Meyer-Base et al 2003) etc.
Optimized DDFS for complex demodulation using CORDIC was
discussed by Jridi (2009). DDFS with 8Hz tuning frequency resolution and 20
bits output data (for sine and cosine waves) implemented in Xilinx FPGA
device giving a maximum operating frequency of more than 306 MHz and a
SFDR of 112dBc.
92
Sung et al (2009) had reported Hybrid CORDIC algorithm for
design and implementation of the DDFS. The multiplier-less architecture with
small ROM and pipelined data path provides a spurious free dynamic range of
84.4dBc. Later on, Eugene (1998) had proposed DDFS based on the Modified
Coordinate Rotation (CORDIC) algorithm.
A second order parabolic approximation for sine function has been
proposed by Sodagar (2001) which is so close to the sine function that
satisfies the accuracy requirements for sine computation in sine-output DDS.
The ROMless Parabolic DDS was designed with a 32-bits phase accumulator
whose output was truncated to 12 bits and the synthesizer had a 10-bit output.
The maximum operating frequency of this method was 175MHz, frequency
resolution was 0.0029Hz and maximum harmonic level was found to be
64dBc.
The CORDIC is an iterative approximation method, which can be
implemented without a lookup table but generates higher spurious harmonic
tones. In order to reduce the spurious tones, a significant number of iterations
and high resolution of the data path are required.
Using a computational algorithm and a lookup table in a combined
approach can offer better optimization in hardware complexity and speed
(Torosyan et al 2002). Further reduction of the lookup table size is desired as
the implementation cost of a digital communication system becomes more
important.
Antelo et al (2008) proposed the unfolded and pipelined CORDIC
by using linear approximation. The linear approximation schemes for rotation
93
(multiplication)
and
vectoring
(division)
were
complicated
the
implementation in a single unit.
This design demands on high resources on FPGA with more power.
Although to minimize the delay, the scale factor compensation technique was
combined with linear approximation. Even though, the pipelined CORDIC
used more number of registers to improve the throughput.
Later, Jun Ma et al (2000) developed the CORDIC-based IIR
digital filters based on fast orthogonal micro rotations for the realization. Each
orthogonal section realizes one real zero or a pair of complex conjugate zeros
of the transfer function. The CORDIC based IIR filter implementation leads
low sensitivity to finite word-length truncation in the filter stop band.
The CORDIC algorithm is performing vector rotations by arbitrary
angles using only shift and add operations. Volder’s algorithm is derived from
the general equations for a vector rotation. If a vector V with Coordinates
(x, y) is rotated through an angle
then a vector
can be obtained with
coordinates (x , y ) where x and y can be obtained using (x,y) and
by the
following method
X= rcos , Y= rsin
(4.1)
V=
(4.2)
=
94
Figure 4.1 Rotation of a vector ‘V’ by an angle
As shown in Figure 4.1, a vector V(x,y) can be resolved in two
parts along the x-axis and y-axis as rcos and rsin respectively. Figure 4.2
illustrates the rotation of a vector V=
by the angle .
Figure 4.2 Vector ‘V’ with magnitude r and phase
i.e x= r cos & y= r sin
(4.3)
95
Similarly, from Figure 4.1 it can be seen that vector V and V can
be resolved in to two parts. Let V has its magnitude and phase as r and
respectively and V has its magnitude and phase as r and
where V came in to
picture after anticlockwise rotation of vector V by an angle
. From
Figure 4.1 it can be observed
- =
(4.4)
= +
(4.5)
OX = x =r cos
= r cos(
)
= r (cos .cos -sin .sin )
= (rcos )cos -(rsin )sin
(4.6)
Using Figure 4.2 and equation 4.3 OX can be represented as
OX =x =xcos – ysin
(4.7)
Similarly, OY
OY =y =ycos +xsin
(4.8)
Similarly, value for the vector V in the clockwise direction rotating
the vector V by the angle
and the equations obtain in this case becomes
x = xcos + ysin
(4.9)
y = xsin – ysin
(4.10)
The Equations (4.7) to (4.10) can be represented in the matrix form as
=
(4.11)
The individual equations for x and y can be rewritten as:
96
x = xcos( ) ± ysin( )
(4.12)
y = ycos( )±xsin( )
(4.13)
Volder observed that by factoring out a cos
from both the sides,
resulting equation be in terms of the tangent of the angle
, the angle of
which to find sine and cosine. Then this equation can be rewritten as an
iterative formula.
x = cos (x ± ytan( )
(4.14)
y = cos (y±xtan( )
(4.15)
z =z ±
, here
is the angle of rotation (± sign is showing the
direction of rotation) and z is the argument. For the ease of calculation here
only rotation anticlockwise direction is observed first. Rearranging the
Equation (4.7) and (4.8)
x = cos( ) ((x-ytan( ))
(4.16)
y =cos( )(y+xtan( ))
(4.17)
The multiplication by the tangent can be avoided in the rotation
angles and therefore tan( ) is restricted so that tan( ) = 2-i. In the digital
hardware this denotes a simple shift operation. Furthermore, if those rotations
are performed iteratively and in both directions every value of tan( ) is
representable. With
= arctan(2-i ) the cosine term could also be simplified
and since cos ( ) = cos(- ). It is a constant for a fixed number of iterations.
97
This iterative rotation can be expressed as:
Xi+1 = ki(xi-yi.d i.2-i)
(4.18)
Yi+1 = ki(yi-xi.d i.2-i)
(4.19)
Where ‘i’ denotes the number of rotation required to reach the required angle
of the required vector ki= cos (arctan(2-i)) and di = ±1. The product of the ki’s
represent K factor (Walther 1971):
K=
(4.20)
is the angle of rotation here for n times rotation).
Table 4.1 12-bit CORDIC hardware
i
tan i=2-i
-i
i=arctan(2 )
0
1
45
0.7854
1
0.5
26.565
0.4636
2
0.25
14.036
0.2450
3
0.125
7.125
0.1244
4
0.0625
3.576
0.0624
5
0.03125
1.7876
0.0312
6
0.015625
0.8938
0.0156
7
0.0078125
0.4469
0.0078
8
0.00390625
0.22345
0.0039
9
0.001953125
0.111725
0.00195
10
0.0009765625
0.0558625
0.000975
11
0.00048828125
0.02793125
0.0004875
i
in radians
98
K is the gain and its value changes as the number of iteration
increases. For 8-bit hardware CORDIC approximation method the value of Ki
as K=
cos i = cos 0. cos 1 … … cos 7
= cos 45°cos26.565° … … … … … cos 0.4469°.
=0.6703
(4.21)
From the Table 4.1 it can be seen that precision up to 0.4469 is
possible for 8-bit CORDIC hardware. This
l are
stored in the ROM hardware
of the CORIC hardware as a look up table.
4.2
IMPLEMENTATION OF CORDIC ARCHITECTURES
In this chapter, different hardware architecture for sine and cosine
computation using CORDIC has been presented. If the sine and cosine
functions have been implemented in digital hardware, it needs more number
of multipliers for many algebraic methods. The alternative techniques such as
polynomial approximation, table-lookup method as well as shift and add
algorithms have reported by Andraka (1998). FPGAs are suitable for
hardware implementation of CORDIC algorithm as with other hardware
circuitry.
The CORDIC algorithm only performs shift and add operations and
it can be easy to implement and also suitable for FPGA based design. It is
necessary to analyze the various hardware architecture of CORDIC for the
present research work. This work has considered four different hardware
architectures of CORDIC. Its speed and area performance has been analyzed.
The most obvious methods of implementing a CORDIC such as bit-serial, bitparallel, unrolled and iterative, are described and compared in the following
sections.
99
4.2.1
Iterative CORDIC Architecture
The CORDIC structure as described in Equations (4.10) to (4.14) is
represented by the schematics in Figure 4.3 when directly translated into
hardware. Each branch consists of an adder-subtractor combination, a shift
unit and a register for buffering the output. At the beginning of a calculation
initial values are fed into the register by the multiplexer, where the MSB of
the stored value in the z-branch determines the operation mode for the addersubtractor unit. The data in the x and the y branch pass the shift units and are
then added to or subtracted from the unshifted data in the opposite path.
Figure 4.3 Block Diagram of Iterative CORDIC Architecture
The z branch arithmetically combines the register values with the
values taken from a lookup table whose address is changed according to the
number of iterations. For n iterations the output is mapped back to the
registers before initial values are fed. Then the final sine value can be
accessed at the output. The design has been implemented in a FPGA the
initial values for the vector coordinates as well as the constant values in the
100
LUT can be hardwired. The adder and the subtractor components are carried
out separately in the architecture. The multiplexer unit is controlled by the
sign of angle accumulator, which distinguishes between addition and
subtraction operations. The shift operations require a high fan in and reduce
the maximum speed when numbers of iterations are more. In addition the
throughput rate is also limited by the operations that are performed iteratively.
4.2.2
Unrolled CORDIC Architecture
In Unrolled CORDIC architecture, the output of one stage is the
input of the next one as shown in Figure 4.4. First, the shift operations for
each step can be performed by wiring the connections between stages
appropriately. Second, there is no need for changing constant values and that
can be hardwired as well. The purely unrolled design only consists of
combinatorial components and computes one sine and cosine value per clock
cycle. The input values find their path through the architecture on their own
and do not need to be controlled.
Obviously the resources in a FPGA are not very suitable for this
kind of architecture. A bit-parallel unrolled design with 16 bit word length,
each stage contains 48 inputs and outputs with a great number crossconnections between single stages. These cross connections from the x-path
through the shift components to the y-path and vice versa. It makes the design
difficult to route in a FPGA and cause additional delay times.
From Table 4.2 shows the performance and resource usage change
with the number of iterations if implemented in a XILINX FPGA. The area in
FPGAs can be measured in CLBs. The CLB consists of two lookup tables as
well as storage cells with additional control components (Andhraka 1998).
101
For the purely combinatorial design the CLBs function generators perform the
add and shift operations and storage cells are not used. This means registers
could be inserted easily without significantly increasing the area.
Figure 4.4 Block Diagram of Unrolled CORDIC Architecture
102
However, inserting registers between stages would also reduce the
maximum path delays and also increases the speed. It can be seen that the
number of CLBs are increased with the reduction in the maximum frequency.
The reason for that is the decreasing amount of combinatorial logic between
sequential cells. Obviously, the gain of speed when inserting registers exceeds
the cost of area and therefore the fully pipelined CORDIC is a suitable
solution for generating a sine wave in FPGA’s.
Table 4.2 Performance and CLB usage in Xilinx Spartan 3 FPGA
No. of Iterations
Complexity(CLB)
Max path delay(ns)
4.2.3
8
9
10
11
12
184
208
232
256
280
163.75 177.17
206.9
225.72
256.87
Pipelined CORDIC Architecture
Both the Unrolled and the iterative bit-Parallel designs, show
disadvantages in terms of complexities and path delays along with the large
number of cross connections between single stages. In order to reduce the
complexity, the architecture of bit parallel unrolled CORDIC has been
modified. The modified architecture is known as Bit serial iterative CORDIC
(Wang 2005). Bit-serial means only one bit is processed at the time and hence
the cross connection become one bit-wide data paths.
103
The throughput of pipelined CORDIC is as high as with the
unrolled design as shown in Figure 4.5. The structural simplicity of a bitserial design has been achieved high throughput with speed.
Figure 4.5 Block diagram of Pipelined CORDIC Architecture
The pipeline CORDIC design is implemented in a XILINX Spartan
device. The performance is constrained by the use of multiplexers for the shift
operation and even more for the constant LUT. The latter could be replaced
by a RAM or serial ROM where values are read by simply incrementing the
memory’s address.
104
4.2.4
HWP CORDIC Architecture
The development of the HWP CORDIC architecture was carried
out for achieving the highest throughput rate and reduction of hardwarecomplexity as well as the computational latency of implementation. Some of
the typical approaches for reducing complexity implementation are targeted
on minimization of using the scaling-operation and complexity of barrelshifters and adders in the CORDIC engine. Angle recoding schemes, mixedgrain rotation and higher radix CORDIC have been developed for reduced
latency realization. The parallel and pipelined CORDIC have been suggested
for high-throughput computation (Lakshmi 2010). The inherent drawbacks of
the conventional pipeline CORDIC algorithm are computational latency and
hardware complexity. Hence, an attempt has been made in the present
research work to concentrate on (i) reduction of computational latency (ii)
reduction of clock routing and hardware complexity of the pipelined CORDIC
architecture based-on Hybrid wave pipelining technique.
105
Sign(z)
X0
R
e
g
X Register
R
e
g
Adder/
Subtractor
x(n)
Y0
y(n)
R
e
g
Y Register
R
e
g
Adder/
Subtractor
Sign(z)
Z0
Sign(z)
z(n)
R
e
g
Angle Register
R
e
g
Adder/
Subtractor
Constant LUT
Figure 4.6 Block digram of HWP CORDIC Architecture
Figure 4.6 shows the architecture of the present HWP CORDIC
processor. This architecture includes X register, Y Register, Angle Register,
Adder/Subtractor and Multiplexers. The Adder/Subtractor module has been
implemented with basic full Adder Circuit. The HWP CORDIC processor can
accept 12 bit input data and produce sine and cosine outputs. The X Register,
Y Register and Angle Registers were separated with D-Flip flip or Register to
compensate the path delays and increases the throughput. A multiplexer can
be used to change position according to the current iteration. The initial
values X0, Y0 and Z0 are fed into the array at the left end of this serial-in
serial-out register. Finally, when all iterations are passed the input
106
multiplexers switch again and initial values enter the HWP CORDIC
processor as the computed sine and cosine values exit.
The HWP CORDIC architecture has been implemented on to
Xilinx Spartan FPGA and its critical paths and path delay characteristics were
analyzed. In the conventional pipeline CORDIC architecture Dmax has been
observed at 5.265ns and Dmin was found to be 2.345ns. The maximum and
minimum delay difference was measured as 2.92ns.
In HWP CORDIC
architecture the critical paths were adjusted and it was found to be 4.365ns
(Dmax) and 2.784nsec (Dmin) respectively. The path delay difference was
measured at 1.581ns. From the observation it is clear that the Dmax and Dmin
values are less in the present HWP CORDIC architecture.
4.3
COMPARISON OF CORDIC ARCHITECTURES
Various CORDIC architectures used for FPGA implementations
have been discussed in previous section. Table 4.3 illustrates the comparison
between the various CORDIC architectures such as iterative, unrolled,
Pipelined and HWP CORDIC. The iterative design stands out due to its low
area usage and low speed, whereas the maximum throughput rate is much
lower compared to the bit-parallel designs. The bit-parallel unrolled and fully
pipelined design uses the resources extensively but shows the best latency per
sample and maximum throughput rate. The HWP CORDIC design provides a
balance between unrolled and bit-serial design. It shows an optimum usage of
the resources with high speed and maximum throughput rate.
107
Table 4.3
Performance and Area usage for the various CORDIC
architectures
Architecture Type
4.4
# Slices
# LUT
#Flip-flops
Speed (MHz)
Iterative CORDIC
52
63
68
48
Unrolled CORDIC
68
84
78
76
Pipelined CORDIC
57
72
56
96
HWP CORDIC
36
33
44
188
RESULTS AND DISCUSSION
The present research work has also taken the comparison among
various CORDIC architecture. The VHDL code has been developed for
Pipeline CORDIC and its architecture has been modified into HWP CORDIC
by altering the critical path delays. From the timing details of the synthesis
report, the maximum(Dmax)and minimum(Dmin) path delays have been
idenified after downloading the source code into Spatran FPGA. The input
data have been applied and delay values are altered by using Xilinx FPGA
Editor.The Figure 4.6a shows the ModelSim simulation result of pipelined
CORDIC.The output samples have been taken and sine and cosine output
waveforms are plotted in Figure 4.6b.
Figure 4.6a Sine and Cosine output of CORDIC using ModelSim6.2g
108
Sine Plot
100
50
0
1
Angle in degrees
150
-50
-100
-150
Time in ns
Sine Value
Cosine Plot
150
Angle in degrees
100
50
0
-50
-100
-150
Time in ns
Cosine
Figure 4.6b Sine and Cosine waveform generated by CORDIC samples
The conventional pipelinedCORDIC and HWP CORDIC based
DDS architectures
have been considered for performance analyis of
CORDIC. From the comparion result, that the pipeline CORDIC and HWP
CORDIC produce better performance than other two architectures. Hence,
pipelined CORDIC and HWP CORDIC have been taken for DDS system
integration. This chapter deals with the performance analysis of both pipeline
CORDIC DDS and HWP CORDIC DDS. Figures 4.7and 4.11 illustrate the
design summary of pipeline CORDIC DDS and HWP CORDIC DDS
architetcure respectively. The pipeline CORDIC DDS uses 47 slices, 41 slice
flipflops, 73 four input LUTs, 48 IOBs and 40 IOB Flipflops. Whereas HWP
CORDIC DDS utilizes 45 slice flipflops, 32 four input LUTs, 83 IOBs and 43
109
IOB flipflops. Figures 4.8 and 4.12 show the RTL schematic of pipeline
CORDIC DDS and HWP CORDIC DDS.
Figure 4.7 Design summary of Pipelined CORDIC DDS
Figure 4.8 RTL Schematic of pipelined CORDIC DDS architecture
110
Figure 4.9 Power summary of pipelined CORDIC DDS
Figure 4.10 shows the block diagram of HWP CORDIC DDS
architecture.
This
DDS
design
includes
the
conventional
phase
accumulatorand HWP CORDIC. The input of the PA is 32bits and its output
is 16bits. The 16bits are quantized into 12bits and it is given to HWP
CORDIC. If the 16bit input is used for the HWP CORDIC the number of
iterations are increased in sine and cosine calculation. In order to minimize
the hardware complexity 12bit input is used for HWP CORDIC.
Phase Accumulator
FCW
Input
Sine
Cosine
fclk
Figure 4.10 Block Diagram of HWP CORDIC DDS architecture
111
Figure 4.11 Design summary of HWP CORDIC DDS architecture
The pipeline CORDIC DDS has been consumed 82mW of power
and HWP CORDIC DDS consumed 78mW of power dissipation as shown in
Figures 4.9 and 4.13.
Figure 4.12 RTL Schematic of HWP CORDIC DDS
112
Figure 4.13 Power summary of HWP CORDIC DDS
Table 4.4 illustrates the performance comparision of area and
speed of the DDS architectures pipelined CORDIC and HWP CORDIC. The
area usuage of pipeline CORDIC DDS and HWP CORDIC DDS are almost
equal in terms of Flipflops and Four input LUTs. The total estimated power
of HWP CORDIC DDS has been reduced to 78mW and throughput is
increased from 75% to 81% with 176.34MHz frequency improvement.
Table 4.4 Performance comparison of Area and Speed
Parameters
Pipelined CORDIC DDS
HWP CORDIC DDS
# slice flip-flop
41
45
# Slices
47
23
# 4input LUT
73
32
# Bonded IOB
48
83
Total estimated Power Consumption
(mW)
82
78
Throughput (%)
75
81
164.26
176.34
Maximum frequency(MHz)
113
Figure 4.14 shows the ModelSim simulation result of HWP
CORDIC DDS architecture design. It uses 100MHz reference clock and
generates the high frequency of 47MHz output. The same design
can
produce minimum frequency to maximum frequencyas shown in Figures 4.15
to 4.21. The common reference clock frequency and input bits have been
consider for this work. Because these values can
produce the optimum
outputs such as frequency resolution, speed, area, throughput, power and
SFDR etc. Figure 4.15 has taken 28F5C28h FCW and it can generate the
1MHz output sine and cosine waveform.
Similarly, Figures 4.16, 4.17, 4.18 and 4.19 shows the output
frequencies such as 1.7, 3.5, 4.5 and 7.5MHz. The correcsponding FCWs are
45A1CACh, 8F5C28Fh, 251EB851h and 13333333h respectively. The output
frequencies 25 and 50MHz also has been generated by applying the FCWs of
40000000h and 80000000h respectively as shown in Figures 4.20 and 4.21.
114
Figure 4.14 ModelSim output of HWP CORDIC DDS architecture
Figure 4.15 1MHz HWP CORDIC DDS output
115
Figure 4.16 1.7MHz HWP CORDIC DDS output
Figure 4.17 3.5MHz HWP CORDIC DDS output
Figure 4.18 4.5MHz HWP CORDIC DDS output
116
Figure 4.19 7.5MHz HWP CORDIC DDS output
Figure 4.20 25MHz HWP CORDIC DDS output
Figure 4.21 50MHz HWP CORDIC DDS output
117
4.5
CONCLUSION
In this chapter, the researcher discusses the basic CORDIC
equation for sine and cosine calculation and analyzed about various CORDIC
architecture. The VHDL code has been developed for HWP CORDIC
architecture and it is simulated using Xilinx ISE9.2i and ModelSim6.2g. The
sine and cosine samples are taken and plotted. The HWP CORDIC has been
developed with DDS top module, the synthesis and simulation results are
verified. The Pipeline PA with pipelined CORDIC based DDS has been
compared with HWP CORDIC based DDS design in terms of speed and area.
From the synthesis and implementation, it is clear that HWP CORDIC has
produced the better performance than pipelined CORDIC.