Low Power VLSI Architectures for Variable-Length - KIT

Low Power VLSI Architectures for Variable-Length Encoding and Decoding
Stephen Molloy and Rajeev Jain
UCLA Department of Electrical Engineering
Engineering IV, 53-138N15
Los Angeles, CA 90024
Abstract - New low power VLSI architectures are presented for
variable-length encoding and decoding. Look-up table partitioning by symbol probability is shown to reduce the total
power consumption of the variable-length decoder and encoder
are reduced by as much as 66% and 75%, respectively. Design
examples for a subband image CODEC are presented with
measurements from extracted VLSI layouts.
I. INTRODUCTION
Variable-length coding is a lossless form of coding
that achieves compression by assigning shorter length codewords to commonly occurring symbols in a source alphabet,
while assigning longer codewords to statistically rare symbols [ 13. Variable-length coding has been effectively applied
as one step in the MPEG and P E G image compression standards, as well as in alternative non-standard coding
schemes, such as those based on wavelets. These coding
schemes typically consist of a prediction and transformation
step, followed by a lossy quantization step, and finally one or
more lossless coding steps such as run-length coding and/or
huffman coding. Previous work in the design of circuits for
variable-length encoding and decoding have concentrated
almost exclusively on achieving high throughput decoding
for video and high-definition television (HDTV) applications, and are described in [2-51.
With the growing popularity of portable computers
with high-quality audio and video playback capability, low
power architectures and techniques are required to provide
maximum battery life, without sacrificing real-time throughput. Previous work in low power video signal processing
have concentrated on other steps in the video compression
and decompression process such as the discrete cosine transform. To minimize the total power consumption of multimedia systems, low power, high throughput, designs are
required for all steps in the compression and decompression
processes, including variable-length encoding and decoding.
II. LOOK-UP TABLE PARTITIONING
A significant source of power consumption in a
variable-length encoder (VLE) or variable-length decoder
(VLD) is the look-up table containing the mapping from
source symbols to variable-length codewords. If we assume
that the table power consumption increases linearly with
size, we can represent the table power consumption as:
0-7803-3694-1
/97/$10.00 1997 IEEE
@
,P = N a
(1)
where a is the power consumption per table entry, P is the
total power, and the tablle has N entries. Next, consider partitioning the table into k isolatee iub-tables, so that only one
sub-table is selected when encoding or decoding a source
symbol. The power consumption then becomes:
i E
S,
i e S,
i e S,
where each summation is performed over a set of source
alphabet symbols S,, where a is once again the power consumption per table entry, andp, is the probability that the i-th
symbol of the source alphabet occurs. The union of all the
subsets 8, is the entire source alphabet and the intersection
of the subsets is the null set - simply meaning that each subtable contains some unique partition of the entire table. The
sum can of course be minimized by equalizing the contribution of each of the right-hand-side terms of the equation.
Note that visualizing a plot of source symbol probability versus symbol number, this partitioning is equivalent to dividing the source pdf into k segments with equal area under the
curve.
If the source alphabet had a uniform probability distribution, the terms in the right-hand-side of equation 2
would be equalized by choosing k tables of size N k , each
having a probability Ilk of being selected. The resulting
power consumption would become:
k
i=l
Provided that a decision could be made which of
the k sub-tables is to be selected and the remaining subtables consume no power unless selected, the active power
consumption is reduced by a factor of k, minus the power
consumed in the process of deciding which sub-table to
select.
In applications such as image coding, transform
coefficients typically have highly skewed probability distri-
997
Authorized licensed use limited to: Kalpataru Institute of Technology. Downloaded on February 1, 2010 at 00:50 from IEEE Xplore. Restrictions apply.
butions, many times modeled as Laplacian or Generalized
Gaussian sources. In this case, we expect that a smaller number of symbols in the source alphabet will be very common,
while the large remainder of the alphabet occur less often.
The probability distribution of d l e v e l pairs after scalar
quantization of subband coefficients in a wavelet-transformed image has a maximum for small runs and small levels, and falls off along the run and level axes. We therefore
also expect that as a consequence of this, the right-hand-side
terms of equation 2 for highly skewed sources will consist
of small sets of highly probably source symbols and larger
sets of less probable source symbols. This partitioning has a
simple but compelling effect on power consumption - it is
statistically likely we will access small, and therefore low
power look-up tables, and it is statistically unlikely we will
access larger, higher power look-up tables. The goal is to
reduce power consumption in the average sense - much in
the sanie way that a cache benefits the throughput of a
microprocessor in the average sense by making most
accesses to a small, fast memory and rare accesses to slower
main memory.
pipelining could be performed by adding registers to the
input or output of the look-up table and using retiming transformations to balance the levels of logic between registers.
Finally, an output multiplexer selects the output of the appropriate look-up table.
Input Data from Fifo
m
Shift
4
Codeword Length
111. LOW POWER VLD ARCHITECTURE
A conventional architecture for implementing variable-length decoding is shown in figure 1. This architecture
consists of a pair of input registers with a combined
wordlength equal to twice the maximum codeword length, a
barrel shifter to select a “sliding” segment of bits, a look-up
table for converting the input codeword to an output run/
level pair and a codeword length L, and an accumulator for
shifting out the L bits corresponding to the current codeword. This is a constant output rate architecture, capable of
decoding one variable-length codeword per cycle
An architecture implementing the low-power table
partitioning for variable-length decoding is illustrated in figure 2. The VLD in figure 2 is also a constant output-rate
architecture capable of decoding a sustained one symbol per
cycle, and is based on the designs of [2] and [3]. The VLD
consists of two parts: (1) a variable-length shifting datapath
and (2) the partitioned codeword tables. The shifting datapath uses a smaller look-up table containing the codeword
lengths together with an accumulator to shift the input bitstream past each variable-length codeword. The LUT Select
module counts the number of leading ones or zeros in the
prefix of the received codeword, using this count to select
one of the four individual look-up tables. At the input to each
look-up table is a register with an enable signal generated
fiom the decoded prefix length. The registers serve to isolate
the unselected tables so that only one of the four look-up
tables consumes power on any particular cycle. Note that the
codeword look-up tables used to determine the corresponding d e v e l pair are not in the feedback path of the shifting
datapath, and can therefore be pipelined arbitrarily. This
1
RunLevel Pair
Figure 1: Conventional VLD Architecture
The table-partitioning has been applied to the
design of a VLD for a second-generation of the wavelet
image CODEC described in [6], and is used to convert variable-length codewords into d e v e l pairs. The VLD table
contains a total of 529 entries, has a maximum codeword
length of 16-bitsYand produces a 9-bit run and 7-bit level.
VLD circuits with and without table partitioning were implemented in synthesizable VHDL, synthesized to a standard
cell netlist, and mapped to a layout in 0.5-micron CMOS.
Power consumption measurements were estimated from the
extracted layout at a clock rate of 12.27 MHz using a commercial power analysis tool. The 12.27 MHz clock frequency corresponds to the square-pixel NTSC clock rate
used in the image CODEC. Using the common architecture
of [2] without table partitioning, the power consumption of
the VLD is 4.2 mW, 74% of which is consumed in the single
look-up table. Of the remaining 1.1 mW, 1.03 mW is consumed in the shifting datapath and the remaining 60 UW in
the non-partitioned length look-up table.
Applying look-up table partitioning reduces the
active power consumption of the look-up tables by 98%,
fkom 3.1 mW to 63 uW. The 63 UW average power consumption corresponds to the power of each of the four subtables weighted by the probability that they are accessed.
The gatecounts for the synthesized tables (given in terms of
998
Authorized licensed use limited to: Kalpataru Institute of Technology. Downloaded on February 1, 2010 at 00:50 from IEEE Xplore. Restrictions apply.
two-input NAND gates), power consumption, and probability of access are given in table I. Note that the sum of the
probabilities does not add up to 100% because additional
variable-length codewords are allocated to special symbols
such as escape codes and end-of-subband markers that assist
in the decoding process.
feed-forward, the look-up table producing the codeword
length is merged with the look-up table producing the variable-length codeword itself, rather than splitting the two
tables as shown for the VLI).
R d eve1 Pair
i
Input Data from FIFO FIFO Pop Signal
.L
I
LA
Look-up Table
I
I
f
Length
I
I
Length
I
LUT
3.1 Mux
:I I
-+
Fifo Data
RudLevel Pair
Figure 3: Conventional VLE Architecture
Figure 2: Low Power VLD Architecture
TABLE I: VLD Table Partitioning Results
I/ I 1
Power
(mW)
Probability
of Access
I I
Table
Entries
Weighted
IV. LOW POWER VLE ARCHITECTURE
The VLE architecture is the dual of the VLD, converting d e v e l pairs into variable-length symbols, and
packing these symbols into a fixed-length output bitstream at
a sustained one symbol per cycle. A conventional architecture is shown in figure 3, consisting of a look-up table to
convert d e v e l pairs to variable-length symbols, and a
shift-network to concatenate the variable-length symbols
into a fixed-length output stream. The low-power VLE is
shown in figure 4. Because this architecture is completely
The LUT Select niodule computes a d e v e l distance metric, and uses the result to select one of the four
individual look-up tables. The specifics of the d e v e l metric will be described in the: full paper. At the input to each
look-up table is a register with an enable signal generated
fiom the decoded prefix length. As with the VLD, the registers serve to isolate the unselected tables so that only one of
the four look-up tables coinsumes power on any particular
cycle. As previously mentioned, because the codeword symbol and length look-up tables are completely feedforward
they can be pipelined arbitrarily using retiming transformations. The outputs of the fcur look-up tables are then multiplexed, with the resulting variable-length codeword and its
length used in the shifting network to concatenate the current
codeword to the end of the outgoing stream.
This architecture has been implemented in the same
wavelet image CODEC used in the implementation example
of the previous section. Lilke the VLD, the VLE table contains a total of 529 entries, takes a 9-bit run and 7-bit level as
inputs, and produces a a codeword symbol of 16-bits maximum, and four bits representing the length of the codeword.
Table partitioning has been applied to the VLE, and were
implemented in synthesizable VHDL, synthesized to a standard cell netlist, and mapped to a layout in 0.5-micron
CMOS. As with the decoder, power consumption measurements were estimated fiom the extracted layout at the square
999
Authorized licensed use limited to: Kalpataru Institute of Technology. Downloaded on February 1, 2010 at 00:50 from IEEE Xplore. Restrictions apply.
pixel NTSC clock rate of 12.27 MHz using a commercial
power analysis tool. Using a straightforward single look-up
table architecture without partitioning, the power consumption of the complete VLE is 4.722 mW, 85% of which, or
4.01 mW, is consumed in the look-up table. The remaining
7 10 mW is consumed in the shifting datapath.
R d e v e l Pair
---I
Code
*
Fifo Data
Fifo Pop
Figure 4: Low Power VLE Architecture
Applying a four-way look-up table partitioning
reduces the active power consumption of the look-up tables
by 95%, from 4.01 mW to 192 uW. This 192 uW average
power consumption corresponds to the power of each of the
four sub-tables weighted by the probability that they are
accessed. The gatecounts for the synthesized tables, power
consumption, and probability of access are given in tableII.
Gatecount
Power
(mW)
Probability
of Access
Table
Entries
Weighted
Power
(mw)
0
38
0.022
0.7182
10
0.0155
1
11
293
I
0.271
I
0.1718
I
56
I
0.0466
2
11
794
I
0.955
I
0.0779
I
154
I
0.0744
V. CONCLUSIONS
In this paper we have shown that the dominant
source of power consumption in variable-length encoding
and decoding circuits is the power consumed in the look-up
tables. A new table-partitioning technique based on symbol
probabilities was presented, and applied to both the variablelength encoding and variable-length decoding problems.
VLSI architectures utilizing this technique as part of a subband image CODEC have been presented and shown to
reduce the look-up table power consumption by as much as
98% for decoding and 95% for encoding, while reducing the
total power consumption up to 68% and 75% for variablelength decoding and encoding, respectively.
One important benefit of this work is the ability to
implement large symbol tables to maximize the coding eficiency, without paying the usual linear increase in power
consumption with increased table size. This will become
increasingly important in handheld devices communicating
over low-bitrate channels, where both low power and high
coding efficiency are important. A second benefit of this
work is that the power reduction is achieved without sacrificing throughput, as the architectures can sustain one symbol
per cycle throughput and are amenable to additional pipelining of the tables by retiming. A final important benefit of this
work is that these relatively large reductions in active power
consumption were achieved entirely using algorithmic optimizations. The growing body of work on low-level circuit
design techniques, gate-level power optimization, and low
supply-voltage operation can be used together with the techniques presented in this paper to greatly reduce the power
consumed in variable-length encoding and decoding circuits.
REFERENCES
[I] M.I. Lu and C.F. Chen, “An Encoding Procedure and a Decoding Procedure for a New Modified Huffman Code,” IEEE Transactions on
Acoustics, Speech andsignal Processing, Vol. 28, No. 1, January 1990.
TABLE 11: VLE Table Partitioning Results
Table
design. The new dominant component of the VLE power
consumption is the shifting datapath.
[2] M.T. Sun and S.M. Lei, “A Parallel Variable-Length-Code Decoder for
Advanced Television Applications,” Proceedings of the Third Intemational Workshop on HDTV, Torino, Italy, August 1989.
I
I
I
[3] P. Ruetz, P. Tong, D. Luthi, and P. Ang, “A Video-Rate P E G Chip Set,”
Journal of VLSI Signal Processing, Vol. 5, No. 3, April 1993.
[4]
~
3
1457
1.795
0.0307
309
0.0551
Total
2582
3.042
0.9930
529
0.1916
S.F. Chang and D. Messerschmitt, “Designing High-Throughput VLC
Decoder Part I - Concurrent Architectures,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 2, No. 2, pp. 187-196,
June 1992.
[5] H.D. Lin and D. Messerschnutt, “Designing High-Throughput VLC
Decoder Part I1 Parallel Decoding Methods,” IEEE Transactions on
Circuits and Systems for Video Technology, Vol. 2, No. 2, pp. 197-206,
The total power consumption of the VLE after
table partitioning is 1.14 mW, a 75% reduction in power consumption when compared with the original, non-partitioned
June 1992.
[6] S. Molloy, K. Nishibori, and R. Jain, “A Video CODEC Chipset for
1000
Wireless Multimedia Networking,” IEEE Workshop on VLSI Signal
Processing, pp. 381-390, Osaka, Japan, October 1995.