Power optimization of digital baseband WCDMA receiver components on algorithmic and architectural level

High data rates combined with high mobility represent a challenge for the design of cellular devices. Advanced algorithms are required which result in higher complexity, more chip area and increased power consumption. However, this contrasts to the limited power supply of mobile devices. This presentation discusses the application of an HSDPA receiver which has been optimized regarding power consumption with the focus on the algorithmic and architectural level. On algorithmic level the Rake combiner, Prefilter-Rake equalizer and MMSE equalizer are compared regarding their BER performance. Both equalizer approaches provide a significant increase of performance for high data rates compared to the Rake combiner which is commonly used for lower data rates. For both equalizer approaches several adaptive algorithms are available which differ in complexity and convergence properties. To identify the algorithm which achieves the required performance with the lowest power consumption the algorithms have been investigated using SystemC models regarding their performance and arithmetic complexity. Additionally, for the Prefilter Rake equalizer the power estimations of a modified Griffith (LMS) and a Levinson (RLS) algorithm have been compared with the tool ORINOCO® supplied by ChipVision. The accuracy of this tool has been verified with a scalable architecture of the UMTS channel estimation described both in SystemC and VHDL targeting a 130 nm CMOS standard cell library. An architecture combining all three approaches combined with an adaptive control unit is presented. The control unit monitors the current condition of the propagation channel and adjusts parameters for the receiver like filter size and oversampling ratio to minimize the power consumption while maintaining the required performance. The optimizaCorrespondence to: M. Schämann (marcus.schaemann@mez.rub.de) tion strategies result in a reduction of the number of arithmetic operations up to 70% for single components which leads to an estimated power reduction of up to 40% while the BER performance is not affected. This work utilizes SystemC and ORINOCO® for the first estimation of power consumption in an early step of the design flow. Thereby algorithms can be compared in different operating modes including the effects of control units. Here an algorithm having higher peak complexity and power consumption but providing more flexibility showed less consumption for normal operating modes compared to the algorithm which is optimized for peak performance.


Introduction
The evolution of mobile systems provides the user with ever increasing data rates.But these rising data rates require advanced algorithms which encode and decode the signals.Therefore, an increased number of arithmetic operations have to be carried out in the receiver.Figure 1 shows various predictions for the complexity of mobile receivers measured in million instructions per second (Rabaey, 2001;Hausner, 2001;Takano, 2002;Perkins, 2005).The predictions differ because of different assumptions, i.e. which components are included or not.However, the main trend shows that the complexity rises exponentially like the data rate.
This relationship creates a challenge for the design of mobile systems.A rising chip area can be supplied by smaller structures by a higher integration, but the increased power consumption becomes a major design issue for two reasons: First, the mobile receiver operates on a limited power supply.This results in shorter standby and talk times when the power consumption increases.Second, the rising power consumption creates heat development which has to be dissipated by the device.
Published by Copernicus Publications on behalf of the URSI Landesausschuss in der Bundesrepublik Deutschland e.V.The optimization of power consumption can be carried out on all levels of abstraction of the design.For example, the leakage of transistors can be minimized on the layout and transistor levels.On the logic level the power consumption can be reduced by clock gating or by switching components on and off on the architectural level.
However, the higher levels of abstraction provide a higher potential for power savings, e.g. by choosing and implementing algorithms of lower complexity and with less activity (Fig. 2).A higher abstraction level also has the side effect of a faster power estimation.But the accuracy of this estimation has to be observed carefully to ensure that the results of the estimation on a high level comply with the properties of the design at a lower level.
Therefore, the motivation of our work is to enhance the reception of high data rates according to the HSDPA specification even under bad propagation conditions with the minimum of energy consumed.
The report is organized as follows: at first, Sect. 2 gives a short overview of the system parameters, the properties of the propagation channel and possible approaches to the receiver design.Then Sect. 3 compares the performance, complexity, area and power consumption of different algorithms.Section 4 discusses the optimization of the receiver architecture  by an adaptive control unit.Results of the reduction of complexity and energy consumption lead to the conclusions of this report.

System model and propagation channel
Figure 3 shows the tasks which have to be carried out in the physical layer of the transmitter in the base station and the receiver in the mobile device of an High Speed Downlink Packet Access (HSDPA) system.The critical part for the transmission of high data rates is the fading multipath propagation between transmitter and receiver.Thus, mainly the receiver has to be enhanced by an improved multipath combination to restore the transmitted signal.
The figure of merit for the user is the achievable throughput of the transmission link which is related to the Frame Error Rate (FER) of transmitted blocks which were received correctly.However, the raw Bit Error Rate (BER) can act as a more precise indicator for the performance of the critical inner part.Simulations show that a raw BER smaller than 2% corresponds to a FER lower than 10 %.This FER allows to fulfill the requirements of the 3GPP standard for the throughput (3GPP TS 25.101, 2003).
A common approach to multipath combination is the use of a Rake combiner due to its simple structure.The Rake combiner joins the information at maximum ratio to one chip position which is used for decoding, but it does not suppress interference at adjacent chip positions.An improved reception can be achieved by using chip-level equalizers which compensate the propagation channel and restore the orthogonality of the transmitted signal (Hooli, 2002).These approaches improve the performance drastically, but only for the compromise of a higher complexity.A chiplevel equalizer has the advantage of restoring the orthogonality of the signal before decoding which simplifies the subsequent tasks.However, the data equalization operates with the oversampled chip frequency of the pulse-shape matched filter (Fig. 3).Therefore, the activity of the equalizer is comparatively high which results in increased power consumption.The optimization of this component plays a key role in minimizing the overall power consumption and is discussed in detail below.

Optimizations on the algorithmic level
There are several algorithms which compute filter coefficients to equalize the propagation channel in a Finite Impulse Response (FIR) filter.The algorithm which obtains the best performance is the Minimum Mean Square Error (MMSE) equalizer which is often given in literature as (Krauss, 2000): In this equation H is the convolution matrix composed out of the channel impulse response estimated in the receiver.I is the identity matrix and σ 2 N and σ 2 S are the noise and signal power, respectively.δ D is the unit vector with a 1 at the Dth postition.Depending on the signal to noise ratio the MMSE equalizer either performs like a zero-forcing equalizer (σ 2 N σ 2 S ): Or for a high noise level (σ 2 N σ 2 S ) the Rake combiner is approximated: The MMSE equalizer can be transformed into a Prefilter Rake equalizer by separating the equalization into two tasks: at first, the Prefilter minimizes the cross correlation between the samples caused by multipath propagation.Then the signal can be processed by a normal Rake combiner to maximize the signal (Heikkilae, 2001).
The concepts of the Rake combiner, the MMSE equalizer and the Prefilter Rake equalizer are compared regarding their raw BER performance for a vehicular propagation channel (VA30) and the HSDPA testcase H-Set 3 (3GPP TS 25.101, 2003) in Fig. 4 by a fixed-point SystemC simulation.The MMSE equalizer reaches the best results while the Rake combiner has a very high raw BER.The Prefilter Rake achieves also good raw BER results, but due to the separation of the MMSE equation, a small fraction of the gain in performance is lost.There are multiple ways to solve the MMSE and Prefilter approaches.The method given in Eq. ( 1) by computing the pseudo inverse of H is usually the most complex version because it requires several matrix computations.More reasonable versions in terms of complexity are adaptive algorithms, like Least Mean Square (LMS) (e.g.Widrow, 1985) or Recursive Least Square (RLS) (e.g.Moon, 1999) and their variations.
Concerning the effort LMS algorithms have the advantage of a simple structure and of scaling linearly with the filter size while the RLS algorithms scale quadratically.Due to this advantage LMS algorithms are often preferred for implementations with a low complexity.
However, as simulations for the Prefilter Rake equalizer show, the computational complexity measured in multiplications per slot is higher for the LMS algorithm when compared with the RLS algorithm (Fig. 5).This is caused by a different runtime behavior: for the convergence of filter coefficients the LMS algorithm has to adapt the coefficients continuously while the RLS algorithm can operate block-based with 2.5 updates per slot.
Further results have been obtained using the tool ORINOCO ® supplied by ChipVision for the area and energy consumption of both Prefilter algorithms.The required area for the RLS algorithm is larger because more multiplications have to be performed simultaneously, but the energy consumption of both algorithms are comparable.An additional advantage of the RLS algorithm is that it yields better filter coefficients which enables the FIR filter to achieve a better performance with even less filter coefficients.
A summary of the different algorithms investigated with SystemC and ORINOCO ® for the HSDPA receiver application is given in Table 2.The receiver with the lowest complexity, area and power consumption is the Rake combiner.But it also has the worst performance and is not able to fulfill the requirement of the 3GPP standard under vehicular www.adv-radio-sci.net/6/325/2008/Adv.Radio Sci., 6, 325-330, 2008  propagation conditions.The best performance can be obtained with the MMSE equalizer, but only with the highest complexity, area and power consumption.A good trade-off is the Prefilter Rake equalizer which achieves a substantial increase in performance in comparison with the Rake combiner with only a small increase of complexity and energy consumption.When comparing both algorithms for the Prefilter Rake equalizer the Levinson algorithm (RLS) has the best trade-off of performance and energy consumption.

Optimizations on the architectural level
To create an architecture which is able to reach a high performance with a minimum of power consumption the MMSE equalizer (best performance), the Rake combiner (lowest complexity) and the Prefilter Rake equalizer using the Levinson algorithm (best trade-off) have been combined in the proposed architecture in Fig. 6 together with an adaptive control unit.The control unit monitors the current state of the propagation channel and chooses the algorithm which achieves the necessary performance with the lowest complexity.For example, in vehicular environments the MMSE equalizer is used while in indoor or pedestrian environments the Rake combiner is activated which can maintain the performance with a lower complexity.
Apart from the receiver algorithm used for multipath combination, other parameters of the different algorithms can also be adjusted to reduce the required complexity further to the necessary extent: -Oversampling ratio The oversampling ratio determines the temporal resolution of the multipath propagation which influences the performance of the multipath combination.But it also influences the data rate which has to be processed and consequently the complexity and power consumption of the receiver.Adjusting the oversampling ratio can therefore be used to trade off performance and complexity (Schämann, 2006).
-Convergence masking vector By use of a convergence masking vector (Guo, 2005) the state of convergence for each filter coefficient can be monitored.If a coefficient becomes changeless, no further calculations are required for this coefficient.The coefficient is marked and in the following iterations this coefficient is skipped which reduces the number of multiplications, additions and consequently the power consumption.
-Active filter size Depending on the current delay spread of the multipath propagation the FIR filter which is designed for the worst case scenario is not required to operate with its full range of filter coefficients.By switching off parts of the filter which are not necessary the complexity can be further reduced.

-Number of iterations
The fading of the propagation channel is mainly caused by the movement of the receiver device and its surroundings.At higher velocities the propagation  conditions change faster which requires an update of filter coefficients more often.But in the case of slow or no movement, the time between updates can be increased to minimize the activity of the receiver.
The results of the adjustment of parameters for the MMSE equalizer are shown in Table 3. Depending on the control mechanism used in the receiver the computational complexity can be reduced by up to 71.9% while the average BER for the VA30 propagation channel stays lower than 2%.The SystemC model of the MMSE equalizer including the control unit was analyzed by ORINOCO ® as well and reductions of the energy consumption of up to 42% for subcomponents were observed.

Conclusions
The importance of estimating the power consumption on the algorithmic level early in the design process has been shown.The high potential for the reduction of power consumption was demonstrated by comparing the estimated area and energy consumption using SystemC and ORINOCO ® .By monitoring the propagation conditions, selecting the algorithm adaptively and adjusting the receiver's parameters the complexity and power consumption were reduced further.For the investigated application of an HSDPA receiver the number of required multiplications and additions could be reduced by up to 71.9% and the energy consumption could be reduced by up to 42% for subcomponents of the MMSE equalizer.

Fig. 4 .
Fig. 4. Comparison of performance of different algorithms for a vehicular propagation channel (VA30) and HSDPA testcase H-Set 3.

Fig. 5 .
Fig. 5. Complexity of different algorithms measured in multiplications per time unit.

Fig. 6 .
Fig. 6.Architecture combining three algorithms and the proposed adaptive control unit to trade off performance with complexity and power consumption.

Table 1 .
Results obtained with ORINOCO ® for both Prefilter algorithms for a 130 nm standard cell library mapping.

Table 2 .
Summarized comparison of approaches and algorithms regarding their BER performance for the vehicular propagation channel VA30, their complexity (number of multiplications and additions), area and estimated power consumption.

Table 3 .
Reduction of complexity measured in multiplications per time unit by the adaptive control unit for the MMSE equalizer.