Computing Givens Rotations reliably and efficiently on IA-64 Architecture

Liang Huang

huangl@cs.ucdavis.edu

March 20, 2003

 

Abstract

Due to possible intermediate overflow or underflow, the widely used Givens rotation can be tricky to implemented in a robust and efficient way. Robust often comes with scaling to avoid spurious intermediate overflow or underflow. But scaling is not cheap, so one would like to do scaling only when necessary. However deciding whether to do scaling or not costs too. An ideal way would be somehow to get rid of scaling altogether. Fortunately, Intel IA-64 architecture provides many innovative features. These features can get rid of scaling in the algorithm for computing Givens rotation. At a same time, speed of computing Givens rotation can be improved, accuracy and reliability of the algorithm is maintained. To verify this conclusion, a simpler, faster, reliable and accurate program for computing Givens rotation by exploiting features of IA-64 is presented as a result in this report.

 

1.    Introduction

 

How to compute math kernels efficiently and reliably is one of the most important problems in the numerical computing field. Therefore, using IA-64 architecture to simplify algorithms for computing math kernels and improving performances of math kernels becomes my objective of this study. Givens rotation, a widely used math kernel, is used as an example to show how to simplify the algorithms without losing reliability by exploiting the features of IA-64 and HP-UX C. The definition of Givens rotation and algorithms for computing it will be introduced in section 4.

 

The latest work for computing Givens rotation reliably and efficiently is proposed by D. Bindel, J. Demmel, W. Kahan and O. Marques, who are from UC Berkeley. In [3], an elaborate and carefully designed method is presented for computing single complex Givens rotation. It uses double precision floating-point format to store intermediate results. Ren-Cang Li, who is at HP lab, extended this approach for computing double complex and 80-bit extended complex Givens rotations in a similar way: store intermediate results in IA-64 82-bit floating-point format.

 

As described in [1], IA-64 is a 64-bit architecture developed by Intel and Hewlett-Packard (HP). IA-64 architecture has many innovative features [2]. For examples, more and wider registers than usual registers, fused multiply-add(FMA) floating-point arithmetic operation, predicated execution and control speculation. Two of these features will be studied in depth in this report, namely wider floating-point register type and FMA operation. They will be studied in section 2.

 

As an important accessorial part of IA-64, HP-UX C will also be studied in this report. The features of HP-UX C that will be studied in section 3 are in-line assembly instructions and pragmas.

 

What I will do in this report is that: testing the code for computing double precision Givens rotation on IA-64, which is programmed by Ren-Cang Li, comparing the results with some other math softwares and verifying the conclusion given in the abstract. A reason why not to test the code for 80-bit extended precision is that: no other 80-bit extended precision programs for computing Givens rotation can be found to do some comparisons.

 

The rest of this report is organized as follows. Section 2 and 3 introduce the features of IA-64 architecture and HP-UX C respectively. Section 4 studies how to compute Givens rotation reliably and efficiently on IA-64 architecture, and present a simpler algorithm for computing Givens rotation. Section 5 will put the new algorithm into practice, give results and present some analysis. Section 6 draws conclusions and future work.

 

2.    Useful features of IA-64 architecture

 

2.1 82-bit double-extended floating-point register type

 

IA-64 architecture provides 82-bit double-extended floating-point arithmetic as its basic floating-point type[2]. Three IEEE floating-point types: single precision, double precision and 80-bit double-extended precision are all subsets of this type. You can see the differences between them in the table 1:

 

 

 

Single

Double

80-bit double –extended

IA-64 82-bit double-extended

Sign bit

1

1

1

1

Exponent bit

8

11

15

17

Significand bit

23

52

63

63

Explicit bit

0

0

1

1

Bias

127

1023

16383

65535

Table 1

 

The value of an 82-bit floating-point number  is:

The IA-64 floating-point register format contains 7 exponent bits beyond the 11 required for treatment of double precision numbers. The additional bits allow computation near the double overflow/underflow thresholds to progress without generating exceptions. Of course, if the final result in a chain of computation is out of range, forcing the exponent to its usual narrow format will be interpreted as overflow or underflow.

 

For example, in computing a complex product , it may be that any of the products ,, or  could overflow or underflow, while neither  nor  lies outside the target exponent range. By deferring exception reporting and allowing intermediate results slightly outside the normal double exponent range, it becomes much simpler to code a robust complex multiplication sequence.

 

In other words, IA-64 architecture provides a wider exponent range than IEEE double format. As we will see in section 4, we can take advantages by using this feature in computing math kernels.

 

2.2 Fused multiply-add

 

The basic arithmetic operation in IA-64 is a fused multiply–add instruction. The instruction computes  with only one rounding error. The product  is computed first, and all 128 product bits are retained for the addition with . After the addition, the result is rounded to the precision specified in the instruction.

 

Here are some in-line instructions for fused multiply-add and its variants:

                  _Asm_fma:      

                  _Asm_fms:      

                  _Asm_fnma:   

                  _Asm_fadd:     

                  _Asm_fsub:     

                  _Asm_fmpy:     

                  _Asm_fnmpy:  

 

A major benefit of fused multiply-add instructions is the ability to compute quantities with over 64 bits of precision in the arithmetic unit, before rounding. Therefore the computed result will have more than 64+n valid bits before rounding to the desired precision, which is greater than the need of double precision. Therefore fused multiply-add can provide more accuracy than standard floating-point operations.

 

Another benefit is that this instruction can be executed in the same time as a simple multiplication instruction, which is faster than the time for computing a multiplication and a addition separately. Therefore FMA can also provide shorter execution time.

 

We will use FMA to improve the speed and accuracy of our code.

 

2.3 Some other useful features of IA-64

 

IA-64 doesn’t provide hardware-based floating-point division and square root. Replacing these two most expensive floating-point operations by fast instructions that produce an approximation to these operations has two beneficial results. First, special hardware for division and square root, which has relatively low use, is not needed. Second, by using a succession of fused multiply-add operations to complete the division or square root, an entire floating–point unit is not rendered busy during the entire execution time of division or square root.

 

3.    Useful extensions of HP-UX C

 

3.1 In-line Assembly Instructions

 

HP-UX C compiler for IA-64 architecture has been enhanced with “in-line assembly” capabilities, which enable most of the detailed floating-point architectural features to be accessed directly from C.

 

When the header file “machine/sys/inline.h” is included into a program, the new data types, __fpreg, which represents computational precision, becomes available. Computations with data declared as __fpreg use the full 17-bit exponent.

 

Also defined are what appear to be functions of the form _Asm_XXX(PC, a, b, c), where XXX is an IA-64 operation code, such as fma, and PC is for precision control, it is of the enum type: PC{_PC_S, _PC_D, _PC_NONE}, meaning to round the results to float, double or extended respectively.

 

An important in-line assembly instruction that will be used in the code is:

 

              uint64_t _Asm_frsqrta(__fpreg *r, __fpreg a).

 

It computes approximation of reciprocal square root of a. And return value is 0 if a is in  and 1 otherwise. When return value is 0, r is set to be the IEEE-754 mandated square root of a; otherwise r approximates . This in-line instruction can also reduce the time for computing Givens rotation, which is discussed in Section 2.3.

 

3.2 Pragmas

 

The HP-UX C compiler recognizes statements of the form:

 

              #pragma xxxxx,

 

where xxxxx is information that controls some aspects of the compilation environment. Some useful pragmas are listed below:

 

#pragma _USE_SF [0|1|2|3] The _USE_SF pragma permits the C programmer access to the multiple status fields of IA-64. The pragma must precede any declaration or executable statements within a block, and it affects only that block. The default status field is SF0. The status fields most relevant to us here are SF1 besides the default SF0. When user set status filed to SF1, it will disable all 3 standard floating-point types, and uses the computational floating-point precision (82-bit IA-64 double extended precision) as the default. SF2 and SF3 are designed for purposes beyond the scope this report.

 

#pragma Estimated_Frequency f The Estimated_Frequency pragma indicates the estimated relative execution frequency of the current block as compared with the immediately surrounding block. This may be used to indicate the average trip count in the body of a for loop, or to indicate the fraction of time a then clause is executed. The frequency, f, may be expressed as a floating-point constant.

 

Proper pragmas can help compiler optimize the code more efficiently than usual.

 

4.    Computing Givens rotation

 

The definition of Givens rotation is: given two complex numbers and , a Givens rotation is a 2-by-2 unitary matrix  such that

.

The fact that  is unitary implies

                           

Where I is the identity matrix. From this we see that

                           

However this definition of Givens rotation is not clear enough. Different rotation matrices  can satisfy this definition. For example, if given two complex numbers and , and their Givens rotation , for any , it is easy to see  is also a Givens rotation. They are two different rotations when , where . A new definition is proposed in [3]:

if  (includes the case )

; ;

elseif =0 and

; ;

else  and

; ; ;

Where   . The benefit of this definition is that we can minimize the dimensions of the sets of discontinuity of c, s and r. In other words, the value of c, s and r should be continuous functions of the inputs f and g, if possible.

 

4.1 Algorithm without IA-64 support

 

To compute Givens rotation correctly, the “traditional” method divides input data into four cases and handles each case respectively. Four cases are:

 

(I)            f and g are “well scaled”, i.e. we can compute Givens rotation without worrying about the overflow or underflow.

(II)        f is much larger than g, i.e.  will be rounded to

(III)    g is much larger than f, i.e.  will be rounded to

(IV)        both f and g are very large or very small, i.e. we can’t compute its Givens rotation directly due to the overflow or underflow. But we can scale f and g to a proper size. After that, the algorithm for case 1 is used to handle the rest.

 

To compute Givens rotation for case (I)

(1)    Computes

(2)    Computes

(3)    Computes

(4)    Computes  by computing the reciprocal square root of the product of (1) and (3)

(5)    Computes  by computing the product of (1) and (4)

(6)    Computes   by computing the product of (3) and (4)

(7)    Computes  by computing the product of  and (6)

(8)    Computes  by computing the product of  and (4)

(9)    Computes  by computing the product of (8) and conjugate number of

 

The results we need: c, s and r, are produced at step (5), (9) and (7) respectively.

 

The reason why to use such an algorithm for case (I) is that: we must reduce the number of division and square root operations. As we mentioned in section 2.3, division and square root operations are the most expensive floating-point operations. To compute Givens rotation, it is easy to see that one square root and one division are necessary. (Or a reciprocal square root operation is used, as what we have done in our algorithm.) In contrast, the algorithm in the ZROTG routine in the Fortran reference BLAS uses at least 5 square roots and 5 divisions, and perhaps 13 divisions, depending on the aggressiveness of the compiler optimizations and implementation of the complex absolute value function cabs. Therefore we can improve the speed by using only one reciprocal square root operation, which is provided by IA-64.

 

For case (II), (III) and (IV), since what we care is the new algorithm on IA-64, we won’t waste space to describe them here. You can check the details of this algorithm in [3].

 

4.2 Algorithm with IA-64 support

 

With support from IA-64 architecture, we can compute Givens rotation with a much simpler way: just ignore the last three cases. Since IA-64 provides a floating-point format with a much wider exponent range, as we mentioned in section 3, the possibilities of overflow or underflow occurring is zero.

 

For example, for any multiplication, addition, subtraction, division or square root operation, whose operands are both double precision floating-point numbers, results of operations can’t exceed the boundary of 82-bit IA-64 floating-point exponent range.

 

Therefore a simplified algorithm is: for any input data, we just apply the traditional algorithm for case (1). We call this algorithm new ZLARTG, to distinguish from old ZLARTG, which is part of the CLAPACK (a f2c version of LAPACK).

 

5.    Experiment results

 

By using the new ZLARTG algorithm, which is given in section 4, the original code for computing double complex Givens rotation is programmed in HP-UX C. I tested this code on the Itanium machine of our department, which is called “Lanczos”. To compare performances of different algorithms both in speed and accuracy, 4 algorithms are used:

 

1.    Our new ZLARTG, which is presented in section 4.

2.    Old ZLARTG, which is in CLAPACK3.0.

3.    ATLAS ZROTG, which is in ATLAS BLAS.

4.    Vendor ZROTG, which is in INTEL Math Kernel Library (MKL).

 

We note that different compilers and different flags may cause different performance. Therefore, to obtain the optimal performance, we compile 4 routines with different compilers and flags:

1.        For our new ZLARTG, we use HP-UX C compiler (cc), and flags are: +O3 –fpwidetypes

2.        For CLAPACK ZLARTG, we use the latest GNU C compiler (gcc 3.2.2), flags are: -O3

3.        For ATLAS ZROTG, we also use the latest GNU C compiler and ATLAS pre-built library, which is for LINUX and dual processor Intel Itanium, flags are: -O3

4.        For MKL ZROTG, we use Intel C compiler (ecc) to optimize its performance, flags are: -O3

 

Input data are in table 2, which is from [3].

Input data for testing double complex Givens rotations

Case

Case in algorithm

f

g

1

1

(0.11e+01)+I*(0.22e+01)

(0.33e+01)+I*(0.44e+01)

2

2

(0.37e+08)+I*(0.74e+08)

(0.33e+01)+I*(0.44e+01)

3

2

(0.12e+16)+I*(0.25e+16)

(0.11e+09)+I*(0.15e+09)

4

2

(0.42e+23)+I*(0.83e+23)

(0.37e+16)+I*(0.50e+16)

5

2

(0.14e+31)+I*(0.28e+31)

(0.12e+24)+I*(0.17e+24)

6

2

(0.14e+31)+I*(0.28e+31)

(0.33e+01)+I*(0.44e+01)

7

2

(0.14e+31)+I*(0.28e+31)

(0.26e-29)+I*(0.35e-29)

8

2

(0.11e+01)+I*(0.22e+01)

(0.26e-29)+I*(0.35e-29)

9

2

(0.29e-22)+I*(0.58e-22)

(0.26e-29)+I*(0.35e-29)

10

2

(0.98e-15)+I*(0.20e-14)

(0.87e-22)+I*(0.12e-21)

11

2

(0.33e-08)+I*(0.66e-08)

(0.29e-14)+I*(0.39e-14)

12

3

(0.11e+01)+I*(0.22e+01)

(0.11e+09)+I*(0.15e+09)