Computing
Givens Rotations reliably and efficiently on IA-64 Architecture
huangl@cs.ucdavis.edu
March
20, 2003
Due to possible
intermediate overflow or underflow, the widely used Givens rotation can be
tricky to implemented in a robust and efficient way. Robust often comes with
scaling to avoid spurious intermediate overflow or underflow. But scaling is
not cheap, so one would like to do scaling only when necessary. However
deciding whether to do scaling or not costs too. An ideal way would be somehow
to get rid of scaling altogether. Fortunately, Intel IA-64 architecture
provides many innovative features. These features can get rid of scaling in the
algorithm for computing Givens rotation. At a same time, speed of computing
Givens rotation can be improved, accuracy and reliability of the algorithm is
maintained. To verify this conclusion, a simpler, faster, reliable and accurate
program for computing Givens rotation by exploiting features of IA-64 is
presented as a result in this report.
1.
Introduction
How to compute math kernels efficiently and
reliably is one of the most important problems in the numerical computing
field. Therefore, using IA-64 architecture to simplify algorithms for computing
math kernels and improving performances of math kernels becomes my objective of
this study. Givens rotation, a widely used math kernel, is used as an
example to show how to simplify the algorithms without losing reliability by
exploiting the features of IA-64 and HP-UX C. The definition of Givens rotation
and algorithms for computing it will be introduced in section 4.
The latest work for computing Givens rotation
reliably and efficiently is proposed by D. Bindel, J. Demmel, W. Kahan and O.
Marques, who are from UC Berkeley. In [3], an elaborate and carefully designed
method is presented for computing single complex Givens rotation. It uses
double precision floating-point format to store intermediate results. Ren-Cang
Li, who is at HP lab, extended this approach for computing double complex and 80-bit
extended complex Givens rotations in a similar way: store intermediate results
in IA-64 82-bit floating-point format.
As described in [1], IA-64 is a 64-bit
architecture developed by Intel and Hewlett-Packard (HP). IA-64 architecture has
many innovative features [2]. For examples, more and wider registers than usual
registers, fused multiply-add(FMA) floating-point arithmetic operation,
predicated execution and control speculation. Two of these features will be
studied in depth in this report, namely wider floating-point register type and
FMA operation. They will be studied in section 2.
As an important accessorial part of IA-64,
HP-UX C will also be studied in this report. The features of HP-UX C that will
be studied in section 3 are in-line assembly instructions and pragmas.
What I will do in this report is that:
testing the code for computing double precision Givens rotation on IA-64, which
is programmed by Ren-Cang Li, comparing the results with some other math
softwares and verifying the conclusion given in the abstract. A reason why not
to test the code for 80-bit extended precision is that: no other 80-bit
extended precision programs for computing Givens rotation can be found to do
some comparisons.
The rest of this report is organized as follows.
Section 2 and 3 introduce the features of IA-64 architecture and HP-UX C
respectively. Section 4 studies how to compute Givens rotation reliably and
efficiently on IA-64 architecture, and present a simpler algorithm for
computing Givens rotation. Section 5 will put the new algorithm into practice,
give results and present some analysis. Section 6 draws conclusions and future
work.
2.
Useful features of IA-64 architecture
2.1 82-bit double-extended floating-point
register type
IA-64 architecture provides 82-bit double-extended floating-point arithmetic as its
basic floating-point type[2]. Three IEEE floating-point types: single
precision, double precision and 80-bit double-extended precision are all
subsets of this type. You can see the differences between them in the table 1:
|
|
Single |
Double |
80-bit double –extended |
IA-64 82-bit double-extended |
|
Sign bit |
1 |
1 |
1 |
1 |
|
Exponent bit |
8 |
11 |
15 |
17 |
|
Significand bit |
23 |
52 |
63 |
63 |
|
Explicit bit |
0 |
0 |
1 |
1 |
|
Bias |
127 |
1023 |
16383 |
65535 |
Table 1
The value of an 82-bit floating-point number
is:
![]()
The IA-64 floating-point register format
contains 7 exponent bits beyond the 11 required for treatment of double
precision numbers. The additional bits allow computation near the double overflow/underflow
thresholds to progress without generating exceptions. Of course, if the final
result in a chain of computation is out of range, forcing the exponent to its
usual narrow format will be interpreted as overflow or underflow.
For example, in computing a complex product
, it may be that any of the products
,
,
or
could overflow
or underflow, while neither
nor
lies outside the
target exponent range. By deferring exception reporting and allowing
intermediate results slightly outside the normal double exponent range, it
becomes much simpler to code a robust complex multiplication sequence.
In other words, IA-64 architecture provides a
wider exponent range than IEEE double format. As we will see in section 4, we
can take advantages by using this feature in computing math kernels.
2.2 Fused multiply-add
The basic arithmetic operation in IA-64 is a
fused multiply–add instruction. The instruction computes
with only one
rounding error. The product
is computed
first, and all 128 product bits are retained for the addition with
. After the addition, the result is rounded to the precision
specified in the instruction.
Here are some in-line instructions for fused
multiply-add and its variants:
_Asm_fma: ![]()
_Asm_fms: ![]()
_Asm_fnma: ![]()
_Asm_fadd: ![]()
_Asm_fsub: ![]()
_Asm_fmpy: ![]()
_Asm_fnmpy: ![]()
A major benefit of fused multiply-add instructions
is the ability to compute quantities with over 64 bits of precision in the
arithmetic unit, before rounding. Therefore the computed result will have more
than 64+n valid bits before rounding to the desired precision, which is greater
than the need of double precision. Therefore fused multiply-add can provide
more accuracy than standard floating-point operations.
Another benefit is that this instruction can
be executed in the same time as a simple multiplication instruction, which is
faster than the time for computing a multiplication and a addition separately.
Therefore FMA can also provide shorter execution time.
We will use FMA to improve the speed and
accuracy of our code.
2.3 Some other useful features of IA-64
IA-64 doesn’t provide hardware-based
floating-point division and square root. Replacing these two most expensive
floating-point operations by fast instructions that produce an approximation to
these operations has two beneficial results. First, special hardware for
division and square root, which has relatively low use, is not needed. Second,
by using a succession of fused multiply-add operations to complete the division
or square root, an entire floating–point unit is not rendered busy during the
entire execution time of division or square root.
3.
Useful extensions of HP-UX C
3.1 In-line Assembly Instructions
HP-UX C compiler for IA-64 architecture has been enhanced with “in-line assembly” capabilities, which enable most of the detailed floating-point architectural features to be accessed directly from C.
When the header file “machine/sys/inline.h” is included into a program, the new data types, __fpreg, which represents computational precision, becomes available. Computations with data declared as __fpreg use the full 17-bit exponent.
Also defined are what appear to be
functions of the form _Asm_XXX(PC, a, b, c), where XXX is an IA-64
operation code, such as fma, and PC is for precision control, it
is of the enum type: PC
{_PC_S, _PC_D, _PC_NONE}, meaning to round
the results to float, double or extended respectively.
An important in-line assembly instruction that will be used in the code is:
uint64_t _Asm_frsqrta(__fpreg *r, __fpreg a).
It computes approximation of reciprocal
square root of a. And return value is 0 if a is in
and 1 otherwise.
When return value is 0, r is set to be the IEEE-754 mandated square root of a;
otherwise r approximates
. This in-line instruction can also reduce the time for
computing Givens rotation, which is discussed in Section 2.3.
3.2 Pragmas
The HP-UX C compiler recognizes statements of
the form:
#pragma
xxxxx,
where xxxxx is information that controls some
aspects of the compilation environment. Some useful pragmas are listed below:
#pragma _USE_SF [0|1|2|3]
The _USE_SF pragma permits the C programmer access to the multiple status
fields of IA-64. The pragma must precede any declaration or executable
statements within a block, and it affects only that block. The default status
field is SF0. The status fields most relevant to us here are SF1
besides the default SF0. When user set status filed to SF1, it
will disable all 3 standard floating-point types, and uses the computational
floating-point precision (82-bit IA-64 double extended precision) as the
default. SF2 and SF3 are designed for purposes beyond the scope
this report.
#pragma Estimated_Frequency f The Estimated_Frequency pragma
indicates the estimated relative execution frequency of the current block as
compared with the immediately surrounding block. This may be used to indicate
the average trip count in the body of a for loop, or to indicate
the fraction of time a then clause is executed. The frequency, f,
may be expressed as a floating-point constant.
Proper pragmas can help compiler optimize the
code more efficiently than usual.
4.
Computing Givens rotation
The definition of Givens rotation is: given two
complex numbers
and
, a Givens rotation is a 2-by-2 unitary matrix
such that
.
The fact that
is unitary
implies
![]()

Where I is the identity matrix. From
this we see that
![]()
However this definition of Givens rotation is
not clear enough. Different rotation matrices
can satisfy this
definition. For example, if given two complex numbers
and
, and their Givens rotation
, for any
, it is easy to see
is also a Givens
rotation. They are two different rotations when
, where
. A new definition is proposed in [3]:
if
(includes the
case
)
;
; ![]()
elseif
=0 and ![]()
;
; ![]()
else
and ![]()
;
;
;
Where
. The benefit of this definition is that we can minimize the
dimensions of the sets of discontinuity of c, s and r. In other words, the
value of c, s and r should be continuous functions of the inputs f and g, if
possible.
4.1 Algorithm without IA-64 support
To compute Givens rotation correctly, the “traditional” method divides input data into four cases and handles each case respectively. Four cases are:
(I)
f and g are “well scaled”, i.e. we can
compute Givens rotation without worrying about the overflow or underflow.
(II)
f is much larger than g, i.e.
will be rounded
to ![]()
(III) g
is much larger than f, i.e.
will be rounded
to ![]()
(IV)
both f and g are very large or very small,
i.e. we can’t compute its Givens rotation directly due to the overflow or
underflow. But we can scale f and g to a proper size. After that, the algorithm
for case 1 is used to handle the rest.
To compute Givens rotation for case (I)
(1) Computes
![]()
(2) Computes
![]()
(3) Computes
![]()
(4) Computes
by computing the
reciprocal square
root of the product of (1) and (3)
(5) Computes
by computing the
product of (1) and (4)
(6) Computes
by computing
the product of (3) and (4)
(7) Computes
by computing the
product of
and (6)
(8) Computes
by computing the
product of
and (4)
(9) Computes
by computing the
product of (8) and conjugate number of ![]()
The results we need: c, s and r, are produced
at step (5), (9) and (7) respectively.
The reason why to use such an algorithm for
case (I) is that: we must reduce the number of division and square root
operations. As we mentioned in section 2.3, division and square root operations
are the most expensive floating-point operations. To compute Givens rotation,
it is easy to see that one square root and one division are necessary. (Or a
reciprocal square root operation is used, as what we have done in our
algorithm.) In contrast, the algorithm in the ZROTG routine in the Fortran
reference BLAS uses at least 5 square roots and 5 divisions, and perhaps 13
divisions, depending on the aggressiveness of the compiler optimizations and
implementation of the complex absolute value function cabs. Therefore we
can improve the speed by using only one reciprocal square root operation, which
is provided by IA-64.
For case (II), (III) and (IV), since what we
care is the new algorithm on IA-64, we won’t waste space to describe them here.
You can check the details of this algorithm in [3].
4.2 Algorithm with IA-64 support
With support from IA-64 architecture, we can
compute Givens rotation with a much simpler way: just ignore the last three
cases. Since IA-64 provides a floating-point format with a much wider exponent
range, as we mentioned in section 3, the possibilities of overflow or underflow
occurring is zero.
For example, for any multiplication,
addition, subtraction, division or square root operation, whose operands are
both double precision floating-point numbers, results of operations can’t
exceed the boundary of 82-bit IA-64 floating-point exponent range.
Therefore a simplified algorithm is: for any
input data, we just apply the traditional algorithm for case (1). We call this
algorithm new ZLARTG, to distinguish from old ZLARTG, which is part of the
CLAPACK (a f2c version of LAPACK).
5.
Experiment results
By using the new ZLARTG algorithm, which is
given in section 4, the original code for computing double complex Givens
rotation is programmed in HP-UX C. I tested this code on the Itanium machine of
our department, which is called “Lanczos”. To compare performances of different
algorithms both in speed and accuracy, 4 algorithms are used:
1. Our
new ZLARTG, which is presented in section 4.
2. Old
ZLARTG, which is in CLAPACK3.0.
3. ATLAS
ZROTG, which is in ATLAS BLAS.
4. Vendor
ZROTG, which is in INTEL Math Kernel Library (MKL).
We note that different compilers and
different flags may cause different performance. Therefore, to obtain the
optimal performance, we compile 4 routines with different compilers and flags:
1.
For our new ZLARTG, we use HP-UX C compiler (cc),
and flags are: +O3 –fpwidetypes
2.
For CLAPACK ZLARTG, we use the latest GNU C
compiler (gcc 3.2.2), flags are: -O3
3.
For ATLAS ZROTG, we also use the latest GNU
C compiler and ATLAS pre-built library, which is for LINUX and dual
processor Intel Itanium, flags are: -O3
4.
For MKL ZROTG, we use Intel C compiler (ecc)
to optimize its performance, flags are: -O3
Input data are in table 2, which is from [3].
|
Input
data for testing double complex Givens rotations |
|||
|
Case |
Case in algorithm |
f |
g |
|
1 |
1 |
(0.11e+01)+I*(0.22e+01) |
(0.33e+01)+I*(0.44e+01) |
|
2 |
2 |
(0.37e+08)+I*(0.74e+08) |
(0.33e+01)+I*(0.44e+01) |
|
3 |
2 |
(0.12e+16)+I*(0.25e+16) |
(0.11e+09)+I*(0.15e+09) |
|
4 |
2 |
(0.42e+23)+I*(0.83e+23) |
(0.37e+16)+I*(0.50e+16) |
|
5 |
2 |
(0.14e+31)+I*(0.28e+31) |
(0.12e+24)+I*(0.17e+24) |
|
6 |
2 |
(0.14e+31)+I*(0.28e+31) |
(0.33e+01)+I*(0.44e+01) |
|
7 |
2 |
(0.14e+31)+I*(0.28e+31) |
(0.26e-29)+I*(0.35e-29) |
|
8 |
2 |
(0.11e+01)+I*(0.22e+01) |
(0.26e-29)+I*(0.35e-29) |
|
9 |
2 |
(0.29e-22)+I*(0.58e-22) |
(0.26e-29)+I*(0.35e-29) |
|
10 |
2 |
(0.98e-15)+I*(0.20e-14) |
(0.87e-22)+I*(0.12e-21) |
|
11 |
2 |
(0.33e-08)+I*(0.66e-08) |
(0.29e-14)+I*(0.39e-14) |
|
12 |
3 |
(0.11e+01)+I*(0.22e+01) |
(0.11e+09)+I*(0.15e+09) |
|
| |||