Mã html nhân

Trong bài viết này, chúng ta sẽ xem phép tính như nhân và chia bằng Javascript. Một phép toán số học hoạt động trên hai số và các số được gọi là toán hạng.

Phép nhân Toán tử nhân (*) nhân hai hoặc nhiều số

Thí dụ

var a =1 5;
var b = 12;
var c = a × b;

Tiếp cận. Tạo biểu mẫu html để lấy đầu vào từ người dùng để thực hiện các phép tính nhân. Thêm mã javascript bên trong html để thực hiện logic nhân. Tài liệu. getElementById(id). thuộc tính giá trị trả về giá trị của thuộc tính giá trị của trường văn bản

Thí dụ. Dưới đây là việc thực hiện các phương pháp trên

HTML

<body style="margin: 30px">

<

var a = 50;
var b = 20;
var c = a / b;

0 style=

var a = 50;
var b = 20;
var c = a / b;

var a = 50;
var b = 20;
var c = a / b;

var a = 50;
var b = 20;
var c = a / b;

0>

<

var a = 50;
var b = 20;
var c = a / b;

9<0

var a = 50;
var b = 20;
var c = a / b;

9>

<<5>

<7<8<9 body0=423_______2 body3=423_______5 body6body7>

<7style0<9 body0=423_______2 body3=424_______7 body6body7>

<7<422_______9 body0_______425_______425_______6 =7=1_______75 "margin: 30px"0=

var a = 50;
var b = 20;
var c = a / b;

var a = 50;
var b = 20;
var c = a / b;

Số học dấu phẩy động được coi là một chủ đề bí truyền của nhiều người. Điều này khá ngạc nhiên vì dấu chấm động phổ biến trong các hệ thống máy tính. Hầu hết mọi ngôn ngữ đều có kiểu dữ liệu dấu phẩy động; . Bài báo này trình bày một hướng dẫn về các khía cạnh của dấu phẩy động có tác động trực tiếp đến các nhà thiết kế hệ thống máy tính. Nó bắt đầu với thông tin cơ bản về biểu diễn dấu phẩy động và lỗi làm tròn, tiếp tục với phần thảo luận về tiêu chuẩn dấu phẩy động IEEE và kết thúc bằng nhiều ví dụ về cách các nhà chế tạo máy tính có thể hỗ trợ dấu phẩy động tốt hơn.

Categories and Subject Descriptors. (Chính) C. 0 [Tổ chức hệ thống máy tính]. Chung -- thiết kế tập lệnh; . 3. 4 [Programming Languages]. Bộ xử lý -- trình biên dịch, tối ưu hóa; . 1. 0 [Phân tích số]. Chung -- số học máy tính, phân tích lỗi, thuật toán số (Trung học)

D. 2. 1 [Kỹ thuật phần mềm]. Yêu cầu/Thông số kỹ thuật -- ngôn ngữ; . 3. 4 Ngôn ngữ lập trình]. Định nghĩa và lý thuyết chính thức -- ngữ nghĩa; . 4. 1 Hệ điều hành]. Process Management -- synchronization

General Terms. Thuật toán, Thiết kế, Ngôn ngữ

Các từ và cụm từ chính bổ sung. Denormalized number, exception, floating-point, floating-point standard, gradual underflow, guard digit, NaN, overflow, relative error, rounding error, rounding mode, ulp, underflow

Giới thiệu

Các nhà xây dựng hệ thống máy tính thường cần thông tin về số học dấu phẩy động. Tuy nhiên, có rất ít nguồn thông tin chi tiết về nó. Một trong số ít cuốn sách về chủ đề này, Tính toán dấu phẩy động của Pat Sterbenz, đã hết bản in từ lâu. This paper is a tutorial on those aspects of floating-point arithmetic (floating-point hereafter) that have a direct connection to systems building. It consists of three loosely connected parts. The first section, , discusses the implications of using different rounding strategies for the basic operations of addition, subtraction, multiplication and division. It also contains background information on the two methods of measuring rounding error, ulps and

     n = n/2

     n = n/2

3. The second part discusses the IEEE floating-point standard, which is becoming rapidly accepted by commercial hardware manufacturers. Included in the IEEE standard is the rounding method for basic operations. The discussion of the standard draws on the material in the section . The third part discusses the connections between floating-point and the design of various aspects of computer systems. Topics include instruction set design, optimizing compilers and exception handling

I have tried to avoid making statements about floating-point without also giving reasons why the statements are true, especially since the justifications involve nothing more complicated than elementary calculus. Those explanations that are not central to the main argument have been grouped into a section called "The Details," so that they can be skipped if desired. In particular, the proofs of many of the theorems appear in this section. The end of each proof is marked with the z symbol. When a proof is not included, the z appears immediately following the statement of the theorem

Rounding Error

Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits. In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation. The section describes how it is measured

Since most floating-point calculations have rounding error anyway, does it matter if the basic arithmetic operations introduce a little bit more rounding error than necessary? That question is a main theme throughout this section. The section discusses guard digits, a means of reducing the error when subtracting two nearby numbers. Guard digits were considered sufficiently important by IBM that in 1968 it added a guard digit to the double precision format in the System/360 architecture (single precision already had a guard digit), and retrofitted all existing machines in the field. Two examples are given to illustrate the utility of guard digits

The IEEE standard goes further than just requiring the use of a guard digit. It gives an algorithm for addition, subtraction, multiplication, division and square root, and requires that implementations produce the same result as that algorithm. Thus, when a program is moved from one machine to another, the results of the basic operations will be the same in every bit if both machines support the IEEE standard. This greatly simplifies the porting of programs. Other uses of this precise specification are given in

Floating-point Formats

Several different representations of real numbers have been proposed, but by far the most widely used is the floating-point representation. Floating-point representations have a base

(which is always assumed to be even) and a precision p. If

= 10 and p = 3, then the number 0. 1 is represented as 1. 00 × 10-1. If

= 2 and p = 24, then the decimal number 0. 1 cannot be represented exactly, but is approximately 1. 10011001100110011001101 × 2-4.

In general, a floating-point number will be represented as ± d. dd. d ×

e, where d. dd. d is called the significand and has p digits. More precisely ± d0 . d1 d2 . dp-1 ×

e represents the number

(1)

The term floating-point number will be used to mean a real number that can be exactly represented in the format under discussion. Two other parameters associated with floating-point representations are the largest and smallest allowable exponents, emax and emin. Since there are

p possible significands, and emax - emin + 1 possible exponents, a floating-point number can be encoded in

bits, where the final +1 is for the sign bit. The precise encoding is not important for now

There are two reasons why a real number might not be exactly representable as a floating-point number. The most common situation is illustrated by the decimal number 0. 1. Although it has a finite decimal representation, in binary it has an infinite repeating representation. Thus when

= 2, the number 0. 1 lies strictly between two floating-point numbers and is exactly representable by neither of them. A less common situation is that a real number is out of range, that is, its absolute value is larger than

×or smaller than 1. 0 ×. Most of this paper discusses issues due to the first reason. However, numbers that are out of range will be discussed in the sections and .

Floating-point representations are not necessarily unique. For example, both 0. 01 × 101 and 1. 00 × 10-1 represent 0. 1. If the leading digit is nonzero (d0

0 in equation above), then the representation is said to be normalized. The floating-point number 1. 00 × 10-1 is normalized, while 0. 01 × 101 is not. When

= 2, p = 3, emin = -1 and emax = 2 there are 16 normalized floating-point numbers, as shown in . The bold hash marks correspond to numbers whose significand is 1. 00. Requiring that a floating-point representation be normalized makes the representation unique. Unfortunately, this restriction makes it impossible to represent zero. A natural way to represent 0 is with 1. 0 ×, since this preserves the fact that the numerical ordering of nonnegative real numbers corresponds to the lexicographic ordering of their floating-point representations. When the exponent is stored in a k bit field, that means that only 2k - 1 values are available for use as exponents, since one must be reserved to represent 0.

Note that the × in a floating-point number is part of the notation, and different from a floating-point multiply operation. The meaning of the × symbol should be clear from the context. For example, the expression (2. 5 × 10-3) × (4. 0 × 102) involves only a single floating-point multiplication

FIGURE D-1 Normalized numbers when

= 2, p = 3, emin = -1, emax = 2

Relative Error and Ulps

Since rounding error is inherent in floating-point computation, it is important to have a way to measure this error. Consider the floating-point format with

= 10 and p = 3, which will be used throughout this section. If the result of a floating-point computation is 3. 12 × 10-2, and the answer when computed to infinite precision is . 0314, it is clear that this is in error by 2 units in the last place. Similarly, if the real number . 0314159 is represented as 3. 14 × 10-2, then it is in error by . 159 units in the last place. In general, if the floating-point number d. d. d ×

e is used to represent z, then it is in error by

d. d. d - (z/

p-1 units in the last place. , The term ulps will be used as shorthand for "units in the last place. " If the result of a calculation is the floating-point number nearest to the correct result, it still might be in error by as much as . 5 ulp. Another way to measure the difference between a floating-point number and the real number it is approximating is relative error, which is simply the difference between the two numbers divided by the real number. For example the relative error committed when approximating 3. 14159 by 3. 14 × 100 is . 00159/3. 14159

. 0005.

To compute the relative error that corresponds to . 5 ulp, observe that when a real number is approximated by the closest possible floating-point number d. dd. dd ×

e, the error can be as large as 0. 00. 00

' ×

e, where

' is the digit

/2, there are p units in the significand of the floating-point number, and p units of 0 in the significand of the error. This error is ((

/2)

-p) ×

e. Since numbers of the form d. dd. dd ×

e all have the same absolute error, but have values that range between

e and

e, the relative error ranges between ((

/2)

-p) ×

e and ((

/2)

-p) ×

e+1. That is,

(2)

In particular, the relative error corresponding to . 5 ulp can vary by a factor of

. Yếu tố này được gọi là dao động. Setting

= (

/2)

-p to the largest of the bounds in above, we can say that when a real number is rounded to the closest floating-point number, the relative error is always bounded by e, which is referred to as machine epsilon.

In the example above, the relative error was . 00159/3. 14159

. 0005. In order to avoid such small numbers, the relative error is normally written as a factor times

, which in this case is

= (

/2)

-p = 5(10)-3 = . 005. Thus the relative error would be expressed as (. 00159/3. 14159)/. 005)

0. 1

Để minh họa sự khác biệt giữa ulps và lỗi tương đối, hãy xem xét số thực x = 12. 35. It is approximated by= 1. 24 × 101. The error is 0. 5 ulps, the relative error is 0. 8

. Next consider the computation 8. The exact value is 8x = 98. 8, while the computed value is 8= 9. 92 × 101. The error is now 4. 0 ulps, but the relative error is still 0. 8

. The error measured in ulps is 8 times larger, even though the relative error is the same. In general, when the base is

, a fixed relative error expressed in ulps can wobble by a factor of up to

. And conversely, as equation above shows, a fixed error of . 5 ulps results in a relative error that can wobble by

The most natural way to measure rounding error is in ulps. For example rounding to the nearest floating-point number corresponds to an error of less than or equal to . 5 ulp. However, when analyzing the rounding error caused by various formulas, relative error is a better measure. A good illustration of this is the analysis in the section . Since

can overestimate the effect of rounding to the nearest floating-point number by the wobble factor of

, error estimates of formulas will be tighter on machines with a small

When only the order of magnitude of rounding error is of interest, ulps and

may be used interchangeably, since they differ by at most a factor of

. For example, when a floating-point number is in error by n ulps, that means that the number of contaminated digits is log

n. If the relative error in a computation is n

, then

(3) contaminated digits

log

Guard Digits

One method of computing the difference between two floating-point numbers is to compute the difference exactly and then round it to the nearest floating-point number. This is very expensive if the operands differ greatly in size. Assuming p = 3, 2. 15 × 1012 - 1. 25 × 10-5 would be calculated as

x = 2. 15 × 1012
y = . 0000000000000000125 × 1012
x - y = 2. 1499999999999999875 × 1012

which rounds to 2. 15 × 1012. Rather than using all these digits, floating-point hardware normally operates on a fixed number of digits. Suppose that the number of digits kept is p, and that when the smaller operand is shifted right, digits are simply discarded (as opposed to rounding). Then 2. 15 × 1012 - 1. 25 × 10-5 becomes

x = 2. 15 × 1012
y = 0. 00 × 1012
x - y = 2. 15 × 1012

The answer is exactly the same as if the difference had been computed exactly and then rounded. Take another example. 10. 1 - 9. 93. This becomes

x = 1. 01 × 101
y = 0. 99 × 101
x - y = . 02 × 101

The correct answer is . 17, so the computed difference is off by 30 ulps and is wrong in every digit. How bad can the error be?

Theorem 1

Using a floating-point format with parameters

and p, and computing differences using p digits, the relative error of the result can be as large as

- 1.

Proof

A relative error of

- 1 in the expression x - y occurs when x = 1. 00. 0 and y = .

, where

- 1. Here y has p digits (all equal to

). The exact difference is x - y =

-p. However, when computing the answer using only p digits, the rightmost digit of y gets shifted off, and so the computed difference is

-p+1. Thus the error is

-p -

-p+1 =

-p (

- 1), and the relative error is

-p(

- 1)/

-p =

- 1. z

When

=2, the relative error can be as large as the result, and when

=10, it can be 9 times larger. Or to put it another way, when

=2, equation shows that the number of contaminated digits is log2(1/

) = log2(2p) = p. That is, all of the p digits in the result are wrong. Suppose that one extra digit is added to guard against this situation (a guard digit). That is, the smaller number is truncated to p + 1 digits, and then the result of the subtraction is rounded to p digits. With a guard digit, the previous example becomes

x = 1. 010 × 101
y = 0. 993 × 101
x - y = . 017 × 101

and the answer is exact. With a single guard digit, the relative error of the result may be greater than

, as in 110 - 8. 59.

x = 1. 10 × 102
y = . 085 × 102
x - y = 1. 015 × 102

This rounds to 102, compared with the correct answer of 101. 41, for a relative error of . 006, which is greater than

= . 005. In general, the relative error of the result can be only slightly larger than

. More precisely,

Theorem 2

If x and y are floating-point numbers in a format with parameters

and p, and if subtraction is done with p + 1 digits (i. e. one guard digit), then the relative rounding error in the result is less than 2

This theorem will be proven in . Addition is included in the above theorem since x and y can be positive or negative

Cancellation

The last section can be summarized by saying that without a guard digit, the relative error committed when subtracting two nearby quantities can be very large. In other words, the evaluation of any expression containing a subtraction (or an addition of quantities with opposite signs) could result in a relative error so large that all the digits are meaningless (Theorem 1). When subtracting nearby quantities, the most significant digits in the operands match and cancel each other. There are two kinds of cancellation. catastrophic and benign

Catastrophic cancellation occurs when the operands are subject to rounding errors. Ví dụ, trong công thức bậc hai, biểu thức b2 - 4ac xảy ra. The quantities b2 and 4ac are subject to rounding errors since they are the results of floating-point multiplications. Suppose that they are rounded to the nearest floating-point number, and so are accurate to within . 5 ulp. When they are subtracted, cancellation can cause many of the accurate digits to disappear, leaving behind mainly digits contaminated by rounding error. Hence the difference might have an error of many ulps. For example, consider b = 3. 34, a = 1. 22, and c = 2. 28. Giá trị chính xác của b2 - 4ac là. 0292. Nhưng b2 làm tròn thành 11. 2 and 4ac rounds to 11. 1, do đó câu trả lời cuối cùng là. 1 là lỗi của 70 ulps, mặc dù 11. 2 - 11. 1 chính xác bằng. 1. Phép trừ không đưa ra bất kỳ lỗi nào, mà chỉ để lộ ra lỗi được đưa ra trong các phép nhân trước đó.

Xuất trừ lành tính xảy ra khi trừ chính xác các đại lượng đã biết. Nếu x và y không có sai số làm tròn thì theo Định lý 2 nếu phép trừ được thực hiện với một chữ số bảo vệ thì hiệu x-y có sai số tương đối rất nhỏ (nhỏ hơn 2

Một công thức thể hiện sự hủy bỏ nghiêm trọng đôi khi có thể được sắp xếp lại để loại bỏ vấn đề. Lại xét công thức bậc hai

(4)

Khi nào, sau đó không liên quan đến việc hủy bỏ và

But the other addition (subtraction) in one of the formulas will have a catastrophic cancellation. Để tránh điều này, hãy nhân tử số và mẫu số của r1 với

(and similarly for r2) to obtain

(5)

Nếu và, thì tính toán r1 bằng công thức sẽ liên quan đến việc hủy bỏ. Do đó, hãy sử dụng công thức để tính r1 và cho r2. Mặt khác, nếu b < 0, sử dụng để tính r1 và cho r2.

Biểu thức x2 - y2 là một công thức khác thể hiện sự hủy bỏ thảm khốc. Sẽ chính xác hơn nếu đánh giá nó là (x - y)(x + y). Khác với căn thức bậc hai, dạng cải tiến này vẫn có phép trừ, nhưng là phép trừ đại lượng lành tính, không làm tròn sai, không thảm hại. Theo Định lý 2, sai số tương đối trong x - y tối đa là 2

. Tương tự với x + y. Nhân hai đại lượng có sai số tương đối nhỏ sẽ tạo ra tích có sai số tương đối nhỏ (xem phần ).

Để tránh nhầm lẫn giữa giá trị chính xác và giá trị được tính toán, ký hiệu sau được sử dụng. Whereas x - y denotes the exact difference of x and y, xy denotes the computed difference (i. e. , với lỗi làm tròn). Similarly

, anddenote computed addition, multiplication, and division, respectively. All caps indicate the computed value of a function, as in

     n = n/2

4 or

     n = n/2

5. Lowercase functions and traditional mathematical notation denote their exact values as in ln(x) and.

Although (xy)

y) is an excellent approximation to x2 - y2, the floating-point numbers x and y might themselves be approximations to some true quantitiesand. For example,andmight be exactly known decimal numbers that cannot be expressed exactly in binary. Trong trường hợp này, mặc dù x y là một giá trị gần đúng tốt cho x - y, nhưng nó có thể có sai số tương đối lớn so với biểu thức đúng và do đó, lợi thế của (x + y)(x - y) so với x2 - y2 là . Since computing (x + y)(x - y) is about the same amount of work as computing x2 - y2, it is clearly the preferred form in this case. In general, however, replacing a catastrophic cancellation by a benign one is not worthwhile if the expense is large, because the input is often (but not always) an approximation. Nhưng việc loại bỏ hoàn toàn một phép loại bỏ (như trong công thức bậc hai) là đáng giá ngay cả khi dữ liệu không chính xác. Xuyên suốt bài báo này, người ta sẽ giả định rằng các đầu vào dấu phẩy động cho một thuật toán là chính xác và kết quả được tính toán chính xác nhất có thể.

Biểu thức x2 - y2 chính xác hơn khi được viết lại thành (x - y)(x + y) vì một phép hủy thảm khốc được thay thế bằng một phép hủy lành tính. Tiếp theo, chúng tôi trình bày các ví dụ thú vị hơn về các công thức thể hiện sự hủy bỏ nghiêm trọng có thể được viết lại để chỉ thể hiện sự hủy bỏ lành tính

Diện tích của một tam giác có thể được biểu thị trực tiếp theo độ dài của các cạnh a, b và c của nó như

(6)

(Giả sử tam giác rất phẳng; nghĩa là a

b + c. Sau đó, s

a và số hạng (s - a) trong công thức trừ hai số liền kề, một trong số đó có thể có lỗi làm tròn. Ví dụ: nếu a = 9. 0, b = c = 4. 53, giá trị đúng của s là 9. 03 và A là 2. 342. Mặc dù giá trị tính toán của s (9. 05) is in error by only 2 ulps, the computed value of A is 3. 04, an error of 70 ulps.

Có một cách để viết lại công thức sao cho nó sẽ trả về kết quả chính xác ngay cả đối với tam giác phẳng [Kahan 1986]. Nó là

(7)

Nếu a, b và c không thỏa mãn a

c, đổi tên . Dễ dàng kiểm tra xem các vế phải của và có bằng nhau về mặt đại số không. Sử dụng các giá trị của a, b và c ở trên sẽ cho diện tích tính toán là 2. 35, sai 1 ulp và chính xác hơn nhiều so với công thức đầu tiên.

Mặc dù công thức chính xác hơn nhiều so với ví dụ này, nhưng thật tuyệt khi biết hiệu suất nói chung tốt như thế nào

Định lý 3

Sai số làm tròn phát sinh khi sử dụng để tính diện tích tam giác tối đa là 11

, miễn là phép trừ được thực hiện với chữ số bảo vệ, e

.005, and that square roots are computed to within 1/2 ulp.

Điều kiện e <. 005 được đáp ứng trong hầu hết mọi hệ thống dấu phẩy động thực tế. Ví dụ: khi = 2, p 8 đảm bảo rằng e <. 005 và khi = 10, p 3 là đủ.

= 2, p

8 ensures that e < .005, and when

= 10, p

3 is enough.

Trong các phát biểu như Định lý 3 thảo luận về sai số tương đối của một biểu thức, người ta hiểu rằng biểu thức được tính toán bằng cách sử dụng số học dấu phẩy động. Đặc biệt, lỗi tương đối thực sự là của biểu thức

(8)

     n = n/2

6((a

c))

(c(ab))

(ab))

(bc)))4

Do tính chất cồng kềnh của , trong phát biểu của các định lý, chúng ta thường nói giá trị tính toán của E thay vì viết ra E bằng ký hiệu vòng tròn

Giới hạn lỗi thường quá bi quan. Trong ví dụ số đưa ra ở trên, giá trị tính toán của là 2. 35, so với giá trị thực là 2. 34216 cho lỗi tương đối bằng 0. 7

, nhỏ hơn nhiều so với 11

. Lý do chính cho việc tính toán các giới hạn lỗi không phải là để có được các giới hạn chính xác mà là để xác minh rằng công thức không chứa các vấn đề về số.

A final example of an expression that can be rewritten to use benign cancellation is (1 + x)n, where. This expression arises in financial calculations. Consider depositing $100 every day into a bank account that earns an annual interest rate of 6%, compounded daily. If n = 365 and i = . 06, the amount of money accumulated at the end of one year is

100

dollars. If this is computed using

= 2 and p = 24, the result is $37615. 45 compared to the exact answer of $37614. 05, a discrepancy of $1. 40. The reason for the problem is easy to see. The expression 1 + i/n involves adding 1 to . 0001643836, so the low order bits of i/n are lost. This rounding error is amplified when 1 + i/n is raised to the nth power.

The troublesome expression (1 + i/n)n can be rewritten as enln(1 + i/n), where now the problem is to compute ln(1 + x) for small x. One approach is to use the approximation ln(1 + x)

x, in which case the payment becomes $37617. 26, which is off by $3. 21 and even less accurate than the obvious formula. But there is a way to compute ln(1 + x) very accurately, as Theorem 4 shows [Hewlett-Packard 1982]. This formula yields $37614. 07, accurate to within two cents.

Theorem 4 assumes that

     n = n/2

4 approximates ln(x) to within 1/2 ulp. The problem it solves is that when x is small,

     n = n/2

8(1

x) is not close to ln(1 + x) because 1

x has lost the information in the low order bits of x. That is, the computed value of ln(1 + x) is not close to its actual value when.

Theorem 4

If ln(1 + x) is computed using the formula

the relative error is at most 5

when 0

x < 3/4, provided subtraction is performed with a guard digit, e < 0. 1, and ln is computed to within 1/2 ulp.

This formula will work for any value of x but is only interesting for, which is where catastrophic cancellation occurs in the naive formula ln(1 + x). Although the formula may seem mysterious, there is a simple explanation for why it works. Write ln(1 + x) as

The left hand factor can be computed exactly, but the right hand factor µ(x) = ln(1 + x)/x will suffer a large rounding error when adding 1 to x. However, µ is almost constant, since ln(1 + x)

x. So changing x slightly will not introduce much error. In other words, if, computingwill be a good approximation to xµ(x) = ln(1 + x). Is there a value forfor whichandcan be computed accurately? There is; namely= (1

x)1, because then 1 +is exactly equal to 1

The results of this section can be summarized by saying that a guard digit guarantees accuracy when nearby precisely known quantities are subtracted (benign cancellation). Sometimes a formula that gives inaccurate results can be rewritten to have much higher numerical accuracy by using benign cancellation; however, the procedure only works if subtraction is performed using a guard digit. The price of a guard digit is not high, because it merely requires making the adder one bit wider. For a 54 bit double precision adder, the additional cost is less than 2%. For this price, you gain the ability to run many algorithms such as formula for computing the area of a triangle and the expression ln(1 + x). Although most modern computers have a guard digit, there are a few (such as Cray systems) that do not

Exactly Rounded Operations

When floating-point operations are done with a guard digit, they are not as accurate as if they were computed exactly then rounded to the nearest floating-point number. Operations performed in this manner will be called exactly rounded. The example immediately preceding Theorem 2 shows that a single guard digit will not always give exactly rounded results. The previous section gave several examples of algorithms that require a guard digit in order to work properly. This section gives examples of algorithms that require exact rounding

So far, the definition of rounding has not been given. Rounding is straightforward, with the exception of how to round halfway cases; for example, should 12. 5 round to 12 or 13? One school of thought divides the 10 digits in half, letting {0, 1, 2, 3, 4} round down, and {5, 6, 7, 8, 9} round up; thus 12. 5 would round to 13. This is how rounding works on Digital Equipment Corporation's VAX computers. Một trường phái tư tưởng khác nói rằng vì các số kết thúc bằng 5 nằm giữa hai lần làm tròn có thể, nên chúng nên làm tròn xuống một nửa và làm tròn nửa còn lại. Một cách để đạt được hành vi 50% này là yêu cầu kết quả được làm tròn có chữ số có nghĩa nhỏ nhất là số chẵn. Thus 12. 5 rounds to 12 rather than 13 because 2 is even. Which of these methods is best, round up or round to even? Reiser and Knuth [1975] offer the following reason for preferring round to even

Theorem 5

Let x and y be floating-point numbers, and define x0 = x, x1 = (x0y)

y, . , xn = (xn-1 y)

y. If

andare exactly rounded using round to even, then either xn = x for all n or xn = x1 for all n

1. z

To clarify this result, consider

= 10, p = 3 and let x = 1. 00, y = -. 555. When rounding up, the sequence becomes

x0y = 1. 56, x1 = 1. 56. 555 = 1. 01, x1 y = 1. 01

. 555 = 1. 57,

and each successive value of xn increases by . 01, until xn = 9. 45 (n

845). Under round to even, xn is always 1. 00. This example suggests that when using the round up rule, computations can gradually drift upward, whereas when using round to even the theorem says this cannot happen. Throughout the rest of this paper, round to even will be used.

One application of exact rounding occurs in multiple precision arithmetic. There are two basic approaches to higher precision. One approach represents floating-point numbers using a very large significand, which is stored in an array of words, and codes the routines for manipulating these numbers in assembly language. The second approach represents higher precision floating-point numbers as an array of ordinary floating-point numbers, where adding the elements of the array in infinite precision recovers the high precision floating-point number. It is this second approach that will be discussed here. The advantage of using an array of floating-point numbers is that it can be coded portably in a high level language, but it requires exactly rounded arithmetic

The key to multiplication in this system is representing a product xy as a sum, where each summand has the same precision as x and y. This can be done by splitting x and y. Writing x = xh + xl and y = yh + yl, the exact product is

xy = xh yh + xh yl + xl yh + xl yl

If x and y have p bit significands, the summands will also have p bit significands provided that xl, xh, yh, yl can be represented using [p/2] bits. When p is even, it is easy to find a splitting. The number x0. x1 . xp - 1 can be written as the sum of x0. x1 . xp/2 - 1 and 0. 0 . 0xp/2 . xp - 1. When p is odd, this simple splitting method will not work. An extra bit can, however, be gained by using negative numbers. For example, if

= 2, p = 5, and x = . 10111, x can be split as xh = . 11 and xl = -. 00001. There is more than one way to split a number. A splitting method that is easy to compute is due to Dekker [1971], but it requires more than a single guard digit.

Theorem 6

Let p be the floating-point precision, with the restriction that p is even when

> 2, and assume that floating-point operations are exactly rounded. Then if k = [p/2] is half the precision (rounded up) and m =

k + 1, x can be split as x = xh + xl, wherexh = (m

x) (m

xx), xl = xxh,

and each xi is representable using [p/2] bits of precision

To see how this theorem works in an example, let

= 10, p = 4, b = 3. 476, a = 3. 463, and c = 3. 479. Then b2 - ac rounded to the nearest floating-point number is . 03480, while b

b = 12. 08, a

c = 12. 05, and so the computed value of b2 - ac is . 03. This is an error of 480 ulps. Using Theorem 6 to write b = 3. 5 - . 024, a = 3. 5 - . 037, and c = 3. 5 - . 021, b2 becomes 3. 52 - 2 × 3. 5 × . 024 + . 0242. Each summand is exact, so b2 = 12. 25 - . 168 + . 000576, where the sum is left unevaluated at this point. Similarly, ac = 3. 52 - (3. 5 × . 037 + 3. 5 × . 021) + . 037 × . 021 = 12. 25 - . 2030 +. 000777. Finally, subtracting these two series term by term gives an estimate for b2 - ac of 0

. 0350 . 000201 = . 03480, which is identical to the exactly rounded result. To show that Theorem 6 really requires exact rounding, consider p = 3,

= 2, and x = 7. Then m = 5, mx = 35, and m

x = 32. If subtraction is performed with a single guard digit, then (m

x) x = 28. Therefore, xh = 4 and xl = 3, hence xl is not representable with [p/2] = 1 bit.

As a final example of exact rounding, consider dividing m by 10. The result is a floating-point number that will in general not be equal to m/10. When

= 2, multiplying m/10 by 10 will restore m, provided exact rounding is being used. Actually, a more general fact (due to Kahan) is true. The proof is ingenious, but readers not interested in such details can skip ahead to section .

Theorem 7

When

= 2, if m and n are integers with . m. < 2p - 1 and n has the special form n = 2i + 2j, then (mn)

n = m, provided floating-point operations are exactly rounded.

Proof

Scaling by a power of two is harmless, since it changes only the exponent, not the significand. If q = m/n, then scale n so that 2p - 1

n < 2p and scale m so that 1/2 < q < 1. Thus, 2p - 2 < m < 2p. Since m has p significant bits, it has at most one bit to the right of the binary point. Changing the sign of m is harmless, so assume that q > 0. If= mn, to prove the theorem requires showing that(9)
That is because m has at most 1 bit right of the binary point, so nwill round to m. Để đối phó với trường hợp giữa chừng khi. n- m. = 1/4, note that since the initial unscaled m had . m. < 2p - 1, its low-order bit was 0, so the low-order bit of the scaled m is also 0. Thus, halfway cases will round to m. Giả sử rằng q =. q1q2 . , and let= . q1q2 . qp1. To estimate . n- m. , first compute. - q. = . N/2p + 1 - m/n. ,
where N is an odd integer. Since n = 2i + 2j and 2p - 1

n < 2p, it must be that n = 2p - 1 + 2k for some k

p - 2, and thus

.
The numerator is an integer, and since N is odd, it is in fact an odd integer. Thus,. - q.

1/(n2p + 1 - k).
Assume q

=(2p-1+2k)2-p-1-2-p-1+k =
This establishes and proves the theorem. z

The theorem holds true for any base

, as long as 2i + 2j is replaced by

i +

j. As

gets larger, however, denominators of the form

i +

j are farther and farther apart.

We are now in a position to answer the question, Does it matter if the basic arithmetic operations introduce a little more rounding error than necessary? The answer is that it does matter, because accurate basic operations enable us to prove that formulas are "correct" in the sense they have a small relative error. The section discussed several algorithms that require guard digits to produce correct results in this sense. If the input to those formulas are numbers representing imprecise measurements, however, the bounds of Theorems 3 and 4 become less interesting. The reason is that the benign cancellation x - y can become catastrophic if x and y are only approximations to some measured quantity. But accurate operations are useful even in the face of inexact data, because they enable us to establish exact relationships like those discussed in Theorems 6 and 7. These are useful even if every floating-point variable is only an approximation to some actual value

The IEEE Standard

There are two different IEEE standards for floating-point computation. IEEE 754 is a binary standard that requires

= 2, p = 24 for single precision and p = 53 for double precision [IEEE 1987]. It also specifies the precise layout of bits in a single and double precision. IEEE 854 allows either

= 2 or

= 10 and unlike 754, does not specify how floating-point numbers are encoded into bits [Cody et al. 1984]. It does not require a particular value for p, but instead it specifies constraints on the allowable values of p for single and double precision. The term IEEE Standard will be used when discussing properties common to both standards.

This section provides a tour of the IEEE standard. Each subsection discusses one aspect of the standard and why it was included. It is not the purpose of this paper to argue that the IEEE standard is the best possible floating-point standard but rather to accept the standard as given and provide an introduction to its use. For full details consult the standards themselves [IEEE 1987; Cody et al. 1984]

Formats and Operations

Base

It is clear why IEEE 854 allows

= 10. Base ten is how humans exchange and think about numbers. Using

= 10 is especially appropriate for calculators, where the result of each operation is displayed by the calculator in decimal.

There are several reasons why IEEE 854 requires that if the base is not 10, it must be 2. The section mentioned one reason. the results of error analyses are much tighter when

is 2 because a rounding error of . 5 ulp wobbles by a factor of

when computed as a relative error, and error analyses are almost always simpler when based on relative error. A related reason has to do with the effective precision for large bases. Consider

= 16, p = 1 compared to

= 2, p = 4. Both systems have 4 bits of significand. Consider the computation of 15/8. When

= 2, 15 is represented as 1. 111 × 23, and 15/8 as 1. 111 × 20. So 15/8 is exact. However, when

= 16, 15 is represented as F × 160, where F is the hexadecimal digit for 15. But 15/8 is represented as 1 × 160, which has only one bit correct. In general, base 16 can lose up to 3 bits, so that a precision of p hexadecimal digits can have an effective precision as low as 4p - 3 rather than 4p binary bits. Since large values of

have these problems, why did IBM choose

= 16 for its system/370? Only IBM knows for sure, but there are two possible reasons. The first is increased exponent range. Single precision on the system/370 has

= 16, p = 6. Hence the significand requires 24 bits. Since this must fit into 32 bits, this leaves 7 bits for the exponent and one for the sign bit. Thus the magnitude of representable numbers ranges from aboutto about=. To get a similar exponent range when

= 2 would require 9 bits of exponent, leaving only 22 bits for the significand. However, it was just pointed out that when

= 16, the effective precision can be as low as 4p - 3 = 21 bits. Even worse, when

= 2 it is possible to gain an extra bit of precision (as explained later in this section), so the

= 2 machine has 23 bits of precision to compare with a range of 21 - 24 bits for the

= 16 machine.

Another possible explanation for choosing

= 16 has to do with shifting. When adding two floating-point numbers, if their exponents are different, one of the significands will have to be shifted to make the radix points line up, slowing down the operation. In the

= 16, p = 1 system, all the numbers between 1 and 15 have the same exponent, and so no shifting is required when adding any of the () = 105 possible pairs of distinct numbers from this set. However, in the

= 2, p = 4 system, these numbers have exponents ranging from 0 to 3, and shifting is required for 70 of the 105 pairs.

In most modern hardware, the performance gained by avoiding a shift for a subset of operands is negligible, and so the small wobble of

= 2 makes it the preferable base. Another advantage of using

= 2 is that there is a way to gain an extra bit of significance. Vì các số dấu phẩy động luôn được chuẩn hóa, nên bit quan trọng nhất của dấu phẩy động luôn là 1 và không có lý do gì để lãng phí một chút dung lượng lưu trữ để biểu thị nó. Formats that use this trick are said to have a hidden bit. It was already pointed out in that this requires a special convention for 0. The method given there was that an exponent of emin - 1 and a significand of all zeros represents not, but rather 0.

IEEE 754 single precision is encoded in 32 bits using 1 bit for the sign, 8 bits for the exponent, and 23 bits for the significand. However, it uses a hidden bit, so the significand is 24 bits (p = 24), even though it is encoded using only 23 bits

Precision

The IEEE standard defines four different precisions. single, double, single-extended, and double-extended. In IEEE 754, single and double precision correspond roughly to what most floating-point hardware provides. Single precision occupies a single 32 bit word, double precision two consecutive 32 bit words. Extended precision is a format that offers at least a little extra precision and exponent range ()

TABLE D-1 IEEE 754 Format ParametersSingleSingle-ExtendedDoubleDouble-Extendedp24

3253

64emax+127

1023+1023> 16383emin-126

-1022-1022

-16382Exponent width in bits8

1111

15Format width in bits32

4364

The IEEE standard only specifies a lower bound on how many extra bits extended precision provides. The minimum allowable double-extended format is sometimes referred to as 80-bit format, even though the table shows it using 79 bits. The reason is that hardware implementations of extended precision normally do not use a hidden bit, and so would use 80 rather than 79 bits

The standard puts the most emphasis on extended precision, making no recommendation concerning double precision, but strongly recommending that Implementations should support the extended format corresponding to the widest basic format supported,

One motivation for extended precision comes from calculators, which will often display 10 digits, but use 13 digits internally. By displaying only 10 of the 13 digits, the calculator appears to the user as a "black box" that computes exponentials, cosines, etc. to 10 digits of accuracy. For the calculator to compute functions like exp, log and cos to within 10 digits with reasonable efficiency, it needs a few extra digits to work with. It is not hard to find a simple rational expression that approximates log with an error of 500 units in the last place. Thus computing with 13 digits gives an answer correct to 10 digits. By keeping these extra 3 digits hidden, the calculator presents a simple model to the operator

Extended precision in the IEEE standard serves a similar function. It enables libraries to efficiently compute quantities to within about . 5 ulp in single (or double) precision, giving the user of those libraries a simple model, namely that each primitive operation, be it a simple multiply or an invocation of log, returns a value accurate to within about . 5 ulp. However, when using extended precision, it is important to make sure that its use is transparent to the user. For example, on a calculator, if the internal representation of a displayed value is not rounded to the same precision as the display, then the result of further operations will depend on the hidden digits and appear unpredictable to the user

To illustrate extended precision further, consider the problem of converting between IEEE 754 single precision and decimal. Ideally, single precision numbers will be printed with enough digits so that when the decimal number is read back in, the single precision number can be recovered. It turns out that 9 decimal digits are enough to recover a single precision binary number (see the section ). When converting a decimal number back to its unique binary representation, a rounding error as small as 1 ulp is fatal, because it will give the wrong answer. Here is a situation where extended precision is vital for an efficient algorithm. When single-extended is available, a very straightforward method exists for converting a decimal number to a single precision binary one. First read in the 9 decimal digits as an integer N, ignoring the decimal point. From , p

32, and since 109

4. 3 × 109, N can be represented exactly in single-extended. Tiếp theo, tìm công suất thích hợp 10P cần thiết để mở rộng quy mô N. This will be a combination of the exponent of the decimal number, together with the position of the (up until now) ignored decimal point. Tin học 10. P. Nếu. P.

13, thì giá trị này cũng được biểu thị chính xác, vì 1013 = 213513 và 513

13, the use of the single-extended format enables 9-digit decimal numbers to be converted to the closest binary number (i. e. exactly rounded). If . P. > 13, then single-extended is not enough for the above algorithm to always compute the exactly rounded binary equivalent, but Coonen [1984] shows that it is enough to guarantee that the conversion of binary to decimal and back will recover the original binary number.

If double precision is supported, then the algorithm above would be run in double precision rather than single-extended, but to convert double precision to a 17-digit decimal number and back would require the double-extended format

Exponent

Since the exponent can be positive or negative, some method must be chosen to represent its sign. Two common methods of representing signed numbers are sign/magnitude and two's complement. Sign/magnitude is the system used for the sign of the significand in the IEEE formats. one bit is used to hold the sign, the rest of the bits represent the magnitude of the number. The two's complement representation is often used in integer arithmetic. In this scheme, a number in the range [-2p-1, 2p-1 - 1] is represented by the smallest nonnegative number that is congruent to it modulo 2p

The IEEE binary standard does not use either of these methods to represent the exponent, but instead uses a biased representation. In the case of single precision, where the exponent is stored in 8 bits, the bias is 127 (for double precision it is 1023). What this means is that ifis the value of the exponent bits interpreted as an unsigned integer, then the exponent of the floating-point number is- 127. This is often called the unbiased exponent to distinguish from the biased exponent

Referring to , single precision has emax = 127 and emin = -126. The reason for having . emin. < emax sao cho nghịch đảo của số nhỏ nhất không bị tràn. Although it is true that the reciprocal of the largest number will underflow, underflow is usually less serious than overflow. The section explained that emin - 1 is used for representing 0, and will introduce a use for emax + 1. In IEEE single precision, this means that the biased exponents range between emin - 1 = -127 and emax + 1 = 128, whereas the unbiased exponents range between 0 and 255, which are exactly the nonnegative numbers that can be represented using 8 bits

Operations

The IEEE standard requires that the result of addition, subtraction, multiplication and division be exactly rounded. That is, the result must be computed exactly and then rounded to the nearest floating-point number (using round to even). The section pointed out that computing the exact difference or sum of two floating-point numbers can be very expensive when their exponents are substantially different. That section introduced guard digits, which provide a practical way of computing differences while guaranteeing that the relative error is small. However, computing with a single guard digit will not always give the same answer as computing the exact result and then rounding. By introducing a second guard digit and a third sticky bit, differences can be computed at only a little more cost than with a single guard digit, but the result is the same as if the difference were computed exactly and then rounded [Goldberg 1990]. Thus the standard can be implemented efficiently

One reason for completely specifying the results of arithmetic operations is to improve the portability of software. When a program is moved between two machines and both support IEEE arithmetic, then if any intermediate result differs, it must be because of software bugs, not from differences in arithmetic. Another advantage of precise specification is that it makes it easier to reason about floating-point. Proofs about floating-point are hard enough, without having to deal with multiple cases arising from multiple kinds of arithmetic. Just as integer programs can be proven to be correct, so can floating-point programs, although what is proven in that case is that the rounding error of the result satisfies certain bounds. Theorem 4 is an example of such a proof. These proofs are made much easier when the operations being reasoned about are precisely specified. Once an algorithm is proven to be correct for IEEE arithmetic, it will work correctly on any machine supporting the IEEE standard

Brown [1981] has proposed axioms for floating-point that include most of the existing floating-point hardware. However, proofs in this system cannot verify the algorithms of sections and , which require features not present on all hardware. Furthermore, Brown's axioms are more complex than simply defining operations to be performed exactly and then rounded. Thus proving theorems from Brown's axioms is usually more difficult than proving them assuming operations are exactly rounded

There is not complete agreement on what operations a floating-point standard should cover. In addition to the basic operations +, -, × and /, the IEEE standard also specifies that square root, remainder, and conversion between integer and floating-point be correctly rounded. It also requires that conversion between internal formats and decimal be correctly rounded (except for very large numbers). Kulisch and Miranker [1986] have proposed adding inner product to the list of operations that are precisely specified. They note that when inner products are computed in IEEE arithmetic, the final answer can be quite wrong. For example sums are a special case of inner products, and the sum ((2 × 10-30 + 1030) - 1030) - 10-30 is exactly equal to 10-30, but on a machine with IEEE arithmetic the computed result will be -10-30. It is possible to compute inner products to within 1 ulp with less hardware than it takes to implement a fast multiplier [Kirchner and Kulish 1987].

All the operations mentioned in the standard are required to be exactly rounded except conversion between decimal and binary. The reason is that efficient algorithms for exactly rounding all the operations are known, except conversion. For conversion, the best known efficient algorithms produce results that are slightly worse than exactly rounded ones [Coonen 1984]

The IEEE standard does not require transcendental functions to be exactly rounded because of the table maker's dilemma. To illustrate, suppose you are making a table of the exponential function to 4 places. Then exp(1. 626) = 5. 0835. Should this be rounded to 5. 083 or 5. 084? If exp(1. 626) is computed more carefully, it becomes 5. 08350. And then 5. 083500. And then 5. 0835000. Since exp is transcendental, this could go on arbitrarily long before distinguishing whether exp(1. 626) is 5. 083500. 0ddd or 5. 0834999. 9ddd. Thus it is not practical to specify that the precision of transcendental functions be the same as if they were computed to infinite precision and then rounded. Another approach would be to specify transcendental functions algorithmically. But there does not appear to be a single algorithm that works well across all hardware architectures. Rational approximation, CORDIC, and large tables are three different techniques that are used for computing transcendentals on contemporary machines. Each is appropriate for a different class of hardware, and at present no single algorithm works acceptably over the wide range of current hardware

Special Quantities

On some floating-point hardware every bit pattern represents a valid floating-point number. The IBM System/370 is an example of this. On the other hand, the VAXTM reserves some bit patterns to represent special numbers called reserved operands. This idea goes back to the CDC 6600, which had bit patterns for the special quantities

     n = n/2

9 and

     if (n==0) return u

The IEEE standard continues in this tradition and has NaNs (Not a Number) and infinities. Without any special quantities, there is no good way to handle exceptional situations like taking the square root of a negative number, other than aborting computation. Under IBM System/370 FORTRAN, the default action in response to computing the square root of a negative number like -4 results in the printing of an error message. Since every bit pattern represents a valid number, the return value of square root must be some floating-point number. In the case of System/370 FORTRAN,is returned. In IEEE arithmetic, a NaN is returned in this situation

The IEEE standard specifies the following special values (see ). ± 0, denormalized numbers, ±

and NaNs (there is more than one NaN, as explained in the next section). These special values are all encoded with exponents of either emax + 1 or emin - 1 (it was already pointed out that 0 has an exponent of emin - 1).

TABLE D-2 IEEE 754 Special ValuesExponentFractionRepresentse = emin - 1f = 0±0e = emin - 1f

0emin

emax--1. f × 2ee = emax + 1f = 0±

e = emax + 1f

0NaN

NaNs

Traditionally, the computation of 0/0 orhas been treated as an unrecoverable error which causes a computation to halt. However, there are examples where it makes sense for a computation to continue in such a situation. Consider a subroutine that finds the zeros of a function f, say

     if (n==0) return u

1. Traditionally, zero finders require the user to input an interval [a, b] on which the function is defined and over which the zero finder will search. That is, the subroutine is called as

     if (n==0) return u

     if (n==0) return u

     if (n==0) return u

4. A more useful zero finder would not require the user to input this extra information. This more general zero finder is especially appropriate for calculators, where it is natural to simply key in a function, and awkward to then have to specify the domain. However, it is easy to see why most zero finders require a domain. The zero finder does its work by probing the function

     if (n==0) return u

5 at various values. If it probed for a value outside the domain of

     if (n==0) return u

5, the code for

     if (n==0) return u

5 might well compute 0/0 or, and the computation would halt, unnecessarily aborting the zero finding process

This problem can be avoided by introducing a special value called NaN, and specifying that the computation of expressions like 0/0 andproduce NaN, rather than halting. A list of some of the situations that can cause a NaN are given in . Then when

     if (n==0) return u

1 probes outside the domain of

     if (n==0) return u

5, the code for

     if (n==0) return u

5 will return NaN, and the zero finder can continue. That is,

     if (n==0) return u

1 is not "punished" for making an incorrect guess. With this example in mind, it is easy to see what the result of combining a NaN with an ordinary floating-point number should be. Suppose that the final statement of

     if (n==0) return u

5 is

     x = x*x

     x = x*x

4. If d < 0, then

     if (n==0) return u

5 should return a NaN. Since d < 0,

     x = x*x

6 is a NaN, and

     x = x*x

7 will be a NaN, if the sum of a NaN and any other number is a NaN. Similarly if one operand of a division operation is a NaN, the quotient should be a NaN. In general, whenever a NaN participates in a floating-point operation, the result is another NaN

TABLE D-3 Operations That Produce a NaN+

+ (-

)×0 ×

/0/0,

     x = x*x

     x = x*x

8 0,

     x = x*x

8 y(when x < 0)

Another approach to writing a zero solver that doesn't require the user to input a domain is to use signals. The zero-finder could install a signal handler for floating-point exceptions. Then if

     if (n==0) return u

5 was evaluated outside its domain and raised an exception, control would be returned to the zero solver. The problem with this approach is that every language has a different method of handling signals (if it has a method at all), and so it has no hope of portability

In IEEE 754, NaNs are often represented as floating-point numbers with the exponent emax + 1 and nonzero significands. Implementations are free to put system-dependent information into the significand. Thus there is not a unique NaN, but rather a whole family of NaNs. When a NaN and an ordinary floating-point number are combined, the result should be the same as the NaN operand. Thus if the result of a long computation is a NaN, the system-dependent information in the significand will be the information that was generated when the first NaN in the computation was generated. Actually, there is a caveat to the last statement. If both operands are NaNs, then the result will be one of those NaNs, but it might not be the NaN that was generated first

Infinity

Just as NaNs provide a way to continue a computation when expressions like 0/0 orare encountered, infinities provide a way to continue when an overflow occurs. Điều này an toàn hơn nhiều so với việc chỉ trả về số lớn nhất có thể biểu thị. As an example, consider computing, when

= 10, p = 3, and emax = 98. If x = 3 × 1070 and y = 4 × 1070, then x2 will overflow, and be replaced by 9. 99 × 1098. Similarly y2, and x2 + y2 will each overflow in turn, and be replaced by 9. 99 × 1098. So the final result will be, which is drastically wrong. câu trả lời đúng là 5 × 1070. Trong số học IEEE, kết quả của x2 là

, cũng như y2, x2 + y2 và. Vì vậy, kết quả cuối cùng là

, sẽ an toàn hơn so với việc trả về một số dấu phẩy động thông thường không ở gần câu trả lời đúng.

The division of 0 by 0 results in a NaN. A nonzero number divided by 0, however, returns infinity. 1/0 =

, -1/0 = -

. The reason for the distinction is this. if f(x)

0 and g(x)

0 as x approaches some limit, then f(x)/g(x) could have any value. For example, when f(x) = sin x and g(x) = x, then f(x)/g(x)

1 as x

0. But when f(x) = 1 - cos x, f(x)/g(x)

0. When thinking of 0/0 as the limiting situation of a quotient of two very small numbers, 0/0 could represent anything. Thus in the IEEE standard, 0/0 results in a NaN. But when c > 0, f(x)

c, and g(x)

0, then f(x)/g(x)

, for any analytic functions f and g. If g(x) < 0 for small x, then f(x)/g(x)

, otherwise the limit is +

. So the IEEE standard defines c/0 = ±

, as long as c

0. The sign of

depends on the signs of c and 0 in the usual way, so that -10/0 = -

, and -10/-0 = +

. You can distinguish between getting

because of overflow and getting

because of division by zero by checking the status flags (which will be discussed in detail in section ). The overflow flag will be set in the first case, the division by zero flag in the second.

The rule for determining the result of an operation that has infinity as an operand is simple. replace infinity with a finite number x and take the limit as x

. Thus 3/

= 0, because

Similarly, 4 -

= -

, and =

. When the limit doesn't exist, the result is a NaN, so

will be a NaN ( has additional examples). Điều này đồng ý với lý do được sử dụng để kết luận rằng 0/0 phải là NaN.

When a subexpression evaluates to a NaN, the value of the entire expression is also a NaN. In the case of ±

however, the value of the expression might be an ordinary floating-point number because of rules like 1/

= 0. Here is a practical example that makes use of the rules for infinity arithmetic. Consider computing the function x/(x2 + 1). This is a bad formula, because not only will it overflow when x is larger than, but infinity arithmetic will give the wrong answer because it will yield 0, rather than a number near 1/x. However, x/(x2 + 1) can be rewritten as 1/(x + x-1). This improved expression will not overflow prematurely and because of infinity arithmetic will have the correct value when x = 0. 1/(0 + 0-1) = 1/(0 +

) = 1/

= 0. Without infinity arithmetic, the expression 1/(x + x-1) requires a test for x = 0, which not only adds extra instructions, but may also disrupt a pipeline. This example illustrates a general fact, namely that infinity arithmetic often avoids the need for special case checking; however, formulas need to be carefully inspected to make sure they do not have spurious behavior at infinity (as x/(x2 + 1) did).

Signed Zero

Zero is represented by the exponent emin - 1 and a zero significand. Since the sign bit can take on two different values, there are two zeros, +0 and -0. If a distinction were made when comparing +0 and -0, simple tests like

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

05 would have very unpredictable behavior, depending on the sign of

 while (n is even) {

06. Thus the IEEE standard defines comparison so that +0 = -0, rather than -0 < +0. Although it would be possible always to ignore the sign of zero, the IEEE standard does not do so. When a multiplication or division involves a signed zero, the usual sign rules apply in computing the sign of the answer. Thus 3·(+0) = +0, and +0/-3 = -0. If zero did not have a sign, then the relation 1/(1/x) = x would fail to hold when x = ±

. The reason is that 1/-

and 1/+

both result in 0, and 1/0 results in +

, the sign information having been lost. One way to restore the identity 1/(1/x) = x is to only have one kind of infinity, however that would result in the disastrous consequence of losing the sign of an overflowed quantity.

Another example of the use of signed zero concerns underflow and functions that have a discontinuity at 0, such as log. In IEEE arithmetic, it is natural to define log 0 = -

and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return -

. Another example of a function with a discontinuity at zero is the signum function, which returns the sign of a number.

Probably the most interesting use of signed zero occurs in complex arithmetic. To take a simple example, consider the equation. This is certainly true when z

0. If z = -1, the obvious computation givesand. Thus,. The problem can be traced to the fact that square root is multi-valued, and there is no way to select the values so that it is continuous in the entire complex plane. However, square root is continuous if a branch cut consisting of all negative real numbers is excluded from consideration. This leaves the problem of what to do for the negative real numbers, which are of the form -x + i0, where x > 0. Signed zero provides a perfect way to resolve this problem. Numbers of the form x + i(+0) have one signand numbers of the form x + i(-0) on the other side of the branch cut have the other sign. In fact, the natural formulas for computingwill give these results.

Back to. If z =1 = -1 + i0, then

1/z = 1/(-1 + i0) = [(-1- i0)]/[(-1 + i0)(-1 - i0)] = (-1 -- i0)/((-1)2 - 02) = -1 + i(-0),

and so, while. Thus IEEE arithmetic preserves this identity for all z. Some more sophisticated examples are given by Kahan [1987]. Although distinguishing between +0 and -0 has advantages, it can occasionally be confusing. For example, signed zero destroys the relation x = y

1/x = 1/y, which is false when x = +0 and y = -0. However, the IEEE committee decided that the advantages of utilizing the sign of zero outweighed the disadvantages.

Denormalized Numbers

Consider normalized floating-point numbers with

= 10, p = 3, and emin = -98. Các số x = 6. 87 × 10-97 and y = 6. 81 × 10-97 appear to be perfectly ordinary floating-point numbers, which are more than a factor of 10 larger than the smallest floating-point number 1. 00 × 10-98. They have a strange property, however. xy = 0 even though x

y. The reason is that x - y = . 06 × 10 -97 = 6. 0 × 10-99 is too small to be represented as a normalized number, and so must be flushed to zero. How important is it to preserve the property

(10) x = y

x - y = 0 ?

It's very easy to imagine writing the code fragment,

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

13, and much later having a program fail due to a spurious division by zero. Tracking down bugs like this is frustrating and time consuming. On a more philosophical level, computer science textbooks often point out that even though it is currently impractical to prove large programs correct, designing programs with the idea of proving them often results in better code. For example, introducing invariants is quite useful, even if they aren't going to be used as part of a proof. Floating-point code is just like any other code. it helps to have provable facts on which to depend. For example, when analyzing formula , it was very helpful to know that x/2

x y = x - y. Tương tự, biết điều đó là đúng giúp viết mã số dấu phẩy động đáng tin cậy dễ dàng hơn. If it is only true for most numbers, it cannot be used to prove anything.

The IEEE standard uses denormalized numbers, which guarantee , as well as other useful relations. They are the most controversial part of the standard and probably accounted for the long delay in getting 754 approved. Most high performance hardware that claims to be IEEE compatible does not support denormalized numbers directly, but rather traps when consuming or producing denormals, and leaves it to software to simulate the IEEE standard. The idea behind denormalized numbers goes back to Goldberg [1967] and is very simple. Khi số mũ là emin, ý nghĩa và không cần phải chuẩn hóa, sao cho khi

= 10, p = 3 và emin = -98, 1. 00 × 10-98 is no longer the smallest floating-point number, because 0. 98 × 10-98 is also a floating-point number.

There is a small snag when

= 2 and a hidden bit is being used, since a number with an exponent of emin will always have a significand greater than or equal to 1. 0 because of the implicit leading bit. The solution is similar to that used to represent 0, and is summarized in . Số mũ emin được sử dụng để biểu diễn các biến dạng. Chính thức hơn, nếu các bit trong trường ý nghĩa là b1, b2,. , bp -1, and the value of the exponent is e, then when e > emin - 1, the number being represented is 1. b1b2. bp - 1 × 2e whereas when e = emin - 1, the number being represented is 0. b1b2. bp - 1 × 2e + 1. The +1 in the exponent is needed because denormals have an exponent of emin, not emin - 1.

Recall the example of

= 10, p = 3, emin = -98, x = 6. 87 × 10-97 and y = 6. 81 × 10-97 presented at the beginning of this section. With denormals, x - y does not flush to zero but is instead represented by the denormalized number . 6 × 10-98. This behavior is called gradual underflow. It is easy to verify that always holds when using gradual underflow.

FIGURE D-2 Flush To Zero Compared With Gradual Underflow

illustrates denormalized numbers. The top number line in the figure shows normalized floating-point numbers. Notice the gap between 0 and the smallest normalized number. If the result of a floating-point calculation falls into this gulf, it is flushed to zero. Dòng số dưới cùng cho biết điều gì sẽ xảy ra khi các bất thường được thêm vào tập hợp các số dấu phẩy động. The "gulf" is filled in, and when the result of a calculation is less than, it is represented by the nearest denormal. When denormalized numbers are added to the number line, the spacing between adjacent floating-point numbers varies in a regular way. adjacent spacings are either the same length or differ by a factor of

. Without denormals, the
spacing abruptly changes fromto, which is a factor of, rather than the orderly change by a factor of

. Because of this, many algorithms that can have large relative error for normalized numbers close to the underflow threshold are well-behaved in this range when gradual underflow is used.

Without gradual underflow, the simple expression x - y can have a very large relative error for normalized inputs, as was seen above for x = 6. 87 × 10-97 and y = 6. 81 × 10-97. Large relative errors can happen even without cancellation, as the following example shows [Demmel 1984]. Consider dividing two complex numbers, a + ib and c + id. The obvious formula

· i

suffers from the problem that if either component of the denominator c + id is larger than, the formula will overflow, even though the final result may be well within range. A better method of computing the quotients is to use Smith's formula

(11)

Applying Smith's formula to (2 · 10-98 + i10-98)/(4 · 10-98 + i(2 · 10-98)) gives the correct answer of 0. 5 with gradual underflow. It yields 0. 4 with flush to zero, an error of 100 ulps. It is typical for denormalized numbers to guarantee error bounds for arguments all the way down to 1. 0 x.

Exceptions, Flags and Trap Handlers

When an exceptional condition like division by zero or overflow occurs in IEEE arithmetic, the default is to deliver a result and continue. Typical of the default results are NaN for 0/0 and, and

for 1/0 and overflow. The preceding sections gave examples where proceeding from an exception with these default values was the reasonable thing to do. When any exception occurs, a status flag is also set. Implementations of the IEEE standard are required to provide users with a way to read and write the status flags. The flags are "sticky" in that once set, they remain set until explicitly cleared. Testing the flags is the only way to distinguish 1/0, which is a genuine infinity from an overflow.

Sometimes continuing execution in the face of exception conditions is not appropriate. The section gave the example of x/(x2 + 1). When x >, the denominator is infinite, resulting in a final answer of 0, which is totally wrong. Although for this formula the problem can be solved by rewriting it as 1/(x + x-1), rewriting may not always solve the problem. The IEEE standard strongly recommends that implementations allow trap handlers to be installed. Then when an exception occurs, the trap handler is called instead of setting the flag. The value returned by the trap handler will be used as the result of the operation. It is the responsibility of the trap handler to either clear or set the status flag; otherwise, the value of the flag is allowed to be undefined

The IEEE standard divides exceptions into 5 classes. overflow, underflow, division by zero, invalid operation and inexact. There is a separate status flag for each class of exception. The meaning of the first three exceptions is self-evident. Invalid operation covers the situations listed in , and any comparison that involves a NaN. The default result of an operation that causes an invalid exception is to return a NaN, but the converse is not true. When one of the operands to an operation is a NaN, the result is a NaN but no invalid exception is raised unless the operation also satisfies one of the conditions in

TABLE D-4 Exceptions in IEEE 754*ExceptionResult when traps disabledArgument to trap handleroverflow±

or ±xmaxround(x2-

)underflow0,or denormalround(x2

)divide by zero±

operandsinvalidNaNoperandsinexactround(x)round(x)

*x is the exact result of the operation,

= 192 for single precision, 1536 for double, and xmax = 1. 11. 11 ×.

The inexact exception is raised when the result of a floating-point operation is not exact. In the

= 10, p = 3 system, 3. 5

4. 2 = 14. 7 is exact, but 3. 5

4. 3 = 15. 0 is not exact (since 3. 5 · 4. 3 = 15. 05), and raises an inexact exception. discusses an algorithm that uses the inexact exception. A summary of the behavior of all five exceptions is given in .

There is an implementation issue connected with the fact that the inexact exception is raised so often. If floating-point hardware does not have flags of its own, but instead interrupts the operating system to signal a floating-point exception, the cost of inexact exceptions could be prohibitive. This cost can be avoided by having the status flags maintained by software. The first time an exception is raised, set the software flag for the appropriate class, and tell the floating-point hardware to mask off that class of exceptions. Then all further exceptions will run without interrupting the operating system. When a user resets that status flag, the hardware mask is re-enabled

Trap Handlers

One obvious use for trap handlers is for backward compatibility. Old codes that expect to be aborted when exceptions occur can install a trap handler that aborts the process. This is especially useful for codes with a loop like

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

19. Since comparing a NaN to a number with <, , >, , or = (but not ) always returns false, this code will go into an infinite loop if

 while (n is even) {

06 ever becomes a NaN.

, >,

, or = (but not

) always returns false, this code will go into an infinite loop if

 while (n is even) {

06 ever becomes a NaN.

Có một cách sử dụng thú vị hơn cho các trình xử lý bẫy xuất hiện khi tính toán các sản phẩm chẳng hạn như có khả năng bị tràn. One solution is to use logarithms, and compute expinstead. The problem with this approach is that it is less accurate, and that it costs more than the simple expression, even if there is no overflow. There is another solution using trap handlers called over/underflow counting that avoids both of these problems [Sterbenz 1974]

The idea is as follows. There is a global counter initialized to zero. Whenever the partial productoverflows for some k, the trap handler increments the counter by one and returns the overflowed quantity with the exponent wrapped around. In IEEE 754 single precision, emax = 127, so if pk = 1. 45 × 2130, it will overflow and cause the trap handler to be called, which will wrap the exponent back into range, changing pk to 1. 45 × 2-62 (xem bên dưới). Similarly, if pk underflows, the counter would be decremented, and negative exponent would get wrapped around into a positive one. Khi tất cả các phép nhân được thực hiện, nếu bộ đếm bằng 0 thì tích cuối cùng là pn. If the counter is positive, the product overflowed, if the counter is negative, it underflowed. If none of the partial products are out of range, the trap handler is never called and the computation incurs no extra cost. Even if there are over/underflows, the calculation is more accurate than if it had been computed with logarithms, because each pk was computed from pk - 1 using a full precision multiply. Barnett [1987] discusses a formula where the full accuracy of over/underflow counting turned up an error in earlier tables of that formula

IEEE 754 specifies that when an overflow or underflow trap handler is called, it is passed the wrapped-around result as an argument. The definition of wrapped-around for overflow is that the result is computed as if to infinite precision, then divided by 2

, and then rounded to the relevant precision. For underflow, the result is multiplied by 2

. The exponent

is 192 for single precision and 1536 for double precision. This is why 1. 45 x 2130 was transformed into 1. 45 × 2-62 in the example above.

Rounding Modes

In the IEEE standard, rounding occurs whenever an operation has a result that is not exact, since (with the exception of binary decimal conversion) each operation is computed exactly and then rounded. By default, rounding means round toward nearest. The standard requires that three other rounding modes be provided, namely round toward 0, round toward +

, and round toward -

. When used with the convert to integer operation, round toward -

causes the convert to become the floor function, while round toward +

is ceiling. Chế độ làm tròn ảnh hưởng đến tràn, vì khi làm tròn về 0 hoặc làm tròn về -

có hiệu lực, tràn độ lớn dương khiến kết quả mặc định là số lớn nhất có thể biểu thị, không phải +

. Similarly, overflows of negative magnitude will produce the largest negative number when round toward +

or round toward 0 is in effect.

One application of rounding modes occurs in interval arithmetic (another is mentioned in ). When using interval arithmetic, the sum of two numbers x and y is an interval, whereis x

y rounded toward -

, andis x

y rounded toward +

. The exact result of the addition is contained within the interval. Without rounding modes, interval arithmetic is usually implemented by computingand, whereis machine epsilon. This results in overestimates for the size of the intervals. Since the result of an operation in interval arithmetic is an interval, in general the input to an operation will also be an interval. If two intervals, and, are added, the result is, whereiswith the rounding mode set to round toward -

, andiswith the rounding mode set to round toward +

When a floating-point calculation is performed using interval arithmetic, the final answer is an interval that contains the exact result of the calculation. This is not very helpful if the interval turns out to be large (as it often does), since the correct answer could be anywhere in that interval. Interval arithmetic makes more sense when used in conjunction with a multiple precision floating-point package. The calculation is first performed with some precision p. If interval arithmetic suggests that the final answer may be inaccurate, the computation is redone with higher and higher precisions until the final interval is a reasonable size

Flags

The IEEE standard has a number of flags and modes. As discussed above, there is one status flag for each of the five exceptions. underflow, overflow, division by zero, invalid operation and inexact. There are four rounding modes. round toward nearest, round toward +

, round toward 0, and round toward -

. It is strongly recommended that there be an enable mode bit for each of the five exceptions. This section gives some simple examples of how these modes and flags can be put to good use. A more sophisticated example is discussed in the section .

Consider writing a subroutine to compute xn, where n is an integer. When n > 0, a simple routine like

PositivePower(x,n) {

 while (n is even) {

     x = x*x

     n = n/2

 u = x

 while (true) {

     n = n/2

     if (n==0) return u

     x = x*x

 while (n is even) {

If n < 0, then a more accurate way to compute xn is not to call

 while (n is even) {

 while (n is even) {

22 but rather

 while (n is even) {

 while (n is even) {

22, because the first expression multiplies n quantities each of which have a rounding error from the division (i. e. , 1/x). In the second expression these are exact (i. e. , x), and the final division commits just one additional rounding error. Unfortunately, these is a slight snag in this strategy. If

 while (n is even) {

 while (n is even) {

22 underflows, then either the underflow trap handler will be called, or else the underflow status flag will be set. This is incorrect, because if x-n underflows, then xn will either overflow or be in range. But since the IEEE standard gives the user access to all the flags, the subroutine can easily correct for this. It simply turns off the overflow and underflow trap enable bits and saves the overflow and underflow status bits. It then computes

 while (n is even) {

 while (n is even) {

22. If neither the overflow nor underflow status bit is set, it restores them together with the trap enable bits. If one of the status bits is set, it restores the flags and redoes the calculation using

 while (n is even) {

 while (n is even) {

22, which causes the correct exceptions to occur

Another example of the use of flags occurs when computing arccos via the formula

arccos x = 2 arctan

If arctan(

) evaluates to

/2, then arccos(-1) will correctly evaluate to 2·arctan(

) =

, because of infinity arithmetic. However, there is a small snag, because the computation of (1 - x)/(1 + x) will cause the divide by zero exception flag to be set, even though arccos(-1) is not exceptional. The solution to this problem is straightforward. Simply save the value of the divide by zero flag before computing arccos, and then restore its old value after the computation.

Systems Aspects

The design of almost every aspect of a computer system requires knowledge about floating-point. Computer architectures usually have floating-point instructions, compilers must generate those floating-point instructions, and the operating system must decide what to do when exception conditions are raised for those floating-point instructions. Computer system designers rarely get guidance from numerical analysis texts, which are typically aimed at users and writers of software, not at computer designers. As an example of how plausible design decisions can lead to unexpected behavior, consider the following BASIC program

 while (n is even) {

 while (n is even) {

 while (n is even) {

When compiled and run using Borland's Turbo Basic on an IBM PC, the program prints

 while (n is even) {

 while (n is even) {

32. This example will be analyzed in the next section

Incidentally, some people think that the solution to such anomalies is never to compare floating-point numbers for equality, but instead to consider them equal if they are within some error bound E. This is hardly a cure-all because it raises as many questions as it answers. Giá trị của E phải là bao nhiêu? . a - b. < E, is not an equivalence relation because a ~ b and b ~ c does not imply that a ~ c.

|a - b| < E, is not an equivalence relation because a ~ b and b ~ c does not imply that a ~ c.

Instruction Sets

It is quite common for an algorithm to require a short burst of higher precision in order to produce accurate results. One example occurs in the quadratic formula ()/2a. As discussed in the section , when b2

4ac, rounding error can contaminate up to half the digits in the roots computed with the quadratic formula. By performing the subcalculation of b2 - 4ac in double precision, half the double precision bits of the root are lost, which means that all the single precision bits are preserved.

The computation of b2 - 4ac in double precision when each of the quantities a, b, and c are in single precision is easy if there is a multiplication instruction that takes two single precision numbers and produces a double precision result. In order to produce the exactly rounded product of two p-digit numbers, a multiplier needs to generate the entire 2p bits of product, although it may throw bits away as it proceeds. Thus, hardware to compute a double precision product from single precision operands will normally be only a little more expensive than a single precision multiplier, and much cheaper than a double precision multiplier. Despite this, modern instruction sets tend to provide only instructions that produce a result of the same precision as the operands

If an instruction that combines two single precision operands to produce a double precision product was only useful for the quadratic formula, it wouldn't be worth adding to an instruction set. However, this instruction has many other uses. Consider the problem of solving a system of linear equations,

a11x1 + a12x2 + · · · + a1nxn= b1
a21x1 + a22x2 + · · · + a2nxn= b2
· · ·
an1x1 + an2x2 + · · ·+ annxn= bn

which can be written in matrix form as Ax = b, where

Suppose that a solution x(1) is computed by some method, perhaps Gaussian elimination. There is a simple way to improve the accuracy of the result called iterative improvement. First compute

(12)

= Ax(1) - b

and then solve the system

(13) Ay =

Note that if x(1) is an exact solution, then

is the zero vector, as is y. In general, the computation of

and y will incur rounding error, so Ay

Ax(1) - b = A(x(1) - x), where x is the (unknown) true solution. Then y

x(1) - x, so an improved estimate for the solution is

(14) x(2) = x(1) - y

The three steps , , and can be repeated, replacing x(1) with x(2), and x(2) with x(3). This argument that x(i + 1) is more accurate than x(i) is only informal. For more information, see [Golub and Van Loan 1989]

When performing iterative improvement,

is a vector whose elements are the difference of nearby inexact floating-point numbers, and so can suffer from catastrophic cancellation. Thus iterative improvement is not very useful unless

= Ax(1) - b is computed in double precision. Once again, this is a case of computing the product of two single precision numbers (A and x(1)), where the full double precision result is needed.

To summarize, instructions that multiply two floating-point numbers and return a product with twice the precision of the operands make a useful addition to a floating-point instruction set. Some of the implications of this for compilers are discussed in the next section

Languages and Compilers

The interaction of compilers and floating-point is discussed in Farnum [1988], and much of the discussion in this section is taken from that paper

mơ hồ

Lý tưởng nhất là một định nghĩa ngôn ngữ nên xác định ngữ nghĩa của ngôn ngữ đủ chính xác để chứng minh các tuyên bố về chương trình. Mặc dù điều này thường đúng với phần nguyên của ngôn ngữ, nhưng các định nghĩa ngôn ngữ thường có vùng màu xám lớn khi nói đến dấu phẩy động. Perhaps this is due to the fact that many language designers believe that nothing can be proven about floating-point, since it entails rounding error. If so, the previous sections have demonstrated the fallacy in this reasoning. This section discusses some common grey areas in language definitions, including suggestions about how to deal with them

Remarkably enough, some languages don't clearly specify that if

 while (n is even) {

06 is a floating-point variable (with say a value of

 while (n is even) {

34), then every occurrence of (say)

 while (n is even) {

35 must have the same value. For example Ada, which is based on Brown's model, seems to imply that floating-point arithmetic only has to satisfy Brown's axioms, and thus expressions can have one of many possible values. Thinking about floating-point in this fuzzy way stands in sharp contrast to the IEEE model, where the result of each floating-point operation is precisely defined. In the IEEE model, we can prove that

 while (n is even) {

36 evaluates to

 while (n is even) {

37 (Theorem 7). In Brown's model, we cannot

Another ambiguity in most language definitions concerns what happens on overflow, underflow and other exceptions. The IEEE standard precisely specifies the behavior of exceptions, and so languages that use the standard as a model can avoid any ambiguity on this point

Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the expression

 while (n is even) {

38 has a totally different answer than

 while (n is even) {

39 when x = 1030, y = -1030 and z = 1 (it is 1 in the former case, 0 in the latter). The importance of preserving parentheses cannot be overemphasized. Các thuật toán trình bày trong định lý 3, 4 và 6 đều phụ thuộc vào nó. For example, in Theorem 6, the formula xh = mx - (mx - x) would reduce to xh = x if it weren't for parentheses, thereby destroying the entire algorithm. A language definition that does not require parentheses to be honored is useless for floating-point calculations

Subexpression evaluation is imprecisely defined in many languages. Suppose that

 while (n is even) {

40 is double precision, but

 while (n is even) {

06 and

 while (n is even) {

42 are single precision. Then in the expression

 while (n is even) {

 while (n is even) {

 while (n is even) {

45 is the product performed in single or double precision? Another example. in

 while (n is even) {

 while (n is even) {

 while (n is even) {

48 where

 while (n is even) {

49 and

 while (n is even) {

50 are integers, is the division an integer operation or a floating-point one? There are two ways to deal with this problem, neither of which is completely satisfactory. The first is to require that all variables in an expression have the same type. This is the simplest solution, but has some drawbacks. First of all, languages like Pascal that have subrange types allow mixing subrange variables with integer variables, so it is somewhat bizarre to prohibit mixing single and double precision variables. Another problem concerns constants. In the expression

 while (n is even) {

51, most languages interpret 0. 1 to be a single precision constant. Now suppose the programmer decides to change the declaration of all the floating-point variables from single to double precision. If 0. 1 is still treated as a single precision constant, then there will be a compile time error. The programmer will have to hunt down and change every floating-point constant

The second approach is to allow mixed expressions, in which case rules for subexpression evaluation must be provided. There are a number of guiding examples. The original definition of C required that every floating-point expression be computed in double precision [Kernighan and Ritchie 1978]. This leads to anomalies like the example at the beginning of this section. The expression

 while (n is even) {

52 is computed in double precision, but if

 while (n is even) {

53 is a single-precision variable, the quotient is rounded to single precision for storage. Since 3/7 is a repeating binary fraction, its computed value in double precision is different from its stored value in single precision. Thus the comparison q = 3/7 fails. This suggests that computing every expression in the highest precision available is not a good rule

Another guiding example is inner products. If the inner product has thousands of terms, the rounding error in the sum can become substantial. One way to reduce this rounding error is to accumulate the sums in double precision (this will be discussed in more detail in the section ). If

 while (n is even) {

54 is a double precision variable, and

 while (n is even) {

55 and

 while (n is even) {

56 are single precision arrays, then the inner product loop will look like

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

61. If the multiplication is done in single precision, than much of the advantage of double precision accumulation is lost, because the product is truncated to single precision just before being added to a double precision variable

A rule that covers both of the previous two examples is to compute an expression in the highest precision of any variable that occurs in that expression. Then

 while (n is even) {

 while (n is even) {

 while (n is even) {

52 will be computed entirely in single precision and will have the boolean value true, whereas

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

61 will be computed in double precision, gaining the full advantage of double precision accumulation. However, this rule is too simplistic to cover all cases cleanly. If

 while (n is even) {

70 and

 while (n is even) {

71 are double precision variables, the expression

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

76 contains a double precision variable, but performing the sum in double precision would be pointless, because both operands are single precision, as is the result

A more sophisticated subexpression evaluation rule is as follows. First assign each operation a tentative precision, which is the maximum of the precisions of its operands. This assignment has to be carried out from the leaves to the root of the expression tree. Then perform a second pass from the root to the leaves. In this pass, assign to each operation the maximum of the tentative precision and the precision expected by the parent. In the case of

 while (n is even) {

 while (n is even) {

 while (n is even) {

52, every leaf is single precision, so all the operations are done in single precision. In the case of

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

61, the tentative precision of the multiply operation is single precision, but in the second pass it gets promoted to double precision, because its parent operation expects a double precision operand. And in

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

 while (n is even) {

76, the addition is done in single precision. Farnum [1988] presents evidence that this algorithm in not difficult to implement

The disadvantage of this rule is that the evaluation of a subexpression depends on the expression in which it is embedded. This can have some annoying consequences. For example, suppose you are debugging a program and want to know the value of a subexpression. You cannot simply type the subexpression to the debugger and ask it to be evaluated, because the value of the subexpression in the program depends on the expression it is embedded in. A final comment on subexpressions. since converting decimal constants to binary is an operation, the evaluation rule also affects the interpretation of decimal constants. This is especially important for constants like

 while (n is even) {

90 which are not exactly representable in binary

Another potential grey area occurs when a language includes exponentiation as one of its built-in operations. Unlike the basic arithmetic operations, the value of exponentiation is not always obvious [Kahan and Coonen 1982]. If

 while (n is even) {

91 is the exponentiation operator, then

 while (n is even) {

92 certainly has the value -27. However,

 while (n is even) {

93 is problematical. If the

 while (n is even) {

91 operator checks for integer powers, it would compute

 while (n is even) {

93 as -3. 03 = -27. On the other hand, if the formula xy = eylogx is used to define

 while (n is even) {

91 for real arguments, then depending on the log function, the result could be a NaN (using the natural definition of log(x) =

 while (n is even) {

97 when x < 0). If the FORTRAN

 while (n is even) {

98 function is used however, then the answer will be -27, because the ANSI FORTRAN standard defines

 while (n is even) {

99 to be i

+ log 3 [ANSI 1978]. Ngôn ngữ lập trình Ada tránh được vấn đề này bằng cách chỉ định nghĩa lũy thừa cho lũy thừa số nguyên, trong khi ANSI FORTRAN cấm nâng số âm lên lũy thừa thực.

In fact, the FORTRAN standard says that

Any arithmetic operation whose result is not mathematically defined is prohibited

Unfortunately, with the introduction of ±

by the IEEE standard, the meaning of not mathematically defined is no longer totally clear cut. One definition might be to use the method shown in section . For example, to determine the value of ab, consider non-constant analytic functions f and g with the property that f(x)

a and g(x)

b as x

0. If f(x)g(x) always approaches the same limit, then this should be the value of ab. This definition would set 2

which seems quite reasonable. In the case of 1. 0

, when f(x) = 1 and g(x) = 1/x the limit approaches 1, but when f(x) = 1 - x and g(x) = 1/x the limit is e-1. So 1. 0

, should be a NaN. In the case of 00, f(x)g(x) = eg(x)log f(x). Since f and g are analytic and take on the value 0 at 0, f(x) = a1x1 + a2x2 + . and g(x) = b1x1 + b2x2 + . Thus limx

0g(x) log f(x) = limx

0x log(x(a1 + a2x + . )) = limx

0x log(a1x) = 0. So f(x)g(x)

e0 = 1 for all f and g, which means that 00 = 1. Using this definition would unambiguously define the exponential function for all arguments, and in particular would define

 while (n is even) {

93 to be -27.

The IEEE Standard

The section ," discussed many of the features of the IEEE standard. However, the IEEE standard says nothing about how these features are to be accessed from a programming language. Thus, there is usually a mismatch between floating-point hardware that supports the standard and programming languages like C, Pascal or FORTRAN. Some of the IEEE capabilities can be accessed through a library of subroutine calls. For example the IEEE standard requires that square root be exactly rounded, and the square root function is often implemented directly in hardware. This functionality is easily accessed via a library square root routine. However, other aspects of the standard are not so easily implemented as subroutines. For example, most computer languages specify at most two floating-point types, while the IEEE standard has four different precisions (although the recommended configurations are single plus single-extended or single, double, and double-extended). Infinity provides another example. Constants to represent ±

could be supplied by a subroutine. Nhưng điều đó có thể khiến chúng không sử dụng được ở những nơi yêu cầu biểu thức hằng, chẳng hạn như bộ khởi tạo biến hằng.

Một tình huống tinh vi hơn là thao túng trạng thái liên quan đến tính toán, trong đó trạng thái bao gồm các chế độ làm tròn, bit kích hoạt bẫy, trình xử lý bẫy và cờ ngoại lệ. One approach is to provide subroutines for reading and writing the state. In addition, a single call that can atomically set a new value and return the old value is often useful. As the examples in the section show, a very common pattern of modifying IEEE state is to change it only within the scope of a block or subroutine. Thus the burden is on the programmer to find each exit from the block, and make sure the state is restored. Language support for setting the state precisely in the scope of a block would be very useful here. Modula-3 is one language that implements this idea for trap handlers [Nelson 1991]

There are a number of minor points that need to be considered when implementing the IEEE standard in a language. Since x - x = +0 for all x, (+0) - (+0) = +0. However, -(+0) = -0, thus -x should not be defined as 0 - x. Việc giới thiệu NaN có thể gây nhầm lẫn, bởi vì một NaN không bao giờ bằng bất kỳ số nào khác (bao gồm cả NaN khác), vì vậy x = x không còn đúng nữa. Trên thực tế, biểu thức x

x là cách đơn giản nhất để kiểm tra NaN nếu chức năng khuyến nghị của IEEE

     x = x*x

01 không được cung cấp. Hơn nữa, NaN không có thứ tự đối với tất cả các số khác, vì vậy x

y không thể được định nghĩa là không phải x > y. Do việc giới thiệu NaN làm cho các số dấu phẩy động trở nên có thứ tự một phần, nên hàm

     x = x*x

02 trả về một trong

Mặc dù tiêu chuẩn IEEE xác định các hoạt động dấu phẩy động cơ bản để trả về NaN nếu bất kỳ toán hạng nào là NaN, đây có thể không phải lúc nào cũng là định nghĩa tốt nhất cho các hoạt động phức hợp. Ví dụ: khi tính toán hệ số tỷ lệ thích hợp để sử dụng trong việc vẽ đồ thị, giá trị lớn nhất của một tập hợp giá trị phải được tính toán. Trong trường hợp này, điều hợp lý là thao tác tối đa chỉ cần bỏ qua NaN

Cuối cùng, làm tròn có thể là một vấn đề. Tiêu chuẩn IEEE xác định làm tròn rất chính xác và nó phụ thuộc vào giá trị hiện tại của các chế độ làm tròn. Điều này đôi khi xung đột với định nghĩa làm tròn ẩn trong chuyển đổi loại hoặc hàm

     x = x*x

03 rõ ràng trong ngôn ngữ. Điều này có nghĩa là các chương trình muốn sử dụng phương pháp làm tròn IEEE không thể sử dụng các ngôn ngữ gốc của ngôn ngữ tự nhiên và ngược lại, các ngôn ngữ gốc sẽ không hiệu quả để triển khai trên số lượng máy IEEE ngày càng tăng

Optimizers

Compiler texts tend to ignore the subject of floating-point. For example Aho et al. [1986] đề cập đến việc thay thế

     x = x*x

04 bằng

     x = x*x

05, khiến người đọc cho rằng nên thay thế

     x = x*x

06 bằng

 while (n is even) {

51. However, these two expressions do not have the same semantics on a binary machine, because 0. 1 cannot be represented exactly in binary. This textbook also suggests replacing

     x = x*x

08 by

     x = x*x

09, even though we have seen that these two expressions can have quite different values when y

z. Although it does qualify the statement that any algebraic identity can be used when optimizing code by noting that optimizers should not violate the language definition, it leaves the impression that floating-point semantics are not very important. Whether or not the language standard specifies that parenthesis must be honored,

 while (n is even) {

38 can have a totally different answer than

 while (n is even) {

39, as discussed above. There is a problem closely related to preserving parentheses that is illustrated by the following code

 while (n is even) {

 while (n is even) {

This is designed to give an estimate for machine epsilon. If an optimizing compiler notices that eps + 1 > 1

eps > 0, the program will be changed completely. Instead of computing the smallest number x such that 1

x is still greater than x (x

), it will compute the largest number x for which x/2 is rounded to 0 (x

). Avoiding this kind of "optimization" is so important that it is worth presenting one more very useful algorithm that is totally ruined by it.

Many problems, such as numerical integration and the numerical solution of differential equations involve computing sums with many terms. Because each addition can potentially introduce an error as large as . 5 ulp, a sum involving thousands of terms can have quite a bit of rounding error. A simple way to correct for this is to store the partial summand in a double precision variable and to perform each addition using double precision. If the calculation is being done in single precision, performing the sum in double precision is easy on most computer systems. However, if the calculation is already being done in double precision, doubling the precision is not so simple. One method that is sometimes advocated is to sort the numbers and add them from smallest to largest. However, there is a much more efficient method which dramatically improves the accuracy of sums, namely

Theorem 8 (Kahan Summation Formula)

Suppose thatis computed using the following algorithm

 while (n is even) {

 while (n is even) {

 while (n is even) {

     x = x*x

     x = x*x

     x = x*x

     x = x*x

     x = x*x

4
Then the computed sum S is equal towhere

Using the naive formula, the computed sum is equal towhere .

An optimizer that believed floating-point arithmetic obeyed the laws of algebra would conclude that C = [T-S] - Y = [(S+Y)-S] - Y = 0, rendering the algorithm completely useless. These examples can be summarized by saying that optimizers should be extremely cautious when applying algebraic identities that hold for the mathematical real numbers to expressions involving floating-point variables

Another way that optimizers can change the semantics of floating-point code involves constants. In the expression

     x = x*x

12, there is an implicit decimal to binary conversion operation that converts the decimal number to a binary constant. Because this constant cannot be represented exactly in binary, the inexact exception should be raised. In addition, the underflow flag should to be set if the expression is evaluated in single precision. Since the constant is inexact, its exact conversion to binary depends on the current value of the IEEE rounding modes. Thus an optimizer that converts

     x = x*x

13 to binary at compile time would be changing the semantics of the program. However, constants like 27. 5 which are exactly representable in the smallest available precision can be safely converted at compile time, since they are always exact, cannot raise any exception, and are unaffected by the rounding modes. Constants that are intended to be converted at compile time should be done with a constant declaration, such as

     x = x*x

     x = x*x

 while (n is even) {

     x = x*x

Common subexpression elimination is another example of an optimization that can change floating-point semantics, as illustrated by the following code

     x = x*x

     x = x*x

     x = x*x

Although

     x = x*x

18 can appear to be a common subexpression, it is not because the rounding mode is different at the two evaluation sites. Three final examples. x = x cannot be replaced by the boolean constant

     x = x*x

19, because it fails when x is a NaN; -x = 0 - x fails for x = +0; and x < y is not the opposite of x

y, because NaNs are neither greater than nor less than ordinary floating-point numbers.

Despite these examples, there are useful optimizations that can be done on floating-point code. First of all, there are algebraic identities that are valid for floating-point numbers. Some examples in IEEE arithmetic are x + y = y + x, 2 × x = x + x, 1 × x = x, and 0. 5× x = x/2. However, even these simple identities can fail on a few machines such as CDC and Cray supercomputers. Instruction scheduling and in-line procedure substitution are two other potentially useful optimizations

As a final example, consider the expression

 while (n is even) {

 while (n is even) {

 while (n is even) {

45, where

 while (n is even) {

06 and

 while (n is even) {

42 are single precision variables, and

 while (n is even) {

70 is double precision. On machines that have an instruction that multiplies two single precision numbers to produce a double precision number,

 while (n is even) {

 while (n is even) {

 while (n is even) {

45 can get mapped to that instruction, rather than compiled to a series of instructions that convert the operands to double and then perform a double to double precision multiply

Some compiler writers view restrictions which prohibit converting (x + y) + z to x + (y + z) as irrelevant, of interest only to programmers who use unportable tricks. Perhaps they have in mind that floating-point numbers model real numbers and should obey the same laws that real numbers do. Vấn đề với ngữ nghĩa số thực là chúng cực kỳ tốn kém để thực hiện. Every time two n bit numbers are multiplied, the product will have 2n bits. Every time two n bit numbers with widely spaced exponents are added, the number of bits in the sum is n + the space between the exponents. The sum could have up to (emax - emin) + n bits, or roughly 2·emax + n bits. An algorithm that involves thousands of operations (such as solving a linear system) will soon be operating on numbers with many significant bits, and be hopelessly slow. The implementation of library functions such as sin and cos is even more difficult, because the value of these transcendental functions aren't rational numbers. Exact integer arithmetic is often provided by lisp systems and is handy for some problems. However, exact floating-point arithmetic is rarely useful

The fact is that there are useful algorithms (like the Kahan summation formula) that exploit the fact that (x + y) + z

x + (y + z), and work whenever the bound

b = (a + b)(1 +

)

holds (as well as similar bounds for -, × and /). Since these bounds hold for almost all commercial hardware, it would be foolish for numerical programmers to ignore such algorithms, and it would be irresponsible for compiler writers to destroy these algorithms by pretending that floating-point variables have real number semantics

Exception Handling

The topics discussed up to now have primarily concerned systems implications of accuracy and precision. Trap handlers also raise some interesting systems issues. The IEEE standard strongly recommends that users be able to specify a trap handler for each of the five classes of exceptions, and the section , gave some applications of user defined trap handlers. In the case of invalid operation and division by zero exceptions, the handler should be provided with the operands, otherwise, with the exactly rounded result. Depending on the programming language being used, the trap handler might be able to access other variables in the program as well. For all exceptions, the trap handler must be able to identify what operation was being performed and the precision of its destination

The IEEE standard assumes that operations are conceptually serial and that when an interrupt occurs, it is possible to identify the operation and its operands. On machines which have pipelining or multiple arithmetic units, when an exception occurs, it may not be enough to simply have the trap handler examine the program counter. Hardware support for identifying exactly which operation trapped may be necessary

Another problem is illustrated by the following program fragment

     x = x*x

     x = x*x

     n = n/2

     n = n/2

Suppose the second multiply raises an exception, and the trap handler wants to use the value of

     if (n==0) return u

3. On hardware that can do an add and multiply in parallel, an optimizer would probably move the addition operation ahead of the second multiply, so that the add can proceed in parallel with the first multiply. Thus when the second multiply traps,

     if (n==0) return u

 while (n is even) {

     x = x*x

 while (n is even) {

     x = x*x

34 has already been executed, potentially changing the result of

     if (n==0) return u

3. It would not be reasonable for a compiler to avoid this kind of optimization, because every floating-point operation can potentially trap, and thus virtually all instruction scheduling optimizations would be eliminated. This problem can be avoided by prohibiting trap handlers from accessing any variables of the program directly. Instead, the handler can be given the operands or result as an argument

But there are still problems. trong đoạn

hai hướng dẫn cũng có thể được thực hiện song song. If the multiply traps, its argument

 while (n is even) {

11 could already have been overwritten by the addition, especially since addition is usually faster than multiply. Computer systems that support the IEEE standard must provide some way to save the value of

 while (n is even) {

11, either in hardware or by having the compiler avoid such a situation in the first place

W. Kahan has proposed using presubstitution instead of trap handlers to avoid these problems. In this method, the user specifies an exception and the value he wants to be used as the result when the exception occurs. As an example, suppose that in code for computing (sin x)/x, the user decides that x = 0 is so rare that it would improve performance to avoid a test for x = 0, and instead handle this case when a 0/0 trap occurs. Using IEEE trap handlers, the user would write a handler that returns a value of 1 and install it before computing sin x/x. Using presubstitution, the user would specify that when an invalid operation occurs, the value 1 should be used. Kahan calls this presubstitution, because the value to be used must be specified before the exception occurs. When using trap handlers, the value to be returned can be computed when the trap occurs

The advantage of presubstitution is that it has a straightforward hardware implementation. As soon as the type of exception has been determined, it can be used to index a table which contains the desired result of the operation. Although presubstitution has some attractive attributes, the widespread acceptance of the IEEE standard makes it unlikely to be widely implemented by hardware manufacturers

The Details

Một số tuyên bố đã được đưa ra trong bài báo này liên quan đến các thuộc tính của số học dấu phẩy động. We now proceed to show that floating-point is not black magic, but rather is a straightforward subject whose claims can be verified mathematically. This section is divided into three parts. The first part presents an introduction to error analysis, and provides the details for the section . The second part explores binary to decimal conversion, filling in some gaps from the section . The third part discusses the Kahan summation formula, which was used as an example in the section

Rounding Error

In the discussion of rounding error, it was stated that a single guard digit is enough to guarantee that addition and subtraction will always be accurate (Theorem 2). We now proceed to verify this fact. Theorem 2 has two parts, one for subtraction and one for addition. The part for subtraction is

Theorem 9

If x and y are positive floating-point numbers in a format with parameters

and p, and if subtraction is done with p + 1 digits (i. e. một chữ số bảo vệ), thì sai số làm tròn tương đối trong kết quả nhỏ hơn

2e.

Proof

Interchange x and y if necessary so that x > y. It is also harmless to scale x and y so that x is represented by x0. x1 . xp - 1 ×

0. If y is represented as y0. y1 . yp-1, then the difference is exact. If y is represented as 0. y1 . yp, then the guard digit ensures that the computed difference will be the exact difference rounded to a floating-point number, so the rounding error is at most e. In general, let y = 0. 0 . 0yk + 1 . yk + p andbe y truncated to p + 1 digits. Then(15) y -< (

- 1)(

-p - 1 +

-p - 2 + . +

-p - k).
From the definition of guard digit, the computed value of x - y is x -rounded to be a floating-point number, that is, (x -) +

, where the rounding error

satisfies(16) .

(

/2)

-p.
The exact difference is x - y, so the error is (x - y) - (x -+

) =- y +

. There are three cases. If x - y

1 then the relative error is bounded by(17)

-p [(

- 1)(

-1 + . +

-k) +

/2] <

-p(1 +

/2) .
Secondly, if x -< 1, then

= 0. Since the smallest that x - y can be is

> (

- 1)(

-1 + . +

-k), where

- 1,
in this case the relative error is bounded by(18)

.
The final case is when x - y < 1 but x -

1. The only way this could happen is if x - = 1, in which case

= 0. But if

= 0, then applies, so that again the relative error is bounded by

-p <

-p(1 +

/2). z

When

= 2, the bound is exactly 2e, and this bound is achieved for x= 1 + 22 - p and y = 21 - p - 21 - 2p in the limit as p

. When adding numbers of the same sign, a guard digit is not necessary to achieve good accuracy, as the following result shows.

Theorem 10

If x

0 and y

0, then the relative error in computing x + y is at most 2

, even if no guard digits are used.

Proof

The algorithm for addition with k guard digits is similar to that for subtraction. If x

y, shift y right until the radix points of x and y are aligned. Discard any digits shifted past the p + k position. Compute the sum of these two p + k digit numbers exactly. Then round to p digits. We will verify the theorem when no guard digits are used; the general case is similar. There is no loss of generality in assuming that x

0 and that x is scaled to be of the form d. dd. d ×

0. First, assume there is no carry out. Then the digits shifted off the end of y have a value less than

-p + 1, and the sum is at least 1, so the relative error is less than

-p+1/1 = 2e. If there is a carry out, then the error from shifting must be added to the rounding error of.

The sum is at least

, so the relative error is less than

. z

Rõ ràng là kết hợp hai định lý này sẽ cho Định lý 2. Theorem 2 gives the relative error for performing one operation. Comparing the rounding error of x2 - y2 and (x + y) (x - y) requires knowing the relative error of multiple operations. The relative error of xy is

1 = [(xy) - (x - y)] / (x - y), which satisfies .

2e. Or to write it another way

(19) xy = (x - y) (1 +

1), .

Similarly

(20) x

y = (x + y) (1 +

2), .

Assuming that multiplication is performed by computing the exact product and then rounding, the relative error is at most . 5 ulp, so

(21) u

v = uv (1 +

3), .

for any floating-point numbers u and v. Putting these three equations together (letting u = xy and v = x

y) gives

(22) (xy)

y) = (x - y) (1 +

1) (x + y) (1 +

2) (1 +

So the relative error incurred when computing (x - y) (x + y) is

(23)

This relative error is equal to

1 +

2 +

3 +

2 +

3 +

3, which is bounded by 5

+ 8

2. In other words, the maximum relative error is about 5 rounding errors (since e is a small number, e2 is almost negligible).

A similar analysis of (x

x)(y

y) cannot result in a small value for the relative error, because when two nearby values of x and y are plugged into x2 - y2, the relative error will usually be quite large. Another way to see this is to try and duplicate the analysis that worked on (xy)

y), yielding

x)(y

y) = [x2(1 +

1) - y2(1 +

2)] (1 +

3)
= ((x2 - y2) (1 +

1) + (

1 -

2)y2) (1 +

When x and y are nearby, the error term (

1 -

2)y2 can be as large as the result x2 - y2. These computations formally justify our claim that (x - y) (x + y) is more accurate than x2 - y2.

We next turn to an analysis of the formula for the area of a triangle. Để ước tính lỗi tối đa có thể xảy ra khi tính toán với , thực tế sau đây sẽ cần thiết

Định lý 11

Nếu phép trừ được thực hiện với một chữ số bảo vệ và y/2

2y, thì x - y được tính chính xác.

Proof

Note that if x and y have the same exponent, then certainly xy is exact. Mặt khác, từ điều kiện của định lý, các số mũ có thể khác nhau nhiều nhất 1. Chia tỷ lệ và hoán đổi x và y nếu cần sao cho 0

x và x được biểu diễn dưới dạng x0. x1. xp - 1 and y as 0. y1. yp. Then the algorithm for computing xy will compute x - y exactly and round to a floating-point number. Nếu sự khác biệt có dạng 0. d1. dp, sự khác biệt sẽ có độ dài p chữ số và không cần làm tròn. Vì x

2y, x - y

y và vì y có dạng 0. d1 . dp, so is x - y. z

When

> 2, the hypothesis of Theorem 11 cannot be replaced by y/

y; the stronger condition y/2

2y is still necessary. The analysis of the error in (x - y) (x + y), immediately following the proof of Theorem 10, used the fact that the relative error in the basic operations of addition and subtraction is small (namely equations and ). This is the most common kind of error analysis. However, analyzing formula requires something more, namely Theorem 11, as the following proof will show.

Theorem 12

If subtraction uses a guard digit, and if a,b and c are the sides of a triangle (a

c), then the relative error in computing (a + (b + c))(c - (a - b))(c + (a - b))(a +(b - c)) is at most 16

, provided e < . 005.

Proof

Let's examine the factors one by one. From Theorem 10, b

c = (b + c) (1 +

1), where

1 is the relative error, and .

. Then the value of the first factor is(a

c)) = (a + (b

c)) (1 +

2) = (a + (b + c) (1 +

1))(1 +

2),
and thus(a + b + c) (1 - 2

[a + (b + c) (1 - 2

)] · (1-2

)

[a + (b + c) (1 + 2

)] (1 + 2

)

(a + b + c) (1 + 2

)2
This means that there is an

1 so that(24) (a

c)) = (a + b + c) (1 +

1)2, .

.
The next term involves the potentially catastrophic subtraction of c and a

     x = x*x

32, because ab may have rounding error. Because a, b and c are the sides of a triangle, a

b+ c, and combining this with the ordering c

a gives a

b + c

2a. So a - b satisfies the conditions of Theorem 11. This means that a - b = ab is exact, hence c(a - b) is a harmless subtraction which can be estimated from Theorem 9 to be(25) (c(ab)) = (c - (a - b)) (1 +

2), .

The third term is the sum of two exact positive quantities, so(26) (c

(ab)) = (c + (a - b)) (1 +

3), .

Finally, the last term is(27) (a

(bc)) = (a + (b - c)) (1 +

4)2, .

,
using both Theorem 9 and Theorem 10. If multiplication is assumed to be exactly rounded, so that x

y = xy(1 +

) with .

, then combining , , and gives(a

c)) (c(ab)) (c

(ab)) (a

(bc))

(a + (b + c)) (c - (a - b)) (c + (a - b)) (a + (b - c)) E
whereE = (1 +

1)2 (1 +

2) (1 +

3) (1 +

4)2 (1 +

1)(1 +

2) (1 +

3)
An upper bound for E is (1 + 2

)6(1 +

)3, which expands out to 1 + 15

+ O(

2). Some writers simply ignore the O(e2) term, but it is easy to account for it. Writing (1 + 2

)6(1 +

)3 = 1 + 15

), R(

) is a polynomial in e with positive coefficients, so it is an increasing function of

. Since R(. 005) = . 505, R(

) < 1 for all

< . 005, and hence E

(1 + 2

)6(1 +

. To get a lower bound on E, note that 1 - 15

) < E, and so when

< . 005, 1 - 16

< (1 - 2

)6(1 -

)3. Combining these two bounds yields 1 - 16

< E

. Thus the relative error is at most 16

. z

Theorem 12 certainly shows that there is no catastrophic cancellation in formula . So although it is not necessary to show formula is numerically stable, it is satisfying to have a bound for the entire formula, which is what Theorem 3 of gives

Proof of Theorem 3

Letq = (a + (b + c)) (c - (a - b)) (c + (a - b)) (a + (b - c))
andQ = (a

c))

(c(ab))

(ab))

(bc)).
Then, Theorem 12 shows that Q = q(1 +

), with

. It is easy to check that(28)
provided

. 04/(. 52)2

. 15, and since .

16(. 005) =. 08,

does satisfy the condition. Thus,
with .

. 52.

8. 5

. If square roots are computed to within . 5 ulp, then the error when computingis (1 +

1)(1 +

2), with .

. If

= 2, then there is no further error committed when dividing by 4. Otherwise, one more factor 1 +

3 with .

is necessary for the division, and using the method in the proof of Theorem 12, the final error bound of (1 +

1) (1 +

2) (1 +

3) is dominated by 1 +

4, with .

. z

To make the heuristic explanation immediately following the statement of Theorem 4 precise, the next theorem describes just how closely µ(x) approximates a constant

Theorem 13

If µ(x) = ln(1 + x)/x, then for 0

µ(x)

1 and the derivative satisfies . µ'(x).

Proof

Note that µ(x) = 1 - x/2 + x2/3 - . is an alternating series with decreasing terms, so for x

1, µ(x)

1 - x/2

1/2. It is even easier to see that because the series for µ is alternating, µ(x)

1. The Taylor series of µ'(x) is also alternating, and if x

has decreasing terms, so -

µ'(x)

-+ 2x/3, or -

µ'(x)

0, thus . µ'(x).

. z

Proof of Theorem 4

Since the Taylor series for ln

is an alternating series, 0 < x - ln(1 + x) < x2/2, the relative error incurred when approximating ln(1 + x) by x is bounded by x/2. If 1

x = 1, then . x.

, do đó, lỗi tương đối bị giới hạn bởi

/2. When 1

1, definevia 1

x = 1 +. Then since 0

x < 1, (1

x)1 =. If division and logarithms are computed to within ulp, then the computed value of the expression ln(1 + x)/((1 + x) - 1) is(29)(1 +

1) (1 +

2) =(1 +

1) (1 +

2) = µ() (1 +

1) (1 +

ở đâu.

and .

. To estimate µ(), use the mean value theorem, which says that

(30) µ() - µ(x) = (- x)µ'(

)
với một số

giữa x và. From the definition of, it follows that . - x.

, and combining this with Theorem 13 gives . µ() - µ(x).

/2, hoặc. µ()/µ(x) - 1.

/(2. µ(x). )

nghĩa là µ() = µ(x) (1 +

3), với.

. Cuối cùng, phép nhân với x đưa ra kết quả cuối cùng

4, vì vậy, giá trị được tính toán của x·ln(1

x)/((1

x)1)

Là

Dễ dàng kiểm tra xem nếu

< 0. 1 thì(1 +

1) (1 +

2) (1 +

3) (1 +

4) = 1 +

with .

. z

An interesting example of error analysis using formulas , , and occurs in the quadratic formula. The section , explained how rewriting the equation will eliminate the potential cancellation caused by the ± operation. But there is another potential cancellation that can occur when computing d = b2 - 4ac. This one cannot be eliminated by a simple rearrangement of the formula. Roughly speaking, when b2

4ac, rounding error can contaminate up to half the digits in the roots computed with the quadratic formula. Here is an informal proof (another approach to estimating the error in the quadratic formula appears in Kahan [1972]).

If b2

4ac, rounding error can contaminate up to half the digits in the roots computed with the quadratic formula.

Proof. Write (b

b)(4a

c) = (b2(1 +

1) - 4ac(1 +

2)) (1 +

3), where .

. Using d = b2 - 4ac, this can be rewritten as (d(1 +

1) - 4ac(

2 -

1)) (1 +

3). To get an estimate for the size of this error, ignore second order terms in

i, in which case the absolute error is d(

1 +

3) - 4ac

4, where .

4. = .

1 -

. Since, the first term d(

1 +

3) can be ignored. To estimate the second term, use the fact that ax2 + bx + c = a(x - r1) (x - r2), so ar1r2 = c. Since b2

4ac, then r1

r2, so the second error term is. Thus the computed value ofis

The inequality

shows that

where

so the absolute error ina is about. Since

-p,, and thus the absolute error ofdestroys the bottom half of the bits of the roots r1

r2. In other words, since the calculation of the roots involves computing with, and this expression does not have meaningful bits in the position corresponding to the lower order half of ri, then the lower order bits of ri cannot be meaningful. z

Finally, we turn to the proof of Theorem 6. It is based on the following fact, which is proven in the section

Theorem 14

Let 0 < k < p, and set m =

k + 1, and assume that floating-point operations are exactly rounded. Then (m

x)(m

xx) is exactly equal to x rounded to p - k significant digits. More precisely, x is rounded by taking the significand of x, imagining a radix point just left of the k least significant digits and rounding to an integer.

Proof of Theorem 6

By Theorem 14, xh is x rounded to p - k =places. If there is no carry out, then certainly xh can be represented withsignificant digits. Suppose there is a carry-out. If x = x0. x1 . xp - 1 ×

e, then rounding adds 1 to xp - k - 1, and the only way there can be a carry-out is if xp - k - 1 =

- 1, but then the low order digit of xh is 1 + xp - k- 1 = 0, and so again xh is representable indigits. To deal with xl, scale x to be an integer satisfying

p - 1

p - 1. Gọi p - k chữ số bậc cao của x, và k chữ số bậc thấp. There are three cases to consider. If, then rounding x to p - k places is the same as chopping and, and. Sincehas at most k digits, if p is even, thenhas at most k ==digits. Otherwise,

= 2 andis representable with k - 1

significant bits. The second case is when, and then computing xh involves rounding up, so xh =+

k, and xl = x - xh = x - -

k =-

k. Once again,has at most k digits, so is representable with

p/2

digits. Finally, if= (

/2)

k - 1, then xh =or +

k depending on whether there is a round up. So xl is either (

/2)

k - 1 or (

/2)

k - 1 -

k = -

k/2, both of which are represented with 1 digit. z

Theorem 6 gives a way to express the product of two working precision numbers exactly as a sum. There is a companion formula for expressing a sum exactly. If . x.

. y. then x + y = (x

y) + (x(x

y))

y [Dekker 1971; Knuth 1981, Theorem C in section 4. 2. 2]. However, when using exactly rounded operations, this formula is only true for

= 2, and not for

= 10 as the example x = . 99998, y = . 99997 shows.

Binary to Decimal Conversion

Since single precision has p = 24, and 224 < 108, you might expect that converting a binary number to 8 decimal digits would be sufficient to recover the original binary number. However, this is not the case

Theorem 15

When a binary IEEE single precision number is converted to the closest eight digit decimal number, it is not always possible to uniquely recover the binary number from the decimal one. However, if nine decimal digits are used, then converting the decimal number to the closest binary number will recover the original floating-point number

Proof

Binary single precision numbers lying in the half open interval [103, 210) = [1000, 1024) have 10 bits to the left of the binary point, and 14 bits to the right of the binary point. Thus there are (210 - 103)214 = 393,216 different binary numbers in that interval. If decimal numbers are represented with 8 digits, then there are (210 - 103)104 = 240,000 decimal numbers in the same interval. There is no way that 240,000 decimal numbers could represent 393,216 different binary numbers. So 8 decimal digits are not enough to uniquely represent each single precision binary number. To show that 9 digits are sufficient, it is enough to show that the spacing between binary numbers is always greater than the spacing between decimal numbers. This will ensure that for each decimal number N, the interval[N -ulp, N +ulp]
contains at most one binary number. Thus each binary number rounds to a unique decimal number which in turn rounds to a unique binary number. To show that the spacing between binary numbers is always greater than the spacing between decimal numbers, consider an interval [10n, 10n + 1]. On this interval, the spacing between consecutive decimal numbers is 10(n + 1) - 9. On [10n, 2m], where m is the smallest integer so that 10n < 2m, the spacing of binary numbers is 2m - 24, and the spacing gets larger further on in the interval. Do đó, chỉ cần kiểm tra rằng 10(n + 1) - 9 < 2m - 24 là đủ. But in fact, since 10n < 2m, then 10(n + 1) - 9 = 10n10-8 < 2m10-8 < 2m2-24. z

The same argument applied to double precision shows that 17 decimal digits are required to recover a double precision number

Binary-decimal conversion also provides another example of the use of flags. Recall from the section , that to recover a binary number from its decimal expansion, the decimal to binary conversion must be computed exactly. That conversion is performed by multiplying the quantities N and 10. P. (which are both exact if p < 13) in single-extended precision and then rounding this to single precision (or dividing if p < 0; both cases are similar). Of course the computation of N · 10. P. không thể chính xác; . P. ) that must be exact, where the rounding is from single-extended to single precision. Để biết tại sao nó có thể không chính xác, hãy lấy trường hợp đơn giản là

= 10, p = 2 cho đơn lẻ và p = 3 cho mở rộng đơn lẻ. Nếu sản phẩm là 12. 51, thì số này sẽ được làm tròn thành 12. 5 as part of the single-extended multiply operation. Rounding to single precision would give 12. But that answer is not correct, because rounding the product to single precision should give 13. Lỗi là do làm tròn hai lần.

By using the IEEE flags, double rounding can be avoided as follows. Save the current value of the inexact flag, and then reset it. Set the rounding mode to round-to-zero. Then perform the multiplication N · 10. P. Store the new value of the inexact flag in

     x = x*x

39, and restore the rounding mode and inexact flag. If

     x = x*x

39 is 0, then N · 10. P. is exact, so round(N · 10. P. ) will be correct down to the last bit. If

     x = x*x

39 is 1, then some digits were truncated, since round-to-zero always truncates. The significand of the product will look like 1. b1. b22b23. b31. A double rounding error may occur if b23 . b31 = 10. 0. Một cách đơn giản để giải thích cho cả hai trường hợp là thực hiện một

     x = x*x

42 hợp lý của

     x = x*x

39 với b31. Sau đó làm tròn (N · 10. P. ) sẽ được tính toán chính xác trong mọi trường hợp

Lỗi trong tổng kết

Phần , đã đề cập đến vấn đề tính toán chính xác các khoản tiền rất dài. Cách tiếp cận đơn giản nhất để cải thiện độ chính xác là tăng gấp đôi độ chính xác. Để có ước tính sơ bộ về việc nhân đôi độ chính xác sẽ cải thiện độ chính xác của một phép cộng bao nhiêu, hãy đặt s1 = x1, s2 = s1

x2. , si = si - 1

xi. Khi đó si = (1 +

i) (si - 1 + xi), trong đó

và bỏ qua bậc hai

i gives

(31)

Đẳng thức đầu tiên của cho thấy rằng giá trị được tính toán của of giống như thể một phép tính tổng chính xác được thực hiện trên các giá trị bị xáo trộn của xj. Số hạng đầu x1 bị nhiễu bởi n

, số hạng cuối xn chỉ bị nhiễu bởi

. Đẳng thức thứ hai trong chỉ ra rằng số hạng sai số bị giới hạn bởi. Nhân đôi độ chính xác có tác dụng bình phương

. Nếu tính tổng được thực hiện ở định dạng độ chính xác kép của IEEE, 1/

1016, sao cho bất kỳ giá trị hợp lý nào của n. Do đó, việc nhân đôi độ chính xác sẽ làm nhiễu tối đa n

và thay đổi nó thành. Thus the 2

error bound for the Kahan summation formula (Theorem 8) is not as good as using double precision, even though it is much better than single precision.

Để có lời giải thích trực quan về lý do tại sao công thức tính tổng Kahan hoạt động, hãy xem sơ đồ sau của quy trình

Mỗi khi thêm một lệnh tổng, sẽ có một hệ số hiệu chỉnh C sẽ được áp dụng cho vòng lặp tiếp theo. Vì vậy, trước tiên hãy trừ hiệu chỉnh C được tính trong vòng lặp trước từ Xj, cho tổng kết quả đã sửa Y. Sau đó thêm summand này vào tổng đang chạy S. Các bit bậc thấp của Y (cụ thể là Yl) bị mất trong tổng. Tiếp theo tính toán các bit bậc cao của Y bằng cách tính toán T - S. Khi Y bị trừ đi, các bit bậc thấp của Y sẽ được phục hồi. Đây là những bit đã bị mất trong tổng đầu tiên trong sơ đồ. Chúng trở thành hệ số điều chỉnh cho vòng lặp tiếp theo. Một chứng minh chính thức của Định lý 8, lấy từ Knuth [1981] trang 572, xuất hiện trong phần. "

Tóm lược

Không có gì lạ khi các nhà thiết kế hệ thống máy tính bỏ qua các phần của hệ thống liên quan đến dấu chấm động. Điều này có lẽ là do thực tế là dấu phẩy động được chú ý rất ít (nếu có) trong chương trình khoa học máy tính. Đến lượt nó, điều này đã gây ra niềm tin rõ ràng phổ biến rằng dấu phẩy động không phải là một chủ đề có thể định lượng được, và do đó, không cần phải lo lắng về các chi tiết của phần cứng và phần mềm xử lý nó

Bài báo này đã chứng minh rằng có thể lập luận chặt chẽ về dấu phẩy động. For example, floating-point algorithms involving cancellation can be proven to have small relative errors if the underlying hardware has a guard digit, and there is an efficient algorithm for binary-decimal conversion that can be proven to be invertible, provided that extended precision is supported. Nhiệm vụ xây dựng phần mềm dấu phẩy động đáng tin cậy được thực hiện dễ dàng hơn nhiều khi hệ thống máy tính cơ bản hỗ trợ dấu phẩy động. Ngoài hai ví dụ vừa được đề cập (số bảo vệ và độ chính xác mở rộng), phần của bài báo này có các ví dụ khác nhau, từ thiết kế tập lệnh đến tối ưu hóa trình biên dịch minh họa cách hỗ trợ dấu phẩy động tốt hơn.

Sự chấp nhận ngày càng tăng của tiêu chuẩn dấu phẩy động IEEE có nghĩa là các mã sử dụng các tính năng của tiêu chuẩn ngày càng trở nên di động hơn. Phần , đã đưa ra nhiều ví dụ minh họa cách sử dụng các tính năng của tiêu chuẩn IEEE để viết mã dấu phẩy động thực tế

Sự nhìn nhận

Bài viết này được lấy cảm hứng từ một khóa học do W. Kahan tại Sun Microsystems từ tháng 5 đến tháng 7 năm 1988, được tổ chức rất khéo léo bởi David Hough của Sun. Hy vọng của tôi là cho phép những người khác tìm hiểu về sự tương tác của hệ thống máy tính và dấu chấm động mà không cần phải thức dậy kịp thời để tham dự 8. 00 một. m. bài giảng. Xin cảm ơn Kahan và nhiều đồng nghiệp của tôi tại Xerox PARC (đặc biệt là John Gilbert) đã đọc bản thảo của bài báo này và đưa ra nhiều nhận xét hữu ích. Nhận xét từ Paul Hilfinger và một trọng tài ẩn danh cũng giúp cải thiện phần trình bày

Người giới thiệu

Ồ, Alfred V. , Sethi, R. , and Ullman J. Đ. 1986. Trình biên dịch. Nguyên tắc, Kỹ thuật và Công cụ, Addison-Wesley, Reading, MA

ANSI 1978. Ngôn ngữ lập trình tiêu chuẩn quốc gia Mỹ FORTRAN, tiêu chuẩn ANSI X3. 9-1978, Viện Tiêu chuẩn Quốc gia Mỹ, New York, NY

Barnett, David 1987. Môi trường dấu phẩy động di động, bản thảo chưa xuất bản

Nâu, W. S. 1981. Một mô hình tính toán dấu phẩy động đơn giản nhưng thực tế, ACM Trans. về môn Toán. Phần mềm 7(4), trang. 445-480

Cody, W. J et. tất cả. 1984. Tiêu chuẩn không phụ thuộc vào cơ số và độ dài từ được đề xuất cho Số học dấu phẩy động, IEEE Micro 4(4), trang. 86-100

Cody, W. J. 1988. Tiêu chuẩn dấu phẩy động -- Lý thuyết và thực hành, trong "Độ tin cậy trong máy tính. vai trò của các phương pháp khoảng thời gian trong tính toán khoa học", ed. bởi Ramon E. Moore, trang. 99-107, Nhà xuất bản học thuật, Boston, MA

Coonen, Jerome 1984. Đóng góp cho một tiêu chuẩn được đề xuất cho số học dấu phẩy động nhị phân, Luận án tiến sĩ, Đại học. California,Berkeley

Dekker, T. J. 1971. Một kỹ thuật dấu phẩy động để mở rộng độ chính xác có sẵn, số. Toán học. 18(3), trang. 224-242

Demmel, James 1984. Underflow và Độ tin cậy của Phần mềm Số, SIAM J. Khoa học. thống kê. máy tính. 5(4), trang. 887-919

Farnum, Charles 1988. Hỗ trợ trình biên dịch cho tính toán dấu phẩy động, Thực hành và trải nghiệm phần mềm, 18(7), trang. 701-709

Forsythe, G. E. và Moler, C. b. 1967. Giải pháp máy tính của hệ thống đại số tuyến tính, Prentice-Hall, Englewood Cliffs, NJ

Goldberg, tôi. Bennett 1967. 27 bit không đủ cho độ chính xác 8 chữ số, giao tiếp. của ACM. 10(2), trang 105-106

Goldberg, David 1990. Số học máy tính, trong "Kiến trúc máy tính. Một cách tiếp cận định lượng", bởi David Patterson và John L. Hennessy, Phụ lục A, Morgan Kaufmann, Los Altos, CA

Golub, gen H. và Vân Loan, Charles F. 1989. Matrix Computations, 2nd edition,The Johns Hopkins University Press, Baltimore Maryland

Graham, Ronald L. , Knuth, Donald E. và Patashnik, Oren. 1989. Toán cụ thể, Addison-Wesley, Reading, MA, p. 162

Hewlett Packard 1982. Sổ tay chức năng nâng cao HP-15C

IEEE 1987. Tiêu chuẩn IEEE 754-1985 cho Số học dấu phẩy động nhị phân, IEEE, (1985). In lại trong SIGPLAN 22(2) trang. 9-25

Kahan, W. 1972. Khảo sát về phân tích lỗi, trong Xử lý thông tin 71, Tập 2, trang. 1214 - 1239 (Ljubljana, Nam Tư), Bắc Hà Lan, Amsterdam

Kahan, W. 1986. Tính diện tích và góc của tam giác hình kim, bản thảo chưa xuất bản

Kahan, W. 1987. Các phép cắt nhánh cho các hàm cơ bản phức tạp, trong "The State of the Art in Numerical Analysis", ed. bởi M. J. D. Powell và A. Iserles (Đại học Birmingham, Anh), Chương 7, Nhà xuất bản Đại học Oxford, New York

Kahan, W. 1988. Các bài giảng chưa được công bố tại Sun Microsystems, Mountain View, CA

Kahan, W. và Coonen, Jerome T. 1982. Tính trực giao gần của cú pháp, ngữ nghĩa và chẩn đoán trong môi trường lập trình số, trong "Mối quan hệ giữa tính toán số và ngôn ngữ lập trình", biên tập. bởi J. k. Reid, trang. 103-115, Bắc Hà Lan, Amsterdam

Kahan, W. và LeBlanc, E. 1985. Sự bất thường trong Gói Acrith của IBM, Proc. Hội nghị chuyên đề IEEE lần thứ 7 về số học máy tính (Urbana, Illinois), trang. 322-331

Kernighan, Brian W. và Ritchie, Dennis M. 1978. Ngôn ngữ lập trình C, Prentice-Hall, Englewood Cliffs, NJ

Kirchner, R. và Kulisch, U. 1987. Số học cho Bộ xử lý Vector, Proc. Hội nghị chuyên đề IEEE lần thứ 8 về số học máy tính (Como, Ý), trang. 256-269

Knuth, Donald E. , 1981. Nghệ thuật lập trình máy tính, Tập II, Ấn bản thứ hai, Addison-Wesley, Reading, MA

Kulisch, U. W. , và Miranker, W. l. 1986. Số học của máy tính kỹ thuật số. Một cách tiếp cận mới, Đánh giá SIAM 28(1), trang 1-36

Matula, D. W. và Kornerup, P. 1985. Số học hợp lý chính xác hữu hạn. Hệ thống số gạch chéo, IEEE Trans. trên may tinh. C-34(1), pp 3-18

Nelson, G. 1991. Lập trình hệ thống với Modula-3, Prentice-Hall, Englewood Cliffs, NJ

Reiser, John F. và Knuth, Donald E. 1975. Evading the Drift in Floating-point Addition, Xử lý thông tin Letters 3(3), trang 84-87

Sterbenz, Miếng vá. 1974. Tính toán dấu chấm động, Prentice-Hall, Englewood Cliffs, NJ

Swartzlander, Bá tước E. và Alexopoulos, Aristides G. 1975. The Sign/Logarithm Number System, IEEE Trans. Comput. C-24(12), pp. 1238-1242

Walther, J. S. , 1971. A unified algorithm for elementary functions, Proceedings of the AFIP Spring Joint Computer Conf. 38, pp. 379-385

Theorem 14 and Theorem 8

This section contains two of the more technical proofs that were omitted from the text

Theorem 14

Let 0 < k < p, and set m =

k + 1, and assume that floating-point operations are exactly rounded. Then (m

x)(m

Proof

The proof breaks up into two cases, depending on whether or not the computation of mx =

kx + x has a carry-out or not. Assume there is no carry out. It is harmless to scale x so that it is an integer. Sau đó, tính toán của mx = x +

kx trông như thế này.

     x = x*x

44aa. aabb. b

     x = x*x

45trong đó x đã được chia thành hai phần. Các chữ số k bậc thấp được đánh dấu

     x = x*x

32 và các chữ số p - k bậc cao được đánh dấu

     if (n==0) return u

3. Để tính m

x từ mx, bạn cần làm tròn k chữ số bậc thấp (những chữ số được đánh dấu bằng

     x = x*x

32) vì vậy(32) m

x = mx - x mod

k) + r

k
Giá trị của r là 1 nếu

     x = x*x

49 lớn hơn và 0 nếu ngược lại. Chính xác hơn (33) r = 1 nếu

     x = x*x

50 làm tròn a + 1, r = 0 nếu không
Tính toán tiếp theo m

x - x = mx - x mod(

k) + r

k - . Hình dưới đây cho thấy phép tính của m

k(x + r) - x mod(

k). The picture below shows the computation of m

x - x được làm tròn, tức là (m

x)x. Dòng trên cùng là

k(x + r), trong đó

     x = x*x

51 là chữ số có được từ việc thêm

     x = x*x

52 vào chữ số có thứ tự thấp nhất

     x = x*x

32.

     x = x*x

54bb. b

     x = x*x

55Nếu

     x = x*x

kx. If

     x = x*x

49 >then r = 1, and 1 is subtracted from

     x = x*x

51 because of the borrow, so the result is

kx. Cuối cùng hãy xem xét trường hợp

     x = x*x

49 =. If r = 0 then

     x = x*x

51 is even,

     x = x*x

62 is odd, and the difference is rounded up, giving

kx. Tương tự khi r = 1,

     x = x*x

51 là số lẻ,

     x = x*x

62 là số chẵn, phần chênh lệch được làm tròn xuống, do đó, phần chênh lệch lại là

kx. Tóm lại(34) (m

x)x =

kx
Kết hợp các phương trình và cho (m

x) - (m

xx) = x - x mod(

k) +

k. The result of performing this computation is

     x = x*x

65 bb...bb

     x = x*x

66Quy tắc tính r, phương trình (33), cũng giống như quy tắc làm tròn

     x = x*x

     x = x*x

68 đến p - k. Do đó, tính toán mx - (mx - x) ở độ chính xác số học dấu phẩy động chính xác bằng cách làm tròn x đến p - k vị trí, trong trường hợp khi x +

kx không thực hiện. Khi x +

kx thực hiện, thì mx =

kx + x trông như thế này.

     x = x*x

44aa. aabb. b

     x = x*x

70Như vậy, m

x = mx - x mod(

k) + w

k, trong đó w . Tiếp theo, m

/2, but the exact value of w is unimportant. Next, m

x - x =

kx - x mod(

k) + w

k. In a picture

     x = x*x

71 w

     x = x*x

72Làm tròn cho (m

x)x =

kx + w

k - r

k, where r = 1 if

     x = x*x

49 > or if

     x = x*x

49 =and b0 = 1. Finally,(m

x) - (m

xx) = mx - x mod(

k) + w

k - (

kx + w

k - r

k)
= x - x mod(

k) + r

k.
Và một lần nữa, r = 1 chính xác khi làm tròn

     x = x*x

75 đến p - k vị trí liên quan đến việc làm tròn lên. Như vậy Định lý 14 được chứng minh trong mọi trường hợp. z

Theorem 8 (Kahan Summation Formula)

Suppose thatis computed using the following algorithm

     n = n/2

 while (n is even) {

     n = n/2

     n = n/2

     n = n/2

     n = n/2

     n = n/2

     n = n/2

9
Sau đó, tổng tính toán S bằng S =

xj (1 +

j) + O(N

|xj|, where |

Proof

Đầu tiên hãy nhớ lại ước tính sai số cho công thức đơn giản

xi đã diễn ra như thế nào. Giới thiệu s1 = x1, si = (1 +

i) (si - 1 + xi). Sau đó, tổng được tính là sn, là tổng của các số hạng, mỗi số là một xi nhân với một biểu thức liên quan đến

j's. Hệ số chính xác của x1 là (1 +

2)(1 +

3). (1 +

n), do đó, bằng cách đánh số lại, hệ số của x2 phải là (1 +

3)(1 +

4) .. (1 +

n), v.v. Việc chứng minh Định lý 8 hoàn toàn giống nhau, chỉ có hệ số của x1 là phức tạp hơn. In detail s0 = c0 = 0 andyk = xkck - 1 = (xk - ck - 1) (1 +

k)sk = sk - 1

yk = (sk-1 + yk) (1 +

k)ck = (sksk - 1)yk= [(sk - sk - 1) (1 +

k) - yk] (1 +

k)where all the Greek letters are bounded by

. Mặc dù hệ số của x1 trong sk là biểu thức quan tâm cuối cùng, nhưng hóa ra việc tính hệ số của x1 trong sk lại dễ dàng hơn - ck và ck. Khi k = 1,c1 = (s1(1 +

1) - y1) (1 + d1)= y1((1 + s1) (1 +

1) - 1) (1 + d1)= x1(s1 +

1 + s1g1) (1 + d1) (1 + h1)s1 - c1 = x1[(1 + s1) - (s1 + g1 + s1g1) (1 + d1)](1 + h1)= x1[1 - g1 - s1d1 - s1g1 - d1g1 - s1g1d1](1 + h1)Calling the coefficients of x1 in these expressions Ck and Sk respectively, thenC1 = 2

+ O(

2)
S1 = +

1 -

1 + 4

2 + O(

3)
Để có công thức chung cho Sk và Ck, hãy mở rộng định nghĩa của sk và ck, bỏ qua tất cả các số hạng liên quan đến xi với i > 1 để getsk = (sk - 1 + yk)(1 +

k)= [sk - 1 + (xk - ck - 1) (1 +

k)](1 +

k)= [(sk - 1 - ck - 1) -

kck - 1](1+

k)ck = [{sk - sk - 1}(1 +

k) - yk](1 +

k)= [{((sk - 1 - ck - 1) -

kck - 1)(1 +

k) - sk - 1}(1 +

k) + ck - 1(1 +

k)](1 +

k)= [{(sk - 1 - ck - 1)

k -

kck-1(1 +

k) - ck - 1}(1 +

k) + ck - 1(1 +

k)](1 +

k)= [(sk - 1 - ck - 1)

k(1 +

k) - ck - 1(

k +

k))](1 +

k),sk - ck = ((sk - 1 - ck - 1) -

kck - 1) (1 +

k)- [(sk - 1 - ck - 1)

k(1 +

k) - ck - 1(

k +

k)](1 +

k)= (sk- 1 - ck - 1)((1 +

k) -

k(1 +

k)(1 +

k))+ ck - 1(-

k(1 +

k) + (

k +

k)) (1 +

k))= (s- 1 - ck - 1) (1 -

k +

k))+ ck - 1 - [

k +

k) + (

k +

k))

k]Since Sk and Ck are only being computed up to order

2, these formulas can be simplified toCk= (

k + O(

2))Sk - 1 + (-

k + O(

2))Ck - 1Sk= ((1 + 2

2 + O(

3))Sk - 1 + (2

(

2))Ck - 1Using these formulas givesC2 =

2 + O(

2)
S2 = 1 +

1 -

1 + 10

2 + O(

3)
và nói chung, có thể dễ dàng kiểm tra bằng quy nạp rằng Ck =

k + O(

2)
Sk = 1 +

1 -

1 + (4k+2)

2 + O(<

Cuối cùng, điều mong muốn là hệ số của x1 trong sk. Để có giá trị này, hãy đặt xn + 1 = 0, đặt tất cả các chữ cái Hy Lạp có chỉ số dưới của n + 1 bằng 0 và tính sn + 1. Khi đó sn + 1 = sn - cn và hệ số của x1 trong sn nhỏ hơn hệ số trong sn + 1, tức là Sn = 1 +

1 -

1 + (4n + 2)

2 = (1 + 2

2)). z

Sự khác biệt giữa các triển khai IEEE 754

Lưu ý - Phần này không phải là một phần của bài báo đã xuất bản. Nó đã được thêm vào để làm rõ một số điểm nhất định và sửa chữa những quan niệm sai lầm có thể có về tiêu chuẩn IEEE mà người đọc có thể suy ra từ bài báo. Tài liệu này không phải do David Goldberg viết, nhưng nó xuất hiện ở đây với sự cho phép của ông

Bài báo trước đã chỉ ra rằng số học dấu phẩy động phải được triển khai cẩn thận, vì các lập trình viên có thể phụ thuộc vào các thuộc tính của nó để đảm bảo tính đúng đắn và chính xác của chương trình của họ. Đặc biệt, tiêu chuẩn IEEE yêu cầu triển khai cẩn thận và chỉ có thể viết các chương trình hữu ích hoạt động chính xác và mang lại kết quả chính xác trên các hệ thống tuân thủ tiêu chuẩn. The reader might be tempted to conclude that such programs should be portable to all IEEE systems. Thật vậy, phần mềm di động sẽ dễ viết hơn nếu nhận xét "Khi một chương trình được di chuyển giữa hai máy và cả hai đều hỗ trợ số học IEEE, thì nếu bất kỳ kết quả trung gian nào khác đi, thì đó phải là do lỗi phần mềm, không phải do sự khác biệt về số học," là

Thật không may, tiêu chuẩn IEEE không đảm bảo rằng cùng một chương trình sẽ mang lại kết quả giống hệt nhau trên tất cả các hệ thống phù hợp. Hầu hết các chương trình sẽ thực sự tạo ra các kết quả khác nhau trên các hệ thống khác nhau vì nhiều lý do. Đầu tiên, hầu hết các chương trình liên quan đến việc chuyển đổi số giữa định dạng thập phân và nhị phân và tiêu chuẩn IEEE không chỉ định đầy đủ độ chính xác mà các chuyển đổi đó phải được thực hiện. Mặt khác, nhiều chương trình sử dụng các chức năng cơ bản do thư viện hệ thống cung cấp và tiêu chuẩn hoàn toàn không chỉ định các chức năng này. Tất nhiên, hầu hết các lập trình viên đều biết rằng các tính năng này nằm ngoài phạm vi của tiêu chuẩn IEEE.

Nhiều lập trình viên có thể không nhận ra rằng ngay cả một chương trình chỉ sử dụng các định dạng số và hoạt động theo tiêu chuẩn IEEE có thể tính toán các kết quả khác nhau trên các hệ thống khác nhau. Trên thực tế, các tác giả của tiêu chuẩn dự định cho phép các triển khai khác nhau thu được các kết quả khác nhau. Ý định của họ thể hiện rõ trong định nghĩa về thuật ngữ đích trong tiêu chuẩn IEEE 754. "A destination may be either explicitly designated by the user or implicitly supplied by the system (for example, intermediate results in subexpressions or arguments for procedures). Một số ngôn ngữ đặt kết quả tính toán trung gian ở đích ngoài tầm kiểm soát của người dùng. Tuy nhiên, tiêu chuẩn này xác định kết quả của một hoạt động theo định dạng của đích đó và các giá trị của toán hạng. " (IEEE 754-1985, tr. 7) Nói cách khác, tiêu chuẩn IEEE yêu cầu mỗi kết quả phải được làm tròn chính xác theo độ chính xác của đích mà nó sẽ được đặt vào, nhưng tiêu chuẩn không yêu cầu độ chính xác của đích đó được xác định bởi chương trình của người dùng. Do đó, các hệ thống khác nhau có thể đưa kết quả của chúng đến các đích với độ chính xác khác nhau, khiến cùng một chương trình tạo ra các kết quả khác nhau (đôi khi rất đáng kể), mặc dù các hệ thống đó đều tuân theo tiêu chuẩn

Một số ví dụ trong bài báo trước phụ thuộc vào một số kiến thức về cách làm tròn số học dấu phẩy động. Để dựa vào các ví dụ như thế này, một lập trình viên phải có khả năng dự đoán cách một chương trình sẽ được diễn giải và đặc biệt, trên hệ thống IEEE, độ chính xác của đích đến của mỗi phép toán số học có thể là bao nhiêu. Than ôi, lỗ hổng trong định nghĩa đích của tiêu chuẩn IEEE làm suy yếu khả năng của lập trình viên để biết chương trình sẽ được diễn giải như thế nào. Do đó, một số ví dụ nêu trên, khi được triển khai dưới dạng các chương trình di động rõ ràng bằng ngôn ngữ cấp cao, có thể không hoạt động chính xác trên các hệ thống IEEE thường cung cấp kết quả tới đích với độ chính xác khác với mong đợi của lập trình viên. Các ví dụ khác có thể hoạt động, nhưng việc chứng minh rằng chúng hoạt động có thể nằm ngoài khả năng của một lập trình viên bình thường

Trong phần này, chúng tôi phân loại các triển khai số học IEEE 754 hiện có dựa trên độ chính xác của các định dạng đích mà chúng thường sử dụng. Sau đó, chúng tôi xem xét một số ví dụ từ bài báo để chỉ ra rằng việc cung cấp kết quả với độ chính xác rộng hơn mong đợi của chương trình có thể khiến chương trình tính toán kết quả sai mặc dù nó có thể đúng khi sử dụng độ chính xác dự kiến. Chúng tôi cũng xem lại một trong những bằng chứng trong bài báo để minh họa nỗ lực trí tuệ cần thiết để đối phó với độ chính xác ngoài dự kiến ngay cả khi nó không làm mất hiệu lực chương trình của chúng tôi. Những ví dụ này cho thấy rằng bất chấp tất cả những gì tiêu chuẩn IEEE quy định, sự khác biệt mà nó cho phép giữa các triển khai khác nhau có thể ngăn chúng ta viết phần mềm số hiệu quả, di động có hành vi mà chúng ta có thể dự đoán chính xác. Sau đó, để phát triển phần mềm như vậy, trước tiên chúng ta phải tạo ra các ngôn ngữ lập trình và môi trường hạn chế tính biến đổi mà tiêu chuẩn IEEE cho phép và cho phép các lập trình viên thể hiện ngữ nghĩa dấu phẩy động mà chương trình của họ phụ thuộc vào.

Triển khai IEEE 754 hiện tại

Việc triển khai số học IEEE 754 hiện tại có thể được chia thành hai nhóm được phân biệt theo mức độ chúng hỗ trợ các định dạng dấu phẩy động khác nhau trong phần cứng. Extended-based systems, exemplified by the Intel x86 family of processors, provide full support for an extended double precision format but only partial support for single and double precision. chúng cung cấp các hướng dẫn để tải hoặc lưu trữ dữ liệu ở độ chính xác đơn và kép, chuyển đổi dữ liệu nhanh chóng sang hoặc từ định dạng kép mở rộng và chúng cung cấp các chế độ đặc biệt (không phải mặc định) trong đó kết quả của các phép toán số học được làm tròn thành đơn . (Bộ xử lý sê-ri Motorola 68000 làm tròn kết quả cho cả độ chính xác và phạm vi của định dạng đơn hoặc kép trong các chế độ này. Intel x86 và bộ xử lý tương thích làm tròn kết quả thành độ chính xác của định dạng đơn hoặc kép nhưng vẫn giữ nguyên phạm vi như định dạng kép mở rộng. ) Các hệ thống đơn/kép, bao gồm hầu hết các bộ xử lý RISC, cung cấp hỗ trợ đầy đủ cho các định dạng chính xác đơn và kép nhưng không hỗ trợ định dạng chính xác kép mở rộng tuân thủ IEEE. (Kiến trúc IBM POWER chỉ cung cấp hỗ trợ một phần cho độ chính xác đơn, nhưng với mục đích của phần này, chúng tôi phân loại nó là hệ thống đơn/kép. )

Để xem cách tính toán có thể hoạt động khác nhau trên hệ thống dựa trên mở rộng so với trên hệ thống đơn/kép, hãy xem xét phiên bản C của ví dụ từ phần

     n = n/2

Ở đây các hằng số 3. 0 và 7. 0 được hiểu là số dấu phẩy động có độ chính xác kép và biểu thức 3. 0/7. 0 kế thừa kiểu dữ liệu

     x = x*x

76. Trên một hệ thống đơn/kép, biểu thức sẽ được đánh giá với độ chính xác kép vì đó là định dạng hiệu quả nhất để sử dụng. Do đó,

 while (n is even) {

53 sẽ được gán giá trị 3. 0/7. 0 rounded correctly to double precision. In the next line, the expression 3. 0/7. 0 một lần nữa sẽ được đánh giá với độ chính xác kép và tất nhiên kết quả sẽ bằng với giá trị vừa được gán cho

 while (n is even) {

53, vì vậy chương trình sẽ in "Bằng" như mong đợi

Trên một hệ thống dựa trên mở rộng, mặc dù biểu thức 3. 0/7. 0 có loại

     x = x*x

76, thương số sẽ được tính trong một thanh ghi ở định dạng kép mở rộng và do đó ở chế độ mặc định, nó sẽ được làm tròn thành độ chính xác kép mở rộng. Tuy nhiên, khi giá trị kết quả được gán cho biến

 while (n is even) {

53, thì nó có thể được lưu trữ trong bộ nhớ và vì

 while (n is even) {

53 được khai báo là

     x = x*x

76, nên giá trị sẽ được làm tròn thành độ chính xác gấp đôi. Trong dòng tiếp theo, biểu thức 3. 0/7. 0 một lần nữa có thể được đánh giá ở độ chính xác mở rộng mang lại kết quả khác với giá trị độ chính xác kép được lưu trữ trong

 while (n is even) {

53, khiến chương trình in ra "Không bằng". Tất nhiên, các kết quả khác cũng có thể xảy ra. trình biên dịch có thể quyết định lưu trữ và do đó làm tròn giá trị của biểu thức 3. 0/7. 0 ở dòng thứ hai trước khi so sánh nó với

 while (n is even) {

53 hoặc nó có thể giữ

 while (n is even) {

53 trong sổ đăng ký với độ chính xác mở rộng mà không cần lưu trữ. Trình biên dịch tối ưu hóa có thể đánh giá biểu thức 3. 0/7. 0 tại thời điểm biên dịch, có lẽ ở độ chính xác kép hoặc có thể ở độ chính xác kép mở rộng. (Với một trình biên dịch x86, chương trình sẽ in "Bằng" khi được biên dịch với tối ưu hóa và "Không bằng" khi được biên dịch để gỡ lỗi. ) Finally, some compilers for extended-based systems automatically change the rounding precision mode to cause operations producing results in registers to round those results to single or double precision, albeit possibly with a wider range. Do đó, trên các hệ thống này, chúng tôi không thể dự đoán hành vi của chương trình chỉ bằng cách đọc mã nguồn của nó và áp dụng hiểu biết cơ bản về số học IEEE 754. Chúng tôi cũng không thể buộc tội phần cứng hoặc trình biên dịch không cung cấp môi trường tuân thủ IEEE 754;

Cạm bẫy trong tính toán trên các hệ thống dựa trên mở rộng

Sự khôn ngoan thông thường cho rằng các hệ thống dựa trên mở rộng phải tạo ra kết quả ít nhất là chính xác, nếu không muốn nói là chính xác hơn kết quả được cung cấp trên hệ thống đơn/kép, vì hệ thống trước luôn cung cấp độ chính xác ít nhất bằng và thường cao hơn hệ thống sau. Các ví dụ tầm thường như chương trình C ở trên cũng như các chương trình phức tạp hơn dựa trên các ví dụ được thảo luận bên dưới cho thấy sự khôn ngoan này tốt nhất là ngây thơ. một số chương trình rõ ràng là di động, thực sự là di động trên các hệ thống đơn/kép, cung cấp kết quả không chính xác trên các hệ thống dựa trên mở rộng chính xác do trình biên dịch và phần cứng đôi khi cung cấp độ chính xác cao hơn mong đợi của chương trình

Các ngôn ngữ lập trình hiện tại khiến chương trình khó xác định độ chính xác mà nó mong đợi. Như phần này đã đề cập, nhiều ngôn ngữ lập trình không chỉ định rằng mỗi lần xuất hiện của một biểu thức như

 while (n is even) {

35 trong cùng một ngữ cảnh sẽ đánh giá cùng một giá trị. Một số ngôn ngữ, chẳng hạn như Ada, bị ảnh hưởng về mặt này bởi các biến thể giữa các số học khác nhau trước khi có tiêu chuẩn IEEE. Gần đây hơn, các ngôn ngữ như ANSI C đã bị ảnh hưởng bởi các hệ thống dựa trên mở rộng phù hợp với tiêu chuẩn. Trên thực tế, tiêu chuẩn ANSI C rõ ràng cho phép trình biên dịch đánh giá một biểu thức dấu phẩy động với độ chính xác rộng hơn mức thường được kết hợp với loại của nó. Do đó, giá trị của biểu thức

 while (n is even) {

35 có thể khác nhau tùy thuộc vào nhiều yếu tố. liệu biểu thức được gán ngay cho một biến hay xuất hiện dưới dạng biểu thức con trong biểu thức lớn hơn;

Các tiêu chuẩn ngôn ngữ không hoàn toàn đổ lỗi cho sự thất thường của việc đánh giá biểu thức. Các hệ thống dựa trên mở rộng chạy hiệu quả nhất khi các biểu thức được đánh giá trong các thanh ghi độ chính xác mở rộng bất cứ khi nào có thể, tuy nhiên các giá trị phải được lưu trữ được lưu trữ ở độ chính xác hẹp nhất được yêu cầu. Hạn chế một ngôn ngữ để yêu cầu

 while (n is even) {

35 đánh giá cùng một giá trị ở mọi nơi sẽ áp đặt một hình phạt về hiệu suất đối với các hệ thống đó. Thật không may, việc cho phép các hệ thống đó đánh giá

 while (n is even) {

35 khác nhau trong các ngữ cảnh tương đương về mặt cú pháp sẽ tự áp đặt hình phạt đối với những người lập trình phần mềm số chính xác bằng cách ngăn họ dựa vào cú pháp của chương trình để diễn đạt ngữ nghĩa dự định của họ

Do real programs depend on the assumption that a given expression always evaluates to the same value? Recall the algorithm presented in Theorem 4 for computing ln(1 + x), written here in Fortran

 u = x

 u = x

 u = x

 u = x

 u = x

Trên hệ thống dựa trên mở rộng, trình biên dịch có thể đánh giá biểu thức

     x = x*x

 while (n is even) {

 while (n is even) {

06 ở dòng thứ ba với độ chính xác mở rộng và so sánh kết quả với

     x = x*x

90. Tuy nhiên, khi biểu thức tương tự được chuyển đến hàm nhật ký ở dòng thứ sáu, trình biên dịch có thể lưu trữ giá trị của nó trong bộ nhớ, làm tròn nó thành độ chính xác đơn. Do đó, nếu

 while (n is even) {

06 không nhỏ đến mức

     x = x*x

 while (n is even) {

 while (n is even) {

06 làm tròn thành

     x = x*x

90 với độ chính xác mở rộng nhưng đủ nhỏ để

     x = x*x

 while (n is even) {

 while (n is even) {

06 làm tròn thành

     x = x*x

90 với độ chính xác đơn, thì giá trị được trả về bởi

     n = n/2

03 sẽ bằng 0 thay vì

 while (n is even) {

06 và lỗi tương đối sẽ . Tương tự, giả sử phần còn lại của biểu thức ở dòng thứ sáu, bao gồm cả sự xuất hiện lại của biểu thức con

     x = x*x

 while (n is even) {

 while (n is even) {

06, được đánh giá theo độ chính xác mở rộng. Trong trường hợp đó, nếu

 while (n is even) {

06 nhỏ nhưng không hoàn toàn nhỏ đến mức

     x = x*x

 while (n is even) {

 while (n is even) {

06 làm tròn thành

     x = x*x

90 với độ chính xác duy nhất, thì giá trị được trả về bởi

     n = n/2

03 có thể vượt quá giá trị đúng gần bằng

 while (n is even) {

06 và một lần nữa, lỗi tương đối có thể đạt tới một. Ví dụ cụ thể, lấy

 while (n is even) {

06 là 2-24 + 2-47, vì vậy,

 while (n is even) {

06 là số chính xác đơn lẻ nhỏ nhất sao cho

     x = x*x

 while (n is even) {

 while (n is even) {

06 làm tròn lên số lớn hơn tiếp theo, 1 + 2-23. Khi đó

     n = n/2

 while (n is even) {

     n = n/2

22 xấp xỉ 2-23. Bởi vì mẫu số trong biểu thức ở dòng thứ sáu được đánh giá với độ chính xác mở rộng, nên nó được tính toán chính xác và cho kết quả là

 while (n is even) {

06, do đó,

     n = n/2

03 trả về khoảng 2-23, lớn gần gấp đôi giá trị chính xác. (Điều này thực sự xảy ra với ít nhất một trình biên dịch. Khi mã trước đó được biên dịch bởi Sun WorkShop Compilers 4. 2. 1 Trình biên dịch Fortran 77 cho các hệ thống x86 sử dụng cờ tối ưu hóa

     n = n/2

25, mã được tạo sẽ tính toán

     x = x*x

 while (n is even) {

 while (n is even) {

06 chính xác như mô tả. Kết quả là, hàm trả về 0 cho

     n = n/2

29 và

     n = n/2

30 cho

     n = n/2

31. )

. Similarly, suppose the rest of the expression in the sixth line, including the reoccurrence of the subexpression

     x = x*x

 while (n is even) {

 while (n is even) {

06, is evaluated in extended precision. In that case, if

 while (n is even) {

06 is small but not quite small enough that

     x = x*x

 while (n is even) {

 while (n is even) {

06 rounds to

     x = x*x

90 in single precision, then the value returned by

     n = n/2

03 can exceed the correct value by nearly as much as

 while (n is even) {

06, and again the relative error can approach one. For a concrete example, take

 while (n is even) {

06 to be 2-24 + 2-47, so

 while (n is even) {

06 is the smallest single precision number such that

     x = x*x

 while (n is even) {

 while (n is even) {

06 rounds up to the next larger number, 1 + 2-23. Then

     n = n/2

 while (n is even) {

     n = n/2

22 is approximately 2-23. Because the denominator in the expression in the sixth line is evaluated in extended precision, it is computed exactly and delivers

 while (n is even) {

06, so

     n = n/2

03 returns approximately 2-23, which is nearly twice as large as the exact value. (This actually happens with at least one compiler. When the preceding code is compiled by the Sun WorkShop Compilers 4.2.1 Fortran 77 compiler for x86 systems using the

     n = n/2

25 optimization flag, the generated code computes

     x = x*x

 while (n is even) {

 while (n is even) {

06 exactly as described. As a result, the function delivers zero for

     n = n/2

29 and

     n = n/2

30 for

     n = n/2

31.)

Để thuật toán của Định lý 4 hoạt động chính xác, biểu thức

     x = x*x

 while (n is even) {

 while (n is even) {

06 phải được đánh giá theo cùng một cách mỗi khi nó xuất hiện; . Tất nhiên, vì

     n = n/2

38 là một hàm nội tại chung trong Fortran, trình biên dịch có thể đánh giá biểu thức

     x = x*x

 while (n is even) {

 while (n is even) {

06 với độ chính xác mở rộng xuyên suốt, tính logarit của nó với cùng độ chính xác, nhưng rõ ràng là chúng ta không thể cho rằng trình biên dịch sẽ làm như vậy. (Người ta cũng có thể tưởng tượng một ví dụ tương tự liên quan đến hàm do người dùng định nghĩa. Trong trường hợp đó, một trình biên dịch vẫn có thể giữ đối số ở độ chính xác mở rộng mặc dù hàm trả về một kết quả chính xác duy nhất, nhưng rất ít nếu có bất kỳ trình biên dịch Fortran hiện có nào làm được điều này. ) We might therefore attempt to ensure that

     x = x*x

 while (n is even) {

 while (n is even) {

06 is evaluated consistently by assigning it to a variable. Thật không may, nếu chúng ta khai báo biến đó là

     n = n/2

45, thì chúng ta vẫn có thể bị trình biên dịch đánh lừa khi thay thế một giá trị được lưu trong sổ đăng ký với độ chính xác mở rộng cho một lần xuất hiện của biến và một giá trị được lưu trong bộ nhớ với độ chính xác duy nhất cho một lần xuất hiện khác. Thay vào đó, chúng ta cần khai báo biến có kiểu tương ứng với định dạng chính xác mở rộng. FORTRAN 77 tiêu chuẩn không cung cấp cách để thực hiện việc này và trong khi Fortran 95 cung cấp cơ chế

     n = n/2

46 để mô tả các định dạng khác nhau, nó không yêu cầu rõ ràng việc triển khai đánh giá các biểu thức với độ chính xác mở rộng để cho phép các biến được khai báo với độ chính xác đó. Nói tóm lại, không có cách di động nào để viết chương trình này bằng Fortran tiêu chuẩn được đảm bảo để ngăn biểu thức

     x = x*x

 while (n is even) {

 while (n is even) {

06 khỏi bị đánh giá theo cách làm mất hiệu lực bằng chứng của chúng tôi

Có những ví dụ khác có thể gặp trục trặc trên các hệ thống dựa trên mở rộng ngay cả khi mỗi biểu thức con được lưu trữ và do đó được làm tròn với cùng độ chính xác. Nguyên nhân là do làm tròn hai lần. Ở chế độ chính xác mặc định, một hệ thống dựa trên mở rộng ban đầu sẽ làm tròn từng kết quả thành độ chính xác kép mở rộng. Nếu kết quả đó sau đó được lưu trữ với độ chính xác gấp đôi, nó sẽ được làm tròn lại. Sự kết hợp của hai cách làm tròn này có thể mang lại một giá trị khác với giá trị có được bằng cách làm tròn kết quả đầu tiên một cách chính xác để tăng gấp đôi độ chính xác. Điều này có thể xảy ra khi kết quả được làm tròn thành độ chính xác kép mở rộng là "trường hợp giữa chừng", tôi. e. , nó nằm chính xác ở giữa hai số chính xác kép, do đó, lần làm tròn thứ hai được xác định theo quy tắc làm tròn số chẵn. Nếu lần làm tròn thứ hai này quay cùng hướng với lần đầu tiên, thì sai số làm tròn ròng sẽ vượt quá nửa đơn vị ở vị trí cuối cùng. (Tuy nhiên, xin lưu ý rằng việc làm tròn hai lần chỉ ảnh hưởng đến các phép tính có độ chính xác kép. Người ta có thể chứng minh rằng tổng, hiệu, tích hoặc thương của hai số p-bit hoặc căn bậc hai của một số p-bit, trước tiên được làm tròn thành q bit và sau đó thành p bit cho cùng một giá trị như thể kết quả là . Do đó, độ chính xác kép mở rộng đủ rộng để các tính toán chính xác đơn lẻ không bị làm tròn kép. )

2p + 2. Thus, extended double precision is wide enough that single precision computations don't suffer double-rounding.)

Một số thuật toán phụ thuộc vào làm tròn chính xác có thể thất bại với làm tròn kép. Trên thực tế, ngay cả một số thuật toán không yêu cầu làm tròn chính xác và hoạt động chính xác trên nhiều loại máy không tuân thủ IEEE 754 cũng có thể thất bại khi làm tròn hai lần. Hữu ích nhất trong số này là các thuật toán di động để thực hiện nhiều phép tính chính xác mô phỏng được đề cập trong phần. Ví dụ: quy trình được mô tả trong Định lý 6 để tách một số dấu phẩy động thành các phần cao và thấp không hoạt động chính xác trong số học làm tròn kép. try to split the double precision number 252 + 3 × 226 - 1 into two parts each with at most 26 bits. Khi mỗi thao tác được làm tròn chính xác thành độ chính xác gấp đôi, phần có thứ tự cao là 252 + 227 và phần có thứ tự thấp là 226 - 1, nhưng khi mỗi thao tác được làm tròn trước thành độ chính xác gấp đôi mở rộng rồi đến độ chính xác gấp đôi, quy trình sẽ tạo ra giá trị cao . Số thứ hai chiếm 27 bit, vì vậy bình phương của nó không thể được tính chính xác với độ chính xác gấp đôi. Tất nhiên, vẫn có thể tính bình phương của số này với độ chính xác gấp đôi mở rộng, nhưng thuật toán kết quả sẽ không còn khả dụng cho các hệ thống đơn/kép. Ngoài ra, các bước sau trong thuật toán nhân chính xác bội giả định rằng tất cả các tích từng phần đã được tính toán với độ chính xác kép. Xử lý chính xác hỗn hợp các biến kép kép và mở rộng sẽ khiến việc triển khai trở nên đắt đỏ hơn đáng kể

Tương tự như vậy, các thuật toán di động để thêm nhiều số chính xác được biểu thị dưới dạng mảng các số chính xác kép có thể không thành công trong phép tính làm tròn hai lần. Các thuật toán này thường dựa trên một kỹ thuật tương tự như công thức tính tổng của Kahan. Như lời giải thích không chính thức về công thức tính tổng được đưa ra trên gợi ý, nếu

     n = n/2

50 và

 while (n is even) {

42 là các biến dấu phẩy động với.

     n = n/2

50.

 while (n is even) {

42. và chúng tôi tính toán.

 u = x

 u = x

sau đó trong hầu hết các phép tính,

     n = n/2

54 khôi phục chính xác lỗi làm tròn đã xảy ra trong máy tính

     n = n/2

55. Tuy nhiên, kỹ thuật này không hoạt động trong số học làm tròn hai lần. nếu

     n = n/2

50 = 252 + 1 và

 while (n is even) {

42 = 1/2 - 2-54, thì

     n = n/2

 while (n is even) {

 while (n is even) {

42 làm tròn đầu tiên thành 252 + 3/2 ở độ chính xác kép mở rộng và giá trị này làm tròn thành 252 + 2 ở độ chính xác kép bằng các mối liên hệ . Một lần nữa, ở đây, có thể khôi phục lỗi làm tròn bằng cách tính tổng ở độ chính xác kép mở rộng, nhưng sau đó chương trình sẽ phải thực hiện thêm công việc để giảm kết quả cuối cùng trở lại độ chính xác gấp đôi và việc làm tròn kép có thể ảnh hưởng đến quá trình này, . For this reason, although portable programs for simulating multiple precision arithmetic by these methods work correctly and efficiently on a wide variety of machines, they do not work as advertised on extended-based systems

Cuối cùng, một số thuật toán thoạt nhìn có vẻ phụ thuộc vào việc làm tròn chính xác trên thực tế có thể hoạt động chính xác với phép làm tròn hai lần. Trong những trường hợp này, chi phí đối phó với việc làm tròn hai lần không nằm ở việc triển khai mà nằm ở việc xác minh rằng thuật toán có hoạt động như quảng cáo hay không. Để minh họa, ta chứng minh biến thể sau của Định lý 7

Định lý 7'

Nếu m và n là các số nguyên có thể biểu diễn theo độ chính xác kép của IEEE 754 với. m. < 252 và n có dạng đặc biệt n = 2i + 2j, sau đó (mn)

n = m, miễn là cả hai phép toán dấu phẩy động đều được làm tròn chính xác thành gấp đôi độ chính xác hoặc làm tròn trước thành gấp đôi mở rộng .

Proof

Giả sử không mất mát rằng m > 0. Đặt q = mn. Chia tỷ lệ theo lũy thừa hai, chúng ta có thể xem xét một cài đặt tương đương trong đó 252

m < 253 và tương tự như vậy đối với q, sao cho cả m và q đều là số nguyên có bit ít quan trọng nhất chiếm vị trí đơn vị (i. e. , ulp(m) = ulp(q) = 1). Trước khi chia tỷ lệ, chúng tôi giả sử m < 252, vì vậy sau khi chia tỷ lệ, m là một số nguyên chẵn. Ngoài ra, vì các giá trị tỷ lệ của m và q thỏa mãn m/2 < q < 2m nên giá trị tương ứng của n phải có một trong hai dạng tùy thuộc vào m hay q lớn hơn. nếu q

Gọi e là lỗi làm tròn trong phép tính q, sao cho q = m/n + e và giá trị được tính q

n sẽ là giá trị được làm tròn (một hoặc hai lần) . Consider first the case in which each floating-point operation is rounded correctly to double precision. Trong trường hợp này,. e. < 1/2. Nếu n có dạng 1/2 + 2-(k + 1) thì ne = nq - m là bội số nguyên của 2-(k + 1) và. ne. < 1/4 + 2-(k + 2). Điều này ngụ ý rằng. ne.

1/4. Nhớ lại rằng sự khác biệt giữa m và số có thể biểu thị lớn hơn tiếp theo là 1 và sự khác biệt giữa m và số có thể biểu thị nhỏ hơn tiếp theo là 1 nếu m > 252 hoặc 1/2 nếu m = 252. Vì vậy, như. ne.

1/4, m + ne sẽ làm tròn thành m. (Ngay cả khi m = 252 và ne = -1/4, tích sẽ làm tròn thành m theo quy tắc làm tròn số chẵn. ) Tương tự, nếu n có dạng 1 + 2-k, thì ne là bội số nguyên của 2-k và. ne.

1/2. Chúng ta không thể có m = 252 trong trường hợp này vì m hoàn toàn lớn hơn q, vì vậy m khác với các lân cận có thể biểu diễn gần nhất của nó bằng ±1. Vì vậy, như. ne.

1/2, lại m + ne sẽ làm tròn thành m. (Thậm chí nếu. ne. = 1/2, tích sẽ làm tròn thành m theo quy tắc làm tròn số chẵn vì m chẵn. ) Điều này hoàn thành bằng chứng cho số học được làm tròn chính xác.

Trong số học làm tròn hai lần, vẫn có thể xảy ra trường hợp q là thương được làm tròn chính xác (mặc dù nó thực sự đã được làm tròn hai lần), vì vậy. e. < 1/2 như trên. Trong trường hợp này, chúng tôi có thể kháng cáo các lập luận của đoạn trước miễn là chúng tôi xem xét thực tế rằng q

n sẽ được làm tròn hai lần. Để giải thích điều này, hãy lưu ý rằng tiêu chuẩn IEEE yêu cầu định dạng kép mở rộng mang ít nhất 64 bit có nghĩa, sao cho các số m ± 1/2 và m ± 1/4 có thể biểu diễn chính xác ở độ chính xác kép mở rộng. Do đó, nếu n có dạng 1/2 + 2-(k + 1), sao cho. ne.

1/4, sau đó làm tròn m + ne thành độ chính xác kép mở rộng phải tạo ra kết quả khác với m nhiều nhất là 1/4 và như đã lưu ý ở trên, giá trị này sẽ làm tròn thành m gấp đôi . Tương tự, nếu n có dạng 1 + 2-k, sao cho. ne.

1/2, sau đó làm tròn m + ne thành độ chính xác kép mở rộng phải tạo ra kết quả khác với m nhiều nhất là 1/2 và giá trị này sẽ làm tròn thành m ở độ chính xác kép. (Nhắc lại m > 252 trong trường hợp này. )

Cuối cùng, chúng ta còn lại để xem xét các trường hợp trong đó q không phải là thương được làm tròn chính xác do làm tròn hai lần. Trong những trường hợp này, chúng ta có. e. < 1/2 + 2-(d + 1) trong trường hợp xấu nhất, trong đó d là số bit thừa ở định dạng kép mở rộng. (Tất cả các hệ thống dựa trên mở rộng hiện có đều hỗ trợ định dạng kép mở rộng với chính xác 64 bit có nghĩa; đối với định dạng này, d = 64 - 53 = 11. ) Vì làm tròn hai lần chỉ tạo ra kết quả làm tròn không chính xác khi làm tròn thứ hai được xác định theo quy tắc làm tròn về số chẵn, nên q phải là một số nguyên chẵn. Do đó, nếu n có dạng 1/2 + 2-(k + 1), thì ne = nq - m là bội số nguyên của 2-k và

ne. < (1/2 + 2-(k + 1))(1/2 + 2-(d + 1)) = 1/4 + 2-(k + 2) + 2-(d + 2) + 2-

Nếu k

d, điều này ngụ ý. ne.

1/4. Nếu k > d, ta có. ne.

1/4 + 2-(d + 2). Trong cả hai trường hợp, lần làm tròn đầu tiên của tích sẽ mang lại kết quả khác với m nhiều nhất là 1/4 và theo các đối số trước đó, lần làm tròn thứ hai sẽ làm tròn thành m. Similarly, if n has the form 1 + 2-k, then ne is an integer multiple of 2-(k - 1), and

ne. < 1/2 + 2-(k + 1) + 2-(d + 1) + 2-(k + d + 1)

Nếu k

d, điều này ngụ ý. ne.

1/2. Nếu k > d, ta có. ne.

1/2 + 2-(d + 1). Trong cả hai trường hợp, lần làm tròn đầu tiên của tích sẽ mang lại kết quả khác với m nhiều nhất là 1/2 và một lần nữa theo các đối số trước đó, lần làm tròn thứ hai sẽ làm tròn thành m. z

Bằng chứng trước cho thấy tích chỉ có thể làm tròn hai lần nếu thương thực hiện, và thậm chí sau đó, nó làm tròn thành kết quả đúng. Bằng chứng cũng chỉ ra rằng việc mở rộng lập luận của chúng ta để bao gồm khả năng làm tròn hai lần có thể là một thách thức ngay cả đối với một chương trình chỉ có hai phép tính dấu phẩy động. Đối với một chương trình phức tạp hơn, có thể không thể tính toán một cách có hệ thống các tác động của phép làm tròn hai lần, chưa kể đến các kết hợp tổng quát hơn của phép tính chính xác kép mở rộng và kép

Hỗ trợ ngôn ngữ lập trình cho độ chính xác mở rộng

Không nên dùng các ví dụ trước để gợi ý rằng bản thân độ chính xác mở rộng là có hại. Many programs can benefit from extended precision when the programmer is able to use it selectively. Thật không may, các ngôn ngữ lập trình hiện tại không cung cấp đủ phương tiện để lập trình viên chỉ định thời điểm và cách sử dụng độ chính xác mở rộng. Để chỉ ra những hỗ trợ nào là cần thiết, chúng tôi xem xét các cách mà chúng tôi có thể muốn quản lý việc sử dụng độ chính xác mở rộng

Trong một chương trình di động sử dụng độ chính xác kép làm độ chính xác làm việc danh nghĩa của nó, có năm cách chúng ta có thể muốn kiểm soát việc sử dụng độ chính xác rộng hơn

Biên dịch để tạo mã nhanh nhất, sử dụng độ chính xác mở rộng nếu có thể trên các hệ thống dựa trên mở rộng. Rõ ràng hầu hết các phần mềm số không yêu cầu nhiều phép tính hơn là lỗi tương đối trong mỗi thao tác được giới hạn bởi "máy epsilon". Khi dữ liệu trong bộ nhớ được lưu trữ ở độ chính xác kép, epsilon của máy thường được coi là lỗi làm tròn tương đối lớn nhất ở độ chính xác đó, vì dữ liệu đầu vào (đúng hoặc sai) được cho là đã được làm tròn khi chúng được nhập và kết quả sẽ . Do đó, trong khi tính toán một số kết quả trung gian ở độ chính xác mở rộng có thể mang lại kết quả chính xác hơn, thì độ chính xác mở rộng là không cần thiết. Trong trường hợp này, chúng tôi có thể muốn trình biên dịch chỉ sử dụng độ chính xác mở rộng khi nó sẽ không làm chậm chương trình một cách đáng kể và nếu không thì sử dụng độ chính xác kép
Sử dụng định dạng rộng hơn gấp đôi nếu định dạng đó đủ nhanh và đủ rộng, nếu không thì sử dụng định dạng khác. Một số tính toán có thể được thực hiện dễ dàng hơn khi có độ chính xác mở rộng, nhưng chúng cũng có thể được thực hiện với độ chính xác kép chỉ với nỗ lực lớn hơn một chút. Xem xét tính toán định mức Euclide của một vectơ số chính xác kép. Bằng cách tính bình phương của các phần tử và tích lũy tổng của chúng ở định dạng kép mở rộng IEEE 754 với phạm vi số mũ rộng hơn, chúng ta có thể tránh được tình trạng tràn hoặc tràn sớm đối với các vectơ có độ dài thực tế một cách tầm thường. Trên các hệ thống dựa trên mở rộng, đây là cách nhanh nhất để tính định mức. Trên các hệ thống đơn/kép, một định dạng kép mở rộng sẽ phải được mô phỏng trong phần mềm (nếu một phần mềm được hỗ trợ) và việc mô phỏng như vậy sẽ chậm hơn nhiều so với việc chỉ sử dụng độ chính xác kép, kiểm tra các cờ ngoại lệ để xác định xem có xảy ra tình trạng tràn hoặc tràn . Lưu ý rằng để hỗ trợ việc sử dụng độ chính xác mở rộng này, một ngôn ngữ phải cung cấp cả chỉ báo về định dạng có sẵn rộng nhất, tốc độ hợp lý, để chương trình có thể chọn phương pháp sử dụng và các tham số môi trường cho biết độ chính xác và phạm vi của từng định dạng . g. , rằng nó có phạm vi rộng hơn gấp đôi)
Sử dụng định dạng rộng hơn gấp đôi ngay cả khi nó phải được mô phỏng trong phần mềm. Đối với các chương trình phức tạp hơn ví dụ chuẩn Euclide, lập trình viên có thể chỉ muốn tránh viết hai phiên bản của chương trình và thay vào đó dựa vào độ chính xác mở rộng ngay cả khi nó chậm. Again, the language must provide environmental parameters so that the program can determine the range and precision of the widest available format
Không sử dụng độ chính xác rộng hơn; . Đối với các chương trình được viết dễ dàng nhất phụ thuộc vào số học chính xác kép được làm tròn chính xác, bao gồm một số ví dụ được đề cập ở trên, ngôn ngữ phải cung cấp cách để lập trình viên chỉ ra rằng không được sử dụng độ chính xác mở rộng, mặc dù kết quả trung gian có thể được tính toán . (Các kết quả trung gian được tính toán theo cách này vẫn có thể phải làm tròn hai lần nếu chúng bị tràn khi lưu vào bộ nhớ. nếu kết quả của một phép toán số học trước tiên được làm tròn thành 53 bit có nghĩa, sau đó được làm tròn lại thành ít bit có nghĩa hơn khi nó phải được không chuẩn hóa, thì kết quả cuối cùng có thể khác với kết quả thu được bằng cách làm tròn chỉ một lần thành số không được chuẩn hóa. Tất nhiên, hình thức làm tròn kép này rất khó có thể ảnh hưởng xấu đến bất kỳ chương trình thực tế nào. )
Làm tròn kết quả chính xác cho cả độ chính xác và phạm vi của định dạng kép. Việc thực thi nghiêm ngặt độ chính xác kép này sẽ hữu ích nhất cho các chương trình kiểm tra phần mềm số hoặc bản thân số học gần giới hạn của cả phạm vi và độ chính xác của định dạng kép. Các chương trình kiểm tra cẩn thận như vậy có xu hướng khó viết theo cách di động; . Do đó, một lập trình viên sử dụng hệ thống dựa trên mở rộng để phát triển phần mềm mạnh mẽ phải có khả năng di động đối với tất cả các triển khai IEEE 754 sẽ nhanh chóng đánh giá cao khả năng mô phỏng số học của các hệ thống đơn/kép mà không cần nỗ lực phi thường

Không có ngôn ngữ hiện tại nào hỗ trợ tất cả năm tùy chọn này. Trên thực tế, một số ngôn ngữ đã cố gắng cung cấp cho lập trình viên khả năng kiểm soát việc sử dụng độ chính xác mở rộng. Một ngoại lệ đáng chú ý là ISO/IEC 9899. Ngôn ngữ lập trình 1999 - Tiêu chuẩn C, phiên bản mới nhất của ngôn ngữ C, hiện đang ở giai đoạn chuẩn hóa cuối cùng

Tiêu chuẩn C99 cho phép triển khai đánh giá các biểu thức ở định dạng rộng hơn định dạng thường được liên kết với loại của chúng, nhưng tiêu chuẩn C99 khuyến nghị sử dụng một trong ba phương pháp đánh giá biểu thức duy nhất. Ba phương pháp được khuyến nghị được đặc trưng bởi mức độ mà các biểu thức được "thăng cấp" thành các định dạng rộng hơn và việc triển khai được khuyến khích xác định phương pháp nào nó sử dụng bằng cách xác định macro tiền xử lý

     n = n/2

62. nếu

     n = n/2

62 là 0, thì mỗi biểu thức được đánh giá theo định dạng tương ứng với loại của nó; . (Việc triển khai được phép đặt

     n = n/2

62 thành -1 để chỉ ra rằng phương pháp đánh giá biểu thức là không thể xác định được. ) The C99 standard also requires that the

     n = n/2

72 header file define the types

     n = n/2

73 and

     n = n/2

74, which are at least as wide as

     n = n/2

65 and

     x = x*x

76, respectively, and are intended to match the types used to evaluate

     n = n/2

65 and

     x = x*x

76 expressions. Ví dụ: nếu

     n = n/2

62 là 2, thì cả

     n = n/2

73 và

     n = n/2

74 đều là

     n = n/2

70. Cuối cùng, tiêu chuẩn C99 yêu cầu tệp tiêu đề

     n = n/2

72 xác định các macro tiền xử lý chỉ định phạm vi và độ chính xác của các định dạng tương ứng với từng loại dấu phẩy động

Sự kết hợp các tính năng được yêu cầu hoặc khuyến nghị bởi tiêu chuẩn C99 hỗ trợ một số trong năm tùy chọn được liệt kê ở trên nhưng không phải tất cả. Ví dụ: nếu một triển khai ánh xạ loại

     n = n/2

70 sang định dạng kép mở rộng và xác định

     n = n/2

62 là 2, lập trình viên có thể giả định một cách hợp lý rằng độ chính xác mở rộng tương đối nhanh, vì vậy các chương trình như ví dụ về định mức Euclide có thể chỉ cần sử dụng các biến trung gian loại

     n = n/2

70 ( . Mặt khác, việc triển khai tương tự phải giữ các biểu thức ẩn danh ở độ chính xác mở rộng ngay cả khi chúng được lưu trữ trong bộ nhớ (e. g. , khi trình biên dịch phải tràn các thanh ghi dấu phẩy động), và nó phải lưu kết quả của các biểu thức được gán cho các biến được khai báo

     x = x*x

76 để chuyển đổi chúng thành độ chính xác kép ngay cả khi chúng có thể được giữ trong các thanh ghi. Do đó, cả loại

     x = x*x

76 và loại

     n = n/2

74 đều không thể được biên dịch để tạo mã nhanh nhất trên phần cứng dựa trên mở rộng hiện tại

Tương tự như vậy, tiêu chuẩn C99 cung cấp các giải pháp cho một số vấn đề được minh họa bằng các ví dụ trong phần này nhưng không phải tất cả. Phiên bản tiêu chuẩn C99 của hàm

     n = n/2

91 được đảm bảo hoạt động chính xác nếu biểu thức

     x = x*x

 while (n is even) {

 while (n is even) {

06 được gán cho một biến (thuộc bất kỳ loại nào) và biến đó được sử dụng xuyên suốt. Tuy nhiên, một chương trình tiêu chuẩn C99 di động, hiệu quả để chia một số có độ chính xác kép thành các phần cao và thấp, tuy nhiên, khó khăn hơn. làm cách nào chúng tôi có thể tách ở đúng vị trí và tránh làm tròn hai lần nếu chúng tôi không thể đảm bảo rằng các biểu thức

     x = x*x

76 được làm tròn chính xác để đạt được độ chính xác gấp đôi? . Định lý 14 nói rằng chúng ta có thể tách ở bất kỳ vị trí bit nào miễn là chúng ta biết độ chính xác của phép toán cơ bản và các macro tham số môi trường và

     n = n/2

62 sẽ cung cấp cho chúng ta thông tin này

Đoạn sau đây cho thấy một triển khai có thể

 u = x

 u = x

 u = x

 while (true) {

 while (true) {

 while (true) {

 while (true) {

 while (true) {

 while (true) {

 while (true) {

 while (true) {

 while (true) {

 while (true) {

     n = n/2

     n = n/2

Tất nhiên, để tìm ra giải pháp này, lập trình viên phải biết rằng các biểu thức

     x = x*x

76 có thể được đánh giá ở độ chính xác mở rộng, rằng vấn đề làm tròn hai lần sau đó có thể khiến thuật toán gặp trục trặc và thay vào đó, độ chính xác mở rộng đó có thể được sử dụng theo Định lý 14. Một giải pháp rõ ràng hơn chỉ đơn giản là xác định rằng mỗi biểu thức được làm tròn chính xác để tăng gấp đôi độ chính xác. Trên các hệ thống dựa trên mở rộng, điều này chỉ yêu cầu thay đổi chế độ chính xác làm tròn, nhưng thật không may, tiêu chuẩn C99 không cung cấp cách di động để thực hiện việc này. (Early drafts of the Floating-Point C Edits, the working document that specified the changes to be made to the C90 standard to support floating-point, recommended that implementations on systems with rounding precision modes provide

     n = n/2

99 and

00 functions to get and set the rounding precision, analogous to the

01 and

02 functions that get and set the rounding direction. Đề xuất này đã bị xóa trước khi các thay đổi được thực hiện đối với tiêu chuẩn C99. )

Thật trùng hợp, cách tiếp cận của tiêu chuẩn C99 để hỗ trợ tính di động giữa các hệ thống có khả năng số học số nguyên khác nhau gợi ý một cách tốt hơn để hỗ trợ các kiến trúc dấu phẩy động khác nhau. Mỗi triển khai tiêu chuẩn C99 cung cấp tệp tiêu đề

     n = n/2

72 xác định các loại số nguyên mà triển khai hỗ trợ, được đặt tên theo kích thước và hiệu quả của chúng. ví dụ:

04 là loại số nguyên có độ rộng chính xác là 32 bit,

05 là loại số nguyên nhanh nhất có độ rộng tối thiểu 16 bit của triển khai và

06 là loại số nguyên rộng nhất được hỗ trợ. Người ta có thể tưởng tượng một sơ đồ tương tự cho các loại dấu phẩy động. ví dụ:

07 có thể đặt tên cho loại dấu phẩy động với độ chính xác chính xác là 53 bit nhưng có thể có phạm vi rộng hơn,

08 có thể đặt tên cho loại triển khai nhanh nhất với độ chính xác ít nhất là 24 bit và

09 có thể đặt tên cho loại nhanh hợp lý rộng nhất được hỗ trợ. Các loại nhanh có thể cho phép trình biên dịch trên các hệ thống dựa trên mở rộng tạo mã nhanh nhất có thể chỉ tuân theo ràng buộc rằng giá trị của các biến được đặt tên không được thay đổi do tràn thanh ghi. Các loại chiều rộng chính xác sẽ khiến trình biên dịch trên các hệ thống dựa trên mở rộng đặt chế độ làm tròn độ chính xác thành làm tròn theo độ chính xác đã chỉ định, cho phép phạm vi rộng hơn chịu cùng một ràng buộc. Cuối cùng,

     n = n/2

74 có thể đặt tên cho một loại có cả độ chính xác và phạm vi của định dạng kép IEEE 754, cung cấp khả năng đánh giá kép nghiêm ngặt. Cùng với các macro tham số môi trường được đặt tên phù hợp, sơ đồ như vậy sẽ sẵn sàng hỗ trợ tất cả năm tùy chọn được mô tả ở trên và cho phép người lập trình chỉ ra ngữ nghĩa dấu phẩy động mà chương trình của họ yêu cầu một cách dễ dàng và rõ ràng.

Hỗ trợ ngôn ngữ cho độ chính xác mở rộng có phức tạp như vậy không? . Tuy nhiên, các hệ thống dựa trên mở rộng đặt ra những lựa chọn khó khăn. chúng không hỗ trợ tính toán chính xác kép thuần túy cũng như tính toán chính xác mở rộng thuần túy hiệu quả như hỗn hợp của cả hai và các chương trình khác nhau yêu cầu các hỗn hợp khác nhau. Hơn nữa, không nên để việc lựa chọn thời điểm sử dụng độ chính xác mở rộng cho những người viết trình biên dịch, những người thường bị cám dỗ bởi các điểm chuẩn (và đôi khi được các nhà phân tích số nói thẳng ra) coi số học dấu phẩy động là "không chính xác vốn có" và do đó không xứng đáng cũng như không có khả năng. . Thay vào đó, sự lựa chọn phải được trình bày cho các lập trình viên và họ sẽ yêu cầu các ngôn ngữ có khả năng thể hiện sự lựa chọn của họ.

Phần kết luận

Những nhận xét ở trên không nhằm mục đích chê bai các hệ thống dựa trên mở rộng mà để vạch trần một số sai lầm, điều đầu tiên là tất cả các hệ thống IEEE 754 phải cung cấp kết quả giống hệt nhau cho cùng một chương trình. Chúng tôi đã tập trung vào sự khác biệt giữa các hệ thống dựa trên mở rộng và hệ thống đơn/kép, nhưng còn có sự khác biệt nữa giữa các hệ thống trong mỗi họ này. For example, some single/double systems provide a single instruction to multiply two numbers and add a third with just one final rounding. This operation, called a fused multiply-add, can cause the same program to produce different results across different single/double systems, and, like extended precision, it can even cause the same program to produce different results on the same system depending on whether and when it is used. (Một phép cộng nhân hợp nhất cũng có thể làm hỏng quá trình tách của Định lý 6, mặc dù nó có thể được sử dụng theo cách không di động để thực hiện nhiều phép nhân chính xác mà không cần tách. ) Mặc dù tiêu chuẩn IEEE không lường trước được hoạt động như vậy, nhưng nó vẫn tuân thủ. sản phẩm trung gian được chuyển đến một "điểm đến" ngoài tầm kiểm soát của người dùng đủ rộng để giữ chính xác sản phẩm đó và tổng cuối cùng được làm tròn chính xác để khớp với điểm đến chính xác đơn hoặc kép

Ý tưởng rằng IEEE 754 quy định chính xác kết quả mà một chương trình nhất định phải cung cấp dù sao cũng rất hấp dẫn. Nhiều lập trình viên muốn tin rằng họ có thể hiểu hành vi của một chương trình và chứng minh rằng nó sẽ hoạt động bình thường mà không cần tham khảo trình biên dịch biên dịch nó hoặc máy tính chạy nó. Theo nhiều cách, hỗ trợ niềm tin này là một mục tiêu đáng giá cho các nhà thiết kế hệ thống máy tính và ngôn ngữ lập trình. Thật không may, khi nói đến số học dấu phẩy động, mục tiêu hầu như không thể đạt được. Các tác giả của các tiêu chuẩn IEEE biết điều đó và họ đã không cố gắng đạt được nó. Kết quả là, mặc dù gần như tuân thủ (hầu hết) tiêu chuẩn IEEE 754 trong toàn ngành công nghiệp máy tính, các lập trình viên của phần mềm di động vẫn phải tiếp tục đối phó với số học dấu chấm động không thể đoán trước.

Nếu các lập trình viên khai thác các tính năng của IEEE 754, họ sẽ cần các ngôn ngữ lập trình giúp dự đoán số học dấu phẩy động. Tiêu chuẩn C99 cải thiện khả năng dự đoán ở một mức độ nào đó với chi phí yêu cầu các lập trình viên viết nhiều phiên bản chương trình của họ, mỗi phiên bản cho một

     n = n/2

62. Liệu các ngôn ngữ trong tương lai có chọn thay vào đó để cho phép các lập trình viên viết một chương trình duy nhất với cú pháp thể hiện rõ ràng mức độ phụ thuộc vào ngữ nghĩa của IEEE 754 hay không vẫn còn phải xem. Các hệ thống dựa trên mở rộng hiện tại đe dọa triển vọng đó bằng cách cám dỗ chúng ta giả định rằng trình biên dịch và phần cứng có thể biết rõ hơn lập trình viên về cách thực hiện tính toán trên một hệ thống nhất định. Giả định đó là sai lầm thứ hai. độ chính xác được yêu cầu trong kết quả tính toán không phụ thuộc vào máy tạo ra nó mà chỉ phụ thuộc vào kết luận rút ra từ nó, và của người lập trình, trình biên dịch và phần cứng, tốt nhất chỉ có người lập trình mới có thể biết những kết luận đó có thể là gì

programming html

Mã html nhân

HTML

Giới thiệu

Rounding Error

Floating-point Formats

Relative Error and Ulps

Guard Digits

Theorem 1

Proof

Theorem 2

Cancellation

Định lý 3

Theorem 4

Exactly Rounded Operations

Theorem 5

Theorem 6

Theorem 7

Proof

The IEEE Standard

Formats and Operations

Base

Precision

Exponent

Operations

Special Quantities

NaNs

Infinity

Signed Zero

Denormalized Numbers

Exceptions, Flags and Trap Handlers

Trap Handlers

Rounding Modes

Flags

Systems Aspects

Instruction Sets

Languages and Compilers

mơ hồ

The IEEE Standard

Optimizers

Theorem 8 (Kahan Summation Formula)

Exception Handling

The Details

Rounding Error

Theorem 9

Proof

Theorem 10

Proof

Định lý 11

Proof

Theorem 12

Proof

Proof of Theorem 3

Theorem 13

Proof

Proof of Theorem 4

Theorem 14

Proof of Theorem 6

Binary to Decimal Conversion

Theorem 15

Proof

Lỗi trong tổng kết

Tóm lược

Sự nhìn nhận

Người giới thiệu

Theorem 14 and Theorem 8

Theorem 14

Proof

Theorem 8 (Kahan Summation Formula)

Proof

Sự khác biệt giữa các triển khai IEEE 754

Triển khai IEEE 754 hiện tại

Cạm bẫy trong tính toán trên các hệ thống dựa trên mở rộng

Định lý 7'

Proof

Hỗ trợ ngôn ngữ lập trình cho độ chính xác mở rộng

Phần kết luận

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm