Trong bài viết này, chúng ta sẽ xem phép tính như nhân và chia bằng Javascript. Một phép toán số học hoạt động trên hai số và các số được gọi là toán hạng.
Phép nhân Toán tử nhân [*] nhân hai hoặc nhiều số
Thí dụ
var a =1 5; var b = 12; var c = a × b;
Tiếp cận. Tạo biểu mẫu html để lấy đầu vào từ người dùng để thực hiện các phép tính nhân. Thêm mã javascript bên trong html để thực hiện logic nhân. Tài liệu. getElementById[id]. thuộc tính giá trị trả về giá trị của thuộc tính giá trị của trường văn bản
Thí dụ. Dưới đây là việc thực hiện các phương pháp trên
HTML
0. Signed zero provides a perfect way to resolve this problem. Numbers of the form x + i[+0] have one signand numbers of the form x + i[-0] on the other side of the branch cut have the other sign. In fact, the natural formulas for computingwill give these results.
Back to. If z =1 = -1 + i0, then
1/z = 1/[-1 + i0] = [[-1- i0]]/[[-1 + i0][-1 - i0]] = [-1 -- i0]/[[-1]2 - 02] = -1 + i[-0],and so, while. Thus IEEE arithmetic preserves this identity for all z. Some more sophisticated examples are given by Kahan [1987]. Although distinguishing between +0 and -0 has advantages, it can occasionally be confusing. For example, signed zero destroys the relation x = y
Denormalized Numbers
Consider normalized floating-point numbers with
It's very easy to imagine writing the code fragment,
while [n is even] {02
while [n is even] {03
while [n is even] {09
while [n is even] {10
while [n is even] {11
while [n is even] {04
while [n is even] {13, and much later having a program fail due to a spurious division by zero. Tracking down bugs like this is frustrating and time consuming. On a more philosophical level, computer science textbooks often point out that even though it is currently impractical to prove large programs correct, designing programs with the idea of proving them often results in better code. For example, introducing invariants is quite useful, even if they aren't going to be used as part of a proof. Floating-point code is just like any other code. it helps to have provable facts on which to depend. For example, when analyzing formula , it was very helpful to know that x/2
The IEEE standard uses denormalized numbers, which guarantee , as well as other useful relations. They are the most controversial part of the standard and probably accounted for the long delay in getting 754 approved. Most high performance hardware that claims to be IEEE compatible does not support denormalized numbers directly, but rather traps when consuming or producing denormals, and leaves it to software to simulate the IEEE standard. The idea behind denormalized numbers goes back to Goldberg [1967] and is very simple. Khi số mũ là emin, ý nghĩa và không cần phải chuẩn hóa, sao cho khi
There is a small snag when
Recall the example of
FIGURE D-2 Flush To Zero Compared With Gradual Underflow
illustrates denormalized numbers. The top number line in the figure shows normalized floating-point numbers. Notice the gap between 0 and the smallest normalized number. If the result of a floating-point calculation falls into this gulf, it is flushed to zero. Dòng số dưới cùng cho biết điều gì sẽ xảy ra khi các bất thường được thêm vào tập hợp các số dấu phẩy động. The "gulf" is filled in, and when the result of a calculation is less than, it is represented by the nearest denormal. When denormalized numbers are added to the number line, the spacing between adjacent floating-point numbers varies in a regular way. adjacent spacings are either the same length or differ by a factor of
spacing abruptly changes fromto, which is a factor of, rather than the orderly change by a factor of
Without gradual underflow, the simple expression x - y can have a very large relative error for normalized inputs, as was seen above for x = 6. 87 × 10-97 and y = 6. 81 × 10-97. Large relative errors can happen even without cancellation, as the following example shows [Demmel 1984]. Consider dividing two complex numbers, a + ib and c + id. The obvious formula
suffers from the problem that if either component of the denominator c + id is larger than, the formula will overflow, even though the final result may be well within range. A better method of computing the quotients is to use Smith's formula
[11]Applying Smith's formula to [2 · 10-98 + i10-98]/[4 · 10-98 + i[2 · 10-98]] gives the correct answer of 0. 5 with gradual underflow. It yields 0. 4 with flush to zero, an error of 100 ulps. It is typical for denormalized numbers to guarantee error bounds for arguments all the way down to 1. 0 x.
Exceptions, Flags and Trap Handlers
When an exceptional condition like division by zero or overflow occurs in IEEE arithmetic, the default is to deliver a result and continue. Typical of the default results are NaN for 0/0 and, and
Sometimes continuing execution in the face of exception conditions is not appropriate. The section gave the example of x/[x2 + 1]. When x >, the denominator is infinite, resulting in a final answer of 0, which is totally wrong. Although for this formula the problem can be solved by rewriting it as 1/[x + x-1], rewriting may not always solve the problem. The IEEE standard strongly recommends that implementations allow trap handlers to be installed. Then when an exception occurs, the trap handler is called instead of setting the flag. The value returned by the trap handler will be used as the result of the operation. It is the responsibility of the trap handler to either clear or set the status flag; otherwise, the value of the flag is allowed to be undefined
The IEEE standard divides exceptions into 5 classes. overflow, underflow, division by zero, invalid operation and inexact. There is a separate status flag for each class of exception. The meaning of the first three exceptions is self-evident. Invalid operation covers the situations listed in , and any comparison that involves a NaN. The default result of an operation that causes an invalid exception is to return a NaN, but the converse is not true. When one of the operands to an operation is a NaN, the result is a NaN but no invalid exception is raised unless the operation also satisfies one of the conditions in
TABLE D-4 Exceptions in IEEE 754*ExceptionResult when traps disabledArgument to trap handleroverflow±*x is the exact result of the operation,
The inexact exception is raised when the result of a floating-point operation is not exact. In the
There is an implementation issue connected with the fact that the inexact exception is raised so often. If floating-point hardware does not have flags of its own, but instead interrupts the operating system to signal a floating-point exception, the cost of inexact exceptions could be prohibitive. This cost can be avoided by having the status flags maintained by software. The first time an exception is raised, set the software flag for the appropriate class, and tell the floating-point hardware to mask off that class of exceptions. Then all further exceptions will run without interrupting the operating system. When a user resets that status flag, the hardware mask is re-enabled
Trap Handlers
One obvious use for trap handlers is for backward compatibility. Old codes that expect to be aborted when exceptions occur can install a trap handler that aborts the process. This is especially useful for codes with a loop like while [n is even] {
14 while [n is even] {
15 while [n is even] {
16 while [n is even] {
03 while [n is even] {
18 while [n is even] {
19. Since comparing a NaN to a number with , , or = [but not ] always returns false, this code will go into an infinite loop if while [n is even] {
06 ever becomes a NaN.
while [n is even] {06 ever becomes a NaN.
Có một cách sử dụng thú vị hơn cho các trình xử lý bẫy xuất hiện khi tính toán các sản phẩm chẳng hạn như có khả năng bị tràn. One solution is to use logarithms, and compute expinstead. The problem with this approach is that it is less accurate, and that it costs more than the simple expression, even if there is no overflow. There is another solution using trap handlers called over/underflow counting that avoids both of these problems [Sterbenz 1974]
The idea is as follows. There is a global counter initialized to zero. Whenever the partial productoverflows for some k, the trap handler increments the counter by one and returns the overflowed quantity with the exponent wrapped around. In IEEE 754 single precision, emax = 127, so if pk = 1. 45 × 2130, it will overflow and cause the trap handler to be called, which will wrap the exponent back into range, changing pk to 1. 45 × 2-62 [xem bên dưới]. Similarly, if pk underflows, the counter would be decremented, and negative exponent would get wrapped around into a positive one. Khi tất cả các phép nhân được thực hiện, nếu bộ đếm bằng 0 thì tích cuối cùng là pn. If the counter is positive, the product overflowed, if the counter is negative, it underflowed. If none of the partial products are out of range, the trap handler is never called and the computation incurs no extra cost. Even if there are over/underflows, the calculation is more accurate than if it had been computed with logarithms, because each pk was computed from pk - 1 using a full precision multiply. Barnett [1987] discusses a formula where the full accuracy of over/underflow counting turned up an error in earlier tables of that formula
IEEE 754 specifies that when an overflow or underflow trap handler is called, it is passed the wrapped-around result as an argument. The definition of wrapped-around for overflow is that the result is computed as if to infinite precision, then divided by 2
Rounding Modes
In the IEEE standard, rounding occurs whenever an operation has a result that is not exact, since [with the exception of binary decimal conversion] each operation is computed exactly and then rounded. By default, rounding means round toward nearest. The standard requires that three other rounding modes be provided, namely round toward 0, round toward +
One application of rounding modes occurs in interval arithmetic [another is mentioned in ]. When using interval arithmetic, the sum of two numbers x and y is an interval, whereis x
When a floating-point calculation is performed using interval arithmetic, the final answer is an interval that contains the exact result of the calculation. This is not very helpful if the interval turns out to be large [as it often does], since the correct answer could be anywhere in that interval. Interval arithmetic makes more sense when used in conjunction with a multiple precision floating-point package. The calculation is first performed with some precision p. If interval arithmetic suggests that the final answer may be inaccurate, the computation is redone with higher and higher precisions until the final interval is a reasonable size
Flags
The IEEE standard has a number of flags and modes. As discussed above, there is one status flag for each of the five exceptions. underflow, overflow, division by zero, invalid operation and inexact. There are four rounding modes. round toward nearest, round toward +
Consider writing a subroutine to compute xn, where n is an integer. When n > 0, a simple routine like
PositivePower[x,n] {
while [n is even] {
x = x*x
n = n/2
}
u = x
while [true] {
n = n/2
if [n==0] return u
x = x*x
while [n is even] {0
}
If n < 0, then a more accurate way to compute xn is not to call
while [n is even] {21
while [n is even] {22 but rather
while [n is even] {23
while [n is even] {22, because the first expression multiplies n quantities each of which have a rounding error from the division [i. e. , 1/x]. In the second expression these are exact [i. e. , x], and the final division commits just one additional rounding error. Unfortunately, these is a slight snag in this strategy. If
while [n is even] {25
while [n is even] {22 underflows, then either the underflow trap handler will be called, or else the underflow status flag will be set. This is incorrect, because if x-n underflows, then xn will either overflow or be in range. But since the IEEE standard gives the user access to all the flags, the subroutine can easily correct for this. It simply turns off the overflow and underflow trap enable bits and saves the overflow and underflow status bits. It then computes
while [n is even] {23
while [n is even] {22. If neither the overflow nor underflow status bit is set, it restores them together with the trap enable bits. If one of the status bits is set, it restores the flags and redoes the calculation using
while [n is even] {21
while [n is even] {22, which causes the correct exceptions to occur
Another example of the use of flags occurs when computing arccos via the formula
arccos x = 2 arctanIf arctan[
Systems Aspects
The design of almost every aspect of a computer system requires knowledge about floating-point. Computer architectures usually have floating-point instructions, compilers must generate those floating-point instructions, and the operating system must decide what to do when exception conditions are raised for those floating-point instructions. Computer system designers rarely get guidance from numerical analysis texts, which are typically aimed at users and writers of software, not at computer designers. As an example of how plausible design decisions can lead to unexpected behavior, consider the following BASIC program
while [n is even] {2
while [n is even] {3
while [n is even] {4
When compiled and run using Borland's Turbo Basic on an IBM PC, the program prints
while [n is even] {31
while [n is even] {32. This example will be analyzed in the next section
Instruction Sets
It is quite common for an algorithm to require a short burst of higher precision in order to produce accurate results. One example occurs in the quadratic formula []/2a. As discussed in the section , when b2
The computation of b2 - 4ac in double precision when each of the quantities a, b, and c are in single precision is easy if there is a multiplication instruction that takes two single precision numbers and produces a double precision result. In order to produce the exactly rounded product of two p-digit numbers, a multiplier needs to generate the entire 2p bits of product, although it may throw bits away as it proceeds. Thus, hardware to compute a double precision product from single precision operands will normally be only a little more expensive than a single precision multiplier, and much cheaper than a double precision multiplier. Despite this, modern instruction sets tend to provide only instructions that produce a result of the same precision as the operands
If an instruction that combines two single precision operands to produce a double precision product was only useful for the quadratic formula, it wouldn't be worth adding to an instruction set. However, this instruction has many other uses. Consider the problem of solving a system of linear equations,
a11x1 + a12x2 + · · · + a1nxn= b1a21x1 + a22x2 + · · · + a2nxn= b2
· · ·
an1x1 + an2x2 + · · ·+ annxn= bn
which can be written in matrix form as Ax = b, where
Suppose that a solution x[1] is computed by some method, perhaps Gaussian elimination. There is a simple way to improve the accuracy of the result called iterative improvement. First compute
[12]and then solve the system
[13] Ay =Note that if x[1] is an exact solution, then
The three steps , , and can be repeated, replacing x[1] with x[2], and x[2] with x[3]. This argument that x[i + 1] is more accurate than x[i] is only informal. For more information, see [Golub and Van Loan 1989]
When performing iterative improvement,
To summarize, instructions that multiply two floating-point numbers and return a product with twice the precision of the operands make a useful addition to a floating-point instruction set. Some of the implications of this for compilers are discussed in the next section
Languages and Compilers
The interaction of compilers and floating-point is discussed in Farnum [1988], and much of the discussion in this section is taken from that paper
mơ hồ
Lý tưởng nhất là một định nghĩa ngôn ngữ nên xác định ngữ nghĩa của ngôn ngữ đủ chính xác để chứng minh các tuyên bố về chương trình. Mặc dù điều này thường đúng với phần nguyên của ngôn ngữ, nhưng các định nghĩa ngôn ngữ thường có vùng màu xám lớn khi nói đến dấu phẩy động. Perhaps this is due to the fact that many language designers believe that nothing can be proven about floating-point, since it entails rounding error. If so, the previous sections have demonstrated the fallacy in this reasoning. This section discusses some common grey areas in language definitions, including suggestions about how to deal with them
Remarkably enough, some languages don't clearly specify that if
while [n is even] {06 is a floating-point variable [with say a value of
while [n is even] {34], then every occurrence of [say]
while [n is even] {35 must have the same value. For example Ada, which is based on Brown's model, seems to imply that floating-point arithmetic only has to satisfy Brown's axioms, and thus expressions can have one of many possible values. Thinking about floating-point in this fuzzy way stands in sharp contrast to the IEEE model, where the result of each floating-point operation is precisely defined. In the IEEE model, we can prove that
while [n is even] {36 evaluates to
while [n is even] {37 [Theorem 7]. In Brown's model, we cannot
Another ambiguity in most language definitions concerns what happens on overflow, underflow and other exceptions. The IEEE standard precisely specifies the behavior of exceptions, and so languages that use the standard as a model can avoid any ambiguity on this point
Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the expression
while [n is even] {38 has a totally different answer than
while [n is even] {39 when x = 1030, y = -1030 and z = 1 [it is 1 in the former case, 0 in the latter]. The importance of preserving parentheses cannot be overemphasized. Các thuật toán trình bày trong định lý 3, 4 và 6 đều phụ thuộc vào nó. For example, in Theorem 6, the formula xh = mx - [mx - x] would reduce to xh = x if it weren't for parentheses, thereby destroying the entire algorithm. A language definition that does not require parentheses to be honored is useless for floating-point calculations
Subexpression evaluation is imprecisely defined in many languages. Suppose that
while [n is even] {40 is double precision, but
while [n is even] {06 and
while [n is even] {42 are single precision. Then in the expression
while [n is even] {40
while [n is even] {44
while [n is even] {45 is the product performed in single or double precision? Another example. in
while [n is even] {06
while [n is even] {44
while [n is even] {48 where
while [n is even] {49 and
while [n is even] {50 are integers, is the division an integer operation or a floating-point one? There are two ways to deal with this problem, neither of which is completely satisfactory. The first is to require that all variables in an expression have the same type. This is the simplest solution, but has some drawbacks. First of all, languages like Pascal that have subrange types allow mixing subrange variables with integer variables, so it is somewhat bizarre to prohibit mixing single and double precision variables. Another problem concerns constants. In the expression
while [n is even] {51, most languages interpret 0. 1 to be a single precision constant. Now suppose the programmer decides to change the declaration of all the floating-point variables from single to double precision. If 0. 1 is still treated as a single precision constant, then there will be a compile time error. The programmer will have to hunt down and change every floating-point constant
The second approach is to allow mixed expressions, in which case rules for subexpression evaluation must be provided. There are a number of guiding examples. The original definition of C required that every floating-point expression be computed in double precision [Kernighan and Ritchie 1978]. This leads to anomalies like the example at the beginning of this section. The expression
while [n is even] {52 is computed in double precision, but if
while [n is even] {53 is a single-precision variable, the quotient is rounded to single precision for storage. Since 3/7 is a repeating binary fraction, its computed value in double precision is different from its stored value in single precision. Thus the comparison q = 3/7 fails. This suggests that computing every expression in the highest precision available is not a good rule
Another guiding example is inner products. If the inner product has thousands of terms, the rounding error in the sum can become substantial. One way to reduce this rounding error is to accumulate the sums in double precision [this will be discussed in more detail in the section ]. If
while [n is even] {54 is a double precision variable, and
while [n is even] {55 and
while [n is even] {56 are single precision arrays, then the inner product loop will look like
while [n is even] {54
while [n is even] {04
while [n is even] {54
while [n is even] {44
while [n is even] {61. If the multiplication is done in single precision, than much of the advantage of double precision accumulation is lost, because the product is truncated to single precision just before being added to a double precision variable
A rule that covers both of the previous two examples is to compute an expression in the highest precision of any variable that occurs in that expression. Then
while [n is even] {53
while [n is even] {04
while [n is even] {52 will be computed entirely in single precision and will have the boolean value true, whereas
while [n is even] {54
while [n is even] {04
while [n is even] {54
while [n is even] {44
while [n is even] {61 will be computed in double precision, gaining the full advantage of double precision accumulation. However, this rule is too simplistic to cover all cases cleanly. If
while [n is even] {70 and
while [n is even] {71 are double precision variables, the expression
while [n is even] {42
while [n is even] {04
while [n is even] {06
while [n is even] {44
while [n is even] {76 contains a double precision variable, but performing the sum in double precision would be pointless, because both operands are single precision, as is the result
A more sophisticated subexpression evaluation rule is as follows. First assign each operation a tentative precision, which is the maximum of the precisions of its operands. This assignment has to be carried out from the leaves to the root of the expression tree. Then perform a second pass from the root to the leaves. In this pass, assign to each operation the maximum of the tentative precision and the precision expected by the parent. In the case of
while [n is even] {53
while [n is even] {04
while [n is even] {52, every leaf is single precision, so all the operations are done in single precision. In the case of
while [n is even] {54
while [n is even] {04
while [n is even] {54
while [n is even] {44
while [n is even] {61, the tentative precision of the multiply operation is single precision, but in the second pass it gets promoted to double precision, because its parent operation expects a double precision operand. And in
while [n is even] {42
while [n is even] {04
while [n is even] {06
while [n is even] {44
while [n is even] {76, the addition is done in single precision. Farnum [1988] presents evidence that this algorithm in not difficult to implement
The disadvantage of this rule is that the evaluation of a subexpression depends on the expression in which it is embedded. This can have some annoying consequences. For example, suppose you are debugging a program and want to know the value of a subexpression. You cannot simply type the subexpression to the debugger and ask it to be evaluated, because the value of the subexpression in the program depends on the expression it is embedded in. A final comment on subexpressions. since converting decimal constants to binary is an operation, the evaluation rule also affects the interpretation of decimal constants. This is especially important for constants like
while [n is even] {90 which are not exactly representable in binary
Another potential grey area occurs when a language includes exponentiation as one of its built-in operations. Unlike the basic arithmetic operations, the value of exponentiation is not always obvious [Kahan and Coonen 1982]. If
while [n is even] {91 is the exponentiation operator, then
while [n is even] {92 certainly has the value -27. However,
while [n is even] {93 is problematical. If the
while [n is even] {91 operator checks for integer powers, it would compute
while [n is even] {93 as -3. 03 = -27. On the other hand, if the formula xy = eylogx is used to define
while [n is even] {91 for real arguments, then depending on the log function, the result could be a NaN [using the natural definition of log[x] =
while [n is even] {97 when x < 0]. If the FORTRAN
while [n is even] {98 function is used however, then the answer will be -27, because the ANSI FORTRAN standard defines
while [n is even] {99 to be i
In fact, the FORTRAN standard says that
Any arithmetic operation whose result is not mathematically defined is prohibitedUnfortunately, with the introduction of ±
while [n is even] {93 to be -27.
The IEEE Standard
The section ," discussed many of the features of the IEEE standard. However, the IEEE standard says nothing about how these features are to be accessed from a programming language. Thus, there is usually a mismatch between floating-point hardware that supports the standard and programming languages like C, Pascal or FORTRAN. Some of the IEEE capabilities can be accessed through a library of subroutine calls. For example the IEEE standard requires that square root be exactly rounded, and the square root function is often implemented directly in hardware. This functionality is easily accessed via a library square root routine. However, other aspects of the standard are not so easily implemented as subroutines. For example, most computer languages specify at most two floating-point types, while the IEEE standard has four different precisions [although the recommended configurations are single plus single-extended or single, double, and double-extended]. Infinity provides another example. Constants to represent ±
Một tình huống tinh vi hơn là thao túng trạng thái liên quan đến tính toán, trong đó trạng thái bao gồm các chế độ làm tròn, bit kích hoạt bẫy, trình xử lý bẫy và cờ ngoại lệ. One approach is to provide subroutines for reading and writing the state. In addition, a single call that can atomically set a new value and return the old value is often useful. As the examples in the section show, a very common pattern of modifying IEEE state is to change it only within the scope of a block or subroutine. Thus the burden is on the programmer to find each exit from the block, and make sure the state is restored. Language support for setting the state precisely in the scope of a block would be very useful here. Modula-3 is one language that implements this idea for trap handlers [Nelson 1991]
There are a number of minor points that need to be considered when implementing the IEEE standard in a language. Since x - x = +0 for all x, [+0] - [+0] = +0. However, -[+0] = -0, thus -x should not be defined as 0 - x. Việc giới thiệu NaN có thể gây nhầm lẫn, bởi vì một NaN không bao giờ bằng bất kỳ số nào khác [bao gồm cả NaN khác], vì vậy x = x không còn đúng nữa. Trên thực tế, biểu thức x
x = x*x01 không được cung cấp. Hơn nữa, NaN không có thứ tự đối với tất cả các số khác, vì vậy x
x = x*x02 trả về một trong
Mặc dù tiêu chuẩn IEEE xác định các hoạt động dấu phẩy động cơ bản để trả về NaN nếu bất kỳ toán hạng nào là NaN, đây có thể không phải lúc nào cũng là định nghĩa tốt nhất cho các hoạt động phức hợp. Ví dụ: khi tính toán hệ số tỷ lệ thích hợp để sử dụng trong việc vẽ đồ thị, giá trị lớn nhất của một tập hợp giá trị phải được tính toán. Trong trường hợp này, điều hợp lý là thao tác tối đa chỉ cần bỏ qua NaN
Cuối cùng, làm tròn có thể là một vấn đề. Tiêu chuẩn IEEE xác định làm tròn rất chính xác và nó phụ thuộc vào giá trị hiện tại của các chế độ làm tròn. Điều này đôi khi xung đột với định nghĩa làm tròn ẩn trong chuyển đổi loại hoặc hàm
x = x*x03 rõ ràng trong ngôn ngữ. Điều này có nghĩa là các chương trình muốn sử dụng phương pháp làm tròn IEEE không thể sử dụng các ngôn ngữ gốc của ngôn ngữ tự nhiên và ngược lại, các ngôn ngữ gốc sẽ không hiệu quả để triển khai trên số lượng máy IEEE ngày càng tăng
Optimizers
Compiler texts tend to ignore the subject of floating-point. For example Aho et al. [1986] đề cập đến việc thay thế
x = x*x04 bằng
x = x*x05, khiến người đọc cho rằng nên thay thế
x = x*x06 bằng
while [n is even] {51. However, these two expressions do not have the same semantics on a binary machine, because 0. 1 cannot be represented exactly in binary. This textbook also suggests replacing
x = x*x08 by
x = x*x09, even though we have seen that these two expressions can have quite different values when y
while [n is even] {38 can have a totally different answer than
while [n is even] {39, as discussed above. There is a problem closely related to preserving parentheses that is illustrated by the following code
while [n is even] {5
while [n is even] {6
:
This is designed to give an estimate for machine epsilon. If an optimizing compiler notices that eps + 1 > 1
Many problems, such as numerical integration and the numerical solution of differential equations involve computing sums with many terms. Because each addition can potentially introduce an error as large as . 5 ulp, a sum involving thousands of terms can have quite a bit of rounding error. A simple way to correct for this is to store the partial summand in a double precision variable and to perform each addition using double precision. If the calculation is being done in single precision, performing the sum in double precision is easy on most computer systems. However, if the calculation is already being done in double precision, doubling the precision is not so simple. One method that is sometimes advocated is to sort the numbers and add them from smallest to largest. However, there is a much more efficient method which dramatically improves the accuracy of sums, namely
Theorem 8 [Kahan Summation Formula]
Suppose thatis computed using the following algorithmwhile [n is even] {7
while [n is even] {8
while [n is even] {9
x = x*x0
x = x*x1
x = x*x2
x = x*x3
x = x*x4
Then the computed sum S is equal towhere
Using the naive formula, the computed sum is equal towhere .
An optimizer that believed floating-point arithmetic obeyed the laws of algebra would conclude that C = [T-S] - Y = [[S+Y]-S] - Y = 0, rendering the algorithm completely useless. These examples can be summarized by saying that optimizers should be extremely cautious when applying algebraic identities that hold for the mathematical real numbers to expressions involving floating-point variables
Another way that optimizers can change the semantics of floating-point code involves constants. In the expression
x = x*x12, there is an implicit decimal to binary conversion operation that converts the decimal number to a binary constant. Because this constant cannot be represented exactly in binary, the inexact exception should be raised. In addition, the underflow flag should to be set if the expression is evaluated in single precision. Since the constant is inexact, its exact conversion to binary depends on the current value of the IEEE rounding modes. Thus an optimizer that converts
x = x*x13 to binary at compile time would be changing the semantics of the program. However, constants like 27. 5 which are exactly representable in the smallest available precision can be safely converted at compile time, since they are always exact, cannot raise any exception, and are unaffected by the rounding modes. Constants that are intended to be converted at compile time should be done with a constant declaration, such as
x = x*x14
x = x*x15
while [n is even] {04
x = x*x17
Common subexpression elimination is another example of an optimization that can change floating-point semantics, as illustrated by the following code
x = x*x5
x = x*x6
x = x*x7
Although
x = x*x18 can appear to be a common subexpression, it is not because the rounding mode is different at the two evaluation sites. Three final examples. x = x cannot be replaced by the boolean constant
x = x*x19, because it fails when x is a NaN; -x = 0 - x fails for x = +0; and x < y is not the opposite of x
Despite these examples, there are useful optimizations that can be done on floating-point code. First of all, there are algebraic identities that are valid for floating-point numbers. Some examples in IEEE arithmetic are x + y = y + x, 2 × x = x + x, 1 × x = x, and 0. 5× x = x/2. However, even these simple identities can fail on a few machines such as CDC and Cray supercomputers. Instruction scheduling and in-line procedure substitution are two other potentially useful optimizations
As a final example, consider the expression
while [n is even] {70
while [n is even] {04
while [n is even] {45, where
while [n is even] {06 and
while [n is even] {42 are single precision variables, and
while [n is even] {70 is double precision. On machines that have an instruction that multiplies two single precision numbers to produce a double precision number,
while [n is even] {70
while [n is even] {04
while [n is even] {45 can get mapped to that instruction, rather than compiled to a series of instructions that convert the operands to double and then perform a double to double precision multiply
Some compiler writers view restrictions which prohibit converting [x + y] + z to x + [y + z] as irrelevant, of interest only to programmers who use unportable tricks. Perhaps they have in mind that floating-point numbers model real numbers and should obey the same laws that real numbers do. Vấn đề với ngữ nghĩa số thực là chúng cực kỳ tốn kém để thực hiện. Every time two n bit numbers are multiplied, the product will have 2n bits. Every time two n bit numbers with widely spaced exponents are added, the number of bits in the sum is n + the space between the exponents. The sum could have up to [emax - emin] + n bits, or roughly 2·emax + n bits. An algorithm that involves thousands of operations [such as solving a linear system] will soon be operating on numbers with many significant bits, and be hopelessly slow. The implementation of library functions such as sin and cos is even more difficult, because the value of these transcendental functions aren't rational numbers. Exact integer arithmetic is often provided by lisp systems and is handy for some problems. However, exact floating-point arithmetic is rarely useful
The fact is that there are useful algorithms [like the Kahan summation formula] that exploit the fact that [x + y] + z
holds [as well as similar bounds for -, × and /]. Since these bounds hold for almost all commercial hardware, it would be foolish for numerical programmers to ignore such algorithms, and it would be irresponsible for compiler writers to destroy these algorithms by pretending that floating-point variables have real number semantics
Exception Handling
The topics discussed up to now have primarily concerned systems implications of accuracy and precision. Trap handlers also raise some interesting systems issues. The IEEE standard strongly recommends that users be able to specify a trap handler for each of the five classes of exceptions, and the section , gave some applications of user defined trap handlers. In the case of invalid operation and division by zero exceptions, the handler should be provided with the operands, otherwise, with the exactly rounded result. Depending on the programming language being used, the trap handler might be able to access other variables in the program as well. For all exceptions, the trap handler must be able to identify what operation was being performed and the precision of its destination
The IEEE standard assumes that operations are conceptually serial and that when an interrupt occurs, it is possible to identify the operation and its operands. On machines which have pipelining or multiple arithmetic units, when an exception occurs, it may not be enough to simply have the trap handler examine the program counter. Hardware support for identifying exactly which operation trapped may be necessary
Another problem is illustrated by the following program fragment
x = x*x8
x = x*x9
n = n/20
n = n/21
Suppose the second multiply raises an exception, and the trap handler wants to use the value of
if [n==0] return u3. On hardware that can do an add and multiply in parallel, an optimizer would probably move the addition operation ahead of the second multiply, so that the add can proceed in parallel with the first multiply. Thus when the second multiply traps,
if [n==0] return u3
while [n is even] {04
x = x*x32
while [n is even] {44
x = x*x34 has already been executed, potentially changing the result of
if [n==0] return u3. It would not be reasonable for a compiler to avoid this kind of optimization, because every floating-point operation can potentially trap, and thus virtually all instruction scheduling optimizations would be eliminated. This problem can be avoided by prohibiting trap handlers from accessing any variables of the program directly. Instead, the handler can be given the operands or result as an argument
But there are still problems. trong đoạn
hai hướng dẫn cũng có thể được thực hiện song song. If the multiply traps, its argument
while [n is even] {11 could already have been overwritten by the addition, especially since addition is usually faster than multiply. Computer systems that support the IEEE standard must provide some way to save the value of
while [n is even] {11, either in hardware or by having the compiler avoid such a situation in the first place
W. Kahan has proposed using presubstitution instead of trap handlers to avoid these problems. In this method, the user specifies an exception and the value he wants to be used as the result when the exception occurs. As an example, suppose that in code for computing [sin x]/x, the user decides that x = 0 is so rare that it would improve performance to avoid a test for x = 0, and instead handle this case when a 0/0 trap occurs. Using IEEE trap handlers, the user would write a handler that returns a value of 1 and install it before computing sin x/x. Using presubstitution, the user would specify that when an invalid operation occurs, the value 1 should be used. Kahan calls this presubstitution, because the value to be used must be specified before the exception occurs. When using trap handlers, the value to be returned can be computed when the trap occurs
The advantage of presubstitution is that it has a straightforward hardware implementation. As soon as the type of exception has been determined, it can be used to index a table which contains the desired result of the operation. Although presubstitution has some attractive attributes, the widespread acceptance of the IEEE standard makes it unlikely to be widely implemented by hardware manufacturers
The Details
Một số tuyên bố đã được đưa ra trong bài báo này liên quan đến các thuộc tính của số học dấu phẩy động. We now proceed to show that floating-point is not black magic, but rather is a straightforward subject whose claims can be verified mathematically. This section is divided into three parts. The first part presents an introduction to error analysis, and provides the details for the section . The second part explores binary to decimal conversion, filling in some gaps from the section . The third part discusses the Kahan summation formula, which was used as an example in the section
Rounding Error
In the discussion of rounding error, it was stated that a single guard digit is enough to guarantee that addition and subtraction will always be accurate [Theorem 2]. We now proceed to verify this fact. Theorem 2 has two parts, one for subtraction and one for addition. The part for subtraction is
Theorem 9
If x and y are positive floating-point numbers in a format with parameters
Proof
Interchange x and y if necessary so that x > y. It is also harmless to scale x and y so that x is represented by x0. x1 . xp - 1 ×From the definition of guard digit, the computed value of x - y is x -rounded to be a floating-point number, that is, [x -] +
The exact difference is x - y, so the error is [x - y] - [x -+
Secondly, if x -< 1, then
in this case the relative error is bounded by[18]
The final case is when x - y < 1 but x -
When
Theorem 10
If x
Proof
The algorithm for addition with k guard digits is similar to that for subtraction. If xThe sum is at least
Rõ ràng là kết hợp hai định lý này sẽ cho Định lý 2. Theorem 2 gives the relative error for performing one operation. Comparing the rounding error of x2 - y2 and [x + y] [x - y] requires knowing the relative error of multiple operations. The relative error of xy is
Similarly
[20] xAssuming that multiplication is performed by computing the exact product and then rounding, the relative error is at most . 5 ulp, so
[21] ufor any floating-point numbers u and v. Putting these three equations together [letting u = xy and v = x
So the relative error incurred when computing [x - y] [x + y] is
[23]This relative error is equal to
A similar analysis of [x
= [[x2 - y2] [1 +
When x and y are nearby, the error term [
We next turn to an analysis of the formula for the area of a triangle. Để ước tính lỗi tối đa có thể xảy ra khi tính toán với , thực tế sau đây sẽ cần thiết
Định lý 11
Nếu phép trừ được thực hiện với một chữ số bảo vệ và y/2Proof
Note that if x and y have the same exponent, then certainly xy is exact. Mặt khác, từ điều kiện của định lý, các số mũ có thể khác nhau nhiều nhất 1. Chia tỷ lệ và hoán đổi x và y nếu cần sao cho 0When
Theorem 12
If subtraction uses a guard digit, and if a,b and c are the sides of a triangle [aProof
Let's examine the factors one by one. From Theorem 10, band thus[a + b + c] [1 - 2
This means that there is an
The next term involves the potentially catastrophic subtraction of c and a
x = x*x32, because ab may have rounding error. Because a, b and c are the sides of a triangle, a
The third term is the sum of two exact positive quantities, so[26] [c
Finally, the last term is[27] [a
using both Theorem 9 and Theorem 10. If multiplication is assumed to be exactly rounded, so that x
whereE = [1 +
An upper bound for E is [1 + 2
Theorem 12 certainly shows that there is no catastrophic cancellation in formula . So although it is not necessary to show formula is numerically stable, it is satisfying to have a bound for the entire formula, which is what Theorem 3 of gives
Proof of Theorem 3
Letq = [a + [b + c]] [c - [a - b]] [c + [a - b]] [a + [b - c]]andQ = [a
Then, Theorem 12 shows that Q = q[1 +
provided
with .
To make the heuristic explanation immediately following the statement of Theorem 4 precise, the next theorem describes just how closely µ[x] approximates a constant
Theorem 13
If µ[x] = ln[1 + x]/x, then for 0Proof
Note that µ[x] = 1 - x/2 + x2/3 - . is an alternating series with decreasing terms, so for xProof of Theorem 4
Since the Taylor series for lnis an alternating series, 0 < x - ln[1 + x] < x2/2, the relative error incurred when approximating ln[1 + x] by x is bounded by x/2. If 1
ở đâu.
với một số
Là
Dễ dàng kiểm tra xem nếu