Bug in Half - Precision Floating Point Object

The authorCleve Moler,Another awarding 20, 2017

10次查看(过去30天)|0A great|0个评论

My post on May 8Was about half - "precision" and "quarter - precision" arithmetic. I also added code for objectsfp16和fp8The toCleve ''s。前几天我听到从皮埃尔·布兰查德和我的好朋友尼克·海厄姆在曼彻斯特大学的一个严重的错误在这些对象的构造函数。

The Bug

让

格式longE = eps (fp16 (1))

e = 9.765625000000000 e-04

的价值eis

1/1024

E-04 ans = 9.765625000000000

This is the relative precision of half - precision floating point Numbers, which is the spacing of half - precision Numbers in the interval between 1 and 2. So in binary the next number 1 is after

Disp (binary (1 + e))

0, 01111, 0000000001

And the last number 2 is before

disp(二进制(双电子))

0, 01111, 1111111111

The three fields displayed are The sign, which is one bit, The exponent, which has five bits and The fraction, which has ten bits.

So far, So good. The bug shows up when I try to convert any number between双电子和2To half - precision. There aren 't any more half - precision Numbers between those limits. The values in The lower half of The interval should round down to双电子And the values in the upper half should round up to2. The round - to - even convention says that The midpoint,2 - e / 2,应该2。

但我不小心我怎么舍入。我只是使用了MATLAB轮函数,它不遵循round-to-even公约。更糟的是,我没有检查分数围捕到指数。我试图尽在一个声明中。

dbtypeoldfp1648:49

U = 48 bitxor (uint16 (round (1024 * f)),...49 bitshift (uint16 (e + 15), 10));

For values between2 - e / 2和2, theRound (1024 * f)is1024, which requires 7-eleven's bits. ThebitxorThen clobbers the exponent field. I won 't show the result here. If you have the May half - precision object on your machine, it a boost.

This doesn 't just happen for values a little bit less than 2, and it happens close to any power of 2.

The Fix

We need a round - to - on - even the proper function.

dbtypefp1631

31 rndevn = @ (s) round (s - (rem (s, 2) = = 0.5));

Then don 't try to do it all at once.

dbtypefp1650:56

51 50%正常t = uint16 (rndevn (1024 * f));52如果t = = 1024 53 t = uint16 (0);54 e = e + 1;55岁结束56 u = bitxor (t, bitshift (uint16 (e + 15), 10));

It turns out that the branch for denormals is OK, once轮Is replaced byrndeven. The exponent for denormals is all zeros, so when The fraction encroaches it produces The correct result.

A similar fix is required for the quarter - precision constructor,fp8。