John R. Hauser
2019 July 29
1. Introduction 2. Limitations 3. Acknowledgments and License 4. HardFloat Package Directory Structure 5. Floating-Point Representations 5.1. Standard Formats 5.2. Recoded Formats 5.3. Raw Deconstructions 6. Common Control and Mode Inputs 6.1. Control Input 6.2. Rounding Mode 7. Exception Results 8. Specialization 8.1. Width and Default Value for the Control Input 8.2. Integer Results on Exceptions 8.3. NaN Results 9. Main Modules 9.1. Conversions Between Standard and Recoded Floating-Point ( fNToRecFN
,recFNToFN
)9.2. Conversions from Integer ( iNToRecFN
,iNToRawFN
)9.3. Conversions to Integer ( recFNToIN
)9.4. Conversions Between Formats ( recFNToRecFN
)9.5. Addition and Subtraction ( addRecFN
,addRecFNToRaw
)9.6. Multiplication ( mulRecFN
,mulRecFNToRaw
,mulRecFNToFullRaw
)9.7. Fused Multiply-Add ( mulAddRecFN
,mulAddRecFNToRaw
)9.8. Division and Square Root ( divSqrtRecFN_small
,divSqrtRecFNToRaw_small
)9.9. Comparisons ( compareRecFN
)10. Common Submodules 10.1. isSigNaNRecFN
10.2. recFNToRawFN
10.3. roundAnyRawFNToRecFN
10.4. roundRawFNToRecFN
11. Testing HardFloat 12. Contact Information
Berkeley HardFloat is a hardware implementation of binary floating-point that
conforms to the IEEE Standard for Floating-Point Arithmetic.
HardFloat supports a wide range of floating-point formats, using module
parameters to independently determine the widths of the exponent and
significand fields.
The set of possible formats includes the standard ones of
For any supported format, the following arithmetic functions are available:
This document covers the Verilog version of Berkeley HardFloat. An understanding of both Verilog and the IEEE Floating-Point Standard is required.
HardFloat is currently in its first documented release, called
In its Verilog version, Berkeley HardFloat requires development tools that conform at a minimum to the 2001 IEEE Standard for the Verilog language.
For a particular IEEE-conforming binary floating-point format, if w is
the width of the format’s exponent field and p is the precision as
defined by the floating-point standard (that is,
p ≤ 2(w − 2) + 3
Formats not satisfying these constraints will not always operate correctly.
Although HardFloat supports an infinite range of binary floating-point formats
within these constraints,
While the range of HardFloat’s parameters includes the
“bfloat16” format (
HardFloat is designed to operate on IEEE floating-point values by converting
them into a recoded format, performing arithmetic in the recoded format,
and then eventually converting back to a standard IEEE-defined encoding.
HardFloat’s recoded formats encode the exact same set of values as the
standard floating-point encodings, so recoding entails only a change of
representation as bits, not a change of value.
It is intended that this recoding be made invisible to other system components
by converting values back to a standard encoding whenever necessary.
Nevertheless, there may exist situations where use of the recoded formats
cannot be tolerated.
For more about the recoding, see
The HardFloat package was written by me,
Par Lab: Microsoft (Award #024263), Intel (Award #024894), and U.C. Discovery (Award #DIG07-10227), with additional support from Par Lab affiliates Nokia, NVIDIA, Oracle, and Samsung. ASPIRE Lab: DARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from ASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA, Oracle, and Samsung. ADEPT Lab: ADEPT industrial sponsor Intel and ADEPT affiliates Google, Futurewei, Seagate, Siemens, and SK Hynix.
The following applies to the whole of HardFloat
Copyright 2019 The Regents of the University of California. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS “AS IS”, AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The directory structure for the Verilog version of HardFloat is as follows:
doc source 8086-SSE ARM-VFPv2 RISCV test source Verilator build Verilator-GCC IcarusVerilog
The source files that define the Verilog modules are all in the main
source
directory.
The test
directory has files used solely for testing.
Most of HardFloat’s source files are regular Verilog
‘.v
’ files containing module definitions.
There are also a few ‘.vi
’ files that define macros
and constant functions.
These latter types of files get textually included (using compiler directive
`include
) in the other Verilog files.
Within the main source
directory are subdirectories for files that
specialize the floating-point behavior to match particular processor families:
8086-SSE
- Intel’s x86 processors with Streaming SIMD Extensions (SSE) and later compatible extensions (but not including the older
80-bit double-extended-precision format)ARM-VFPv2
- Arm’s VFPv2 or later floating-point
RISCV
RISC-V floating-point
Each of these specialization subdirectories contains two files,
HardFloat_specialize.vi
and HardFloat_specialize.v
.
Specialization is covered in detail in
HardFloat has three main ways of representing binary floating-point values:
- Standard IEEE formats (‘
fN
’)- A computation’s original floating-point inputs and final results will typically be represented in what the IEEE Standard calls binary interchange formats, such as binary32 for single-precision and binary64 for double-precision.
- Recoded formats (‘
recFN
’)- Rather than operate on the standard formats directly, HardFloat prefers to translate IEEE-encoded values into equivalent recoded formats and operate on those alternative representations instead. The main purpose of the recoding is to make subnormals be normalized, aligned with other floating-point values, thus reducing the complexity of floating-point operations.
- Raw deconstructions (‘
rawFN
’)- During the computation of individual arithmetic operations, HardFloat often represents floating-point values in a deconstructed “raw” form, with separated sign, exponent, significand, etc. Among other uses, this representation is employed for intermediate results before rounding.
Each of these representations is covered in a subsection below.
HardFloat uses the term standard format to refer to an encoding of floating-point in the style of a binary interchange format defined by the IEEE Standard. Although the standard officially limits binary interchange formats to certain combinations of exponent width w and precision p, HardFloat loosens those restrictions to allow any w and p, so long as each parameter is no smaller than 3, and this relationship is also satisfied:
p ≤ 2(w − 2) + 3
In module names, HardFloat uses the abbreviation ‘fN
’
(distinct from ‘recFN
’ or
‘rawFN
’) to refer to standard IEEE-style formats.
A particular format is indicated by parameters expWidth
(sigWidth
(expWidth
+
sigWidth
, which is subdivided as expWidth
bits for the encoded exponent, and
sigWidth
− 1
expWidth
(w)sigWidth
(p)16-bit half-precision 5 11 32-bit single-precision 8 24 64-bit double-precision 11 53 128-bit quadruple-precision 15 113
As supplied, HardFloat distinguishes signaling NaNs from quiet NaNs in the way
the IEEE Standard prefers;
that is, if the most-significant bit of a NaN’s trailing significand
0
1
0
1
For each standard format, HardFloat defines a corresponding
recoded format that encodes the same set of values in a slightly
different representation.
The recoded formats have sign, exponent, and significand fields just like the
standard formats, but with one extra bit for the exponent.
Hence, standard
The following table summarizes the recoding:
standard format HardFloat’s recoded format sign exponent significand sign exponent significand zeros s 0 0 s 000
xx...xx0 subnormal numbers s 0 F s 2k + 2 − n normalized F <<
nnormal numbers s E F s E + 2k + 1 F infinities s 1111
...11
0 s 110
xx...xxxxxxxxx...xxxx NaNs s 1111
...11
F s 111
xx...xxF
The parameter k is expWidth
− 1expWidth
is the standard width of the exponent field, not
the recoded width.
An x represents a “don’t care” bit.
For the special values of zeros, infinities, and NaNs, only the three
most-significant bits are relevant in the recoded exponent field.
If those three bits are 000
, 110
, or
111
, the floating-point value is a zero, infinity, or NaN,
respectively;
otherwise, the value is a normalized finite number.
In the latter case, if the recoded exponent field is
<<
n”
means F normalized by shifting left by 1
For most floating-point values (zeros, normal numbers, and NaNs) the only thing different about the recoded formats is the encoding of the exponent; the sign and significand fields are the same as the corresponding standard format. Infinities are special only in that the significand field is ignored in the recoded formats and might not be zero. Only for subnormals is there a significant transformation of both exponent and significand fields, due to normalization.
A large number of possible bit patterns in the recoded formats don’t
conform to entries in the table above, such as when the recoded
exponent’s most-significant three bits are 000
, indicating a
zero, yet the significand field is not zero.
Such patterns are not valid.
If an invalid recoded input is given to a HardFloat module, all module outputs
are unspecified.
Invalid recoded results are never generated by HardFloat’s modules, so
long as all module inputs are valid.
For normalized numbers (not zeros, infinities, or NaNs), a valid recoded
exponent field is always in the range
sigWidth
In module names, a recoded format is abbreviated as
‘recFN
’.
Like the standard formats, a particular recoded format is specified by
parameters expWidth
and sigWidth
.
The parameters to give, however, are those of the corresponding standard
format, without adjustment for the wider recoded exponent field.
For example, the recoded format for common expWidth
= 8sigWidth
= 24expWidth
+ 1expWidth
+
sigWidth
+ 1
Within the implementation of individual arithmetic operations (addition, multiplication, etc.), HardFloat often represents floating-point values in a deconstructed form, with six separate components:
isNaN
- A single bit indicating whether the floating-point value is a NaN. If
isNaN
is true (= 1 ), the other components are irrelevant except possibly forsign
andsig
. Componentssign
andsig
are relevant for a NaN only if HardFloat is configured to propagate NaN payloads.isInf
- A single bit indicating whether the floating-point value is an infinity. If
isNaN
is false andisInf
is true, the other components are irrelevant except forsign
.isZero
- A single bit indicating whether the floating-point value is a zero. If
isNaN
andisInf
are both false andisZero
is true, the other components are irrelevant except forsign
.sign
- The floating-point sign bit. Ignored only if
isNaN
is true and NaN payloads are not propagated.sExp
- For a finite nonzero floating-point value, the biased floating-point exponent as a signed integer. For NaNs, infinities, and zeros,
sExp
is ignored. As there are no special encodings ofsExp
to indicate special values,sExp
is always just a simple signed integer, with the same bias as in the recoded formats. For a specifiedexpWidth
, the actual width ofsExp
is, giving it a range of −2w+1 to expWidth
+ 22w+1 − 1 , wherew = as usual.expWidth
sig
- For a NaN, the NaN payload (sometimes ignored). For a finite nonzero floating-point value, the complete significand, including the usually implicit
. For a specified 1
bitsigWidth
, the width ofsig
is, with 2 extra bits at the most-significant end when compared to the usual trailing significand of the standard and recoded formats. In some cases, these two bits are always binary sigWidth
+ 101
. In other cases, the most-significant1
bit of the significand can be either of the two leading bits ofsig
, allowing for a1-bit slack in the normalization of the significand. A value ofsig
with the two most-significant bits both zeros (00
) is invalid (unless one ofisNaN
,isInf
, orisZero
is true).
Unlike the recoded formats, the deconstructed forms allow many valid encodings that do not correspond directly to IEEE floating-point values. These extra-precise and/or out-of-bounds values may be mapped to the set of valid IEEE values by rounding, possibly resulting in underflow or overflow.
In HardFloat’s module names, the deconstructed forms are abbreviated as
‘rawFN
’.
They are typically used in two ways.
First, inputs to an operation get converted from recoded formats into
deconstructed forms using recFNToRawFN
submodules.
An intermediate numeric result is then computed, also in raw form, and this
gets rounded to a valid IEEE value using module roundRawFNToRecFN
or roundAnyRawFNToRecFN
.
These modules for handling rawFN
are documented in more detail in
HardFloat’s floating-point arithmetic modules commonly take two inputs that may adjust the behavior of the module:
control
roundingMode
These are covered in the following subsections.
The control
input is a vehicle for supplying to floating-point
operations any number of miscellaneous control parameters that are not expected
to change frequently.
The width of this input is specified by macro floatControlWidth
,
which is defined by default in file HardFloat_consts.vi
, although
it may be overridden in HardFloat_specialize.vi
.
Currently, the default width for control
is only
The default single bit of control
determines detection of
tininess for underflow.
In the terminology of the IEEE Standard, HardFloat can detect tininess for
underflow either before or after rounding.
The following are the names of macros whose values can be bitwised
control
input:
flControl_tininessBeforeRounding flControl_tininessAfterRounding
If the control
bit specified by macro
flControl_tininessAfterRounding
is
The roundingMode
input to a module chooses the rounding mode for a
floating-point operation (obviously).
All five rounding modes defined by the 2008 IEEE Floating-Point Standard are
implemented for all operations that require rounding.
HardFloat adds support also for a sixth mode, round to odd.
The value of roundingMode
may be that of any of these macros
defined in HardFloat_consts.vi
:
round_near_even
round to nearest, with ties to even round_near_maxMag
round to nearest, with ties to maximum magnitude (away from zero) round_minMag
round to minimum magnitude (toward zero) round_min
round to minimum (down) round_max
round to maximum (up) round_odd
round to odd (jamming)
Rounding mode round_odd
operates as follows:
For a conversion to an integer, if the input is not already an integer value,
the rounded result is the closest odd integer.
For operations that return a floating-point value, the floating-point result is
first rounded to minimum magnitude, the same as round_minMag
, and
then, if the result is inexact, the least-significant bit of the result is set
1
HardFloat supports all five exception flags required by the IEEE Floating-Point
Standard.
Most modules for floating-point operations have an exceptionFlags
output with
{
invalid,
infinite,
overflow,
underflow,
inexact}
The infinite exception is what the standard calls “divide by zero”, meaning an infinite result generated from finite operands.
The module that converts from floating-point to integer,
recFNToIN
, drops the infinite and underflow bits,
because conversions to integer can never underflow or deliver an infinite
result.
This module has instead an intExceptionFlags
output with
{
invalid,
overflow,
inexact}
Note that, although recFNToIN
distinguishes overflow from
invalid exceptions in intExceptionFlags
as shown, the IEEE
Standard does not permit conversions to integer to signal a
floating-point overflow exception.
Rather, if a system has no other way to indicate overflow from conversions to
integer, the standard requires that the floating-point invalid exception
be signaled, not floating-point overflow.
Hence, it will often be the case that the invalid and overflow
bits from the intExceptionFlags
output must be
Depending on how it is configured, HardFloat has the ability to create distinct NaNs for different exceptions, and to propagate NaN payloads from operation inputs to output. These options for NaNs are determined by the specialization of HardFloat, covered in the next section.
The IEEE Floating-Point Standard allows for some variation among conforming
implementations, particularly concerning NaNs.
The HardFloat source
directory is supplied with some
specialization subdirectories containing possible definitions for this
implementation-specific behavior.
For example, the 8086-SSE
control
input, if it’s not the default of
control
input, including whether
tininess for underflow is detected before or after rounding by default;
A specialization subdirectory is expected to contain two files: an
“include” file named HardFloat_specialize.vi
and a
regular Verilog file named HardFloat_specialize.v
.
If the specific implementation adds any bits to the control
input,
its width must be defined in HardFloat_specialize.vi
by undefining
(`undef
) the existing macro floatControlWidth
and
redefining it to the correct number of bits.
If needed, a default value for the control
input can be specified
by defining macro flControl_default
in
HardFloat_specialize.vi
.
This macro should be defined to either
`flControl_tininessBeforeRounding
or
`flControl_tininessAfterRounding
, combined with defaults for any
added control
bits.
To determine the result returned when a conversion to integer type overflows or
raises the invalid exception, file HardFloat_specialize.v
is expected to define a module with these parameters and ports:
module iNFromException#(parameter width) ( input signedOut, input isNaN, input sign, output [(width - 1):0] out );
Input signedOut
indicates whether the result integer type is
signed or unsigned, with 0 being unsigned and 1 being signed.
The other two inputs, isNaN
and sign
, provide
information about the original floating-point input that can affect the choice
of the integer result.
If isNaN
is true, the floating-point input is a NaN;
else it is a number (possibly an infinity) outside the range of the integer
type.
The correct integer result is returned in out
.
The specialization also determines the specific NaNs delivered when operations
return NaNs.
One must first choose whether NaN payloads will be propagated.
Defining macro HardFloat_propagateNaNPayloads
in
HardFloat_specialize.vi
enables NaN payload propagation.
(If defined, the content of this macro is ignored;
only its effect on `ifdef
directives matters.)
The remaining specification of NaN results depends on whether macro
HardFloat_propagateNaNPayloads
is defined.
Without NaN payload propagation
When NaN payloads are not propagated, a NaN result will usually be just the
default NaN for the result format, regardless of any input NaNs.
Default NaNs must be specified by two other macros defined in
HardFloat_specialize.vi
.
First, a default NaN’s sign bit is chosen by macro
HardFloat_signDefaultNaN
, which can be defined as either
0
1
HardFloat_fractDefaultNaN(sigWidth)
, which must evaluate to
an integer of sigWidth
− 1
With NaN payload propagation
If NaN propagation is enabled (macro
HardFloat_propagateNaNPayloads
is defined), then NaN results are
determined in an entirely different way.
In this case, several modules must be supplied in
HardFloat_specialize.v
to implement the desired propagation.
The names, parameters, and ports of these modules are as follows:
module propagateFloatNaN_add#(parameter sigWidth) ( input [(`floatControlWidth - 1):0] control, input subOp, input isNaNA, input signA, input [(sigWidth - 2):0] fractA, input isNaNB, input signB, input [(sigWidth - 2):0] fractB, output signNaN, output [(sigWidth - 2):0] fractNaN ); module propagateFloatNaN_mul#(parameter sigWidth) ( input [(`floatControlWidth - 1):0] control, input isNaNA, input signA, input [(sigWidth - 2):0] fractA, input isNaNB, input signB, input [(sigWidth - 2):0] fractB, output signNaN, output [(sigWidth - 2):0] fractNaN ); module propagateFloatNaN_mulAdd#(parameter sigWidth) ( input [(`floatControlWidth - 1):0] control, input [1:0] op, input isNaNA, input signA, input [(sigWidth - 2):0] fractA, input isNaNB, input signB, input [(sigWidth - 2):0] fractB, input invalidProd, input isNaNC, input signC, input [(sigWidth - 2):0] fractC, output signNaN, output [(sigWidth - 2):0] fractNaN ); module propagateFloatNaN_divSqrt#(parameter sigWidth) ( input [(`floatControlWidth - 1):0] control, input sqrtOp, input isNaNA, input signA, input [(sigWidth - 2):0] fractA, input isNaNB, input signB, input [(sigWidth - 2):0] fractB, output signNaN, output [(sigWidth - 2):0] fractNaN );
A different NaN-propagation module is needed for each of addition/subtraction, multiplication, fused multiply-add, and division/square-root. In all cases:
control
input comes directly from the inputs of the original
floating-point operation.
isNaNA
, signA
, and fractA
provide information about the first floating-point operand.
(signA
and fractA
are expected to be useful only when
isNaNA
is true.)
isNaNB
, signB
, and fractB
inform
about the second floating-point operand.
signNaN
and
fractNaN
.
Module propagateFloatNaN_add
has an extra input,
subOp
, indicating whether the operation is addition
(propagateFloatNaN_divSqrt
has an extra input,
sqrtOp
, indicating whether the operation is division
(b
operand (isNaNB
,
signB
, fractB
) should be ignored.
Module propagateFloatNaN_mulAdd
has these extra inputs:
op
duplicates the same input to the mulAdd
module.
invalidProd
indicates whether the product of floating-point
operands a
and b
is invalid (because
one operand is a zero and the other is an infinity).
isNaNC
, signC
, and fractC
inform
about the third floating-point operand, which is the other addend besides the
product a
× b
If no floating-point operand is a NaN (all isNaN
inputs are
false), the modules must return the correct default NaN for an invalid
operation of the given kind.
It may be that correct NaN results depend in part on whether floating-point
operands are signaling NaNs.
If so, inputs isNaNA
and fractA
contain enough
information for a module to determine whether operand a
is
a signaling NaN, and likewise for operands b
c
HardFloat defines modules for the following floating-point conversions and operations:
Only for the first bullet item above do HardFloat’s modules directly
touch floating-point encoded in a standard IEEE format.
Otherwise, floating-point inputs are always in HardFloat’s recoded
format, and floating-point outputs are either in a recoded format or in a
“raw” deconstructed form.
When an output is in deconstructed form, the output value is before rounding,
and it must still be processed through roundRawFNToRecFN
or
roundAnyRawFNToRecFN
to obtain a correctly rounded result in
conformance with IEEE rules.
Modules roundRawFNToRecFN
and roundAnyRawFNToRecFN
are documented later, in
Many HardFloat modules take inputs named control
and
roundingMode
.
These are always as documented in exceptionFlags
is always the five
exception flags reported in
Naturally, if recoded floating-point inputs to a module are not valid according
to
fNToRecFN
, recFNToFN
)
HardFloat has only two modules that input or output floating-point encoded in a
standard IEEE format.
One module, fNToRecFN
, converts from a standard format into
HardFloat’s equivalent recoded format, and the other,
recFNToFN
, converts back from a recoded format to standard format.
For all other functions, HardFloat requires the use of either its recoded
format or a “raw” deconstructed form.
The two conversion modules are complementary, with these parameters and ports:
module fNToRecFN#(parameter expWidth, parameter sigWidth) ( input [(expWidth + sigWidth - 1):0] in, output [(expWidth + sigWidth):0] out ); module recFNToFN#(parameter expWidth, parameter sigWidth) ( input [(expWidth + sigWidth):0] in, output [(expWidth + sigWidth - 1):0] out );
Because the set of values encoded in the recoded format is identical to the corresponding standard format, there are no possible exceptions from converting in either direction.
iNToRecFN
,
iNToRawFN
)
Module iNToRecFN
converts from an integer type to floating-point
in recoded form.
Its parameters and ports are:
module iNToRecFN#(parameter intWidth, parameter expWidth, parameter sigWidth) ( input [(`floatControlWidth - 1):0] control, input signedIn, input [(intWidth - 1):0] in, input [2:0] roundingMode, output [(expWidth + sigWidth):0] out, output [4:0] exceptionFlags );
The input named in
is interpreted as an unsigned integer if
signedIn
is false, or as a signed integer if
signedIn
is true.
Module iNToRawFN
performs a similar function, but returns a
floating-point value in deconstructed form:
module iNToRawFN#(parameter intWidth) ( input signedIn, input [(intWidth - 1):0] in, output isZero, output sign, output signed [(clog2(intWidth) + 2):0] sExp, output [intWidth:0] sig );
This module lacks parameters expWidth
and sigWidth
.
Instead, the output format is determined from the width of the input integer as
follows:
expWidth
=clog2
(intWidth
) + 1
sigWidth
=intWidth
Function clog2
(x) is the smallest integer at least
as large as the intWidth
is 17, the intWidth
is approximately 4.087, so
clog2
(intWidth
) sExp
is thus
The deconstructed output from iNToRawFN
omits isNaN
and isInf
, because these are known always to be false for
an integer.
The widths chosen for the output exponent and significand allow the
floating-point result from iNToRawFN
to be always exactly equal in
value to the integer input (much like recFNToRawFN
,
iNToRawFN
guarantees the following for its outputs:
sExp
is never negative.
isZero
is false), the two
most-significant bits of sig
are binary 01
.
isZero
is true),
sign
and sig
are zeros.
recFNToIN
)
Module recFNToIN
converts from a floating-point value in recoded
form to an integer type.
Its parameters and ports are:
module recFNToIN#(parameter expWidth, parameter sigWidth, parameter intWidth) ( input [(`floatControlWidth - 1):0] control, input [(expWidth + sigWidth):0] in, input [2:0] roundingMode, input signedOut, output [(intWidth - 1):0] out, output [2:0] intExceptionFlags );
The output named out
is an unsigned integer if input
signedOut
is false, or is a signed integer if
signedOut
is true.
As explained earlier in intExceptionFlags
reports
exceptions invalid, overflow, and inexact.
Although intExceptionFlags
distinguishes integer overflow
separately from invalid exceptions, the IEEE Standard does not permit
conversions to integer to raise a floating-point overflow exception.
Instead, if a system has no other way to indicate that a conversion to integer
overflowed, the standard requires that the floating-point invalid
exception be raised, not floating-point overflow.
Hence, the invalid and overflow bits from
intExceptionFlags
will typically be
recFNToRecFN
)
Module recFNToRecFN
converts a recoded floating-point value to a
different recoded format (such as from single-precision to double-precision, or
vice versa):
module recFNToRecFN#( parameter inExpWidth, parameter inSigWidth, parameter outExpWidth, parameter outSigWidth ) ( input [(`floatControlWidth - 1):0] control, input [(inExpWidth + inSigWidth):0] in, input [2:0] roundingMode, output [(outExpWidth + outSigWidth):0] out, output [4:0] exceptionFlags );
This module requires no special explanation.
addRecFN
,
addRecFNToRaw
)
Module addRecFN
adds or subtracts two recoded floating-point
values, returning a result in the same format.
Its parameters and ports are:
module addRecFN#(parameter expWidth, parameter sigWidth) ( input [(`floatControlWidth - 1):0] control, input subOp, input [(expWidth + sigWidth):0] a, input [(expWidth + sigWidth):0] b, input [2:0] roundingMode, output [(expWidth + sigWidth):0] out, output [4:0] exceptionFlags );
When input subOp
a
+ b
a
− b
A variant module, addRecFNToRaw
, returns the intermediate result
of addition or subtraction before rounding, as a “raw”
deconstructed floating-point value with two extra bits of significand:
module addRecFNToRaw#(parameter expWidth, parameter sigWidth) ( input [(`floatControlWidth - 1):0] control, input subOp, input [(expWidth + sigWidth):0] a, input [(expWidth + sigWidth):0] b, input [2:0] roundingMode, output invalidExc, output out_isNaN, output out_isInf, output out_isZero, output out_sign, output signed [(expWidth + 1):0] out_sExp, output [(sigWidth + 2):0] out_sig );
Boolean output invalidExc
is true if the operation should
raise an invalid exception.
Module roundRawFNToRecFN
can be used to round the intermediate
result in conformance with the IEEE Standard.
mulRecFN
,
mulRecFNToRaw
, mulRecFNToFullRaw
)
Module mulRecFN
multiplies two recoded floating-point values,
returning a result in the same format:
module mulRecFN#(parameter expWidth, parameter sigWidth) ( input [(`floatControlWidth - 1):0] control, input [(expWidth + sigWidth):0] a, input [(expWidth + sigWidth):0] b, input [2:0] roundingMode, output [(expWidth + sigWidth):0] out, output [4:0] exceptionFlags );
A variant module, mulRecFNToRaw
, returns the intermediate result
of multiplication before rounding, as a “raw”
deconstructed floating-point value with two extra bits of significand:
module mulRecFNToRaw#(parameter expWidth, parameter sigWidth) ( input [(`floatControlWidth - 1):0] control, input [(expWidth + sigWidth):0] a, input [(expWidth + sigWidth):0] b, output invalidExc, output out_isNaN, output out_isInf, output out_isZero, output out_sign, output signed [(expWidth + 1):0] out_sExp, output [(sigWidth + 2):0] out_sig );
Boolean output invalidExc
is true if the operation should
raise an invalid exception.
Module roundRawFNToRecFN
can be used to round the intermediate
result in conformance with the IEEE Standard.
Module mulRecFNToFullRaw
is a different variant, acting like
mulRecFNToRaw
except returning the complete double-width product
significand:
module mulRecFNToFullRaw#(parameter expWidth, parameter sigWidth) ( input [(`floatControlWidth - 1):0] control, input [(expWidth + sigWidth):0] a, input [(expWidth + sigWidth):0] b, output invalidExc, output out_isNaN, output out_isInf, output out_isZero, output out_sign, output signed [(expWidth + 1):0] out_sExp, output [(sigWidth*2 - 1):0] out_sig );
Unlike mulRecFNToRaw
, the result from
mulRecFNToFullRaw
is the exact product of the operands, without
any rounding approximation.
This full-size deconstructed floating-point result can be correctly rounded to
any recoded format using roundAnyRawFNToRecFN
.
mulAddRecFN
,
mulAddRecFNToRaw
)
Module mulAddRecFN
implements fused multiply-add as defined by the
IEEE Floating-Point Standard:
module mulAddRecFN#(parameter expWidth, parameter sigWidth) ( input [(`floatControlWidth - 1):0] control, input [1:0] op, input [(expWidth + sigWidth):0] a, input [(expWidth + sigWidth):0] b, input [(expWidth + sigWidth):0] c, input [2:0] roundingMode, output [(expWidth + sigWidth):0] out, output [4:0] exceptionFlags );
When op
= 0a
× b
) +
c
a
and
b
is infinite and the other is zero, the invalid
exception is indicated even if operand c
is a quiet NaN.
The bits of input op
affect the signs of the addends, making it
possible to turn addition into subtraction (much like the subOp
input to addRecFN
).
The exact effects of op
are summarized in this table:
op[1]
op[0]
Function 0 0 ( a
×b
) +c
0 1 ( a
×b
) −c
1 0 c
− (a
×b
)1 1 −( a
×b
) −c
In all cases, the function is computed with only a single rounding, of course.
A variant module, mulAddRecFNToRaw
, returns the intermediate
result of the fused multiply-add before rounding, as a
“raw” deconstructed floating-point value with two extra bits of
significand:
module mulAddRecFNToRaw#(parameter expWidth, parameter sigWidth) ( input [(`floatControlWidth - 1):0] control, input [1:0] op, input [(expWidth + sigWidth):0] a, input [(expWidth + sigWidth):0] b, input [(expWidth + sigWidth):0] c, input [2:0] roundingMode, output invalidExc, output out_isNaN, output out_isInf, output out_isZero, output out_sign, output signed [(expWidth + 1):0] out_sExp, output [(sigWidth + 2):0] out_sig );
Boolean output invalidExc
is true if the operation should
raise an invalid exception.
Module roundRawFNToRecFN
can be used to round the intermediate
result in conformance with the IEEE Standard.
divSqrtRecFN_small
,
divSqrtRecFNToRaw_small
)
Besides basic addition and multiplication, HardFloat has modules for computing
either division or square root, where the choice of function is controlled by a
module input.
Implementing as they do the most complex operations that HardFloat supports,
these combined division/square-root modules are unique within HardFloat for
being sequential, meaning they take a clock input and execute over
more than one clock cycle.
The division/square-root modules in HardFloat use classic one-bit-per-cycle
techniques that are simple and inexpensive but also slower than other known
algorithms for computing division and square root.
This character is reflected in the suffix ‘_small
’ in
the modules’ names.
HardFloat’s principal division/square-root module is
divSqrtRecFN_small
, with these parameters and ports:
module divSqrtRecFN_small#( parameter expWidth, parameter sigWidth, parameter options = 0 ) ( input nReset, input clock, input [(`floatControlWidth - 1):0] control, output inReady, input inValid, input sqrtOp, input [(expWidth + sigWidth):0] a, input [(expWidth + sigWidth):0] b, input [2:0] roundingMode, output outValid, output sqrtOpOut, output [(expWidth + sigWidth):0] out, output [4:0] exceptionFlags );
Currently, the only valid value for the options
parameter is zero.
The reset input, nReset
, is active-negative and operates
asynchronously within the module.
(By applying reset asynchronously, the module should accept any valid form of
reset, assuming proper care is taken elsewhere to synchronize the release of
reset with the clock.)
No clock cycles are needed during reset.
Apart from reset, state changes within the module occur only on the rising
edges of clock
.
The module asserts inReady
true (inReady
and inValid
are both true at
a rising edge of clock
, a new operation is begun, and inputs
control
, sqrtOp
, a
, b
, and
roundingMode
must also be valid.
The computation of inValid
(or any other module input) should not
depend on inReady
within the same clock cycle.
If sqrtOp
a
÷ b
sqrtOp
a
b
is
ignored.
After some number of clock cycles, outValid
is asserted
true for exactly one clock cycle, at which time outputs
sqrtOpOut
, out
, and exceptionFlags
are
all valid.
These outputs become invalid again in the very next clock cycle if another
division or square root operation gets initiated before or during the cycle
that outValid
is true.
There is no mechanism within the module to retain a result for more than one
cycle once another division or square root has begun.
On the other hand, if no subsequent operation has yet been started when
outValid
is asserted, then divSqrtRecFN_small
will
hold its result outputs constant until a new operation is begun (by asserting
inValid
while inReady
is true).
Even so, outValid
is never asserted for more than one clock cycle
per result, indicating the first cycle when the result is valid.
Output sqrtOpOut
is merely a copy of the original
sqrtOp
for the operation.
It is expected that clients will rarely have a need for this output.
The number of clock cycles to complete an operation is not guaranteed to be
constant, but may in fact depend on all inputs to the operation, including
a
, b
, and roundingMode
.
Because the module employs algorithms that compute one result bit per cycle,
the number of cycles is typically
sigWidth
+ n
A variant module, divSqrtRecFNToRaw_small
, returns the
intermediate result of a division or square root operation
before rounding, as a “raw” deconstructed floating-point
value with two extra bits of significand:
module divSqrtRecFNToRaw_small#( parameter expWidth, parameter sigWidth, parameter options = 0 ) ( input nReset, input clock, input [(`floatControlWidth - 1):0] control, output inReady, input inValid, input sqrtOp, input [(expWidth + sigWidth):0] a, input [(expWidth + sigWidth):0] b, input [2:0] roundingMode, output outValid, output sqrtOpOut, output [2:0] roundingModeOut, output invalidExc, output infiniteExc, output out_isNaN, output out_isInf, output out_isZero, output out_sign, output signed [(expWidth + 1):0] out_sExp, output [(sigWidth + 2):0] out_sig );
Functionally, the only difference between this module and
divSqrtRecFN_small
is the form of the outputs.
Like sqrtOpOut
, output roundingModeOut
is simply a
copy of the original roundingMode
for the operation.
Boolean output invalidExc
is true if the operation should
raise an invalid exception, while infiniteExc
is
true if the operation should raise an infinite exception
(“divide by zero”).
Module roundRawFNToRecFN
can be used to round the intermediate
result in conformance with the IEEE Standard.
compareRecFN
)
Module compareRecFN
compares two recoded floating-point values for
equality or inequality, according to the IEEE Standard.
Its parameters and ports are as follows:
module compareRecFN#(parameter expWidth, parameter sigWidth) ( input [(expWidth + sigWidth):0] a, input [(expWidth + sigWidth):0] b, input signaling, output lt, output eq, output gt, output unordered, output [4:0] exceptionFlags );
If Boolean input signaling
is true, a signaling
comparison is done, meaning that an invalid exception is raised if
either operand is any kind of NaN.
If signaling
is false, a quiet comparison is done,
meaning that quiet NaNs do not cause an invalid exception.
The IEEE Standard mandates that equality comparisons (
Boolean outputs lt
, eq
, and gt
indicate
whether a
< b
a
= b
a
> b
unordered
is true if either operand is a NaN.
Exactly one of lt
, eq
, gt
, and
unordered
will be true for any pair of operands,
a
b
A few HardFloat modules are components that are shared by multiple
floating-point operations.
One such module tests whether a floating-point value is a signaling NaN.
Other modules convert into and out of the deconstructed “raw” forms
documented earler in isSigNaNRecFN.v
and
HardFloat_rawFN.v
.
These modules are documented here for those who may need to use them directly. Otherwise, this section can be skipped.
isSigNaNRecFN
Module isSigNaNRecFN
tells whether a floating-point value is a
signaling NaN.
It has these parameters and ports:
module isSigNaNRecFN#(parameter expWidth, parameter sigWidth) ( input [(expWidth + sigWidth):0] in, output isSigNaN );
As indicated by the module’s name, the input floating-point value is in
HardFloat’s recoded format.
The output isSigNaN
is true if the input is a signaling
NaN.
recFNToRawFN
Module recFNToRawFN
converts a floating-point value from
HardFloat’s recoded format into the “raw” deconstructed form.
Its parameters and ports are as follows:
module recFNToRawFN#(parameter expWidth, parameter sigWidth) ( input [(expWidth + sigWidth):0] in, output isNaN, output isInf, output isZero, output sign, output signed [(expWidth + 1):0] sExp, output [sigWidth:0] sig );
For general information about the deconstructed form, see
isNaN
, isInf
, and isZero
are always valid, so at most one is true.
(Ordinarily, when isNaN
is true, the value of
isInf
is irrelevant, and when isNaN
or
isInf
is true, isZero
is irrelevant.)
sExp
is never negative.
isNaN
, isInf
, and
isZero
all being false):
sExp
is in the recoded format’s range of
sigWidth
expWidth
− 1sig
are binary 01
.
isZero
is true):
sExp
is less than the value it has for any finite nonzero number
in the same format (i.e., less than
sigWidth
sig
is zero.
roundAnyRawFNToRecFN
Module roundAnyRawFNToRecFN
takes an intermediate floating-point
value in deconstructed form and rounds it to a valid IEEE-conformant value in a
recoded format, applying the requested rounding mode and taking account of any
exceptional conditions such as underflow or overflow.
This module is declared with these parameters and ports:
module roundAnyRawFNToRecFN#( parameter inExpWidth, parameter inSigWidth, parameter outExpWidth, parameter outSigWidth, parameter options = 0 ) ( input [(`floatControlWidth - 1):0] control, input invalidExc, input infiniteExc, input in_isNaN, input in_isInf, input in_isZero, input in_sign, input signed [(inExpWidth + 1):0] in_sExp, input [inSigWidth:0] in_sig, input [2:0] roundingMode, output [(outExpWidth + outSigWidth):0] out, output [4:0] exceptionFlags );
Parameters inExpWidth
and inSigWidth
characterize the
floating-point input to be rounded, while outExpWidth
and
outSigWidth
control the result format.
The options
parameter is explained later below.
Inputs control
and roundingMode
are as documented in
in_*
inputs obviously supply the incoming floating-point value
in deconstructed form.
Output out
is the rounded IEEE floating-point value, in recoded
format, while exceptionFlags
delivers the five floating-point
exception flags as documented in
That leaves only inputs invalidExc
and infiniteExc
,
both Booleans.
If invalidExc
is true, it forces an invalid
exception to be asserted, independent of the other inputs, so the
floating-point result delivered by out
will be a NaN, and an
invalid exception will be indicated in exceptionFlags
.
Similarly, infiniteExc
asserts an infinite exception
(“divide by zero”) independent of most other inputs.
If invalidExc
is false and infiniteExc
is
true, the floating-point output will be an infinity with sign
in_sign
, and an infinite exception will be indicated in
exceptionFlags
.
When the floating-point input to roundAnyRawFNToRecFN
is the
intermediate result of an operation, its precision typically needs to be at
least two bits greater than the output precision to avoid corrupting the result
value, i.e.,
inSigWidth
≥ outSigWidth
+ 2iNToRawFN
or recFNToRawFN
), then
no relationship between input and output formats is necessary.
Source file HardFloat_consts.vi
defines several
‘flRoundOpt_
’ macros to be used for the
options
parameter, with the following meanings:
flRoundOpt_sigMSBitAlwaysZero
- For finite nonzero values, the two most-significant bits of
in_sig
are always binary01
.flRoundOpt_subnormsAlwaysExact
- Whenever the floating-point result is a subnormal, the result is always exact, requiring no real rounding. The inexact exception is therefore never indicated for subnormal results. (This case commonly arises with floating-point addition and subtraction.)
flRoundOpt_neverUnderflows
- Underflow never occurs, because, for finite nonzero values, the floating-point exponent is never below the normal range.
flRoundOpt_neverOverflows
- Overflow never occurs, because, for finite nonzero values, the floating-point exponent is never above the normal range.
The options
parameter can be set to the bitwise OR of any
combination of these macro values, or to 0, if none is applicable.
By setting options
to the maximal set of conditions that apply,
the efficiency of the module may be improved.
If the floating-point output is a NaN (because either invalidExc
is true, or invalidExc
and infiniteExc
are
false and in_isNaN
is true), and if macro
HardFloat_propagateNaNPayloads
is defined (refer back to
in_sign
and in_sig
.
This is true even when the NaN result should be the default NaN.
The client is responsible for controlling when a NaN result will be the default
NaN, by setting in_sign
and in_sig
appropriately.
On the other hand, if macro HardFloat_propagateNaNPayloads
is not
defined, any NaN outputs from roundAnyRawFNToRecFN
are always the
default NaN, and inputs in_sign
and in_sig
are
ignored for NaNs.
roundRawFNToRecFN
Module roundRawFNToRecFN
is a variation on
roundAnyRawFNToRecFN
, with the same set of ports but fewer
parameters:
module roundRawFNToRecFN#( parameter expWidth, parameter sigWidth, parameter options = 0 ) ( input [(`floatControlWidth - 1):0] control, input invalidExc, input infiniteExc, input in_isNaN, input in_isInf, input in_isZero, input in_sign, input signed [(expWidth + 1):0] in_sExp, input [(sigWidth + 2):0] in_sig, input [2:0] roundingMode, output [(expWidth + sigWidth):0] out, output [4:0] exceptionFlags );
roundRawFNToRecFN
is identical to
roundAnyRawFNToRecFN
with this assignment of parameters:
inExpWidth
=expWidth
inSigWidth
=sigWidth
+ 2
outExpWidth
=expWidth
outSigWidth
=sigWidth
Note that the deconstructed input has implicitly two more bits of precision
than the specified sigWidth
would normally indicate.
The HardFloat package includes a subdirectory named test
containing source code and example Makefile
s for testing
HardFloat’s Verilog modules.
To execute the tests, either a Verilog simulator or Verilator is required.
(Verilator is a free tool for converting a subset of synthesizable Verilog or
SystemVerilog into C++ code.
When compiled into an executable program, the code generated by Verilator has
been found to run much faster than some Verilog simulators.)
HardFloat’s test infrastructure also depends on Berkeley TestFloat, which must be obtained and compiled separately. And building TestFloat furthermore requires Berkeley SoftFloat, thus completing the three-part set of Berkeley HardFloat, SoftFloat, and TestFloat. Information about TestFloat and SoftFloat can be found at their respective Web pages:
http://www.jhauser.us/arithmetic/TestFloat.html
http://www.jhauser.us/arithmetic/SoftFloat.html
Separate documentation is supplied according to whether one is using a Verilog simulator or Verilator for testing:
HardFloat-test-Verilog.html
Documentation for testing HardFloat using Verilog simulation. HardFloat-test-Verilator.html
Documentation for testing HardFloat using Verilator.
At the time of this writing, the most up-to-date information about HardFloat
and the latest release can be found at the Web page
http://www.jhauser.us/arithmetic/HardFloat.html