Berkeley HardFloat Release 1: Verilog Modules

John R. Hauser
2019 July 29

Contents

1. Introduction
2. Limitations
3. Acknowledgments and License
4. HardFloat Package Directory Structure
5. Floating-Point Representations
5.1. Standard Formats
5.2. Recoded Formats
5.3. Raw Deconstructions
6. Common Control and Mode Inputs
6.1. Control Input
6.2. Rounding Mode
7. Exception Results
8. Specialization
8.1. Width and Default Value for the Control Input
8.2. Integer Results on Exceptions
8.3. NaN Results
9. Main Modules
9.1. Conversions Between Standard and Recoded Floating-Point (fNToRecFN, recFNToFN)
9.2. Conversions from Integer (iNToRecFN,  iNToRawFN)
9.3. Conversions to Integer (recFNToIN)
9.4. Conversions Between Formats (recFNToRecFN)
9.5. Addition and Subtraction (addRecFN,  addRecFNToRaw)
9.6. Multiplication (mulRecFN,  mulRecFNToRaw,  mulRecFNToFullRaw)
9.7. Fused Multiply-Add (mulAddRecFN,  mulAddRecFNToRaw)
9.8. Division and Square Root (divSqrtRecFN_small,  divSqrtRecFNToRaw_small)
9.9. Comparisons (compareRecFN)
10. Common Submodules
10.1. isSigNaNRecFN
10.2. recFNToRawFN
10.3. roundAnyRawFNToRecFN
10.4. roundRawFNToRecFN
11. Testing HardFloat
12. Contact Information

1. Introduction

Berkeley HardFloat is a hardware implementation of binary floating-point that conforms to the IEEE Standard for Floating-Point Arithmetic. HardFloat supports a wide range of floating-point formats, using module parameters to independently determine the widths of the exponent and significand fields. The set of possible formats includes the standard ones of 16-bit half-precision, 32-bit single-precision, 64-bit double-precision, and 128-bit quadruple-precision. Some historical extended formats, such as Intel’s old 80-bit double-extended-precision floating-point, are not directly supported. (But HardFloat could be used in the implementation of these and other IEEE-based formats with the addition of modules to convert between encodings.)

For any supported format, the following arithmetic functions are available:

This document covers the Verilog version of Berkeley HardFloat. An understanding of both Verilog and the IEEE Floating-Point Standard is required.

HardFloat is currently in its first documented release, called Release 1.

2. Limitations

In its Verilog version, Berkeley HardFloat requires development tools that conform at a minimum to the 2001 IEEE Standard for the Verilog language.

For a particular IEEE-conforming binary floating-point format, if w is the width of the format’s exponent field and p is the precision as defined by the floating-point standard (that is, p = 1 + the width of the trailing significand field), HardFloat requires that w ≥ 3 and p ≥ 3, and also that w and p satisfy this relationship:

p  ≤  2(w − 2) + 3

Formats not satisfying these constraints will not always operate correctly.

Although HardFloat supports an infinite range of binary floating-point formats within these constraints, Release 1 has been tested only for the common IEEE-defined formats of 16-bit half-precision, 32-bit single-precision, 64-bit double-precision, and 128-bit quadruple-precision. You should assume there is a greater risk of failure for any format that is not one of the four that have been tested.

While the range of HardFloat’s parameters includes the “bfloat16” format (w = 8, p = 8) first defined by Google and subsequently adopted by Intel and others, HardFloat’s floating-point always includes support for subnormals as dictated by the IEEE Standard, whereas at least some versions of bfloat16 (as implemented by Intel, for example) officially exclude subnormals. Getting HardFloat to implement bfloat16 without subnormals requires some modifications, such as by adding wrappers around HardFloat’s modules to force subnormals to zeros.

HardFloat is designed to operate on IEEE floating-point values by converting them into a recoded format, performing arithmetic in the recoded format, and then eventually converting back to a standard IEEE-defined encoding. HardFloat’s recoded formats encode the exact same set of values as the standard floating-point encodings, so recoding entails only a change of representation as bits, not a change of value. It is intended that this recoding be made invisible to other system components by converting values back to a standard encoding whenever necessary. Nevertheless, there may exist situations where use of the recoded formats cannot be tolerated. For more about the recoding, see section 5, Floating-Point Representations.

3. Acknowledgments and License

The HardFloat package was written by me, John R. Hauser. The project was done in the employ of the University of California, Berkeley, within the Department of Electrical Engineering and Computer Sciences, first for the Parallel Computing Laboratory (Par Lab), then for the ASPIRE Lab, and lastly for the ADEPT Lab. The work was officially overseen by Prof. Krste Asanovic, with funding provided by these sources:

Par Lab: Microsoft (Award #024263), Intel (Award #024894), and U.C. Discovery (Award #DIG07-10227), with additional support from Par Lab affiliates Nokia, NVIDIA, Oracle, and Samsung.
ASPIRE Lab: DARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from ASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA, Oracle, and Samsung.
ADEPT Lab: ADEPT industrial sponsor Intel and ADEPT affiliates Google, Futurewei, Seagate, Siemens, and SK Hynix.

The following applies to the whole of HardFloat Release 1 as well as to each source file individually.

Copyright 2019 The Regents of the University of California. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution.

  3. Neither the name of the University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS “AS IS”, AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

4. HardFloat Package Directory Structure

The directory structure for the Verilog version of HardFloat is as follows:

doc
source
    8086-SSE
    ARM-VFPv2
    RISCV
test
    source
        Verilator
    build
        Verilator-GCC
        IcarusVerilog

The source files that define the Verilog modules are all in the main source directory. The test directory has files used solely for testing.

Most of HardFloat’s source files are regular Verilog ‘.v’ files containing module definitions. There are also a few ‘.vi’ files that define macros and constant functions. These latter types of files get textually included (using compiler directive `include) in the other Verilog files.

Within the main source directory are subdirectories for files that specialize the floating-point behavior to match particular processor families:

8086-SSE
Intel’s x86 processors with Streaming SIMD Extensions (SSE) and later compatible extensions (but not including the older 80-bit double-extended-precision format)
ARM-VFPv2
Arm’s VFPv2 or later floating-point
RISCV
RISC-V floating-point

Each of these specialization subdirectories contains two files, HardFloat_specialize.vi and HardFloat_specialize.v. Specialization is covered in detail in section 8.

5. Floating-Point Representations

HardFloat has three main ways of representing binary floating-point values:

Standard IEEE formats (‘fN’)
A computation’s original floating-point inputs and final results will typically be represented in what the IEEE Standard calls binary interchange formats, such as binary32 for single-precision and binary64 for double-precision.
Recoded formats (‘recFN’)
Rather than operate on the standard formats directly, HardFloat prefers to translate IEEE-encoded values into equivalent recoded formats and operate on those alternative representations instead. The main purpose of the recoding is to make subnormals be normalized, aligned with other floating-point values, thus reducing the complexity of floating-point operations.
Raw deconstructions (‘rawFN’)
During the computation of individual arithmetic operations, HardFloat often represents floating-point values in a deconstructed “raw” form, with separated sign, exponent, significand, etc. Among other uses, this representation is employed for intermediate results before rounding.

Each of these representations is covered in a subsection below.

5.1. Standard Formats

HardFloat uses the term standard format to refer to an encoding of floating-point in the style of a binary interchange format defined by the IEEE Standard. Although the standard officially limits binary interchange formats to certain combinations of exponent width w and precision p, HardFloat loosens those restrictions to allow any w and p, so long as each parameter is no smaller than 3, and this relationship is also satisfied:

p  ≤  2(w − 2) + 3

In module names, HardFloat uses the abbreviation ‘fN’ (distinct from ‘recFN’ or ‘rawFN’) to refer to standard IEEE-style formats. A particular format is indicated by parameters expWidth (= w) and sigWidth (= p). The size in bits of a complete floating-point value is expWidth + sigWidth, which is subdivided as 1 bit for the sign, expWidth bits for the encoded exponent, and sigWidth − 1 bits for the trailing significand. The parameters for the common (and tested) floating-point formats are

  expWidth (w)   sigWidth (p)
16-bit half-precision 5 11
32-bit single-precision 8 24
64-bit double-precision 11 53
128-bit quadruple-precision 15 113

As supplied, HardFloat distinguishes signaling NaNs from quiet NaNs in the way the IEEE Standard prefers; that is, if the most-significant bit of a NaN’s trailing significand is 0, the NaN is signaling; if this bit is 1, the NaN is quiet. When NaN payloads are propagated through operations, signaling NaNs are ordinarily converted into quiet NaNs by flipping this bit from 0 to 1.

5.2. Recoded Formats

For each standard format, HardFloat defines a corresponding recoded format that encodes the same set of values in a slightly different representation. The recoded formats have sign, exponent, and significand fields just like the standard formats, but with one extra bit for the exponent. Hence, standard 32-bit single-precision, for example, gets recoded in 33 bits: one bit for the sign, 9 for the encoded exponent (one more than usual), and 23 for the trailing significand. The recoded formats simplify the implementation of floating-point arithmetic in a few ways, the most important being to normalize subnormals so they can be treated nearly the same as regular floating-point numbers.

The following table summarizes the recoding:

standard format HardFloat’s recoded format
sign   exponent   significand sign   exponent significand
zeros s 0 0 s 000xx...xx 0
subnormal numbers s 0 F s 2k + 2 − n    normalized F<<n
normal numbers s E F s E + 2k + 1 F
infinities s 1111...11 0 s 110xx...xx xxxxxxx...xxxx
NaNs s 1111...11 F s 111xx...xx F

The parameter k is expWidth − 1, where expWidth is the standard width of the exponent field, not the recoded width. An x represents a “don’t care” bit. For the special values of zeros, infinities, and NaNs, only the three most-significant bits are relevant in the recoded exponent field. If those three bits are 000, 110, or 111, the floating-point value is a zero, infinity, or NaN, respectively; otherwise, the value is a normalized finite number. In the latter case, if the recoded exponent field is 2k + 2 or more, the value is a regular normal number, and if it’s less, the value is a normalized subnormal number. (The notation “normalized F<<n” means F normalized by shifting left by n bits and discarding the leading 1).

For most floating-point values (zeros, normal numbers, and NaNs) the only thing different about the recoded formats is the encoding of the exponent; the sign and significand fields are the same as the corresponding standard format. Infinities are special only in that the significand field is ignored in the recoded formats and might not be zero. Only for subnormals is there a significant transformation of both exponent and significand fields, due to normalization.

A large number of possible bit patterns in the recoded formats don’t conform to entries in the table above, such as when the recoded exponent’s most-significant three bits are 000, indicating a zero, yet the significand field is not zero. Such patterns are not valid. If an invalid recoded input is given to a HardFloat module, all module outputs are unspecified. Invalid recoded results are never generated by HardFloat’s modules, so long as all module inputs are valid. For normalized numbers (not zeros, infinities, or NaNs), a valid recoded exponent field is always in the range 2k + 3 − sigWidth (for the tiniest subnormal) to 3×2k − 1 (for the largest normal number).

In module names, a recoded format is abbreviated as ‘recFN’. Like the standard formats, a particular recoded format is specified by parameters expWidth and sigWidth. The parameters to give, however, are those of the corresponding standard format, without adjustment for the wider recoded exponent field. For example, the recoded format for common 32-bit single-precision is specified by expWidth = 8 and sigWidth = 24, the same as the standard format. (Refer back to the previous subsection on the standard formats.) Accordingly, the width of a recoded floating-point exponent field is really expWidth + 1, and the size in bits of the entire recoded floating-point format is expWidth + sigWidth + 1.

5.3. Raw Deconstructions

Within the implementation of individual arithmetic operations (addition, multiplication, etc.), HardFloat often represents floating-point values in a deconstructed form, with six separate components:

isNaN
A single bit indicating whether the floating-point value is a NaN. If isNaN is true (= 1), the other components are irrelevant except possibly for sign and sig. Components sign and sig are relevant for a NaN only if HardFloat is configured to propagate NaN payloads.
isInf
A single bit indicating whether the floating-point value is an infinity. If isNaN is false and isInf is true, the other components are irrelevant except for sign.
isZero
A single bit indicating whether the floating-point value is a zero. If isNaN and isInf are both false and isZero is true, the other components are irrelevant except for sign.
sign
The floating-point sign bit. Ignored only if isNaN is true and NaN payloads are not propagated.
sExp
For a finite nonzero floating-point value, the biased floating-point exponent as a signed integer. For NaNs, infinities, and zeros, sExp is ignored. As there are no special encodings of sExp to indicate special values, sExp is always just a simple signed integer, with the same bias as in the recoded formats. For a specified expWidth, the actual width of sExp is expWidth + 2, giving it a range of −2w+1 to 2w+1 − 1, where w = expWidth as usual.
sig
For a NaN, the NaN payload (sometimes ignored). For a finite nonzero floating-point value, the complete significand, including the usually implicit 1 bit. For a specified sigWidth, the width of sig is sigWidth + 1, with 2 extra bits at the most-significant end when compared to the usual trailing significand of the standard and recoded formats. In some cases, these two bits are always binary 01. In other cases, the most-significant 1 bit of the significand can be either of the two leading bits of sig, allowing for a 1-bit slack in the normalization of the significand. A value of sig with the two most-significant bits both zeros (00) is invalid (unless one of isNaN, isInf, or isZero is true).

Unlike the recoded formats, the deconstructed forms allow many valid encodings that do not correspond directly to IEEE floating-point values. These extra-precise and/or out-of-bounds values may be mapped to the set of valid IEEE values by rounding, possibly resulting in underflow or overflow.

In HardFloat’s module names, the deconstructed forms are abbreviated as ‘rawFN’. They are typically used in two ways. First, inputs to an operation get converted from recoded formats into deconstructed forms using recFNToRawFN submodules. An intermediate numeric result is then computed, also in raw form, and this gets rounded to a valid IEEE value using module roundRawFNToRecFN or roundAnyRawFNToRecFN. These modules for handling rawFN are documented in more detail in section 10, Common Submodules.

6. Common Control and Mode Inputs

HardFloat’s floating-point arithmetic modules commonly take two inputs that may adjust the behavior of the module:

control
roundingMode

These are covered in the following subsections.

6.1. Control Input

The control input is a vehicle for supplying to floating-point operations any number of miscellaneous control parameters that are not expected to change frequently. The width of this input is specified by macro floatControlWidth, which is defined by default in file HardFloat_consts.vi, although it may be overridden in HardFloat_specialize.vi. Currently, the default width for control is only 1 bit.

The default single bit of control determines detection of tininess for underflow. In the terminology of the IEEE Standard, HardFloat can detect tininess for underflow either before or after rounding. The following are the names of macros whose values can be bitwised ORed into the control input:

flControl_tininessBeforeRounding
flControl_tininessAfterRounding

If the control bit specified by macro flControl_tininessAfterRounding is set (= 1), then tininess is detected after rounding. If this bit is not set (= 0), tininess is detected before rounding. Detecting tininess after rounding is usually slightly better because it results in fewer spurious underflow signals. The option for detecting tininess before rounding is provided for compatibility with some systems. As required by the IEEE Standard since 2008, HardFloat always detects loss of accuracy for underflow as an inexact result.

6.2. Rounding Mode

The roundingMode input to a module chooses the rounding mode for a floating-point operation (obviously). All five rounding modes defined by the 2008 IEEE Floating-Point Standard are implemented for all operations that require rounding. HardFloat adds support also for a sixth mode, round to odd. The value of roundingMode may be that of any of these macros defined in HardFloat_consts.vi:

round_near_even round to nearest, with ties to even
round_near_maxMag   round to nearest, with ties to maximum magnitude (away from zero)
round_minMag round to minimum magnitude (toward zero)
round_min round to minimum (down)
round_max round to maximum (up)
round_odd round to odd (jamming)

Rounding mode round_odd operates as follows: For a conversion to an integer, if the input is not already an integer value, the rounded result is the closest odd integer. For operations that return a floating-point value, the floating-point result is first rounded to minimum magnitude, the same as round_minMag, and then, if the result is inexact, the least-significant bit of the result is set to 1. Rounding to odd is also known as jamming.

7. Exception Results

HardFloat supports all five exception flags required by the IEEE Floating-Point Standard. Most modules for floating-point operations have an exceptionFlags output with 5 bits in this order:

{invalid, infinite, overflow, underflow, inexact}

The infinite exception is what the standard calls “divide by zero”, meaning an infinite result generated from finite operands.

The module that converts from floating-point to integer, recFNToIN, drops the infinite and underflow bits, because conversions to integer can never underflow or deliver an infinite result. This module has instead an intExceptionFlags output with 3 bits in this order:

{invalid, overflow, inexact}

Note that, although recFNToIN distinguishes overflow from invalid exceptions in intExceptionFlags as shown, the IEEE Standard does not permit conversions to integer to signal a floating-point overflow exception. Rather, if a system has no other way to indicate overflow from conversions to integer, the standard requires that the floating-point invalid exception be signaled, not floating-point overflow. Hence, it will often be the case that the invalid and overflow bits from the intExceptionFlags output must be ORed together to signal the usual floating-point invalid exception.

Depending on how it is configured, HardFloat has the ability to create distinct NaNs for different exceptions, and to propagate NaN payloads from operation inputs to output. These options for NaNs are determined by the specialization of HardFloat, covered in the next section.

8. Specialization

The IEEE Floating-Point Standard allows for some variation among conforming implementations, particularly concerning NaNs. The HardFloat source directory is supplied with some specialization subdirectories containing possible definitions for this implementation-specific behavior. For example, the 8086-SSE subdirectory has source files that specialize HardFloat to that of Intel’s newer x86 processors. The files in a specialization subdirectory determine:

A specialization subdirectory is expected to contain two files: an “include” file named HardFloat_specialize.vi and a regular Verilog file named HardFloat_specialize.v.

8.1. Width and Default Value for the Control Input

If the specific implementation adds any bits to the control input, its width must be defined in HardFloat_specialize.vi by undefining (`undef) the existing macro floatControlWidth and redefining it to the correct number of bits.

If needed, a default value for the control input can be specified by defining macro flControl_default in HardFloat_specialize.vi. This macro should be defined to either `flControl_tininessBeforeRounding or `flControl_tininessAfterRounding, combined with defaults for any added control bits.

8.2. Integer Results on Exceptions

To determine the result returned when a conversion to integer type overflows or raises the invalid exception, file HardFloat_specialize.v is expected to define a module with these parameters and ports:

module
    iNFromException#(parameter width) (
        input signedOut,
        input isNaN,
        input sign,
        output [(width - 1):0] out
    );

Input signedOut indicates whether the result integer type is signed or unsigned, with 0 being unsigned and 1 being signed. The other two inputs, isNaN and sign, provide information about the original floating-point input that can affect the choice of the integer result. If isNaN is true, the floating-point input is a NaN; else it is a number (possibly an infinity) outside the range of the integer type. The correct integer result is returned in out.

8.3. NaN Results

The specialization also determines the specific NaNs delivered when operations return NaNs. One must first choose whether NaN payloads will be propagated. Defining macro HardFloat_propagateNaNPayloads in HardFloat_specialize.vi enables NaN payload propagation. (If defined, the content of this macro is ignored; only its effect on `ifdef directives matters.) The remaining specification of NaN results depends on whether macro HardFloat_propagateNaNPayloads is defined.

Without NaN payload propagation

When NaN payloads are not propagated, a NaN result will usually be just the default NaN for the result format, regardless of any input NaNs. Default NaNs must be specified by two other macros defined in HardFloat_specialize.vi. First, a default NaN’s sign bit is chosen by macro HardFloat_signDefaultNaN, which can be defined as either 0 or 1. Second, the bulk of the default NaN payload (apart from the sign) is determined by a function-like macro with one argument, HardFloat_fractDefaultNaN(sigWidth), which must evaluate to an integer of sigWidth − 1 bits.

With NaN payload propagation

If NaN propagation is enabled (macro HardFloat_propagateNaNPayloads is defined), then NaN results are determined in an entirely different way. In this case, several modules must be supplied in HardFloat_specialize.v to implement the desired propagation. The names, parameters, and ports of these modules are as follows:

module
    propagateFloatNaN_add#(parameter sigWidth) (
        input [(`floatControlWidth - 1):0] control,
        input subOp,
        input isNaNA,
        input signA,
        input [(sigWidth - 2):0] fractA,
        input isNaNB,
        input signB,
        input [(sigWidth - 2):0] fractB,
        output signNaN,
        output [(sigWidth - 2):0] fractNaN
    );

module
    propagateFloatNaN_mul#(parameter sigWidth) (
        input [(`floatControlWidth - 1):0] control,
        input isNaNA,
        input signA,
        input [(sigWidth - 2):0] fractA,
        input isNaNB,
        input signB,
        input [(sigWidth - 2):0] fractB,
        output signNaN,
        output [(sigWidth - 2):0] fractNaN
    );

module
    propagateFloatNaN_mulAdd#(parameter sigWidth) (
        input [(`floatControlWidth - 1):0] control,
        input [1:0] op,
        input isNaNA,
        input signA,
        input [(sigWidth - 2):0] fractA,
        input isNaNB,
        input signB,
        input [(sigWidth - 2):0] fractB,
        input invalidProd,
        input isNaNC,
        input signC,
        input [(sigWidth - 2):0] fractC,
        output signNaN,
        output [(sigWidth - 2):0] fractNaN
    );

module
    propagateFloatNaN_divSqrt#(parameter sigWidth) (
        input [(`floatControlWidth - 1):0] control,
        input sqrtOp,
        input isNaNA,
        input signA,
        input [(sigWidth - 2):0] fractA,
        input isNaNB,
        input signB,
        input [(sigWidth - 2):0] fractB,
        output signNaN,
        output [(sigWidth - 2):0] fractNaN
    );

A different NaN-propagation module is needed for each of addition/subtraction, multiplication, fused multiply-add, and division/square-root. In all cases:

Module propagateFloatNaN_add has an extra input, subOp, indicating whether the operation is addition (= 0) or subtraction (= 1). Similarly, propagateFloatNaN_divSqrt has an extra input, sqrtOp, indicating whether the operation is division (= 0) or square root (= 1). For square roots, the b operand (isNaNB, signB, fractB) should be ignored. Module propagateFloatNaN_mulAdd has these extra inputs:

If no floating-point operand is a NaN (all isNaN inputs are false), the modules must return the correct default NaN for an invalid operation of the given kind.

It may be that correct NaN results depend in part on whether floating-point operands are signaling NaNs. If so, inputs isNaNA and fractA contain enough information for a module to determine whether operand a is a signaling NaN, and likewise for operands b and c.

9. Main Modules

HardFloat defines modules for the following floating-point conversions and operations:

Only for the first bullet item above do HardFloat’s modules directly touch floating-point encoded in a standard IEEE format. Otherwise, floating-point inputs are always in HardFloat’s recoded format, and floating-point outputs are either in a recoded format or in a “raw” deconstructed form. When an output is in deconstructed form, the output value is before rounding, and it must still be processed through roundRawFNToRecFN or roundAnyRawFNToRecFN to obtain a correctly rounded result in conformance with IEEE rules. Modules roundRawFNToRecFN and roundAnyRawFNToRecFN are documented later, in section 10, Common Submodules.

Many HardFloat modules take inputs named control and roundingMode. These are always as documented in section 6, Common Control and Mode Inputs. Likewise, an output named exceptionFlags is always the five exception flags reported in section 7, Exception Results.

Naturally, if recoded floating-point inputs to a module are not valid according to section 5.2, Recoded Formats, then module outputs become unspecified.

9.1. Conversions Between Standard and Recoded Floating-Point (fNToRecFN, recFNToFN)

HardFloat has only two modules that input or output floating-point encoded in a standard IEEE format. One module, fNToRecFN, converts from a standard format into HardFloat’s equivalent recoded format, and the other, recFNToFN, converts back from a recoded format to standard format. For all other functions, HardFloat requires the use of either its recoded format or a “raw” deconstructed form. The two conversion modules are complementary, with these parameters and ports:

module
    fNToRecFN#(parameter expWidth, parameter sigWidth) (
        input [(expWidth + sigWidth - 1):0] in,
        output [(expWidth + sigWidth):0] out
    );

module
    recFNToFN#(parameter expWidth, parameter sigWidth) (
        input [(expWidth + sigWidth):0] in,
        output [(expWidth + sigWidth - 1):0] out
    );

Because the set of values encoded in the recoded format is identical to the corresponding standard format, there are no possible exceptions from converting in either direction.

9.2. Conversions from Integer (iNToRecFN,  iNToRawFN)

Module iNToRecFN converts from an integer type to floating-point in recoded form. Its parameters and ports are:

module
    iNToRecFN#(parameter intWidth, parameter expWidth, parameter sigWidth) (
        input [(`floatControlWidth - 1):0] control,
        input signedIn,
        input [(intWidth - 1):0] in,
        input [2:0] roundingMode,
        output [(expWidth + sigWidth):0] out,
        output [4:0] exceptionFlags
    );

The input named in is interpreted as an unsigned integer if signedIn is false, or as a signed integer if signedIn is true.

Module iNToRawFN performs a similar function, but returns a floating-point value in deconstructed form:

module
    iNToRawFN#(parameter intWidth) (
        input signedIn,
        input [(intWidth - 1):0] in,
        output isZero,
        output sign,
        output signed [(clog2(intWidth) + 2):0] sExp,
        output [intWidth:0] sig
    );

This module lacks parameters expWidth and sigWidth. Instead, the output format is determined from the width of the input integer as follows:

expWidth = clog2(intWidth) + 1
sigWidth = intWidth

Function clog2(x) is the smallest integer at least as large as the base-2 logarithm of x. For example, if intWidth is 17, the base-2 logarithm of intWidth is approximately 4.087, so clog2(intWidth) is 5, and port sExp is thus 8 bits wide.

The deconstructed output from iNToRawFN omits isNaN and isInf, because these are known always to be false for an integer.

The widths chosen for the output exponent and significand allow the floating-point result from iNToRawFN to be always exactly equal in value to the integer input (much like recFNToRawFN, section 10.2). Furthermore, iNToRawFN guarantees the following for its outputs:

9.3. Conversions to Integer (recFNToIN)

Module recFNToIN converts from a floating-point value in recoded form to an integer type. Its parameters and ports are:

module
    recFNToIN#(parameter expWidth, parameter sigWidth, parameter intWidth) (
        input [(`floatControlWidth - 1):0] control,
        input [(expWidth + sigWidth):0] in,
        input [2:0] roundingMode,
        input signedOut,
        output [(intWidth - 1):0] out,
        output [2:0] intExceptionFlags
    );

The output named out is an unsigned integer if input signedOut is false, or is a signed integer if signedOut is true.

As explained earlier in section 7, Exception Results, the 3-bit output named intExceptionFlags reports exceptions invalid, overflow, and inexact. Although intExceptionFlags distinguishes integer overflow separately from invalid exceptions, the IEEE Standard does not permit conversions to integer to raise a floating-point overflow exception. Instead, if a system has no other way to indicate that a conversion to integer overflowed, the standard requires that the floating-point invalid exception be raised, not floating-point overflow. Hence, the invalid and overflow bits from intExceptionFlags will typically be ORed together to contribute to the usual floating-point invalid exception.

9.4. Conversions Between Formats (recFNToRecFN)

Module recFNToRecFN converts a recoded floating-point value to a different recoded format (such as from single-precision to double-precision, or vice versa):

module
    recFNToRecFN#(
        parameter inExpWidth,
        parameter inSigWidth,
        parameter outExpWidth,
        parameter outSigWidth
    ) (
        input [(`floatControlWidth - 1):0] control,
        input [(inExpWidth + inSigWidth):0] in,
        input [2:0] roundingMode,
        output [(outExpWidth + outSigWidth):0] out,
        output [4:0] exceptionFlags
    );

This module requires no special explanation.

9.5. Addition and Subtraction (addRecFN,  addRecFNToRaw)

Module addRecFN adds or subtracts two recoded floating-point values, returning a result in the same format. Its parameters and ports are:

module
    addRecFN#(parameter expWidth, parameter sigWidth) (
        input [(`floatControlWidth - 1):0] control,
        input subOp,
        input [(expWidth + sigWidth):0] a,
        input [(expWidth + sigWidth):0] b,
        input [2:0] roundingMode,
        output [(expWidth + sigWidth):0] out,
        output [4:0] exceptionFlags
    );

When input subOp is 0, the operation is addition (a + b), and when it is 1, the operation is subtraction (ab).

A variant module, addRecFNToRaw, returns the intermediate result of addition or subtraction before rounding, as a “raw” deconstructed floating-point value with two extra bits of significand:

module
    addRecFNToRaw#(parameter expWidth, parameter sigWidth) (
        input [(`floatControlWidth - 1):0] control,
        input subOp,
        input [(expWidth + sigWidth):0] a,
        input [(expWidth + sigWidth):0] b,
        input [2:0] roundingMode,
        output invalidExc,
        output out_isNaN,
        output out_isInf,
        output out_isZero,
        output out_sign,
        output signed [(expWidth + 1):0] out_sExp,
        output [(sigWidth + 2):0] out_sig
    );

Boolean output invalidExc is true if the operation should raise an invalid exception. Module roundRawFNToRecFN can be used to round the intermediate result in conformance with the IEEE Standard.

9.6. Multiplication (mulRecFN,  mulRecFNToRaw,  mulRecFNToFullRaw)

Module mulRecFN multiplies two recoded floating-point values, returning a result in the same format:

module
    mulRecFN#(parameter expWidth, parameter sigWidth) (
        input [(`floatControlWidth - 1):0] control,
        input [(expWidth + sigWidth):0] a,
        input [(expWidth + sigWidth):0] b,
        input [2:0] roundingMode,
        output [(expWidth + sigWidth):0] out,
        output [4:0] exceptionFlags
    );

A variant module, mulRecFNToRaw, returns the intermediate result of multiplication before rounding, as a “raw” deconstructed floating-point value with two extra bits of significand:

module
    mulRecFNToRaw#(parameter expWidth, parameter sigWidth) (
        input [(`floatControlWidth - 1):0] control,
        input [(expWidth + sigWidth):0] a,
        input [(expWidth + sigWidth):0] b,
        output invalidExc,
        output out_isNaN,
        output out_isInf,
        output out_isZero,
        output out_sign,
        output signed [(expWidth + 1):0] out_sExp,
        output [(sigWidth + 2):0] out_sig
    );

Boolean output invalidExc is true if the operation should raise an invalid exception. Module roundRawFNToRecFN can be used to round the intermediate result in conformance with the IEEE Standard.

Module mulRecFNToFullRaw is a different variant, acting like mulRecFNToRaw except returning the complete double-width product significand:

module
    mulRecFNToFullRaw#(parameter expWidth, parameter sigWidth) (
        input [(`floatControlWidth - 1):0] control,
        input [(expWidth + sigWidth):0] a,
        input [(expWidth + sigWidth):0] b,
        output invalidExc,
        output out_isNaN,
        output out_isInf,
        output out_isZero,
        output out_sign,
        output signed [(expWidth + 1):0] out_sExp,
        output [(sigWidth*2 - 1):0] out_sig
    );

Unlike mulRecFNToRaw, the result from mulRecFNToFullRaw is the exact product of the operands, without any rounding approximation. This full-size deconstructed floating-point result can be correctly rounded to any recoded format using roundAnyRawFNToRecFN.

9.7. Fused Multiply-Add (mulAddRecFN,  mulAddRecFNToRaw)

Module mulAddRecFN implements fused multiply-add as defined by the IEEE Floating-Point Standard:

module
    mulAddRecFN#(parameter expWidth, parameter sigWidth) (
        input [(`floatControlWidth - 1):0] control,
        input [1:0] op,
        input [(expWidth + sigWidth):0] a,
        input [(expWidth + sigWidth):0] b,
        input [(expWidth + sigWidth):0] c,
        input [2:0] roundingMode,
        output [(expWidth + sigWidth):0] out,
        output [4:0] exceptionFlags
    );

When op = 0, the module computes (a × b) + c with a single rounding. If one of the multiplication operands a and b is infinite and the other is zero, the invalid exception is indicated even if operand c is a quiet NaN.

The bits of input op affect the signs of the addends, making it possible to turn addition into subtraction (much like the subOp input to addRecFN). The exact effects of op are summarized in this table:

op[1]   op[0]   Function
0 0 (a × b) + c
0 1 (a × b) − c
1 0 c − (a × b)
1 1 −(a × b) − c

In all cases, the function is computed with only a single rounding, of course.

A variant module, mulAddRecFNToRaw, returns the intermediate result of the fused multiply-add before rounding, as a “raw” deconstructed floating-point value with two extra bits of significand:

module
    mulAddRecFNToRaw#(parameter expWidth, parameter sigWidth) (
        input [(`floatControlWidth - 1):0] control,
        input [1:0] op,
        input [(expWidth + sigWidth):0] a,
        input [(expWidth + sigWidth):0] b,
        input [(expWidth + sigWidth):0] c,
        input [2:0] roundingMode,
        output invalidExc,
        output out_isNaN,
        output out_isInf,
        output out_isZero,
        output out_sign,
        output signed [(expWidth + 1):0] out_sExp,
        output [(sigWidth + 2):0] out_sig
    );

Boolean output invalidExc is true if the operation should raise an invalid exception. Module roundRawFNToRecFN can be used to round the intermediate result in conformance with the IEEE Standard.

9.8. Division and Square Root (divSqrtRecFN_small,  divSqrtRecFNToRaw_small)

Besides basic addition and multiplication, HardFloat has modules for computing either division or square root, where the choice of function is controlled by a module input. Implementing as they do the most complex operations that HardFloat supports, these combined division/square-root modules are unique within HardFloat for being sequential, meaning they take a clock input and execute over more than one clock cycle. The division/square-root modules in HardFloat use classic one-bit-per-cycle techniques that are simple and inexpensive but also slower than other known algorithms for computing division and square root. This character is reflected in the suffix ‘_small’ in the modules’ names.

HardFloat’s principal division/square-root module is divSqrtRecFN_small, with these parameters and ports:

module
    divSqrtRecFN_small#(
        parameter expWidth,
        parameter sigWidth,
        parameter options = 0
    ) (
        input nReset,
        input clock,
        input [(`floatControlWidth - 1):0] control,
        output inReady,
        input inValid,
        input sqrtOp,
        input [(expWidth + sigWidth):0] a,
        input [(expWidth + sigWidth):0] b,
        input [2:0] roundingMode,
        output outValid,
        output sqrtOpOut,
        output [(expWidth + sigWidth):0] out,
        output [4:0] exceptionFlags
    );

Currently, the only valid value for the options parameter is zero.

The reset input, nReset, is active-negative and operates asynchronously within the module. (By applying reset asynchronously, the module should accept any valid form of reset, assuming proper care is taken elsewhere to synchronize the release of reset with the clock.) No clock cycles are needed during reset. Apart from reset, state changes within the module occur only on the rising edges of clock.

The module asserts inReady true (= 1) in any clock cycle when it is ready to start a new division or square root operation. Whenever inReady and inValid are both true at a rising edge of clock, a new operation is begun, and inputs control, sqrtOp, a, b, and roundingMode must also be valid. The computation of inValid (or any other module input) should not depend on inReady within the same clock cycle.

If sqrtOp is 0 when a new operation is started, the operation is division (a ÷ b). If sqrtOp is 1, the operation is the square root of a, and operand b is ignored.

After some number of clock cycles, outValid is asserted true for exactly one clock cycle, at which time outputs sqrtOpOut, out, and exceptionFlags are all valid. These outputs become invalid again in the very next clock cycle if another division or square root operation gets initiated before or during the cycle that outValid is true. There is no mechanism within the module to retain a result for more than one cycle once another division or square root has begun.

On the other hand, if no subsequent operation has yet been started when outValid is asserted, then divSqrtRecFN_small will hold its result outputs constant until a new operation is begun (by asserting inValid while inReady is true). Even so, outValid is never asserted for more than one clock cycle per result, indicating the first cycle when the result is valid.

Output sqrtOpOut is merely a copy of the original sqrtOp for the operation. It is expected that clients will rarely have a need for this output.

The number of clock cycles to complete an operation is not guaranteed to be constant, but may in fact depend on all inputs to the operation, including a, b, and roundingMode. Because the module employs algorithms that compute one result bit per cycle, the number of cycles is typically sigWidth + n, for some small n. However, for exceptional cases (zero divided by infinity, for example), a result may be delivered much sooner.

A variant module, divSqrtRecFNToRaw_small, returns the intermediate result of a division or square root operation before rounding, as a “raw” deconstructed floating-point value with two extra bits of significand:

module
    divSqrtRecFNToRaw_small#(
        parameter expWidth,
        parameter sigWidth,
        parameter options = 0
    ) (
        input nReset,
        input clock,
        input [(`floatControlWidth - 1):0] control,
        output inReady,
        input inValid,
        input sqrtOp,
        input [(expWidth + sigWidth):0] a,
        input [(expWidth + sigWidth):0] b,
        input [2:0] roundingMode,
        output outValid,
        output sqrtOpOut,
        output [2:0] roundingModeOut,
        output invalidExc,
        output infiniteExc,
        output out_isNaN,
        output out_isInf,
        output out_isZero,
        output out_sign,
        output signed [(expWidth + 1):0] out_sExp,
        output [(sigWidth + 2):0] out_sig
    );

Functionally, the only difference between this module and divSqrtRecFN_small is the form of the outputs. Like sqrtOpOut, output roundingModeOut is simply a copy of the original roundingMode for the operation. Boolean output invalidExc is true if the operation should raise an invalid exception, while infiniteExc is true if the operation should raise an infinite exception (“divide by zero”). Module roundRawFNToRecFN can be used to round the intermediate result in conformance with the IEEE Standard.

9.9. Comparisons (compareRecFN)

Module compareRecFN compares two recoded floating-point values for equality or inequality, according to the IEEE Standard. Its parameters and ports are as follows:

module
    compareRecFN#(parameter expWidth, parameter sigWidth) (
        input [(expWidth + sigWidth):0] a,
        input [(expWidth + sigWidth):0] b,
        input signaling,
        output lt,
        output eq,
        output gt,
        output unordered,
        output [4:0] exceptionFlags
    );

If Boolean input signaling is true, a signaling comparison is done, meaning that an invalid exception is raised if either operand is any kind of NaN. If signaling is false, a quiet comparison is done, meaning that quiet NaNs do not cause an invalid exception. The IEEE Standard mandates that equality comparisons (=, ≠) ordinarily are quiet, while inequality comparisons (<, ≤, ≥, >) ordinarily are signaling.

Boolean outputs lt, eq, and gt indicate whether a < b, a = b, or a > b, respectively. Output unordered is true if either operand is a NaN. Exactly one of lt, eq, gt, and unordered will be true for any pair of operands, a and b.

10. Common Submodules

A few HardFloat modules are components that are shared by multiple floating-point operations. One such module tests whether a floating-point value is a signaling NaN. Other modules convert into and out of the deconstructed “raw” forms documented earler in section 5.3, Raw Deconstructions. These component modules are found in files isSigNaNRecFN.v and HardFloat_rawFN.v.

These modules are documented here for those who may need to use them directly. Otherwise, this section can be skipped.

10.1. isSigNaNRecFN

Module isSigNaNRecFN tells whether a floating-point value is a signaling NaN. It has these parameters and ports:

module
    isSigNaNRecFN#(parameter expWidth, parameter sigWidth) (
        input [(expWidth + sigWidth):0] in,
        output isSigNaN
    );

As indicated by the module’s name, the input floating-point value is in HardFloat’s recoded format. The output isSigNaN is true if the input is a signaling NaN.

10.2. recFNToRawFN

Module recFNToRawFN converts a floating-point value from HardFloat’s recoded format into the “raw” deconstructed form. Its parameters and ports are as follows:

module
    recFNToRawFN#(parameter expWidth, parameter sigWidth) (
        input [(expWidth + sigWidth):0] in,
        output isNaN,
        output isInf,
        output isZero,
        output sign,
        output signed [(expWidth + 1):0] sExp,
        output [sigWidth:0] sig
    );

For general information about the deconstructed form, see section 5.3, Raw Deconstructions. Besides the usual rules for the deconstructed form, this module guarantees the following for its outputs:

10.3. roundAnyRawFNToRecFN

Module roundAnyRawFNToRecFN takes an intermediate floating-point value in deconstructed form and rounds it to a valid IEEE-conformant value in a recoded format, applying the requested rounding mode and taking account of any exceptional conditions such as underflow or overflow. This module is declared with these parameters and ports:

module
    roundAnyRawFNToRecFN#(
        parameter inExpWidth,
        parameter inSigWidth,
        parameter outExpWidth,
        parameter outSigWidth,
        parameter options = 0
    ) (
        input [(`floatControlWidth - 1):0] control,
        input invalidExc,
        input infiniteExc,
        input in_isNaN,
        input in_isInf,
        input in_isZero,
        input in_sign,
        input signed [(inExpWidth + 1):0] in_sExp,
        input [inSigWidth:0] in_sig,
        input [2:0] roundingMode,
        output [(outExpWidth + outSigWidth):0] out,
        output [4:0] exceptionFlags
    );

Parameters inExpWidth and inSigWidth characterize the floating-point input to be rounded, while outExpWidth and outSigWidth control the result format. The options parameter is explained later below.

Inputs control and roundingMode are as documented in section 6, Common Control and Mode Inputs. The in_* inputs obviously supply the incoming floating-point value in deconstructed form. Output out is the rounded IEEE floating-point value, in recoded format, while exceptionFlags delivers the five floating-point exception flags as documented in section 7, Exception Results.

That leaves only inputs invalidExc and infiniteExc, both Booleans. If invalidExc is true, it forces an invalid exception to be asserted, independent of the other inputs, so the floating-point result delivered by out will be a NaN, and an invalid exception will be indicated in exceptionFlags. Similarly, infiniteExc asserts an infinite exception (“divide by zero”) independent of most other inputs. If invalidExc is false and infiniteExc is true, the floating-point output will be an infinity with sign in_sign, and an infinite exception will be indicated in exceptionFlags.

When the floating-point input to roundAnyRawFNToRecFN is the intermediate result of an operation, its precision typically needs to be at least two bits greater than the output precision to avoid corrupting the result value, i.e., inSigWidthoutSigWidth + 2. In this case, the least-significant bit of the input is typically called the sticky bit of the computation. On the other hand, if the floating-point input is always an exact value (such as from iNToRawFN or recFNToRawFN), then no relationship between input and output formats is necessary.

Source file HardFloat_consts.vi defines several ‘flRoundOpt_’ macros to be used for the options parameter, with the following meanings:

flRoundOpt_sigMSBitAlwaysZero
For finite nonzero values, the two most-significant bits of in_sig are always binary 01.
flRoundOpt_subnormsAlwaysExact
Whenever the floating-point result is a subnormal, the result is always exact, requiring no real rounding. The inexact exception is therefore never indicated for subnormal results. (This case commonly arises with floating-point addition and subtraction.)
flRoundOpt_neverUnderflows
Underflow never occurs, because, for finite nonzero values, the floating-point exponent is never below the normal range.
flRoundOpt_neverOverflows
Overflow never occurs, because, for finite nonzero values, the floating-point exponent is never above the normal range.

The options parameter can be set to the bitwise OR of any combination of these macro values, or to 0, if none is applicable. By setting options to the maximal set of conditions that apply, the efficiency of the module may be improved.

If the floating-point output is a NaN (because either invalidExc is true, or invalidExc and infiniteExc are false and in_isNaN is true), and if macro HardFloat_propagateNaNPayloads is defined (refer back to section 8.3, NaN Results), then the NaN’s payload is specified by inputs in_sign and in_sig. This is true even when the NaN result should be the default NaN. The client is responsible for controlling when a NaN result will be the default NaN, by setting in_sign and in_sig appropriately. On the other hand, if macro HardFloat_propagateNaNPayloads is not defined, any NaN outputs from roundAnyRawFNToRecFN are always the default NaN, and inputs in_sign and in_sig are ignored for NaNs.

10.4. roundRawFNToRecFN

Module roundRawFNToRecFN is a variation on roundAnyRawFNToRecFN, with the same set of ports but fewer parameters:

module
    roundRawFNToRecFN#(
        parameter expWidth,
        parameter sigWidth,
        parameter options = 0
    ) (
        input [(`floatControlWidth - 1):0] control,
        input invalidExc,
        input infiniteExc,
        input in_isNaN,
        input in_isInf,
        input in_isZero,
        input in_sign,
        input signed [(expWidth + 1):0] in_sExp,
        input [(sigWidth + 2):0] in_sig,
        input [2:0] roundingMode,
        output [(expWidth + sigWidth):0] out,
        output [4:0] exceptionFlags
    );

roundRawFNToRecFN is identical to roundAnyRawFNToRecFN with this assignment of parameters:

inExpWidth = expWidth
inSigWidth = sigWidth + 2
outExpWidth = expWidth
outSigWidth = sigWidth

Note that the deconstructed input has implicitly two more bits of precision than the specified sigWidth would normally indicate.

11. Testing HardFloat

The HardFloat package includes a subdirectory named test containing source code and example Makefiles for testing HardFloat’s Verilog modules. To execute the tests, either a Verilog simulator or Verilator is required. (Verilator is a free tool for converting a subset of synthesizable Verilog or SystemVerilog into C++ code. When compiled into an executable program, the code generated by Verilator has been found to run much faster than some Verilog simulators.)

HardFloat’s test infrastructure also depends on Berkeley TestFloat, which must be obtained and compiled separately. And building TestFloat furthermore requires Berkeley SoftFloat, thus completing the three-part set of Berkeley HardFloat, SoftFloat, and TestFloat. Information about TestFloat and SoftFloat can be found at their respective Web pages:

http://www.jhauser.us/arithmetic/TestFloat.html
http://www.jhauser.us/arithmetic/SoftFloat.html

Separate documentation is supplied according to whether one is using a Verilog simulator or Verilator for testing:

HardFloat-test-Verilog.html Documentation for testing HardFloat using Verilog simulation.
HardFloat-test-Verilator.html    Documentation for testing HardFloat using Verilator.

12. Contact Information

At the time of this writing, the most up-to-date information about HardFloat and the latest release can be found at the Web page http://www.jhauser.us/arithmetic/HardFloat.html.