[ad_1]

**What you learn:**

- Why floating point is important for developing machine-learning models.
- What floating point formats are used for machine learning?

Over the past two decades, computer-intensive artificial intelligence (AI) tasks have promoted the use of custom hardware to efficiently power these robust new systems. Machine-learning (ML) models, one of the most widely used forms of AI, are trained to handle these intensive tasks using floating-point arithmetic.

However, because floating-point formats have been extremely resource-intensive, AI implementation systems often rely on one of a handful of now-standard integer quantization techniques that use floating-point formats, such as Google’s bfloat16 and IEEE’s FP16.

Since computer memory is limited, it is not efficient to store numbers with infinite precision, whether they are binary fractions or decimal numbers. This is due to the inaccuracy of the numbers when it comes to certain applications, such as training AI.

While software engineers can design machine learning algorithms, they often cannot rely on the constantly changing hardware to efficiently execute these algorithms. The same can be said for hardware manufacturers, who often produce next-generation CPUs without being task-oriented, meaning that the CPU is designed to be a well-rounded platform to handle most tasks rather than target-specific applications.

When it comes to computation, floating-point formulas are arithmetic representations of real numbers that are an approximation to support a trade-off between range and precision, or rather huge amounts of data and accurate results. Because of this, floating point arithmetic is often used in systems with small and large numbers that require fast processing times.

It is widely known that deep neural networks can tolerate lower numerical precision because high-precision computations are less efficient when training or inferring neural networks. Additional precision provides no benefit while being slower and less memory efficient.

In fact, some models can even achieve higher accuracy with lower precision. A paper published by Cornell University attributes the lower precision to regularization effects.

### Floating point formats

Although there are a multitude of floating-point formats, only a few have gained traction in machine learning applications, as these formats require the appropriate hardware and firmware support to run efficiently. In this section, we’ll look at several examples of floating-point formats designed to handle machine learning development.

*IEEE 754*

IEEE standard 754 *(Fig. 1)* is one of the commonly known formats for AI apps. It is a set of representations of numeric values and symbols, including FP16, FP32, and FP64 (AKA half, single, and double precision formats). For example, FP32 is broken down as a sequence of 32 bits, such as b31, b30, and b29, all the way down to zero.

A floating point format is specified by a base (*b)*which is either 2 (binary) or 10 (decimal), a precision (*p*) range, and an exponent range from emin to emax, with emin = 1 − emax for all IEEE 754 formats. The format includes finite numbers that can be described by three integers.

These integers include s = a sign (zero or one), c = a significand (or coefficient) that has no more than p digits when written in base b (i.e., an integer in the range from 0 to bp − 1 ), and q = an exponent such that emin ≤ q + p − 1 ≤ emax. The format also includes two infinite (+∞ and −∞) and two kinds of NaN (not a number), including a silent NaN (qNaN) and a signaling NaN (sNaN).

The details here are extensive, but this is the general format of how IEEE 754 floating point works; more detailed information can be found at the link above. FP32 and FP64 are on the larger floating-point spectrum and are supported by x86 CPUs and most of today’s GPUs, along with the C/C++, PyTorch, and TensorFlow programming languages. FP16, on the other hand, is not widely used with modern processors, but it is widely supported by current GPUs in conjunction with machine learning frameworks.

*Bfloat16*

Google’s bfloat16 *(Fig. 2)* is another widely used floating-point format aimed at machine learning workloads. The Brain Floating Point format is basically a shortened version of IEEE’s FP16, allowing for fast, single-precision conversion of the 754 to and from that format. When applied to machine learning, there are generally three varieties of values, including weights, activations, and gradients.

Google recommends saving weights and gradients in FP32 format and saving activations in bfloat16. Of course, the weights can also be stored in BFloat16 without a significant performance degradation depending on the circumstances.

At its core, bfloat16 consists of one sign bit, eight exponent bits and seven mantissa bits. This differs from IEEE 16-bit floating point, which was not designed with deep-learning applications in mind during development. The format is used in Intel AI processors, including the Nervana NNP-L1000, Xeon processors, Intel FPGAs, and Google Cloud TPUs.

Unlike the IEEE format, bfloat16 is not used with C/C++ programming languages. However, it leverages TensorFlow, AMD’s ROCm, NVIDIA’s CUDA, and the ARMv8.6-A software stack for AI applications.

*TensorFloat*

NVIDIA’s TensorFloat *(Fig. 3)* is another excellent floating-point format. However, it was only designed to take advantage of TensorFlow TPUs built explicitly for AI applications. According to NVIDIA, “TensorFloat-32 is the new math mode in NVIDIA A100 GPUs for handling the matrix math aka tensor operations used at the heart of AI and certain HPC applications. TF32 runs on Tensor Cores in A100 GPUs can provide up to 10X speedups compared to single-precision floating-point math (FP32) on Volta GPUs.”

The format is just a 32-bit float, losing 13 bits of precision to run on Tensor Cores. It thus has the precision of FP16 (10 bits) but has the range of the FP32 (8 bits) IEEE 754 format.

NVIDIA states that TF32 uses the same 10-bit mantissa as the half-precision FP16 math, which has proven to have more than enough margin for the precision requirements of AI workloads. TF32 also uses the same 8-bit exponent as FP32, so it can support the same numeric range. This means that content can be converted from FP32 to TF32, making it easy to switch platforms.

Currently, TF32 does not support C/C++ programming languages, but NVIDIA says that the TensorFlow framework and a version of the PyTorch framework with support for TF32 on NGC are available to developers. Although it limits the hardware and software that can be used with the format, it is exceptional in performance on enterprise GPUs.

### Summary

This is just a basic overview of floating-point formats, an introduction to a larger, more comprehensive world designed to reduce the hardware and software requirements to drive innovation in the AI industry. It will be interesting to see how these platforms evolve over the coming years as AI becomes more advanced and ingrained in our lives. Technology is constantly evolving, as are the formats that make developing machine-learning applications increasingly efficient in software execution.