Processing

Please wait...

Settings

Settings

Goto Application

1. WO2020142192 - NEURAL NETWORK ACTIVATION COMPRESSION WITH NARROW BLOCK FLOATING-POINT

Note: Text based on automatic Optical Character Recognition processes. Please use the PDF version for legal matters

[ EN ]

CLAIMS

1. A computing system comprising:

one or more processors;

bulk memory comprising computer-readable storage devices and/or memory; a block floating-point compressor formed from at least one of the processors, the block floating-point compressor being in communication with the bulk memory; and

the computing system being configured to:

with at least one of the processors, perform forward propagation for a layer of a neural network to produce first activation values in a first block floating-point format, the first block floating-point format having a first numerical precision; with the block floating-point compressor, convert at least one of the activation values to a second block floating-point format to produce compressed activation values, the second block floating-point format having a second numerical precision less than the first numerical precision; and

with at least one of the processors, storing the compressed activation values in the bulk memory.

2. The computing system of claim 1, wherein:

the second block floating-point format has a lower-precision mantissa and/or a lower-precision exponent than the first block floating-point format; and

the second block floating-point format has a different sharing format of a common exponent than the first block floating-point format, the sharing format being different based on per-row, per-column, or per-tile sharing of a common exponent for the compressed activation values.

3. The computing system of claim 1, wherein the computing system is further configured to:

convert the first activation values to a normal -precision format, producing converted normal-precision values; and

convert the at least one of the activation values to the second block floating-point format by converting the converted normal-precision values to the second block floating point format to produce compressed activation values.

4. The computing system of claim 1, wherein the compressor is further configured to further compress the compressed activation values prior to the storing by performing at least one or more of the following: entropy compression, zero compression, run length encoding, compressed sparse row compression, or compressed sparse column

compression.

5. The computing system of claim 1, wherein the computing system is further configured to:

perform backward propagation for a layer of the neural network by converting the stored, compressed activation values to activation values in the first block floating-point format to uncompressed activation values, and

perform a gradient operation with the uncompressed activation values.

6. The computing system of claim 1, wherein the layer is a first layer, the compressed activation values are first compressed activation values, and wherein the computing system is further configured to:

with at least one of the processors, perform forward propagation for a different, second layer of a neural network to produce second activation values in the first block floating-point format;

with the block floating-point compressor, for at least one of the second activation values, convert the at least one of the second activation values to a third block floating point format to produce second compressed activation values, the third block floating point format having a numerical precision different than the second numerical precision; and

with at least one of the processors, storing the second compressed activation values in the bulk memory.

7. The computing system of claim 1, wherein:

the processors comprise at least one of the following: a tensor processing unit, a neural network accelerator, a graphics processing unit, or a processor implemented in a reconfigurable logic array;

the bulk memory is situated on a different integrated circuit than the processors, the bulk memory including dynamic random access memory (DRAM) or embedded DRAM; and

a hardware accelerator including a memory temporarily storing the first activation values for at least a portion of only one layer of the neural network, the hardware accelerator memory including static RAM (SRAM) or a register file.

8. A method of operating a computing system implementing a neural network, the method comprising:

with the computing system:

forward propagating a layer of the neural network to generate activation values in a first block floating-point format;

converting at least one of the activation values to a second block floating-point format different than the first block floating-point format, generating compressed activation values; and

storing the compressed activation values in a computer-readable memory or storage device.

9. The method of claim 8, wherein the second block floating-point format differs from the first block floating-point format in at least one of the following ways: a different mantissa format, a different exponent format, or a different exponent sharing scheme.

10. The method of claim 8, further comprising:

prior to the storing, further compressing the compressed activation values stored in the computer-readable memory or storage device by one or more of the following techniques: entropy compression, zero compression, run length encoding, compressed sparse row compression, or compressed sparse column compression.

11. The method of claim 8, further comprising:

with the computing system, performing backward propagation for a layer of the neural network by converting the stored, compressed activation values to activation values in the first block floating-point format to uncompressed activation values; and

with the computing system, performing a gradient operation with the

uncompressed activation values; and

with the computing system, updating weights for a portion of at least one node of the neural network based on the uncompressed activation values, wherein the at least one node is one of the following: a long-short term memory node (LSTM), a gated recurrent unit (GRU).

12. The method of claim 8, further comprising:

selecting the second block floating-point format based on an attribute of the layer, the attribute being selected from the group consisting of the following: the layer being a convolution layer, the layer comprising a long-short term memory node (LSTM), the layer comprising a gated recurrent unit (GRU), the layer being fully-connected to another layer, the layer being sparsely-connected to another layer, the layer being an attention layer, the layer being a normalization layer.

13. One or more computer-readable storage devices or media storing computer-executable instructions, which when executed by a computer, cause the computer to

perform a method of configuring a computer system to implement an artificial neural network, the instruction comprising:

instructions that cause the computer system to implement a first layer of neural network using first weights and/or first activation values expressed in a first block floating-point format;

instructions that cause the computer system to forward propagate values from the first layer of the neural network to a second layer of the neural network, thereby generating second activation values expressed in the first block floating-point format; and instructions that cause the computer system to, prior to performing back propagation for the neural network, store the second activation values in a second, different block floating-point format in a bulk memory or storage device in

communication with the computer system.

14. The computer-readable storage devices or media of 13, further comprising: instructions that cause the computer system temporarily store the first weights and/or the first activation values in a different memory than the bulk memory or storage device.

15. The computer-readable storage devices or media of 13, further comprising: instructions that cause the computer system to further compress the second activation values prior to storing the further compressed values in the bulk memory or storage device.