WHAT IS CLAIMED IS:

1. A method comprising:

receiving at a graphics processing unit (GPU) [100] a set of commands for execution, the GPU comprising a plurality of compute units (CUs) [105,

106, 107, 108], the set of commands including a plurality of matrix multiplication operations [103, 1 14];

in response to receiving a set of commands, scheduling a first matrix

multiplication operation of the plurality of matrix multiplication operations at a first subset of CUs [1 10] and a second matrix multiplication operation of the plurality of matrix multiplication operations at a second subset of the CUs [1 1 1], the second subset of CUs different from the first subset of CUs; and

executing the first and second matrix multiplication operations at the

respective first subset and second subset of CUs.

2. The method of claim 1 , further comprising:

providing results of the first matrix multiplication operation from the first subset of CUs to the second subset of CUs to perform the second matrix multiplication operation.

3. The method of claim 2, further comprising:

providing results of the second matrix multiplication operation to a third subset of CUs [1 12] of the plurality of CUs to perform a third matrix multiplication operation, the third subset of CUs different from the first subset and the second subset of CUs.

4. The method of claim 3, further comprising:

providing results of the third matrix multiplication operation from the third

subset of CUs to the first set of CUs to perform a fourth matrix multiplication operation.

5. The method of claim 2, wherein:

the first matrix multiplication operation comprises a first multiplication and a second multiplication;

the second matrix multiplication operation comprises a third multiplication; and wherein executing the first and second matrix multiplication operations

comprises executing the second multiplication concurrent with the third multiplication.

6. The method of claim 5, wherein:

the third multiplication multiplies a result of the first multiplication.

7. The method of claim 2, wherein:

the first matrix multiplication operation comprises a first multiplication and a second multiplication;

wherein executing the first matrix multiplication operation comprises executing the first multiplication at a first cluster of the first subset of CUs and the second multiplication at a second cluster of the first subset of CUs.

8. The method of claim 7, wherein:

executing the first matrix multiplication operation comprises executing the first multiplication concurrent with the second multiplication.

9. The method of claim 1 , further comprising:

generating an output of a recurrent neural network (RNN) [102] based on the first and second matrix multiplication operations.

10. A method, comprising:

receiving, at a graphics processing unit (GPU) [100] comprising a plurality of compute units (CUs) [105, 106, 107, 108], a plurality of matrix multiplication operations [103, 1 14];

in response to receiving the plurality of matrix multiplication operations,

scheduling different ones of the plurality of matrix multiplication operations at different corresponding subsets [1 10, 1 1 1 , 1 12, 1 13] of the plurality of CUs; and

pipelining results of the plurality of matrix multiplication operations between the different subsets of the plurality of CUs.

1 1. The method of claim 10, further comprising:

concurrently executing portions of the plurality of matrix multiplication

operations at different subsets of the plurality of CUs.

12. A graphics processing unit (GPU) [100], comprising:

a plurality of CUs [105, 106, 107, 108], including a first subset of CUs [1 10] and a second subset of CUs [1 1 1], the second subset of CUs different from the first subset of CUs;

a scheduler [104] configured to:

receive a set of commands for execution, the set of commands

including a plurality of matrix multiplication operations [103, 1 14]; in response to receiving the set of commands, schedule a first matrix multiplication operation of the plurality of matrix multiplication operations at the first subset of CUs and a second matrix multiplication operation of the plurality of matrix multiplication operations at the second subset of the CUs; and

wherein the first subset of CUs and second subset of CUs are

configured to execute the first and second matrix multiplication operations.

13. The GPU of claim 12, wherein:

the first subset of CUs is configured to provide results of the first matrix

multiplication operation to the second subset of CUs to perform the second matrix multiplication operation.

14. The GPU of claim 13, wherein:

the second subset of CUs is configured to provide results of the second matrix multiplication operation to a third subset of CUs [1 12] of the plurality of CUs to perform a third matrix multiplication operation, the third subset of CUs different from the first subset and the second subset of CUs.

15. The GPU of claim 14, wherein:

the third subset of CUs is configured to provide results of the third matrix

multiplication operation to the first set of CUs to perform a fourth matrix multiplication operation.

16. The GPU of claim 13, wherein:

the first matrix multiplication operation comprises a first multiplication and a second multiplication;

the second matrix multiplication operation comprises a third multiplication; and wherein the first subset of CUs is configured to execute the second

multiplication concurrent with the second subset of CUs configured executing the third multiplication.

17. The method of claim 16, wherein:

the third multiplication multiplies a result of the first multiplication.

18. The GPU of claim 13, wherein:

the first subset of CUs comprises a first cluster of CUs and a second cluster of

CUs, the second cluster different from the first cluster;

the first matrix multiplication operation comprises a first multiplication and a second multiplication;

wherein the first subset of CUs is configured to execute the first multiplication at the first cluster of the first subset of CUs and the second multiplication at the second cluster of the first subset of CUs.

19. The GPU of claim 18, wherein:

the first subset of CUs is configured to execute the first matrix multiplication operation concurrent with the second multiplication.

20. The method of claim 12, wherein the GPU is configured to:

generate an output of a recurrent neural network (RNN) based on the first and second matrix multiplication operations.