Processing

Please wait...

Settings

Settings

Goto Application

1. WO2016138375 - METHOD AND APPARATUS FOR PREDICTING GPU MALFUNCTIONS

Note: Text based on automatic Optical Character Recognition processes. Please use the PDF version for legal matters

[ EN ]

WHAT IS CLAIMED IS:

1. A method of predicting GPU malfunctions, the method comprising:

installing a daemon program at a GPU node, the daemon program periodically collecting GPU status parameters corresponding to the GPU node at a pre-determined time period;

obtaining the GPU status parameters from the GPU node; and

comparing the obtained GPU status parameters with mean status fault parameters to determine whether the GPU is to malfunction, wherein the mean status fault parameters are obtained by use of a pre-configured statistical model.

2. The method of claim 1, wherein the status parameter is temperature, and comparing the obtained GPU status parameters with mean status fault parameters to determine whether the GPU is to malfunction comprises:

performing statistics to obtain a temperature count of a number of times the GPU incurs a temperature greater than a pre-determined temperature threshold and a GPU temperature standard deviation threshold;

comparing the obtained temperature count with a mean temperature fault count;

comparing a GPU temperature fault standard deviation with the GPU temperature standard deviation threshold;

if the temperature count is greater than the mean temperature fault count, and if the GPU temperature fault standard deviation is less than the GPU temperature standard deviation threshold, determining that the GPU is to malfunction; and

if the temperature count is less than the mean temperature fault count, or if the GPU temperature fault standard deviation is greater than the GPU temperature standard deviation threshold, determining that the GPU is not to malfunction.

3. The method of claim 2, wherein:

the GPU status parameters collected from the GPU node comprise a GPU model, temperature, and usage status; and

the obtaining mean status fault parameters by use of a pre-configured statistical model comprises:

determining, based on the usage status, whether the GPU malfunctions; in response to a determination that the GPU does not malfunction, based upon the GPU model, storing the temperature collected from the GPU node in an information storage space corresponding to the GPU model; and

in response to a determination that the GPU malfunctions, based on the GPU model, obtaining temperatures stored in an information storage space corresponding to the GPU model, and based on the temperature collected from the GPU node and the stored temperatures obtained from the storage space corresponding to the GPU model, performing statistics to compute an arithmetic mean temperature fault count and a temperature fault standard deviation, by use of a pre-configured temperature statistical model.

4. The method of claim 3, wherein the pre-configured temperature statistical model comprises: a mean temperature fault count model configured to compute an arithmetic mean based on the GPU temperature collected from the GPU node and the stored GPU temperatures obtained from the information storage space corresponding to the GPU model; and

a temperature fault standard deviation model configured to compute a temperature fault standard deviation based on the GPU temperature collected from the GPU node and the stored GPU temperatures obtained from the information storage space corresponding to the GPU model, and the mean temperature fault count.

5. The method of claim 1, wherein the status parameter is power consumption, and comparing the obtained GPU status parameters with mean status fault parameters to determine whether the GPU is to malfunction comprises:

performing statistics to obtain a power consumption count of a number of times the GPU incurs power consumption greater than a pre-determined power consumption threshold and a GPU power consumption standard deviation threshold;

comparing the obtained power consumption count with a mean power consumption fault count;

comparing a GPU power consumption fault standard deviation with the GPU power consumption standard deviation threshold;

if the power consumption count is greater than the mean power consumption fault count, and if the GPU power consumption fault standard deviation is less than the GPU power consumption standard deviation threshold, determining that the GPU is to malfunction; and

if the power consumption count is less than the mean power consumption fault count, or if the GPU power consumption fault standard deviation is greater than the GPU power consumption standard deviation threshold, determining that the GPU is not to malfunction.

6. The method of claim 5, wherein:

the GPU status parameters collected from the GPU node comprise a GPU model, power consumption, and usage status; and

the obtaining mean status fault parameters by use of a pre-configured statistical model comprises:

determining, based on the usage status, whether the GPU malfunctions; in response to a determination that the GPU does not malfunction, based upon the GPU model, storing the power consumption collected from the GPU node in an information storage space corresponding to the GPU model; and

in response to a determination that the GPU malfunctions, based on the GPU model, obtaining power consumption stored in an information storage space corresponding to the GPU model, and based on the power consumption collected from the GPU node and the stored power consumption obtained from the storage space corresponding to the GPU model, performing statistics to compute an arithmetic mean power consumption fault count and a power consumption fault standard deviation, by use of a pre-configured power consumption statistical model.

7. The method of claim 6, wherein the pre-configured power consumption statistical model comprises:

a mean power consumption fault count model configured to compute an arithmetic mean based on the GPU temperature collected from the GPU node and the stored GPU temperatures obtained from the information storage space corresponding to the GPU model; and

a power consumption fault standard deviation model configured to compute a power consumption standard deviation based on the GPU power consumption collected from the GPU

node and the stored GPU power consumption obtained from the information storage space corresponding to the GPU model, and the mean power consumption fault count.

8. The method of claim 1, wherein the status parameter is usage duration, and comparing the obtained GPU status parameters with mean status fault parameters to determine whether the GPU is to malfunction comprises:

comparing the obtained GPU usage duration with a mean fault usage duration;

if the GPU usage duration is greater than the mean fault usage duration, determining the GPU is to malfunction; and

if the GPU usage duration is less than the mean fault usage duration, determining the GPU is not to malfunction.

9. The method of claim 8, wherein:

the GPU status parameters collected from the GPU node comprise a GPU model, usage duration, and usage status; and

the obtaining mean status fault parameters by use of a pre-configured statistical model comprises:

determining, based on the usage status, whether the GPU malfunctions; in response to a determination that the GPU does not malfunction, based upon the GPU model, storing the usage duration collected from the GPU node in an information storage space corresponding to the GPU model; and

in response to a determination that the GPU malfunctions, based on the GPU model, obtaining usage duration stored in an information storage space corresponding to the GPU model, and based on the usage duration collected from the GPU node and the stored usage duration obtained from the storage space corresponding to the GPU model, performing statistics to compute an arithmetic mean fault usage duration, by use of a pre-configured usage duration statistical model.

10. The method of claim 9, wherein the pre-configured usage duration statistical model comprises a mean fault usage duration model configured to compute an arithmetic mean based on the GPU usage duration collected from the GPU node and the stored GPU usage duration obtained from the information storage space corresponding to the GPU model.

11. An apparatus for predicting GPU malfunctions, the apparatus comprising:

a processor; and

a non-transitory computer-readable medium operably coupled to the processor, the non-transitory computer-readable medium having computer-readable instructions stored thereon to be executed when accessed by the processor, the instructions comprising:

an installation module configured to install a daemon program at a GPU node, the daemon program periodically collecting GPU status parameters corresponding to the GPU node at a pre-determined time period;

a collecting module configured to obtain the GPU status parameters from the GPU node; and

a processing module configured to compare the obtained GPU status parameters with mean status fault parameters to determine whether the GPU is to malfunction, wherein the mean status fault parameters are obtained by use of a pre-configured statistical model.

12. The apparatus of claim 11, wherein the status parameter is temperature, and the processing module comprises:

a first statistical module configured to perform statistics to obtain a temperature count of a number of times the GPU incurs a temperature greater than a pre-determined temperature threshold and a GPU temperature standard deviation threshold;

a first comparison module configured to compare the obtained temperature count with a mean temperature fault count and to compare a GPU temperature fault standard deviation with the GPU temperature standard deviation threshold;

a first determination module configured to, if the temperature count is greater than the mean temperature fault count, and if the GPU temperature fault standard deviation is less than the GPU temperature standard deviation threshold, determine that the GPU is to malfunction; and

a second determination module configured to, if the temperature count is less than the mean temperature fault count, or if the GPU temperature fault standard deviation is greater than the GPU temperature standard deviation threshold, determine that the GPU is not to malfunction.

13. The apparatus of claim 12, wherein after the installation module installs the daemon programs on the GPU node, the daemon program periodically collects the GPU model and GPU status parameters corresponding to the GPU node at a pre-determined time period; wherein the

collecting module comprises a first collecting module configured to collect from the GPU node a GPU model, temperature, and usage status; and wherein the processing module further comprises: a first decision module configured to decide, based on the usage status, whether the GPU malfunctions;

a first storing module configured to, in response to a determination that the GPU does not malfunction, based upon the GPU model, store the temperature collected from the GPU node in an information storage space corresponding to the GPU model; and

a first computing module configured to, in response to a determination that the GPU malfunctions, based on the GPU model, obtain temperatures stored in an information storage space corresponding to the GPU model, and based on the temperature collected from the GPU node and the stored temperatures obtained from the storage space corresponding to the GPU model, perform statistics to compute an arithmetic mean temperature fault count and a temperature fault standard deviation, by use of a pre-configured temperature statistical model.

14. The apparatus of claim 13, wherein the pre-configured temperature statistical model comprises:

a mean temperature fault count model configured to compute an arithmetic mean based on the GPU temperature collected from the GPU node and the stored GPU temperatures obtained from the information storage space corresponding to the GPU model; and

a temperature fault standard deviation model configured to compute a temperature fault standard deviation based on the GPU temperature collected from the GPU node and the stored GPU temperatures obtained from the information storage space corresponding to the GPU model, and the mean temperature fault count.

15. The apparatus of claim 11, wherein the status parameter is power consumption, and wherein the processing module comprises:

a second statistical module configured to perform statistics to obtain a power consumption count of a number of times the GPU incurs power consumption greater than a pre-determined power consumption threshold and a GPU power consumption standard deviation threshold;

a second comparison module configured to compare the obtained power consumption count with a mean power consumption fault count and to compare a GPU power consumption fault standard deviation with the GPU power consumption standard deviation threshold;

a third determination module configured to, if the power consumption count is greater than the mean power consumption fault count, and if the GPU power consumption fault standard deviation is less than the GPU power consumption standard deviation threshold, determine that the GPU is to malfunction; and

a fourth determination module configured to, if the power consumption count is less than the mean power consumption fault count, and or the GPU power consumption fault standard deviation is greater than the GPU power consumption standard deviation threshold, determine that the GPU is not to malfunction.

16. The apparatus of claim 15, wherein, after the installation module installs the daemon programs on the GPU node, the daemon program periodically collects the GPU model and GPU status parameters corresponding to the GPU node at a pre-determined time period; wherein the collecting module comprises a second collecting module configured to collect from the GPU node a GPU model, power consumption, and usage status; and wherein the processing module further comprises:

a second decision module configured to, based on the usage status, decide whether the GPU malfunctions;

a second storing module configured to, in response to a determination that the GPU does not malfunction, based upon the GPU model, store the power consumption collected from the GPU node in an information storage space corresponding to the GPU model; and

a second computing module configured to, in response to a determination that the GPU malfunctions, based on the GPU model, obtain power consumption stored in an information storage space corresponding to the GPU model, and based on the power consumption collected from the GPU node and the stored power consumption obtained from the storage space corresponding to the GPU model, perform statistics to compute an arithmetic mean power consumption fault count and a power consumption fault standard deviation, by use of a pre-configured power consumption statistical model.

17. The apparatus of claim 16, wherein the pre-configured power consumption statistical model comprises:

a mean power consumption fault count model configured to compute an arithmetic mean based on the GPU temperature collected from the GPU node and the stored GPU temperatures obtained from the information storage space corresponding to the GPU model; and

a power consumption fault standard deviation model configured to compute a power consumption fault standard deviation based on the GPU power consumption collected from the GPU node and the stored GPU power consumption obtained from the information storage space corresponding to the GPU model, and the mean power consumption fault count.

18. The apparatus of claim 11, wherein the status parameter is usage duration; and wherein the processing module comprises:

a third comparison module configured to compare the obtained GPU usage duration with a mean fault usage duration;

a fifth determination module configured to, if the GPU usage duration is greater than the mean fault usage duration, determine the GPU is to malfunction; and

a sixth determination module configured to, if the GPU usage duration is less than the mean fault usage duration, determine the GPU is not to malfunction.

19. The apparatus of claim 18, wherein, after the installation module installs the daemon programs on the GPU node, the daemon program periodically collects the GPU model and GPU status parameters corresponding to the GPU node at a pre-determined time period; wherein the collecting module comprises a third collecting module configured to collect from the GPU node a GPU model, usage duration, and usage status; and wherein the processing module further comprises: a third decision module configured to, based on the usage status, decide whether the GPU malfunctions;

a third storing module configured to, in response to a determination that the GPU does not malfunction, based upon the GPU model, store the usage duration collected from the GPU node in an information storage space corresponding to the GPU model; and

a third computing module configured to, in response to a determination that the GPU malfunctions, based on the GPU model, obtain usage duration stored in an information storage space corresponding to the GPU model, and based on the usage duration collected from the GPU node and the stored usage duration obtained from the storage space corresponding to the GPU model, perform statistics to compute an arithmetic mean fault usage duration, by use of a pre-configured duration statistical model.

20. The apparatus of claim 19, wherein the pre-configured usage duration statistical model comprises a mean fault usage duration model configured to compute an arithmetic mean based on the GPU usage duration collected from the GPU node and the stored GPU usage duration obtained from the information storage space corresponding to the GPU model.