Processing

Please wait...

Settings

Settings

Goto Application

1. WO2016138375 - METHOD AND APPARATUS FOR PREDICTING GPU MALFUNCTIONS

Publication Number WO/2016/138375
Publication Date 01.09.2016
International Application No. PCT/US2016/019765
International Filing Date 26.02.2016
IPC
G06F 11/07 2006.01
GPHYSICS
06COMPUTING; CALCULATING OR COUNTING
FELECTRIC DIGITAL DATA PROCESSING
11Error detection; Error correction; Monitoring
07Responding to the occurrence of a fault, e.g. fault tolerance
CPC
G06F 11/008
GPHYSICS
06COMPUTING; CALCULATING; COUNTING
FELECTRIC DIGITAL DATA PROCESSING
11Error detection; Error correction; Monitoring
008Reliability or availability analysis
G06F 11/0721
GPHYSICS
06COMPUTING; CALCULATING; COUNTING
FELECTRIC DIGITAL DATA PROCESSING
11Error detection; Error correction; Monitoring
07Responding to the occurrence of a fault, e.g. fault tolerance
0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
0706the processing taking place on a specific hardware platform or in a specific software environment
0721within a central processing unit [CPU]
G06F 11/076
GPHYSICS
06COMPUTING; CALCULATING; COUNTING
FELECTRIC DIGITAL DATA PROCESSING
11Error detection; Error correction; Monitoring
07Responding to the occurrence of a fault, e.g. fault tolerance
0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
0751Error or fault detection not based on redundancy
0754by exceeding limits
076by exceeding a count or rate limit, e.g. word- or bit count limit
G06F 11/079
GPHYSICS
06COMPUTING; CALCULATING; COUNTING
FELECTRIC DIGITAL DATA PROCESSING
11Error detection; Error correction; Monitoring
07Responding to the occurrence of a fault, e.g. fault tolerance
0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
079Root cause analysis, i.e. error or fault diagnosis
G06T 1/20
GPHYSICS
06COMPUTING; CALCULATING; COUNTING
TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
1General purpose image data processing
20Processor architectures; Processor configuration, e.g. pipelining
G06T 2200/28
GPHYSICS
06COMPUTING; CALCULATING; COUNTING
TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
2200Indexing scheme for image data processing or generation, in general
28involving image processing hardware
Applicants
  • ALIBABA GROUP HOLDING LIMITED
Inventors
  • FEI, Hui
Agents
  • MURABITO, Anthony, C.
Priority Data
201510088768.326.02.2015CN
Publication Language English (EN)
Filing Language English (EN)
Designated States
Title
(EN) METHOD AND APPARATUS FOR PREDICTING GPU MALFUNCTIONS
(FR) PROCÉDÉ ET APPAREIL DE PRÉDICTION DE DYSFONCTIONNEMENTS DE GPU
Abstract
(EN)
A method of predicting GPU malfunctions includes installing a daemon program at a GPU node, the daemon program periodically collecting GPU status parameters corresponding to the GPU node at a pre-determined time period. The method also includes obtaining the GPU status parameters from the GPU node and comparing the obtained GPU status parameters with mean status fault parameters to determine whether the GPU is to malfunction, where the mean status fault parameters are obtained by use of a pre-configured statistical model. Prior to a GPU enters a malfunction state, the GPU can be replaced, or the programs executing on the GPU can be migrated to other GPUs for execution, without affecting the normal business operations.
(FR)
La présente invention concerne un procédé de prédiction de dysfonctionnements de GPU consistant à installer un programme fantôme au niveau d'un nœud de GPU, le programme fantôme collectant périodiquement des paramètres d'état de GPU correspondant au nœud de GPU sur une période de temps prédéfinie. Le procédé consiste également à obtenir les paramètres d'état de GPU du nœud de GPU et à comparer les paramètres d'état de GPU obtenus avec des paramètres d'anomalie d'état moyens pour déterminer si la GPU présente un dysfonctionnement, les paramètres d'anomalie d'état moyens étant obtenus au moyen d'un modèle statistique préconfiguré. Avant l'entrée de la GPU en état de dysfonctionnement, il est possible de remplacer la GPU, ou de faire migrer les programmes en cours d'exécution sur la GPU vers d'autres GPU à des fins d'exécution, sans affecter les opérations commerciales normales.
Latest bibliographic data on file with the International Bureau