Search International and National Patent Collections

1. (WO2018219163) MAPREDUCE-BASED DISTRIBUTED CLUSTER PROCESSING METHOD FOR LARGE-SCALE DATA

Pub. No.:    WO/2018/219163    International Application No.:    PCT/CN2018/087567
Publication Date: Fri Dec 07 00:59:59 CET 2018 International Filing Date: Sat May 19 01:59:59 CEST 2018
IPC: G06F 17/30
Applicants: NORTHEASTERN UNIVERSITY
东北大学
Inventors: GAO, Tianhan
高天寒
KONG, Xue
孔雪
Title: MAPREDUCE-BASED DISTRIBUTED CLUSTER PROCESSING METHOD FOR LARGE-SCALE DATA
Abstract:
Provided by the present invention is a MapReduce-based distributed cluster processing method for large-scale data, which comprises: sampling large-scale data according to an equal-scale non-repetition principle; inputting the sampled data into a MapReduce distributed parallel framework, and calculating the local density and average density of the sampled data; finding all sampled data having a local density greater than the average density to serve as a candidate point set of initial cluster center points for each cluster, and feeding the candidate point set back to a master node, wherein every two adjacent candidate points at a distance from each other which is greater than twice that of a set range are selected to serve as the initial cluster center points; using the MapReduce distributed parallel framework to perform a parallel clustering task, wherein an average value of the distance between the data is calculated for each cluster in order to update the cluster center points; child nodes applying an error sum of squares criterion function so as to determine whether to continue iteration; the child nodes performing clustering on the large-scale data according to the cluster center points. By means of the present invention, parallel clustering is implemented, thereby reducing the number of clustering iterations, while increasing clustering accuracy and the efficiency of parallel clustering.