Large-scale automatic k-means clustering for heterogeneous many-core supercomputer

This article presents an automatic k-means clustering solution targeting the Sunway TaihuLight supercomputer. We ﬁrst introduce a multilevel parallel partition approach that not only partitions by dataﬂow and centroid, but also by dimension, which unlocks the potential of the hierarchical parallelism in the heterogeneous many-core processor and the system architecture of the supercomputer. The parallel design is able to process large-scale clustering problems with up to 196,608 dimensions and over 160,000 targeting centroids, while maintaining high performance and high scalability. Furthermore, we propose an automatic hyper-parameter determination process for k-means clustering, by automatically generating and executing the clustering tasks with a set of candidate hyper-parameter, and then determining the optimal hyper-parameter using a proposed evaluation method. The proposed auto-clustering solution can not only achieve high performance and scalability for problems with massive high-dimensional data, but also support clustering without sufﬁcient prior knowledge for the number of targeted clusters, which can potentially increase the scope of k-means algorithm to new application areas.

Citation

Yu , T , Zhao , W , Liu , P , Janjic , V , Yan , X , Wang , S , Fu , H , Yang , G & Thomson , J D 2020 , ' Large-scale automatic k-means clustering for heterogeneous many-core supercomputer ' , IEEE Transactions on Parallel and Distributed Systems , vol. 31 , no. 5 , pp. 997-1008 . https://doi.org/10.1109/TPDS.2019.2955467

Publication

IEEE Transactions on Parallel and Distributed Systems

Status

Peer reviewed

DOI

https://doi.org/10.1109/TPDS.2019.2955467

ISSN

1045-9219

Type

Journal article

Description

Funding: UK EPSRC grants ”Discovery” EP/P020631/1, ”ABC: Adaptive Brokerage for the Cloud” EP/R010528/1.

Collections

University of St Andrews Research

URI

https://hdl.handle.net/10023/19096