A Fast Distributed K-Means Algorithm for the Approximate Aggregation of a Distributed Dataset

Abstract In this paper we give a fast algorithm to solve the distributed K-means problem. This problem is as follows: we have a group of machines that each has a dataset of points in a real vector space. The goal is to choose a set of K vectors that (approximately) minimises the classic K-means objective function relative to the union of all the datasets. We consider the broadcast model - in which, at any time, one machine is broadcasting data to all others. We give a distributed algorithm that needs only to broadcast a small amount of data. The amount of data which needs to be broadcast by our algorithm is far smaller than the previous state of the art, whilst our theoretical guarantee of the quality of our solution is essentially the same. In addition to our theoretical performance guarantees we perform experiments with very promising preliminary results.
Authors
  • Stephen Pasteris (UCL)
  • Ting He (PSU)
  • Mark Herbster (UCL)
  • Richard Tomsett (IBM UK)
Date Sep-2020
Venue 4th Annual Fall Meeting of the DAIS ITA, 2020