When dealing with lots of data points, clustering algorithms may be used to group them. The k-means algorithm partitions n data points into k clusters and finds the centroids of these clusters incrementally.
The algorithm assigns data points to the closest cluster, and the centroids of each cluster are re-calculated. These steps are repeated until the centroids do not changing anymore.
The basic k-means algorithm is initialized with k centroids at random positions. This implementation addresses some disadvantages of the arbitrary initialization method with the k-means++ algorithm (see "Further reading" at the end).
Install kmpp as Node.js module via NPM:
$ npm install kmppvar kmpp = require('kmpp');
kmpp([
  [x1, y1, ...],
  [x2, y2, ...],
  [x3, y3, ...],
  ...
], {
  k: 4
});
// =>
// { converged: true,
//   centroids: [[xm1, ym1, ...], [xm2, ym2, ...], [xm3, ym3, ...]],
//   counts: [ 7, 6, 7 ],
//   assignments: [ 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1 ]
// }Exectes the k-means++ algorithm on points.
Arguments:
points(Array): An array-of-arrays containing the points in format[[x1, y1, ...], [x2, y2, ...], [x3, y3, ...], ...]opts: object containing configuration parameters. Parameters aredistance(function): Optional function that takes two points and returns the distance between them.initialize(Boolean): Perform initialization. If false, uses the initial state provided incentroidsandassignments. Otherwise discards any initial state and performs initialization.k(Number): number of centroids. If not provided,sqrt(n / 2)is used, wherenis the number of points.kmpp(Boolean, default:true): If true, uses k-means++ initialization. Otherwise uses naive random assignment.maxIterations(Number, default:100): Maximum allowed number of iterations.norm(Number, default:2): L-norm used for distance computation.1is Manhattan norm,2is Euclidean norm. Ignored ifdistancefunction is provided.centroids(Array): An array of centroids. Ifinitializeis false, used as initialization for the algorithm, otherwise overwritten in-place if of the correct size.assignments(Array): An array of assignments. Used for initialization, otherwise overwritten.counts(Array): An output array used to avoid extra allocation. Values are discarded and overwritten.
Returns an object containing information about the centroids and point assignments. Values are:
converged:trueif the algorithm converged successfullycentroids: a list of centroidscounts: the number of points assigned to each respective centroidassignments: a list of integer assignments of each point to the respective centroiditerations: number of iterations used
- 
Jared Harkins improved the performance by reducing the amount of function calls, reverting to Manhattan distance for measurements and improved the random initialization by choosing from points
 - 
Ricky Reusser refactored API
 
- Wikipedia: k-means clustering
 - Wikipedia: Determining the number of clusters in a data set
 - k-means++: The advantages of careful seeding, Arthur Vassilvitskii
 - k-means++: The advantages of careful seeding, Presentation by Arthur Vassilvitskii (Presentation)
 
© 2017-2019. MIT License.