geokmeans provides fast C++ implementations of several k-means clustering algorithms behind a single, uniform R interface. The headline method is Geometric-k-means, a bound-free algorithm that uses geometry (scalar projection) to focus distance computations only on the points that can actually change cluster membership, while skipping the rest. It returns the same clustering as Lloyd’s k-means when initialized identically, but typically with far fewer distance computations, less memory, and lower energy use (Sharma et al., 2026, ).
The package exposes seven algorithms:
| Function | Algorithm | Type |
|---|---|---|
geo_kmeans() |
Geometric-k-means | bound-free |
ball_kmeans() |
Ball k-means++ | bound-free |
lloyd_kmeans() |
Lloyd’s k-means | baseline |
elkan_kmeans() |
Elkan | bounded |
hamerly_kmeans() |
Hamerly | bounded |
annulus_kmeans() |
Annulus | bounded |
exponion_kmeans() |
Exponion | bounded |
All share the same interface and return value; they differ only in the acceleration strategy used internally.
Every function takes a numeric matrix (observations in rows, features
in columns) and centers, which is either the number of
clusters or a matrix of initial centroids.
X <- rbind(
matrix(rnorm(200, mean = 0), ncol = 2),
matrix(rnorm(200, mean = 6), ncol = 2)
)
fit <- geo_kmeans(X, centers = 2)
fit
#> k-means clustering (geokmeans) with 2 clusters
#> Iterations: 2 | Distance computations: 601
#> Cluster sizes: 100, 100
#> Centroids:
#> [,1] [,2]
#> [1,] 6.0297 6.0516
#> [2,] 0.1089 -0.0378The result is a geokmeans object: a list with the final
centroids, the per-point cluster assignment, and computational
statistics.
str(fit)
#> List of 6
#> $ centroids : num [1:2, 1:2] 6.0297 0.1089 6.0516 -0.0378
#> $ iterations : int 2
#> $ distance_calculations: num 601
#> $ cluster : int [1:200] 2 2 2 2 2 2 2 2 2 2 ...
#> $ method : chr "geokmeans"
#> $ k : int 2
#> - attr(*, "class")= chr "geokmeans"plot(X, col = fit$cluster, pch = 19, cex = 0.6,
xlab = "x1", ylab = "x2", main = "geo_kmeans result")
points(fit$centroids, pch = 8, cex = 2, lwd = 2)Because every variant returns the same partition (they are all
exact), the choice is about speed, not quality. Call a function
directly, or use the kmeans_dc() dispatcher with
method:
kmeans_dc(X, centers = 2, method = "elkan")$centroids
#> [,1] [,2]
#> [1,] 0.1088874 -0.03780808
#> [2,] 6.0296755 6.05160093The algorithms differ in how many point-to-centroid distances they compute. On the same data and seed, they reach the same solution with very different amounts of work:
set.seed(42)
Y <- do.call(rbind, lapply(seq_len(6), function(i)
matrix(rnorm(500, mean = 4 * i), ncol = 2)))
methods <- c("lloyd", "hamerly", "annulus", "exponion",
"ball", "geokmeans")
comparison <- data.frame(
method = methods,
distance_calcs = vapply(methods, function(m) {
kmeans_dc(Y, centers = 6, method = m, seed = 1)$distance_calculations
}, numeric(1)),
row.names = NULL
)
comparison[order(comparison$distance_calcs), ]
#> method distance_calcs
#> 4 exponion 16789
#> 6 geokmeans 16954
#> 5 ball 17414
#> 3 annulus 19458
#> 2 hamerly 21241
#> 1 lloyd 54000Lower is better. The relative savings grow with the number of observations and clusters.
For a numeric centers, the starting centroids are chosen
by init:
"random" (default) – random observations, drawn with
R’s RNG;"sequential" – the first k
observations.Because initialization uses R’s random number generator, results are
reproducible. Passing a non-NULL seed calls
set.seed() internally so a call is self-contained and
repeatable; the default seed = NULL leaves the RNG
untouched and honours a preceding set.seed().
a <- geo_kmeans(X, centers = 2, seed = 7)
b <- geo_kmeans(X, centers = 2, seed = 7)
identical(a$centroids, b$centroids)
#> [1] TRUEYou can also supply your own starting centroids as a matrix
(mirroring stats::kmeans()):
Two small datasets ship with the package under
extdata:
You cannot form more non-empty clusters than there are distinct observations. Requesting too many is an error with an explanatory message:
D <- rbind(matrix(0.1, 50, 2), matrix(9, 50, 2)) # only 2 distinct rows
geo_kmeans(D, centers = 3)
#> Error:
#> ! Requested 3 clusters, but 'data' has only 2 distinct row(s).
#> k-means cannot form more non-empty clusters than there are distinct observations.
#> Set 'centers' to at most 2, or provide data with at least 3 distinct rows.If a run leaves a cluster empty, it is dropped by default and the
labels are renumbered (drop_empty = TRUE); set
drop_empty = FALSE to keep the requested number of
centroids. Set with_labels = FALSE to skip computing
per-point assignments when you only need the centroids.