PyDataBlog
diff --git a/‎Project.toml‎
Lines changed: 2 additions & 2 deletions b/‎Project.toml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 7 additions & 0 deletions b/‎README.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎docs/src/benchmark_image.png‎
189 KB b/‎docs/src/benchmark_image.png‎
189 KB
diff --git a/‎docs/src/index.md‎
Lines changed: 37 additions & 19 deletions b/‎docs/src/index.md‎
Lines changed: 37 additions & 19 deletions
@@ -1,7 +1,7 @@
 name = "ParallelKMeans"
 uuid = "42b8e9d4-006b-409a-8472-7f34b3fb58af"
 authors = ["Bernard Brenyah", "Andrey Oskin"]
-version = "0.1.4"
+version = "0.1.5"
 
 [deps]
 Distances = "b4f34e82-e78d-54a5-968a-f98e89d6e8f7"
@@ -10,7 +10,7 @@ StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
 
 [compat]
 StatsBase = "0.32, 0.33"
-julia = "1.3, 1.4"
+julia = "1.3"
 Distances = "0.8.2"
 MLJModelInterface = "0.2.1"
 
 
@@ -58,6 +58,13 @@ ________________________________________________________________________________
 
 _________________________________________________________________________________________________________
 
+### Benchmarks
+
+Currently, this package is benchmarked against similar implementations in both Python, R, and Julia. All reproducible benchmarks can be found in [ParallelKMeans/extras](https://github.com/PyDataBlog/ParallelKMeans.jl/tree/master/extras) directory.
+
+![benchmark_image.png](docs/src/benchmark_image.png)
+_________________________________________________________________________________________________________
+
 ### License
 
 [![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2FPyDataBlog%2FParallelKMeans.jl.svg?type=large)](https://app.fossa.com/projects/git%2Bgithub.com%2FPyDataBlog%2FParallelKMeans.jl?ref=badge_large)
@@ -2,8 +2,8 @@
 
 ## Motivation
 
-It's actually a funny story that led to the development of this package.
-What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after a heated discussion on the Julia Discourse forum when I asked for Julia optimization tips. Long story short, the Julia community is an amazing one! Andrey offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia package.
+It's actually a funny story led to the development of this package.
+What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after a heated discussion on the Julia Discourse forum when I asked for Julia optimization tips. Long story short, Julia community is an amazing one! Andrey offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia pacakge.
 
 Say hello to `ParallelKMeans`!
 
@@ -24,6 +24,22 @@ As a result, it is useful in practice to restart it several times to get the cor
 
 ## Installation
 
+If you are using  Julia in the recommended [Juno IDE](https://junolab.org/), the number of threads is already set to the number of available CPU cores so multithreading enabled out of the box.
+For other IDEs, multithreading must be exported in your environment before launching the Julia REPL in the command line.
+
+*TIP*: One needs to navigate or point to the Julia executable file to be able to launch it in the command line.
+Enable multi threading on Mac/Linux systems via;
+
+```bash
+export JULIA_NUM_THREADS=n  # where n is the number of threads/cores
+```
+
+For Windows systems:
+
+```bash
+set JULIA_NUM_THREADS=n  # where n is the number of threads/cores
+```
+
 You can grab the latest stable version of this package from Julia registries by simply running;
 
 *NB:* Don't forget to invoke Julia's package manager with `]`
@@ -56,10 +72,10 @@ git checkout experimental
 - [X] Implementation of [Hamerly implementation](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster).
 - [X] Interface for inclusion in Alan Turing Institute's [MLJModels](https://github.com/alan-turing-institute/MLJModels.jl#who-is-this-repo-for).
 - [X] Full Implementation of Triangle inequality based on [Elkan - 2003 Using the Triangle Inequality to Accelerate K-Means"](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf).
-- [X] Implementation of [Yinyang K-Means: A Drop-In Replacement of the Classic K-Means
-with Consistent Speedup](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ding15.pdf)
+- [X] Implementation of [Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ding15.pdf)
 - [ ] Implementation of [Geometric methods to accelerate k-means algorithm](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf).
 - [ ] Support for other distance metrics supported by [Distances.jl](https://github.com/JuliaStats/Distances.jl#supported-distances).
+- [ ] Support of MLJ Random generation hyperparameter.
 - [ ] Native support for tabular data inputs outside of MLJModels' interface.
 - [ ] Refactoring and finalizaiton of API desgin.
 - [ ] GPU support.
@@ -101,12 +117,12 @@ r.iterations            # number of elapsed iterations
 r.converged             # whether the procedure converged
 ```
 
-### Supported KMeans algorithm variations
+### Supported KMeans algorithm variations and recommended use cases
 
-- [Lloyd()](https://cs.nyu.edu/~roweis/csc2515-2006/readings/lloyd57.pdf)
-- [Hamerly()](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster)
-- [Elkan()](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf)
-- [Yinyang()](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ding15.pdf)
+- [Lloyd()](https://cs.nyu.edu/~roweis/csc2515-2006/readings/lloyd57.pdf)  - Default algorithm but only recommended for very small matrices (switch to `n_threads = 1` to avoid overhead).
+- [Hamerly()](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster) - Useful in most cases. If uncertain about your use case, try this!
+- [Elkan()](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) - Recommended for high dimensional data.
+- [Yinyang()](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ding15.pdf) - An excellent choice for most cases. Swiss blade for many use cases.
 - [Geometric()](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf) - (Coming soon)
 - [MiniBatch()](https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf) - (Coming soon)
 
@@ -163,15 +179,17 @@ Currently, the benchmark speed tests are based on the search for optimal number
 
 _________________________________________________________________________________________________________
 
-|1 million (ms)|100k (ms)|10k (ms)|1k (ms)|package                |language|
-|:------------:|:-------:|:------:|:-----:|:---------------------:|:------:|
-|    666840    |  34034  |709.049 |17.686 |     Clustering.jl     | Julia  |
-|    21730     |  2975   |163.771 | 6.444 | ParallelKMeans Lloyd  | Julia  |
-|    11784     |  1339   | 94.233 |  6.6  |ParallelKMeans Hamerly | Julia  |
-|    17591     |  1074   | 81.995 | 6.953 | ParallelKMeans Elkan  | Julia  |
-|   1430000    | 146000  |  5770  |  344  |    Sklearn Kmeans     | Python |
-|    30100     |  3750   |  613   |  201  |Sklearn MiniBatchKmeans| Python |
-|    218200    |  15510  | 733.7  | 19.47 |         Knor          |   R    |
+|1 million (ms)|100k (ms)|10k (ms)|1k (ms)|package                |language   |
+|:------------:|:-------:|:------:|:-----:|:---------------------:|:---------:|
+|    580079    |  47804  |882.486 |17.424 |     Clustering.jl     |   Julia   |
+|    238716    |  20224  | 721.43 |24.581 |        mlpack         |C++ Wrapper|
+|    22946     |  2844   |177.329 | 6.403 |         Lloyd         |   Julia   |
+|    11084     |  1160   | 96.67  | 6.459 |        Hamerly        |   Julia   |
+|    13773     |  1457   | 80.484 | 6.854 |         Elkan         |   Julia   |
+|   1430000    | 146000  |  5770  |  344  |    Sklearn Kmeans     |  Python   |
+|    30100     |  3750   |  613   |  201  |Sklearn MiniBatchKmeans|  Python   |
+|    218200    |  15510  | 733.7  | 19.47 |         Knor          |     R     |
+
 _________________________________________________________________________________________________________
 
 ## Release History
@@ -180,7 +198,7 @@ ________________________________________________________________________________
 - 0.1.1 Added interface for MLJ.
 - 0.1.2 Added `Elkan` algorithm.
 - 0.1.3 Faster & optimized execution.
-- 0.1.4 Bug fixes
+- 0.1.4 Bug fixes.
 - 0.1.5 Added `Yinyang` algorithm.
 
 ## Contributing