@@ -8,52 +8,123 @@ This package is based on the following paper - [SpectralNet](https://openreview.
88
99## Installation
1010
11- You can install the latest package version via
11+ ### From PyPI
1212
1313``` bash
1414pip install spectralnet
1515```
1616
17+ ### From source (with pixi)
18+
19+ [ pixi] ( https://pixi.sh ) is the recommended way to set up a fully reproducible
20+ development environment after cloning the repo.
21+
22+ ``` bash
23+ # 1. Install pixi (once, system-wide)
24+ curl -fsSL https://pixi.sh/install.sh | sh
25+
26+ # 2. Clone and enter the repo
27+ git clone https://github.com/shaham-lab/SpectralNet.git
28+ cd SpectralNet
29+
30+ # 3. Install all dependencies (conda + PyPI) into an isolated environment
31+ pixi install
32+
33+ # 4. Run the test suite to verify everything works
34+ pixi run test
35+ ```
36+
37+ After ` pixi install ` you can prefix any command with ` pixi run ` to execute it
38+ inside the managed environment, or activate the environment with:
39+
40+ ``` bash
41+ pixi shell
42+ ```
43+
1744## Usage
1845
19- ### Clustering
46+ ### Clustering — small datasets (in-memory tensor)
2047
21- The basic functionality is quite intuitive and easy to use, e.g.,
48+ For datasets that fit in RAM, pass a ` torch.Tensor ` directly:
2249
2350``` python
2451from spectralnet import SpectralNet
2552
2653spectralnet = SpectralNet(n_clusters = 10 )
27- spectralnet.fit(X) # X is the dataset and it should be a torch.Tensor
28- cluster_assignments = spectralnet.predict(X) # Get the final assignments to clusters
54+ spectralnet.fit(X) # X: torch.Tensor of shape (N, ...)
55+ cluster_assignments = spectralnet.predict(X)
2956```
3057
31- If you have labels to your dataset and you want to measure ACC and NMI you can do the following :
58+ To measure ACC and NMI when labels are available :
3259
3360``` python
34- from spectralnet import SpectralNet
35- from spectralnet import Metrics
36-
61+ from spectralnet import SpectralNet, Metrics
3762
3863spectralnet = SpectralNet(n_clusters = 2 )
39- spectralnet.fit(X, y) # X is the dataset and it should be a torch.Tensor
40- cluster_assignments = spectralnet.predict(X) # Get the final assignments to clusters
41-
42- y = y_train.detach().cpu().numpy() # In case your labels are of torch.Tensor type.
43- acc_score = Metrics.acc_score(cluster_assignments, y, n_clusters = 2 )
44- nmi_score = Metrics.nmi_score(cluster_assignments, y)
45- print (f " ACC: { np.round(acc_score, 3 )} " )
46- print (f " NMI: { np.round(nmi_score, 3 )} " )
64+ spectralnet.fit(X, y) # y: integer label tensor
65+ cluster_assignments = spectralnet.predict(X)
66+
67+ y_np = y.detach().cpu().numpy()
68+ acc_score = Metrics.acc_score(cluster_assignments, y_np, n_clusters = 2 )
69+ nmi_score = Metrics.nmi_score(cluster_assignments, y_np)
70+ print (f " ACC: { acc_score:.3f } NMI: { nmi_score:.3f } " )
71+ ```
72+
73+ ### Clustering — large datasets (streaming from disk)
74+
75+ For datasets too large to hold in RAM (e.g. millions of images on disk),
76+ define a ` torch.utils.data.Dataset ` that loads ** one sample at a time**
77+ and pass it to ` fit() ` . Nothing large ever lives in memory at once — every
78+ trainer pulls mini-batches through its own ` DataLoader ` internally.
79+
80+ ``` python
81+ from torch.utils.data import Dataset, DataLoader
82+ from spectralnet import SpectralNet
83+ from PIL import Image
84+ import torchvision.transforms as T
85+ import os
86+
87+ class ImageFolderDataset (Dataset ):
88+ def __init__ (self , root ):
89+ self .paths = [
90+ os.path.join(root, f) for f in os.listdir(root) if f.endswith(" .jpg" )
91+ ]
92+ self .transform = T.Compose([T.Resize(64 ), T.ToTensor(), T.Normalize(0.5 , 0.5 )])
93+
94+ def __len__ (self ):
95+ return len (self .paths)
96+
97+ def __getitem__ (self , idx ):
98+ return self .transform(Image.open(self .paths[idx]).convert(" RGB" ))
99+
100+ dataset = ImageFolderDataset(" /path/to/images" )
101+
102+ spectralnet = SpectralNet(
103+ n_clusters = 10 ,
104+ should_use_ae = True , # compress images before clustering
105+ ae_hiddens = [2048 , 512 , 64 , 10 ],
106+ spectral_hiddens = [512 , 512 , 10 ],
107+ )
108+ spectralnet.fit(dataset)
109+
110+ # predict() also accepts a DataLoader for large test sets
111+ test_loader = DataLoader(dataset, batch_size = 512 , shuffle = False )
112+ cluster_assignments = spectralnet.predict(test_loader)
47113```
48114
49- You can read the code docs for more information and functionalities<br >
115+ > ** Note on Siamese training with large datasets:** the Siamese network
116+ > builds exact k-NN pairs, which requires loading all features into memory.
117+ > For very large datasets either disable it (` should_use_siamese=False ` ),
118+ > enable approximate neighbours (` siamese_use_approx=True ` ), or pass a
119+ > representative subset as the Dataset.
50120
51- #### Running examples
121+ ### Running examples
52122
53- In order to run the model on twomoons or MNIST datasets, you should first cd to the examples folder and then run:<br >
54- ` python3 cluster_twomoons.py ` <br >
55- or<br >
56- ` python3 cluster_mnist.py `
123+ ``` bash
124+ cd examples
125+ python3 cluster_twomoons.py
126+ python3 cluster_mnist.py
127+ ```
57128
58129<!-- ### Data reduction and visualization
59130
0 commit comments