data
+-<dataset>
+-devices
+-central
+-devicesCells
+-consolidate
+-results
+-DBSCAN
+-config
script
sumData
getCluster
This directory has data to be processed
For organization, each dataset has its own directory named <dataset>
In this directory resides datasets from devices (CSV files)
This directory is used to store all data in the central node. If it doesn't exist it will be created at run time
This directory receive cells sent by each device.
For testing purposes it could be have raw data.
The script gather all csv cells data received on deviceCells directory and create a single file with all cells.
It could have raw data if the devices sent raw data for testing purposes.
This directory stores the results of clusterization algorithms.
- CSV Files
- SVG File if the clustering dataset has 2 dimensions.
Files created by DBSCAN algorithm
Store CSV configuration files
All the scripts to run the experiment are stored on script directory.
The main script is complete.py, that run all phases of process.
Below the help screen:
Options Description
-h Show this help
-d <dir> The data files gathering by devices will be found on <dir>/devices directory
-e <epsilon> Value of epsilon: default = 10
-m <cells> Minimum Cells (default: 3)
-f <force> Minimum Force (default: 150)
-r Don't draw rectangles
-g Don't draw edges
-p Draw points
-b Draw numbers
-x Don't use prefix on the
First of all you need to create a new directory to store dataset in a CSV format.
The CSV files must be stored in the devices directory below <dataset> directory you've just created.
The script runs sumData for each CSV file simulating the process that runs at each device.
You'll need a configuration file that by default will be found at <dataset>/config directory with name config-<dataset>.csv
This CSV file must be 4 lines:
- Header: the names of variables.
- Example:
X,Y,Id,Classification
- Example:
- Variable Identification.
- (C)lustered: variables to be clustered
- (N)ot clustered
- C(L)assification: The Ground Truth label (if exists) to test the clustering algorithm.
- Example:
C,C,N,L
- Max Values. Values to be used on linear normalization. This line must contain the max value of each column.
- Example:
30,30,0,0
- Example:
- Min Values. Values to be used on linear normalization. This line must contain the min value of each column.
- Example:
2.9,3.7,0,0
- Example:
Note: All lines must have the same number of columns.
The program sumData is stored on sumData/bin directory and summarize data stored in <dataset>/devices/*.csv and create a new CSV file with cells created by summarization.
- Input
<dataset>/devices/*.csv
- Outputs (
complete.pyscript)<dataset>/central/devicesCells/cell-<dataset>-<seq>.csv. Where<seq>is a sequential number.<dataset>/central/devicesCells/point-<dataset>-<seq>.csv. Note: this file is generated if use-poption
- Parameter
-eEpsilon parameter. The default value for Epsilon is 10.
The script gather all output files created by sumData and create one single file.
- Inputs
<dataset>/central/devicesCells/cell-<dataset>-<seq>.csv. Where<seq>is a sequential number.<dataset>/central/devicesCells/point-<dataset>-<seq>.csvif used-poption.
- Outputs
<dataset>/central/consolidate/cells-<dataset>.csv<dataset>/central/consolidate/cells-<dataset>.csvif used-poption.
With the cells gathered consolidated in one single file, the clustering algorithm take place.
It´s important to notice that the outputs generated by complete.py script create different file names for different parameters. It´s useful to remember the parameters used when compare results.
The script create a prefix for output files that identify the Epsilon and the Force used to run the process.
The prefix has the format: ennnfd.dddd where nnn is the value of Epsilon and d.dddd is the value of force. Example: if Epsilon = 35 and Force = 0.0777 the prefix will be e035f0.077.
- Inputs
<dataset>/central/consolidate/cells-<dataset>.csv<dataset>/central/consolidate/cells-<dataset>.csvif used-poption.
- Outputs (
complete.pyscript)<dataset>/central/results/<prefix>-cells-<dataset>.csv.<dataset>/central/results/<prefix>-points-<dataset>.csvif used-poption.<dataset>/central/results/<prefix>-points-<dataset>.svg. If dimension = 2 it´s possible generate a plotting file (SVG), that could be opened in any browser.
- Clustering Parameters
-eEpsilon parameter. The default value for Epsilon is 10.-fForce parameter. If the force between cells is greater or equal then parameter the cells become together.-mMinimum Cells. If a cluster is formed by less than parameter, the cluster is discarded.
- Plotting Parameters: used to configure SVG output:
-rDon´t draw cells (rectangles)-gDon´t draw edges that link the cells creating the clusters-pDraw points. Used to compare raw data with clustering results.-bDraw Numbers. Draw the labels of clusters inside the cells and label of ground-truth (classification column on raw data) inside points.
- Prefix parameter
-xThis option could be used when testing Epsilon and Force Parameters to don't generate a bunch of files for each group of parameters tested.
The output files are in the CSV format:
| Field | Description |
|---|---|
| cell-id | Sequential number |
| number-points | Same as cell-id |
| CM-0 | Coordinates of center of mass |
| CM-1 | |
| ... | |
| CM-n | |
| qty-cells-cluster | Quantity of cells of the cluster |
| gGluster-label | Label of Cluster defined by gCluster |
| ground-truth-cell-label |
Label of Ground Truth cluster (Class Column) The ground truth of cell is determined by the class of center of mass' closest point |
| Field | Description |
|---|---|
| Coord-0 | Coordinates of point |
| Coord-1 | |
| ... | |
| Coord-n | |
| gGluster-label | Label of Cluster defined by gCluster |
| ground-truth-label |
Label of Ground Truth (Class. Column) |
The script DBSCAN.py runs the DBSCAN clustering algorithm to compare results.
It´s possible run DBSCAN over the points (raw data) or over the center of mass of cells generated by sumData program.
The goal is compare gCluster Algorithm with DBSCAN in two situations:
-
Run DBSCAN over the all data to compare centralized data and distributed data approaches.
-
Run DBSCAN over cell´s center of masses to verify if a conventional and mature algorithm has good performance over summarized data.
Options Description -h Show this help -d <dir> Directory of files -pr <pre> Prefix of files (e<epsilon>f<force (with 4 decimals)> - Ex. e014f0.1500) -t <opt> <opt> = c or p (for cells or points respectively) -e <value> Epsilon value -m <value> Min points value -l Print legend -x Don't create files with prefix
- Options
-dis the<dataset>directory-ttype of data: c for cells (center of mass of cells) or p for points (raw data).-prPrefix of files. Is the best way to ensure you are comparing correctly. The script uses this prefix to find out the input file
- DBSCAN parameters
-eEpsilon value. It´s important to notice that due the normalization the distance between minimum and maximum values for all dimensions is one. So this is a good reference to choose a good value of Epsilon.
- Inputs
<dataset>/central/results/<prefix>-cells-<dataset>.csvfor cells (option-t c)<dataset>/central/results/<prefix>-points-<dataset>.csvfor points (option-t p)
- Outputs
<dataset>/central/DBSCAN/<prefix>-cells-DBSCAN-<dataset>.csvfor cells (option-t c)<dataset>/central/DBSCAN/<prefix>-points-DBSCAN-<dataset>.csvfor points (option-t p)
| Field | Description |
|---|---|
| CM-0 | Coordinates of point or cell |
| CM-1 | |
| ... | |
| CM-n | |
| gGluster-label | Label of Cluster defined by DBSCAN |
| ground-truth-label |
Label of Ground Truth cluster (Class Column) |
The script validation.py runs the Fowlkes and Mallows index to compare a result of an algorithm with their ground-truth.
Each point has the ground-truth and the label find out by algorithm to be evaluated.
So it's created a set of 2-combinations from a set of n points and for each pair define:
- ss (same/same) - the two points belong in the same cluster on both gCluster and Ground Truth
- sd (same/different) - the two points belong in the same cluster on gCluster and different clusters on Ground Truth"
- ds (different/same) - the two points belong in different clusters on gCluster and in the same cluster on Ground Truth"
- dd (different/different) - the points belong in different clusters on both partitions
To calculate the Fowlker and Mallows index:
m1 = ss + sd (number of pairs ss plus number of pairs sd) m2 = ds + dd (number of pairs ds plus number of pairs dd)
FM = ss / sqrt(m1.m2)
Options Description
-h Show this help
-d <dir> Directory of files
-m <file> File with map of indexes
-t <opt> <opt> = c or p (for cells or points respectively)
-pr <pre> Prefix of files
if gGluster pr = (e<epsilon (3 digits)>f<force (with decimals)> - Ex. e014f0.1500)
if DBSCAN pr = (e<epsilon (4 decimals)>m<minPts (with 3 digits)> - Ex. e0.1100m003)
-b Use this if you'll validate DBSCAN
- Options
-dis the<dataset>directory-m <file>Map file (see below)-t <opt>type of file (c)ell or (p)oint-pr <pre>prefix of file. The same prefix of previous phases. This format is different between gCluster or DBSCAN. See help screen above.-bindicate DBSCAN
- Inputs
- If
-t c(type = cells, algo = gCluster):<dataset>/central/results/<prefix>-cells-result-<dataset>.csv - If
-t p(type = points, algo = gCluster):<dataset>/central/results/<prefix>-points-result-<dataset>.csv - If
-t c -b(type = cells, algo = DBSCAN):<dataset>/central/results/<prefix>-cells-DBSCAN-<dataset>.csv - If
-t p -b(type = points, algo = DBSCAN):<dataset>/central/results/<prefix>-points-DBSCAN-<dataset>.csv
- If
- Outputs
- Show at screen:
- values of ss, sd, ds, and dd
- value of FM Index
- Show at screen:
In this case the script will load the cells output of gGluster algorithm (file <dataset>/central/results/<prefix>-cells-result-<dataset>.csv). See the file format here
It will compare the labels of clusters generated by gCluster and labels from Ground Truth.
To find out the Ground Truth label of the cell, for each cell we chose the closest point to the center of mass.
In the figure 1, the ground truth label of the cell is 22.

Figure 1: Cell label = 248, Ground Truth Label = 22
In the Figure 2, all the points from raw data have the same label, but gCluster didn't join the two graphs (green and blue), probably due the Force parameter choose.
But in the validation, the cells from clusters 20 and 13 will have the same ground truth.

Figure 2: Two different clusters find out by gClusters with the same ground truth
In this case the script will load the points generated by gGluster algorithm (file <dataset>/central/results/<prefix>-points-result-<dataset>.csv). See the file format here.
When use type = points (-t p), the script simulates all the points in the central node. To compare the same clustering algorithm running over the summarized data and raw data.
The idea is subtly different from the type = cell. To verify the cluster the algorithm found out to the point it just set the cluster label where the point belongs.
Examples:
- Figure 1: The points with Ground Truth label 22 will be set as label 43 for points that is inside cells with label 43, and -1 for points located on cells that doesn't belong to any cluster.
- Figure 2: Using the same idea, points from Ground Truth Label 2 will receive gCluster Labels 20, 13, and -1 depends on cell they are inserted into.
In this case the script will load the cells output of DBSCAN algorithm (file <dataset>/central/results/<prefix>-cells-DBSCAN-<dataset>.csv). See the file format here
It will compare the labels of clusters generated by DBSCAN and labels from Ground Truth.
To find out the Ground Truth label of the cell, for each center of mass, the algorithm find out the closest point with distance equal or less then minPts. If there is no point close enough the the center of mass will receive the label -1.
In this case the script will load the points generated by DBSCAN algorithm (file <dataset>/central/results/<prefix>-points-DBSCAN-<dataset>.csv). See the file format here.
When use type = points, the script simulates all the points in the central node. To compare the same clustering algorithm running over the summarized data and raw data.
As you can see on Figures 1 to 3, the labels are defined automatically by algorithms. So, to validation works, the labels must match, because it uses the labels to determine if two points belongs or not to the same cluster.
So it's necessary create a Map File to create a relation between the label created by algorithms and the label provided by the Ground Truth file.
As the label could change depends on parameters of algorithms, it's necessary create a map file for each parameter. So, the file will use the prefix to identify the parameters used by algos.
The map file is a CSV format and the following name is expected by validation.py script:
<dataset>/<prefix>-map-<dataset>.csvfor gCluster tests<dataset>/<prefix>-DBSCAN-<dataset>.csvfor DBSCAN tests