Skip to content

ATTPC/cluster_validation

Repository files navigation

cluster_validation

Methods for validating Spyral clustering algorithms with attpc_engine data. We implement a custom Spyral phase for analyzing the accuracy of clustering.

Installation

Download the repository as

git clone https://github.com/ATTPC/cluster_validation.git

Enter the repo and create a python environment and activate it

cd cluster_validation
python -m venv .venv
source .venv/bin/activate

Note you may need to use a specific Python (i.e. replace python with python3.11) and the activation step may change depending on your platform (specifically for Windows). Now install the dependencies from the requirements.txt file

pip install -r requirements.txt

Usage

Before using this repo, you will need to generate some attpc_engine data. See the engine docs for details.

This repo comes with a couple of methods for usage. The first is a notebook for visualizing the analysis and understanding the techniques used. To run the notebook, run the command

jupyter-lab

and select the notebook cluster_validation.ipynb. You'll have to set some paths and choose clustering parameters, which you can explore with the visualizations.

However, the notebook is not a good way to explore a large set of events. To run an entire dataset through the analysis, we use the Spyral framework. Included in the repo is the file cluster_validation.py. This runs a customized Spyral phase that clusters and validates the data. Note that you will need to edit cluster_validation.py to set paths and parameters. See the Spyral docs for details on the various parameters. To run the analysis simply use

python cluster_validation.py

Once you've run the validation pipeline, you will have a Spyral workspace that has a directory called ClusterValidation. Inside this directory will be a standard Spyral cluster datafile (run_#.h5) and a parquet file (run_#.parquet) that contains the validation data. This repo contains a simple tool which provides some aggregate validation results evaluate_cluster_accuracy.py. To run the aggregator use

python evaluate_cluster_accuracy.py <path/to/parquet.parquet> <truth_label>

This will evaluate clustering statistics for the specific truth label you specify; that is, it will evaluate the accuracy of clustering a specific nucleus.

Modifying clustering

Generally, you will want to use this repo to evaluate the performance of different cluster methods. You can modify the validation phase by editing the code in the validation directory. Normal rules for custom Spyral phases apply here.

Accuracy Evaluation

One of the complications of evaluating the clusters is that the truth label value given by the attpc_engine does not (in general) equal the clutering label value assigned by the analysis. So, a comparision is made by taking the given truth label and analysis label and evaluating wether they include the same points over the entire point cloud. A simple example is a 3 point point cloud with truth labels [1,2,2] and analysis labels [0,1,1]. In this case we say analysis cluster 1 corresponds to truth cluster two with 100% accuracy. However, if the analysis labels were [1,1,1] we would say that cluster 1 corresponds to cluster 2 with 66% accuracy, because at the first point the labeling disagrees. You can see how over larger point clouds this becomes complicated. You need to not only include the correct points in the cluster, but also exclude the correct points as well. It is also clear that to determine if a given analysis label "matches" a truth label, we need an accuracy threshold. For this analysis the following is defined:

  • total accuracy: This is the accuracy of a comparision over all points. This accuracy is used to evaluate if a truth and analysis cluster match.
  • inclusive accuracy: This is the accuracy when considering only points which lie within the truth label cluster
  • exclusive accuracy: This is the accuracy when considering only points which lie outside the truth label cluster
  • accuracy threshold: the limit of total accuracy for which a truth cluster and analysis cluster are considered to match
  • size rejected: This means that an event was rejected based on the size of the point cloud and not due to clustering

With these we can completely define the accuracy of a clustering analysis (I think) (maybe) (subject to change) (idk).

Output Dataframe

The output validation data frame has the following columns:

  • event: the event number for this row
  • truth_labels: the list of found truth labels for this point cloud
  • n_truth: the number of truth labels
  • n_predicted: the number of labels predicted by the analysis
  • n_validated: the number of (unqiue) matches between predicted and truth labels
  • matches: a list of label pairs, (predicted, truth), which were matched
  • total_acc: the total accuracy of the matches, in the same order as matches
  • inclusive_acc: the inclusive accuracy of the matches, in the same order as matches
  • exclusive_acc: the exclusive accuracy of the matches, in the same order as matches
  • size_rejected: a boolean indicating if this event was rejected by size (true if rejected)

Requirements

Requires Python >= 3.10, < 3.13

About

Methods for validating Spyral clustering algorithms

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published