Collate results from Eddie3 cellprofiler jobs
Following a cellprofiler analysis with ExportToSpreadsheet, we produce many .csv files that we want to merge together.
If there is a lot of data, then the sensible option is to create a database of the results. We can use meld to scan through multiple sub-directories and build an sqlite database from the .csv files, with the option of aggregating the data as we go.
If we have a directory of results containing multiple sub-folders, e.g:
results/
├── output_1
│ ├── DATA.csv
│ └── IMAGE.csv
├── output_2
│ ├── DATA.csv
│ └── IMAGE.csv
└── output_3
├── DATA.csv
└── IMAGE.csv
import meld
merger = meld.Merger("/results")We then want to tell merge_to_db where to store the database.
merger.create_db("/path/to/db/location")Now we want the database to create separate tables for DATA and IMAGE. We
can specify the name of the .csv file we want to store in each table.
merger.to_db("DATA")
merger.to_db("IMAGE")This will automatically scan through the sub-directories, and read in the DATA and IMAGE files respectively, appending each to the appropriate table.
CellProfiler can combine the results of different objects into a single csv file, when this is done it produces a .csv file with multi-indexed columns.
merge_to_db can automatically flatten these column headers before storing in
the database if we specify the number of headers beforehand. So if DATA has
two indices:
merger.to_db("DATA", header=[0,1])As the default output from CellProfiler is cell-level data, whereby we have a
row per object, it's normally convenient to aggregate this to image or well
averages. We can do this automatically when appending the raw-data to the
database with the method to_db_agg(), which produces a separate table in the
database named <object>_agg, where object is the name of the .csv file
containing the raw data.
usage:
merger.to_db_agg(select="DATA", header=[0,1], by="ImageNumber")This will group the data by ImageNumber and create a row with a median value for each image.
We can change the aggregation function by passing the method argument.
merger.to_db_agg(select="DATA", header=[0,1], by="ImageNumber", method="mean")This will create a table called DATA_agg in the database, with a row per
image.
If you're collapsing multi-indexed columns and aggregating data, you have to specify the column you wish to aggregate by with the collapsed name.
For example:
| Image | Image | ... |
|---|---|---|
| ImageNumber | Intensity_channel_1 | ... |
| 1 | 0.758 | ... |
| ... | ... | ... |
You would aggregate by Image_ImageNumber, as the Image and ImageNumber column headers will be collapsed before aggregation.
merger.to_db_agg(select="DATA", header=[0,1], by="Image_ImageNumber")