Pigeon iterative MAG binning + isolate assembly #127
Conversation
|
Woohoo! This looks great. I'll check and merge by Mon afternoon unless you
want it earlier.
…On Sat, May 16, 2026, 8:54 PM Austin Marshall ***@***.***> wrote:
@microbemarsh <https://github.com/microbemarsh> requested your review on:
treangenlab/somatem#127 <#127>
Bioconda cleanup i109.
—
Reply to this email directly, view it on GitHub
<#127 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADRLCHYOIESKP4KUVTCLPQ343ELUVAVCNFSM6AAAAACZBAHAESVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMRVGYYTQNBSGYZTSNQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
|
Summary of recent commit: Big points - added isolate_analysis.nf and created summary_report module for pipeline summaries Nitty gritty AI gen from the git commit below:
|
ppreshant
left a comment
There was a problem hiding this comment.
Great updates. Other than reverting the path changes in the example csv and yaml files everything looks good to be merged.
If you're unable to get to the path changes, I can do them early next week.
| @@ -1,2 +1,2 @@ | |||
| sample,fastq_1 | |||
| zymo,assets/data/assembly/mock20_hiq100k.fastq.gz No newline at end of file | |||
| zymo,/home/agm/Documents/somatem/assets/data/assembly/mock20_hiq100k.fastq.gz No newline at end of file | |||
There was a problem hiding this comment.
These paths are hardcoded to specific directoreis and will fail for users. The original path assets/data/assembly/mock20_hiq100k.fastq.gz works when the users download example files using the script: subworkflows/local/download_example_datasets.nf saving files here:
params.save_dir = params.save_dir ?: "${launchDir}/assets" // default save directory:
// currently users can't change this since the config yml files are hardcoded to use this pathI will revert this and other similar changes manually for now
| # 1. laptop = runs on a laptop. Only does taxonomic profiling (default) | ||
| # 2. cluster = runs on a cluster or workstation. Can perform all analysis_types (including assembly) | ||
| running_mode: "cluster" | ||
| running_mode: "laptop" |
There was a problem hiding this comment.
I think the running_mode is not used currently so this is redundant and can be removed from these params to minimize distraction
| ### Set the cache dir | ||
| Nextflow's cache dir is set to `~/micromamba/other-envs` in the `nextflow.config` file's `conda.cacheDir` parameter. So create a directory at this location the first time. | ||
| _The default cache dir for nextflow created environments is `Somatem/work/..`_; Since I like to delete `work/` frequently to save space, I set this to a different location outside the repo. | ||
| Nextflow's conda cache and downloaded databases default to the Somatem data directory. In a Bioconda or micromamba install this resolves to `$CONDA_PREFIX/var/somatem`, so generated environments and databases stay with the active Somatem environment. |
There was a problem hiding this comment.
Actually now that I rethink, this will cause issues by storing other environments within the somatem environment. Nextflow activates and switches between them independently so it is best to leave it at the default location of conda.cacheDir at $workDir/conda path and mention this for power users' to change.
I think the nextflow config features an automatic detection of somatem home and setting the conda or mamba path based directory for the environments so this documentation might need to be updated?
| Nextflow's conda cache and downloaded databases default to the Somatem data directory. In a Bioconda or micromamba install this resolves to `$CONDA_PREFIX/var/somatem`, so generated environments and databases stay with the active Somatem environment. | ||
|
|
||
| ```sh | ||
| mkdir -p ~/micromamba/other-envs |
There was a problem hiding this comment.
Not sure if this cache directory is automatically made by nextflow?
- If not, we should have a setup nextflow module that makes this dir rather than have users make it?
|
|
||
| // Run SingleM pipe on bins - FIXED: Add suffix to avoid filename collision | ||
| SINGLEM_PIPE_BINS(SEMIBIN_SINGLEEASYBIN.out.output_fasta, ch_singlem_db, 'genome') | ||
| SINGLEM_PIPE_BINS(ch_bins_for_downstream, ch_singlem_db, 'genome') |
There was a problem hiding this comment.
Minor note (for future): ch_bins_for_downstream rename to more descriptive? Since it is used in many places in the pipeline, it would make it clearer by being self-documenting!
| input = "${projectDir}/assets/samplesheet.csv" // bundled default file should resolve from the installed pipeline root | ||
| db_base_dir = "/home/dbs" // change to "./assets/databases" if not using a shared directory | ||
| unified_db_base_dir = "/home/Users/pacbio_bakeoff/data/ref_db/refseq03032025" // path to Eddy's unified databases (Ensemble analysis: species detection) | ||
| somatem_home = System.getenv('SOMATEM_HOME') ?: ( System.getenv('CONDA_PREFIX') ? "${System.getenv('CONDA_PREFIX')}/var/somatem" : ( System.getenv('MAMBA_ROOT_PREFIX') ? "${System.getenv('MAMBA_ROOT_PREFIX')}/var/somatem" : "${System.getenv('HOME')}/.somatem" ) ) |
|
@copilot please address my comments from the review in a few commits! |
Agent-Logs-Url: https://github.com/treangenlab/somatem/sessions/c367543a-34f2-4367-8c5e-6a497b81b5fd Co-authored-by: ppreshant <14856479+ppreshant@users.noreply.github.com>
Addressed in commit 1c7ca47. I reverted hardcoded local paths in the MAG example sample/config files, removed the unused |
Bioconda cleanup, launcher fixes, and Pigeon iterative MAG binning
Summary
This branch cleans up the local/Bioconda Somatem experience and hardens the MAG assembly workflow. It fixes packaged launcher shortcuts, adds local CPU/memory detection, updates problematic module environments, and introduces Pigeon-driven iterative MAG binning as the default binning route.
Related to #109, #90 (fully fixed), and partially fixed #41.
This was getting pretty long and I decided it was worthy of a merge before I kept adding more.
Key Changes
bin/somatemshortcut commands so packaged utility workflows resolve correctly.PIGEON_ITERATIVE_BINNING, while preserving the original SemiBin2-only path withmag_iterative_binning_enabled: false.bin/pigeon_iterate.pyto run SemiBin2, MetaBAT2, and VAMB across bounded iterations, select candidates with Pigeon metrics, and run DAS Tool consensus.iterative_report.htmlplus TSV/JSON selection outputs so users can inspect Pigeon’s candidate trajectory and selected bins.Validation
Tested this on villapol lab PC with the example mag dataset.
The MAG workflow was run on the zymo example dataset to debug Hostile, minimap2, VAMB, DAS Tool, Bakta, and Pigeon output-publishing issues.
Reviewer Notes
mag_iterative_binning_enabled: truenow sends MAG bins through Pigeon iterative binning by default.mag_iterative_binning_enabled: falsekeeps the standard SemiBin2-only route available.