Skip to content

Pigeon iterative MAG binning + isolate assembly #127

Open
microbemarsh wants to merge 4 commits into
mainfrom
bioconda_cleanup_i109
Open

Pigeon iterative MAG binning + isolate assembly #127
microbemarsh wants to merge 4 commits into
mainfrom
bioconda_cleanup_i109

Conversation

@microbemarsh
Copy link
Copy Markdown
Collaborator

@microbemarsh microbemarsh commented May 17, 2026

Bioconda cleanup, launcher fixes, and Pigeon iterative MAG binning

Summary

This branch cleans up the local/Bioconda Somatem experience and hardens the MAG assembly workflow. It fixes packaged launcher shortcuts, adds local CPU/memory detection, updates problematic module environments, and introduces Pigeon-driven iterative MAG binning as the default binning route.

Related to #109, #90 (fully fixed), and partially fixed #41.

This was getting pretty long and I decided it was worthy of a merge before I kept adding more.

Key Changes

  • Fixed bin/somatem shortcut commands so packaged utility workflows resolve correctly.
  • Added local resource detection and caps for high-resource processes.
  • Replaced the default MAG binning path with PIGEON_ITERATIVE_BINNING, while preserving the original SemiBin2-only path with mag_iterative_binning_enabled: false.
  • Added bin/pigeon_iterate.py to run SemiBin2, MetaBAT2, and VAMB across bounded iterations, select candidates with Pigeon metrics, and run DAS Tool consensus.
  • Improved Pigeon scoring with an adaptive graph-aware loss objective.
  • Added iterative_report.html plus TSV/JSON selection outputs so users can inspect Pigeon’s candidate trajectory and selected bins.
  • Published iterative binning, taxonomy, quality, assembly, and annotation outputs into user-facing result folders.
  • Fixed Hostile staged index handling, minimap2 local resource failures, DAS Tool compressed assembly handling, VAMB residual mismatch handling, and the Bakta 1.11.x annotation crash.
  • Removed the experimental CheckM2-in-Pigeon scoring path; downstream CheckM2 quality assessment remains unchanged.

Validation

Tested this on villapol lab PC with the example mag dataset.

The MAG workflow was run on the zymo example dataset to debug Hostile, minimap2, VAMB, DAS Tool, Bakta, and Pigeon output-publishing issues.

Reviewer Notes

  • mag_iterative_binning_enabled: true now sends MAG bins through Pigeon iterative binning by default.
  • mag_iterative_binning_enabled: false keeps the standard SemiBin2-only route available.
  • The Pigeon HTML report uses Plotly when available and falls back to table-only HTML if needed.

@microbemarsh microbemarsh requested a review from ppreshant May 17, 2026 01:53
@microbemarsh microbemarsh added the enhancement New feature or request label May 17, 2026
@microbemarsh microbemarsh self-assigned this May 17, 2026
@microbemarsh microbemarsh added the bug Something isn't working label May 17, 2026
@ppreshant
Copy link
Copy Markdown
Member

ppreshant commented May 17, 2026 via email

@microbemarsh microbemarsh marked this pull request as draft May 21, 2026 15:44
@microbemarsh
Copy link
Copy Markdown
Collaborator Author

microbemarsh commented May 22, 2026

Summary of recent commit:

Big points - added isolate_analysis.nf and created summary_report module for pipeline summaries

Nitty gritty AI gen from the git commit below:

  • Added shared SOMaTeM HTML summary reporting infrastructure via bin/somatem_report.py and a reusable SOMATEM_SUMMARY_REPORT module.
  • Wired summary reports into pre-processing, taxonomic profiling, MAG assembly, genome dynamics, and isolate analysis workflows.
  • Added a new long-read-first isolate-analysis workflow that consumes preprocessed reads.
  • Added isolate read classification with Kraken2, assembly with Autocycler/Flye, optional hybrid polishing with Polypolish and pypolca, Bakta annotation, and optional BTyper3 typing.
  • Added tool-first local modules and Conda environments for Autocycler, Flye-for-Autocycler, BWA, Polypolish, pypolca, BTyper3, and FASTA finalization.
  • Reused and updated existing Bakta and Kraken2 modules for isolate workflow outputs and version reporting.
  • Added isolate_analysis CLI support through bin/somatem.
  • Updated database handling to support isolate-analysis Bakta DB setup.
  • Updated EMU database Add latest emu and lemur databases (2025) from Mike #56 download/staging logic and default EMU DB config.
  • Improved example dataset download behavior with configurable output location and completion tracking.
  • Added Conda environment stability tweaks, including longer create timeouts and nodefaults for Hostile environments.

@microbemarsh microbemarsh marked this pull request as ready for review May 22, 2026 13:39
Copy link
Copy Markdown
Member

@ppreshant ppreshant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great updates. Other than reverting the path changes in the example csv and yaml files everything looks good to be merged.

If you're unable to get to the path changes, I can do them early next week.

Comment thread assets/samplesheets/mag_samples.csv Outdated
@@ -1,2 +1,2 @@
sample,fastq_1
zymo,assets/data/assembly/mock20_hiq100k.fastq.gz No newline at end of file
zymo,/home/agm/Documents/somatem/assets/data/assembly/mock20_hiq100k.fastq.gz No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These paths are hardcoded to specific directoreis and will fail for users. The original path assets/data/assembly/mock20_hiq100k.fastq.gz works when the users download example files using the script: subworkflows/local/download_example_datasets.nf saving files here:

params.save_dir = params.save_dir ?: "${launchDir}/assets" // default save directory: 
  // currently users can't change this since the config yml files are hardcoded to use this path

I will revert this and other similar changes manually for now

Comment thread assets/user_config/mag_params.yml Outdated
# 1. laptop = runs on a laptop. Only does taxonomic profiling (default)
# 2. cluster = runs on a cluster or workstation. Can perform all analysis_types (including assembly)
running_mode: "cluster"
running_mode: "laptop"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the running_mode is not used currently so this is redundant and can be removed from these params to minimize distraction

Comment thread docs/installation.md Outdated
### Set the cache dir
Nextflow's cache dir is set to `~/micromamba/other-envs` in the `nextflow.config` file's `conda.cacheDir` parameter. So create a directory at this location the first time.
_The default cache dir for nextflow created environments is `Somatem/work/..`_; Since I like to delete `work/` frequently to save space, I set this to a different location outside the repo.
Nextflow's conda cache and downloaded databases default to the Somatem data directory. In a Bioconda or micromamba install this resolves to `$CONDA_PREFIX/var/somatem`, so generated environments and databases stay with the active Somatem environment.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually now that I rethink, this will cause issues by storing other environments within the somatem environment. Nextflow activates and switches between them independently so it is best to leave it at the default location of conda.cacheDir at $workDir/conda path and mention this for power users' to change.

I think the nextflow config features an automatic detection of somatem home and setting the conda or mamba path based directory for the environments so this documentation might need to be updated?

Comment thread docs/installation.md
Nextflow's conda cache and downloaded databases default to the Somatem data directory. In a Bioconda or micromamba install this resolves to `$CONDA_PREFIX/var/somatem`, so generated environments and databases stay with the active Somatem environment.

```sh
mkdir -p ~/micromamba/other-envs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this cache directory is automatically made by nextflow?

  • If not, we should have a setup nextflow module that makes this dir rather than have users make it?


// Run SingleM pipe on bins - FIXED: Add suffix to avoid filename collision
SINGLEM_PIPE_BINS(SEMIBIN_SINGLEEASYBIN.out.output_fasta, ch_singlem_db, 'genome')
SINGLEM_PIPE_BINS(ch_bins_for_downstream, ch_singlem_db, 'genome')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor note (for future): ch_bins_for_downstream rename to more descriptive? Since it is used in many places in the pipeline, it would make it clearer by being self-documenting!

Comment thread nextflow.config
input = "${projectDir}/assets/samplesheet.csv" // bundled default file should resolve from the installed pipeline root
db_base_dir = "/home/dbs" // change to "./assets/databases" if not using a shared directory
unified_db_base_dir = "/home/Users/pacbio_bakeoff/data/ref_db/refseq03032025" // path to Eddy's unified databases (Ensemble analysis: species detection)
somatem_home = System.getenv('SOMATEM_HOME') ?: ( System.getenv('CONDA_PREFIX') ? "${System.getenv('CONDA_PREFIX')}/var/somatem" : ( System.getenv('MAMBA_ROOT_PREFIX') ? "${System.getenv('MAMBA_ROOT_PREFIX')}/var/somatem" : "${System.getenv('HOME')}/.somatem" ) )
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting method!

@ppreshant ppreshant changed the title Bioconda cleanup i109 Pigeon iterative MAG binning + isolate assembly May 23, 2026
@ppreshant
Copy link
Copy Markdown
Member

@copilot please address my comments from the review in a few commits!

Copy link
Copy Markdown

Copilot AI commented May 23, 2026

@copilot please address my comments from the review in a few commits!

Addressed in commit 1c7ca47. I reverted hardcoded local paths in the MAG example sample/config files, removed the unused running_mode entry from assets/user_config/mag_params.yml, and updated docs/installation.md cache guidance to default Nextflow behavior plus override env vars. No UI changes were made in this update.

Copilot AI requested a review from ppreshant May 23, 2026 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Collect important outputs from the whole workflow (with global publish)

3 participants