Skip to content

Conversation

@sainsachiko
Copy link
Contributor

@sainsachiko sainsachiko commented Nov 25, 2025

Add pbmarkdup to identifies and marks duplicate reads in PacBio HiFi (CCS) data.
#9456
Thank you for reviewing!

PR checklist

Closes #XXX

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

@muffato muffato removed their request for review November 26, 2025 10:17
@sainsachiko sainsachiko requested a review from mashehu November 26, 2025 11:06
Copy link
Contributor

@DLBPointon DLBPointon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

làm tốt lắm!

I just have a few comments to flesh it out more.

It'll mean adding another test, sorry. Feel free to get another opinion though.

- pbmarkdup:
description: |
pbmarkdup identifies and marks duplicate reads in PacBio HiFi (CCS) data. It clusters
highly similar CCS reads to detect PCR duplicates and flags them in the BAM output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the bit about it being BAM output, as there are 3 formats it can output.


output:
tuple val(meta), path("${prefix}.${suffix}"), emit: markduped
path "versions.yml" , emit: versions
Copy link
Contributor

@DLBPointon DLBPointon Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the --dup-file dups.fasta flag needs better support, otherwise it won't get caught and output.

Perhaps an output channel, and something to capture the name the user will use in the config?

output:
tuple val(meta), path("${dup_file}"), optional: true, emit: duplicates

script:

// This little chunk would be soley to have a string to give to the output tuple
// not needed in the script as it exists in the args.
def matcher = (task.ext.args =~ /--dup-file\s+(\S+)/)
dup_file = matcher.find() ? matcher[0][1] : ""

"""
    pbmarkdup \\
        -j ${task.cpus} \\
        $input \\
        ${prefix}.${suffix} \\
        $args
"""

Feel free to get another reviewer here, but otherwise you can't capture the dupes file unless you simplify your existing output by removing the suffix, but then you have to deal with a tuple[meta, file, file] in the workflow.


script:
def args = task.ext.args ?: ''

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pbmarkdup can take a list of files as input, so i think you may just need to do some double checks to make sure names don't conflict.

Make sure input and output file names don't conflict, which seems like it could happen.

Adapted from CAT_CAT

    if(file_list.contains("${prefix}.{suffix}") {
        error "PBMARKDUP: The name of the input file can't be the same as the output: " +
        "Change the prefix to avoid conflict."
    }

As it can take a list of input files, perhaps a check to make sure they are in fact unique files too? However i think this can be optional as it should be checked in the input_check or workflow of the pipeline.

    def input_files = [input].collect { file.baseName }
    if (input_files.size() != input_files.unique().size()) {
        error "PBMARKDUP: Input files must have unique names. Found duplicates: ${input_files}" +
        "Check your input reads"
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DLBPointon for kindly reviewing this. I have updated the code regarding your reviews, please take a look, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants