Model improvements#2
Conversation
Updated task type descriptions and example gene/network file paths in README.
- GraphSAGE: remove activation+dropout after output layer (distorted logits) - GIN: forward() now uses all conv layers dynamically instead of hardcoding 2 - GraphTransformer: fix dimension mismatch for num_layers > 2 - GCNII: fix loop variable shadowing, pass theta through instead of hardcoding None - Evaluate test metric every epoch for accurate best-metric tracking - Replace bare except clauses with except Exception - hyperopt: remove broken sys.path.append, dead code, fix n_jobs default to 1
There was a problem hiding this comment.
Pull request overview
This pull request introduces comprehensive "Model improvements" that expand the GNN-Suite pipeline to support multiple task types (binary, multiclass, regression), adds MLflow experiment tracking, implements hyperparameter optimization with Optuna, and improves the overall testing and documentation infrastructure.
Changes:
- Added support for multiclass classification and regression tasks alongside binary classification
- Integrated MLflow for experiment tracking with optional configuration
- Enhanced hyperparameter optimization using YAML-based configuration and Optuna
- Expanded CI/CD with comprehensive integration tests for all task types
- Improved documentation with detailed parameter references and examples
Reviewed changes
Copilot reviewed 7 out of 17 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| testdata/sfari_multiclass_genes.csv | New test dataset with 1256 genes for multiclass classification |
| nextflow.config | Added parameters for task_type, MLflow integration, num_heads, and plotting control |
| mlflow_server.sh | Script to start MLflow tracking server with SQLite backend |
| main.nf | Enhanced workflow to support multiple task types and dynamic hyperparameter loading |
| conf/hyperparams.yaml | Centralized hyperparameter search space configuration |
| bin/plot.py | Updated to handle task-specific metrics and auto-detect available metrics |
| bin/models.py | Fixed GIN concatenation logic and improved GCNII implementation |
| bin/load_hyperparams.py | New script to load optimized hyperparameters or use defaults |
| bin/hyperopt.py | Refactored with YAML-driven configuration and task type support |
| bin/gnn.py | Major refactoring to support all three task types with appropriate losses and metrics |
| README_mlflow.md | Quick start guide for MLflow integration |
| README.md | Comprehensive documentation updates with parameter tables and examples |
| .gitignore | Added MLflow-related directories |
| .github/workflows/ci.yml | Extended with integration tests for all task types and hyperparameter optimization |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - name: Integration Test - Multiclass Classification | ||
| run: | | ||
| nextflow run . -profile test,ci,docker \ | ||
| --task-type multiclass \ |
There was a problem hiding this comment.
Parameter name inconsistency: Use --task_type (with underscore) instead of --task-type (with hyphen) to match the workflow parameter definition in nextflow.config.
|
|
||
| - name: Integration Test - Regression | ||
| run: | | ||
| nextflow run . -profile test,ci,docker \ |
There was a problem hiding this comment.
Parameter name inconsistency: Use --task_type (with underscore) instead of --task-type (with hyphen) to match the workflow parameter definition in nextflow.config.
| - name: Integration Test - MLflow Enabled | ||
| run: | | ||
| nextflow run . -profile test,ci,docker \ | ||
| --task-type binary \ |
There was a problem hiding this comment.
Parameter name inconsistency: Use --task_type (with underscore) instead of --task-type (with hyphen) to match the workflow parameter definition in nextflow.config.
| --geneFile testdata/multiclass.csv \ | ||
| --networkFile testdata/multiclass_network.tsv \ | ||
| --task_type multiclass | ||
|
|
||
| # Regression | ||
| nextflow run main.nf -profile docker \ | ||
| --geneFile testdata/regression.csv \ | ||
| --networkFile testdata/regression_network.tsv \ |
There was a problem hiding this comment.
Incorrect file paths in documentation: The test data files referenced here don't exist. According to the testdata directory listing, the actual multiclass file is sfari_multiclass_genes.csv, not multiclass.csv. Similarly for the network file and regression files. Update these examples to use the correct file paths that exist in the repository.
| --geneFile testdata/multiclass.csv \ | |
| --networkFile testdata/multiclass_network.tsv \ | |
| --task_type multiclass | |
| # Regression | |
| nextflow run main.nf -profile docker \ | |
| --geneFile testdata/regression.csv \ | |
| --networkFile testdata/regression_network.tsv \ | |
| --geneFile testdata/sfari_multiclass_genes.csv \ | |
| --networkFile testdata/sfari_multiclass_network.tsv \ | |
| --task_type multiclass | |
| # Regression | |
| nextflow run main.nf -profile docker \ | |
| --geneFile testdata/sfari_regression_genes.csv \ | |
| --networkFile testdata/sfari_regression_network.tsv \ |
| --geneFile testdata/genes.csv \ | ||
| --networkFile testdata/multiclass_network.tsv \ | ||
| --task_type multiclass | ||
|
|
||
| # Regression | ||
| nextflow run main.nf -profile docker -entry hyperopt \ | ||
| --geneFile testdata/genes.csv \ | ||
| --networkFile testdata/regression_network.tsv \ |
There was a problem hiding this comment.
Incorrect file paths in documentation: The multiclass and regression examples reference non-existent files. The correct file paths should be testdata/sfari_multiclass_genes.csv with testdata/sfari_multiclass_network.tsv for multiclass, and testdata/sfari_regression_genes.csv with testdata/sfari_regression_network.tsv for regression. Update these examples to match the actual test data files in the repository.
| --geneFile testdata/genes.csv \ | |
| --networkFile testdata/multiclass_network.tsv \ | |
| --task_type multiclass | |
| # Regression | |
| nextflow run main.nf -profile docker -entry hyperopt \ | |
| --geneFile testdata/genes.csv \ | |
| --networkFile testdata/regression_network.tsv \ | |
| --geneFile testdata/sfari_multiclass_genes.csv \ | |
| --networkFile testdata/sfari_multiclass_network.tsv \ | |
| --task_type multiclass | |
| # Regression | |
| nextflow run main.nf -profile docker -entry hyperopt \ | |
| --geneFile testdata/sfari_regression_genes.csv \ | |
| --networkFile testdata/sfari_regression_network.tsv \ |
| @@ -0,0 +1,7 @@ | |||
| # Start MLflow tracking server | |||
There was a problem hiding this comment.
Missing shebang: Bash scripts should start with #!/bin/bash or #!/usr/bin/env bash to ensure proper execution. Add this as the first line of the file.
| - name: Integration Test - Binary Classification | ||
| run: | | ||
| nextflow run . -profile test,ci,docker \ | ||
| --task-type binary \ |
There was a problem hiding this comment.
Parameter name inconsistency: This test uses --task-type (with hyphen) but the workflow definition in nextflow.config defines it as task_type (with underscore). Nextflow parameters should use underscores, not hyphens. Change to --task_type to match the configuration.
| eval \$(load_hyperparams.py ${model} ${dataSet} ${resultsDir} \ | ||
| ${params.learning_rate} \ | ||
| ${params.weight_decay} \ | ||
| ${params.dropout} \ | ||
| ${params.alpha} \ | ||
| ${params.theta} \ | ||
| ${params.num_heads}) | ||
|
|
||
| echo "# Using hyperparameters for ${model}: LR=\${LEARNING_RATE}, WD=\${WEIGHT_DECAY}, DROPOUT=\${DROPOUT}, ALPHA=\${ALPHA}, THETA=\${THETA}, NUM_HEADS=\${NUM_HEADS}" >&2 | ||
|
|
||
| gnn.py ${geneFile} ${networkFile} \ | ||
| --train-size ${params.train_size} \ | ||
| --model-name ${model} \ | ||
| --learning-rate ${params.learning_rate} \ | ||
| --weight-decay ${params.weight_decay} \ | ||
| --learning-rate \${LEARNING_RATE} \ | ||
| --weight-decay \${WEIGHT_DECAY} \ | ||
| --epochs ${epoch} \ | ||
| --eval-threshold ${params.eval_threshold} \ | ||
| --verbose-interval ${params.verbose_interval} \ | ||
| --dropout ${params.dropout} \ | ||
| --alpha ${params.alpha} \ | ||
| --theta ${params.theta} \ | ||
| > full-${model}-${epoch}-run-${run}.txt | ||
|
|
||
| split_data.py full-${model}-${epoch}-run-${run}.txt | ||
| --dropout \${DROPOUT} \ | ||
| --alpha \${ALPHA} \ | ||
| --theta \${THETA} \ | ||
| --num-heads \${NUM_HEADS} \ | ||
| --task-type ${params.task_type} \ | ||
| > full-${model}-${epoch}-run-${run}-${params.task_type}.txt |
There was a problem hiding this comment.
The TrainGNN process builds a shell command using eval $(load_hyperparams.py ${model} ${dataSet} ${resultsDir} ...) with unquoted parameters, which allows command injection if any of model, dataSet, or resultsDir (or other CLI params like learning_rate) contain shell metacharacters such as ; or $(...). An attacker who can control these workflow parameters (e.g., via --dataSet, --models, or a crafted --geneFile basename) can execute arbitrary commands in the task environment, leading to full compromise of the worker node or container. Replace eval with a safer mechanism (e.g., parsing the hyperparameter script output without shell evaluation, or sourcing a strictly validated key=value file) and ensure all interpolated parameters passed to shell commands are safely quoted or avoided by using direct process invocation.
| mlflow server \ | ||
| --backend-store-uri sqlite:///mlflow.db \ | ||
| --default-artifact-root ./mlflow-artifacts \ | ||
| --host 0.0.0.0 \ |
There was a problem hiding this comment.
The MLflow tracking server is started bound to 0.0.0.0 over plain HTTP with no authentication, which exposes experiment metadata and artifacts to anyone on the reachable network. In multi-user or shared environments this can allow unauthorized users to read or tamper with experiment data and models. Restrict the bind address to 127.0.0.1 or a trusted interface, and place MLflow behind authenticated, TLS-terminating infrastructure (or enable MLflow’s security features) before using this script in non-local setups.
| mlflow server \ | |
| --backend-store-uri sqlite:///mlflow.db \ | |
| --default-artifact-root ./mlflow-artifacts \ | |
| --host 0.0.0.0 \ | |
| MLFLOW_HOST="${MLFLOW_HOST:-127.0.0.1}" | |
| mlflow server \ | |
| --backend-store-uri sqlite:///mlflow.db \ | |
| --default-artifact-root ./mlflow-artifacts \ | |
| --host "${MLFLOW_HOST}" \ |
No description provided.