For PBS systems (and in the forthcoming archer2 changes for slurm systems also), there is a continuation script in the monc/misc folder that contains running scripts for continuation jobs. These make use of dependency chains so that a job can be started after a previous job has completed, using the checkpoint of the previous job.
A similar script can be written for the arc4 sge systems, using the --hold_jid flag which does the same job as the slurm --dependency flag.
The job submission script used by @craigpoku had a function in it which checks for completed jobs as a way of providing this functionality. That and the monc/misc/continuation.sh script would be a good place to start. The relevant code is below:
--- Checks:
# Check for run completion message in monc output file:
function check_complete() {
if [ -r "${MONC_OUT}" ] ; then
grep -q 'Model run complete due to model time' ${MONC_OUT} >& /dev/null
if [ "${?}" = "0" ] ; then
echo 'MONC run appears to have completed (exceeded termination time)'
# Display end time:
echo "END TIME: $(date)"
exit 0
fi
fi
}
check_complete
# Check for previous checkpoint file:
if [ -r "${MONC_OUT}" ] ; then
PREV_CKPT_FILE=$(basename $(grep \
'Restarted configuration from checkpoint file' \
${MONC_OUT} | egrep -o '[0-9a-zA-Z_/-]+\.nc') \
fi
# Check for most recent existing checkpoint file:
CKPT_FILE=$(basename $(\ls -1v ${CKPT_DIR} | tail -n 1) 2> /dev/null)
# If current chckpoint file is same as previous, give up:
if [ ! -z "${PREV_CKPT_FILE}" ] && [ ! -z "${CKPT_FILE}" ] ; then
if [ "${PREV_CKPT_FILE}" = "${CKPT_FILE}" ] ; then
echo "Previous checkpoint file is same as current (${CKPT_FILE})"
# Display end time:
echo "END TIME: $(date)"
exit 1
fi
fi
# If we have a checkpoint file, restart MONC, else, start from config:
if [ ! -z "${CKPT_FILE}" ] ; then
MONC_ARGS="--checkpoint=${CKPT_DIR}/${CKPT_FILE}"
else
MONC_ARGS="--config=${MONC_CONFIG}"
fi
For PBS systems (and in the forthcoming archer2 changes for slurm systems also), there is a continuation script in the monc/misc folder that contains running scripts for continuation jobs. These make use of dependency chains so that a job can be started after a previous job has completed, using the checkpoint of the previous job.
A similar script can be written for the arc4 sge systems, using the
--hold_jidflag which does the same job as the slurm--dependencyflag.The job submission script used by @craigpoku had a function in it which checks for completed jobs as a way of providing this functionality. That and the
monc/misc/continuation.shscript would be a good place to start. The relevant code is below: