Skip to content

installation of new cluster doesn't complete #34

@boegel

Description

@boegel

I've made two attempts this afternoon to create a new CitC on AWS using the one-click installer, but for some reason the installation "hangs".

The management node is being created, and I can SSH into that, but the finish command keep producing this (with or without a limits.yaml file):

[citc@mgmt ~]$ finish
Error: The management node has not finished its setup
Please allow it to finish before continuing.
For information about why they have not finished, check the file /root/ansible-pull.log

The last part in /root/ansible-pull.log is this:

TASK [slurm : open all ports] **************************************************
Friday 19 February 2021  14:19:11 +0000 (0:00:00.045)       0:06:17.021 *******

That was over 1 hour ago, no progress since then...

/var/log/slurm exists, but it entirely empty.

Running processes:

Details
root        1515  0.0  1.0 372592 40816 ?        Ss   14:12   0:00 /usr/libexec/platform-python /usr/bin/cloud-init modules --mode=final
root        1997  0.0  0.0 217052   732 ?        S    14:12   0:00  \_ tee -a /var/log/cloud-init-output.log
root        2037  0.0  0.0 235744  3412 ?        S    14:12   0:00  \_ /bin/bash /var/lib/cloud/instance/scripts/part-001
root        4767  0.0  0.9 406240 34832 ?        S    14:12   0:00      \_ /usr/bin/python3 -u /usr/bin/ansible-pull --url=https://github.com/clusterinthecloud/ansible.git --checkout=6 --inventory=/root/hosts management.yml
root        9929  7.3  1.6 590508 61548 ?        Sl   14:12   5:24          \_ /usr/bin/python3.6 /usr/bin/ansible-playbook -c local /root/.ansible/pull/ip-10-0-16-0.eu-west-1.compute.internal/management.yml -t all -l localhost,mgmt,ip-10-0-16-0,ip-10-0-16-0.eu-west-1.com
root       27615  0.0  1.4 583004 54488 ?        S    14:19   0:00              \_ /usr/bin/python3.6 /usr/bin/ansible-playbook -c local /root/.ansible/pull/ip-10-0-16-0.eu-west-1.compute.internal/management.yml -t all -l localhost,mgmt,ip-10-0-16-0,ip-10-0-16-0.eu-west-1
root       27616  0.0  0.0 235744  3372 ?        S    14:19   0:00                  \_ /bin/sh -c /usr/libexec/platform-python && sleep 0
root       27617  0.0  0.8 415588 30484 ?        S    14:19   0:00                      \_ /usr/libexec/platform-python
dirsrv     17078  0.1  2.1 662068 81740 ?        Ssl  14:14   0:06 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-mgmt -i /run/dirsrv/slapd-mgmt.pid
citc       17138  0.0  0.2  93904  9968 ?        Ss   14:15   0:00 /usr/lib/systemd/systemd --user
citc       17142  0.0  0.1 257440  5068 ?        S    14:15   0:00  \_ (sd-pam)
mysql      21671  0.0  2.4 1776020 93568 ?       Ssl  14:15   0:01 /usr/libexec/mysqld --basedir=/usr
munge      22577  0.0  0.1 125220  4048 ?        Sl   14:17   0:00 /usr/sbin/munged
root       24674  0.0  1.0 509096 41380 ?        Ssl  14:17   0:00 /usr/libexec/platform-python -s /usr/sbin/firewalld --nofork --nopid
root       27703  0.0  0.0 232532  2036 ?        Ss   15:01   0:00 /usr/sbin/anacron -s

Any suggestions on how to figure out what went wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    AWSbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions