From Fedora Project Wiki
No edit summary
No edit summary
 
Line 326: Line 326:
...
...
</pre>
</pre>
Debugging further it seems to be a haproxy/corosync issue:
<pre>
(undercloud) [stack@oscloud5 ~]$ openstack port list
...
| 2a7939b9-8f31-43c5-ad9a-c790ea445718 | control_virtual_ip | fa:16:3e:42:5d:85 | ip_address='9.114.118.250', subnet_id='1a69d433-0911-4ae0-8c4e-370b3fdfff7d' | DOWN  |
...
(undercloud) [stack@oscloud5 ~]$ (NODE="overcloud-controller-0"; LINE=$(openstack server list -f value | grep ${NODE}) || exit; IP=$(echo ${LINE} | sed -r -e 's,^.*ctlplane=([0-9.]*) .*$,\1,'); ssh-keygen -f ~/.ssh/known_hosts -R ${IP}; ssh-keyscan ${IP} >> ~/.ssh/known_hosts; ssh heat-admin@${IP})
[heat-admin@overcloud-controller-0 ~]$ sudo su -
[root@overcloud-controller-0 ~]# cat /var/log/messages
...
Oct 23 00:23:06 localhost os-collect-config: "Notice: /Stage[main]/Nova::Db::Sync_api/Exec[nova-db-sync-api]/returns: DBConnectionError: (pymysql.err.OperationalError) (2003, \"Can't connect to MySQL server on '9.114.118.250' ([Errno 113] EHOSTUNREACH)\")"
Oct 23 00:23:06 localhost os-collect-config: ],
Oct 23 00:23:06 localhost os-collect-config: "changed": false,
Oct 23 00:23:06 localhost os-collect-config: "failed": true,
Oct 23 00:23:06 localhost os-collect-config: "failed_when_result": true
Oct 23 00:23:06 localhost os-collect-config: }
Oct 23 00:23:06 localhost os-collect-config: to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/60ae6a13-927f-4f6a-938b-af8df97ac61a_playbook.retry
...
[root@overcloud-controller-0 ~]# ip a | grep 9.114.118.250
[root@overcloud-controller-0 ~]# systemctl status -l corosync.service
● corosync.service - Corosync Cluster Engine
  Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; vendor preset: disabled)
  Active: failed (Result: exit-code) since Mon 2017-10-23 20:08:30 UTC; 1h 28min ago
    Docs: man:corosync
          man:corosync.conf
          man:corosync_overview
  Process: 70353 ExecStart=/usr/share/corosync/corosync start (code=exited, status=1/FAILURE)
Oct 23 20:08:30 overcloud-controller-0 corosync[70353]: + return 1
Oct 23 20:08:30 overcloud-controller-0 corosync[70353]: + '[' -x /bin/plymouth ']'
Oct 23 20:08:30 overcloud-controller-0 corosync[70353]: + return 0
Oct 23 20:08:30 overcloud-controller-0 corosync[70353]: + rtrn=1
Oct 23 20:08:30 overcloud-controller-0 corosync[70353]: + echo
Oct 23 20:08:30 overcloud-controller-0 corosync[70353]: + exit 1
Oct 23 20:08:30 overcloud-controller-0 systemd[1]: corosync.service: control process exited, code=exited status=1
Oct 23 20:08:30 overcloud-controller-0 systemd[1]: Failed to start Corosync Cluster Engine.
Oct 23 20:08:30 overcloud-controller-0 systemd[1]: Unit corosync.service entered failed state.
Oct 23 20:08:30 overcloud-controller-0 systemd[1]: corosync.service failed.
[root@overcloud-controller-0 ~]# systemctl start corosync.service
Job for corosync.service failed because the control process exited with error code. See "systemctl status corosync.service" and "journalctl -xe" for details.
[root@overcloud-controller-0 ~]# pcs cluster start
Starting Cluster...
Job for corosync.service failed because the control process exited with error code. See "systemctl status corosync.service" and "journalctl -xe" for details.
Error: unable to start corosync
[root@overcloud-controller-0 ~]# corosync -f; echo $?
8
</pre>
That seems like it is not an error code?  According to:
http://www.linux-ha.org/doc/dev-guides/_literal_ocf_running_master_literal_8.html
"The resource was found to be running in the Master role. This applies only to stateful (Master/Slave) resources, and only to their monitor action."

Latest revision as of 21:40, 23 October 2017

On a CentOS7 system

[hamzy@oscloud5 ~]$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.3.1611 (Core) 
Release:        7.3.1611
Codename:       Core
[stack@oscloud5 ~]$ uname -a
Linux oscloud5.stglabs.ibm.com 3.10.0-514.16.1.el7.x86_64 #1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Unfortunately, it seems that Environment setup for baremetal environment does not explain how to install the undercloud. There are three machines in this scenario:

arch use portname1 MAC1 IP1 portname2 MAC2 IP2
x86_64 undercloud eno2 6c:ae:8b:29:2a:02 9.114.219.30 eno4 6c:ae:8b:29:2a:04 9.114.118.98
ppc64le overcloud control eth3 00:0a:f7:73:3c:c3 9.114.219.134 eth2 00:0a:f7:73:3c:c2 9.114.118.156
ppc64le overcloud compute enP3p5s0f2 00:90:fa:74:05:52 9.114.219.49 enP3p5s0f3 00:90:fa:74:05:53 9.114.118.154

So, following Undercloud installation, I perform the following:

[hamzy@oscloud5 ~]$ sudo useradd stack
[hamzy@oscloud5 ~]$ sudo passwd stack
[hamzy@oscloud5 ~]$ echo "stack ALL=(root) NOPASSWD:ALL" | sudo tee -a /etc/sudoers.d/stack
[hamzy@oscloud5 ~]$ sudo chmod 0440 /etc/sudoers.d/stack
[hamzy@oscloud5 ~]$ sudo su - stack
[stack@oscloud5 ~]$ sudo hostnamectl set-hostname oscloud5.stglabs.ibm.com
[stack@oscloud5 ~]$ sudo hostnamectl set-hostname --transient oscloud5.stglabs.ibm.com
[stack@oscloud5 ~]$ sudo curl -L -o /etc/yum.repos.d/delorean.repo https://trunk.rdoproject.org/centos7-master/current-passed-ci/delorean.repo
[stack@oscloud5 ~]$ sudo curl -L -o /etc/yum.repos.d/delorean-deps.repo https://trunk.rdoproject.org/centos7/delorean-deps.repo
[stack@oscloud5 ~]$ sudo yum install -y python-tripleoclient
[stack@oscloud5 ~]$ cp /usr/share/instack-undercloud/undercloud.conf.sample ~/undercloud.conf
[stack@oscloud5 ~]$ cat << '__EOF__' > instackenv.json
{
    "nodes": [
        {
            "pm_type":"pxe_ipmitool",
            "mac":[
                "00:0a:f7:73:3c:c2"
            ],
            "cpu":"16",
            "memory":"1048576",
            "disk":"1000",
            "arch":"ppc64le",
            "pm_password":"update",
            "pm_addr":"9.114.118.157"
        },
        {
            "pm_type":"pxe_ipmitool",
            "mac":[
                "00:90:fa:74:05:53"
            ],
            "cpu":"16",
            "memory":"1048576",
            "disk":"1000",
            "arch":"ppc64le",
            "pm_password":"update",
            "pm_addr":"9.114.118.155"
        }
    ]
}
__EOF__

I transfer over the built overcloud images:

[hamzy@pkvmci853 ~]$ (OCB=$(dig @192.168.122.1 -4 +short Overcloud.virbr0); UC=9.114.118.98; ssh-keygen -f ~/.ssh/known_hosts -R ${UC}; ssh-keyscan ${UC} >> ~/.ssh/known_hosts; scp -3 hamzy@${OCB}:~/*{initrd,initramfs,kernel,vmlinuz,qcow2}* stack@${UC}:~/)

I then modify undercloud.conf as follows:

[stack@oscloud5 ~]$ cat << __EOF__ | patch -p0
--- undercloud.conf.orig        2017-08-25 12:04:54.935063830 +0000
+++ undercloud.conf 2017-08-25 12:05:17.561063576 +0000
@@ -17,21 +17,25 @@
 # defined by local_interface, with the netmask defined by the prefix
 # portion of the value. (string value)
 #local_ip = 192.168.24.1/24
+local_ip = 9.114.118.98/24
 
 # Network gateway for the Neutron-managed network for Overcloud
 # instances. This should match the local_ip above when using
 # masquerading. (string value)
 #network_gateway = 192.168.24.1
+network_gateway = 9.114.118.98
 
 # Virtual IP or DNS address to use for the public endpoints of
 # Undercloud services. Only used with SSL. (string value)
 # Deprecated group/name - [DEFAULT]/undercloud_public_vip
 #undercloud_public_host = 192.168.24.2
+undercloud_public_host = 9.114.118.98
 
 # Virtual IP or DNS address to use for the admin endpoints of
 # Undercloud services. Only used with SSL. (string value)
 # Deprecated group/name - [DEFAULT]/undercloud_admin_vip
 #undercloud_admin_host = 192.168.24.3
+undercloud_admin_host = 9.114.118.98
 
 # DNS nameserver(s) to use for the undercloud node. (list value)
 #undercloud_nameservers =
@@ -74,6 +78,7 @@
 # Network interface on the Undercloud that will be handling the PXE
 # boots and DHCP for Overcloud instances. (string value)
 #local_interface = eth1
+local_interface = eno4
 
 # MTU to use for the local_interface. (integer value)
 #local_mtu = 1500
@@ -82,18 +87,22 @@
 # instances. This should be the subnet used for PXE booting. (string
 # value)
 #network_cidr = 192.168.24.0/24
+network_cidr = 9.114.118.0/24
 
 # Network that will be masqueraded for external access, if required.
 # This should be the subnet used for PXE booting. (string value)
 #masquerade_network = 192.168.24.0/24
+masquerade_network = 9.114.118.0/24
 
 # Start of DHCP allocation range for PXE and DHCP of Overcloud
 # instances. (string value)
 #dhcp_start = 192.168.24.5
+dhcp_start = 9.114.118.240
 
 # End of DHCP allocation range for PXE and DHCP of Overcloud
 # instances. (string value)
 #dhcp_end = 192.168.24.24
+dhcp_end = 9.114.118.248
 
 # Path to hieradata override file. If set, the file will be copied
 # under /etc/puppet/hieradata and set as the first file in the hiera
@@ -112,12 +121,14 @@
 # doubt, use the default value. (string value)
 # Deprecated group/name - [DEFAULT]/discovery_interface
 #inspection_interface = br-ctlplane
+inspection_interface = br-ctlplane
 
 # Temporary IP range that will be given to nodes during the inspection
 # process.  Should not overlap with the range defined by dhcp_start
 # and dhcp_end, but should be in the same network. (string value)
 # Deprecated group/name - [DEFAULT]/discovery_iprange
 #inspection_iprange = 192.168.24.100,192.168.24.120
+inspection_iprange = 9.114.118.249,9.114.118.250
 
 # Whether to enable extra hardware collection during the inspection
 # process. Requires python-hardware or python-hardware-detect package
__EOF__

And install the undercloud:

[stack@oscloud5 ~]$ time openstack undercloud install 2>&1 | tee output.undercloud.install
...
Undercloud install complete.
...

There is a bug for needing the userid for machines using ipmi that needs to be patched around.

[stack@oscloud5 ~]$ (cd /usr/lib/python2.7/site-packages/tripleo_common/utils/; cat << __EOF__ | sudo patch -p0)
--- nodes.py.orig       2017-08-24 15:54:07.614226329 +0000
+++ nodes.py    2017-08-24 15:54:29.699440619 +0000
@@ -105,7 +105,7 @@
             'pm_user': '%s_username' % prefix,
             'pm_password': '%s_password' % prefix,
         }
-        mandatory_fields = list(mapping)
+        mandatory_fields = ['pm_addr', 'pm_password'] # list(mapping)
 
         if has_port:
             mapping['pm_port'] = '%s_port' % prefix
__EOF__
[stack@undercloud ~]$ (for SERVICE in openstack-mistral-api.service openstack-mistral-engine.service openstack-mistral-executor.service; do sudo systemctl restart ${SERVICE}; done)

Ironic needs some different settings to be able to support PXE for ppc64le:

[stack@oscloud5 ~]$ (cd /etc/ironic; cat << '__EOF__' | sudo patch -p0)
--- ironic.conf.orig    2017-09-11 17:46:28.760794196 +0000
+++ ironic.conf 2017-09-11 17:49:55.637796731 +0000
@@ -343,6 +343,7 @@
 # for this option to be unset. (string value)
 # Allowed values: debug, info, warning, error, critical
 #notification_level = <None>
+notification_level = debug
 
 # Directory where the ironic python module is installed.
 # (string value)
@@ -3512,6 +3513,7 @@
 # configuration per node architecture. For example:
 # aarch64:/opt/share/grubaa64_pxe_config.template (dict value)
 #pxe_config_template_by_arch =
+pxe_config_template_by_arch = ppc64le:$pybasedir/drivers/modules/pxe_config.template
 
 # IP address of ironic-conductor node's TFTP server. (string
 # value)
@@ -3551,10 +3553,11 @@
 # Bootfile DHCP parameter per node architecture. For example:
 # aarch64:grubaa64.efi (dict value)
 #pxe_bootfile_name_by_arch =
+pxe_bootfile_name_by_arch = ppc64le:config
 
 # Enable iPXE boot. (boolean value)
 #ipxe_enabled = false
-ipxe_enabled=True
+ipxe_enabled = false
 
 # On ironic-conductor node, the path to the main iPXE script
 # file. (string value)
__EOF__
[stack@oscloud5 ~]$ for I in openstack-ironic-conductor.service openstack-ironic-inspector.service openstack-ironic-inspector-dnsmasq.service; do sudo systemctl restart ${I}; done

I then go through the process of installing the overcloud:

[stack@oscloud5 ~]$ source stackrc
(undercloud) [stack@oscloud5 ~]$ time openstack overcloud image upload
...

The overcloud-full qcow2 image needs to be recreated in glance so that it loses both the kernel_id and the ramdisk_id. This way a full disk image can be deployed.

(undercloud) [stack@oscloud5 ~]$ (FILE="overcloud-full.qcow2"; UUID=$(openstack image list -f value | grep 'overcloud-full ' | awk '{print $1;}'); openstack image delete ${UUID}; openstack image create --container-format bare --disk-format qcow2 --min-disk 0 --min-ram 0 --file ${FILE} --public overcloud-full)

Now import the baremetal nodes and assign them profiles.

(undercloud) [stack@oscloud5 ~]$ time openstack overcloud node import --provide instackenv.json 2>&1 | tee output.overcloud.node.import
...
+--------------------------------------+-----------+-----------------+-----------------+-------------------+
| Node UUID                            | Node Name | Provision State | Current Profile | Possible Profiles |
+--------------------------------------+-----------+-----------------+-----------------+-------------------+
| ff2fdac5-6cc5-47a9-a095-d942b3960795 |           | available       | None            |                   |
| ef3d7b3b-97b8-42ab-b501-896474df658f |           | available       | None            |                   |
+--------------------------------------+-----------+-----------------+-----------------+-------------------+
(undercloud) [stack@oscloud5 ~]$ (COMPUTE=""; CONTROL=""; while IFS=$' ' read -r -a PROFILES; do if [ -z "${COMPUTE}" ]; then COMPUTE=${PROFILES[0]}; ironic node-update ${COMPUTE} replace properties/capabilities=profile:compute,boot_option:local; continue; fi; if [ -z "${CONTROL}" ]; then CONTROL=${PROFILES[0]}; ironic node-update ${CONTROL} replace properties/capabilities=profile:control,boot_option:local; continue; fi; done < <(openstack overcloud profiles list -f value))
(undercloud) [stack@oscloud5 ~]$ openstack overcloud profiles list
+--------------------------------------+-----------+-----------------+-----------------+-------------------+
| Node UUID                            | Node Name | Provision State | Current Profile | Possible Profiles |
+--------------------------------------+-----------+-----------------+-----------------+-------------------+
| ff2fdac5-6cc5-47a9-a095-d942b3960795 |           | available       | compute         |                   |
| ef3d7b3b-97b8-42ab-b501-896474df658f |           | available       | control         |                   |
+--------------------------------------+-----------+-----------------+-----------------+-------------------+

Patch the openstack-tripleo-heat-templates locally with our configuration and then do the deploy.

(undercloud) [stack@oscloud5 ~]$ cp -r /usr/share/openstack-tripleo-heat-templates templates
(undercloud) [stack@oscloud5 ~]$ (cd templates/; curl --silent -o - https://hamzy.fedorapeople.org/openstack-tripleo-heat-templates.patch | patch -p1)
(undercloud) [stack@oscloud5 ~]$ time openstack overcloud deploy --debug --templates /home/stack/templates -e /home/stack/templates/environments/network-environment.yaml -e /home/stack/templates/environments/network-isolation-custom.yaml --control-scale 1 --compute-scale 1 --control-flavor control --compute-flavor compute 2>&1 | tee output.overcloud.deploy

You will now see the following error:

...
 Stack overcloud CREATE_FAILED

overcloud.AllNodesDeploySteps.ControlleHeat Stack create failed.
Heat Stack create failed.
rDeployment_Step3.0:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: dab03cd8-439c-43cc-acf5-885c31ad9541
  status: CREATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
        ],
        "changed": false,
        "failed": true,
        "failed_when_result": true
    }
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/60ae6a13-927f-4f6a-938b-af8df97ac61a_playbook.retry

    PLAY RECAP *********************************************************************
    localhost                  : ok=3    changed=1    unreachable=0    failed=1

    (truncated, view all with --long)
  deploy_stderr: |
...

Debugging further it seems to be a haproxy/corosync issue:

(undercloud) [stack@oscloud5 ~]$ openstack port list
...
| 2a7939b9-8f31-43c5-ad9a-c790ea445718 | control_virtual_ip | fa:16:3e:42:5d:85 | ip_address='9.114.118.250', subnet_id='1a69d433-0911-4ae0-8c4e-370b3fdfff7d' | DOWN   |
...
(undercloud) [stack@oscloud5 ~]$ (NODE="overcloud-controller-0"; LINE=$(openstack server list -f value | grep ${NODE}) || exit; IP=$(echo ${LINE} | sed -r -e 's,^.*ctlplane=([0-9.]*) .*$,\1,'); ssh-keygen -f ~/.ssh/known_hosts -R ${IP}; ssh-keyscan ${IP} >> ~/.ssh/known_hosts; ssh heat-admin@${IP})
[heat-admin@overcloud-controller-0 ~]$ sudo su -
[root@overcloud-controller-0 ~]# cat /var/log/messages
...
Oct 23 00:23:06 localhost os-collect-config: "Notice: /Stage[main]/Nova::Db::Sync_api/Exec[nova-db-sync-api]/returns: DBConnectionError: (pymysql.err.OperationalError) (2003, \"Can't connect to MySQL server on '9.114.118.250' ([Errno 113] EHOSTUNREACH)\")"
Oct 23 00:23:06 localhost os-collect-config: ],
Oct 23 00:23:06 localhost os-collect-config: "changed": false,
Oct 23 00:23:06 localhost os-collect-config: "failed": true,
Oct 23 00:23:06 localhost os-collect-config: "failed_when_result": true
Oct 23 00:23:06 localhost os-collect-config: }
Oct 23 00:23:06 localhost os-collect-config: to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/60ae6a13-927f-4f6a-938b-af8df97ac61a_playbook.retry
...
[root@overcloud-controller-0 ~]# ip a | grep 9.114.118.250
[root@overcloud-controller-0 ~]# systemctl status -l corosync.service
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2017-10-23 20:08:30 UTC; 1h 28min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 70353 ExecStart=/usr/share/corosync/corosync start (code=exited, status=1/FAILURE)

Oct 23 20:08:30 overcloud-controller-0 corosync[70353]: + return 1
Oct 23 20:08:30 overcloud-controller-0 corosync[70353]: + '[' -x /bin/plymouth ']'
Oct 23 20:08:30 overcloud-controller-0 corosync[70353]: + return 0
Oct 23 20:08:30 overcloud-controller-0 corosync[70353]: + rtrn=1
Oct 23 20:08:30 overcloud-controller-0 corosync[70353]: + echo
Oct 23 20:08:30 overcloud-controller-0 corosync[70353]: + exit 1
Oct 23 20:08:30 overcloud-controller-0 systemd[1]: corosync.service: control process exited, code=exited status=1
Oct 23 20:08:30 overcloud-controller-0 systemd[1]: Failed to start Corosync Cluster Engine.
Oct 23 20:08:30 overcloud-controller-0 systemd[1]: Unit corosync.service entered failed state.
Oct 23 20:08:30 overcloud-controller-0 systemd[1]: corosync.service failed.
[root@overcloud-controller-0 ~]# systemctl start corosync.service
Job for corosync.service failed because the control process exited with error code. See "systemctl status corosync.service" and "journalctl -xe" for details.
[root@overcloud-controller-0 ~]# pcs cluster start
Starting Cluster...
Job for corosync.service failed because the control process exited with error code. See "systemctl status corosync.service" and "journalctl -xe" for details.
Error: unable to start corosync
[root@overcloud-controller-0 ~]# corosync -f; echo $?
8

That seems like it is not an error code? According to:

http://www.linux-ha.org/doc/dev-guides/_literal_ocf_running_master_literal_8.html

"The resource was found to be running in the Master role. This applies only to stateful (Master/Slave) resources, and only to their monitor action."