This question is unrelated to the question I just asked about OpenStack metadata server. This is another setup, on other machines and they both are clearly not connected to one-another.
I setup OpenStack on 8 nodes (gn008..gn015
), where all nodes are compute (libvirt/kvm
), network (linuxbridge
) and storage (lvm
) nodes; gn011
additionally runs all OpenStack administrative services. I occasionally had issues when the /var/log
partition got full, especially on gn011
, but nothing that removing big log files and restarting associated daemons would not fix.
Now the volume service part of OpenStack fails when creating any new volume, even blank. To rule out the possibility of lacking storage space, I removed a few virtual machines and their associated volumes; but volume deletion also fails. Now I find myself with volumes attached to non-existent virtual machines (see volume block1
in the trace below):
[root@gn011 ~]# openstack volume list
+--------------------------------------+---------------------+-----------+------+---------------------------------------------------------------+
| ID | Name | Status | Size | Attached to |
+--------------------------------------+---------------------+-----------+------+---------------------------------------------------------------+
| 217d9087-3175-4565-91f9-dcca2e1be383 | cpu1_instances | in-use | 50 | Attached to cpu1 on /dev/vdb |
| dde062b6-0fc6-4e76-b936-1d2cfed14af4 | cpu1 | in-use | 16 | Attached to cpu1 on /dev/vda |
| c7d11144-278e-4785-9024-a685a0406215 | block1_volumes | available | 50 | |
| 67d7882d-903a-4e2a-b386-5d0af65b6c65 | block1 | in-use | 16 | Attached to c0866654-fcc5-48f1-a446-1c33e518a10e on /dev/vda |
| 680c085c-2959-45ee-85bf-b249b8f0a6bd | block0_volumes | in-use | 50 | Attached to 399aa8dd-9aea-4059-bed7-eb66209813f9 on /dev/vdb |
| 9a272ff6-c443-49d6-befd-730d1635d6eb | block0 | in-use | 16 | Attached to 399aa8dd-9aea-4059-bed7-eb66209813f9 on /dev/vda |
| 1e716bcc-a381-4da6-80c6-44572437f610 | designate01 | in-use | 16 | Attached to 5ca79be8-4fe1-43fe-a04a-aed4dfcf2158 on /dev/vda |
| 53adcf54-543c-4d37-95c4-37696e76b747 | cinder01_conversion | in-use | 20 | Attached to d2bd428b-dbc7-45a1-bf6c-6f82d05c3a89 on /dev/vdb |
| 356d474f-13a5-42c1-8ee7-0411e5671f98 | cinder01 | in-use | 16 | Attached to d2bd428b-dbc7-45a1-bf6c-6f82d05c3a89 on /dev/vda |
| 4f13a989-ee46-48cf-99ef-19922f8f9564 | horizon01 | in-use | 16 | Attached to f2ba2db1-c0aa-4256-a2c7-a0d9047cd374 on /dev/vda |
| b9c3b5b6-d242-408f-b939-503096ab51c7 | neutron01 | in-use | 16 | Attached to a6d4e0d3-ac6e-40bd-ac5d-5e2347b0b9e4 on /dev/vda |
| a212296b-6380-4f22-ad01-749d28dec198 | nova01 | in-use | 16 | Attached to e76a0843-f093-49d6-80a1-a44cd55335be on /dev/vda |
| 4c6e239c-67f7-43a7-bdb8-178fbb639159 | placement01 | in-use | 16 | Attached to 044290a6-ffc4-48d0-a709-0628e4a1fa57 on /dev/vda |
| 50f6ca48-2d79-40ac-8a37-b7719eadbce1 | glance01_images | in-use | 100 | Attached to c7fb6b3b-f13a-45f9-95c2-248233d7a982 on /dev/vdb |
| e88d302e-c06c-4751-b117-8a3c44ced804 | glance01 | in-use | 16 | Attached to c7fb6b3b-f13a-45f9-95c2-248233d7a982 on /dev/vda |
| 50621c38-6c38-496b-89df-c6fcbf466186 | keystone01 | in-use | 16 | Attached to df51164e-15ff-460c-bfcd-fb91bc569f47 on /dev/vda |
| eb48421c-121e-4cf0-9833-c617cc7fadec | rabbitmq01 | in-use | 16 | Attached to e64f26d1-6405-4f0c-9427-b5ae1656f30f on /dev/vda |
| 844cacae-ec0c-42b4-9229-423b48fd9eb6 | memcached01 | in-use | 16 | Attached to edabcee3-69f3-4a60-861e-b4f961501102 on /dev/vda |
| d54a0aff-efea-43fe-9ce8-7eba4597b7ea | openstack_base | available | 16 | |
+--------------------------------------+---------------------+-----------+------+---------------------------------------------------------------+
[root@gn011 ~]# openstack server show c0866654-fcc5-48f1-a446-1c33e518a10e
No server with a name or ID of 'c0866654-fcc5-48f1-a446-1c33e518a10e' exists.
[root@gn011 ~]#
I did manage to set some volumes I try to delete to error
and then delete them, but now the deletion part stays stuck for days and the volume is still not deleted. I did notice every cinder daemon got deadlocked due to some issue with greenlet
or eventlet
. I don't have the exact trace anymore, but it looked like this bug report. I hit this bug quite often in the past, so I just set heartbeat_in_pthread = true
in /etc/cinder/cinder.conf
of each of the machines running openstack-cinder-volume
service and restarted the service on all gn008-gn015
, but everything is still stuck.
When I did manage to get progress is deleting volumes, I noticed restarting rabbitmq-server helped a bit, if only for a few tens of seconds after restart. But this is not helping anymore either. Restarting httpd
on gn011
takes a long time, but it does not help. I cannot login anymore via the dashboard, but I found this line in /var/log/httpd/error_log:
Timeout when reading response headers from daemon process 'dashboard': /usr/share/openstack-dashboard/openstack_dashboard/wsgi.py, referer: http://openstack.svc.lunarc/dashboard/auth/login/?next=/dashboard/project/
I tried to get more information by adding logging settings in /usr/share/openstack-dashboard/openstack_dashboard/local/local_settings.py
:
LOGGING = {
'version': 1,
# When set to True this will disable all logging except
# for loggers specified in this configuration dictionary. Note that
# if nothing is specified here and disable_existing_loggers is True,
# django.db.backends will still log unless it is disabled explicitly.
'disable_existing_loggers': False,
# If apache2 mod_wsgi is used to deploy OpenStack dashboard
# timestamp is output by mod_wsgi. If WSGI framework you use does not
# output timestamp for logging, add %(asctime)s in the following
# format definitions.
'formatters': {
'console': {
'format': '%(levelname)s %(name)s %(message)s'
},
'operation': {
# The format of "%(message)s" is defined by
# OPERATION_LOG_OPTIONS['format']
'format': '%(message)s'
},
'verbose': {
'format': '%(levelname)s %(asctime)s %(module)s %(process)d %(thread)d %(message)s'
},
},
'handlers': {
...
#'file': {
# 'level': 'DEBUG' if DEBUG else 'INFO',
# 'class': 'logging.FileHandler',
# 'filename': '/var/log/httpd/dashboard.log',
# 'formatter': 'console',
#},
'syslog': {
'level': 'DEBUG' if DEBUG else 'INFO',
'class': 'logging.handlers.SysLogHandler',
'formatter': 'console',
'facility': 'user',
},
},
'loggers': {
'horizon': {
'handlers': ['syslog'],
'level': 'DEBUG',
'propagate': False,
},
'horizon.file_log': {
'handlers': ['syslog'],
'level': 'DEBUG',
'propagate': False,
},
'openstack_dashboard': {
'handlers': ['syslog'],
'level': 'DEBUG',
'propagate': False,
},
'novaclient': {
'handlers': ['syslog'],
'level': 'DEBUG',
'propagate': False,
},
... long long list, all set to debug and "handler" set to 'syslog' ...
}
and /etc/rsyslog.conf with:
...
user.* /var/log/user.log
...
but /var/log/user.log
does not give anything useful:
Jan 4 18:50:01 gn011 httpd[2275299]: Server configured, listening on: port 5000, port 8778, port 80
Jan 5 10:51:00 gn011 httpd[2375720]: Server configured, listening on: port 5000, port 8778, port 80
Jan 5 11:22:08 gn011 httpd[2380573]: Server configured, listening on: port 5000, port 8778, port 80
Jan 5 15:32:42 gn011 httpd[2412748]: Server configured, listening on: port 5000, port 8778, port 80
Jan 5 16:28:21 gn011 httpd[2420940]: Server configured, listening on: port 5000, port 8778, port 80
I also tried outputting logs directly in a file in /tmp
but it would always fail, reporting a lack of permissions.
At this point, I would restart gn011
(controller and compute node), but I am reluctant to do it before I can migrate or delete all virtual machines and volumes it hosts, which I can't because everything seems to be stuck. If it still does not work, then I would restart all nodes. It worked in the past, although then every volume in use was effectively lost, and so were the VMs using them. I don't really mind about them, but I would really like to be able to solve the issue without loosing VMs and data if (when?) I face it in a production install.
Does anyone have any direction I can take to get my setup back online?
OS (same for all nodes)
[root@gn011 httpd]# cat /etc/redhat-release
Rocky Linux release 8.6 (Green Obsidian)
[root@gn011 httpd]#
OpenStack version (I don't recall what release that was):
[root@gn011 httpd]# rpm -qa | grep openstack
openstack-dashboard-20.1.2-1.el8.noarch
openstack-designate-api-13.0.0-1.el8.noarch
openstack-placement-common-6.0.0-1.el8.noarch
openstack-designate-producer-13.0.0-1.el8.noarch
openstack-nova-novncproxy-24.1.0-1.el8.noarch
openstack-designate-ui-13.0.0-2.el8.noarch
openstack-neutron-linuxbridge-19.3.0-1.el8.noarch
openstack-neutron-common-19.3.0-1.el8.noarch
openstack-neutron-19.3.0-1.el8.noarch
python-openstackclient-lang-5.6.0-1.el8.noarch
openstack-designate-sink-13.0.0-1.el8.noarch
openstack-dashboard-theme-20.1.2-1.el8.noarch
openstack-cinder-19.1.0-1.el8.noarch
openstack-designate-agent-13.0.0-1.el8.noarch
openstack-nova-common-24.1.0-1.el8.noarch
openstack-neutron-ml2-19.3.0-1.el8.noarch
openstack-keystone-20.0.0-2.el8.noarch
openstack-placement-api-6.0.0-1.el8.noarch
openstack-designate-mdns-13.0.0-1.el8.noarch
openstack-nova-conductor-24.1.0-1.el8.noarch
openstack-designate-worker-13.0.0-1.el8.noarch
openstack-nova-api-24.1.0-1.el8.noarch
openstack-glance-23.0.0-2.el8.noarch
openstack-nova-scheduler-24.1.0-1.el8.noarch
python3-openstacksdk-0.59.0-1.el8.noarch
python3-openstackclient-5.6.0-1.el8.noarch
openstack-selinux-0.8.27-1.el8.noarch
openstack-designate-common-13.0.0-1.el8.noarch
openstack-nova-compute-24.1.0-1.el8.noarch
openstack-designate-central-13.0.0-1.el8.noarch
[root@gn011 httpd]#
(tagging as centos
for the lack of a rocky
tag)
cinder-volume
services. I discovered that three nodes, including the one holding the volume I can't remove, still loggreenlet.error: cannot switch to a different thread
although oslo_messaging_rabbit\heartbeat_in_pthread is set totrue
in/etc/cinder/cinder.conf
gn011
) is additionally the control node; no redundancy. None of the nodes (20 "real" cores) is more busy than 2-3 processes reported at 20% bytop
, or has active IOs reaching the MB/s (from iotop). I can see nothing suspicious inrabbit-mq-server
logs, except maybe a long and consecutive alternation ofAsked to [re-]register this node (rabbit@gn011) with epmd...
and[Re-]registered this node (rabbit@gn011) with epmd at port 25672
.openstack volume attachment delete <UUID>
), then I set their state to "error" (openstack volume set --state error <UUID>
and then I proceed with removing the volume. Some users definitely have been abusing the install. I don't remember what exactly (I recall they deleted thedefault
security group using Horizon), but I had to remove manually (i.e., with a SQL command) some references to (guessing) VMs that no longer appeared in GUI but still prevented volumes from being removed.