Openstack Storage unresponsive

Question

This question is unrelated to the question I just asked about OpenStack metadata server. This is another setup, on other machines and they both are clearly not connected to one-another.

I setup OpenStack on 8 nodes (gn008..gn015), where all nodes are compute (libvirt/kvm), network (linuxbridge) and storage (lvm) nodes; gn011 additionally runs all OpenStack administrative services. I occasionally had issues when the /var/log partition got full, especially on gn011, but nothing that removing big log files and restarting associated daemons would not fix.

Now the volume service part of OpenStack fails when creating any new volume, even blank. To rule out the possibility of lacking storage space, I removed a few virtual machines and their associated volumes; but volume deletion also fails. Now I find myself with volumes attached to non-existent virtual machines (see volume block1 in the trace below):

[root@gn011 ~]# openstack volume list
+--------------------------------------+---------------------+-----------+------+---------------------------------------------------------------+
| ID                                   | Name                | Status    | Size | Attached to                                                   |
+--------------------------------------+---------------------+-----------+------+---------------------------------------------------------------+
| 217d9087-3175-4565-91f9-dcca2e1be383 | cpu1_instances      | in-use    |   50 | Attached to cpu1 on /dev/vdb                                  |
| dde062b6-0fc6-4e76-b936-1d2cfed14af4 | cpu1                | in-use    |   16 | Attached to cpu1 on /dev/vda                                  |
| c7d11144-278e-4785-9024-a685a0406215 | block1_volumes      | available |   50 |                                                               |
| 67d7882d-903a-4e2a-b386-5d0af65b6c65 | block1              | in-use    |   16 | Attached to c0866654-fcc5-48f1-a446-1c33e518a10e on /dev/vda  |
| 680c085c-2959-45ee-85bf-b249b8f0a6bd | block0_volumes      | in-use    |   50 | Attached to 399aa8dd-9aea-4059-bed7-eb66209813f9 on /dev/vdb  |
| 9a272ff6-c443-49d6-befd-730d1635d6eb | block0              | in-use    |   16 | Attached to 399aa8dd-9aea-4059-bed7-eb66209813f9 on /dev/vda  |
| 1e716bcc-a381-4da6-80c6-44572437f610 | designate01         | in-use    |   16 | Attached to 5ca79be8-4fe1-43fe-a04a-aed4dfcf2158 on /dev/vda  |
| 53adcf54-543c-4d37-95c4-37696e76b747 | cinder01_conversion | in-use    |   20 | Attached to d2bd428b-dbc7-45a1-bf6c-6f82d05c3a89 on /dev/vdb  |
| 356d474f-13a5-42c1-8ee7-0411e5671f98 | cinder01            | in-use    |   16 | Attached to d2bd428b-dbc7-45a1-bf6c-6f82d05c3a89 on /dev/vda  |
| 4f13a989-ee46-48cf-99ef-19922f8f9564 | horizon01           | in-use    |   16 | Attached to f2ba2db1-c0aa-4256-a2c7-a0d9047cd374 on /dev/vda  |
| b9c3b5b6-d242-408f-b939-503096ab51c7 | neutron01           | in-use    |   16 | Attached to a6d4e0d3-ac6e-40bd-ac5d-5e2347b0b9e4 on /dev/vda  |
| a212296b-6380-4f22-ad01-749d28dec198 | nova01              | in-use    |   16 | Attached to e76a0843-f093-49d6-80a1-a44cd55335be on /dev/vda  |
| 4c6e239c-67f7-43a7-bdb8-178fbb639159 | placement01         | in-use    |   16 | Attached to 044290a6-ffc4-48d0-a709-0628e4a1fa57 on /dev/vda  |
| 50f6ca48-2d79-40ac-8a37-b7719eadbce1 | glance01_images     | in-use    |  100 | Attached to c7fb6b3b-f13a-45f9-95c2-248233d7a982 on /dev/vdb  |
| e88d302e-c06c-4751-b117-8a3c44ced804 | glance01            | in-use    |   16 | Attached to c7fb6b3b-f13a-45f9-95c2-248233d7a982 on /dev/vda  |
| 50621c38-6c38-496b-89df-c6fcbf466186 | keystone01          | in-use    |   16 | Attached to df51164e-15ff-460c-bfcd-fb91bc569f47 on /dev/vda  |
| eb48421c-121e-4cf0-9833-c617cc7fadec | rabbitmq01          | in-use    |   16 | Attached to e64f26d1-6405-4f0c-9427-b5ae1656f30f on /dev/vda  |
| 844cacae-ec0c-42b4-9229-423b48fd9eb6 | memcached01         | in-use    |   16 | Attached to edabcee3-69f3-4a60-861e-b4f961501102 on /dev/vda  |
| d54a0aff-efea-43fe-9ce8-7eba4597b7ea | openstack_base      | available |   16 |                                                               |
+--------------------------------------+---------------------+-----------+------+---------------------------------------------------------------+
[root@gn011 ~]# openstack server show c0866654-fcc5-48f1-a446-1c33e518a10e
No server with a name or ID of 'c0866654-fcc5-48f1-a446-1c33e518a10e' exists.
[root@gn011 ~]#

I did manage to set some volumes I try to delete to error and then delete them, but now the deletion part stays stuck for days and the volume is still not deleted. I did notice every cinder daemon got deadlocked due to some issue with greenlet or eventlet. I don't have the exact trace anymore, but it looked like this bug report. I hit this bug quite often in the past, so I just set heartbeat_in_pthread = true in /etc/cinder/cinder.conf of each of the machines running openstack-cinder-volume service and restarted the service on all gn008-gn015, but everything is still stuck.

When I did manage to get progress is deleting volumes, I noticed restarting rabbitmq-server helped a bit, if only for a few tens of seconds after restart. But this is not helping anymore either. Restarting httpd on gn011 takes a long time, but it does not help. I cannot login anymore via the dashboard, but I found this line in /var/log/httpd/error_log:

Timeout when reading response headers from daemon process 'dashboard': /usr/share/openstack-dashboard/openstack_dashboard/wsgi.py, referer: http://openstack.svc.lunarc/dashboard/auth/login/?next=/dashboard/project/

I tried to get more information by adding logging settings in /usr/share/openstack-dashboard/openstack_dashboard/local/local_settings.py:

LOGGING = {
    'version': 1,
    # When set to True this will disable all logging except
    # for loggers specified in this configuration dictionary. Note that
    # if nothing is specified here and disable_existing_loggers is True,
    # django.db.backends will still log unless it is disabled explicitly.
    'disable_existing_loggers': False,
    # If apache2 mod_wsgi is used to deploy OpenStack dashboard
    # timestamp is output by mod_wsgi. If WSGI framework you use does not
    # output timestamp for logging, add %(asctime)s in the following
    # format definitions.
    'formatters': {
        'console': {
            'format': '%(levelname)s %(name)s %(message)s'
        },
        'operation': {
            # The format of "%(message)s" is defined by
            # OPERATION_LOG_OPTIONS['format']
            'format': '%(message)s'
        },
        'verbose': {
            'format': '%(levelname)s %(asctime)s %(module)s %(process)d %(thread)d %(message)s'
        },
    },
    'handlers': {
...
        #'file': {
        #    'level': 'DEBUG' if DEBUG else 'INFO',
        #    'class': 'logging.FileHandler',
        #    'filename': '/var/log/httpd/dashboard.log',
        #    'formatter': 'console',
        #},
        'syslog': {
            'level': 'DEBUG' if DEBUG else 'INFO',
            'class': 'logging.handlers.SysLogHandler',
            'formatter': 'console',
            'facility': 'user',
        },
    },
    'loggers': {
        'horizon': {
            'handlers': ['syslog'],
            'level': 'DEBUG',
            'propagate': False,
        },
        'horizon.file_log': {
            'handlers': ['syslog'],
            'level': 'DEBUG',
            'propagate': False,
        },
        'openstack_dashboard': {
            'handlers': ['syslog'],
            'level': 'DEBUG',
            'propagate': False,
        },
        'novaclient': {
            'handlers': ['syslog'],
            'level': 'DEBUG',
            'propagate': False,
        },
... long long list, all set to debug and "handler" set to 'syslog' ...
}

and /etc/rsyslog.conf with:

...
user.*                          /var/log/user.log
...

but /var/log/user.log does not give anything useful:

Jan  4 18:50:01 gn011 httpd[2275299]: Server configured, listening on: port 5000, port 8778, port 80
Jan  5 10:51:00 gn011 httpd[2375720]: Server configured, listening on: port 5000, port 8778, port 80
Jan  5 11:22:08 gn011 httpd[2380573]: Server configured, listening on: port 5000, port 8778, port 80
Jan  5 15:32:42 gn011 httpd[2412748]: Server configured, listening on: port 5000, port 8778, port 80
Jan  5 16:28:21 gn011 httpd[2420940]: Server configured, listening on: port 5000, port 8778, port 80

I also tried outputting logs directly in a file in /tmp but it would always fail, reporting a lack of permissions.

At this point, I would restart gn011 (controller and compute node), but I am reluctant to do it before I can migrate or delete all virtual machines and volumes it hosts, which I can't because everything seems to be stuck. If it still does not work, then I would restart all nodes. It worked in the past, although then every volume in use was effectively lost, and so were the VMs using them. I don't really mind about them, but I would really like to be able to solve the issue without loosing VMs and data if (when?) I face it in a production install.

Does anyone have any direction I can take to get my setup back online?

OS (same for all nodes)

[root@gn011 httpd]# cat /etc/redhat-release 
Rocky Linux release 8.6 (Green Obsidian)
[root@gn011 httpd]#

OpenStack version (I don't recall what release that was):

[root@gn011 httpd]# rpm -qa | grep openstack
openstack-dashboard-20.1.2-1.el8.noarch
openstack-designate-api-13.0.0-1.el8.noarch
openstack-placement-common-6.0.0-1.el8.noarch
openstack-designate-producer-13.0.0-1.el8.noarch
openstack-nova-novncproxy-24.1.0-1.el8.noarch
openstack-designate-ui-13.0.0-2.el8.noarch
openstack-neutron-linuxbridge-19.3.0-1.el8.noarch
openstack-neutron-common-19.3.0-1.el8.noarch
openstack-neutron-19.3.0-1.el8.noarch
python-openstackclient-lang-5.6.0-1.el8.noarch
openstack-designate-sink-13.0.0-1.el8.noarch
openstack-dashboard-theme-20.1.2-1.el8.noarch
openstack-cinder-19.1.0-1.el8.noarch
openstack-designate-agent-13.0.0-1.el8.noarch
openstack-nova-common-24.1.0-1.el8.noarch
openstack-neutron-ml2-19.3.0-1.el8.noarch
openstack-keystone-20.0.0-2.el8.noarch
openstack-placement-api-6.0.0-1.el8.noarch
openstack-designate-mdns-13.0.0-1.el8.noarch
openstack-nova-conductor-24.1.0-1.el8.noarch
openstack-designate-worker-13.0.0-1.el8.noarch
openstack-nova-api-24.1.0-1.el8.noarch
openstack-glance-23.0.0-2.el8.noarch
openstack-nova-scheduler-24.1.0-1.el8.noarch
python3-openstacksdk-0.59.0-1.el8.noarch
python3-openstackclient-5.6.0-1.el8.noarch
openstack-selinux-0.8.27-1.el8.noarch
openstack-designate-common-13.0.0-1.el8.noarch
openstack-nova-compute-24.1.0-1.el8.noarch
openstack-designate-central-13.0.0-1.el8.noarch
[root@gn011 httpd]#

(tagging as centos for the lack of a rocky tag)

Do you find anything useful in the cinder logs, especially cinder-volume.log (maybe turn on debug log level)? Also did you check rabbitmq logs? Could it be an MTU issue? — eblock, Commented Jan 12, 2023 at 10:33
I reactived debug log level in all cinder-volume services. I discovered that three nodes, including the one holding the volume I can't remove, still log greenlet.error: cannot switch to a different thread although oslo_messaging_rabbit\heartbeat_in_pthread is set to true in /etc/cinder/cinder.conf — Nykau, Commented Jan 13, 2023 at 13:39
Are the nodes very busy? What about the rabbitmq logs? I don't know what pthread is for, so that doesn't help me. How many control nodes do you have? There are many things to look at, but if existing volumes work (do they still?) but newly ones don't, there might be some rabbit issue, or users didn't properly clean up resources. Do you have a full stack trace from cinder-volume (or maybe scheduler as well)? To remove the volumes where the VM is already deleted you can use 'cinder reset-state' to reset its state in the database to available and then delete it (if cinder responds). — eblock, Commented Jan 13, 2023 at 14:12
There are 8 compute+network agent+cinder volume nodes. Once of them (gn011) is additionally the control node; no redundancy. None of the nodes (20 "real" cores) is more busy than 2-3 processes reported at 20% by top, or has active IOs reaching the MB/s (from iotop). I can see nothing suspicious in rabbit-mq-server logs, except maybe a long and consecutive alternation of Asked to [re-]register this node (rabbit@gn011) with epmd... and [Re-]registered this node (rabbit@gn011) with epmd at port 25672. — Nykau, Commented Jan 13, 2023 at 21:10
To remove volumes, I first remove their attachement to the no-longer-existing VM (openstack volume attachment delete <UUID>), then I set their state to "error" (openstack volume set --state error <UUID> and then I proceed with removing the volume. Some users definitely have been abusing the install. I don't remember what exactly (I recall they deleted the default security group using Horizon), but I had to remove manually (i.e., with a SQL command) some references to (guessing) VMs that no longer appeared in GUI but still prevented volumes from being removed. — Nykau, Commented Jan 13, 2023 at 21:16

Stack Exchange Network

Openstack Storage unresponsive

0

You must log in to answer this question.

Browse other questions tagged
linux
centos
openstack
.

Hot Network Questions

Openstack Storage unresponsive

0

You must log in to answer this question.

Browse other questions tagged linuxcentosopenstack.

Related

Hot Network Questions

Browse other questions tagged
linux
centos
openstack
.