Impact of highly unbalanced Ceph storage
How to fix undersized degraded PGs in Ceph
781 Words
2025-03-27
At the institute, we are using Ceph as a storage backend on our hyper-converged infrastructure (HCI). This means, that each node in the cluster is a storage node and a compute node at the same time, running Proxmox in our case.
Basic storage setup
This is a three-node Ceph cluster with two nodes with 8TB disks and one node with 2TB disks. The initial idea for such a setup was to have one fast caching node and two high usage nodes. There is also a fourth node which is mainly used for compute and monitoring and does not participate in the Ceph cluster.

The four Dell PowerEdge R760 nodes
The problem described here is similar to the one written in this reddit post. But the impact is more severe.
General information
so we have three nodes running proxmox with ceph, having each disk added as an OSD. The cluster is configured to place data first on each host, to make sure that there are not all replicas of data on a single host:
ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
So as we have 10 OSDs per host - this makes ~80 TB for the first two nodes and 20 TB for the third node. Of course, our cluster is configured to have at least 2 replicas and create 3 replicas for every situation.
osd_pool_default_min_size = 2
osd_pool_default_size = 3
The attentive reader has surely noticed, that each node effectively has only 20TB of available storage, as the smallest node sets the limit as everything has to be replicated on all three nodes.
Problem - undersized degraded since 2w
After adding a bunch of data to this system, we noticed that some data is not placed correctly and becomes degraded as some PGs can not be placed anywhere. This is especially weird as the 2TB disks are only 50% full, while the 8TB disks are only ~12.5% full (1/8 which makes sense).
But still, some PGs could not be placed the third time to the host with small disks for some reason, even after weeks of runtime. Ceph did not balance this out.
While this is expectable when nearing something near 80% disk utilization on the small disks, I do not understand why this happens.
Ceph makes it worse - by reweighting the small disks
While trying to fix this, I tried increasing or decreasing the PG count per pool, which did not help. So I tried to let Ceph fix this itself using
ceph osd reweight-by-utilization
Of course, the small disks have a high usage as the pressure to store data on them is very high as this small nodes has to keep one replica of everything as well. Therefore, ceph reduces the weight of the small disks, making the situation even worse.
moved 35 / 6339 (0.552138%)
avg 147.419
stddev 28.5785 -> 27.2297 (expected baseline 11.9996)
min osd.12 with 109 -> 111 pgs (0.739391 -> 0.752958 * mean)
max osd.28 with 250 -> 236 pgs (1.69585 -> 1.60088 * mean)
oload 120
max_change 0.05
max_change_osds 4
average_utilization 0.1545
overload_utilization 0.1854
osd.28 weight 1.0000 -> 0.9500
osd.11 weight 1.0000 -> 0.9500
osd.2 weight 0.9000 -> 0.8500
osd.15 weight 1.0000 -> 0.9500
Only the weights of the small disks are decreased.
Solution - do not create imbalanced hosts
So while it still is unclear why Ceph does not just place the degraded PGs on the free space of the small nodes, there are ways to fix this.
- The first one is of course to reset the weight of all small OSDs to 1, and eventually increase the overall crush weight as well.
- Second is to move disks around to remove the unbalanced situation at all.
This of course requires the removal/deletion of OSDs, and freshly readding the disk on the other node, as Ceph does not like reappering OSDs on other nodes, as metadata is kept on the initial host.

similar movement setup as the one planned for the R760 nodes I am maintaining
It mainly seems that Ceph is not intended for such situations (which are bad anyway, as 60+60=120TB of storage can never be used with Ceph in this scenario).
Some even say, that one should not use Ceph with storage disks of such different sizes. I will decide if the small disks should be part of a separate pool, defined through a separate replication rule for this class.
For now it is just good to finally have a better understanding of what was going on and why.