Ceph: Safely Available Storage Calculator

The only way I've managed to ever break Ceph is by not giving it enough raw storage to work with. You can abuse ceph in all kinds of ways and it will recover, but when it runs out of storage really bad things happen. It's surprisingly easy to get into trouble. Mainly because the default safety mechanisms (nearfull and full ratios) assume that you are running a cluster with at least 7 nodes. For smaller clusters the defaults are too risky. For that reason I created this calculator. It calculates how much storage you can safely consume.



Assumptions:
Number of Replicas (ceph osd pool get {pool-name} size)
Units:

Node NameTotal OSD size
 
Total cluster size
Total raw purchased storage. You will never be able to use this much unless you turn off all replication (which is foolish)
Worst failure replication size
Amount of data that will have to be replicated after "worst failure" occurs. Assume that the worst failure we can have is failure of the biggest node. You decide if this assumption is sufficiently conservative.
 
Risky cluster size
How much of raw storage is available if you are ok with being in degraded state while the failed node is fixed? This is assuming you can fix it at least partially by recovering some OSDs from it. If you are doing this you should probably do "ceph osd set noout" to avoid replication eating up all free space and/or have a very quick disaster recovery plan. If you just let it fix itself, the cluster will run out of space and/or lose data. So this is not a good plan unless you really know what you are doing.
Risky efficiency
Same as above in percent
 
Safe cluster size
How much of raw storage is safely available even in worse case? If you use no more than this amount of storage, you can sleep well at night knowing at you do not have to intervene in case of failure. Ceph will magically fix itself (Only this time though. All bets are off for next failure as you will probably be in the "risky" scenario after this first failure is handled)
Safe efficiency
Same as above in percent
Safe nearfull ratio
Set osd nearfull ratio to this number to get proper warning when safety margin exceeded. (default is 0.85 which may be too high - too risky)