19 december 2018

Today was crazy. Discovered we have a really interesting bug with one of our k8s clusters. Turns out the kernel version we are running has an old and not so great Ceph module. This module causes the block device to hang when (I think) terminating a pod using an RBD for storage. The consequences of this RBD getting stuck are many. For starters, the docker API begin to slow down and causes a significant increase in docker api timeout errors. The second observed problem is that some pods won't start back up if others in the deployment haven't yet finished terminating, causing deployments to fail and finally causing our monitoring system to lose it's mind because it's trying to scan disks and getting stuck trying to read block devices that are hung.

Today, I feel a combination of amused and exhausted.