Unable to Find Cause of Random CPU Spikes

Daniel · April 20, 2019, 3:22am

You could try oomd instead of the standard OOM killer.

I don’t know what’ll happen if you disable the OOM killer and swap, and something eats all the memory. Likely stuff will start crashing. Swap actually shouldn’t be too bad if it’s on an SSD, as long as swap usage is only temporary and doesn’t persist for long periods of time. That could be a sign the server needs more RAM.

Animazing · April 20, 2019, 6:52am

If it happens again I recommend you whip out perf and try running perf top for a bit. It will show you where the kernel spends it’s time. This should help you get to the root cause.

jetchirag · April 20, 2019, 6:59am

The problem with swap was it was being used at 50-100% during such events even though system was only using 25% of RAM. If OOM Killer is disabled, CloudLinux will be using SIGKILL instead.

I’ll try enabling OOM killer and swap in few days to see if that was the problem or not.

jetchirag · April 20, 2019, 6:59am

I’ll be trying that next time. Thanks!

Daniel · April 20, 2019, 7:14am

What’s your swappiness set to (cat /proc/sys/vm/swappiness)? This number roughly represents the percentage of free RAM before using swap. The default is normally 60, which means swap will be used when 40% of RAM is used. I often change it to 10 (which means swap will be used once RAM usage hits 90%), but you could set it to 1 which means swap will only be used when RAM is totally full (0 means swap is disabled, so don’t use 0).

You can set it in /etc/sysctl.conf:

vm.swappiness = 10

jetchirag · April 20, 2019, 7:18am

Yup, it was set to 10.

Daniel · April 20, 2019, 7:20am

Hmm weird, it shouldn’t be swapping if 75% RAM is still free with swappiness of 10. It might be possible that something is suddenly consuming all the RAM, causing everything to swap.

I’d recommend using netdata if you don’t already use it. It tracks metrics every second, and you can look back and see if there were any sudden large spikes in RAM usage (particularly if one program spikes a lot). Definitely better than manually watching for the issue

jetchirag · April 20, 2019, 7:41am

Will try this with perf after enabling them back. Thank you!

jetchirag · April 23, 2019, 4:19pm

Hey guys.

Since the issue was fixed, there has been lot of un-explainable webserver downtimes. Server’s working fine - php is working fine. LiteSpeed shows running but does not respond to any requests. Page keeps loading. Give it a few minute and it’s back up. Load to drops to 1 in the meantime.

Today, even SSH was unaccessible.

If someone with good reputation here can take a look at their price/cost, I’d be grateful.

WSS · April 23, 2019, 4:42pm

What build kernel are you on, and what scheduler?

Daniel · April 23, 2019, 4:52pm

Could it be unrelated network issues? Try mtr from both ends (on your computer to the server, and on the server to somewhere else) and see if there’s packet loss when it occurs.

WSS · April 23, 2019, 4:55pm

It looks like network issues, but judging by his other issues, I’m leaning towards THP.

jetchirag · April 23, 2019, 5:44pm

Kernel: 3.10.0-962.3.2.lve1.5.24.10.el7.x86_64
CFQ

jetchirag · April 23, 2019, 5:58pm

Except for the last time, it’s very unlikely. I was able to ssh, use WHM etc.

FHR · April 23, 2019, 6:37pm

You know, this could actually very well be a THP problem.

One of the issues:

While streaming out, we experiencing delays (more than one to ten seconds)
because of a process freeze and at the same point in time
a bunch of memory is being freed

Sounds relevant

deank · April 23, 2019, 6:43pm

Have you considered the remote possibility of your system having a PMS occasionally?

jetchirag · April 23, 2019, 6:55pm

I’ve disabled THP and bunch of other things CloudLinux’s tuned profile offers. Lets see how it goes.
Thanks!

WSS · April 23, 2019, 8:40pm

Why does everyone assume that because I am an incessant shitposter, that I have no other redeeming value?

SonOfAMotherlessGoat · April 23, 2019, 10:24pm

Because we are all devout Pythonistas and only believe in duck typing.

WSS · April 23, 2019, 10:28pm