Unable to Find Cause of Random CPU Spikes

You could try oomd instead of the standard OOM killer.

I don’t know what’ll happen if you disable the OOM killer and swap, and something eats all the memory. Likely stuff will start crashing. Swap actually shouldn’t be too bad if it’s on an SSD, as long as swap usage is only temporary and doesn’t persist for long periods of time. That could be a sign the server needs more RAM.

If it happens again I recommend you whip out perf and try running perf top for a bit. It will show you where the kernel spends it’s time. This should help you get to the root cause.

The problem with swap was it was being used at 50-100% during such events even though system was only using 25% of RAM. If OOM Killer is disabled, CloudLinux will be using SIGKILL instead.

I’ll try enabling OOM killer and swap in few days to see if that was the problem or not.

I’ll be trying that next time. Thanks!

What’s your swappiness set to (cat /proc/sys/vm/swappiness)? This number roughly represents the percentage of free RAM before using swap. The default is normally 60, which means swap will be used when 40% of RAM is used. I often change it to 10 (which means swap will be used once RAM usage hits 90%), but you could set it to 1 which means swap will only be used when RAM is totally full (0 means swap is disabled, so don’t use 0).

You can set it in /etc/sysctl.conf:

vm.swappiness = 10

Yup, it was set to 10.

Hmm weird, it shouldn’t be swapping if 75% RAM is still free with swappiness of 10. It might be possible that something is suddenly consuming all the RAM, causing everything to swap.

I’d recommend using netdata if you don’t already use it. It tracks metrics every second, and you can look back and see if there were any sudden large spikes in RAM usage (particularly if one program spikes a lot). Definitely better than manually watching for the issue :stuck_out_tongue:

2 Likes

Will try this with perf after enabling them back. Thank you!

Hey guys.

Since the issue was fixed, there has been lot of un-explainable webserver downtimes. Server’s working fine - php is working fine. LiteSpeed shows running but does not respond to any requests. Page keeps loading. Give it a few minute and it’s back up. Load to drops to 1 in the meantime.

Today, even SSH was unaccessible.

If someone with good reputation here can take a look at their price/cost, I’d be grateful.

What build kernel are you on, and what scheduler?

Could it be unrelated network issues? Try mtr from both ends (on your computer to the server, and on the server to somewhere else) and see if there’s packet loss when it occurs.

It looks like network issues, but judging by his other issues, I’m leaning towards THP.

1 Like

Kernel: 3.10.0-962.3.2.lve1.5.24.10.el7.x86_64
CFQ

Except for the last time, it’s very unlikely. I was able to ssh, use WHM etc.

You know, this could actually very well be a THP problem.

One of the issues:

While streaming out, we experiencing delays (more than one to ten seconds)
because of a process freeze and at the same point in time
a bunch of memory is being freed

Sounds relevant

1 Like

Have you considered the remote possibility of your system having a PMS occasionally?

I’ve disabled THP and bunch of other things CloudLinux’s tuned profile offers. Lets see how it goes.
Thanks!

Why does everyone assume that because I am an incessant shitposter, that I have no other redeeming value?

4 Likes

Because we are all devout Pythonistas and only believe in duck typing.

1 Like
2 Likes