Unable to Find Cause of Random CPU Spikes

jetchirag · April 17, 2019, 7:46am

Hey,

I’ve been trying to debug this problem since almost a week now. So, this is a cPanel server with CloudLinux and LS. Randomly server load reaches above 200 (mostly 250) and becomes unresponsive.

Things I’ve tried so far:

Disabling SWAP
Making sure everything’s upto date
Scanning accounts for malwares/exploits (even with lve enabled)

Things I’m thinking to try:

Disable CloudLinux OOM killer
Trying Kdump

Problem is even when load is high, there are no processes using significant amount of CPU. So, I’m doubting high IO. Here is monitor log from one such event:

!-------------------------------------------- top 50
top - 04:28:10 up 3 days,  1:59,  1 user,  load average: 131.37, 139.55, 102.51
Tasks: 1206 total, 270 running, 934 sleeping,   1 stopped,   1 zombie
%Cpu(s): 17.1 us, 80.1 sy,  1.1 ni,  0.0 id,  0.2 wa,  0.0 hi,  1.5 si,  0.0 st
KiB Mem : 65768876 total,  1533392 free, 19082896 used, 45152588 buff/cache
KiB Swap: 10491896 total,     7776 free, 10484120 used. 34913172 avail Mem

  PID USER	PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10245 mysql     26   6   18.7g   4.4g   3344 S 106.5  7.0   2586:18 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mys$
17068 root	20   0  271208  56792   1532 R  50.2  0.1  57:06.76 cxswatch - scanning
  192 root	20   0       0      0	   0 R  42.9  0.0 102:07.40 [kswapd0]
 9242 mongod    20   0 1325612 249392   2772 S  37.7  0.4 367:50.58 /usr/local/jetapps/usr/bin/mongod --quiet -f /usr/local/jet$
28994 digikrea  20   0  338700  24912   3444 R  36.4  0.0   0:01.29 lsphp:/home/someacc/public_html/index.php
  193 root	20   0       0      0	   0 R  34.2  0.0 112:06.32 [kswapd1]
17067 root	20   0  272140  58868   1432 R  33.8  0.1  57:32.19 cxswatch - scanning

IOTOP:
    !------------------------------------- iotop -b -n 3
    Total DISK READ :     184.97 M/s | Total DISK WRITE :    2016.29 K/s
    Actual DISK READ:      98.64 M/s | Actual DISK WRITE:      24.53 M/s
      TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    27604 be/4 anotheracc    0.00 B/s    0.00 B/s  0.00 % 96.25 % lsphp:/home/anotheracc/public_html/wp-login.php
    28277 be/4 lyricsa1   20.52 K/s    0.00 B/s  0.00 % 53.80 % lsphp:e/anotheracc/public_html/anotheracc/index.php
    28520 ?err {none}    182.79 K/s    0.00 B/s  0.00 % 10.18 % {no such process}

Looking at network graph spike, I’m doubting either outgoing connections or ddos. However, mostly LS admin shows mostly free http[s] connections. Another thing I found in LS logs during event is:

2019-04-16 16:54:44.881725 [WARN] Redirect destination does not specified, ignore!
# grep "2019-04-16 16:54" error_log | grep Redirect | wc -l
2002

Any pointers?

Andrei · April 17, 2019, 10:33am

Your sy system (kernelspace) CPU usage is quite high, 80%… and looking at the graphs it seems that your server is putting a lot of things into swap for some reason, up to the point where it fills it up?

anon40039896 · April 17, 2019, 10:34am

on command line, try

sysctl -w net.ipv4.ip_forward=0
sysctl -w net.ipv4.send_redirects=0

if it fixes the problem, add both lines to /etc/sysctl.conf

Ympker · April 17, 2019, 10:48am

Are you on a VPS or a Dedi? Could also be noisy neighbours?

anon40039896 · April 17, 2019, 2:06pm

Maybe the neighbours are having a party

Daniel · April 17, 2019, 5:16pm

Something that immediately stands out to me:

270 processes running! I doubt you have 270 cores in the server, so the CPU would be doing a lot of context switching.

Processors can only execute one process at a time. Context switching is when it switches from one process to another (processors do this a lot, of course). Your system has 270 processes that are either running or “ready to run”. How many CPU cores do you have? If you have 32 cores (as an example), only 32 of those processes can actually run concurrently at any one time, meaning that at any one time, 238 of them will be blocked, waiting for access to the CPU (actually this is slightly wrong due to Hyper Threading - many systems will actually execute two processes per CPU core, but you get the idea).

Load average is the number of processes that are actively running, plus the number of processes that could run but are blocked for some reason - usually waiting for a shared resource like IO or CPU. Load average is given as three numbers - the average over the past one, five and fifteen minutes. A load average of 1.00 generally means that one core has been fully utilized (100% CPU usage) for the entire past one, five or fifteen minutes.

It’s entirely possible that the 80% system CPU usage (80.1 sy) could almost entirely be caused by context switching, or other interrupts (like maybe all those processes are trying to read from / write to the network, and one is absolutely hammering the network card with a large number of very small packets, meaning all the others have to wait). netdata has a nice interrupts graph that you could use.

I’d say to first get a list of all the processes in “runnable” state:

sudo ps -o comm,pid,ppid,user,time,etime,start,pcpu,state --sort=comm aH | grep '^COMMAND\|R$'

and work out what they are.

It’s very likely the load is actually caused by processes in the ‘runnable’ state, which are technically using 0% CPU (as they’re not even running yet - they’re waiting to be able to run!)

jetchirag · April 18, 2019, 4:27am

@Andrei Yes, even with so much of available RAM, swap usage reaches nearly full. I’ve disabled swap now to see what happens

@anon40039896 Not sure how that’s related, are you referring to the LS log?

@Ympker Dedi

@Daniel Problem is I don’t understand why running processes spike suddenly. It might be cron running at same time but several processes are php running for WordPress’s index.php but belongs to different users and rest jobs have always been there running peacefully.

Nothing has changed much the previous week and the issue arose out of nothing.

blurry · April 18, 2019, 4:31am

you using a default mysql config?

jetchirag · April 18, 2019, 4:53am

5.7 - Mostly, only a few changes with strict mode disabled.

FHR · April 18, 2019, 9:52am

How should this help?

FHR · April 18, 2019, 9:55am

Did you check webserver logs?
It could be caused by an overly powerful crawler hammering a heavy application.

jetchirag · April 18, 2019, 10:09am

I did but there are no specific user having too much usage but several lsphp processes running under several different user accounts.

I think 50% of them are cron jobs but after checking them individually - they are supposed to run every x minute so why would they be causing problem randomly. Also, none of them has been running for too long.

I also came across this message:

-F BUG: WARNING: at WARNING: CPU: INFO: possible recursive locking detected ernel BUG at list_del corruption list_add corruption do_IRQ: stack overflow: ear stack overflow (cur: eneral protection fault nable to handle kernel ouble fault: RTNL: assertion failed eek! page_mapcount(page) went negative! adness at NETDEV WATCHDOG ysctl table check failed : nobody cared IRQ handler type mismatch Kernel panic - not syncing: Machine Check Exception: Machine check events logged divide error: bounds: coprocessor segment overrun: invalid TSS: segment not present: invalid opcode: alignment check: stack segment: fpu exception: simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops -xtD

Does not seem related but just in case.

Ympker · April 18, 2019, 10:40am

I mean you could try to export all cPanel as backups and install on a fresh server, then propagate the nameservers to see if it makes any difference.

Andrei · April 18, 2019, 11:00am

I’d say try to find out what’s hogging all of the RAM/SWAP at the time of the spike, because just disabling SWAP would probably fill up your RAM and end up in a crash as well.

jetchirag · April 18, 2019, 3:39pm

Can do but most of these accounts have recently been transferred from another server and doing that again… It’s tiring, expensive and lot of work after migration… Will do as last resort but hoping won’t have to…

@Andrei - As you can see from graph, nothing is taking major portion of RAM. Just SWAP. Lot of RAM is still available but simply not being used.

Ympker · April 18, 2019, 4:20pm

Hmm… yeah that sux. Often enough I found cPanel not performing significantly better on a self hosted VPS/Dedi than on a Premium Reseller like Ramnode etc which ultimately made me stick to Reseller plans when offering shared hosting. It’s cheaper and way less a pita to maintain and I usually had 1-2 other reseller accounts for private use on standby which I could’ve used if things were going bad with the current reseller my clients were on. Not saying there are no perks for self hosting everything, however at some point you’ll just have to do the same like the reseller hosts: Paying cPanel/JetBackup/CloudLinux license and maintaining the server load always having to monitor things closely etc. After thinking about this a few times I decided I’d rather focus on providing good support and let infrastructure be handled by someone else while still choosing reputable Reseller hosts ofc.

Not telling you to go down the same path. Just think about it once or twice

imok · April 18, 2019, 11:23pm

Aren’t Cloudlinux limits for this? I was thinking on buying a license but I’m confused now.

FHR · April 19, 2019, 1:20pm

Yeah, properly set CloudLinux limits would prevent this scenario.

WSS · April 19, 2019, 4:55pm

fixed

jetchirag · April 19, 2019, 7:07pm

Just wanted to update here, it’s been around a day and haven’t noticed it reoccur. Here are things I did:

Disabled SWAP
Disabled OOM killer
Disabled CXSWATCH (which was also spawning clamd)

If everything goes fine for another 24 hours, I’ll try enabling them one-by-one except for swap. Let’s see

Big thanks for everything for suggestions <3