RAID Health

404error · October 30, 2021, 11:53pm

Hey guys,

I’ve been receiving the following info from Hetrix (thank you @Andrei )

* **/boot**
  * Filesystem: /dev/md1
  * Type: RAID1
  * State: Clean, resyncing DELAYED
  * Persistence: Superblock is persistent
  * Total Drives: 2
  * Active Drives: 2
  * Working Drives: 2
  * Spare Drives: 0
  * Failed Drives: 0
* **/**
  * Filesystem: /dev/md2
  * Type: RAID1
  * State: Active, resyncing DELAYED
  * Persistence: Superblock is persistent
  * Total Drives: 2
  * Active Drives: 2
  * Working Drives: 2
  * Spare Drives: 0
  * Failed Drives: 0

Can you guy give me some direction about this?

Friendly · October 31, 2021, 12:48am

I haven’t have this happening but a MX500 1TB SSD was being a ding dong on a 18.04LTS (I believe?) Ubuntu box cause of the it’s false positives on the older smartmontools versioning on that OS version.

Are you running a older OS as such that might hints at an olrder smartmontools versioning than what’s current?

Andrei · October 31, 2021, 9:19am

Configure your RAID warnings to critical, to receive just critical notifications about it: Difference between ‘not ideal’ and ‘critical’ RAID health warnings – HetrixTools

Mr_Tom · October 31, 2021, 9:25am

Resync delayed is usually okay. Try and touch a file on either filesystem and that should trigger the resync (unless another sync is running).

404error · November 6, 2021, 9:56pm

@Friendly @Andrei @Mr_Tom

Sorry for the late reply, I do appreciarte your help.

@Friendly I0’m running CentOS 8.4.2105
@Andrei done
@Mr_Tom I’m guessing that why Andrei directed me to change the warning level to critical.

Thank you all.

Friendly · November 6, 2021, 10:46pm

Then your SMART databases might not be up to date. I would verifies if this is the case and ensures both the smartmontools and it’s databases are up to date. Before writing this off. Cause the last thing you want is misdiagnosis when your disk(s) might very well be on their ways out.

As I said MX500s were misdiagnosing on my end when the data bases were in bad versionings.

404error · November 6, 2021, 11:12pm

Can you direct me to some site with instruction on how to do it.

On a second server i just got this… that worries me. what’s your take on the blow?

Friendly · November 6, 2021, 11:38pm

If the OS supports it this might helps https://www.systutorials.com/docs/linux/man/8-update-smart-drivedb/ BUT…

If those NVMes are failing SMART then that’s a cause for concerns. Make sure your backups are up to date, take a manual backup then ask your data center provider to replaces them at once. SMART failures might not mean the disks’ are on the way out but it usually doesn’t gets better.

I had a 500GB HDD doing this to and ever since it kept on incrementing reallocated sectors on the SMART (it would only works on short testings) so it was/is on the way out.

I still got it on me as a SHTF drive but a drive responding in this fashion shouldn’t be in service though.

404error · November 6, 2021, 11:45pm

That printscreen is a diferent server than the one I was referring to when i first started this topic.
The one i was referring to looks way better, heres a printscreen.

But the second server, an older one I just reinstalled ApisCP into and for the first time I’m trying to monitor the disk… looks like this.

i don’t really remember how nI installed the monitor for the first server above, but I tihnk it was …

yum nvme-cli

And that was it…

There ar eno instruction for CentOS here…

Friendly · November 7, 2021, 12:18am

Correct you need to confirm that you actually got the prequities installed correctly as per documentions for those NVMes. Or else they will not test properly on that end.

404error · November 7, 2021, 12:22am

Just noticed that I could run
nvme list

The above seems poorly worded, looks liek not listing th edisks is what one shoudl expect. In my case it listed the installed disk… so i’d say its correctely installed.

So i guess I should ask for a replacement…

tinyweasel · November 7, 2021, 12:42am

True winterhoax fashion. Let’s use consumer SSDs that will fail soon! Give me your money!!!

Friendly · November 7, 2021, 1:54am

Really is birches on hosts who actually use TLC only disks.

SLC cache backed TLCs are MUCH better in both testings with Samsung 860 EVOs and SK Gold Hynix S31s even with reboots alone being MUCH better off on such same classed disks.

Friendly · November 7, 2021, 1:56am

Indeed so, I would treat a untest able disk as a “questionable” disk. Therefore needing to be replaced as soon as possible.

404error · November 7, 2021, 4:47pm

Hetzner disagrees.
I requested them to look into a replacement, they did and got back to me with…

Our test shows no issues with your drives.
-----------------%<-----------------
HDDTEST S4GENX0N420995: Ok
HDDTEST S4GENX0N421048: Ok
-----------------%<-----------------

Also the SMART values seems to be good.

/dev/nvme0n1 512.11 GB S4GENX0N421048
critical_warning 0
available_spare 100%
available_spare_threshold 10%
percentage_used 24%
media_errors 0
num_err_log_entries 5

/dev/nvme1n1 512.11 GB S4GENX0N420995
available_spare 100%
available_spare_threshold 10%
percentage_used 41%
media_errors 0
num_err_log_entries 5

-----------------%<-----------------

I guess the wearout is not an issue?

@Andrei would there be a reason for the server agent not to be able to run the SMART test?

same server specs and OS. One works fully… while for the other…

Friendly · November 7, 2021, 5:42pm

Ask them what tests were successful and how they were able to tests them.

They should be telling you all of this so please go back at them with requesting further clarifications on the matters.

Andrei · November 7, 2021, 5:53pm

Some nvme don’t support this, but you’ll have to open a support ticket for further support on this matter.

Cheers.

Friendly · November 7, 2021, 6:09pm

If you means by drives not supporting S.M.A.R.T and/or live testings then…

The balls that a provider chooses to have unsupported S.M.A.R.T NVMes is really not cool if this is really the case.

We as rentees should have means to access S.M.A.R.T and other critical information from disks from a running system if at all possible.

As I said before, if a drive can’t be validly tested with trusted means then I am labeling that disk as a “questionable” disk that should be pulled from a production node as soon as possible.

Falzo · November 7, 2021, 6:11pm

Of course not. If you use any SSD it will obviously wear out. That is how it works. 25% means a quarter of the lifetime that is guarantued for has been used up. So you have 75% left to use and go for, there is no reason to change it. You would not change the tires of your car after only 25% of usage, would you?

So of course Hetzner is going to deny that change request. A (pending) raid rebuild does not automatically mean there is something wrong with the disk - especially if it is just soft raid. More likely any unclean shutdown can cause this.

TL;DR; stop worrying about the disks, there is nothing wrong with it.

Friendly · November 7, 2021, 6:13pm

We are not worried about the cells dying, we are worried about validlying that their instrafacture health statuses.

RAID Health

/dev/nvme0n1 512.11 GB S4GENX0N421048
critical_warning 0
available_spare 100%
available_spare_threshold 10%
percentage_used 24%
media_errors 0
num_err_log_entries 5

/dev/nvme1n1 512.11 GB S4GENX0N420995
available_spare 100%
available_spare_threshold 10%
percentage_used 41%
media_errors 0
num_err_log_entries 5

RAID Health

/dev/nvme0n1 512.11 GB S4GENX0N421048 critical_warning 0 available_spare 100% available_spare_threshold 10% percentage_used 24% media_errors 0 num_err_log_entries 5

/dev/nvme1n1 512.11 GB S4GENX0N420995 available_spare 100% available_spare_threshold 10% percentage_used 41% media_errors 0 num_err_log_entries 5

/dev/nvme0n1 512.11 GB S4GENX0N421048
critical_warning 0
available_spare 100%
available_spare_threshold 10%
percentage_used 24%
media_errors 0
num_err_log_entries 5

/dev/nvme1n1 512.11 GB S4GENX0N420995
available_spare 100%
available_spare_threshold 10%
percentage_used 41%
media_errors 0
num_err_log_entries 5