Remove overhead from raid 10 mdadm onlinux?!

OCFreely101 · Aug 3, 2020

I have a raid made up of 6 WD Re4's:

https://hdd.userbenchmark.com/SpeedTest/5792/WDC-WD2503ABYX-01WERA0
https://gzhls.at/blob/ldb/7/6/1/e/ee7a79edc97b885933949eeefdb2d9fbdf1b.pdf

Writes seem realistic per drive to the userbench at around 95mb/s. When put in any raid configuration with mdadm it never goes above 120mb/s. It basically acts like a single drive with redundancy. It should be getting up to 3x.

Code:

sudo hdparm -Tt /dev/md/localhost-live.attlocal.net:10(redacted)

 Timing cached reads:   5790 MB in  2.00 seconds = 2895.92 MB/sec
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 Timing buffered disk reads: 866 MB in  3.08 seconds = 281.53 MB/sec

This is getting the correct read speed for Near2.

https://imgur.com/piH1Px2

<- this is with Gnome Disk Utility
This is my realistic speeds.

Code:

sudo mdadm -D /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Fri Jul 31 02:37:02 2020
        Raid Level : raid10
        Array Size : 735129600 (701.07 GiB 752.77 GB)
     Used Dev Size : 245043200 (233.69 GiB 250.92 GB)
      Raid Devices : 6
     Total Devices : 6
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sun Aug  2 19:45:38 2020
             State : clean 
    Active Devices : 6
   Working Devices : 6
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 2048K

Consistency Policy : bitmap

              Name : localhost-live.attlocal.net:10  (local to host localhost-live.attlocal.net)
              UUID : 93a69ccd:be91b7b5:2d7dd8ea:e1630957
            Events : 6

    Number   Major   Minor   RaidDevice State
       0       8       32        0      active sync set-A   /dev/sdc
       1       8       48        1      active sync set-B   /dev/sdd
       2       8       64        2      active sync set-A   /dev/sde
       3       8       80        3      active sync set-B   /dev/sdf
       4       8       96        4      active sync set-A   /dev/sdg
       5       8      112        5      active sync set-B   /dev/sdh

I've tested this with 4x and 6x drives and every combo of raid 10 and raid 1+0 and all chunk sized and I can't get more than 120mb/s. What is making the excessive overhead? Or what causes it to begin with?

BTW, if I make a raid 1 out of these drive the write goes to 69mb/s in gnome-disk-utility. This seems to show when it's introduced. It is then further reduced by the overhead of putting the raid 1's into a raid 10. It realistically runs between 90-120mb/s in all configs. Is there a way to reduce this overhead and run a raid 10 to it's maximum performance?

I know they can get better performance because I've run with fedora 25(on an install usb) over my current fedora 32 and it was capable of running closer to full speed. I'm not familiar with linux enough to know what is causing what.

My (wild)guesses are around either alignment, file system, or things like asynchronous write or other odd settings. (obviously I have no idea.)

I've tried maximising apm and other things and it went up to like 130mb/s. Literally acting like one disks.

Could it also be things like 32bit limitations or not using enough cored/threads? Does anyone know why this acts this way?

I originally wanted to do offset so I could get double sustained read and get a max read speed near my SSD. (And 850 pro 256gb drive @540mb/s.)

Edit: I can run it in a raid 0 and get closer to some of the numbers I want, but I don't want to run it in raid 0 in case of data loss. and it's still not up to the read speeds in practice. I don't understand what makes it act this way.(literally, I don't know and am curious.)

BTW, what does it mean that my raid drive is located at /dev/md/localhost-live.localhost-live.attlocal.net:10

Does that mean it's on a network or just part of my systems internal name?

Is there also a way to get it to use asynchronous writes. I read it's better and lowers CPU overhead. I think I got it to max out 3-4 cores using a different file system manager or something with raid 0.

I tested with an install folder of warthunder at 33 gigs. If I ran it under Fedora 32 it runs at 3-5 minutes for a copy job. If I do it under the Fedora 25 install usb I get around 1 minute for a copy... This is some serious real world performance impact. What is the difference between these two?

One other oddity is that it seems to normally only max out one cpu core if any. I had one config (i forget which one) was using more.

Do I need to activate a tristate thing on either the hdd's with hdparm or on the bios? edit: I may be thinking of memory tristate. Not sure if that is the same thing or not.

Memclock tri-stating
Determines whether to enable memory clock tri-stating in CPU C3 or Alt VID mode. (Default: Disabled)

Could DQS Training help at all? Or would that hinder?

DQS Training Control
Enables or disables memory DQS training each time the system restarts. (Default: Skip DQS)

So far neither of those did anything that I can tell. I'll try turning off virtualization next.

HankB · Aug 3, 2020

I'm not an MD RAID expert but I did use it for years. I've moved to ZFS but perhaps some of the same principles apply. When performance is discussed on the mailing list, The consensus seems to be that mirroring provides the best performance. (Striping is not generally discussed.) Anything that uses parity to insure integrity will not provide the best performance.

And when it comes to benchmarks... It is usually suggested to test with the particular load the storage will see for meaningful results.

I've never looked much at performance because my file servers serve over my Gigabit LAN and are fast enough to saturate that. More speed provides no benefit (at least until the cost of 10G or at least 2.5G hardware comes down.

)

Also I should make sure you are aware of the SMR drives that WD has sold as RAID drives. They will perform very poorly. If you haven't already, you need to make sure that these are not the WD reds that you are using.

Many drives identify as 512 byte sector size but internally are really 4K. You should align your partitions as if the drive had 4K sectors. Someday if you need to replace a drive it might be with a 4K drive and you will be ready for that. Most modern tools will automatically align for best performance, but I'm sure you can override that if not careful.

I am not familiar with Memclock tristating or DQS training control.

Sorry I can't offer more help.

OCFreely101 · Aug 4, 2020

Those aren't WD reds. They are WD Re4. They are older enterprise raid disks.

They are disks from around 2010. They get around 95mb/s write individually per disk. But I'm only getting 70-120mb/s with a 6 disk raid 10. I can't figure out why. These might be old enough to still be 512. I'll have to try with 4096 again and test performance.

I tested the tristate and other stuff and none of it helped. I'm starting to think the problem is linux and something with it and or the file system or something. I'm have a feeling they randomly changed something on a whim or for a security reason over the past several versions.

Here are some more testing using dd:

Code:

$ sudo dd if=/dev/zero of=/home/*****/Storage/test1.img bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.62504 s, 232 MB/s

$ sudo dd if=/dev/zero of=/home/*****/Storage/test1.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.34143 s, 201 MB/s

$ sudo dd if=/dev/zero of=/home/*****/Storage/test1.img bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.95121 s, 272 MB/s

$ sudo dd if=/dev/zero of=/home/*****/Storage/test1.img bs=1G count=1 oflag=direct conv=fdatasync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.08304 s, 263 MB/s

$ sudo dd if=/dev/zero of=/home/*****/Storage/test1.img bs=1G count=1 oflag=direct conv=fsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.1799 s, 257 MB/s

$ sudo dd if=/dev/zero of=/home/*****/Storage/test1.img bs=1G count=1 oflag=dsync conv=fsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.28045 s, 203 MB/s

$ sudo dd if=/dev/zero of=/home/*****/Storage/test1.img bs=1G count=1 oflag=dsync conv=fdatasync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.16565 s, 208 MB/s

$ sudo dd if=/dev/zero of=/home/*****/Storage/test1.img bs=1G count=1 conv=fdatasync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.37737 s, 200 MB/s

$ sudo dd if=/dev/zero of=/home/*****/Storage/test1.img bs=1G count=1 conv=fsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.35503 s, 201 MB/s

Storage is the mounting point of the Raid. It's not mounted like normal below home. Not sure if that makes any difference. All testing I've done still works out the same way with a real world folder. Nothing goes above 130mb/s easily. It started climbing once but went to around 145mb/s. Nothing near what it should be getting.

https://en.wikipedia.org/wiki/GVfs

https://imgur.com/zaptS1F

Could this have anything to do with it?! Maybe it's burdened with so many file system checks it destroying all access performance. This is just for showing it in the gui...(Could still be causing problems though I guess.)

It's got to be something with async and the operating system on some level... There has to be a way to run this at proper performance. I can understand 10% overhead or something, but this is over 66% overhead. The linux developers are pathetic if they find this stuff acceptable for normal performance and don't care. How can they not have software raid support out of the box considering how long it's been around. There must be some massive security issue or something. 99% of all raids would be Near2 and the other defaults. How hard would it be to at least make the most common auto setups for hdd and ssd a common thing. Let alone easier to find information on the settings or actual gui or other settings for raid and other similar issues. They need way more open ended software. They need to stop doing everything under the hood and just make proper complete transparent gui and other software. Even all the CLI stuff is hidden under the incomplete description in the manual pages. They think they can rely on others to write the info needed for everything instead of making complete documentation. It would be so easy to make a general linux set that goes beyond manual and gives the rest of the info. Dictionary and Encyclopaedia pages would be great. They could cross reference them and have proper full documentation for linux and in general for computers. They don't seem to be in the business of educating anymore though sadly.

I'll have to look into xfs more. I thought I tried that and it didn't work.

https://docs.fedoraproject.org/en-US/Fedora/14/html/Storage_Administration_Guide/xfsmain.html

In fact xfs is not showing beyond fedora 26. Maybe this is why my fedora 25 install was using it so well.

Edit: XFS did not help. In fact it made the performance worse by about 10%.... It would be nice if the people in the linux community would start being helpful again and actually answer questions for people. Linux forums do not answer questions unless you already know the answer. And they are obviously very hard to get answered outside of linux forums. It's been turned into their little boys club instead of a serious engineering environment.

Actually, next I'm going to look into optimizing xfs system since there might be info on how to customize it.

Remove overhead from raid 10 mdadm onlinux?!

OCFreely101

New Member

HankB

Member

OCFreely101

New Member

Similar threads