Software RAID (mdadm) and you!

Automata · Oct 3, 2010

WARNING: This post will contain some THEORY (where noted) and may not describe features that are available in anything discussed (software or hardware). This post is made for information gathering only. If there is any information that is confirmed working, I will update the thread.

-----------------------------

I've been looking into software RAID and has been my "project" for the night. It is a very interesting idea and I have a few things that I would really like to test without risking data loss. I may pick up some drives so that I can test this out in a "live environment".

So, let's start with the basics. Mdadm us a utility available on Linux operating systems that allows you to run multiple hard drives in a software RAID array. Mdadm will not drop a drive that is unresponsive. This means that you can use drives that have large TLER values (ie: Green drives). The array will simply be unresponsive (or slow, if it reads from parity) until the drive comes back online. This means a substantially lower hardware cost as a RAID card is not needed. Where did I find this out? On Backblaze's website. They use HUGE servers (45 1.5tb driver per server) with mdadm as their RAID. They use "home drives", which is something that I found extremely interesting. In combination with their article (which I had originally read when it was released) and RJARRRPCGP's post, I've been thinking very hard on how this will function.

For example. Mdadm allows the expansion of RAID arrays through a simple command. This would allow a user to start off with a small array and grow it as their server grows. Granted, the rebuild time is something that would need to be tested and I wouldn't expect anything less than a few days once you starting getting in the 5+ TB range, but it is still better than backing up to a device that can store all the data, rebuilding the array with the new drives and moving the data back.

So, what can you do with this? What would be reasons to use mdadm over hardware RAID?

Cost is substantially lower
Drives are not dropped out of the array when they go unresponsive
Allows you to easily expand an array (some RAID cards do this too)
Allows [virtually] an unlimited number of hard drives to be added to an array
Easily monitor the hard drives SMART data, including temperature (can script in warnings/emails) without using proprietary software

What are possible (including untested/unknown) downsides? I've bolded entries that can be tested in a virtual environment.

Array performance will be [probably] be substantially lower
Rebuilds will take longer since you don't have an optimized dedicate processor to calculate parity (the RAID card itself)
How does TLER respond to an array?
If a drive is unresponsive, so is the array? What about with parity?
Recovering an array that is good, but the server has failed (bad motherboard, operating system drive, etc)
Recovering an array that is failed

-----------------------------

THEORY:

My question is this, though. Can you shrink an array through mdadm? If so, this would allow you to completely move data between two different mdadm arrays. For example, let's use my server and say I had a ton of money laying around (I wish!). My server has the capability of running 20 hard drives at a time. Say that I filled this up with 1tb drives for cheap and have a running capacity of 18tb (RAID 6). Say that I'm nearing the capacity of this array and I want to upgrade to larger drives. I don't want to purchase another server with 2tb drives to copy the data over. If you could shrink the mdadm RAID (by cleaning up space on the array and resizing the filesystem, then removing a drive from the array) by 4 drives (you'd need 4tb of space free, which could be moved to external drives, other internal drives or other computers), you would have the space for a RAID 6 array. So, you resize this partition and drop four of the 1tb drives out of the array. This leaves you with 14tb of data and little to no remaining space. You could shut the server down, remove the unused 1tb drives and plug in your shiny new 2tb drives. Fire up the server, build the RAID 6 array with the new four 2tb drives. This gives you 4tb of space to work with. Format the array and move over 4tb of data to it. Continue shrinking/growing the two arrays until there is no space data left on the 1tb array. Remove all the 1tb drives, plug in the 2tb drives and resize. You have successfully gone from a 18tb RAID 6 array to a 36tb RAID 6 array without using two computers or two servers. Bottom line, I know you can't do this with hardware RAID; can it be done with software?

Even without being able to remove drives from an array, software RAID would be extremely useful if it proves to be stable and fast enough for multiple data streams.

-----------------------------

I do have a few ideas for testing. Without any hardware cost, I can see how this works. "How would you do that, Thideras?", you may ask. Simple, virtual machines. I can create multiple small hard drives that would allow me to create an array. The only downside is these are "fake" hard drives and will not reflect how an actual drive responds to being in an array, but will instead show how the mdadm utility works with drives falling out of the array, adding and (possibly) removing. It should give a fair indication on how fast it can re-build/re-size and how much processor it would take up or identify bottlenecks.

To-do list:

~~Install host operating system (CentOS)~~
~~Update the host operating system and install utilities needed~~
~~Create multiple small hard drives (unknown size as of now, maybe 5gb in size?)~~
~~Use these drives in a RAID using the mdadm utility~~
Test what happens when a drives suddenly stops responding (delete the drive or find another method to make it broken)
See how quickly it can rebuild an array when forced (easy to test)
See how quickly it adds drives to an array
See how quickly it adds multiple drives to an array
See if you can change the RAID level without shutting down the array
See if you can remove drives from the array

-----------------------------

If everything works, we need to identify a few things:

What port multipliers can be used to interface the drives?
What are the best drives for the array?

In addition to that, I think that if this turns out positive results, I will personally write an article on this subject.

Now, I'm sorry if this doesn't make any sense, but I wrote this late on a Saturday night of hard thinking. Please let me know if there are any questions on anything I discussed. I would also like to note that this "test server" (or any server with this purpose, really) would not have "enterprise quality" drives and equipment; because that raises costs higher than most home users can afford. Myself included. The purpose of this would have massive amounts of storage, with parity, for cheap. Quality can be worked around through different methods (RAID, backups, etc).

Automata · Oct 3, 2010

Got the virtual machine created and am just starting the normal install tonight since it will take a bit to get going. The client operating system of choice is CentOS 5.5 32bit. The host operating system is CentOS 5.5 64bit and I'm using my normal file server.

The test operating system will be very basic. No window manager and very little services installed. Samba will be available in case this idea somehow works.

bbtkd · Oct 3, 2010

Well - I haven't been a Unix admin for about 15 years, but I do manage them and and lead the technical planning. We have over 2PB (petabytes) of hardware RAID, but in some instances we do use software RAID. Where we have done this is in cases where we have an older RAID card in a server that has caused performance issues. Seems to me that it is more work for the admin to manage software RAID as hardware RAID offer some time-saving tools.

As you mentioned, software RAID can be more flexible than some hardware RAID. It does impact system CPUs, but we usually have spare cycles and our bottleneck is I/O. We've recently had some disk failures which led to quite a bit of manual intervention, where a hardware RAID may have handled some of that for us. If you have spare CPU cycles and don't mind some more manual intervention, then software RAID can have benefit. Also should use ext4 filesystems as we've seen about a 25% performance improvement over ext3. ZFS would be best, but that is currently only available natively on Solaris - Linux emulations are not advised. Eventually if ZFS is integrated into Linux then it can be your soft RAID.

Our experience with software RAID is that with even a low end server, you will have higher performance than a server based hardware RAID, such as a PERC card - and better than low-end hardware RAID. Not sure how software RAID stacks up against mid-range hardware RAID - may keep up. It probably can't come close to high-end hardware RAID. Lastly, software RAID allows you to dedicate more memory for caching than a hardware RAID, as long as you have some to spare.

Automata · Oct 3, 2010

Good information, I didn't know you used software RAID there. Makes this test a little more promising!

Automata · Oct 3, 2010

Very good information in the MAN pages for mdadm.

Grow Grow (or shrink) an array, or otherwise reshape it in some way. Currently supported growth options including changing the active size of component devices in RAID level 1/4/5/6 and changing the number of active devices in RAID1

You can also use spare drives:

-x, --spare-devices=
Specify the number of spare (eXtra) devices in the initial array. Spares can also be added and removed later. The number of component devices listed on the command line must equal the number of raid devices plus the number of spare devices.

There also looks to be level changes when growing, in the future:

-l, --level=
Set raid level. When used with --create, options are: linear, raid0, 0, stripe, raid1, 1, mirror, raid4, 4, raid5, 5, raid6, 6, raid10, 10, multipath, mp, faulty. Obviously some of these are synonymous.
When used with --build, only linear, stripe, raid0, 0, raid1, multipath, mp, and faulty are valid.
Not yet supported with --grow.

-----------------------------

I got an array of 3 devices in a RAID 5 created right now. It is pulling a constant speed of 30mb/s while writing and 48mb/sec while reading. Considering this is in a virtual machine on another file system in a different RAID array, that isn't too bad. This may also be an issue since the actual hardware interface it is using has an MTU of 9000 while the virtual device of the virtual machine runs at 1500. I'll try adding and removing to see how well it works.

Automata · Oct 3, 2010

SUCCESS

Here is how you create an array with mdadm. Anything that starts with # is a note, do not type it. Anything highlighted in red is something that you will have to change to match your system configuration and has a high probability of NOT matching my line exactly.

Code:

mdadm --create --verbose /dev/md0 --level=[COLOR=Red]5[/COLOR] --raid-devices=[COLOR=Red]3[/COLOR] /dev/[COLOR=Red]sdb[/COLOR] /dev/[COLOR=Red]sdc[/COLOR] /dev/[COLOR=Red]sdd[/COLOR]
[COLOR=Gray][I]#Create a RAID 5 array with the sdb, sdc and sdd devices and set them as "/dev/md0"
#Pay attention to what devices you add, you may need to change them[/I][/COLOR]

mkfs.ext4 /dev/md0
[COLOR=Gray]#Create the file system on /dev/md0[/COLOR]

mount /dev/md0 [COLOR=Red]/raid/raid5[/COLOR]
[COLOR=Gray]#Mount the array to a specified folder[/COLOR]

Now, we need to add this to the configuration file so the array persists after a restart. DO NOT SKIP THIS:

Open the configuration file:

Code:

nano -w /etc/mdadm.conf

Add your devices to the line, you can do it multiple ways. I this example, I will have sdb, sdc and sdd. You only want to add one of these ways, depending on what you have for a configuration.

Code:

DEVICE /dev/sd[bcd]
[COLOR=Gray][I]#Useful if your devices are not in a row (IE: sdb, sdc and sde)[/I][/COLOR]

DEVICE /dev/sd[b-e]
[COLOR=Gray][I]#Useful if your devices are in a row[/I][/COLOR]

DEVICE /dev/sdb /dev/sdc /dev/sdd
[COLOR=Gray][I]#Long-hand version of the above[/I][/COLOR]

After that, save the file and close it. Type the following at the command line:

Code:

mdadm --detail --scan >> /etc/mdadm.conf

Open the file back up and you will see something similar at the end of the file:

Code:

ARRAY /dev/md0 level=raid5 num-devices=3 metadata=0.90 UUID=0e364d44:896cb5c1:3fc5c4c1:855f11ff

Your system is now ready to go! Just remember to add the /dev/md0 device to the /etc/fstab file so it mounts on restart. You can then use this just like a normal device now.

--------------------

Here is how you grow an already existing array:

Code:

umount /dev/md0
[COLOR=Gray][I]#Unmount the array[/I][/COLOR]

mdadm --add /dev/md0 /dev/[COLOR=Red]sde[/COLOR]
[COLOR=Gray][I]#Add the new drive as a hot spare[/I][/COLOR]

mdadm /dev/md0 --grow --raid-disks=4
[COLOR=Gray][I]#Add the new hot spare as a md device[/I][/COLOR]

mdadm -D /dev/md0 | grep Reshape
[COLOR=Gray][I]#To watch the status (this can take a long time, don't continue until this finishes)[/I][/COLOR]

e4fsck -f /dev/md0
[COLOR=Gray][I]#Before we resize the partition, we need to check for errors[/I][/COLOR]

resize4fs /dev/md0
[COLOR=Gray][I]#To resize the partitions.  NOTE:  The command is "resize4fs" on CentOS 5.5 and may be listed as "resize2fs" on other versions or operating system!![/I][/COLOR]

mount /dev/md0 /raid/raid5[I]
[COLOR=Gray]#Mount the partition and use as normal[/COLOR][/I]

Now we need to update the configuration file so it builds properly on restart. Open the configuration file through your favorite editor and find the line for the array:

Code:

DEVICE /dev/sd[bcd]

ARRAY /dev/md0 level=raid5 num-devices=3 metadata=0.90 UUID=0e364d44:896cb5c1:3fc5c4c1:855f11ff

Since we added /dev/sde in our example, we need to update it to the following:

Code:

DEVICE /dev/sd[bcd[COLOR=Red]e[/COLOR]]

ARRAY /dev/md0 level=raid5 [COLOR=Red]num-devices=4[/COLOR] metadata=0.90 UUID=0e364d44:896cb5c1:3fc5c4c1:855f11ff

Automata · Oct 3, 2010

Got the RAID 6 setup very easily. Went to add a drive and got "Invalid argument". It seems that with the 2.6.18 kernel (CentOS 5.5), this hadn't been added. I'm currently compiling 2.6.35.7. I've added the EXPERIMENTAL option of multithreading parity calculations for software RAID.

The kernel is compiling right now, so I'm waiting on that.

Automata · Oct 4, 2010

SUCCESS

I'm running the 2.6.35.7 kernel with RAID 6 reshaping support that I built from source! It was a PITA to get through since LVM isn't enabled by default in the kernel, but I finally got it! Check this bad boy out!

Code:

[root@localhost ~]# mdadm -D /dev/md1
/dev/md1:
        Version : 0.91
  Creation Time : Sun Oct  3 14:48:29 2010
     Raid Level : raid6
     Array Size : 8388480 (8.00 GiB 8.59 GB)
  Used Dev Size : 4194240 (4.00 GiB 4.29 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Sun Oct  3 18:40:49 2010
          State : clean, recovering
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 64K

 Reshape Status : 10% complete
  Delta Devices : 2, (4->6)

           UUID : e33fdea7:69436750:f4f9893a:9994f9a5
         Events : 0.47

    Number   Major   Minor   RaidDevice State
       0       8       80        0      active sync   /dev/sdf
       1       8       96        1      active sync   /dev/sdg
       2       8      112        2      active sync   /dev/sdh
       3       8      128        3      active sync   /dev/sdi
       4       8      160        4      active sync   /dev/sdk
       5       8      144        5      active sync   /dev/sdj

Automata · Oct 4, 2010

I broke the other arrays and moved all drives to the RAID 6 array and it went fairly quick. With 4gb per "drive", it went from 6->22 drives in roughly 10 minutes. The main reason I did this is that RAID 10 does not support reshaping, so adding drives is a no-go. It does work fine if you don't need to change the number of drives, though.

I've not seen a VMWare machine with this many drives before:

Code:

[root@localhost ~]# mdadm -D /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Sun Oct  3 14:48:29 2010
     Raid Level : raid6
     Array Size : 83884800 (80.00 GiB 85.90 GB)
  Used Dev Size : 4194240 (4.00 GiB 4.29 GB)
   Raid Devices : 22
  Total Devices : 22
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Mon Oct  4 06:18:19 2010
          State : clean
 Active Devices : 22
Working Devices : 22
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 64K

           UUID : e33fdea7:69436750:f4f9893a:9994f9a5
         Events : 0.334

    Number   Major   Minor   RaidDevice State
       0       8       80        0      active sync   /dev/sdf
       1       8       96        1      active sync   /dev/sdg
       2       8      112        2      active sync   /dev/sdh
       3       8      128        3      active sync   /dev/sdi
       4       8      160        4      active sync   /dev/sdk
       5       8      144        5      active sync   /dev/sdj
       6      65       96        6      active sync   /dev/sdw
       7      65       80        7      active sync   /dev/sdv
       8      65       64        8      active sync   /dev/sdu
       9      65       48        9      active sync   /dev/sdt
      10      65       32       10      active sync   /dev/sds
      11      65       16       11      active sync   /dev/sdr
      12      65        0       12      active sync   /dev/sdq
      13       8      240       13      active sync   /dev/sdp
      14       8      224       14      active sync   /dev/sdo
      15       8      208       15      active sync   /dev/sdn
      16       8      192       16      active sync   /dev/sdm
      17       8      176       17      active sync   /dev/sdl
      18       8       64       18      active sync   /dev/sde
      19       8       48       19      active sync   /dev/sdd
      20       8       32       20      active sync   /dev/sdc
      21       8       16       21      active sync   /dev/sdb

Code:

[root@localhost ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       15G  4.5G  9.4G  33% /
/dev/sda1              99M   30M   65M  32% /boot
tmpfs                1013M     0 1013M   0% /dev/shm
/dev/md0               79G   34G   42G  46% /raid/raid6

j0rd · Oct 18, 2010

e4fsck -f /dev/md0

On some other distros they tend to have e2fsck, as you noted on the next command. Minor but could cause problems for someone following allong. Cool article as always thid. look forward to seeing how it all pans out.

cyberfish · Oct 26, 2010

There are arguments you can pass to mkfs.ext4 to optimize it for the array (block size, strip width, etc).

cw823 · Nov 15, 2010

I've been working with software raid in ubuntu for the last 7 months or so, no problems until today, lol.

Added another 2tb drive to the array, finished reshaping, now I just need to resize parition which has been no-go.

Running e2fsck -f /dev/md0 hangs at 11.6%. Going to let it run overnight but doubt it will have done anything by morning.

file system is ext3. sdb, sdc, sdd, sde, sdf, sdg are all 2tb Hitachis.

Automata · Nov 15, 2010

That is odd. It doesn't give any errors at all? What about load levels and drive access? Is there a lot of "wa" percent in 'top'? I can't think of a reason that it would stop mid-check and not give any sort of notification as to what is wrong. I don't think interrupting it is bad if it is idle, but if you have disk access (or other load by that program), it should be working.

cw823 · Nov 15, 2010

I'm not overly concerned, yet.

If I can bring this back to the point of your thread, lol, I will say with jumbo frames turned on I see anywhere from 80-100MB/s transfer rates across my gigabit network, which for software raid I think is excellent.

I'm running ubuntu 10.04 on a dual-xeon with 6Gb of ram (overkill on both counts I'm sure) and couldn't be happier with the performance

Automata · Nov 15, 2010

How large is the array and what parity level are you using?

I want to test this out some time, but it will be difficult taking my current array offline. Got a lot of data stored on it and I don't have money to buy other drives. I also need to upgrade my switch to one that can do more than 85 mb/sec. QQ

cw823 · Nov 15, 2010

thideras said:
How large is the array and what parity level are you using?

I want to test this out some time, but it will be difficult taking my current array offline. Got a lot of data stored on it and I don't have money to buy other drives. I also need to upgrade my switch to one that can do more than 85 mb/sec. QQ

it was an 8tb raid-5 with a spare, running low on space

and attempted to grow the array onto the spare. Planning to pick another one or two drives up the next time there's a good deal, although it looks like the hitachi's may be harder to find.

I realize that a good hardware raid card could make processes like this much easier, but the cost/performance has been so good just doing the software raid works for me.

Like you, though, if I wanted to migrate to another solution down the road, how am I going to move what is now just under 7tb of data to a new array without purchasing several more drives.

cw823 · Nov 15, 2010

Well another great experience with software raid in my book.

This morning it had finished e2fsck and I was able to resize from 7.1tb to 9.3tb. Now to browse for a couple more 2tb drives, lol.

Automata · Nov 15, 2010

Awesome, glad to hear it worked for you.

cw823 · Nov 15, 2010

Anything I can assist with for this thread? I did a raid1-raid5 conversion a while back following that online guide that google turned up, may be good for the average user that doesn't have quite enough hard drives to copy all their existing data to.

Without that I'd have had to purchase another 2tb drive when I originally did the setup.

Need to add how to edit fstab, if it's a beginner that will be one step that could be confusing (as it was for me). I normally edit via "sudo nano -w /etc/fstab", I know there's another way that involves the insert key that I could never get quite right, lol.

Automata · Nov 15, 2010

I don't want to risk your data, but I want to test a few things.

-What happens if you just yank a drive out?
----It should fall back on parity. What happens when you put it back in?
----What happens if you remove n+1 drives (2 on RAID 5)? Does the array halt and then rebuild when the drive is available?

-What happens with TLER? Does the drive actually drop from the array or does it wait?

These are the things I wanted to test. Almost everything else has been answered. TLER is a bit tricky.

Software RAID (mdadm) and you!

Destroyer of Empires and Use

Destroyer of Empires and Use

Registered

Destroyer of Empires and Use

Destroyer of Empires and Use

Destroyer of Empires and Use

Destroyer of Empires and Use

Destroyer of Empires and Use

Destroyer of Empires and Use

Member

Member

Honeybadger Moderator

Destroyer of Empires and Use

Honeybadger Moderator

Destroyer of Empires and Use

Honeybadger Moderator

Honeybadger Moderator

Destroyer of Empires and Use

Honeybadger Moderator

Destroyer of Empires and Use

Similar threads