A highly resilient raid solution!
By Red Squirrel
For a raid, you obviously need multiple hard drives. In the case of software raid, you'll want a separate OS drive (can be using hardware raid) and then data drives. You CAN use md raid for the OS, but I have never tried it myself and am not sure of the process. Perhaps I can cover that in a future article once I've done it. Ideally, you want to build a machine and have removable disk bays on the front to make hot swapping easy. This is also known as a backplane. Some server cases have it built in. For standard cases you can buy an enclosure that fits in the 3.25" bays. One of the advantages of a setup like this is the ability to hot swap drives. When a drive fails, you simply pull the faulty one out and insert a new one, and let it rebuild. Some will require you to screw the drive to a tray while others will take the drive directly as the sata/power orientation is standard on all drives (that I've seen). You do not need to go with hot swap bays though. If you want you can still put the drives inside the case, it's your choice. It's also a good idea to label the drive bays and keep a list somewhere of the serial numbers and which bays they are in. I will cover this more later.
The first thing you want to do is decide which drives to use. If you are anal about the physical placement of the drives like I am, the easiest way to ensure a single array occupies a single area of the bay is to have all the drives out, then insert as needed. You can issue the dmesg command which will tell you what the drive was named. The dmesg -c command will also clear the log, so you should probably do that before you insert the drive. Take note of the name, insert the next drive, and repeat.
If you don't care and already have all the drives inserted, you can type fdisk /dev/sd and hit tab. fdisk can be any other command. You will see something like this:
[root@raidtest ~]# fdisk /dev/sd [tab] sda sda1 sda2 sdb sdc sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdn sdo
Remember, one of these is your OS drive, you want to leave that one alone. When building a server I always ensure to put the OS drive in the first sata port. Normally they are numbered 0,1,2 etc... So in this case it is sda. It is also clear given it has multiple partitions: sda1 and sda2. If you want to know more information on a drive such as it's serial number, make, and size, you can use the smartctl command.
[root@raidtest ~]# smartctl -a /dev/sda smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD6402AAEX-00Z3A0 Serial Number: WD-WCATR4058891 Firmware Version: 05.01D05 User Capacity: 640,135,028,736 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Sep 5 22:20:29 2011 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (12360) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 145) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3037) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 173 170 021 Pre-fail Always - 4308 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 68 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 6708 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 66 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 31 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 36 194 Temperature_Celsius 0x0022 109 098 000 Old_age Always - 38 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
This also shows smart information and is a great tool to assess if a hard drive may be failing.
Creating an array
All Linux raid related work is done with the mdadm command. A Linux raid is actually surprisingly easy to setup. mdadm --help shows all the different commands and if you do mdadm --command --help it shows more info for that command. Instead of just slapping out the output of --help like lot of articles tend to do, I'll actually walk you through the process, but the --help does serve as a nice reference if you forget the exact wording of a command.
Let's start by making a raid5 array using 4 drives: /dev/sdb, /dev/sdc, /dev/sdd and /dev/sde. These drives are 1GB drives, so to calculate the actual size the array will be, simply add all the drives minus one. That gives us 3GB.
[root@raidtest ~]# mdadm --create --level=5 --raid-devices=4 /dev/md0 /dev/sdb /dev/sdc /dev/sdd /dev/sde mdadm: array /dev/md0 started. [root@raidtest ~]#--level is the raid level, which is raid 5. --raid-devices is the number of drives we are using, which is 4. /dev/md0 is the name of the raid device. Raid devices always start with md, and it will show up like any other drive. You can call it anything you want but it's best to stick to the proper convention to avoid confusion. If you use another name it will create a md named device anyway, so may as well just name it yourself with md. Lastly, the 4 raid devices are added following enter. At this point, the raid is building and it could take anywhere from an hour to days depending on the number of drives, their size, speed, and the processor's speed. To see a status of the raid you can use this command:
[root@raidtest ~]# mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Tue Sep 6 18:31:41 2011 Raid Level : raid5 Array Size : 3144192 (3.00 GiB 3.22 GB) Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Tue Sep 6 18:31:55 2011 State : clean, degraded, recovering Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 512K Rebuild Status : 19% complete Name : raidtest.loc:0 (local to host raidtest.loc) UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf Events : 4 Number Major Minor RaidDevice State 0 8 16 0 active sync /dev/sdb 1 8 32 1 active sync /dev/sdc 2 8 48 2 active sync /dev/sdd 4 8 64 3 spare rebuilding /dev/sde [root@raidtest ~]#
If you want to easily monitor the progress of one or more rebuilds, you can also use this command:
[root@raidtest ~]# watch 'cat /proc/mdstat' Personalities : [raid6] [raid5] [raid4] md0 : active raid5 sde sdd sdc sdb 3144192 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_] [======>..............] recovery = 34.6% (363392/1048064) finish=0.7min speed=15141K/sec unused devices:
This will update every 2 seconds. Alternatively you can just use cat /proc/mdstat directly to show it once.
Once the array is complete, it will look like this:
[root@raidtest ~]# mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Tue Sep 6 18:31:41 2011 Raid Level : raid5 Array Size : 3144192 (3.00 GiB 3.22 GB) Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Tue Sep 6 18:32:49 2011 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : raidtest.loc:0 (local to host raidtest.loc) UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf Events : 20 Number Major Minor RaidDevice State 0 8 16 0 active sync /dev/sdb 1 8 32 1 active sync /dev/sdc 2 8 48 2 active sync /dev/sdd 4 8 64 3 active sync /dev/sde [root@raidtest ~]#
Whether or not the array is done rebuilding, it is ready to start using. This means you can format it, mount it, and put data on it, so let's do that:
[root@raidtest ~]# mkfs.ext4 /dev/md0 mke2fs 1.41.12 (17-May-2010) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=128 blocks, Stripe width=384 blocks 196608 inodes, 786048 blocks 39302 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=805306368 24 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912 Writing inode tables: done Creating journal (16384 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 21 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. [root@raidtest ~]# [root@raidtest ~]# [root@raidtest ~]# mkdir /mnt/md0 [root@raidtest ~]# [root@raidtest ~]# mount /dev/md0 /mnt/md0 [root@raidtest ~]# dir /mnt/md0 lost+found [root@raidtest ~]#
Optionally you could create a few partitions on it first. A quick way to confirm that the mount is working is seeing the lost+found folder. You can also use the df command which displays the disk space usage on all local devices:
[root@raidtest ~]# df -hl Filesystem Size Used Avail Use% Mounted on /dev/sda2 7.9G 2.9G 4.7G 39% / tmpfs 499M 0 499M 0% /dev/shm /dev/sda1 194M 25M 159M 14% /boot /dev/md0 3.0G 69M 2.8G 3% /mnt/md0 [root@raidtest ~]#
At this point, should any of those 4 drives fail, the data will still be available and normal operations will continue unaffected. Before we go on, we need to save the configuration. Each drive has a UUID which identifies it as part of an array, however at this point if we reboot, we will need to reassemble this array. To make this automatic we need to save the settings. To do this, issue this command:
[root@raidtest ~]# mdadm --detail --scan > /etc/mdadm.conf [root@raidtest ~]#
Now if we reboot, the /dev/md0 device will exist and it just needs to be mounted. If you want to automate mounting you can add it to /etc/fstab or other startup script. Personally I do not like using /etc/fstab as if it fails to mount for whatever reason, the entire system will fail to boot. It is best to add it to /etc/rc.local or other area. However if any programs such as mysql depend on this mount, then you will need to put it in fstab so it mounts before these programs load.
Assembling an array
Let's say you forgot to save the settings, and the system rebooted, you are not out of luck. However you will need to know which 4 drives contain the raid. Unless you've been swapping drives around, they should be called the same. To re-assemble, simply do the following:
[root@raidtest ~]# mdadm --assemble /dev/md0 /dev/sdb /dev/sdc /dev/sdd /dev/sde mdadm: /dev/md0 has been started with 4 drives. [root@raidtest ~]#
Even with the configuration file set, I have seen some situations where it does not start on it's own. With the configuration file in place you start the array with the above command except you do not need to specify the individual drives. One thing to know as well is the physical drive order does not matter anymore. You can actually turn off the server, go swap all the drives around, and you will boot up fine and the array will work fine. Of course, don't do that as it will mess up your documentation. Any physical drive change you do, you should document.
Now it's fine and dandy to have redundancy, but I have seen this even in corporate environments, where a failure goes unnoticed because there is no proper alerting system. Then another drive fails, and it's game over. By default, root gets all messages related to mdadm. This may be fine in some cases if you have a forward setup but in other cases you may want to have that mail go to another address. To do this, add the following command in a startup script such as /etc/rc.local:
mdadm --monitor --scan --mail=[email address] --delay=1800 &
If you want to test to ensure any alert emails will be received you can issue this command:
[root@raidtest ~]# mdadm --monitor --scan --test ^C [root@raidtest ~]#
You will need to ctrl+c out of it. You should get an email that looks something like this:
Subject: TestMessage event on /dev/md0:raidtest.loc From: mdadm monitoring
On the next page, we'll continue by looking with how to deal with a hard drive failure.
This site best viewed in a W3C standard browser at 800*600 or higher
Site design by Red Squirrel | Contact
© Copyright 2017 Ryan Auclair/IceTeks, All rights reserved