First Phase : Building 14TB

Contents

  • Setting up a Storage Node
  • Setting up a Service Node
  • iSCSI Fabric Performance Notes
  • NFS Client Performance Notes
  • System Deployment
  • System And Device Failure Management
  • Setting Up The Second Service Node
  • Fault Archive
  • Major Events

  • 26th Aug 2006 : Purchase order is sent out for 2x storage nodes and 1x service node.
  • 13th Sep 2006 : Partial delivery of components. The first storage node is partially assembled (no CPU).
  • 26th Sep 2006 : Both storage nodes assembled, all hardware is verified and OS is running, storage runs at 200+MB/sec.
  • 27th Sep 2006 : Got iSCSI up and running for the first time. Initial performance over GE is not impressive (70MB/sec).
  • 29th Sep 2006 : I've assembled the service node. The AMD Athlon X2 3800+ was dead on arrival (poor AMD QC ?). Server can't boot up.
  • 4th Oct 2006 : I'm still waiting for the replacement AMD cpu to arrive, I decide to try out a new kernel 2.6.18 in the mean time.
  • 9th Oct 2006 : All storage servers are now racked up and hooked up on the network. I create a 2TB volume exported on NFS to backup SBCR.
  • 13th Oct 2006 : I have deployed a (nostalgic) Alpha DS10L running FreeBSD 6.1 as a management server.
  • 22nd Oct 2006 : The Alpha DS10L suffered a CPU fan failure, the chip pretty much got baked to death. Fortunately I grabbed a replacement.
  • 23rd Oct 2006 : Vendor finally delivers replacement AMD Athlon X2 3800+ CPU, but I've already bought my own replacement.
  • 20th July 2007 : The service node encountered (what appears to be) a kernel bug. I knew linux was NOT enterprise eady !

  • Setting up a Storage Node

  • Image gallery
  • Bios screen shots
  • Initial Hardware Preparations

    The majorly unfortunate discovery was that the 3ware 9590SE controller's bios doesn't appear to get along with the Asus M2N32 WS bios. What happens is that upon powering on, the 3ware's "Alt-3" command to enter its bios is never displayed. Consequently, even if volumes have already been configured on the 3ware, the motherboard's bios is not aware of them and naturally will not be able to boot from it. The 3ware's bios was upgraded to 9.3.0.6 but this did not correct the problem. A check on the Asus website revealed that this incompatibility had not been addressed.

    I managed to workaround this problem on the next day. I first installed the OS on a sata disk connected directly to the motherboard. Once the machine could boot off its own OS, I copied the tw_cli utility and created the raid5 volume on the 3ware card. Next I duplicated the OS onto the newly created 3ware raid5 volume. Note that it takes 8-9 hours to build the raid5 (about 230MB/sec) of 15 disks. Finally, I copied /boot onto a usb thumb drive and installed grub on it. From then on, the motherboard bios boots off the thumb drive. Grub loads the kernel and initrd off the thumbdrive. Once the kernel detects the 3ware controller and the volume(s), it mounts the root filesystem off the 3ware. The sata disk connected to the motherboard is then removed.

    The storage node is now up and running :

    Although the raw storage on paper is 8TB (ie 16 * 500GB disks), the final storage offered works out to 6.36TB. This is because :

    Various disappointments and/or surprises :


    Setting up a Service Node

  • Image gallery
  • Initial Hardware Preparations

    Assembling the service node is fairly straight forward. However, since the server has no hardware raid controller, the OS is installed on a pair of SATA disks and the OS configured to run off a software raid1. In order to accomodate the possibility of a disk failure, I decided to place the boot loader and the "/boot" files on a USB thumbdrive. This would provide a highly reliable boot loader. Everything else is essentially identical to the storage node, except that :

    Subsequently, I removed the thumb drive because it didn't seem to get along with the USB controller. I kept getting :

      usb 2-2: new high speed USB device using ehci_hcd and address 5
      usb 2-2: device not accepting address 5, error -110
      

    So I decided to drop the idea of booting from thumbdrive and instead put the grub boot loader on both disks, since the rootfs was a raid1 anyway. This was done using the following commands from the grub shell :

      grub> root (hd0,0)
      grub> setup (hd0)
      grub> root (hd1,0)
      grub> setup (hd1)
      

    iSCSI Fabric Performance Notes

    NOTE! Running iptraf reduces performance immediately by about 10-20%.

    Intel GE

    nVidia (Marvell) GE

    iSCSI Target Software

    Alternative Hardware

    Using Kernel 2.6.18


    NFS Client Performance Notes

    No particular tuning or experimentation has been done yet. NFS services were configured "out of the (CentOS) box", but i decided to toss in the "async" option in /etc/exports (ie, lie to NFS clients that writes have already been committed). Using LVM to manage the underlying iSCSI volumes, the first logical volume I created was 2TB and formatted with reiserfs.

    Solaris Clients

    Linux Clients

    System Deployment

  • Image gallery
  • Putting It All Together

    The system was deployed on the 9th Oct 2006, about 6 weeks after the purchase order was sent out. The setup is somewhere between proof of concept and production (proof of conceptual production?), and is therefore undergoing changes and tuning. I obtained a 3com 3870 switch and configured it to segregate traffic into VLANs illustrated by the diagram below.

    All ethernet interfaces run at gigabit speeds and the iSCSI fabric runs at 9000 byte mtu. The service node (silo1) gets 2Gbps of bandwidth to the fabric via channel bonding / link aggregation / 802.3ad / (or whatever terminology you wish). The 2nd storage node actually has an Intel dual GE PCIe card but is tentatively not used. As the name suggests, the iSCSI fabric is a subnet for strictly iSCSI traffic only. Thus a seperate network is setup for administration / management / monitoring. Finally, a single exit point presents a means for clients to utilize the facilities of the the storage equipment.

    There isn't much of a plan for an upgrade path at this point. As it is, we need to watch how the storage is used, such that genuine needs can be identified. Each of the 3 servers have 2 PCI-X slots available for future expansion. Additional storage or service nodes may also be added to the fabric (there're a lot of unused ports on the 48-port switch). The current 2 storage nodes have 6.5TB of available storage each. This volume is actually split into 2 LUNs, each 3.2TB. This permits 2 service nodes to mount a single storage node. Until then, the single service node runs LVM which performs all volume management. The following diagram illustrates this structure.

    As you can see, the many layers involved in acquiring the final volume for export can get messy. I felt that calling iscsiadm alone to perform mapping and unmapping of block devices was a bit incomplete. A more comprehensive wrapper which could report and handle errors gracefully (and also provide positive feedback on successful operations) would be nice. Thus, to ease the management of LVM's iSCSI physical volumes, I wrote a perl script to assist in bringing iscsi devices online and offline.

    Equipment Monitoring

    In order to protect the hardware from damage, I realized I needed an automated shutdown mechanism should a system overheat. It's always better to power off a machine (and risk damaging the file system), than to keep it running and risk damaging the hardware. The motherboard is capable of monitoring fan RPMs as well as board and CPU temperatures. Unfortunately, the default distribution of the 2.6.18 kernel did not detect the IT8716F chip. Fortunately a patch exists on the www.lm-sensors.org website. Next, the distribution's lm_sensors (version 2.8.7) was too old and I had to get a newer one (version 2.10.1) to interoperate with the new kernel.

    I wrote a script which checks that temperatures don't rise above certain levels and that fan RPMs don't drop below certain threshold. What was interesting was that I initially set the fan RPM threshold at 1000 RPM, since I observed it spinning at about 1500 RPM. However, when I let the machine cool (down to 22 degrees), I did not realize the initial power up RPM would only be 800 RPM. Thus the server kept switching itself off ! All environment monitoring parameters (ie temperature and RPM) are checked and logged every minute. Presently, I have the following environmental checks :

    I wrote another script which checks on each of the (physical) disks. Not only does this provide disk failure indicators, but it (should) provide pre-failure signs (ie hardware errors). For the service node, these would be the 2x OS disks. For the storage node, this would be the 16x disks behind the 3ware controller. For each disk, it regularly polls its temperature, read errors, seek error and crc errors. This information is then made available for the Management Node.

    Management Node

    On the 13th of Oct 2006, I setup a management node, based on FreeBSD 6.1 running on an (ancient) Alpha DS10L. Its role in life would be to poll the switch/service/storage nodes, charting their work loads as well as sending notification of events of interest (eg, disk failures, fan failures, etc). I wrote some scripts to pull data from both the 3com switch and the servers. Since I didn't want the poor old alpha to execute lots of fork() (think: cut, sed, awk, head, tail, etc), I decided to write them in perl.

    iSCSI Target Services

    The Service node started off by providing NFS volumes only. Around the beginning of 2007, users started to express interest in obtaining block devices as iSCSI targets from the service node. This allows users to run any filesystem, thus utilizing more optimal block level caching and also bypasses NFS locking. I wrote a script to simplify my job of administering iSCSI targets to users.


    System And Device Failure Management

    The Linux Volume Manager (LVM)

    Perhaps the first question on everybody's mind is : if LVM constructs volumes based on underlying iscsi disks, what happens should an underlying iscsi disk go offline ? To find out exactly what happens, I created a volume group consisting of 2 physical volumes obtained over iscsi. I then created a logical volume big enough to require both physical volumes. I then detached one of the physical volumes by terminating one of the iscsi sessions. Upon "losing" one of its physical volumes, LVM complained, but when I re-attached the physical volume, LVM was happy again (took a few seconds for LVM to "detect" the reattached drive). The following illustrates what happened :

      # vgcreate vol1 /dev/sda /dev/sdb
        Volume group "vol1" successfully created
      # lvcreate -L 200M -n lv1 blah
        Logical volume "lv1" created
      # map_iscsi.pl unmap sda
      unmapping 04:553e23 -> sda
      # lvscan
        Couldn't find device with uuid '67F3jR-92gj-K7mP-HGI6-0MA7-V8iE-Ehx9p6'.
        Couldn't find all physical volumes for volume group vol1.
        Volume group "vol1" not found
      # cat /proc/partitions
        :      :       :
       254     3     204800 dm-3	(Note: logical volume still exists)
      # map_iscsi.pl map 553e23
      553e23 = sde			(Note: device returns as different drive)
      # lvscan
        ACTIVE '/dev/vol1/lv1' [200.00 MB] inherit
      

    Unfortunately, even though the device seems to be "restored", attempts to access the file system on the logical volume failed. The following shows what happens when an attempt is made to mount the logical volume.

      # mount -o ro /dev/vol1/lv1 /mnt/disk
      /dev/vol1/lv1: Input/output error
      mount: /dev/vol1/lv1: can't read superblock
      

    The following messages appeared in syslog

      Oct 25 13:25:45 virgo kernel: 4:0:0:0: rejecting I/O to dead device
      Oct 25 13:25:45 virgo kernel: Buffer I/O error on device dm-3, logical block 0
      Oct 25 13:25:45 virgo kernel: Buffer I/O error on device dm-3, logical block 1
      Oct 25 13:25:45 virgo kernel: Buffer I/O error on device dm-3, logical block 2
      Oct 25 13:25:45 virgo kernel: Buffer I/O error on device dm-3, logical block 3
      

    It might be interesting to note that if the logical volume is actually a reiserfs which has been mounted read/write, the kernel switches the volume to read only upon detecting the IO error. The files in the "damaged" file system remain readable as long as they are cached in memory. From this point on, the only remaining action would be to umount the file system, which occurs without anybody grumbling. The following messages are generated by failed IO on reiserfs.

      Oct 25 14:29:51 virgo kernel: 9:0:0:0: rejecting I/O to dead device
      Oct 25 14:29:51 virgo kernel: Buffer I/O error on device dm-3, logical block 59
      Oct 25 14:29:51 virgo kernel: lost page write due to I/O error on dm-3
      Oct 25 14:29:51 virgo kernel: Buffer I/O error on device dm-3, logical block 60
      Oct 25 14:29:51 virgo kernel: lost page write due to I/O error on dm-3
      Oct 25 14:29:51 virgo kernel: REISERFS: abort (device dm-3): Journal write error in flush_commit_list
      Oct 25 14:29:51 virgo kernel: REISERFS: Aborting journal for filesystem on dm-3
      

    Recall that the major/minor numbers for sda are 8, 0 and the major/minor numbers for sdb are 8, 16. The reason for the above failure is actually due to the device mapper holding on to the major/minor numbers of the old scsi device, "sda". This can be seen by checking the dependencies of /dev/mapper/vol1/lv1 in :

      # dmsetup -v deps /dev/mapper/vol1-lv1
      Name:              vol1-lv1
      State:             ACTIVE
      Tables present:    LIVE
      Open count:        0
      Event number:      0
      Major, minor:      254, 3
      Number of targets: 2
      UUID: rOBqBOyT4GXROB11cPuUdHerIeCZ0Ltdighru3QKuatopReA3zYWDUDrlBI81qdB
      2 dependencies  : (8, 16) (8, 0)
      

    At this stage, since the device instances of sda and sdb are already references by the device mapper, the iscsi layer will not be able to reattach disks as sda or sdb, even if they already do not show up in /proc/partitions. The remedy is to remove these drive references in the device mapper first, remove any drive references in /proc/partitions and remap them all over again :

      # dmsetup remove /dev/mapper/vol1-lv1
      # map_iscsi.pl map 553e23
      553e23 = sda
      # map_iscsi.pl map 553e24
      553e24 = sdb
      # vgchange -a y vol1
        1 logical volume(s) in volume group "vol1" now active
      # lvscan
        ACTIVE '/dev/vol1/lv1' [200.00 MB] inherit
      

    Network Layer Failures

    The fundamental question asked here is : what happens if I temporarily unplug the network cable between the iscsi target and the iscsi initiator ? Similarly, what happens if the switch connecting iscsi target and iscsi initiator is rebooted ? In order to document what happens under such circumstances, I set up an iscsi drive sda, and mounted it and wrote an infinite loop which keeps appending to a file while printing timestamps every 10 seconds. The command used was :

      while [ true ] ; do
        date				# print some feedback of progress
        date >> /mnt/disk/date.out	# append to a file
        sync				# flush to disk
        sleep 10
      done
      
      Time Event iSCSI Initiator Server
      17:17:50 Script runs. Script reports current time is 17:17:50.
      17:17:51 Cable unplugged. TCP/IP stack performs standard retransmissions.
      17:18:30   Syslog messages appear :
      Oct 25 17:18:30 virgo kernel: connection12:0: iscsi: detected conn error (1011)
      Oct 25 17:18:31 virgo iscsid: detected iSCSI connection 12:0 error (1011) state (3)
      17:18:31   iscsid attempts to make a new TCP connection (SYN packet) to the iscsi target.
      17:18:47   iscsid attempts to make yet another new TCP connection (SYN packet) to the iscsi target.
      17:19:03 Cable reconnected. iscsid sucessfully makes a new TCP connection to the iscsi target, and sends syslog messages :
      Oct 25 17:19:03 virgo kernel: iscsi: host reset succeeded
      Oct 25 17:19:04 virgo iscsid: connection12:0 is operational after recovery (3 attempts)
      17:19:14   Data transfer observed over iscsi TCP session.
      17:19:23   Script reports time is 17:19:23.

    iSCSI Target Failures

    The typical scenerio explored here is rebooting an iSCSI target server. In this case, we examine an orderly reboot. The iSCSI target (assumed to be running iscsitarget-0.4.13) executes an orderly shutdown via the commands :

      # ietadm --op delete
      

    This means the iSCSI target sends the TCP FIN packet to the iSCSI initiator. The following message then appears on the iSCSI initiator's syslog :

      Oct 25 18:05:31 virgo kernel: connection12:0: iscsi: detected conn error (1011)

    The iSCSI initiator attempts to re-establish the TCP session by sending a SYN back to the iSCSI target, but the iSCSI target promptly refuses to entertain the iSCSI initiator anymore. The following messages then appear on the iSCSI initiator's syslog :

      Oct 25 18:05:33 virgo iscsid: detected iSCSI connection 12:0 error (1011) state (3)
      Oct 25 18:05:33 virgo iscsid: Login negotiation failed, can't accept MaxOutstandingR2T 0
      Oct 25 18:07:32 virgo kernel: session12: iscsi: session recovery timed out after 120 secs

    At this point, even if the iSCSI target becomes available again, the iSCSI initiator remains perpetually blocked. This also means that it is not possible to umount the filesystem affected. Interestingly, the command "umount -l /mnt/disk" works and the affected filesystem no longer shows up. Unfortunately, nothing much can be done with the broken iSCSI session as the iscsiadm tool is no longer able to communicate with iscsid (which is blocked). The system's load climbs as each IO thread tries to write to the broken disk and is blocked. Attempts to reboot the machine using "init 6" or "reboot" failed. And finally the only option was to use "reboot -n" (do not sync before reboot).

    The scenerio differs slightly if the "ietadm --op delete" command is not issued, but instead the iscsi target server is rebooted without explicitly shutting down ietd (ie just use the "reboot" command). Using the same infinite loop where we write out the current date in 10 second intervals, the following events are observed :

      Time Event iSCSI Initiator Server
      09:35:00 Script runs. Script reports current time is 09:35:00.
      09:35:14 iSCSI target reboots. TCP/IP stack performs standard retransmissions.
      09:35:54   Syslog messages appear :
      Oct 26 09:35:54 virgo kernel: connection0:0: iscsi: detected conn error (1011)
      Oct 26 09:35:55 virgo iscsid: detected iSCSI connection 0:0 error (1011) state (3)
      09:37:28 iSCSI target is up.
      Port 3260 not listening.
      iSCSI initiator fails to connect to port 3260, syslog reports :
      Oct 26 09:37:28 virgo kernel: connection0:0: iscsi: detected conn error (1011)
      Oct 26 09:37:29 virgo iscsid: connect failed (111)
      Oct 26 09:37:29 virgo iscsid: detected iSCSI connection 0:0 error (1011) state (2)
      Oct 26 09:37:29 virgo iscsid: connect failed (111)
      09:37:54   iSCSI initiator seems to have given up hope, syslog reports :
      Oct 26 09:37:54 virgo kernel: session0: iscsi: session recovery timed out after 120 secs
      Oct 26 09:37:54 virgo kernel: sd 0:0:0:0: scsi: Device offlined - not ready after error recovery
      Oct 26 09:37:54 virgo kernel: sd 0:0:0:0: SCSI error: return code = 0x20000
      Oct 26 09:37:54 virgo kernel: end_request: I/O error, dev sda, sector 2116
      Oct 26 09:37:54 virgo kernel: Buffer I/O error on device sda1, logical block 258
      Oct 26 09:37:54 virgo kernel: lost page write due to I/O error on sda1
      Oct 26 09:37:54 virgo kernel: sd 0:0:0:0: rejecting I/O to offline device
      09:38:04 Port 3260 is online. iSCSI initiator connects to iSCSI target, unfortunately, reiserfs has become read-only, syslog reports,
      Oct 26 09:38:06 virgo iscsid: connection0:0 is operational after recovery (13 attempts)

      Interestingly, the last entry written to "date.out" is timestamped 09:37:00.
      09:40:00   The filesystem is umount'ed and remounted, the last entry in "date.out" is in fact 09:35:00.
      You could see that reiserfs rolled back some transactions in the syslog message,
      Oct 26 09:40:37 virgo kernel: ReiserFS: sda1: checking transaction log (sda1)
      Oct 26 09:40:37 virgo kernel: ReiserFS: sda1: replayed 2 transactions in 0 seconds

    After I reviewed the behavior of the iscsi initiator during iscsi target failure, I noticed that after 120 seconds, the underlying iscsi session would be considered failed, and IO errors are returned to reiserfs. Thus I increased this timing window to 1 hour, repeated the entire test, duplicating the iscsi target failure (ie rebooted the iscsi target server). Naturally, the iscsi target became online once again before the iscsi session timeout. A review of the "date.out" file shows one entry at 13:21:31, and the next entry at 13:24:35. For all practical purposes, reiserfs transactions were paused during this period. The following syslog messages show the loss, followed by recovery, of the iscsi session.

      Oct 26 13:22:11 virgo kernel: connection4:0: iscsi: detected conn error (1011)
      Oct 26 13:22:12 virgo iscsid: detected iSCSI connection 4:0 error (1011) state (3)
      Oct 26 13:23:44 virgo kernel: connection4:0: iscsi: detected conn error (1011)
      Oct 26 13:23:44 virgo iscsid: connect failed (111)
      Oct 26 13:23:44 virgo iscsid: detected iSCSI connection 4:0 error (1011) state (2)
      Oct 26 13:23:50 virgo iscsid: connect failed (111)
      Oct 26 13:23:56 virgo iscsid: connect failed (111)
      Oct 26 13:24:02 virgo iscsid: connect failed (111)
      Oct 26 13:23:08 virgo iscsid: connect failed (111)
      Oct 26 13:24:15 virgo kernel: iscsi: host reset succeeded
      Oct 26 13:24:15 virgo iscsid: connection4:0 is operational after recovery (12 attempts)

    Thus, in the file "iscsid.conf", increase the value of "node.session.timeo.replacement_timeout" to a large enough value such that the iscsi initiator continues to retry throughout the duration of any temporary iscsi target failures.

    Setting Up The Second Service Node

    After about 9 months of smooth sailing, I finally hit my first bump. When routinely provisioning a new NFS volume, the service node hit a kernel fault which led to the dreaded kernel stack trace, the cryptic CPU registers and whatever. The server didn't actually go down, and restarting NFS services brought life back to usual (for the usual customers). However the problem was still there, and it had to get fixed. Since the fault occured within kernel space, the logical way forward would be to replace the kernel with a newer one. The current running kernel is 2.6.18, and the latest linux kernel at this moment is 2.6.22.1. Unfortunately, building and testing a new kernel was not something you could do in a few minutes (with live customers depending on you). Thus a separate server would be required while silo1 is brought down for maintenance.

    Fortunately, much of silo1's allocated storage was used by various teams for bulk data backups, and they were fine with some days of downtime. A small number of volumes had to remain online and these consisted of less than a terabyte of disk space. Thus, I decided to setup another service node, that would provide data storage while I bring down silo1 for a kernel upgrade and some testing.

  • Server Overview : silo2

  • Fault Archive

    The fault archive stores all (note worthy) records of hardware and software anomalies, bugs or failures experienced by the storage equipment. Life's a learning experience, so hopefully this makes a good story.

  • 20th Oct 2006 :
  • Allocation failures for kswapd0, istd1 and istiod1.
  • 1st Nov 2006 :
  • Disk error on storage node 2, port 13.
  • 5th Mar 2007 :
  • Disk error on storage node 1, port 1.
  • 5th Apr 2007 :
  • Disk error on storage node 1, port 1.
  • 20th July 2007 :
  • NFS Mount Point Kernel Failure