Gregory Steulet blog

High-Availability and Open Source fields
How to recover data from a "dead" hard drive or from a human error

Who never encountered problems with a failing media, never accidently dropped a valuable file, corrupted his FAT(File Allocation Table) or simply got bad disk block sectors ? Of course restoring his backup in such cases would be the easiest way to recover precious data.. Unfortunately it is often in such a situation we saw that backup is a month old or our files have never been backuped. The following lines relate such stories and introduce some ways to recover data with open source softwares.


A nice sunny day

It was a nice sunny Saturday and I was invited at my girlfriend’s mother home for tea. During lunch she spoke about weird behavior with her PC under windows XP, error messages, impossible files to drop from the desktop and a lot of other weird stuff. I was quite proud to offer some help in order to solve those problems.

So I tried to switch on the computer by pressing the power button but the PC shutdown after 5 seconds! Furthermore, the power button remained pushed in each time I pressed on it due to a broken plastic inside the power switch. After 5 minutes I finally succeeded starting the PC and a wonderful Windows XP logo preceding  error messages about missing “.dll” files appeared. Finally after removing 2 out of 3 antivirus/spywares and some unused programs, the computer behavior became smoother, although still very slow so I decided to run a checkdisk on C:\ drive. As you know when one does such an operation on windows, one has to  reboot it, which I did. The checkdisk started, step 1 out of 5, 2 out of 5, when the following message appeared :  “unable to locate the file name attribute of index entry etc”, “not enough space to rebuild index”, etc..

When the chkdsk ended, the computer restarted, the windows xp logo appeared and the computer automatically switched off. Nothing concerning the power switch button this time, it was the disk. No more visible data on it. This nice sunny day became really cloudy when my girlfriend told me that her mother stored a lot of important documents concerning her work and many pictures. Of course she obviously hadn’t any backup as well as the majority of people storing precious data.

I decided to take the hard drive home to plug it in as ”slave” on my own PC. The disk was present but windows indicated 0 disk space available and 0 disk space used. Disappointed, I decided to contact a company specialized in data recovery to ask for a quote and received it a few days later with the amount of 1684 Euro for a complete data recovery without guarantee that the integrity of the data would be recovered. It indicated : “The reasons for the defective data are major structure damages.”

It was simply too expensive to recover maybe only unusable blocks on a drive.

 

 cause of data loss

Let’s save those data with open source softwares



I decided to look for softwares on the internet in order to resolve this problem. I was amazed to discover so many recovery tools e.g.: active partition recovery, ontrack easy recovery, active uneraser, lost & found, prosoft media tools, active undelete and many others.

Attempting to repair the filesystem directly on the defective disk would generate unnecessary disk activity and carries the risk of further damage to the drive. Therefore the first step in such a situation is to copy the hard drive onto another one.

I firstly thought about dd for this task and noticed a lot of better tools had been developed to copy data from one disk to another, like ddrescue, dd_rescue and dd_rhelp (wrapper script for dd_rescue). My choice went on ddrescue because it combines advantages from dd_rescue and dd_rhelp. Unlike dd, dd_rescue will not stop when it encounters errors. It is especially useful when you work with failing media. In addition it allows copying blocks backward, so if we have an error in the middle of a block, ddrescue will copy both data before and after the error inside this block.

Ubuntu distrib has been used to proceed to this recovery.Notice that it’s not needed to have any Linux installation, knoppix allows to boot on a live CD and all needed software to copy and recover data are available on it.

In the following case the partition /dev/sdc2 need to be duplicated. In order to proceed to the partition copy it’s mandatory to create a new partition of the same size as the defective one on a backup disk.  Fdisk will perfectly fullfill this mission.

The partition size can be calculated by substracting the end column(last cylinders) to the start column(first cylinder).  In the case below 10240-833 = 9507 cylinders.

steulet@steulet-desktop:~$ sudo fdisk  /dev/sdc

 

Command (m for help): p

 

Disk /dev/sdc: 80.0 GB, 80060424192 bytes

240 heads, 63 sectors/track, 10341 cylinders

Units = cylinders of 15120 * 512 = 7741440 bytes

Disk identifier: 0x1549f232

 

   Device Boot      Start         End      Blocks   Id  System

/dev/sdc1               1         832     6289888+   c  W95 FAT32 (LBA)

/dev/sdc2   *         833       10340    71880480    7  HPFS/NTFS

Creation of a new partition - /dev/sdd1 - onto the new media - /dev/sdd - with fdisk is a straightforward process.

steulet@steulet-desktop:~$ sudo fdisk /dev/sdd

 

Command (m for help): n

Command action

   e   extended

   p   primary partition (1-4)

p

Partition number (1-4): 1

First cylinder (1-38913, default 1):

Using default value 1

Last cylinder, +cylinders or +size{K,M,G} (1-38913, default 38913): 9508

 

Then ddrescue can duplicate blocks on the second media. As shown bellow the average rate for duplicate is about 1Mb/s on a small configuration, meaning that about 28 hours are needed to copy a 100Gb partition and 12 days for 1Tb.

 

steulet@steulet-desktop:~$ sudo ddrescue /dev/sdc2 /dev/sdd1

 

 

Press Ctrl-C to interrupt

rescued:    73605 MB,  errsize:       0 B,  current rate:     699 kB/s

   ipos:    73605 MB,   errors:       0,    average rate:    1064 kB/s

   opos:    73605 MB

 

Once the failing media partition is fully copied, data recovery can start. One more time a variety of useful tools can be found on internet. The one used in this example is TestDisk.  TestDisk is a free open source software originaly created to recover lost partition or making non bootable disk bootable again.

TestDisk has a large set of functionalities such as : undeleting files from FAT,NTFS and ext2 filesystem, recovering/rebuilding NTFS boot sector, fixing FAT tables, copying files from deleted FAT, NTFS, ext2/ext3 partitions and many others. In addition TestDisk can be run on many operating systems such as Windows, Linux, macOS, SunOS, BSD.
Having a short look on TestDisk will present some interesting functionalities

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

 

  TestDisk is free software, and

comes with ABSOLUTELY NO WARRANTY.

 

Select a media (use Arrow keys, then press Enter):

Disk /dev/sda - 160 GB / 149 GiB - ATA SAMSUNG HD160JJ

Disk /dev/sdb - 160 GB / 149 GiB - ATA SAMSUNG HD160JJ

Disk /dev/sdc - 80 GB / 74 GiB - ATA SAMSUNG SP0802N

Disk /dev/sdd - 320 GB / 298 GiB - Hitachi HTS543232L9A300

 

[Proceed ]  [  Quit  ]

 

Note: Disk capacity must be correctly detected for a successful recovery.

If a disk listed above has incorrect size, check HD jumper settings, BIOS

detection, and install the latest OS patches and disk drivers.



After having selected the disk to analyze, TestDisk will ask you for the partition table type. In the current case “Intel” is used. Notice that it is also possible to use Mac, Sun or even Xbox partition.

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

 

 

Disk /dev/sdd1 - 85 GB / 79 GiB - Hitachi HTS543232L9A300

 

Please select the partition table type, press Enter when done.

[Intel  ]  Intel/PC partition

[EFI GPT]  EFI GPT partition map (Mac i386, some x86_64...)

[Mac    ]  Apple partition map

[None   ]  Non partitioned media

[Sun    ]  Sun Solaris partition

[XBox   ]  XBox partition

[Return ]  Return to disk selection

 

 

Note: Do NOT select 'None' for media with only a single partition. It's very

rare for a drive to be 'Non-partitioned'.

 

After analyzing disk partitions, TestDisk shows the available partitions and also offers functionalities such as rebuilding boot sector.

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

 

Disk /dev/sdd - 320 GB / 298 GiB - CHS 38913 255 63

     Partition                  Start        End    Size in sectors

 1 * HPFS - NTFS              0   1  1 10340 254 63  166128102 [HP_PAVILION]

 

Boot sector

Status: OK

 

Backup boot sector

Status: Bad

 

Sectors are not identical.

 

A valid NTFS Boot sector must be present in order to access

any data; even if the partition is not bootable.

 

 

[  Quit  ]  [  List  ]  [Org. BS ]  [Rebuild BS]  [  Dump  ]

                      Copy boot sector over backup sector

 

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

 

Disk /dev/sdd - 320 GB / 298 GiB - CHS 38913 255 63

     Partition                  Start        End    Size in sectors

 1 * HPFS - NTFS              0   1  1 10340 254 63  166128102 [HP_PAVILION]

 

filesystem size           166128102 166112100

sectors_per_cluster       8 8

mft_lcn                   10 10

mftmirr_lcn               1048576 1048576

clusters_per_mft_record   -10 -10

clusters_per_index_record 1 1

Extrapolated boot sector and current boot sector are different.

 

 

[  Dump  ]  [  List  ]  [ Write  ]  [  Quit  ]

 

                           List directories and files

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

 

Write new NTFS boot sector, confirm ? (Y/N)



As expected TestDisk recovered the backup boot sector

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

 

Disk /dev/sdd - 320 GB / 298 GiB - CHS 38913 255 63

     Partition                  Start        End    Size in sectors

 1 * HPFS - NTFS              0   1  1 10340 254 63  166128102 [HP_PAVILION]

 

Boot sector

Status: OK

 

Backup boot sector

Status: OK

 

Sectors are identical.

 

A valid NTFS Boot sector must be present in order to access

any data; even if the partition is not bootable.

 

[  Quit  ]  [  List  ]  [Rebuild BS]  [Repair MFT]  [  Dump  ]

                            Return to Advanced menu

 

Sometimes the MFT (Master File Table) can be also corrupted. Microsoft Check Disk (chkdsk) can failed trying to repair the MFT. TestDisk offers the possibility to repair this MFT through the advanced menu after having selected the NTFS partition has shown above „Repair MFT“.

 

TestDisk provides plenty of other functionalities, adding a partition, changing is type or listing files inside a partition and copy those files to another location, etc…

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

 

Disk /dev/sdd - 320 GB / 298 GiB - CHS 38913 255 63

     Partition               Start        End    Size in sectors

* HPFS - NTFS              0   1  1 10340 254 63  166128102 [HP_PAVILION]

 

Structure: Ok.  Use Up/Down Arrow keys to select partition.

Use Left/Right Arrow keys to CHANGE partition characteristics:

*=Primary bootable  P=Primary  L=Logical  E=Extended  D=Deleted

Keys A: add partition, L: load backup, T: change type, P: list files,

     Enter: to continue

NTFS, 85 GB / 79 GiB

 

 

Example below demonstrates the possibility to copy files from defected media to another place.

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

   * HPFS - NTFS              0   1  1 10340 254 63  166128102 [HP_PAVILION]

Directory /

 

dr-xr-xr-x     0     0         0  6-Sep-2009 17:52 .

dr-xr-xr-x     0     0         0  6-Sep-2009 17:52 ..

-r--r--r--     0     0        50  1-Jan-2005 19:49 AUTOEXEC.BAT

-r--r--r--     0     0       218 30-Oct-2005 18:04 BOOT.BAK

-r--r--r--     0     0       298 30-Oct-2005 18:04 boot.ini

dr-xr-xr-x     0     0         0  2-Jan-2005 04:00 Config.Msi

-r--r--r--     0     0         0 23-Nov-2004 15:21 CONFIG.SYS

dr-xr-xr-x     0     0         0 13-May-2006 12:18 C_DILLA

-r--r--r--     0     0        62  6-Sep-2009 17:16 delfichier.bat

dr-xr-xr-x     0     0         0  2-Jan-2005 05:01 Documents and Settings

-r--r--r--     0     0   528011264  6-Sep-2009 17:24 hiberfil.sys

dr-xr-xr-x     0     0         0  2-Jan-2005 05:05 hp

-r--r--r--     0     0         0 23-Nov-2004 15:21 IO.SYS

-r--r--r--     0     0         0 23-Nov-2004 15:21 MSDOS.SYS

-r--r--r--     0     0     47564  5-Aug-2004 13:00 NTDETECT.COM

-r--r--r--     0     0    252240  5-Aug-2004 13:00 ntldr

-r--r--r--     0     0   792723456 23-Dec-2007 05:35 pagefile.sys

dr-xr-xr-x     0     0         0  2-Jan-2005 05:21 Program Files

dr-xr-xr-x     0     0         0  2-Jan-2005 05:21 Python22

dr-xr-xr-x     0     0         0  2-Jan-2005 05:21 RECYCLER

dr-xr-xr-x     0     0         0  2-Jan-2005 05:21 sysprep

dr-xr-xr-x     0     0         0 28-Oct-2005 17:29 System Volume Information

dr-xr-xr-x     0     0         0  2-Jan-2005 04:12 system.sav

dr-xr-xr-x     0     0         0 17-Mar-2006 17:36 temp

dr-xr-xr-x     0     0         0  2-Jan-2005 05:37 WINDOWS

 

 

 

Use Right arrow to change directory, c to copy,

    q to quit

 

But what about ext3 filesystem ?

Among the TestDisk limitations there is the impossibility to recover deleted files which stand on an ext3 partition.

Having a look on forum about ext3 file recovery will often discourage you. If you are not convice that file recovery on ext3 is not possible having a look on ext3 FAQ (http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html ) will definitively remove any hope.

1.1      Q: How can I recover (undelete) deleted files from my ext3 partition?

Actually, you can't! This is what one of the developers, Andreas Dilger, said about it:

In order to ensure that ext3 can safely resume an unlink after a crash, it actually zeros out the block pointers in the inode, whereas
ext2 just marks these blocks as unused in the block bitmaps and marks the inode as "deleted" and leaves the block pointers alone.

Your only hope is to "grep" for parts of your files that have been deleted and hope for the best.

Hopefully it looks that this statement is too categorical. When a file is removed, data are not really overwritten. On ext3 filesystem the pointer that reference a file is simply removed meaning that the disk area can be overwritten if writes operation occur. Therefore the first thing to do after such a mistake is avoiding any additional write operation. The best way to achieve that is simply unmouting the filesystem.

Once the filesystem unmounted one can take time looking for recovery tools. I found two tools able to recover deleted files.  

  • ext3grep  is a simple tool developed by Carlo Wood and intended to aid anyone who accidentally deletes a file on an ext3 filesystem.

Both of them are intended to be run on disk images, meaning that it is mandatory to create a disk image of the partition where the removed files stand.

 

Some doubts ?

 

A concrete exemple needing file restoration on ext3 partition could be removing all *.log files on a partition hosting redo log files from an  Oracle Database. Redo logs that would be named with *.log extension (which is by the way strongly not recommended especially for this reason) would be removed after such an operation. As you maybe know an Oracle Database cannot work without at least two redo log groups therefore such a delete would lead to a database crash.
The storage setup in the following example is composed by two groups of raid 5 with three disks each configured by mdadm. On the top a raid 0 (stripping) has been configured with LVM2. This configuration – RAID 50 - is illustrated in the figure below.

 

raid 50

 

The Oracle database version is 10.2.0.4 and the filesystem_io_option is set to “setall“.

After removing redo logs of the SOUK database, the following can be seen in the Oracle alert log.

 

LGWR: Failed to archive log 2 thread 1 sequence 72 (16198)
Sun Nov  8 17:17:32 2009
Thread 1 advanced to log sequence 72 (LGWR switch)
  Current log# 2 seq# 72 mem# 0: /u05/oradata/SOUK/redog02a_SOUK.log
  Current log# 2 seq# 72 mem# 1: /u05/oradata/SOUK/redog02b_SOUK.log
Sun Nov  8 17:18:47 2009
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '/u05/oradata/SOUK/redog01b_SOUK.log'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3
ORA-00312: online log 1 thread 1: '/u05/oradata/SOUK/redog01a_SOUK.log'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory


In order to restore redo log files, the first step is to avoid any additional write on the filesystem. The most writes operations occur, the less chance we have to recover those files. That why it is needed to stop any processes that write on this specific filesystem.

 

oracle@slo02test:~/ [SOUK] sqh

SQL*Plus: Release 10.2.0.4.0 - Production on Fri Oct 23 00:08:36 2009
Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 – Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> shutdown abort

ORACLE instance shut down.

The filesystem needs to be unmounted to avoid any write access.

[root@slo02test ~]# umount /u05


Then an image copy of the filesystem can be done with “dd” command as shown bellow. “dd” perfectly fits our need in such a case.

[root@slo02test ~]# dd if=/dev/mapper/vgdata-lvdata of=/u99/copyU05
6291456+0 records in
6291456+0 records out
3221225472 bytes (3.2 GB) copied, 241.446 seconds, 13.3 MB/s


Now we simply have to execute ext3grep with the filesystem image in parameter. Several options are provided allowing restoring from a specific date or a specific file or simply everything.

[root@slo02test u99]# ext3grep /u99/copyU05 --restore-all
Running ext3grep version 0.10.0
Number of groups: 24
Minimum / maximum journal block: 713 / 17115
Loading journal descriptors... sorting... done
The oldest inode block that is still in the journal, appears to be from 1257696360 = Sun Nov  8 17:06:00 2009
Number of descriptors in journal: 108; min / max sequence numbers: 3 / 34
Writing output to directory RESTORED_FILES/
Finding all blocks that might be directories.
D: block containing directory start, d: block containing more directory entries.
Each plus represents a directory start that references the same inode as a directory start that we found previously.

Searching group 0: DDD+D+++
Searching group 1:
Searching group 2:


Searching group 22:
Searching group 23:

Writing analysis so far to 'copyU05.ext3grep.stage1'. Delete that file if you want to do this stage again.

Result of stage one:
  4 inodes are referenced by one or more directory blocks, 4 of those inodes are still allocated.
  3 inodes are referenced by more than one directory block, 3 of those inodes are still allocated.
  0 blocks contain an extended directory.

Result of stage two:
  4 of those inodes could be resolved because they are still allocated.

All directory inodes are accounted for!

Writing analysis so far to 'copyU05.ext3grep.stage2'. Delete that file if you want to do this stage again.
Restoring oradata/SOUK/redog01a_SOUK.log
Restoring oradata/SOUK/redog01b_SOUK.log
Restoring oradata/SOUK/redog02a_SOUK.log
Restoring oradata/SOUK/redog02b_SOUK.log
Restoring oradata/SOUK/redog03a_SOUK.log
Restoring oradata/SOUK/redog03b_SOUK.log


Restored files are copied in the “RESTORED_FILES” directory of the current path.  Now we only have to copy them in the correct directory.

[root@slo02test ~]# ls -ls RESTORED_FILES/ -R

RESTORED_FILES/:
total 8
4 drwx------ 2 root root 4096 Oct 22 23:06 lost+found
4 drwxr-xr-x 3 root root 4096 Oct 22 23:09 oradata

RESTORED_FILES/lost+found:
total 0

RESTORED_FILES/oradata:
total 4
4 drwxr-xr-x 2 root root 4096 Oct 22 23:38 SOUK

RESTORED_FILES/oradata/SOUK:
total 82080
10260 -rw-r----- 1 root root 10486272 Oct 22 23:17 redog01a_SOUK.log

10260 -rw-r----- 1 root root 10486272 Nov 22 23:17 redog01b_SOUK.log
10260 -rw-r----- 1 root root 10486272 Oct 22 23:18 redog02a_SOUK.log
10260 -rw-r----- 1 root root 10486272 Oct 22 23:18 redog02b_SOUK.log
10260 -rw-r----- 1 root root 10486272 Oct 22 23:18 redog03a_SOUK.log
10260 -rw-r----- 1 root root 10486272 Oct 22 23:18 redog03b_SOUK.log

[root@slo02test ~]# mount /u05

[root@slo02test ~]# cp RESTORED_FILES/oradata/SOUK/redog0?a_SOUK.log /u05/oradata/SOUK/
[root@slo02test ~]# chown oracle.oinstall /u05/oradata/SOUK/ -R

Once redo logs restored and copied into the original path, database start can be done. However keep in consideration that committed transactions could be lost depending on your database and filesystem configuration!

oracle@slo02test:~/ [SOUK] sqh
SQL*Plus: Release 10.2.0.4.0 - Production on Thu Oct 22 23:43:41 2009

Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.

Connected to an idle instance.

SQL> startup

ORACLE instance started.

Total System Global Area 1073741824 bytes
Fixed Size                  1271588 bytes
Variable Size             264243420 bytes
Database Buffers          805306368 bytes
Redo Buffers                2920448 bytes
Database mounted.

Database opened.

 

 As we can see, dropping redo log files or any other file do not necessarily lead to definitely loosing transactions. Ext3grep can be used as an additional way to recover your database.

 

Conclusion

Several situations can lead to media recovery, site disaster, block corruption, FAT corrupted, etc… In case of hardware errors the most secure way to solve it is often calling media recovery companies. Dealing with physical damages to a hard drive with such tools can destroy any last hope of a successful recovery. However if no physical damages are confirmed these tools can recover precious data.

The first thing to do when media error occurs is stopping immediately any activity on this media. It is only after stopping any activity on the media that we can take time to think about the way to proceed. As shown in this article the second step is generally copying the failing media on a trusted support. Working directly on the failing media could lead to definitively loosing valuable data. That’s why it is strongly recommended duplicating data before any other operation. Several tools provide duplication functionalities such as dd_rescue.

Once the failing media duplicated recovery can be done using the backup media. One more time a variety of tools can be used depending on the filesystem and data to recover.

Finally once data recovered it may worth doing a backup. Although those tools can get out of hopeless situations and add a way to recover files to usual methods, testing backup and restore processes at regular intervals is maybe the best way to avoid spending time in such recovery processes.

Gregory Steulet
Oracle Certified Professional 10G
MySQL Cluster 5.1 Certified
Avaloq Certified Professional

 

Trivadis SA
Rue Marterey 5
CH-1005 Lausanne
Tel: +41-21-321 47 00
Fax: +41-21-321 47 01
Internet:  www.trivadis.com
Mail: info@trivadis.com

 

Literature and Links…

http://www.cgsecurity.org/wiki/TestDisk

http://www.gnu.org/software/ddrescue/ddrescue.html

http://www.knoppix.org/

http://www.xs4all.nl/~carlo17/howto/undelete_ext3.html

http://foremost.sourceforge.net/

 

 

http://www.hiren.info/

http://www.krollontrack.com/

http://www.giis.co.in/

 

       

Dolphin interconnect
Dolphin interconnect
Gregory Steulet  .  Consultant  .  05.04.2009
 

 

Several technologies such as clusters need increasingly faster bandwidth and possible lowest latency on their interconnect infrastructure. Storage technologies are at dawn of a quantic jump with SSD, ioMemory and spintronic science. In addition, data market is growing day after day. In order to face these requirements, Dolphin Interconnect Solution launched Dolphin Express products. This article provides an overview of Dolphin Express interconnect, some possible applications and performance tests with Data Replication Block Device (DRBD).

 

 What is Dolphin interconnect?

Dolphin Interconnect Solutions products are used to connect multiple computers together to achieve high performance architecture for specific applications. Those applications could be Oracle Real Application Clusters, Oracle Data Guard, MySQL replication, MySQL cluster, DRBD as well as any application that uses generic BSD sockets for communication. Dolphin’s products are available through a worldwide network of resellers and distributors.Those Interconnect Solutions are provided in the form of standard PCI adapter that require a standard PCI Express slot supporting x8 or x16 peripheral cards. The cable length varies between 1 to 3 meters for copper cables and from 10 to 100 meters for fiber optic cables. Those cards serve as a transparent booster of existing Ethernet interfaces and operate with the same IP addresses. Dolphin products run on Windows, Linux and Solaris.

Two different product lines are provided for Dolphin Express: Dolphin Express D and Dolphin Express DX. The difference between these series is the need of a dedicated switch for DX series when more than two nodes are used. In addition, DX series is the latest interconnect from Dolphin and provides better performances. In the following article only DX series has been considered. The supported architecture for DX series products are x86 (32 & 64bits). Below pictures of a copper cable (DXC1M-A 1 Meter copper cable) and a PCI card (DXH510 PCI) used in Dolphin Express interconnection setup.

Dolphin interconnect DX series

dolphin interconnect DX series


Software installation and configuration

Installing Dolphin PCI Express cards and software is really not a rocket science. First, we need to download the software from http://www.dolphinics.com/support. Afterwards, we simply have to execute a shell script, which will prompt for some information regarding name of nodes, target path, etc… before installing the software. This shell script will automatically install the software stack on each node of the cluster, fill in the configuration files and make some connectivity and performance tests at the end of the installation. Definitely a straightforward process!

 

[root@srv01 sbin]# ./DIS_DX_install_3_1_0_2_LINUX.sh
...
...
...#+ About to BUILD Dolphin Express interconnect drivers on this node:
... srv01.penguins.com
#+ About to INSTALL Dolphin Express interconnect drivers on these nodes:
... srv01.penguins.com
... srv02.penguins.com
#+ About to install management and control services on the frontend machine:
... srv01.penguins.com
#* Installing to default target path /opt/DIS on all machines
.. (or the current installation path if this is an update installation).
# >>> OK to proceed? [Y/n]
...

 

 

In addition, SuperSocket software is licensed by the Gnu Public License v2. Several command line tools are provided in the installation pack. Here is a non-exhaustive list of those programs that are utilized to:
  • Test that all adapters are working correctly like dxdiag
  • Test the interconnect load (stress test) like sciconntest
  • Benchmark interconnect performance like scibench2, scipp, intr_bench, dma_bench, scimemcopybench
  • Benchmark and test SuperSocket like sockperf, latency_bench,sockrpb
  • Catch throughput statistics on all SuperSockets like dis_ssock_stats, dis_status
  • Many other functionalities
Two other graphical configuration tools are also provided:
  • dishosteditor
  • dxadmin
dishosteditor presents the nodes in the cluster and the chosen topology. Through this interface, several settings and the general cluster configuration can be changed. The interconnect status may have several states: up, reduced, degraded, failed and unstable.  With dishosteditor, it is possible to set notification and action on interconnect status changes. dishosteditor


dxadmin provides the possibility to view the interconnect status, to configure and administrate a node or the whole cluster, to test and diagnose the cable and adapter through three main menus which are “Admin”, “Cluster” and “Node”.

dxadmin 
Use case with DRBD and Oracle Data Guard

Before going further with tests, some small configuration changes and settings need to be done.  Let’s have a look on the DRBD configuration file - /etc/drbd.conf


global {
}
common {
}
resource r0 {
  protocol C;
  startup {
  }
  disk {
    on-io-error   detach;
  }
  syncer {
  rate 900M;
  }
  on srv01.penguins.com {
  device     /dev/drbd0;
  disk       /dev/sdd3;
  address  sci 10.0.0.11:7778;
  meta-disk  internal;
  }
  on srv02.penguins.com {
  device    /dev/drbd0;
  disk     /dev/sdd3;
  address  sci 10.0.0.12:7778;
  meta-disk internal;
  }
}


The only change that has to be done compared to a “normal” DRBD configuration file is the addition of the keyword “sci” before the IP address. Of course we have to pay attention to the
rate parameter, which is by default 250KB/sec and could be a possible bottleneck.
 In an Oracle Data Guard environment, the following environment variable needs to be set before starting Oracle.
LD_PRELOAD=libksupersockets.so 
No other changes need to be done to run Oracle Data Guard with Dolphin Interconnect. In an Oracle Clusterware setup, SuperSockets comes with a script super_sockets_for_oracle, which takes care of all necessary settings.
./supersockets_for_oracle enable


To check if the Oracle software is currently benefiting from the supersocket, the following commands can be run:



[root@srv01 ~]# echo streams >/proc/net/af_sci/stats
[root@srv01 ~]# tail –f /var/log/messagesMar 17 16:12:58 srv01 kernel: dis_ssocks: Listen sockets:
Mar 17 16:12:58 srv01 kernel: dis_ssocks:  [0.0.0.0:1521] LISTEN (pid 23951), pend: 1, incompl: 0
Mar 17 16:12:58 srv01 kernel: dis_ssocks:  [0.0.0.0:51044] LISTEN (pid 8839), pend: 1, incompl: 0
Mar 17 16:12:58 srv01 kernel: dis_ssocks: Server sockets:
Mar 17 16:12:58 srv01 kernel: dis_ssocks:  [192.168.1.101:1521] --> [192.168.1.101:16218] FALLBACK_ACTIVE LOOP (pid 23951) 00000000
...
Mar 17 16:12:58 srv01 kernel: dis_ssocks: Client sockets:
Mar 17 16:12:58 srv01 kernel: dis_ssocks:  [10.0.0.11:61992] --> [10.0.0.12:1521] ESTABLISHED  (pid 24147) 00000000
Mar 17 16:12:58 srv01 kernel: dis_ssocks:  [192.168.1.101:16964] --> [192.168.1.101:1521] FALLBACK_ACTIVE LOOP (pid 8807) 00000000
Mar 17 16:12:58 srv01 kernel: dis_ssocks:  [10.0.0.11:26905] --> [10.0.0.12:1521] ESTABLISHED  (pid 9017) 00000000
Mar 17 16:12:58 srv01 kernel: dis_ssocks:  [10.0.0.11:53594] --> [10.0.0.12:1521] ESTABLISHED  (pid 9014) 00000000
...
  

Having a look on those processes shows that they refer to Oracle processes and therefore that SuperSocket is currently in use for them.


[root@srv01 ~]# ps -ef | grep 23951
oracle   23951     1  0 15:30 ?        00:00:00 /u00/app/oracle/product/11.1.0/bin/tnslsnr LISTENER –inherit
[root@srv01 ~]# ps -ef | grep 8839
oracle    8839     1  0 14:18 ?        00:00:00 ora_d000_souk
[root@srv01 ~]# ps -ef | grep 24147
oracle   24147     1  0 15:31 ?        00:00:00 ora_lnsc_souk
[root@srv01 ~]# ps -ef | grep 8807
oracle    8807     1  0 14:18 ?        00:00:00 ora_pmon_souk
[root@srv01 ~]# ps -ef | grep 9017
oracle    9017     1  0 14:18 ?        00:00:00 ora_arc1_souk
[root@srv01 ~]# ps -ef | grep 9014
oracle    9014     1  0 14:18 ?        00:00:00 ora_arc0_souk
...
 
High-Availability

 SuperSocket is in use by our application but what happens if the two connection cables between each DXH510 fail? To demonstrate that, let’s assume that DRBD is currently in use and a 8 GB file is streamed between two nodes while both cables are disconnected.In this use case, two resources are configured in Primary/Secondary mode.


[root@srv01 ~]# cat /proc/drbd
0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 

The first resource r0 is mounted on /u05 on the primary node.

[root@srv01 ~]# mount -t ext3 /dev/drbd0 /u05
 
drbdx_receiver processes are using the SuperSocket where x stand for the resource number.


[root@srv01 ~]# ps -ef | grep drbd
root     14597     1  0 13:16 ?        00:00:00 [drbd0_worker]
root     14605     1  9 13:16 ?        00:00:19 [drbd1_worker]
root     14615     1  0 13:16 ?        00:00:00 [drbd0_receiver]
root     14618     1  0 13:16 ?        00:00:00 [drbd1_receiver]
root     14634     1  0 13:16 ?        00:00:00 [drbd0_asender]
root     14635     1  6 13:16 ?        00:00:13 [drbd1_asender]
[root@srv01 ~]# echo streams >/proc/net/af_sci/stats
...
Mar 19 13:19:51 srv01 kernel: dis_ssocks: Server sockets:
Mar 19 13:19:51 srv01 kernel: dis_ssocks:  [10.0.0.11:7777] --> [10.0.0.12:52051] ESTABLISHED  (pid 14618) 00000000
Mar 19 13:19:51 srv01 kernel: dis_ssocks:  [10.0.0.11:7778] --> [10.0.0.12:63595] ESTABLISHED  (pid 14615) 00000000
...


In this test, a file - /u05/test.txt – having a size of 8GB is created on node srv01 and replicated with the DRBD process on srv02. Just after starting the process, both copper cables will fail.

[root@srv01 ~]# dd if=/dev/zero of=/u05/test.txt count=1000000 bs=8192
 

Interconnect crash

 

 

SuperSockets will transparently fall back to Ethernet when connections between adapters are down. The only consequence is obviously that throughput falls off.

 
1000000+0 records in
1000000+0 records out
8192000000 bytes (8.2 GB) copied, 69.6321 seconds, 118 MB/s
 
Dolphin Express and DRBD write performances

The following tests have been done on two DELL powerEdge T300 with dedicated disks for DRBD. In the first case, a standard network adapter BCM5722 – has been used (red column). In the second case, dolphin interconnect – DXH510 – has been added to the standard adapter (blue column).  The write performance test is a simple written operation like the following:


 [root@srv01 ~]# dd if=/dev/zero of=/u06/test.txt count=1000000 bs=8192

 No attempts to improve performances have been done. DRBD parameters such as modifying unplug-watermark, max-buffers, max-epoch-size, etc…. have not been used.  It should be possible to improve the performances by adjusting those parameters.

  Dolphin Interconnect Performance 

This graph points out several things: Firstly, a good performance improvement can be done - in this case 60% - by using Dolphin Express interconnect if we have a good SSD drive. Notice that higher performance improvement is possible, depending on your hardware configuration and settings. In this configuration a throughput of 185Mb/s has been reached with DRBD and Dolphin interconnect in conjunction with an SSD drive Intel X25-E.
Obviously, when DRBD is not in use, no network communication is needed and therefore we have the same really high throughput with a normal interconnect and with Dolphin interconnect. In this case, only the storage system can impact the performance. We can expect about twice the throughput of an SAS 15K disk with an SSD INTEL X25-E drive, as demonstrated in the performance chart above.

Secondly, we cannot expect a big performance improvement if we basically do not have a good storage system. Having a good storage system is maybe the first step in the hardware performance improvement. The very low throughput difference between the red and blue column in the case where SAS disk is used justifies the preceding statement. In the case of MySQL Cluster, where data are by default stored in memory, we can expect a huge performance improvement.  Dolphin Interconnect speak about 300% of performance improvement in the case of MySQL Cluster. In the case of Oracle RAC, it would be close to 400% of performance improvement in term of throughput and response time. Of course, those results have been obtained on specific architecture and need to be interpreted with care.

The gap of throughput between a system running DRBD and a system that does not run DRBD in the preceding tests can also be explained by the fact that no time has been spent in DRBD application tuning. However, we definitely cannot expect having the same throughput with DRBD in protocol C than on a “standalone” disk. Local write operations on the primary node are considered completed only after both the local and the remote disk write have been confirmed. The hardware and software layers between the local and “replicated” host implies latency. This latency can be reduced by optimizing the software and hardware stacks but cannot be erased. The usage of Dolphin Express adapters can significantly improve the network latency. The sequence diagram below, which is largely simplified, shows the main elements that lead latency in a DRBD write operation.   DRBD sequence diagramm   
It is also possible to execute some low-level benchmark of dolphin interconnect with provided benchmark tools like
scibench2, scipp, intr_bench, dma_bench. Below a possible sample output from scibench2 :
 
----------------------------------------------------------------------------
Segment Size:   Average Send Latency:           Throughput:
----------------------------------------------------------------------------
 
      4                   0.13 us                 30.11 MBytes/s
      8                   0.13 us                 60.67 MBytes/s
     16                   0.14 us                113.59 MBytes/s
     32                   0.48 us                 66.18 MBytes/s
     64                   0.16 us                402.92 MBytes/s
    128                   0.13 us                993.24 MBytes/s
    256                   0.19 us               1348.72 MBytes/s
    512                   0.41 us               1256.16 MBytes/s
   1024                   0.77 us               1338.15 MBytes/s
   2048                   1.58 us               1292.59 MBytes/s
   4096                   3.17 us               1290.56 MBytes/s
   8192                   6.33 us               1294.84 MBytes/s
  16384                  12.60 us               1299.97 MBytes/s
  32768                  25.20 us               1300.22 MBytes/s
  65536                  50.43 us               1299.66 MBytes/s
 

In the previous case, the minimal latency to write 4 bytes to remote memory is 0.13 micro seconds and the maximal bandwidth for writing to remote memory is 1348MB/s.
 

Conclusion

Dolphin interconnect provides a fast and reliable way to interconnect high-availability systems that need low latency and high throughput. The software stack is easy to install and provides an efficient way to monitor the system behavior as well as performances. However, in a lot of cases, performances are not only depending on hardware. Taking care of processes, models and workflows is also part of the game. Additionally, it is not worth spending time and money in implementation of fast interconnect without having good storage performances.

We also tested the Dolphin support team by asking several questions and we have been amazingly surprised by the efficiency of this support, which provided us answers in really short delays.

Finally, even if it sounds obvious, please perform a lot of tests before going in production, this is anyway a best practice in each high availability project. In order to know if this solution fits your requirements in terms of availability and performances, testing is mandatory. Trivadis supports several customers for High Availability architecture. As for any High Availability project (MySQL replication MySQL Cluster, Oracle Data Guard, RAC, Veritas Storage Foundation,…) the complexity of a cluster architecture shouldn’t be underestimated. Several steps are very important in such a project: concept, documentation, testing/validating, and so on…

Thanks to Yann Neuhaus for his expertise and contribution in this paper.

 

Gregory Steulet
Oracle Certified Professional 10G
MySQL Cluster 5.1 Certified
Avaloq Certified Professional

Trivadis SA
Rue Marterey 5
CH-1005 Lausanne
Tel: +41-21-321 47 00
Fax: +41-21-321 47 01
Internet:  www.trivadis.com
Mail: info@trivadis.com

 Literature and Links…

http://www.dolphinics.com
http://ww.dolphinics.no/download/DX_3_0_0_LINUX_DOC/index.html
http://www.drbd.org/users-guide/users-guide.html
http://www.mysql.com/products/enterprise/drbd.html
http://www.oracle.com/technology/products/database/clustering/index.htmlhttp://en.wikipedia.org/wiki/Spintronics

 

Posted: Apr 06 2009, 03:12 von Gregory Steulet | mit 7 comment(s)
Abgelegt unter:
High Availability cluster with DRBD

Nowadays many systems need to be high available. A solution to provide such availability on the storage level could be DRBD (Data Replication Block Device). What kind of usage can fit to DRBD? Could we trust this technology and finally how does DRBD impact the performances? With these questions in mind we deeply analyzed DRBD at Trivadis.

What is DRBD?

DRBD is an open source product made by an innovative Austrian company called LINBIT. This product is designed to build high availability clusters on Linux system. This is achieved by mirroring a whole block device via a (dedicated) network. Basically you could see it as a network RAID 1 (mirroring) over two nodes in the free distribution. The current version of DRBD is 8.2.6 and our tests have been performed on this release.

A second version of DRBD, DRBD+ is commercialised by LINBIT. This version allows to build a two nodes cluster and enables to perform a third, asynchronous mirroring for a high-availability cluster over an unlimited distance.

DRBD supports three kinds of replication mode:

  • Protocol A: For high latency networks. Write I/O is reported as completed as soon as it reached local disk and local TCP send buffer

  • Protocol B: For lower-risk scenarios. Write I/O is reported as completed as soon as it reached local disk and remote TCP buffer cache

  • Protocol C: For most cases, preserves transactional semantics. Write I/O is reported as completed as soon as it reached both local and remote disks.

In our tests only protocol C has been considered as it is by far the most used and most secure.

Possible applications

DRBD provides two basics kinds of synchronous replication, master-slave (active-passive) and multi-master (active-active). In the active-passive case operations are only allowed on the master because the slave device is unmounted. This mode guarantees that only one cluster node manipulates the data at a given point in time.

Several kinds of application can benefit from a DBRD architecture such as a web server, ftp server or even a database server. A server running DRBD in active-passive mode could be used as a failover cluster.

Since version 8.0, DRBD provides the active-active configuration mode which could be used within a load-balanced cluster (both nodes of the cluster are in use). However this mode requires the use of a shared cluster file system such as GFS (Global File System) or OCFS2 (Oracle Cluster File System) as it is possible to have concurrent accesses against the same data.

DRBD has no load balancing functionality, to benefit from the active active mode, the use of a load balancer or a cluster aware application is mandatory.

High-availability

As we speak about high-availability we also speak about downtimes. There are two major kinds of downtime; planned downtime and unplanned downtime. Unplanned downtime can have several causes, like power outage, human error, data corruption, software or hardware errors, etc... The planned downtime can be due to maintenance or routine operations such as upgrade, space management or system reconfiguration. In this section we focus on how DRBD handles maintenance operations and split brain situations, those are two causes of downtime in the DRBD field.

Note that for general information about High Availability, each one is free to read the first pages of the Trivadis white paper about this topic:

www.trivadis.com/uploads/tx_cabagdownloadarea/Trivadis_HA_white_paper_release_2_2_final.pdf

Maintenance operations

Before having a look on some maintenance operations it could be necessary to know how to monitor the current status. For that you can use one of the following commands:

  • cat /proc/drbd

  • /etc/init.d/drbd status

  • drbdadm cstate/state/dstate

  • cat /var/log/messages | grep drbd

a10:/etc/init.d # ./drbd status

m:res cs st ds p mounted fstype
1:rVot01 Connected Primary/Primary UpToDate/UpToDate C /u101 ocfs2
2:rVot02 Connected Primary/Primary UpToDate/UpToDate C /u102 ocfs2

This output shows two resources that are both connected (cs column), the configuration mode is active-active (st column) and data are up to date on both nodes (ds column). In addition this status provides the mount points (/u101,/u102) and the used file system (ocfs2).

DRBD allows to achieve a lot of maintenance operations without impacting the system availability. For example adding resources can be done online by following the procedure below:

  1. Fdisk, partprobe your disks (max. 15 logical partitions on Linux) on both hosts

  2. Create the Logical Volumes on both hosts

  3. Create the appropriate entries in the /etc/drbd.conf file on both hosts

  4. Reload your DRBD config, you might use the --dry-run first

  5. Start the new DRBD device on both hosts

  6. Force one site to be primary, and then the resynchronisation starts

  7. Set the second site to primary as well, create the FS and mount them. Ready!

DRBD also supports online resize of a resource. However before growing or shrinking a file system online three criterias must be fulfilled:

  1. The resource must be on logical volume management subsystem, such as LVM or EVMS

  2. The resource must currently be in the “Connected” connection state

  3. The file system must support online growing or shrinking

Even rolling upgrade is possible in a primary-secondary configuration mode by upgrading the secondary node first. A DRBD rolling upgrade implies to:

  1. Putting the standby server offline

  2. Upgrading DRBD on the standby node

  3. Bringing the standby server online

  4. Doing the failover

  5. Upgrading the former active

Finally DRBD allows to reconfigure resources while they are up and running by making the necessary changes to the /etc/drbd.conf file and issuing the “drbdadm adjust all” command on both nodes.

Split Brain

Split brain is a term coming from medical field and describes a state where the link between both hemispheres of the brain is severed. In computer science the cluster split brain is based on this definition. However DRBD split brain definition slightly differs. In DRBD, a split brain is (according to LINBIT) "a situation where, due to temporary failure of all network links between cluster nodes, and possibly due to intervention by a cluster management software or human error, both nodes switched to the primary role while disconnected". To summarize, the split brain is the fact of having two primary resources while the network between these resources failed. Loss of connectivity between hosts in DRBD is referred to as a cluster partition. For convenience and to avoid confusion the term split brain in this article refers to the DRBD split brain definition.

Although a split brain situation occurs when there is no more connection between two primary resources it's also possible to get a split brain situation in an active-passive configuration. If the connection between both servers is down, it is still possible to switch a resource from passive to active, and therefore to obtain a DRBD split brain. DRBD detects this situation as the connection between hosts becomes available again. As soon as the split brain has been detected DRBD stops the replication.

Feb 10 17:01:53 node1 kernel: drbd0: Split-Brain detected, dropping connection!

You have two ways to manage such a situation: manually and automatically. Manually by discarding data on one node or automatically by using one of the algorithms provided by DRBD. The possible algorithms will depend on the configuration (0,1 or 2 primary) when the split brain is detected. In addition the choice of an algorithm will strongly depend on the applications which are running over DRBD. Among the possible algorithms there is

  1. Discarding modifications made on the younger primary

  2. Discarding modifications made on the “older” primary

  3. Discarding modifications on the primary with fewer changes

  4. Graceful recovery from split brain if one host has had no intermediate changes

  5. Discarding modifications on the current secondary

Therefore it is possible to manage the “after split-brain” situation, but what about the possibilities of avoiding it? Dopd (DRBD outdate-peer deamon) in conjunction with heartbeat provides a solution by outdating the resources on the secondary node through another network path. An outdated resource cannot become primary anymore as long as the connection is not re-established and as long as some maintenance operations have not been performed. In order to enable dopd you will need to append on both nodes the following lines in your /etc/ha.d/ha.cf

respawn hacluster /usr/lib/heartbeat/dopd
apiauth dopd gid=haclient uid=hacluster

Reloading heartbeat and modifying drbd.conf as below

resource rDrbd01 {
handlers {
outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater"; }
disk { fencing resource-only; }
...
}

Modifications will take effect after restarting or adjusting your drbd configuration

drbdadm adjust rDrbd01

Performances

Depending on which protocol is used A, B or C the performances could be more or less impacted. All Read operations are serviced from local hard drive even in active-active mode, however writes suffer from DRBD overhead. In our tests we used protocol C. Two kinds of tests have been made, a pure write test with “dd” and a second with MySQL and SysBench. Of course these results fully depend on the architecture and on the software but the proportions should be kept in a “standard” architecture.

 

 

The first graph (DRBD write performances) shows how DRBD in a primary/secondary mode could impact write performances if you do not pay attention to the settings (difference between blue and purple column). After spending some time on DRBD and kernel settings a significant improvement can be obtained (difference between blue and light blue column).

The second graph (MySQL Architecture Performances) compares several high-availability solutions for MySQL. The highest throughput is done with MySQL 5.1 with innodb as storage engine. Regarding MySQL Cluster, tests have been done with three nodes architecture including two NDB nodes with dedicated network. DRBD in a primary secondary mode is stuck at about 80 transactions/second whatever the number of connection. Those poor results are due to the fact that MySQL 5.1 do lot of serialized synchronous writes.

The major problem with write performances in DRBD are mostly due to disk and network latency, so if you plan to use DRBD and you have some important write performance prerequisites you should have a look on Battery Backed Write Cache (BBWC) controller and Infiniband or Dolphin interconnect. Dolphin interconnect and LINBIT announced there partnership on September 12th. Dolphin Express low latency protocol will be supported with version 8.2.7 which should be available in a few weeks.


How could I tune DRBD?

There is no silver bullet that allows a significant improvement in every kind of situation. However there is a range of DRBD parameters that can affect writes performances like:

  • Activity Log size (al-extents)

  • Unplug watermark (unplug-watermark)

  • Maximum I/O request buffers allocated (max-buffer)

It's also possible to change some network kernel parameters such as

rmem_default, rmem_max, wmem_default, wmem_max, tcp_mem, tcp_wmem, tcp_rmem. Anyway before changing those parameters have a look on the application requirements and make sure you gathered the required know-how.

Conclusion

DRBD provides a reliable and stable host based mirror solution. Installation as well as configuration is a straightforward process, even if you have to know what you do of course. In addition this solution if free in a two nodes configuration and MySQL supports it officially.

On the other hand, DRBD is only working on Linux. The other operating systems sell their own solution. For the moment SuSE Linux Enterprise supports only version 0.7 and there is no support provided by Red Hat. As told in the performance section you could encounter some serious write performance degradation if your application is strongly write oriented.

Finally, even if it sounds obvious, please perform a lot of tests before going in production, This is anyway a best practice in each high availability project. In order to know if this solution fits well to your requirements in terms of availability and performances, testing is mandatory. In addition do not be fooled there is no product that can give high-availability. The high-availability is a concept, lot of best practices, good habits before a product.

Trivadis supports several customers for High Availability architecture. As for any High Available project (MySQL replication MySQL Cluster, Oracle Dataguard or RAC, Vertas Storage Fundation,…) the complexity of a cluster architecture shouldn’t be underestimated. Several steps are very important in such a project: concept, documentation, testing/validating, and so on…

Thanks to William Sescu for his active participation and contribution in this paper, to Michael Wirz for his contribution in the project and Yann Neuhaus for his expertise in the field of high-availability and contribution in this paper. Without them this article wouldn’t have been possible.


Gregory Steulet
Oracle Certified Professional 10G
MySQL Cluster 5.1 Certified

Trivadis SA
Rue Marter
ey 5
CH-1005 Lausanne
Tel: +41-21-321 47 00
Fax: +41-21-321 47 01

Internet:  www.trivadis.com
Mail:
info@trivadis.com

Literatur und Links…

http://www.trivadis.com/uploads/tx_cabagdownloadarea/DRBDArticle_08_OK_YAN.pdf
http://www.drbd.org/users-guide/users-guide.html
http://blogs.linbit.com/florian
http://www.mysql.com/products/enterprise/drbd.html
http://www.mysqlperformanceblog.com
http://sysbench.sourceforge.net/
http://www.mysql.org

Posted: Nov 11 2008, 10:26 von Gregory Steulet | mit 9 comment(s)
Abgelegt unter: