Created:
1/29/2010 12:58:30 PM

Author:
Przemek Radzikowski

permalink [Permalink]





International Careers & Jobs - An international employment directory, reviewing world-wide top job sites




  • Home  ›
  • Articles  ›
  • How to Restore a SQL 2008 Failover Cluster onto New Disk Volumes and Fix the Microsoft Distributed Transaction Coordinator (MSDTC) on Windows Server 2008 x64



| More

How to Restore a SQL 2008 Failover Cluster onto New Disk Volumes and Fix the Microsoft Distributed Transaction Coordinator (MSDTC) on Windows Server 2008 x64

This article will show you how to restore a SQL 2008 Failover Cluster after you have either lost your disks or volumes due to damage or failure and general disaster. We also demonstrate a way to get the Microsoft Distributed Transaction Coordinator (MSDTC) to run again.


 

 

Introduction

Recently we found ourselves in a delicate situation when the SAN crashed along with our Data Protection Manager 2007 (DPM) storage pool LUN (logical unit number).  Luckily we had a secondary DPM server in the Disaster Recovery (DR) site which held up-to-date replicas of what was protected. 

Although the data was protected, the important part of the SAN, the metadata, was corrupt beyond repair.  Metadata stores the physical to logical mapping of disks to virtual disks, virtual in the sense that a virtual disk (vdisk) on a SAN can span many physical disks.

Having lost the metadata we were forced to re-initialize (format) the SAN and start from scratch.  This of course meant that each LUN had to be recreated and attached to the servers.  Doing this poses a number of problems for clustered services, in particular services which rely on exact placement and permissions of files to be restored.

So we find ourselves with a 4 node SQL 2008 Failover Cluster, running on Windows Server 2008 x64 without a quorum disk or any other disks for that matter, because the disks that were presented to the cluster when we formed it no longer exist.

Initialize and Format all Cluster Volumes

Assuming that you have presented the LUNs to all the cluster nodes, we need to ensure that they are reformatted and made ready for use.  In our case we had three instances of SQL Server running on the cluster, a single instance of SQL 2005 and two instances of SQL 2008.  Therefore, we’ll have three sets of three drives plus the Quorum drive to configure.

We need to initialize all the disks and format.  The easiest way to do this is to:

  1. Terminal server to the first node
  2. Online each disk by right clicking it and selecting online.
  3. Initialize each disk by right clicking it and selecting Initialize.  You should choose either GPT or MBR as you normally would.
  4. Create the volume and format with NTFS. Give the volume a meaningful name, such as DB01 Data if the volume is going to be used to hold the database or DB01 Log if it will hold the log.
  5. Don’t forget to do the same to your Quorum drive.

The resulting disk configuration should look something like the following diagram.

[Figure 1 – Volumes and Quorum drive for SQL Cluster]

With all the disks initialized and formatted we must now relink, or in cluster language, repair the cluster disk resources so that they can be recognized as new replacements.  The next few steps will demonstrate how to repair the quorum and remaining disks in order to get the SQL 2008 Failover Cluster back online.

Restoring the Quorum Drive

The first thing we need to do before any other action is to ensure that the quorum is restored between all the nodes.  As you can see in Figure 2, all our resources are offline and trying to online them results in errors. This is because the physical disks that the cluster is configured to use no longer exist.

[Figure 2 – SQL Cluster Disk Resources Showing Offline State]

The Failover Cluster feature in Windows Server 2008 x64 is a marked improvement over earlier versions of Windows.  To bring the Quorum drive back online, right click it and select More Actions… > Repair

[Figure 3 – Quorum Witness Disk Repair]

On the Repair a Disk Resource dialog, tick the disk that you have formatted to be the new Quorum disk.  In our case it is Disk 1.  Then click OK.

[Figure 4 – Repair a Cluster Disk Resource]

At this stage the disk should be repaired and we can now bring the Quorum drive back online.  To do this, simply right click the Quorum disk and select Bring this resource online.

[Figure 5 – Bring this resource online]

Repairing the Remaining Cluster Disk Resources

Use the exact same method to repair and bring the remaining SQL 2008 Failover Cluster disk back online.  After you repair all the disks you should seek all disks with green online status.

[Figure 6 – All Cluster Disks Online]

Restoring the SQL 2008 & 2005 Data, Log and MSDTC Files

Whether you use Microsoft DPM or some over backup solution, the only thing you must remember is to ensure that you restore the full folder structures exactly as they were before the failure.  If you were only backing up the databases, you may have some more work ahead of you in order to get things working.  Luckily we were backing up the volume contents including the folder structure, so for us it was a simple copy to the new volumes.

Restore Files to Original Volumes

Our SQL Cluster contained SQL 2005 and SQL 2008 instances, and we needed to ensure that everything was restored to exactly the same location.  Ordinarily with DPM this wouldn’t be a problem, we could have restored directly onto the newly attached cluster disks, however our primary DPM server was dead and we had some challenges in switching the secondary server to function as the primary so that we could restore to source. 

Our only option was to restore to a network folder and then copy the files to the respective LUNs.

Errors 17058, 4163, 17207 When Starting DB Services

Some could say that our troubles were over, we had good backups, and we were able to restore the full folder structure to the required LUNs.  But we were unable to start neither the clustered DB nor the MSDTC resource. 

Some of the errors we were getting are shown below:

Event 17058: initerrlog: Could not open error log file 'I:\MSSQL10.DB02\MSSQL\Log\ERRORLOG'. Operating system error = 5(failed to retrieve text for this error. Reason: 15105).

[Figure 7 – Error 17058 initerrlog: Could not open error log file]

Event 4163: MS DTC log file not found. After ensuring that all Resource Managers coordinated by MS DTC have no indoubt transactions, please run msdtc -resetlog to create the log file.

[Figure 8 – Error 4163 MS DTC log file not found]

Event 17207: FCB::Open: Operating system error 5(error not found) occurred while creating or opening file 'G:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\OperationsManager.ldf'. Diagnose and correct the operating system error, and retry the operation.

[Figure 9 – Error 17207 – FCB::Open: Operating system error 5]

Solving the File Permission Problem

After looking through the logs we eventually found that it was a permissions issue.  For some reason the restore didn’t preserve the file permissions and the SQL as well as MSDTC were unable to read and/or write the files.

Before we go into the file permissions, we need to understand how SQL operates, and in particular when SQL is running under a domain account.  In our case the first instance (DB01) was running under the s-GGG-SQL1DB01 domain service account.

[Figure 10 – SQL Server Instance User Account]

We referred to one of our other servers for guidance and copied permissions across from the directory structure.  The following table will demonstrate which permissions must be added to which folders.  Keep in mind that your drive letters may differ depending on whether you split your databases and logs.

Folder

Account

Permissions

F:\Microsoft SQL Server\MSSQL.1\MSSQL

s-GGG-SQL1DB01-DB

Read & execute

List folder contes

Read

F:\Microsoft SQL Server\MSSQL.1\MSSQL\Backup

s-GGG-SQL1DB01-DB

Full control

F:\Microsoft SQL Server\MSSQL.1\MSSQL\Data

s-GGG-SQL1DB01-DB

Full control

F:\Microsoft SQL Server\MSSQL.1\MSSQL\FTData

s-GGG-SQL1DB01-DB

s-GGG-SQL1DB01-FT

Full control

F:\Microsoft SQL Server\MSSQL.1\MSSQL\JOBS

s-GGG-SQL1DB01-AG

Full control

F:\Microsoft SQL Server\MSSQL.1\MSSQL\Log

s-GGG-SQL1DB01-DB

s-GGG-SQL1DB01-AG

Full control

F:\Microsoft SQL Server\MSSQL.1\MSSQL\repldata

s-GGG-SQL1DB01-DB

 

Full control

G:\Microsoft SQL Server\MSSQL.1\MSSQL

s-GGG-SQL1DB01-DB

Read & execute

List folder contes

Read

G:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG

s-GGG-SQL1DB01-AG

Full control

[Table 1 – SQL Database and MSDTC File Permissions List]

Fixing MSDTC Log File Error

Our H: drive held the MSDTC log file which also needed to have its permissions changed.  The problem was that the permissions on the log file H:\MSDTC\a5a85ece-45c4-4780-a969-584a6c2e871c\MSDTC.LOG were created by the MSDTC service itself and the user does not exist in Active Directory or local machine and is in the form MSDTC${SOME LONG GUID}.

[Figure 11 – MSDTC GUID Permission]

Without the change of permissions our MSDTC resource will fail to start.  In order to get around that problem we need to do three things. 

  1. Check if the Distributed Transaction Coordinator (MSDTC) service is started.


    [Figure 12 – Distributed Transaction Coordinator Service]

  2. Run the following command to reset the MSDTC log file

    msdtc -resetlog
  3. Run one of the following commands:

    msdtc -resetclustermlog {guid}
    or

    msdtc -resetclustertmlog <resource_name>
    Where {guid} is the GUID of the DTC resrouce (this is the name of the folder that the Cluster log resides on a shared disk), and <resource_name> is the name of the MSDTC resource listed in the Failover Clustering interface.

    An easy way to find the MSDTC GUID is to browse to where the MSDTC log files are stored and then follow the folder structure until you reach a folder named with a GUID. The folder name is the GUID for this particular DTC. But it is far easier to just use the Failover Cluster Management MMC to get the MSDTC.

 The file permissions should have now been updated.

Finishing Off the SQL Cluster Database Fiasco

Now that all the preliminary work has been done, you shouldn’t have any problems starting all the cluster resources.

[Figure 13 – SQL 2008 Failover Cluster Up and Running]

Perform the same steps for any other cluster resources or databases you may have.

Conclusion

Troubleshooting SQL 2008 Failover Cluster isn’t as hard as one would think provided that the user has a basic understanding of cluster operation and SQL 2008 permissions.  Certainly, MSDTC needs to be handled differently but on the whole, failed drives and volumes prove no hurdle to restoring SQL cluster operations on Windows Server2008 x64.

References

How to Configure Multiple Instances of Distributed Transaction Coordinator (DTC) on a Windows Server Failover Cluster 2008
http://blogs.technet.com/askcore/archive/2009/02/18/how-to-configure-multiple-instances-of-distributed-transaction-coordinator-dtc-on-a-windows-server-failover-cluster-2008.aspx

Managing MTS Transactions
http://technet.microsoft.com/en-us/library/cc750857.aspx

Troubleshooting Problems with MSDTC
http://msdn.microsoft.com/en-us/library/aa561924(BTS.10).aspx

 

 

 

permalink [Permalink] - Updated: Monday, October 28, 2013





| More

 

Articles of Interest


International Careers & Jobs - An international employment directory, reviewing world-wide top job sites


 
 
(c) Capitalhead Pty Ltd
Contact Capitalhead About Us Articles & Publications Partners Solutions & Services Products Valid XHTML Valid CSS