A few weeks ago something happened on our TFS instance - we discovered that DBCC CHECKDB under certain conditions can mark a database as corrupted.
Long story short, this was due to a peculiar condition related to a high volume of transactions during that operation, not something you see every day. Microsoft Support was really good helping us getting back to normality.
In retrospective, what really hit me was how resilient TFS was thanks to SQL Server AlwaysOn. As you know, I am a huge fan of AlwaysOn because of how transparent it makes High Availability.
For us, maintaining availability meant a simple failover to the other node. Given that we are running the Availability Group with Synchronous-Commit Mode (my default choice when it comes to TFS) the then-Primary Replica was already updated to the latest transaction, so there was no data loss.
Team Foundation Server did not lose a single heartbeat. When things go south like this, during the issue itself and if you are doing something during the failover you will get a JobInitializationError, which is self-explanative. As this is a transactional system by design, nothing is left hanging in the balance like good ol' SourceSafe :)
Of course we were in limited availability while we were troubleshooting and fixing this problem (always change the Failover Mode to Manual when you are doing so), but there was no downtime.
Also talking recovery, at the end of the day we had to restore backups on the Secondary Replica to get back to a proper synchronisation. Again, a bit tedious and time consuming given the sizes involved, but it was flawless.
Long story short, this was due to a peculiar condition related to a high volume of transactions during that operation, not something you see every day. Microsoft Support was really good helping us getting back to normality.
In retrospective, what really hit me was how resilient TFS was thanks to SQL Server AlwaysOn. As you know, I am a huge fan of AlwaysOn because of how transparent it makes High Availability.
For us, maintaining availability meant a simple failover to the other node. Given that we are running the Availability Group with Synchronous-Commit Mode (my default choice when it comes to TFS) the then-Primary Replica was already updated to the latest transaction, so there was no data loss.
Team Foundation Server did not lose a single heartbeat. When things go south like this, during the issue itself and if you are doing something during the failover you will get a JobInitializationError, which is self-explanative. As this is a transactional system by design, nothing is left hanging in the balance like good ol' SourceSafe :)
Of course we were in limited availability while we were troubleshooting and fixing this problem (always change the Failover Mode to Manual when you are doing so), but there was no downtime.
Also talking recovery, at the end of the day we had to restore backups on the Secondary Replica to get back to a proper synchronisation. Again, a bit tedious and time consuming given the sizes involved, but it was flawless.
No comments:
Post a Comment