Two weeks ago I dealt with a head-scratching outage. A few
minutes of downtime, for a very stupid reason.
So, we are starting in a situation of total outage. All the
services which rely on the production AlwaysOn Availability Group cannot
connect to the server. People start screaming, emails flow in at a rate of tens-per-second…
well it wasn't that bad, but you get the idea 😊 outages are always annoying.
So off we go with the usual stuff – the TFS Management
Console does not load any data from the Data Tier, so the first point of call
is checking the database servers.
Which are humming along totally fine. What the hell?! The
network stack works as expected, I can ping all the machines involved!
When checking the database servers, I can see that the
Availability Group is totally fine – everything is green, synchronised and with
no issues. While this is very good on its own (no backups to restore, nothing
to sweat too much about), it still does not explain why the Application Tier
cannot talk to the Data Tier.
Then the awakening – whenever I try to connect to the
AlwaysOn Listener I get a network error, while going directly to the database
server works without problems. There it is!
Pinging the Listener does not work indeed. But why? All the
cluster resources were green, online.
But for some reason the affected resource
failed to perform its duties.
Given that all the other moving pieces were perfectly fine,
a manual AlwaysOn failover solved the problem. The lesson learned here is that
in a complex architecture there is always something unnoticeable but critical –
it’s like breaking a malleolus.
No comments:
Post a Comment