Monday 22 January 2018

Tips on dealing with WinRM and remoting using the Test Agent

Despite the push we’ve seen in the last few years, the Hosted Build Service might not be the right product for you for whatever reason.

Then, if you are in a situation where your agents aren’t running in the same domain as Team Foundation Server’s and you want to use the Test Agent then you really risk opening the Pandora’s box, courtesy of WinRM and PowerShell remoting.

And to be completely clear – I have nothing against them Smile the only downside is that they need to be approached in the right way, otherwise the can-of-worms effect is just behind the corner.

First and foremost, remember that whenever you target a machine for Test Agent deployment you only need to consider the Build Agent-Test Agent relationship. All the errors you will get are going to be from the Test box, not the build box.

So when you need to configure WinRM, the Test box is the machine that is going to be accepting the connections. While it sounds straightforward, sometimes things happen and one is tempted to look at the Build box first: don’t.

Also, if you really want to use HTTP and WinRM, remember that this is the trickiest combination – so think twice before going down that route!

Then in terms of errors – you will likely face WinRM errors of all sorts. The most common is this:

image

If you are outside a domain then REMEMBER about Shadow Accounts – it is the only way to keep identity issues to a minimum. You’ll also need to set the TrustedHosts value to the machines pushing the agent.

Then this:

image

Remember that passwords need to match, and that mixing users at setup time isn’t really a good idea if you are going down the workgroup/non-trusted domain route.

Always triple check passwords, and I recommend to use the same account for both provisioning and execution, at least as a baseline. This will make sure you have a safety net incase things don’t pan out as expected.

Eventually there is this error, that really puzzles me:

image

This is actually an aggregated exception:

image

Look at UAC and execution context for this – it always happens when you are not running stuff as Administrator when that’s supposed to be elevated. It always drives me mad.

Wednesday 10 January 2018

Don’t overlook the details during a TFS outage

Two weeks ago I dealt with a head-scratching outage. A few minutes of downtime, for a very stupid reason.

So, we are starting in a situation of total outage. All the services which rely on the production AlwaysOn Availability Group cannot connect to the server. People start screaming, emails flow in at a rate of tens-per-second… well it wasn't that bad, but you get the idea 😊 outages are always annoying.

So off we go with the usual stuff – the TFS Management Console does not load any data from the Data Tier, so the first point of call is checking the database servers.

Which are humming along totally fine. What the hell?! The network stack works as expected, I can ping all the machines involved!

When checking the database servers, I can see that the Availability Group is totally fine – everything is green, synchronised and with no issues. While this is very good on its own (no backups to restore, nothing to sweat too much about), it still does not explain why the Application Tier cannot talk to the Data Tier.

Then the awakening – whenever I try to connect to the AlwaysOn Listener I get a network error, while going directly to the database server works without problems. There it is!

Pinging the Listener does not work indeed. But why? All the cluster resources were green, online. 
But for some reason the affected resource failed to perform its duties.


Given that all the other moving pieces were perfectly fine, a manual AlwaysOn failover solved the problem. The lesson learned here is that in a complex architecture there is always something unnoticeable but critical – it’s like breaking a malleolus.