Tuesday, 17 July 2018

I want to move my project from TFVC on TFS to Git on VSTS, without command-line tools. Can I do it?


Many often do not realise how easy is to consume technology to make it accomplish a certain scenario. This happened just last week to me.

For example: you have a project on a Team Foundation Server, which uses TFVC. TFS is only available via the corporate LAN, while you want to move it to the new company’s VSTS account and you also want to move to Git. Throwing an extra spanner in the works, you want something easy to use which does not require any kind of command-line use.

Does it sound too complicated? It is actually a matter of a couple of clicks.

The first step is to use the Import Repository feature on your local TFS – what you will do is to convert a branch from TFVC ($/MyProject/main for example) to a new Git repository:
































You can retain as much as 180 days of history, which is more than enough IMHO. If you need more, you can keep the old system around and look it out there. Why? Because of how TFVC and Git differ – it would not really make sense, and you are just adding stuff to a repository that should be as nimble as possible. Also, you are limited to 1GB per imported branch.

Once you are happy with it you can add your VSTS target repository as a remote, and push it there. Job done.

Tuesday, 10 July 2018

Review – Accelerate


As you know, I am not only a technology enthusiast but also very into the business side of DevOps. And as a fan of The Phoenix Project, I really could not refrain from purchasing it 😊 
Also, the focus is on High Performing Technology Organisations (HPTO from now on), which is a very broad subject intertwining technology, management, strategy. Enough to keep me interested.







































I read it twice before writing this review. Yes, twice. And the conclusion is very simple: it carries a huge horizontal value. This book is not the typical technical or business book, its approach is more scientific, almost academic.

A real HPTO is a well-oiled machine that requires lots of work all across the board. And that is where it shines for business value: despite this approach, the result is that each chapter can be picked by any company as a project on its own to improve itself and go towards the required maturity to ‘be’ an HPTO.

Technical best practices? Chapter four. Infosec and the shift left on security? Chapter six. Employee empowerment through management? Chapter nine. Each chapter has enough stuff to keep you, your teams and your companies busy for months, if you actually start a project on it. And given that I do not think every reader of this book works in a HPTO, you definitely should start some projects 😊

Summarising it in a single sentence, the issue at heart is that software is the actual business engine. That is what the book underlines as well - without a good software factory you simply cannot deliver value to your users, and if you don’t deliver value…

Wednesday, 27 June 2018

A set of tricky situations with HTTPS and TFS

HTTPS is more and more common-place, not just for public websites but also for internal websites. This is extremely good for a number of reasons, but from an administration standpoint there are a few bits to keep in mind.

In particular, when it comes to Team Foundation Server this is a list of errors and problems that go away with a common denominator: the right certificate.

The number one offender is of course the out-of-domain machine. If you have a domain-joined machines these problems simply do not happen because the internal certificate is deployed by the domain GPO - hence you don't have to fiddle with it. When your machine is not domain-joined, things can easily go south.

Bear in mind - these are not security tips, this is just a collection of situations which you will face if you deploy HTTPS with TFS.

Non domain-joined machines

If you are running a non domain-joined machine then you need to procure the root certificate for your domain and install it in the Trusted Root Certification Authorities store on your machine. This needs to be done on any machine not part of your domain, otherwise you won't be able to do pretty much anything.


Build agents

Build agents need to be reconfigured. You can't run away from this, if you don't do that they will be working until the authentication token expires, and then you will start seeing this error in the Event Log after they go offline:

Agent connect error: The audience of the token is invalid.. Retrying every 30 seconds until reconnected

You need to de-register (config.cmd remove) and re-register your build agents in any pool. Not too bad, but it needs to be planned for.


The Deploy Test Agent task in Build and Release

If you don't have your certificate installed on both the Agent (if outside the domain) and the target machine (again, if outside the domain) then you will get this cryptic error:

The running command stopped because the preference variable "ErrorActionPreference" or common parameter is set to Stop: Exception calling ".ctor" with "2" argument(s): "One or more errors occurred."

It's a communication issue between the target machine and TFS. Once the certificate is installed it goes away and the task works normally. This GitHub issue also recommends enabling TLS v1.2, which is not a bad idea.


 Git

Git holds a special spot in this collection, because of how it handles SSL. While newer versions of Git for Windows made this really straightforward (hint: they support the Windows Credential Manager), but if you aren't running the latest and greatest then this is what could happen with Git on your local machine, even if it is joined to the domain:

C:\>git clone https://myserver/Collection/_git/Project 
Cloning into 'Project'... 
fatal: unable to access 'https://myserver/Collection/_git/Project/': SSL certificate problem: unable to get local issuer certificate

You can sort this out in many ways, but the best one is Philip Kelley's approach. It just works, even if it is a bit of a walkthrough. This applies not only on the client, but also on the build agent if you are not running a recent version of the agent itself. It can be easily corrected by replacing the ca-bundle.crt file over there, it is not going to be replaced until you update the agent to a newer version.

Also, a false friend:

error: RPC failed; curl 56 OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 10054
fatal: read error: Invalid argument, 255.05 MiB | 1.35 MiB/s
fatal: early EOF
fatal: index-pack failed

It can be all sorts of things, especially as the error points at OpenSSL - but check your connection's stability first before messing up with Git's postBuffer and compression 😃 if the git clone operation starts the problem is not the SSL authentication.

Wednesday, 20 June 2018

Easily handle internal settings while orchestrating components' deployments and parameters

After ten years of attending, then speaking at conferences it always strikes me what demos often miss are real world details that really make the difference.

Like...deploying an application with a pipeline. Everybody talks about it, right? And everybody (including myself!) have some demo-ready stuff to show around in case it might be required.

I am working on a sample application right now, and I realised how blind I was - even if I am deploying stuff to different slots and environments and whatnot, I am still treating everything as a single monolith. Not really what you want these days, right?

Well' let's sort it out. Say that you have an API component and a Frontend component, the best thing to do is to decouple the two of them so they can be independently deployed *and* mix-matched depending on the requirement.

It is .NET Core in my case, so in my Frontend component's appsettings.json I created this section:








Of course I modified the application so I could add the configuration in my ConfigureServices method and consume it in my Controller. The variable part in this case is the Slot property.

Now comes the fun side of the story - of course I have a pipeline in place. How do I handle these settings?



The best approach here, given the relative complexity of this exercise, is to scope the relevant value by environment. The Dev environment will always point at the Dev environment, Staging to Staging, and the last two environments are effectively production so I do not need to worry about adding a slot. It's not like I have cross-environment settings here.






The reason why the variables are named that way is because I am using the JSON variable substitution option in the Azure App Service Deploy task, and as my property is not on the first level then it needs to be explicitly written that way.







Doing it ensures that each environment has its own setting, and it also makes sure you remain sane while handling internal app settings across your applications and environments 😉 it is really easy to do as well, so there is really no reason to skimp on it.

Saturday, 16 June 2018

Quickly deploy a baseline SQL database with VSTS

"Sometimes we go full steam ahead with a complex solution for a very simple problem..."

That was the answer I gave to a friend of mine who asked me how to feed some baseline database for testing purposes with VSTS in Azure.

The obvious one would be to have your versioned SQL scripts in a dedicated repository which you can use to rebuild the whole thing from code (which is by all accounts the most correct solution to this problem). But in this case there are other avenues.

Databases have been treated like second class citizens for years - by tools and practices. For example, why not using BACPAC files for this exercise? At the end of the day, a BACPAC file contains the packaged version of a database at a certain point in time, including its data.

So if you have your BACPAC somewhere, get to an Azure storage account and run this SQLPackage command inside a VSTS PowerShell Script task (of course you need to replace the variables and provide the actual path):

& 'C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\Common7\IDE\Extensions\Microsoft\SQLDB\DAC\130\sqlpackage.exe' /Action:Import  /TargetServerName:$(DBUrl) /TargetDatabaseName:$(DBName) /TargetUser:$(DBAdmin) /TargetPassword:$(DBPassword) /SourceFile:"<your location>/sample.bacpac"

Don't get me wrong, I love seeing a database fully integrated with the pipeline and that's how it should be. But in this specific case, I feel the tradeoff is worth it.

Also - this is a baseline database, nobody prevents us from running delta scripts against it depending on needs. But given it was for testing purposes, I highly doubt there is going to be much development on it in the future!

Thursday, 7 June 2018

How to run UI tests in a Deployment Group with TFS and VSTS

Especially if you are testing client applications, you might want to run UI tests on a Deployment Group instead of a Build Agent. While technology is the same, there are a couple of things to keep in mind.

In order to enable a machine to run UI tests you need to make sure your InteractiveSession capability is set to true.








In order to do so, you need to re-configure or manually change the script used to add a machine to the Deployment Group. Given a standard script the first step is removing the --runasservice switch from it.

Once you run the configuration script the process will guide you to configure the agent for interactive interaction. You will set it to auto-start so you will get an unattended experience when rebooting the machine, but you will be able to run interactive sessions on it.

Eventually, I always recommend to use the VSTest Platform Installer task to make sure you have a consistent environment to run your tests from:











and to refer to the tools installed by that in the Visual Studio Test task:



Wednesday, 30 May 2018

A story of high availability with SQL Server AlwaysOn and TFS

A few weeks ago something happened on our TFS instance - we discovered that DBCC CHECKDB under certain conditions can mark a database as corrupted.

Long story short, this was due to a peculiar condition related to a high volume of transactions during that operation, not something you see every day. Microsoft Support was really good helping us getting back to normality.

In retrospective, what really hit me was how resilient TFS was thanks to SQL Server AlwaysOn. As you know, I am a huge fan of AlwaysOn because of how transparent it makes High Availability.

For us, maintaining availability meant a simple failover to the other node. Given that we are running the Availability Group with Synchronous-Commit Mode (my default choice when it comes to TFS) the then-Primary Replica was already updated to the latest transaction, so there was no data loss. 

Team Foundation Server did not lose a single heartbeat. When things go south like this, during the issue itself and if you are doing something during the failover you will get a JobInitializationError, which is self-explanative. As this is a transactional system by design, nothing is left hanging in the balance like good ol' SourceSafe :)

Of course we were in limited availability while we were troubleshooting and fixing this problem (always change the Failover Mode to Manual when you are doing so), but there was no downtime.

Also talking recovery, at the end of the day we had to restore backups on the Secondary Replica to get back to a proper synchronisation. Again, a bit tedious and time consuming given the sizes involved, but it was flawless.

Tuesday, 22 May 2018

Small details carrying a huge value

I was reading this post by Microsoft Premier Developer’s blog, and it was a nice throwback to past times where I had to deal with these type of requests because of the existing process in place.

I also thought about how easy it became customising a process with VSTS compared to TFS, and the first thing that sprung to mind was to pair this up with the Board Styling options:




















This will cause cards that are unassigned to a single individual but assigned to a group to be highlighted in the board:


















There can be so may reasons why a team might choose to do this – and it does not just apply to product development. Think about situations where telemetry operators escalate events or tickets are integrated in the backlog.

Why am I focusing on such small details? Well, this is the kind of personalisation (I cannot really call them customisations 😊) that enable cross-role consumption of the stack. 
It does not have to be anything extremely complicated, but whenever you can bring an existing process inside the tool in a frictionless manner you are already paving the way for a better reception and adoption of the tool itself.

Friday, 11 May 2018

Elevate your telemetry from silo to valuable data source


I am going to speak at DevOpsDays Kiel next week about telemetry, and I was thinking about how much Application Insights evolved in the last few years.

Without mentioning the awesome Application Insights Analytics, I was really pleased with how easy it is to bring valuable data to the forefront.

For example, this was there pretty much since the inception:






It’s great, but it is kind-of-buried in the detailed information provided. What I really enjoyed on the other hand, was this:










This is an organic and straightforward way to escalate a single piece of information. Why you ask?
Well, because the previous screen is a summary, with a single button named Operations in a pane called Take Action

So, from a UX point of view, it comes natural to dig into the details of a single request raising an exception and promote that information to an actionable backlog item.

A development team does not (usually) need quantity, it needs quality in order to fix problems raised by telemetry. It is the natural evolution of telemetry systems to be able to integrate with DevOps stacks in an effortless way – the real challenge is doing so without being excessively verbose, but still providing the much needed value to close the loop.

Monday, 30 April 2018

Review – Professional Visual Studio 2017

I recently gave a go to this book, because I feel it is important as a stepping stone to whoever is approaching the IDE – remember, there is always somebody who is starting around 😊


It does its job well to be fair, it covers pretty well all the features and it is pretty up-to-date with the RTM release of the IDE. The only problem I find with it is that the Visual Studio release cadence is going pretty fast, so it will always be a matter of playing catch-up with the team. There is so much that is added and updated on a regular basis that it is almost inevitable for a book like this to fall behind.

Regardless of that, there is also a nice introduction to the Continuous Delivery Tools for Visual Studio that happens to be a nice starting point to the DevOps and CD pipelines tools as well – including Code Analysis.

Visual Studio Team Services is mentioned at the end, instead of Team Foundation Server. It is a change that makes sense, as it is extremely quick and easy to get started there instead of installing TFS.

Wednesday, 11 April 2018

On-premise Blue-Green deployments with TFS 2018 Update 2


Like I said in the previous post, modern deployment patterns are not an exclusive of the OTT providers and they are not something that requires using cloud technologies.

After Rolling Deployments, another very common pattern you might want to tackle is Blue-Green deployments. In a nutshell, it means having two identical environments to use in order to deploy new versions of your application with minimal downtime.

It is a bit harder compared to a Rolling Deployment – mainly because there could be countless variations on the technical details, depending on how your environment is composed, but let’s try to jot down a skeleton version of a Blue-Green pipeline you can use.

So in my case, I am using the same application I used in the previous post, with an additional environment (which happen to be a cluster, just to keep things a little more realistic). This is what my pipeline looks like:











Each environments follows the following process:







































Let’s say we are running all of this against the blue cluster, which is currently production.

The first phase is an Agent Phase – it swaps production traffic from the blue cluster to the green one. 

I want it to be independent from the environments so that it can deal with the router that manages traffic between the two clusters. As I do not have an appliance or anything special in front of them (I am just playing with CNAME records in my lab domain) this ensures the process is not tied to any machine.

Moreover, this pipeline is designed to be used just after everything is deemed production-ready, so if it fails it is not meant to be ran again without a hitch.

The reason behind this choice is that I wanted to share a general idea of how to do this on-premise, and there might be so many permutations of what you might need to do or what could go wrong that my example with all the possible fail safes in place would have been way too complex.

Up next, the Rolling Deployment we saw in the last post for both nodes of the blue cluster, one at a time.

Then, even if you are running this for a production application you still need to make sure your smoke tests are passing. This is literally the last line of defense before the switch.

Eventually, a warm-up script to ensure that my application is responding correctly when it is going to be used from my users.

Now the magic happens: as soon as you move to the second Environment, traffic is switched to the blue cluster again (which is done with the v2, and warm enough for production traffic) in a seamless way while the whole process goes on against the green cluster.

Of course, there are some things to consider. The first one, is that this pipeline is not designed to be a commit-to-production pipeline: there is no backup mechanism in it and no revert process if one environment fails (this lives with the fact that you should already have the pipelines defined in the previous post though 😊).

You want to use approvals to manage the switch from green to blue, so that only when it is checked then you can go ahead.

Eventually (this is quite important though), your application must be able to cope with environment change – it should be message-based, or stateless. Traditional stateful applications can have problems with it, which can be mitigated with message queues for example, so we are back to square one 😊

Wednesday, 4 April 2018

On-premise rolling deployments with TFS 2018 Update 2


Team Foundation Server delivers – as usual – the periodic snapshot of VSTS goodness on-premise.
One particular feature I am really happy landed in our datacentres is Deployment Groups. With it you can target sets of machines where you are going to deploy your applications on.

It is really amazing because it enables scenarios like Rolling Deployments for your existing applications running on-premise. These patterns are not an exclusive of the big boys!

For example, I am targeting a two node cluster with a very simple ASP.NET MVC application (running on full .NET Framework, so no .NET Core or anything that fancy, pretty much the run of the mill internal application you might find in pretty much any company) like this:





















I am targeting one server at a time – it comes as a simple option but this is crucial, as you could do them in parallel.








Then it is fairly straightforward: stop the node draining all the connections (this is quite important) stop the website, deploy the package via the trusted MSDeploy, restart the website and re-join the node.

To handle the cluster nodes, you can easily use the NLB PowerShell cmdlets:


This Release definition is going to run against each of the nodes, making individual node management very easy. Of course it is just a starting point and I am simplifying some of the situations you might find, but all the foundation is there!

Thursday, 29 March 2018

Selective branch indexing with TFS and the Search Server


Team Foundation Server’s Search Server can be tough. I mean, it works really well but it takes a certain degree of planning, otherwise it can easily sink your instance’s performance.

I’ve mentioned in the past that there are scripts from the Product Team that help with the daily administration of the server, they are still the number one choice IMHO from an admin point of view.

But it’s not all command-line. For example, if you look into the Version Control settings of your Team Project, you will discover that each Git repository has a nice setting for selective indexing.









This makes a lot of sense, so you can only index the common branches and have a rational use of your Elastic Search instance.

There is an excellent reason for that: you don’t want *all of your branches* to be searchable. They will feature a ridiculous amount of duplicates, hence you would be wasting resources.



Wednesday, 28 March 2018

Something strange with SQL Server AlwaysOn Automatic Seeding and TFS

I ran into this strange issue the other day in my homelab, and it is worth sharing it: I was trying to setup a highly available Team Foundation Server data tier with AlwaysOn Automatic Seeding instead of the usual backup and restore process, but the TFS_Configuration database (for some reason) was not collaborating.

Automatic seeding of availability database 'Tfs_Configuration' in availability group 'TFSAG' failed with an unrecoverable error. Correct the problem, then issue an ALTER AVAILABILITY GROUP command to set SEEDING_MODE = AUTOMATIC on the replica to restart seeding.

We are talking about a plain, empty instance, so... it was a bit of a needle in a haystack!

Let's take a step back: SQL Server AlwaysOn Automatic Seeding is a new feature of SQL Server 2016 and above that manages to sync up a database in an Availability Group without leveraging backup and restore. This is a life saver in certain situations, so that you can avoid the computational load of a backup and of a restore that might take a long time.

There are some constraints - above all, the instances making up the Availability Group must be *identical*. Yes, identical in everything, including paths used by SQL Server. It is a very cloud-first approach at the end of the day, where you have identical, commodity resources at your disposal and your actual target is to provide a friction-less experience to whom is going to consume the service you'll offer.

So cool, right? Still, for some reason, my Configuration database didn't stream from Primary to Secondary replica. I checked the DMV, and I got an obscure 1200 failed_state error - Internal Error.



















The first thing I did (as the instances are really identical, they were provisioned the day before) was to check that I was on the latest CU, as there are fixes available for Automatic Seeding. Check.

I had a look at the script used by the wizard to add the databases to the Availability Group, nothing too fancy to be fair. Reading around seems that there is still a chance that things might suddenly break, so I took another path.

Yes, a Full Backup (taken with the TFS Administration Console nonetheless) was supposed to be enough to enable Automatic Seeding as the recovery chain is started. Would another Transaction Log backup hurt? I don't think so.

After taking the faulty database off the Availability Group, I ran the speedy Transaction Log backup and added the database back in the Availability Group with the script. Guess what, it worked! And my new TFS instance is up-and-running.

Of course this is totally transparent as usual for TFS, as the configuration wizard is smart enough to set the right connection string from the beginning. But you still need to make sure the Availability Group is correctly set, otherwise at the first failover you will be left with nothing.

Wednesday, 14 March 2018

How Team Foundation Server saves you from a potential mess with IIS and SSL


This is an example of how TFS is robust enough to prevent you doing silly and potentially costly (in terms of time) mistakes.
Let’s say you are configuring a new instance, and you just got your SSL certificate installed on the machine. So you select the HTTPS and HTTP option in the configuration settings, and you select your certificate. And you get an error:









Clicking on that link creates the correct bindings for that certificate. Fair enough.
But the Public URL is not what you like, so you change it to something else. And you go ahead. The result?






The readiness checks prevent you from doing this. It also checks for SNI validity amongst the other things, something that comes handy when you deal with Chrome.



Thursday, 1 March 2018

Not all is lost if your cube gets corrupted…

I ran into this very odd situation yesterday with the Reporting capability of my production TFS instance – I realised the Incremental Analysis Database Sync job and the Optimize Databases job were running for hours!

image_thumb2

I stopped the Incremental Analysis, and the Optimize Databases job completed successfully. Fine.
But – for whatever reason – my SSAS cube got corrupted! I couldn’t even connect to the Analysis Engine with SSMS. I also found errors in the Event Viewer pointing at a corrupted cube:

image_thumb1

Errors in the metadata manager. An error occurred when loading the 'Team System' cube, from the file, '\\?\<path>\Tfs_Analysis.0.db\Team System.3330.cub.xml'.
Errors in the metadata manager. An error occurred when loading the 'Test Configuration' dimension, from the file, '\\?\<path>\Tfs_Analysis.0.db\Configuration.254.dim.xml'.

Now, what to do? It looked like a full-blown rebuild was in order, and it is a costly operation, given that what the rebuild does is dropping both the data warehouse and the SSAS cube, rebuilds the warehouse with data from the TFS databases and then rebuilds the cube.

It is not like being without source code or Work Items, but still… it is an outage, and it is painful to swallow.

Now, in this case the data warehouse was perfectly healthy – the report shown an update age just a few minutes old. So all the raw data in this case is fine, and all you need to do is to rebuild how you look at this data.

The SSAS cube is just a way of looking at the data warehouse. If your warehouse is fine, just wait for the next scheduled Incremental Analysis Database Sync job to run, it will recreate the cube (thus making the Analysis Database Sync job a Full one rather than an Incremental one) without going through the full rebuild.

clip_image0025_thumb1

Why didn’t I process this myself by using the WarehouseControlService? Simply because the less you mess with the scheduled jobs the better it is Smilehiccups happen, but the system is robust enough to withstand such problems and pretty much self-heal itself once the stumbling block is removed.

Tuesday, 27 February 2018

Containers and DevOps: where are we? Some of my thoughts.


I spent most of February talking about Containers from the DevOps perspective – why you (might) ask? Well, the reason is pretty straightforward: if you are a newbie and you are trying to find resources on containerisation technologies (not just Docker and Kubernetes then!) you will mostly find developer-focused articles, closely followed by C-suite overviews.





Hey, it is totally understandable, don’t get me wrong. They are the hip and cool technology to be aware of in 2018, the market moved and – admittedly – they are a brilliant idea.
The problem is that I believe (and this is a personal opinion) this drive to understand what containers are is skewing the market. Containers are not “like VMs, but better/cooler”, they exist to serve a business purpose.

This business purpose is twin-faced: one appeals to the technological person – you can run the same bits (meant as code, configuration, and toolset) everywhere with some resource tuning, and you are basically pushing Infrastructure as Code (another buzzword, but so 2017…) to the limit. And this is very cool.

But the other side of the business purpose – and the most important one IMHO – is that you can literally change how complex applications are deployed, maximising resource usage and enabling scenarios (blue-green is the first one I can think about) that were exclusive to the OTT before.


This is what really matters.

And to be totally fair with you, containerising an application is not that hard – but it won’t magically improve, it would remain a legacy application running in a container instead of a VM or a physical host.

On the other hand, an application which actually adopts the philosophy behind containers has more chances to actually bring a tangible benefit to the company as it naturally adopts many best practices from DevOps. 

Yes, it is unavoidable – whenever you see Containers you cannot avoid DevOps.
There is only a mistake you should never do. Containers are not DevOps. DevOps collates together practices and concepts that fit in perfectly when using Containers, it’s not enabled by Containers. You can do DevOps with anything, including Containers.

The two together are a match made in heaven. Just don’t forget they are not the same thing.

Friday, 16 February 2018

A quick look at the new SonarQube tasks for TFS and VSTS

Last week SonarSource released a new version of their tasks for TFS and VSTS, with a couple of very welcome additions.

Up to v3, we basically had to do everything manually – especially passing parameters with the /d:… switch.

v4 introduces a context-aware switch where you can specify what you are using for your build:

 image

The Use standalone scanner is quite interesting, as it guides you towards providing a .properties file:

image

Also, gone are the days of using /d:… inline. There is a very handy Additional Properties textbox to use with a line-by-line parsing, which makes property override very easy to do:

image

Tasks are also split now into Prepare Analysis, Run Code Analysis and Publish Analysis Result, to allow a more streamlined design of your Build Definition.

Thursday, 8 February 2018

Build Agents losing connection after switch to HTTPS

A quick one I am dealing with these days – if you switch the Public URL of your Team Foundation Server to HTTPS you might see your Build Agents losing connection with the server.

This usually happens because of a known bug in TFS an OAuth token isn’t registered so all the authentication tokens on the agents expire.

Of course YMMV, so always double check with Support before running a Stored Procedure on your production instance.

If you happen to get into this problem, you can mitigate it by reverting your HTTPS switch-on and changing the Public URL back to the HTTP version. Doing that will re-establish the connection between the server and the agents.

Monday, 22 January 2018

Tips on dealing with WinRM and remoting using the Test Agent

Despite the push we’ve seen in the last few years, the Hosted Build Service might not be the right product for you for whatever reason.

Then, if you are in a situation where your agents aren’t running in the same domain as Team Foundation Server’s and you want to use the Test Agent then you really risk opening the Pandora’s box, courtesy of WinRM and PowerShell remoting.

And to be completely clear – I have nothing against them Smile the only downside is that they need to be approached in the right way, otherwise the can-of-worms effect is just behind the corner.

First and foremost, remember that whenever you target a machine for Test Agent deployment you only need to consider the Build Agent-Test Agent relationship. All the errors you will get are going to be from the Test box, not the build box.

So when you need to configure WinRM, the Test box is the machine that is going to be accepting the connections. While it sounds straightforward, sometimes things happen and one is tempted to look at the Build box first: don’t.

Also, if you really want to use HTTP and WinRM, remember that this is the trickiest combination – so think twice before going down that route!

Then in terms of errors – you will likely face WinRM errors of all sorts. The most common is this:

image

If you are outside a domain then REMEMBER about Shadow Accounts – it is the only way to keep identity issues to a minimum. You’ll also need to set the TrustedHosts value to the machines pushing the agent.

Then this:

image

Remember that passwords need to match, and that mixing users at setup time isn’t really a good idea if you are going down the workgroup/non-trusted domain route.

Always triple check passwords, and I recommend to use the same account for both provisioning and execution, at least as a baseline. This will make sure you have a safety net incase things don’t pan out as expected.

Eventually there is this error, that really puzzles me:

image

This is actually an aggregated exception:

image

Look at UAC and execution context for this – it always happens when you are not running stuff as Administrator when that’s supposed to be elevated. It always drives me mad.

Wednesday, 10 January 2018

Don’t overlook the details during a TFS outage

Two weeks ago I dealt with a head-scratching outage. A few minutes of downtime, for a very stupid reason.

So, we are starting in a situation of total outage. All the services which rely on the production AlwaysOn Availability Group cannot connect to the server. People start screaming, emails flow in at a rate of tens-per-second… well it wasn't that bad, but you get the idea 😊 outages are always annoying.

So off we go with the usual stuff – the TFS Management Console does not load any data from the Data Tier, so the first point of call is checking the database servers.

Which are humming along totally fine. What the hell?! The network stack works as expected, I can ping all the machines involved!

When checking the database servers, I can see that the Availability Group is totally fine – everything is green, synchronised and with no issues. While this is very good on its own (no backups to restore, nothing to sweat too much about), it still does not explain why the Application Tier cannot talk to the Data Tier.

Then the awakening – whenever I try to connect to the AlwaysOn Listener I get a network error, while going directly to the database server works without problems. There it is!

Pinging the Listener does not work indeed. But why? All the cluster resources were green, online. 
But for some reason the affected resource failed to perform its duties.


Given that all the other moving pieces were perfectly fine, a manual AlwaysOn failover solved the problem. The lesson learned here is that in a complex architecture there is always something unnoticeable but critical – it’s like breaking a malleolus.