Wednesday, 11 April 2018

On-premise Blue-Green deployments with TFS 2018 Update 2

Like I said in the previous post, modern deployment patterns are not an exclusive of the OTT providers and they are not something that requires using cloud technologies.

After Rolling Deployments, another very common pattern you might want to tackle is Blue-Green deployments. In a nutshell, it means having two identical environments to use in order to deploy new versions of your application with minimal downtime.

It is a bit harder compared to a Rolling Deployment – mainly because there could be countless variations on the technical details, depending on how your environment is composed, but let’s try to jot down a skeleton version of a Blue-Green pipeline you can use.

So in my case, I am using the same application I used in the previous post, with an additional environment (which happen to be a cluster, just to keep things a little more realistic). This is what my pipeline looks like:

Each environments follows the following process:

Let’s say we are running all of this against the blue cluster, which is currently production.

The first phase is an Agent Phase – it swaps production traffic from the blue cluster to the green one. 

I want it to be independent from the environments so that it can deal with the router that manages traffic between the two clusters. As I do not have an appliance or anything special in front of them (I am just playing with CNAME records in my lab domain) this ensures the process is not tied to any machine.

Moreover, this pipeline is designed to be used just after everything is deemed production-ready, so if it fails it is not meant to be ran again without a hitch.

The reason behind this choice is that I wanted to share a general idea of how to do this on-premise, and there might be so many permutations of what you might need to do or what could go wrong that my example with all the possible fail safes in place would have been way too complex.

Up next, the Rolling Deployment we saw in the last post for both nodes of the blue cluster, one at a time.

Then, even if you are running this for a production application you still need to make sure your smoke tests are passing. This is literally the last line of defense before the switch.

Eventually, a warm-up script to ensure that my application is responding correctly when it is going to be used from my users.

Now the magic happens: as soon as you move to the second Environment, traffic is switched to the blue cluster again (which is done with the v2, and warm enough for production traffic) in a seamless way while the whole process goes on against the green cluster.

Of course, there are some things to consider. The first one, is that this pipeline is not designed to be a commit-to-production pipeline: there is no backup mechanism in it and no revert process if one environment fails (this lives with the fact that you should already have the pipelines defined in the previous post though 😊).

You want to use approvals to manage the switch from green to blue, so that only when it is checked then you can go ahead.

Eventually (this is quite important though), your application must be able to cope with environment change – it should be message-based, or stateless. Traditional stateful applications can have problems with it, which can be mitigated with message queues for example, so we are back to square one 😊

Wednesday, 4 April 2018

On-premise rolling deployments with TFS 2018 Update 2

Team Foundation Server delivers – as usual – the periodic snapshot of VSTS goodness on-premise.
One particular feature I am really happy landed in our datacentres is Deployment Groups. With it you can target sets of machines where you are going to deploy your applications on.

It is really amazing because it enables scenarios like Rolling Deployments for your existing applications running on-premise. These patterns are not an exclusive of the big boys!

For example, I am targeting a two node cluster with a very simple ASP.NET MVC application (running on full .NET Framework, so no .NET Core or anything that fancy, pretty much the run of the mill internal application you might find in pretty much any company) like this:

I am targeting one server at a time – it comes as a simple option but this is crucial, as you could do them in parallel.

Then it is fairly straightforward: stop the node draining all the connections (this is quite important) stop the website, deploy the package via the trusted MSDeploy, restart the website and re-join the node.

To handle the cluster nodes, you can easily use the NLB PowerShell cmdlets:

This Release definition is going to run against each of the nodes, making individual node management very easy. Of course it is just a starting point and I am simplifying some of the situations you might find, but all the foundation is there!

Thursday, 29 March 2018

Selective branch indexing with TFS and the Search Server

Team Foundation Server’s Search Server can be tough. I mean, it works really well but it takes a certain degree of planning, otherwise it can easily sink your instance’s performance.

I’ve mentioned in the past that there are scripts from the Product Team that help with the daily administration of the server, they are still the number one choice IMHO from an admin point of view.

But it’s not all command-line. For example, if you look into the Version Control settings of your Team Project, you will discover that each Git repository has a nice setting for selective indexing.

This makes a lot of sense, so you can only index the common branches and have a rational use of your Elastic Search instance.

There is an excellent reason for that: you don’t want *all of your branches* to be searchable. They will feature a ridiculous amount of duplicates, hence you would be wasting resources.

Wednesday, 28 March 2018

Something strange with SQL Server AlwaysOn Automatic Seeding and TFS

I ran into this strange issue the other day in my homelab, and it is worth sharing it: I was trying to setup a highly available Team Foundation Server data tier with AlwaysOn Automatic Seeding instead of the usual backup and restore process, but the TFS_Configuration database (for some reason) was not collaborating.

Automatic seeding of availability database 'Tfs_Configuration' in availability group 'TFSAG' failed with an unrecoverable error. Correct the problem, then issue an ALTER AVAILABILITY GROUP command to set SEEDING_MODE = AUTOMATIC on the replica to restart seeding.

We are talking about a plain, empty instance, so... it was a bit of a needle in a haystack!

Let's take a step back: SQL Server AlwaysOn Automatic Seeding is a new feature of SQL Server 2016 and above that manages to sync up a database in an Availability Group without leveraging backup and restore. This is a life saver in certain situations, so that you can avoid the computational load of a backup and of a restore that might take a long time.

There are some constraints - above all, the instances making up the Availability Group must be *identical*. Yes, identical in everything, including paths used by SQL Server. It is a very cloud-first approach at the end of the day, where you have identical, commodity resources at your disposal and your actual target is to provide a friction-less experience to whom is going to consume the service you'll offer.

So cool, right? Still, for some reason, my Configuration database didn't stream from Primary to Secondary replica. I checked the DMV, and I got an obscure 1200 failed_state error - Internal Error.

The first thing I did (as the instances are really identical, they were provisioned the day before) was to check that I was on the latest CU, as there are fixes available for Automatic Seeding. Check.

I had a look at the script used by the wizard to add the databases to the Availability Group, nothing too fancy to be fair. Reading around seems that there is still a chance that things might suddenly break, so I took another path.

Yes, a Full Backup (taken with the TFS Administration Console nonetheless) was supposed to be enough to enable Automatic Seeding as the recovery chain is started. Would another Transaction Log backup hurt? I don't think so.

After taking the faulty database off the Availability Group, I ran the speedy Transaction Log backup and added the database back in the Availability Group with the script. Guess what, it worked! And my new TFS instance is up-and-running.

Of course this is totally transparent as usual for TFS, as the configuration wizard is smart enough to set the right connection string from the beginning. But you still need to make sure the Availability Group is correctly set, otherwise at the first failover you will be left with nothing.

Wednesday, 14 March 2018

How Team Foundation Server saves you from a potential mess with IIS and SSL

This is an example of how TFS is robust enough to prevent you doing silly and potentially costly (in terms of time) mistakes.
Let’s say you are configuring a new instance, and you just got your SSL certificate installed on the machine. So you select the HTTPS and HTTP option in the configuration settings, and you select your certificate. And you get an error:

Clicking on that link creates the correct bindings for that certificate. Fair enough.
But the Public URL is not what you like, so you change it to something else. And you go ahead. The result?

The readiness checks prevent you from doing this. It also checks for SNI validity amongst the other things, something that comes handy when you deal with Chrome.

Thursday, 1 March 2018

Not all is lost if your cube gets corrupted…

I ran into this very odd situation yesterday with the Reporting capability of my production TFS instance – I realised the Incremental Analysis Database Sync job and the Optimize Databases job were running for hours!


I stopped the Incremental Analysis, and the Optimize Databases job completed successfully. Fine.
But – for whatever reason – my SSAS cube got corrupted! I couldn’t even connect to the Analysis Engine with SSMS. I also found errors in the Event Viewer pointing at a corrupted cube:


Errors in the metadata manager. An error occurred when loading the 'Team System' cube, from the file, '\\?\<path>\Tfs_Analysis.0.db\Team System.3330.cub.xml'.
Errors in the metadata manager. An error occurred when loading the 'Test Configuration' dimension, from the file, '\\?\<path>\Tfs_Analysis.0.db\Configuration.254.dim.xml'.

Now, what to do? It looked like a full-blown rebuild was in order, and it is a costly operation, given that what the rebuild does is dropping both the data warehouse and the SSAS cube, rebuilds the warehouse with data from the TFS databases and then rebuilds the cube.

It is not like being without source code or Work Items, but still… it is an outage, and it is painful to swallow.

Now, in this case the data warehouse was perfectly healthy – the report shown an update age just a few minutes old. So all the raw data in this case is fine, and all you need to do is to rebuild how you look at this data.

The SSAS cube is just a way of looking at the data warehouse. If your warehouse is fine, just wait for the next scheduled Incremental Analysis Database Sync job to run, it will recreate the cube (thus making the Analysis Database Sync job a Full one rather than an Incremental one) without going through the full rebuild.


Why didn’t I process this myself by using the WarehouseControlService? Simply because the less you mess with the scheduled jobs the better it is Smilehiccups happen, but the system is robust enough to withstand such problems and pretty much self-heal itself once the stumbling block is removed.

Tuesday, 27 February 2018

Containers and DevOps: where are we? Some of my thoughts.

I spent most of February talking about Containers from the DevOps perspective – why you (might) ask? Well, the reason is pretty straightforward: if you are a newbie and you are trying to find resources on containerisation technologies (not just Docker and Kubernetes then!) you will mostly find developer-focused articles, closely followed by C-suite overviews.

Hey, it is totally understandable, don’t get me wrong. They are the hip and cool technology to be aware of in 2018, the market moved and – admittedly – they are a brilliant idea.
The problem is that I believe (and this is a personal opinion) this drive to understand what containers are is skewing the market. Containers are not “like VMs, but better/cooler”, they exist to serve a business purpose.

This business purpose is twin-faced: one appeals to the technological person – you can run the same bits (meant as code, configuration, and toolset) everywhere with some resource tuning, and you are basically pushing Infrastructure as Code (another buzzword, but so 2017…) to the limit. And this is very cool.

But the other side of the business purpose – and the most important one IMHO – is that you can literally change how complex applications are deployed, maximising resource usage and enabling scenarios (blue-green is the first one I can think about) that were exclusive to the OTT before.

This is what really matters.

And to be totally fair with you, containerising an application is not that hard – but it won’t magically improve, it would remain a legacy application running in a container instead of a VM or a physical host.

On the other hand, an application which actually adopts the philosophy behind containers has more chances to actually bring a tangible benefit to the company as it naturally adopts many best practices from DevOps. 

Yes, it is unavoidable – whenever you see Containers you cannot avoid DevOps.
There is only a mistake you should never do. Containers are not DevOps. DevOps collates together practices and concepts that fit in perfectly when using Containers, it’s not enabled by Containers. You can do DevOps with anything, including Containers.

The two together are a match made in heaven. Just don’t forget they are not the same thing.