Windows Agents and Failover – Debunking the Myth!

The myth: “If the primary Management Server is down, the windows agents will automatically failover to any Management Server in the Resource Pool.”

It’s been 6 years since the release of SCOM 2012, and yet, the understanding around the failover process in SCOM is still widely confused. SCOM 2012 came out with the concept of the “Resource Pools”, essentially replacing and enhancing the previous “Root Management Server” concept. Having said that, the Resource Pools are still very widely misunderstood and confused.

Why was the concept of Resource Pools introduced? For failover? Sure, but probably not in the way you are thinking. I talk very frequently to other SCOMers in person and online and I often find that their understanding about the Resource Pools is not very accurate. So, I thought about writing a two-part blog explaining the failover process in SCOM – one for the Windows Agents and other for Unix/Linux and network agents.

I will talk about the Windows Agents failover part here, and my friend Stoyan Chalakov was generous enough to agree to write on the U/L and networking part. So, let’s get started!

Before we jump into the actual failover process, let’s recap briefly what Resource Pools are and what do they do.

Basically, the concept of resource pools was introduced to eliminate the Root Management Server as the single point of failure. Till SCOM 2007, RMS was the boss and other MS were under it in the management group hierarchy. Many critical workflows were specifically targeted at the RMS and so there was a risk of your SCOM being paralyzed if the RMS goes down. On top of that, you couldn’t cluster it either.

So starting from SCOM 2012 Microsoft came up with the concept of Resource Pools, and the idea that all the Management Servers are peers, and not in hierarchy. That simplified so many things and the workflows that were running on the RMS were now running on the members of the Resource Pools.

When you install SCOM, out-of-the-box you get 3 default Resource Pools – The “All Management Servers Resource Pool”, which deals with most of the legacy RMS workflows, the “Notifications Resource Pool”, which deals with notifications (alerts subscription service), and “AD Integration Assignment Pool”, which deals with the AD Integrations.

Now the scope of this blog is not to get into much detail of Resource Pools, but there are actually a couple of very good blogs out there that discuss Resource Pools in great details. The one we’ll discuss about here is in particular the “All Management Servers Resource Pool”, and specifically what it DOES NOT do.

Some reading material on Resource Pools:
Understanding SCOM Resource Pools

Resource pool design considerations

OpsMgr (#SCOM) Resources Pools–What they do not do [#SYSCTR]

Now coming back to the failover thing – I’m sure most of you have read or known that the Resource Pools provide failover and high availability in SCOM. Which is true. But again you may also be thinking that the Resource Pools (notably the All Management Servers Resource Pool) provides failover to your Windows Agents. This is simply not true. In almost all of the blogs and even in the Microsoft official documents, when you’re reading about Resource Pools, there is a line mentioned somewhere, “Windows agents do not report to resource pools” – and that’s it. Nothing else. No further explanation, no further discussions at all. That is why it is often just skimmed over or simply forgotten.

So, what does “Windows agents do not report to resource pools” actually mean?

Let’s have a case:

3 Management Servers: MS1, MS2, MS3

2 Gateway Servers: GW1 (reports to MS1) and GW2 (reports to MS2)

As the name suggests, we have all the MS in the “All Management Servers Resource Pool”.

Now let’s understand how the failover takes place should the MS or GW go down.

Case 1: Management Server goes down –

Let’s say it’s the MS3 that failed. All the agents reporting to MS3 are RANDOMLY failed over to either MS1 or MS2 (for successful failover, of course you need the required port 5723 open to all MS). This is the out-of-the-box feature of SCOM and does not require you to set up AD Integration. This process is random by default, but you CAN configure which Management Server you want it to failover to, using Powershell:

$agents = Get-SCOMAgent
$pri = Get-SCOMManagementServer -Name "MS3"
$sec = Get-SCOMManagementServer -Name "MS1"
$agents | where {$_.PrimaryManagementServerName -eq $pri.Name} | Set-SCOMParentManagementServer -PrimaryServer $pri -FailoverServer $sec

Now, once you run this the agents will failover to the Management Server YOU want, instead of failing over randomly. This is NOT affected by what Management Servers you have in whatever Resource Pool. Let’s say I removed one (or all) Management Server(s) from the All Management Servers Resource Pool, this behavior is NOT affected (Don’t do that though, it’ll cause other problems!). The servers will still failover to any Management Servers in the Management Group.

When you install a Windows agent, you configure it to report to a particular Management Server (or GW) only. The Resource Pool simply doesn’t play a role here.

In conclusion, Windows agents will failover to any available Management Server RANDOMLY (unless explicitly configured) and this behavior is NOT affected by any Resource Pools (default or custom).

Case 2: Management Server with a GW reporting goes down –

Let’s say MS1 goes down. GW failovers are not automatic and unlike agents, they DO NOT failover randomly to any other available MS. You need to configure the GW explicitly for failover to MS2 or MS3, using Powershell.

$primaryMS = Get-SCOMManagementServer | where {$_.Name –match "MS1"} 
$failoverMS = Get-SCOMManagementServer | where {$_.Name –match "MS2"} 
$gatewayMS = Get-SCOMManagementServer | where {$_.IsGateway -eq $true} 
Set-SCOMParentManagementServer -GatewayServer: $gatewayMS -PrimaryServer: $primaryMS 
Set-SCOMParentManagementServer -GatewayServer: $gatewayMS -FailoverServer: $failoverMS

Case 3: The GW server goes down –

Let’s say GW1 goes down. Again, the agents will NOT automatically failover to another GW server. You will need to configure the agents to use another server (GW2), using a Powershell script.

#Agents reporting to "GW1" – Failover to "GW2" 
$primaryMS = Get-SCOMManagementServer | where {$_.Name –eq "GW1"} 
$failoverMS = Get-SCOMManagementServer | where {$_.Name –eq "GW2"} 
$agent = Get-SCOMAgent | where {$_.PrimaryManagementServerName -eq "GW1"} 
Set-SCOMParentManagementServer -Agent: $agent -PrimaryServer: $primaryMS 
Set-SCOMParentManagementServer -Agent: $agent -FailoverServer: $failoverMS

Note: Scripts are for example only. You may need to modify them according to your requirements.

Now you’re probably thinking, how come people say that the Resource Pools are used for Failover and high availability then? Fair question! The answer is, they do provide automatic failover to the workflows that are running on the health services of the members of the resource pools. Windows agents run their workloads on their respective health services local to them; hence they have no relationship with the Resource Pools.

In other words you can also say that the failover and high availability resource pools provide is actually for Management Servers, and not for the Windows agents reporting to them.

However, this is not the case with Unix/Linux agents. I will not go into details of it here though, because Stoyan will have an entire blog dedicated to this in part 2, so I’ll let him dive into the details. 😉

Hope this clarifies some misunderstandings and helps someone out there plan their deployment correctly!

Cheers!

This site uses Akismet to reduce spam. Learn how your comment data is processed.