« Windows Azure Hosted Services Limit Lowered | Main | Windows Azure Diagnostics–Where Are My Logs? »

April 05, 2010

Upgrade Domains and Fault Domains in Windows Azure

Recently on the Windows Azure Forum I saw couple of times the question “what is a fault domain?”. The reason people are asking is the following statement in Windows Azure SLA:

For compute, we guarantee that when you deploy two or more role instances in different fault and upgrade domains your Internet facing roles will have external connectivity at least 99.95% of the time.

Upgrade domains are fairly known concept in Windows Azure, however fault domains are not something that customers have lot of visibility into, and may need some clarification.

Fault Domain and Upgrade Domain Definitions

Here are the two simple definitions for fault domain and upgrade domain:

  • Fault Domain is a physical unit of failure, and is closely related to the physical infrastructure in the data centers. In Windows Azure the rack can be considered a fault domain. However there is no 1:1 mapping between fault domain and rack. 
    Windows Azure Fabric is responsible to deploy the instances of your application in different fault domains. Right now Fabric makes sure that your application uses at least 2 (two) fault domains, however depending on capacity and VM availability it may happen that it is spread across more than that.
    Right now you, as a developer have no direct control over how many fault domains your application will use but the way you configure it may impact your availability (see below).
  • Upgrade Domain is a logical unit, which determines how particular service will be upgraded.
    The default number of upgrade domains that are configured for your application is 5 (five). You can control how many upgrade domains your application will use through the upgradeDomain configuration setting in your service definition file (CSDEF).

 

How Fault Domains and Upgrade Domains Work?

What is important to you as a Windows Azure customer is to have your application up and running all the time, and although the infrastructure is quite well abstracted, there are few things you should be aware of when configuring your application.

I will explain how those two concepts work with a simple example – the Hello World sample application. Hello World has only one web role, and by default it is configured to have only one instance running for this web role. Having such configuration will not ensure that your application has 99.95% availability! Why? Quite simple – it has only one instance, and it is physically impossible to deploy it in two fault domains or two upgrade domains. Hence the SLA states that you need to have two or more instances in order to ensure 99.95% availability.

If you change the configuration of Hello World to have 2 (two) instances for the web role and redeploy, you will make sure that your application is deployed in two fault domains. Here is how Hello World instances will be deployed with two instances for the web role:

 

 

Fault Domain #1

Fault Domain #2

Upgrade Domain #1

Instance #1

Upgrade Domain #2

Instance #2

 

In this case if for example Fault Domain #1 fails Instance #2 (in Fault Domain #2) will continue to be available. Of course when Fabric notices that Instance #1 doesn’t respond it will deploy your application to a new VM in a fault domain different than Fault Domain #2.

This is the trivial case though. More interesting one is when you have more than two instances for the role. In this case it is up to Windows Azure Fabric’s algorithms to decide how to deploy your application. Here are two possible options (if you use the default configuration for Upgrade Domains - 5) for 3 role instances:

 

Deployment Option #1:

 

Fault

Domain #1

Fault

Domain #2

Fault
Domain #3

Upgrade Domain #1

Instance #1

 
Upgrade Domain #2

Instance #2

 
Upgrade Domain #3    

Instance #3

 

Deployment Option #2: 

 

Fault Domain #1

Fault Domain #2

Upgrade Domain #1

Instance #1

Upgrade Domain #2

Instance #2

Upgrade Domain #3

Instance #3

 

 

Although option #1 is preferred it may not always be possible, and it highly depends on the state of the cluster where the deployment is happening.

The allocation permutations are even more when you have more than one role (Web, Worker or combination) in your application.

Querying Fault Domain and Upgrade Domain Information

Windows Azure SDK provides some properties you can use to query fault domain and upgrade domain information.

RoleInstance class has property called FaultDomain that you can read to find out in which fault domain your role instance is running. There is a catch though – querying FaultDomain property will return either 1 (one) or 2 (two). This is because you are entitled for only 2 fault domains for your application. If your application is deployed across more fault domains you will not be able to determine this using the FaultDomain property.

Same class has property UpdateDomain that you can read to find out in which upgrade (yes – it is inconsistent naming:)) domain your role instance is running.

As mentioned above you have no control over the number of fault domains but you can use upgradeDomainCount attribute of the ServiceDefinition element in CSDEF file to change the number of upgrade domains.

Guidelines for Fault Domains and Upgrade Domains

Here are some general guidelines you can use for configuring your application.

  • You should think about the Fabric algorithms as follows: 1.) When any two role instances require different fault domains they get placed on nodes (or physical machines) in different racks. 2.) When any two role instances require to be in different update domains they get placed on different nodes (or machines). 3.) If two role instances are in the same fault domain they may or may not be on the same rack.
  • Always configure your application for redundancy. This means always configure at least 2 (two) instances per role.
  • Always consider your capacity and availability requirements when configuring your application.

 

UPDATE (March 17th 2011)

At the time when I wrote this post I had the following statement (and a picture) at the end of the definition sections above: “Windows Azure Fabric ensures that particular upgrade domain is not within single fault domain (see picture below).” After some discussions with the Fabric team I decided to remove this statement as well as the picture because it may give the impression that Upgrade Domains are always spread across Fault Domains. Although Upgrade Domain and Fault Domain are distinct concepts, and single Upgrade Domain can span multiple Fault Domains (as well as multiple Upgrade Domains can live in a single Fault Domain) there is no guarantee that all instances from a single Upgrade Domain will be placed in different Fault Domains. For optimization purposes Fabric places all instances of an Upgrade Domain into a single Fault Domain if resource are availble.

 

Advertisement


   

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Post a comment

Comments are moderated, and will not appear on this weblog until the author has approved them.

If you have a TypeKey or TypePad account, please Sign In