Cloud Architecture Pattern: High Availability and Disaster Recovery for Azure Service Bus

A service bus implements a messaging system or middleware between enterprise solution components. If an Azure Service Bus is crucial and mission-critical for your application, then it must be available continuously. An outage could be partial or absolute unavailability. There are many possible known or unknown reasons for the an outage to any persisted messaging service. However, Microsoft has explained some of them in their documentation.

  • Unavailability of the particular messaging store.
  • Datacenter-wide outage, i.e. power failure, network switch
  • Network connectivity

Azure Service Bus replication or multi-region replication would protect your application from unforced downtime. We would discuss Active-Active Replication and Active-Passive Replication for Azure Service Bus, these replication patterns are cross-region implementation. We would also review Paired Namespace for scaled down scenarios.

Interestingly, message bus defeat its purpose if it can not promise message delivery. Microsoft Azure Service Bus is highly reliable and promises message delivery. However, they do not guarantee the same while outage or failures.

Typically, an outage does not cause loss of messages or other data.

While disaster is considered as permanent loss or long term loss of a data centre or part of the data centre.

Typically a disaster causes loss of some or all messages or other data.

Both backquotes are quoted from Microsoft Documentation. So, the takeaway is that there is no absolute guarantee (irrespective of the Cloud Provider, Vendor. There is no one can ensure 100% uptime). However, that does not mean that we can achieve resilient or highly available service bus implementation.

Azure Service Bus Replication would protect you against outage and disaster. Typically, it would be a common solution for High Availability as well as Disaster Recovery.

Azure Service Bus – Service Level Agreements (SLA)

Service Bus guarantee at least 99.9% of the time for most of the service, and that includes Relay, Queues and Topics, Notification Hubs  . Let’s use 99.9% as base SLA to calculate various probability and downtime.

Effective SLA – P(A∪B), the likelihood of either or failure would make the system go offline. Azure Service Bus supports ACS credentials, and in the case of ACS outage the Azure Service Bus would fail to Authenticate, and Authorise sender application, and the solution would be unavailable until the both services are restored.

For High AvailabilityShared Access Signature (SAS) tokens are useful, in this scenario the client authenticate directly with Service Bus using a self-minted token with a secret key.

To achieve higher availability, cross-region replication is very useful.

Effective SLA is calculated for P(A∩B) and P(A∩B∩C), the system would be available as long as one region is available. We would go through the variants in following sections.

Tip

System can achieve six-sigma equivalent up time by implementing replication.

Active-Active Azure Service Bus Replication

Active replication sends a message to both Azure Service Bus instances, irrespective of the availability of any instances. A client receives messages from both queue and the receiver process first copy message that arrives, and a second copy gets discarded or suppressed. It would require a correlation id or any form of identification field in the message context to recognise the duplicate.

Ideally, the receiver or client should build a dedupe wrapper service (short-span caching) to ensure message delivery. Alternatively, if the receiver has implemented Idempotency Patterns than it would benefit from design.

Advantages

  • Advance protection against message and data loss.
  • No limit on a number of replicating regions.
  • Copy of the same message in a replicating region(s) ensures SLA individually for messages.
  • No message delays (due to unplanned/planned outage or partial outage).
  • Protect against throttling.

Disadvantages

  • Cost would multiply by some the replicating region.
  • Additional engineering efforts for developing and supporting Idempotency or De-duper component.

Active-Passive Azure Service Bus Replication

Passive replication uses only one of the two Azure Service Bus instance at any point in time. If the operation fails (as you can see in scenario 2 in below figure), the sender will retry the same message with secondary Azure Service Bus instance. The receiver application or endpoint listen to both service bus instance.

Advantages

  • No significant cost to maintain secondary region.
  • No need for specific engineering patterns or de-duper component.

Disadvantages

  • Could lose data or messages in disaster or outage scenarios.
  • Message in Transit could get delayed during the outage.
  • Can not ensure QoS SLA individually for every transaction (message).
  • Does not protect against throttling.

Paired Namespace

Paired Namespace supports scenarios in which Service Bus instance with a data centre becomes unavailable the primary namespace failover to the secondary namespace. It is very much like Active-Passive Replication with few more restrictions.

  • Pairing only preserves send availability.
  • Message-order not preserved, given queue or topic may arrive out of order.
  • Session state is only maintained on the primary namespace.

The only benefit with the pattern is that application logic does not interact with a secondary namespace, nor it has to handle failover. It is a good solution for non-critical and small scale Azure Service Bus implementation.

Summary

There are clear benefits of Active-Active replication over Active-Passive replication. However, the decision to select any approach should be based on business case. The both patterns provide High Availability, protection against Disaster and Resiliency. However, the significance is with individual message SLA (active-active replication) and cost (active-passive replication).

Disclaimer

The views expressed on this site are personal opinions only and have no affiliation. See full disclaimerterms & conditions, and privacy policy. No obligations assumed.