Operating reliable cloud applications

Design and operate reliable cloud applications that meet your service-level objectives (SLOs). Start with the resilient Azure platform foundation and take advantage of built-in features across high availability, disaster recovery, and backup scenarios, then design and test your applications with the Well-Architected Framework and the latest chaos engineering tools.

Operate with confidence by building on the resilient foundation of the Azure platform—world-class global infrastructure that meets and exceeds our platform reliability commitments, including providing financially-backed service-level agreements (SLAs) per service.

Highlights of our reliability commitments include:

  • 99.99% compute availability monthly

    Zone-redundant Azure VMs

  • 99.99% identity availability monthly

    Azure Active Directory authentications

  • 99.995% database availability monthly

    Zone-redundant Azure SQL deployments

  • 99.99999999999999% object durability annually

    Objects in an Azure geo-zone-redundant storage account

  • 100% Azure DNS availability monthly

    All valid Azure DNS requests guaranteed to receive a response

    Learn more about SLAs on Azure

Around the world

Top-performing cloud network

Every day, customers around the world connect and pass trillions of requests to Azure, Bing, Dynamics 365, Office 365, Xbox, and other services.

To offer the lowest latency and highest reliability, we manage traffic across our global network as close to the customer as possible by default. Microsoft’s global network is analogous to a highway system—datacenters grouped into Azure regions are the major cities, edge PoPs are the smaller cities, and networking assets are the common highway with multiple lanes that all vehicles can use. Traffic moving between our datacenters, and between any one datacenter and the edge, all stays on Microsoft’s highway system.

Over 190 network PoPs are placed in close proximity for low-latency network performance, and 99 percent of Azure inter-region latencies beat the performance baseline for internet traffic in a key study.

Deep in the Atlantic Ocean

Marea: The highest capacity transatlantic subsea cable

When Hurricane Sandy struck, it revealed a potential single point of failure—that existing transatlantic subsea cables all landed in New York and New Jersey.

To combat this, Microsoft partnered with Facebook and Telxius to develop the highest capacity subsea cable to cross the Atlantic. Providing up to 160 terabits of data per second, the Marea (Spanish for “tide”) is the first subsea cable to connect Virginia and Spain—and because the cable is situated so many miles south of the previous connection points, connectivity is safeguarded against natural disasters or other major events across the Atlantic.

Planned maintenance diagram

Platform changes are both inevitable and beneficial—how can we make changes and releases safer?

What is the primary cause of service reliability issues that we see in Azure, other than small but common hardware failures? Change.

We manage change automation with safe deployment practices (SDP) so that all code and configuration updates go through well-defined stages. This way, we catch regressions and bugs before they reach customers or, if they do make it past the early stages, impact the smallest number possible.

Production rollout pipelines diagram

Planned maintenance on the Azure platform: Low and no‑impact maintenance technologies

Azure always approaches maintenance with a view towards ensuring the smallest possible impact to customer workloads.

Learn about the low and no-impact update technologies—including hot patching, memory-preserving maintenance, and live migration—that Azure uses to maintain its infrastructure with little or no customer impact or downtime.

A floating ring of blue fire

Narya starts with a hardware failure prediction, makes a smart decision on how to respond, implements the response, then measures the customer impact and incorporates it via a feedback loop.

Advancing failure prediction

Project Narya

Project Narya is a holistic, end-to-end prediction and mitigation service named after the 'ring of fire' from Lord of the Rings—known to resist the weariness of time.

Narya is designed not only to predict and mitigate Azure host failures but also to measure the impact of its mitigation actions and to use an automatic feedback loop to intelligently adjust its mitigation strategy. It uses our Resource Central platform, a general machine learning and prediction-serving system that we have deployed to all Azure compute clusters worldwide. Narya has been running in production for over a year and, on average, has reduced virtual machine interruptions by 26 percent, helping you run your Azure workloads more smoothly.

Operating reliable infrastructure relies on a shared responsibility model. Take advantage of optional Azure services and features to achieve your specific reliability goals.

Azure Site Recovery overview

Regional failover with Azure Site Recovery

Azure’s distributed global infrastructure gives you the flexibility you need to respond to workload failures, even in region-down disaster recovery scenarios.

When localized outages hit your primary region, using Azure Site Recovery to fail over your critical applications to a secondary region keeps your business apps and workloads running to ensure business continuity. Once the coast is clear, you can bring your app home to its primary Azure region by failing back to it. Azure Site Recovery lets you perform regional disaster recovery as well as zonal disaster recovery in regions with Availability Zones.

Around the world

Protect your data with storage redundancy

To protect from unplanned disruptions, Azure Storage accounts replicate your data three times within its primary region.

For applications that need high availability or disaster recovery, consider additional redundancy options, including:

  • Zone-redundant storage spreads your data across three different Availability Zones within your primary region (each with independent power, networking, and cooling).
  • Geo-redundant storage copies your data to a secondary region hundreds of miles away, to ensure your data is durable even if a regional outage renders your primary region unrecoverable.
  • A combination of the two provide the benefits of both.

When deciding which redundancy option is best for your scenarios, consider the tradeoffs between lower costs and higher availability.

Around the world

Improve application resilience with Azure Chaos Studio

A fully managed chaos engineering experimentation platform, Azure Chaos Studio helps you to disrupt your apps intentionally to identify gaps and plan mitigations before your customers are impacted by a problem.

Experiment by subjecting your Azure apps to real or simulated faults in a controlled manner to better understand application resiliency. Observe how your apps will respond to real-world disruptions such as network latency, an unexpected storage outage, expiring secrets, or even a full datacenter outage.

Customer Tech Talks sits down with Clear.Bank to discuss their journey to Azure.

United Kingdom

Clear.Bank’s reliability journey

Clear.Bank migrated to Azure and took advantage of Azure Availability Zones to process payments—even during outages.

“We had a zonal outage on Azure, and luckily for us, we had invested in this space. And we had got to a place where we had adopted Availability Zones, and in this scenario, we didn’t experience any customer impact—our payments carried on, going as expected. We did get some alerts that we nervously were watching, but everything came back as expected.” -Tom Harris, Chief Technology Officer, Clear.Bank

High availability even at your low points

Reliability is a shared responsibility. We continue to invest in core capabilities to bolster the foundation and features of the Azure platform—take advantage of both to ensure your Azure applications are resilient and highly available. Even in the face of rare but inevitable technical issues, this joint commitment aims to ensure that all of your critical services continue operating.