a
Amazon web services goes offline

What to Do When AWS Goes Offline?

Amazon’s services are not completely iron-clad and applications built on the platform are still vulnerable to the occasional hiccup. When it does happen, there is a degree of shared responsibility between AWS and the developers of the web applications. All services are designed with redundancies in mind and much of the physical infrastructure does come with backup databases and availability zones to help cover for any instances that go down or are brought down for maintenance. Though data loss is mitigated, it is still advisable to prepare for the next incident.

Active-Active and Active-Passive Failsaves via Route53

The Active-Active method is a more general term for redirecting traffic inbound to a failed node to a still functional node or is balanced across all remaining active nodes.  In AWS’ case, this could be across different scales including both regions and availability zones.  It will be more expensive to effectively have multiple copies of an application’s infrastructure in multiple availability zones, but it is well worth the effort to ensure constant application uptime when an instance does go down.  By comparison, the Active-Passive method is a cheaper solution where the backup nodes are fully redundant and are only brought online when the primary node goes offline.  On the other hand, it does come with more expensive overhead compared to Active-Active.

To accomplish this, AWS does come with an automated tool for managing this called Route53.  It is a scalable cloud Domain Name System (DNS) with some additional features such as health checkups and the option to establish traffic policies.  After detecting an unreachable endpoint, it will start redirecting traffic.  Configuring an endpoint can be found under the health checks option of Route53.  In the Route53 UI, there are options for what to monitor, how frequently to conduct a health check, how many times a check should fail before actions are committed, latency graphs, and regions to check.  After setting the functions of the health checks, there will be options to then set up policies for redirecting traffic to alternate endpoints.  Next time an outage occurs, traffic will be redirected automatically with minimal oversight.

Active-Active and Active-Passive Failsaves via Route53

The Active-Active method is a more general term for redirecting traffic inbound to a failed node to a still functional node or is balanced across all remaining active nodes.  In AWS’ case, this could be across different scales including both regions and availability zones.  It will be more expensive to effectively have multiple copies of an application’s infrastructure in multiple availability zones, but it is well worth the effort to ensure constant application uptime when an instance does go down.  By comparison, the Active-Passive method is a cheaper solution where the backup nodes are fully redundant and are only brought online when the primary node goes offline.  On the other hand, it does come with more expensive overhead compared to Active-Active.

To accomplish this, AWS does come with an automated tool for managing this called Route53.  It is a scalable cloud Domain Name System (DNS) with some additional features such as health checkups and the option to establish traffic policies.  After detecting an unreachable endpoint, it will start redirecting traffic.  Configuring an endpoint can be found under the health checks option of Route53.  In the Route53 UI, there are options for what to monitor, how frequently to conduct a health check, how many times a check should fail before actions are committed, latency graphs, and regions to check.  After setting the functions of the health checks, there will be options to then set up policies for redirecting traffic to alternate endpoints.  Next time an outage occurs, traffic will be redirected automatically with minimal oversight.

Multi-Regional Asset Recreation

Alternatively, another method is automatic asset replication.  While it is enabled, anything within a targeted S3 bucket is automatically copied over to another designated bucket in a different AWS region.  Back in the Route53 interface, users can find a selection of their S3 buckets with property controls.  After enabling Versioning, the option for cross-region replication will become available.  Users will still need to adjust how the application adjusts static content.  While it is possible to just rename the asset hostname, it’s recommended to implement a much more thorough solution.  If all else fails, have some physical backups of all available data on the application kept locally in anticipation of needing to do a refresh.

Multi-Regional Asset Recreation

Alternatively, another method is automatic asset replication.  While it is enabled, anything within a targeted S3 bucket is automatically copied over to another designated bucket in a different AWS region.  Back in the Route53 interface, users can find a selection of their S3 buckets with property controls.  After enabling Versioning, the option for cross-region replication will become available.  Users will still need to adjust how the application adjusts static content.  While it is possible to just rename the asset hostname, it’s recommended to implement a much more thorough solution.  If all else fails, have some physical backups of all available data on the application kept locally in anticipation of needing to do a refresh.

Develop a Backup Plan and Stress Test

It’s probably the safest route to just treat this as an inevitability and act like a failure will happen at some point.  Along with establishing the necessary policies for rerouting traffic, it’s good to have a plan for when the service does fail.  Having some test cases and designing the protocols and actions to take around them will prepare the organization significantly.  Each outage has the potential to damage an application’s state, logins, and configurations, so being able to restore as much as physically possible to the last state before an incident is important.

Develop a Backup Plan and Stress Test

It’s probably the safest route to just treat this as an inevitability and act like a failure will happen at some point.  Along with establishing the necessary policies for rerouting traffic, it’s good to have a plan for when the service does fail.  Having some test cases and designing the protocols and actions to take around them will prepare the organization significantly.  Each outage has the potential to damage an application’s state, logins, and configurations, so being able to restore as much as physically possible to the last state before an incident is important.

Built-in Monitoring Tools

AWS does come with a bunch of tools to provide users options for monitoring the health of the services an application is built on.  Furthermore, use the entire suite of monitoring tools, as using AWS Cloudwatch alone can only go so far due to how it is internally linked with whatever service it is applied towards.  Having an external monitoring service can provide greater insight into how the service went down.  This will help minimize data lost during an outage, especially in anticipation of future outages.  When building on the Amazon Cloud, it may be incredibly helpful to have a look in advance at the current performance and performance history of all Amazon Services across multiple regions listed on AWS’ health dashboard.

Built-in Monitoring Tools

AWS does come with a bunch of tools to provide users options for monitoring the health of the services an application is built on.  Furthermore, use the entire suite of monitoring tools, as using AWS Cloudwatch alone can only go so far due to how it is internally linked with whatever service it is applied towards.  Having an external monitoring service can provide greater insight into how the service went down.  This will help minimize data lost during an outage, especially in anticipation of future outages.  When building on the Amazon Cloud, it may be incredibly helpful to have a look in advance at the current performance and performance history of all Amazon Services across multiple regions listed on AWS’ health dashboard.

Well-Architected Framework

Following the guidelines set out by WAF can help to make AWS architecture much more reliable.  It provides best practice groundwork for operational excellence, security, reliability, performance efficiency, and sustainability for a reliable platform.  Having a fundamentally well-designed structure can help an application from the moment it is conceptualized and we can help build it.  Check out our offering on the AWS Marketplace if the construction of a reliable application on AWS is of great concern.

Well-Architected Framework

Following the guidelines set out by WAF can help to make AWS architecture much more reliable.  It provides best practice groundwork for operational excellence, security, reliability, performance efficiency, and sustainability for a reliable platform.  Having a fundamentally well-designed structure can help an application from the moment it is conceptualized and we can help build it.  Check out our offering on the AWS Marketplace if the construction of a reliable application on AWS is of great concern.

An Ounce of Prevention

Losing connection to an application on AWS is the least of your concerns.  There is potential data that can be lost in the cloud and it would ease everyone’s concern if that was made to be a less consequential accident.  Amazon does provide redundancies in the event of damage, but designers must also structure dataflows accordingly to take full advantage.

An Ounce of Prevention

Losing connection to an application on AWS is the least of your concerns.  There is potential data that can be lost in the cloud and it would ease everyone’s concern if that was made to be a less consequential accident.  Amazon does provide redundancies in the event of damage, but designers must also structure dataflows accordingly to take full advantage.

Dolan Cleary
Dolan Cleary

I am a recent graduate from the University of Wisconsin - Stout and am now working with AllCode as a web technician. Currently working within the marketing department.

Related Articles

A Comprehensive Look at Cloud Storage Pricing

A Comprehensive Look at Cloud Storage Pricing

Having Cloud Storage helps to synchronize key documents between remote workers and to manage data as needed. Cloud services provide a number of features that let users scale contents as they need to and protect storage contents with. Regardless of platform or device type, contents can be accessed by all users who can share that cloud storage. The vendors that provide cloud storage services each have their own features that make them ideal for specific users.

Amazon Elastic Cloud Computing Pricing Guide

Amazon Elastic Cloud Computing Pricing Guide

Amazon Elastic Cloud Computing is the default option for computing on AWS. Outside of outsourced cloud computing options, it is the default service for building, running, and scaling AWS-based applications. As such, EC2 will likely be the main driving force behind AWS bills. Understanding how to control said costs is therefore the most important factor in managing your AWS environment.

Amazon Simple Storage Service Price Guide

Amazon Simple Storage Service Price Guide

AWS pricing is incredibly complex and can result in some users overblowing their budgets very easily. Amazon does have tools for predicting prices and controlling them, though there is a learning curve to it. This is a guide on what controls there are for Amazon Simple Storage Service’s spending.

Download our 10-Step Cloud Migration ChecklistYou'll get direct access to our full-length guide on Google Docs. From here, you will be able to make a copy, download the content, and share it with your team.