a
Amazon web services goes offline

What to Do When AWS Goes Offline?

Amazon’s services are not completely iron-clad and applications built on the platform are still vulnerable to the occasional hiccup. When it does happen, there is a degree of shared responsibility between AWS and the developers of the web applications. All services are designed with redundancies in mind and much of the physical infrastructure does come with backup databases and availability zones to help cover for any instances that go down or are brought down for maintenance. Though data loss is mitigated, it is still advisable to prepare for the next incident.

Active-Active and Active-Passive Failsaves via Route53

The Active-Active method is a more general term for redirecting traffic inbound to a failed node to a still functional node or is balanced across all remaining active nodes.  In AWS’ case, this could be across different scales including both regions and availability zones.  It will be more expensive to effectively have multiple copies of an application’s infrastructure in multiple availability zones, but it is well worth the effort to ensure constant application uptime when an instance does go down.  By comparison, the Active-Passive method is a cheaper solution where the backup nodes are fully redundant and are only brought online when the primary node goes offline.  On the other hand, it does come with more expensive overhead compared to Active-Active.

To accomplish this, AWS does come with an automated tool for managing this called Route53.  It is a scalable cloud Domain Name System (DNS) with some additional features such as health checkups and the option to establish traffic policies.  After detecting an unreachable endpoint, it will start redirecting traffic.  Configuring an endpoint can be found under the health checks option of Route53.  In the Route53 UI, there are options for what to monitor, how frequently to conduct a health check, how many times a check should fail before actions are committed, latency graphs, and regions to check.  After setting the functions of the health checks, there will be options to then set up policies for redirecting traffic to alternate endpoints.  Next time an outage occurs, traffic will be redirected automatically with minimal oversight.

Active-Active and Active-Passive Failsaves via Route53

The Active-Active method is a more general term for redirecting traffic inbound to a failed node to a still functional node or is balanced across all remaining active nodes.  In AWS’ case, this could be across different scales including both regions and availability zones.  It will be more expensive to effectively have multiple copies of an application’s infrastructure in multiple availability zones, but it is well worth the effort to ensure constant application uptime when an instance does go down.  By comparison, the Active-Passive method is a cheaper solution where the backup nodes are fully redundant and are only brought online when the primary node goes offline.  On the other hand, it does come with more expensive overhead compared to Active-Active.

To accomplish this, AWS does come with an automated tool for managing this called Route53.  It is a scalable cloud Domain Name System (DNS) with some additional features such as health checkups and the option to establish traffic policies.  After detecting an unreachable endpoint, it will start redirecting traffic.  Configuring an endpoint can be found under the health checks option of Route53.  In the Route53 UI, there are options for what to monitor, how frequently to conduct a health check, how many times a check should fail before actions are committed, latency graphs, and regions to check.  After setting the functions of the health checks, there will be options to then set up policies for redirecting traffic to alternate endpoints.  Next time an outage occurs, traffic will be redirected automatically with minimal oversight.

Multi-Regional Asset Recreation

Alternatively, another method is automatic asset replication.  While it is enabled, anything within a targeted S3 bucket is automatically copied over to another designated bucket in a different AWS region.  Back in the Route53 interface, users can find a selection of their S3 buckets with property controls.  After enabling Versioning, the option for cross-region replication will become available.  Users will still need to adjust how the application adjusts static content.  While it is possible to just rename the asset hostname, it’s recommended to implement a much more thorough solution.  If all else fails, have some physical backups of all available data on the application kept locally in anticipation of needing to do a refresh.

Multi-Regional Asset Recreation

Alternatively, another method is automatic asset replication.  While it is enabled, anything within a targeted S3 bucket is automatically copied over to another designated bucket in a different AWS region.  Back in the Route53 interface, users can find a selection of their S3 buckets with property controls.  After enabling Versioning, the option for cross-region replication will become available.  Users will still need to adjust how the application adjusts static content.  While it is possible to just rename the asset hostname, it’s recommended to implement a much more thorough solution.  If all else fails, have some physical backups of all available data on the application kept locally in anticipation of needing to do a refresh.

Develop a Backup Plan and Stress Test

It’s probably the safest route to just treat this as an inevitability and act like a failure will happen at some point.  Along with establishing the necessary policies for rerouting traffic, it’s good to have a plan for when the service does fail.  Having some test cases and designing the protocols and actions to take around them will prepare the organization significantly.  Each outage has the potential to damage an application’s state, logins, and configurations, so being able to restore as much as physically possible to the last state before an incident is important.

Develop a Backup Plan and Stress Test

It’s probably the safest route to just treat this as an inevitability and act like a failure will happen at some point.  Along with establishing the necessary policies for rerouting traffic, it’s good to have a plan for when the service does fail.  Having some test cases and designing the protocols and actions to take around them will prepare the organization significantly.  Each outage has the potential to damage an application’s state, logins, and configurations, so being able to restore as much as physically possible to the last state before an incident is important.

Built-in Monitoring Tools

AWS does come with a bunch of tools to provide users options for monitoring the health of the services an application is built on.  Furthermore, use the entire suite of monitoring tools, as using AWS Cloudwatch alone can only go so far due to how it is internally linked with whatever service it is applied towards.  Having an external monitoring service can provide greater insight into how the service went down.  This will help minimize data lost during an outage, especially in anticipation of future outages.  When building on the Amazon Cloud, it may be incredibly helpful to have a look in advance at the current performance and performance history of all Amazon Services across multiple regions listed on AWS’ health dashboard.

Built-in Monitoring Tools

AWS does come with a bunch of tools to provide users options for monitoring the health of the services an application is built on.  Furthermore, use the entire suite of monitoring tools, as using AWS Cloudwatch alone can only go so far due to how it is internally linked with whatever service it is applied towards.  Having an external monitoring service can provide greater insight into how the service went down.  This will help minimize data lost during an outage, especially in anticipation of future outages.  When building on the Amazon Cloud, it may be incredibly helpful to have a look in advance at the current performance and performance history of all Amazon Services across multiple regions listed on AWS’ health dashboard.

Well-Architected Framework

Following the guidelines set out by WAF can help to make AWS architecture much more reliable.  It provides best practice groundwork for operational excellence, security, reliability, performance efficiency, and sustainability for a reliable platform.  Having a fundamentally well-designed structure can help an application from the moment it is conceptualized and we can help build it.  Check out our offering on the AWS Marketplace if the construction of a reliable application on AWS is of great concern.

Well-Architected Framework

Following the guidelines set out by WAF can help to make AWS architecture much more reliable.  It provides best practice groundwork for operational excellence, security, reliability, performance efficiency, and sustainability for a reliable platform.  Having a fundamentally well-designed structure can help an application from the moment it is conceptualized and we can help build it.  Check out our offering on the AWS Marketplace if the construction of a reliable application on AWS is of great concern.

An Ounce of Prevention

Losing connection to an application on AWS is the least of your concerns.  There is potential data that can be lost in the cloud and it would ease everyone’s concern if that was made to be a less consequential accident.  Amazon does provide redundancies in the event of damage, but designers must also structure dataflows accordingly to take full advantage.

An Ounce of Prevention

Losing connection to an application on AWS is the least of your concerns.  There is potential data that can be lost in the cloud and it would ease everyone’s concern if that was made to be a less consequential accident.  Amazon does provide redundancies in the event of damage, but designers must also structure dataflows accordingly to take full advantage.

Dolan Cleary
Dolan Cleary

I am a recent graduate from the University of Wisconsin - Stout and am now working with AllCode as a web technician. Currently working within the marketing department.

Related Articles

The Difference Between Amazon RDS and Aurora

The Difference Between Amazon RDS and Aurora

AWS does incorporate several database services that offer high performance and great functionality. However, customers do find the difference between Amazon Relational Database Service and Amazon Aurora. Both services do provide similar functions, but do cover their own use cases.

AWS Snowflake Data Warehouse Pricing Guide

AWS Snowflake Data Warehouse Pricing Guide

AWS Snowflake Data Warehouse – or just Snowflake – is a data cloud built for users to mobilize, centralize, and process large quantities of data. Regardless of how many sources are connected to Snowflake or the user’s preferred type of organized data used, data is easily stored and controllably shared with selectively-authorized access. Snowflake does offer extensive control over its pricing, though how it works isn’t always clear.

Single-Tenant vs. Multi-Tenant Cloud Environments

Single-Tenant vs. Multi-Tenant Cloud Environments

Operating a cloud environment and optimizing Software as a Service can be managed in two different methods. Reasons for adopting either single-tenant or multi-tenant cloud environments are dependent on business and customer-related factors as well as how much more expensive one architectural structure will be over the other. Both structure types also have a number of security and privacy implications tied to their inherent design.

Download our 10-Step Cloud Migration ChecklistYou'll get direct access to our full-length guide on Google Docs. From here, you will be able to make a copy, download the content, and share it with your team.