a

AWS DevOps Cloud and Network Monitoring

 

AllCode implements continuous monitoring as a core DevOps practice for our customers. We employ a comprehensive approach to identify, observe, and detect issues and threats throughout each phase of the DevOps pipeline. Our toolset includes Amazon CloudWatch, CloudTrail, VPC Flow Logs, Config Rules, GuardDuty, IAM Access Analyzer, Security Hub, KMS logging, and third-party services like DataDog.

We collaborate with customers to customize these monitoring solutions based on the maturity level of their continuous monitoring practices. This ensures tailored and effective monitoring strategies that align with their specific needs.

Here’s an example of some of the Monitoring that we do

URComped ECS Alarms and Logs

Notification System

These alarms are deployed for both Website and Trio ECS environments and send notifications to the SNS topics ‘CORE-UrCompedWebsite-Notifications’ and ‘CORE-Trio-Notifications’ respectively. They are then sent to Slack through a lambda called ‘alarms2slack’ along with the ApplicationLogs and ECS Event Logs.

Alarms Descriptions

ECS Critical Health

This alarm verifies that there are at least two healthy containers in the Service, since this is the minimum number of instances that should be up at all times. 

Detailed information:

  • Statistic: Minimum
  • Period: 5 minutes
  • Threshold: < 2 running tasks

High CPU Usage

Since autoscaling is configured, this threshold should only be crossed after a sudden spike in processing demand or after the autoscaling has reached its maximum number of containers, both of which would require inspection.

Detailed information:

  • Statistic: Average
  • Period: 10 minutes
  • Threshold: >= 90% CPU Usage

High Memory Usage

Much like the high CPU alarm, High Memory Usage is expected to trigger when the load exceeds the capacity of the autoscaling and as such it would require revision.

Detailed information:

  • Statistic: Average
  • Period: 10 minutes
  • Threshold: >= 90% Memory usage

High Request Number

This alarm is meant as a simple DoS detector, since a sudden burst of requests might affect the app’s performance. Website gets between 2k-3k requests every five minutes at its peak, so double that number of requests 6k would trigger the alarm. 

Detailed information:

  • Statistic: Sum
  • Period: 5 minutes
  • Threshold: >= 6000 Requests

High App Latency

This alarm is meant to detect situations in which the health of the container is compromised, but it doesn’t necessarily take down the app. When there have been issues with some of the async components like the caches or Hangfire, the metrics reflected high latency but didn’t affect the health check of the instances. As such, this alarm should trigger when there is an app-side issue or performance is degraded.

Detailed information:

  • Statistic: P99
  • Period: 15 minutes
  • Threshold: >= 59 seconds

Logs Descriptions

ApplicationLogs

These are the logs sent by the applications through Serilog and it’s the main logging system, the logs are separated by application on the log streams.

Log Groups:

  • ApplicationLogs

Log Streams:

  • Prod-URCompedWebsiteCORE
  • Prod-TrioCORE

Notification filter:

# Send notifications only for unhandled exceptions

[Uu]nhandled [Ee]xception

Slack notification for ApplicationLogs:

ECS Logs

These are the OS Event Log level streams fetched through Windows EventLog of the containers and include the OS level errors as well as the unhandled exceptions that don’t get logged through Serilog such as Hangfire exceptions.

Log Groups:

  • /ecs/CORE-UrCompedWebsite
  • /ecs/CORE-Trio

Log Streams:

  • ecs/urcomped_prod_run/<container_id>

Notification filter:

# Send only Application Errors

{ ($.LogEntry.Channel = “Application”) && ($.LogEntry.Level = “Error”) && ($.LogEntry.EventId != 29) }

Slack notification for ECS Logs:

Application Logs

Exceptions handled within the applications are posted to the ApplicationLogs Cloudwatch LogGroups