AllCode implements continuous monitoring as a core DevOps practice for our customers. We employ a comprehensive approach to identify, observe, and detect issues and threats throughout each phase of the DevOps pipeline. Our toolset includes Amazon CloudWatch, CloudTrail, VPC Flow Logs, Config Rules, GuardDuty, IAM Access Analyzer, Security Hub, KMS logging, and third-party services like DataDog.
We collaborate with customers to customize these monitoring solutions based on the maturity level of their continuous monitoring practices. This ensures tailored and effective monitoring strategies that align with their specific needs.
Here’s an example of some of the Monitoring that we do
URComped ECS Alarms and Logs
Notification System
These alarms are deployed for both Website and Trio ECS environments and send notifications to the SNS topics ‘CORE-UrCompedWebsite-Notifications’ and ‘CORE-Trio-Notifications’ respectively. They are then sent to Slack through a lambda called ‘alarms2slack’ along with the ApplicationLogs and ECS Event Logs.
Alarms Descriptions
ECS Critical Health
This alarm verifies that there are at least two healthy containers in the Service, since this is the minimum number of instances that should be up at all times.
Detailed information:
- Statistic: Minimum
- Period: 5 minutes
- Threshold: < 2 running tasks
High CPU Usage
Since autoscaling is configured, this threshold should only be crossed after a sudden spike in processing demand or after the autoscaling has reached its maximum number of containers, both of which would require inspection.
Detailed information:
- Statistic: Average
- Period: 10 minutes
- Threshold: >= 90% CPU Usage
High Memory Usage
Much like the high CPU alarm, High Memory Usage is expected to trigger when the load exceeds the capacity of the autoscaling and as such it would require revision.
Detailed information:
- Statistic: Average
- Period: 10 minutes
- Threshold: >= 90% Memory usage
High Request Number
This alarm is meant as a simple DoS detector, since a sudden burst of requests might affect the app’s performance. Website gets between 2k-3k requests every five minutes at its peak, so double that number of requests 6k would trigger the alarm.
Detailed information:
- Statistic: Sum
- Period: 5 minutes
- Threshold: >= 6000 Requests
High App Latency
This alarm is meant to detect situations in which the health of the container is compromised, but it doesn’t necessarily take down the app. When there have been issues with some of the async components like the caches or Hangfire, the metrics reflected high latency but didn’t affect the health check of the instances. As such, this alarm should trigger when there is an app-side issue or performance is degraded.
Detailed information:
- Statistic: P99
- Period: 15 minutes
- Threshold: >= 59 seconds
Logs Descriptions
ApplicationLogs
These are the logs sent by the applications through Serilog and it’s the main logging system, the logs are separated by application on the log streams.
Log Groups:
- ApplicationLogs
Log Streams:
- Prod-URCompedWebsiteCORE
- Prod-TrioCORE
Notification filter:
# Send notifications only for unhandled exceptions
[Uu]nhandled [Ee]xception
Slack notification for ApplicationLogs:
ECS Logs
These are the OS Event Log level streams fetched through Windows EventLog of the containers and include the OS level errors as well as the unhandled exceptions that don’t get logged through Serilog such as Hangfire exceptions.
Log Groups:
- /ecs/CORE-UrCompedWebsite
- /ecs/CORE-Trio
Log Streams:
- ecs/urcomped_prod_run/<container_id>
Notification filter:
# Send only Application Errors
{ ($.LogEntry.Channel = “Application”) && ($.LogEntry.Level = “Error”) && ($.LogEntry.EventId != 29) }
Slack notification for ECS Logs:
Application Logs
Exceptions handled within the applications are posted to the ApplicationLogs Cloudwatch LogGroups
Recent Comments