a

AWS DevOps Distributed Tracing

AllCode leverages AWS X-Ray to analyze and debug production applications across microservices, including Lambda Functions and Step Functions, using AWS CDK. Additionally, we collaborate with customers to integrate distributed tracing into their preferred tools, such as Splunk and Dynatrace, if needed.

We used X-Ray in combination with AWS CloudWatch and AWS CloudTrial.

We started by

1. Defining, Collecting, and Analyzing Workload Health Metrics
AWS Services: Use AWS CloudWatch to monitor Lambda functions and SQS queues.
3rd Party Tools: Integrate with seed.run for CI/CD and log analysis.

2. Exporting Standard Application Logs
AWS Services: Use AWS CloudTrail and CloudWatch Logs to capture all API calls and standard application logs.
3rd Party Tools: Seed.run checks Lambda logs for errors and sends email notifications.

3. Defining Thresholds for Operational Metrics
Thresholds: Define CloudWatch Alarms for key metrics like error rates, latency, and SQS Dead Letter Queue (DLQ) sizes.
Customer Example: Monitoring and Alerts for Lambda-SQS Architecture
Workload Health Metrics:

Utilized AWS CloudWatch to set up metrics for Lambda error rates, SQS queue depth, and DLQ sizes.
Integrated seed.run to monitor logs and automatically notify personnel in charge if errors are detected.
Standard Application Logs:

Enabled CloudTrail for capturing API calls made on the Lambda functions and SQS services.
Used seed.run to provide an overview of past errors, their timestamps, and corresponding X-Ray traces.
Thresholds for Alerts:

CloudWatch Alarms were set for:
Lambda error rates above 1%.
SQS queue depth exceeding 1000 messages.
More than 5 messages in the DLQ.
When any of these alarms are triggered, an email is sent to the corresponding person in charge.
By implementing these KPIs and metrics, we have a robust monitoring and alerting system that aids in quick error detection and resolution.

Evidence:
Standardized Document:

A comprehensive guide detailing the above KPIs and metrics is available in our internal wiki.
Customer Example Implementation:

In a recent customer engagement, we implemented the above monitoring and KPIs for a system that used Lambda functions orchestrated with SQS.
By following these practices, we ensure optimal health monitoring and quick response times for operational events.