Thursday, 26 June 2025

CloudWatch Dashboard for DevOps – Full Lake House Stack

 


 

🔹 1. S3 – Storage, Replication & Ingestion Health 

Widget 

Metric 

Purpose 

Number 

BucketSizeBytes 

Detect abnormal growth or cleanup issues 

Line Chart 

NumberOfObjects 

Drop = ingestion failure 

Custom Widget 

Replication failure events (CRR) 

Detect DR sync issues 

Log Table 

Failed PUT/GETs (via S3 access logs) 

Spot IAM or app issues 

 

🔹 2. Glue – ETL Job & Crawler Status 

Widget 

Metric 

Purpose 

Line/Bar 

glue.job.run.failed 

Job failure alert 

Number 

glue.job.run.time 

Job duration health 

Log Insights Table 

/aws-glue/jobs/output logs 

View recent errors 

Number 

crawler run status (custom logs or CloudWatch metric filters) 

Detect stale metadata 

 

🔹 3. Athena – Query Scan & Failures 

Widget 

Metric 

Purpose 

Line 

Query latency or scan size 

Cost + performance watch 

Number 

Failed queries 

Alert on syntax/data issues 

Pie Chart 

Queries per user (via Logs Insights) 

Spot abuse or overuse 

 

🔹 4. Lake Formation – Governance + Access 

Widget 

Metric/Source 

Purpose 

Log Table 

CloudTrail: AccessDenied events 

Detect policy issues 

List 

GrantPermissions count over time 

Track governance changes 

Alarm Summary 

Permission errors spike 

Alert misconfigurations 

 

🔹 5. Redshift – Query Engine Health 

Widget 

Metric 

Purpose 

Line Chart 

CPUUtilization 

Detect load spikes 

Line Chart 

HealthStatus 

Instance health (1 = healthy) 

Number 

DatabaseConnections 

App or dashboard overuse 

Line Chart 

QueryDuration (via enhanced VPC metrics or logs) 

Detect slow queries 

Log Table 

STL_ERROR logs via CloudWatch export 

Investigate failed queries 

🔁 Enable Redshift → CloudWatch integration: 

bash 

CopyEdit 

aws redshift enable-logging --cluster-identifier my-cluster --bucket-name redshift-logs 
 

 

🔹 6. RDS – Database Health (Aurora, MySQL, Postgres) 

Widget 

Metric 

Purpose 

Line Chart 

CPUUtilization 

Performance bottleneck 

Line Chart 

FreeStorageSpace 

Detect storage overflow 

Number 

DatabaseConnections 

High = app flood 

Line Chart 

ReadIOPS / WriteIOPS 

Workload pattern analysis 

Alarm Widget 

Status alarms (disk, CPU, conn) 

Actionable alerts 

🧠 RDS metrics are namespace: AWS/RDS, including: 

CPUUtilization, FreeableMemory, DiskQueueDepth, DBConnections 
 

 

🔹 7. Alerts & Incident Response 

Widget 

Metric 

Purpose 

Alarm Widget 

All CloudWatch alarms 

Visualize broken parts 

Log Table 

SNS or Lambda invocations 

Verify auto-remediation ran 

List 

Step Function failures 

Workflow incident insights 

 

🚀 Bonus: Quick Links on Dashboard 

  • Add custom “View Logs” buttons linking to: 

  • Glue logs 

  • Athena query results 

  • Redshift STL_ERROR logs 

  • RDS Performance Insights 

 

🛠️ Build the Dashboard: Tools You Can Use 

Tool 

Purpose 

CloudWatch Console 

Manually build widgets via UI 

CloudWatch Dashboard JSON 

Build in code, version in Git 

Terraform 

Automate dashboard creation 

Grafana CloudWatch plugin 

Visualize metrics with filters + alerts 

 

A CloudWatch Dashboard JSON (importable in UI) 

{ 

  "widgets": [ 

    { 

      "type": "metric", 

      "x": 0, 

      "y": 0, 

      "width": 12, 

      "height": 6, 

      "properties": { 

        "metrics": [ 

          ["AWS/Glue", "GlueJobRunFailed", "JobName", "your-job-name"] 

        ], 

        "period": 300, 

        "stat": "Sum", 

        "region": "us-east-1", 

        "title": "Glue Job Failures" 

      } 

    }, 

    { 

      "type": "metric", 

      "x": 12, 

      "y": 0, 

      "width": 12, 

      "height": 6, 

      "properties": { 

        "metrics": [ 

          ["AWS/S3", "NumberOfObjects", "BucketName", "your-bucket-name", "StorageType", "AllStorageTypes"] 

        ], 

        "period": 300, 

        "stat": "Average", 

        "region": "us-east-1", 

        "title": "S3 Object Count" 

      } 

    }, 

    { 

      "type": "metric", 

      "x": 0, 

      "y": 6, 

      "width": 12, 

      "height": 6, 

      "properties": { 

        "metrics": [ 

          ["AWS/Athena", "QuerySuccessful", "WorkGroup", "primary"] 

        ], 

        "period": 300, 

        "stat": "Sum", 

        "region": "us-east-1", 

        "title": "Athena Query Success Count" 

      } 

    }, 

    { 

      "type": "metric", 

      "x": 12, 

      "y": 6, 

      "width": 12, 

      "height": 6, 

      "properties": { 

        "metrics": [ 

          ["AWS/Redshift", "CPUUtilization", "ClusterIdentifier", "your-cluster-id"] 

        ], 

        "period": 300, 

        "stat": "Average", 

        "region": "us-east-1", 

        "title": "Redshift CPU Utilization" 

      } 

    }, 

    { 

      "type": "metric", 

      "x": 0, 

      "y": 12, 

      "width": 12, 

      "height": 6, 

      "properties": { 

        "metrics": [ 

          ["AWS/RDS", "CPUUtilization", "DBInstanceIdentifier", "your-db-instance"] 

        ], 

        "period": 300, 

        "stat": "Average", 

        "region": "us-east-1", 

        "title": "RDS CPU Utilization" 

      } 

    }, 

    { 

      "type": "metric", 

      "x": 12, 

      "y": 12, 

      "width": 12, 

      "height": 6, 

      "properties": { 

        "metrics": [ 

          ["AWS/RDS", "FreeStorageSpace", "DBInstanceIdentifier", "your-db-instance"] 

        ], 

        "period": 300, 

        "stat": "Average", 

        "region": "us-east-1", 

        "title": "RDS Free Storage" 

      } 

    } 

  ] 

} 

 

No comments:

Post a Comment