Self Hosted Observability
Dedicated observability platform on Amazon EKS for centralized metrics, alerting, and dashboards with Prometheus, Thanos, and Grafana.
Problems this Architecture solves
- Avoids running disconnected monitoring stacks in every workload cluster with different dashboards, retention rules, and alert definitions.
- Keeps historical metrics available after workload clusters are upgraded, recreated, or scaled down by storing blocks in object storage through Thanos.
- Isolates observability infrastructure from product workloads so platform failures do not blind incident response.
Recommended Topology
Workload clusters
- Run Prometheus Agent in
plat-devandplat-stagingto scrape cluster, node, ingress, and application metrics. - Run an HA Prometheus pair in
plat-prodwhen local recording rules or short-lived local querying are needed during incidents. - Keep local retention short, typically 6 to 24 hours, and push samples to the central observability cluster with
remote_write.
Observability cluster
- Run Grafana, Alertmanager, and the Thanos control-plane components on a dedicated EKS cluster separated from product workloads.
- Use separate node groups for query traffic, ingest traffic, and background compaction so noisy dashboards do not starve write or maintenance paths.
- Isolate platform services into namespaces such as
monitoring,thanos,grafana, andalerting.
Data Flow
- Prometheus instances in workload clusters scrape Kubernetes, infrastructure, and application endpoints.
- Samples are streamed to Thanos Receive in the observability cluster over private network connectivity.
- Receive writes TSDB blocks to Amazon S3 for durable, long-term storage.
- Thanos Store Gateway and Thanos Query expose recent and historical metrics through one query layer.
- Grafana uses Thanos Query Frontend as its datasource for dashboards and ad hoc troubleshooting.
- Prometheus Ruler evaluates recording and alerting rules, then sends notifications through Alertmanager to on-call channels.
Storage and Retention
- Use Amazon S3 as the system of record for metrics blocks instead of relying on node-local Prometheus disks for history.
- Enable Thanos compaction and downsampling so long retention does not make wide queries unusably expensive.
- Keep raw retention short enough to control storage costs, then keep 5 minute and 1 hour downsampled data for trend analysis and capacity planning.
- Back S3 with versioning, lifecycle policies, and KMS encryption.
Access and Security
- Put Grafana behind an internal ALB and require access through VPN, Zero Trust proxy, or corporate network.
- Federate Grafana authentication with the same SSO provider used elsewhere in the platform.
- Use IRSA for Thanos and Prometheus components so S3 access is scoped to the exact service accounts that need it.
- Restrict remote write ingress with security groups, mTLS, or authenticated ingress depending on how workload clusters connect.
Deployment Model
- Install the stack with Helm charts managed by ArgoCD so cluster state, dashboards, datasources, and alert rules stay in Git.
- Keep ServiceMonitors, PodMonitors, recording rules, and Alertmanager routes versioned alongside platform manifests.
- Promote changes from lower environments before updating the production observability cluster, especially query, storage, and alert routing changes.
When to use this pattern
- Use it when multiple EKS clusters need one metrics plane, one dashboard surface, and shared alert routing.
- Use it when historical metrics retention matters for incident review, capacity planning, or SLO reporting.
- Avoid it for very small environments where Amazon Managed Service for Prometheus or a single-cluster Prometheus deployment is simpler to operate.