Self Hosted Observability

Dedicated observability platform on Amazon EKS for centralized metrics, alerting, and dashboards with Prometheus, Thanos, and Grafana.

Problems this Architecture solves

Avoids running disconnected monitoring stacks in every workload cluster with different dashboards, retention rules, and alert definitions.
Keeps historical metrics available after workload clusters are upgraded, recreated, or scaled down by storing blocks in object storage through Thanos.
Isolates observability infrastructure from product workloads so platform failures do not blind incident response.

Run Prometheus Agent in plat-dev and plat-staging to scrape cluster, node, ingress, and application metrics.
Run an HA Prometheus pair in plat-prod when local recording rules or short-lived local querying are needed during incidents.
Keep local retention short, typically 6 to 24 hours, and push samples to the central observability cluster with remote_write.

Run Grafana, Alertmanager, and the Thanos control-plane components on a dedicated EKS cluster separated from product workloads.
Use separate node groups for query traffic, ingest traffic, and background compaction so noisy dashboards do not starve write or maintenance paths.
Isolate platform services into namespaces such as monitoring, thanos, grafana, and alerting.

Prometheus instances in workload clusters scrape Kubernetes, infrastructure, and application endpoints.
Samples are streamed to Thanos Receive in the observability cluster over private network connectivity.
Receive writes TSDB blocks to Amazon S3 for durable, long-term storage.
Thanos Store Gateway and Thanos Query expose recent and historical metrics through one query layer.
Grafana uses Thanos Query Frontend as its datasource for dashboards and ad hoc troubleshooting.
Prometheus Ruler evaluates recording and alerting rules, then sends notifications through Alertmanager to on-call channels.

Use Amazon S3 as the system of record for metrics blocks instead of relying on node-local Prometheus disks for history.
Enable Thanos compaction and downsampling so long retention does not make wide queries unusably expensive.
Keep raw retention short enough to control storage costs, then keep 5 minute and 1 hour downsampled data for trend analysis and capacity planning.
Back S3 with versioning, lifecycle policies, and KMS encryption.

Put Grafana behind an internal ALB and require access through VPN, Zero Trust proxy, or corporate network.
Federate Grafana authentication with the same SSO provider used elsewhere in the platform.
Use IRSA for Thanos and Prometheus components so S3 access is scoped to the exact service accounts that need it.
Restrict remote write ingress with security groups, mTLS, or authenticated ingress depending on how workload clusters connect.

Install the stack with Helm charts managed by ArgoCD so cluster state, dashboards, datasources, and alert rules stay in Git.
Keep ServiceMonitors, PodMonitors, recording rules, and Alertmanager routes versioned alongside platform manifests.
Promote changes from lower environments before updating the production observability cluster, especially query, storage, and alert routing changes.

Use it when multiple EKS clusters need one metrics plane, one dashboard surface, and shared alert routing.
Use it when historical metrics retention matters for incident review, capacity planning, or SLO reporting.
Avoid it for very small environments where Amazon Managed Service for Prometheus or a single-cluster Prometheus deployment is simpler to operate.