Job Description

Job Summary: As a Site Reliability Engineer you are expected to take ownership of platform reliability, monitoring, logging, incident response, and operational excellence. This role requires strong accountability, calm decision-making during incidents, and the ability to fix and restore systems under pressure.

Key Responsibilities

Reliability & Operations:

  • Own availability, performance, and reliability of production systems
  • Participate in on-call rotations and lead incident resolution
  • Perform root cause analysis (RCA) and implement preventive fixes
  • Drive reliability improvements through automation and observability

Monitoring, Logging & Observability:


  • mplement and maintain observability using Prometheus, Grafana, Grafana Alloy, Loki, and Datadog
  • Build dashboards, logs, and actionable alerts
  • Correlate metrics, logs, and al...

Ready to Apply?

Take the next step in your AI career. Submit your application to CoreStack today.

Submit Application