Job Description
Job Summary: As a Site Reliability Engineer you are expected to take ownership of platform reliability, monitoring, logging, incident response, and operational excellence. This role requires strong accountability, calm decision-making during incidents, and the ability to fix and restore systems under pressure.
Key Responsibilities
Reliability & Operations:
Key Responsibilities
Reliability & Operations:
- Own availability, performance, and reliability of production systems
- Participate in on-call rotations and lead incident resolution
- Perform root cause analysis (RCA) and implement preventive fixes
- Drive reliability improvements through automation and observability
- mplement and maintain observability using Prometheus, Grafana, Grafana Alloy, Loki, and Datadog
- Build dashboards, logs, and actionable alerts
- Correlate metrics, logs, and al...
Ready to Apply?
Take the next step in your AI career. Submit your application to CoreStack today.
Submit Application