Job Description
We are seeking an experienced Principal Site Reliability Engineer to join a dynamic Platform Tribe. This role focuses on a high-end, microservice-based platform designed to process billions of financial transactions per day. You will be part of a team chasing zero-latency and ensuring a smooth connection for global users regardless of bandwidth.
What you will be doing:
- Manage day-to-day alerts, system checks, and issue escalation.
- Provide 24x7 on-call support for critical SaaS events.
- Proactively create monitors within the EKS/K8s ecosystem.
- Deploy to clusters using Terraform and Helm/Flux.
- Enhance infrastructure health by implementing checks and scripts for known issues.
- Maintain and develop deployment code and integrate new Cloud Infrastructure technologies.
- Conduct RCA (Root Cause Analysis) and take corrective actions to prevent recurrence.
- Collaborate with teams to ensure minimal imp...
Ready to Apply?
Take the next step in your AI career. Submit your application to Explore Group today.
Submit Application