Job Description

We are seeking an experienced Principal Site Reliability Engineer to join a dynamic Platform Tribe. This role focuses on a high-end, microservice-based platform designed to process billions of financial transactions per day. You will be part of a team chasing zero-latency and ensuring a smooth connection for global users regardless of bandwidth.

What you will be doing:

  • Manage day-to-day alerts, system checks, and issue escalation.
  • Provide 24x7 on-call support for critical SaaS events.
  • Proactively create monitors within the EKS/K8s ecosystem.
  • Deploy to clusters using Terraform and Helm/Flux.
  • Enhance infrastructure health by implementing checks and scripts for known issues.
  • Maintain and develop deployment code and integrate new Cloud Infrastructure technologies.
  • Conduct RCA (Root Cause Analysis) and take corrective actions to prevent recurrence.
  • Collaborate with teams to ensure minimal imp...

Ready to Apply?

Take the next step in your AI career. Submit your application to Explore Group today.

Submit Application