Job Description
Job Responsibilities
Design, build, and maintain highly available, scalable, and reliable production systems.
Ensure system uptime, performance, and reliability by proactively monitoring, troubleshooting, and resolving incidents.
Implement and manage monitoring, alerting, and observability solutions (metrics, logs, traces).
Automate operational tasks to reduce manual effort and improve system reliability.
Lead incident response, root cause analysis (RCA), and post-incident reviews.
Collaborate with development teams to define SLIs, SLOs, and error budgets.
Improve CI/CD pipelines to enable safe, fast, and reliable deployments.
Manage capacity planning, performance tuning, and cost optimization.
Ensure security best practices across infrastructure and application layers.
Participate in on-call rotations and provide production support.
Required Skills & Qualifications
Technical Skills
5+ years of experience in Site Reliability Engineering, DevOps, o...
Design, build, and maintain highly available, scalable, and reliable production systems.
Ensure system uptime, performance, and reliability by proactively monitoring, troubleshooting, and resolving incidents.
Implement and manage monitoring, alerting, and observability solutions (metrics, logs, traces).
Automate operational tasks to reduce manual effort and improve system reliability.
Lead incident response, root cause analysis (RCA), and post-incident reviews.
Collaborate with development teams to define SLIs, SLOs, and error budgets.
Improve CI/CD pipelines to enable safe, fast, and reliable deployments.
Manage capacity planning, performance tuning, and cost optimization.
Ensure security best practices across infrastructure and application layers.
Participate in on-call rotations and provide production support.
Required Skills & Qualifications
Technical Skills
5+ years of experience in Site Reliability Engineering, DevOps, o...
Ready to Apply?
Take the next step in your AI career. Submit your application to Net2Source Inc. today.
Submit Application