Job Description
Job Responsibilities
- Design, build, and maintain
highly available, scalable, and reliable production systems
. - Ensure
system uptime, performance, and reliability
by proactively monitoring, troubleshooting, and resolving incidents. - Implement and manage
monitoring, alerting, and observability solutions
(metrics, logs, traces). - Automate operational tasks to reduce manual effort and improve system reliability.
- Lead
incident response
, root cause analysis (RCA), and post-incident reviews. - Collaborate with development teams to define
SLIs, SLOs, and error budgets
. - Improve CI/CD pipelines to enable safe, fast, and reliable deployments.
- Manage
capacity planning, performance tuning, and cost optimization
. - Ensure security best practices across infrastructure and application layers.
- Participate in
on-call rotations
and provide production support...
Ready to Apply?
Take the next step in your AI career. Submit your application to Net2Source (N2S) today.
Submit Application